Midscene.js: Assessing a natural language AI testing tool

We see how Midscene.js stacks up against traditionally coding Playwright tests

02 Jan 2025

With almost everything AI flooding our feeds (regardless of what we do for work) it is difficult to pick out some of the AI tools that stand head and shoulders above the rest. In test engineering and software development, code generation tools are everywhere. So staying in the AI theme, we’ve taken a different direction and looked at an AI tool which does NOT generate code, but rather uses natural language to carry out the test.

Midscene.js is a tool that integrates with Playwright or Puppeteer. In this article, we’ll put it through its paces, seeing how it stacks up against traditionally coding Playwright tests in terms of speed, and observing how Midscene.js handles various test scenarios.

As AI continues to transform testing tools, this legitimate first take on natural language for writing test automation marks a promising development. Midscene.js and other tools need to move beyond the novelty status and become usable tools in the test engineer’s toolbox. In order to get there, major improvements to speed and the ability for AI to interpret the DOM alongside screenshots are a must.

How does Midscene.js. work?

Based on the diagram below, we can see that there is no direct interaction under the hood with the underlying APIs. This means that these tests can only function based on what it derives from the frontend.

How Midscene.js. measured up

Let’s get to testing Midscene.js. We used three scenarios which could represent a smaller part of an entire web app:

A simple login page
An add/remove elements page
Nintendo.com and Amazon.com for real-world, e2e scenarios

What is very exciting with this AI tool is the ability to use natural language — simply describe what the test step is trying to do, and off it went. This brought the wow factor to another level, whereas code generation is almost standard these days.

Login

For this first example, we used Nearform’s UI Testing Playground — a great place to take a tool through some basic as well as more advanced tasks and tests on a page. The UI Testing Playground also allows anyone to sharpen their UI testing practice.

Here we have the login scenario. We’ve used this as a basic test to measure:

Ease of use/readability
Execution time
Maintainability

For comparison, here’s what it looks like in Playwright, writing the “traditional way”:

typescript

test("Can log in to the page with valid credentials", async ({ page }) => {
    const loginPage = new LoginPage(page)
    await loginPage.goto()
    await page.waitForLoadState()
    await loginPage.logIn(user, password)
    await loginPage.checkLoginResult("success")
    await loginPage.logout()
})

Now, in natural language, powered by AI using Midscene.js:

typescript

test.beforeEach(async ({ page }) => {
    await page.goto("https://nearform.github.io/testing-playground/#/login-form")
    await page.waitForLoadState()
  });

test("Can log in with valid credentials", async ({ ai, aiAssert }) => {
    await ai("Fill in username field with the value admin")
    await ai("Fill in password field with the value Passw0rd!")
    await ai("Click on the Login button")
    await aiAssert("Check that the user has successfully logged in")
    await ai("Click on the Logout button")
    await ai("Check that the username and password fields are visible")
})

Slightly longer, but much more readable for say, a non-technical member of a given team.

Takeaways from this scenario:

Ease of writing/readability:

➕ Write it as you think it should execute — which is its strongest point. We found that if it didn’t work, just try another, simpler way to write it! This also supports several languages (English, French, Chinese). Definitely some wow factor here.

➖Can look awfully long to read for developers

➖You can also see that there is no way to hide the username and password in natural language. You need to send those directly in the prompt, whereas without Midscene.js, you can hide these in a separate data file.

Execution time:

➖Because this is powered by OpenAI and screen captures, the execution time was very slow. The AI needed time to plan and “think” about what it needed to do. The report and JSON output clearly show the amount of time it takes to do each task.

➖Locating elements by role in Playwright is a recommended practice, and inherently tests the a11y of the web app. As Midscene.js uses screenshots, it moves away from this practice.

For comparison, running the code above for both, here were the results:

With Midscene.js: 45.8 seconds
Without Midscene.js: 1.9 seconds

The time we may gain by writing in natural language is very minimal when we consider the run time accumulation of this test on a recurring basis.

Maintainability:

➕If this login page were to be refactored (test IDs, accessibility tags) the test would likely still do what it needs to do correctly. Playwright alone would require some maintenance.

➖If there would be maintenance to be done on the test, debugging is trial and error by changing the way you prompt the test, with only the test report and JSON dumps to work with.

➖You can also notice that the Playwright test is written in Page Object Model form, which helps keep it maintainable and scalable. With Midscene.js, writing the test directly in the test file is the whole purpose of Midscene.js, so while writing in natural language is fast and easy, making modifications in multiple test files would be a nightmare.

We also added a negative case, to see how Midscene.js would handle it. We changed the password to trigger an invalid set of credentials:

typescript

test("Cannot log in with invalid credentials", async ({ ai, aiAssert }) => {
    await ai("Fill in username field with the value admin")
    await ai("Fill in password field with the value password")
    await ai("Click on the Login button")
    await aiAssert("Check that the user has failed to log in")
})

The JSON output on the report gives a good insight as to what it was “thinking”, which could be helpful with debugging:

json

{
  "pass": true,
  "thought": "The login attempt failed because the message 'The credentials you have provided are invalid' is displayed."
}

Adding/removing elements test

In this example, we thought we’d step it up a notch, but still kept it simple. Here we will only be evaluating how Midscene.js handles a slightly more complex scenario.

For context here is the page to be tested:

Click on the green button to add a red button, click on a red button to remove it, click on the blue button to remove all red buttons. We’ll be going through the tests we wrote for this page below, and the Playwright “alone” version just for comparison.

Test 1:

Natural language:

typescript

test("AI: Should add 3 elements to the page", async ({ ai, aiAssert }) => {
    await ai("click on Add Element button 3 times")
    await aiAssert("Check for 3 buttons with label Remove Element")
})

No natural language, just Playwright alone:

typescript

test("Should add 3 elements to the page", async ({ page }) => {
    const addRemovePage = new AddRemovePage(page)
    const clickCount = 3
    await addRemovePage.goto()
    await page.waitForLoadState()
    await addRemovePage.clickAddElement(clickCount)
    await addRemovePage.checkRemoveElementQuantity(clickCount)
})

Midscene.js handled this test easily, was able to detect the elements on the page, and make the correct assertion using the prompts we gave as input.

Midscene.js Nearform testing playground example two

This was the assertion output:

json

{
  "pass": true,
  "thought": "The page contains three buttons with the labels 'REMOVE ELEMENT 1', 'REMOVE ELEMENT 2', and 'REMOVE ELEMENT 3'."
}

Test 2:

In this test, we decided to see how it would handle a description using colour in one of the assertions. In Playwright, the assertion requires us to use .toHaveCSS() and pass the RGB colour and a tiny bit of inspection.

Natural language:

typescript

test("Should clear all red buttons on page using Clear Storage button",async ({ ai, aiAssert }) => {
    await ai("Click on Add Element button 5 times")
    await aiAssert("Check that there are 5 Remove Element buttons")
    await aiAssert("Check that Remove Element buttons are red")
    await ai("Click on Clear Storage button")
    await aiAssert("Check that there are no Remove Element buttons")
})

No natural language:

typescript

test("AI: Should clear all red buttons on page using Clear Storage Button",async ({ page }) => {
    const addRemovePage = new AddRemovePage(page)
    const clickCount = 5
    const clearState = 0
    await addRemovePage.goto()
    await page.waitForLoadState()
    await addRemovePage.clickAddElement(clickCount)
    await addRemovePage.checkRemoveElementQuantity(clickCount)
    await addRemovePage.clickClearButton()
    await addRemovePage.checkRemoveElementQuantity(clearState)
})

Midscene.js handled this test very well. The “thought” process was also fascinating to see when going through the steps:

Here was the assertion thought process:

Assertion 1: Check that there are five “remove element” buttons:

json

{
  "pass": true,
  "thought": "The page contains five buttons labeled 'REMOVE ELEMENT 1' through 'REMOVE ELEMENT 5'."
}

Assertion 2: Check that the “remove element” buttons are red:

The output shows that AI detected red buttons through the screenshot!

json

{
  "pass": true,
  "thought": "The 'Remove Element' buttons are described with a class that suggests they are styled as buttons, but the color is not explicitly mentioned in the JSON. However, the visual inspection of the screenshot confirms that the buttons are red."
}

Assertion 3: Check that there are no “remove element” buttons:

json

{
  "pass": true,
  "thought": "The page description does not mention any 'Remove Element' buttons, and the JSON content only lists 'ADD ELEMENT' and 'CLEAR STORAGE' buttons. Therefore, it seems there are no 'Remove Element' buttons present."
}

The ability for Midscene.js to detect elements by colour saves some work of having to get the exact RGB colour of the element, and simply writing it in natural language.

This led us to take it a step further and write a test based purely on the colours of the buttons—and not labels.

Test 3:

This test did not have a non-natural language equivalent — so no “vanilla” Playwright test for it. Playwright recommends using locators like .getByText() , .getByTestId() , or .getByRole()

Not using the recommended locators and using CSS locators could lead to flaky tests as they are closely tied to the implementation of the element.

typescript

test("AI: Can add and remove elements by button colour",async ({ ai, aiAssert}) => {
    await ai("click on green button 10 times")
    await aiAssert("Check that there are 10 red button elements on the page")
    await ai("click on 3 red buttons")
    await aiAssert("Check that there are 7 red button elements on the page")
})

Here are the assertions again:

Assertion 1:

json

{
  "pass": true,
  "thought": "The page contains 10 red button elements labeled 'REMOVE ELEMENT 1' to 'REMOVE ELEMENT 10'."
}

Assertion 2:

typescript

{
  "pass": true,
  "thought": "The page contains buttons labeled 'REMOVE ELEMENT 4' through 'REMOVE ELEMENT 10', which are red. This totals 7 red buttons."
}

This test suggest that Midscene.js may be a strong choice for performing visual tests, where colours are hard requirements in an application. However, different shades of a colour may not be detectable by Midscene.js (i.e. bright red vs. deep red).

End-to-end tests

Test 1:

Now for some real-world examples. One of the pages we tested was Nintendo.com’s site.

typescript

test("Should favourite an item from search",async ({ page, ai, aiAssert }) => {
    await page.goto("https://www.nintendo.com")
    await page.waitForLoadState()
    await ai("click on the Log in button, then click on the Log in button again")
    await ai("switch to the active browser tab")
    await ai("click on the sign in button")
    await ai("sign in with me@gmail.com and use Games123 for the password")
    await ai("click on the Sign in button")
    await ai("Use search field to search alarmo")
    await ai("click on the Hardware button filter")
    await aiAssert("ensure that the alarmo clock is displayed in results")
    await ai("find the heart icon and click on it")
    await ai("Click on the Wish List menu item")
    await aiAssert("Ensure that the Nintendo Sound Clock: Alarmo is displayed first")
})

Notice that in the second ai step we are trying to switch to the active browser tab. This is because during the execution of the test, the video replay did not pick up that there was a second browser tab opened — Midscene.js only plans out what is currently on the single tab.

No matter what language we used— switch to the second browser tab , switch to the other browser tab — Midscene.js just couldn’t figure it out.

This test unfortunately did not go past the second step because of no multi-tab support.

Test 2:

Our second test was on Amazon.com. This time we went with a simple search scenario:

typescript

test("Should filter results using brand", async ({ page, ai, aiAssert }) => {
    await page.goto("https://www.amazon.com")
    await page.waitForLoadState()
    await ai("use the search bar to search for a ps5")
    await aiAssert("ensure that there are PS5 items in the results")
    await ai("click on the first Playstation checkbox")
    await aiAssert("ensure that the Playstation checkbox is checked")
})

The last assertion could not be resolved. This was the info on the assertion:

json

{
  "pass": false,
  "thought": "The PlayStation checkbox is not marked as checked in the provided JSON data."
}

We looked into the provided JSON data, and indeed, there was no indication in here that the checkbox was checked. This test failed to detect that the checkbox was checked. This test threw us a false negative on a seemingly simple test.

json

{"element":"The first Playstation checkbox in the Brands section"},"matchedSection":[],"matchedElement":[{"content":"","rect":{"left":20,"top":604,"width":16,"height":16,"zoom":1},"center":[28,612],"page":{"page":{"_type":"Page","_guid":"page@19d32d99fd2561dbffc60c4a4102a1ae"},"pageType":"playwright"},"locator":"[_midscene_retrieve_task_id='5e4476af29']","id":"5e4476af29","attributes":{"class":".a-icon.a-icon-checkbox","nodeType":"IMG Node"},"indexId":79}],"data":null,"taskInfo":{"durationMs":0,"rawResponse":"{\"elements\":[{\"reason\":\"Reason for finding element 79: It is located in the Brands section and is the first checkbox.\",\"text\":\"\",\"id\":\"5e4476af29\"}]}"}}},"cache":{"hit":false}}

Test 3:

One last test. This one we challenged Midscene.js to detect something outside of the viewport — a checkbox off screen.

typescript

test("Should filter results using out of stock filter", async ({ page, ai, aiAssert }) => {
    await page.goto("https://www.amazon.com")
    await page.waitForLoadState()
    await ai("use the search bar to search for a ps5")
    await aiAssert("ensure that there are PS5 items in the results")
    await ai("scroll down until the out of stock checkbox is visible")
    await ai("click on the out of stock checkbox")
    await aiAssert("ensure that the out of stock checkbox is checked")
    await aiAssert("ensure that there are playstation items in the results")
})

The test failed to locate the out of stock checkbox in this case. The report did show this as one of the actions:

scrollUntilBottom appeared to be the culprit. The video also showed that we were indeed scrolled to the bottom:

Despite giving different prompts like scroll down 10%, scroll down a little among other attempts, we just couldn’t find the right prompt for Midscene.js to see the checkbox.

Conclusion

As seen from the various scenarios, AI testing tools have made some encouraging progress. Tools like Midscene.js have gained traction in the testing community, but still need to further evolve in order to replace the current market’s toolset. While this is a considerable advancement in using AI to write tests, excessive execution times, possible security issues and accuracy of actions and assertions are still significant drawbacks keeping Midscene.js from being a legitimate testing solution for modern web apps.

As code generation and natural language tools like Midscene.js evolve and improve, a shift away from our current testing tools like Playwright and Cypress may be coming sooner than we think. Otherwise, it may not replace our mainstay tools, but just be another tool in the toolbox.

Below, we’ve summarised the strengths and weaknesses of Midscene.js for its practicality in test automation:

Strengths:

Fast, initial writing
Easy and fast to use for simple tasks/systems
Ability to locate elements by colour
Readability
Possibly a good visual testing tool

Weaknesses:

Execution time
Lack of ability to use variables
Slow and doesn’t understand the context of more complex systems or web apps
Only captures what is currently in the viewport
Lack of multi-tab support
Need for a framework to use the correct language in prompts to increase accuracy in actions and assertions
Possible security concern with screenshots being read by OpenAI — storage?

Nearform’s LLM Playground: We created our version of the OpenAI Playground

9 mins
Tiago Ramos
01 Mar 2024

Augmenting user experiences using LLMS: an analysis of our travel AI agent application

6 mins
Antoine Marin
17 Oct 2023

Insight, imagination and expertly engineered solutions to accelerate and sustain progress.

Contact