Midscene.js: Assessing a natural language AI testing tool
We see how Midscene.js stacks up against traditionally coding Playwright tests
With almost everything AI flooding our feeds (regardless of what we do for work) it is difficult to pick out some of the AI tools that stand head and shoulders above the rest. In test engineering and software development, code generation tools are everywhere. So staying in the AI theme, we’ve taken a different direction and looked at an AI tool which does NOT generate code, but rather uses natural language to carry out the test.
Midscene.js is a tool that integrates with Playwright or Puppeteer. In this article, we’ll put it through its paces, seeing how it stacks up against traditionally coding Playwright tests in terms of speed, and observing how Midscene.js handles various test scenarios.
As AI continues to transform testing tools, this legitimate first take on natural language for writing test automation marks a promising development. Midscene.js and other tools need to move beyond the novelty status and become usable tools in the test engineer’s toolbox. In order to get there, major improvements to speed and the ability for AI to interpret the DOM alongside screenshots are a must.
How does Midscene.js. work?
Based on the diagram below, we can see that there is no direct interaction under the hood with the underlying APIs. This means that these tests can only function based on what it derives from the frontend.
How Midscene.js. measured up
Let’s get to testing Midscene.js. We used three scenarios which could represent a smaller part of an entire web app:
A simple login page
An add/remove elements page
Nintendo.com and Amazon.com for real-world, e2e scenarios
What is very exciting with this AI tool is the ability to use natural language — simply describe what the test step is trying to do, and off it went. This brought the wow factor to another level, whereas code generation is almost standard these days.
Login
For this first example, we used Nearform’s UI Testing Playground — a great place to take a tool through some basic as well as more advanced tasks and tests on a page. The UI Testing Playground also allows anyone to sharpen their UI testing practice.
Here we have the login scenario. We’ve used this as a basic test to measure:
Ease of use/readability
Execution time
Maintainability
For comparison, here’s what it looks like in Playwright, writing the “traditional way”:
Now, in natural language, powered by AI using Midscene.js:
Slightly longer, but much more readable for say, a non-technical member of a given team.
Takeaways from this scenario:
Ease of writing/readability:
➕ Write it as you think it should execute — which is its strongest point. We found that if it didn’t work, just try another, simpler way to write it! This also supports several languages (English, French, Chinese). Definitely some wow factor here.
➖Can look awfully long to read for developers
➖You can also see that there is no way to hide the username and password in natural language. You need to send those directly in the prompt, whereas without Midscene.js, you can hide these in a separate data file.
Execution time:
➖Because this is powered by OpenAI and screen captures, the execution time was very slow. The AI needed time to plan and “think” about what it needed to do. The report and JSON output clearly show the amount of time it takes to do each task.
➖Locating elements by role in Playwright is a recommended practice, and inherently tests the a11y of the web app. As Midscene.js uses screenshots, it moves away from this practice.
For comparison, running the code above for both, here were the results:
With Midscene.js: 45.8 seconds
Without Midscene.js: 1.9 seconds
The time we may gain by writing in natural language is very minimal when we consider the run time accumulation of this test on a recurring basis.
Maintainability:
➕If this login page were to be refactored (test IDs, accessibility tags) the test would likely still do what it needs to do correctly. Playwright alone would require some maintenance.
➖If there would be maintenance to be done on the test, debugging is trial and error by changing the way you prompt the test, with only the test report and JSON dumps to work with.
➖You can also notice that the Playwright test is written in Page Object Model form, which helps keep it maintainable and scalable. With Midscene.js, writing the test directly in the test file is the whole purpose of Midscene.js, so while writing in natural language is fast and easy, making modifications in multiple test files would be a nightmare.
We also added a negative case, to see how Midscene.js would handle it. We changed the password to trigger an invalid set of credentials:
The JSON output on the report gives a good insight as to what it was “thinking”, which could be helpful with debugging:
Adding/removing elements test
In this example, we thought we’d step it up a notch, but still kept it simple. Here we will only be evaluating how Midscene.js handles a slightly more complex scenario.
For context here is the page to be tested:
Click on the green button to add a red button, click on a red button to remove it, click on the blue button to remove all red buttons. We’ll be going through the tests we wrote for this page below, and the Playwright “alone” version just for comparison.
Test 1:
Natural language:
No natural language, just Playwright alone:
Midscene.js handled this test easily, was able to detect the elements on the page, and make the correct assertion using the prompts we gave as input.
This was the assertion output:
Test 2:
In this test, we decided to see how it would handle a description using colour in one of the assertions. In Playwright, the assertion requires us to use .toHaveCSS()
and pass the RGB colour and a tiny bit of inspection.
Natural language:
No natural language:
Midscene.js handled this test very well. The “thought” process was also fascinating to see when going through the steps:
Here was the assertion thought process:
Assertion 1: Check that there are five “remove element” buttons:
Assertion 2: Check that the “remove element” buttons are red:
The output shows that AI detected red buttons through the screenshot!
Assertion 3: Check that there are no “remove element” buttons:
The ability for Midscene.js to detect elements by colour saves some work of having to get the exact RGB colour of the element, and simply writing it in natural language.
This led us to take it a step further and write a test based purely on the colours of the buttons—and not labels.
Test 3:
This test did not have a non-natural language equivalent — so no “vanilla” Playwright test for it. Playwright recommends using locators like .getByText()
, .getByTestId()
, or .getByRole()
Not using the recommended locators and using CSS locators could lead to flaky tests as they are closely tied to the implementation of the element.
Here are the assertions again:
Assertion 1:
Assertion 2:
This test suggest that Midscene.js may be a strong choice for performing visual tests, where colours are hard requirements in an application. However, different shades of a colour may not be detectable by Midscene.js (i.e. bright red vs. deep red).
End-to-end tests
Test 1:
Now for some real-world examples. One of the pages we tested was Nintendo.com’s site.
Notice that in the second ai
step we are trying to switch to the active browser tab. This is because during the execution of the test, the video replay did not pick up that there was a second browser tab opened — Midscene.js only plans out what is currently on the single tab.
No matter what language we used— switch to the second browser tab
, switch to the other browser tab
— Midscene.js just couldn’t figure it out.
This test unfortunately did not go past the second step because of no multi-tab support.
Test 2:
Our second test was on Amazon.com. This time we went with a simple search scenario:
The last assertion could not be resolved. This was the info on the assertion:
We looked into the provided JSON data, and indeed, there was no indication in here that the checkbox was checked. This test failed to detect that the checkbox was checked. This test threw us a false negative on a seemingly simple test.
Test 3:
One last test. This one we challenged Midscene.js to detect something outside of the viewport — a checkbox off screen.
The test failed to locate the out of stock checkbox
in this case. The report did show this as one of the actions:
scrollUntilBottom
appeared to be the culprit. The video also showed that we were indeed scrolled to the bottom:
Despite giving different prompts like scroll down 10%
, scroll down a little
among other attempts, we just couldn’t find the right prompt for Midscene.js to see the checkbox.
Conclusion
As seen from the various scenarios, AI testing tools have made some encouraging progress. Tools like Midscene.js have gained traction in the testing community, but still need to further evolve in order to replace the current market’s toolset. While this is a considerable advancement in using AI to write tests, excessive execution times, possible security issues and accuracy of actions and assertions are still significant drawbacks keeping Midscene.js from being a legitimate testing solution for modern web apps.
As code generation and natural language tools like Midscene.js evolve and improve, a shift away from our current testing tools like Playwright and Cypress may be coming sooner than we think. Otherwise, it may not replace our mainstay tools, but just be another tool in the toolbox.
Below, we’ve summarised the strengths and weaknesses of Midscene.js for its practicality in test automation:
Strengths:
Fast, initial writing
Easy and fast to use for simple tasks/systems
Ability to locate elements by colour
Readability
Possibly a good visual testing tool
Weaknesses:
Execution time
Lack of ability to use variables
Slow and doesn’t understand the context of more complex systems or web apps
Only captures what is currently in the viewport
Lack of multi-tab support
Need for a framework to use the correct language in prompts to increase accuracy in actions and assertions
Possible security concern with screenshots being read by OpenAI — storage?
Insight, imagination and expertly engineered solutions to accelerate and sustain progress.
Contact