End to end and smoke tests give a really valuable angle on what the app is doing and can warn you about failures before they happen. However, because they’re working with a live app and a live database over a live network, they can introduce a lot of flakiness. Beyond just changes to the app, different data in the environment or other issues can cause a smoke test failure.
How do you handle the inherent flakiness of testing against a live app?
When do you run smokes? On every phoenix branch? Pre-prod? Prod only?
Who fixes the issues that the smokes find?
I think the reality is that there are lots of different levels of tests, we just don’t have names for all of them.
Even unit tests have levels. You have unit tests for a single function or method in isolation, then you have unit tests for a whole class that might set up quite a bit more mocks and test the class’s contract with the rest of the system.
Then there are tests for a whole module, that might test multiple classes working together, while mocking out the rest of the system.
A step up from that might be unit tests that use fakes instead of mocks. You might have a fake in-memory database, for example. That enables you to test a class or module at a higher level and ensure it can solve more complex problems and leave the database in the state you expect it in the end.
A step up from that might be integration tests between modules, but all things you control.
Up from that might be integration tests or end-to-end tests that include third-party components like databases, libraries, etc. or tests that bring up a real GUI on the desktop - but where you still try to eliminate variables that are out of your control like sending requests to the external network, testing top-level window focus, etc.
Then at the opposite extreme you have end-to-end tests that really do interact with components you don’t have 100% control over. That might mean calling a third-party API, so the test fails if the third-party has downtime. It might mean opening a GUI on the desktop and automating it with the mouse, which might fail if the desktop OS pops up a dialog over top of your app. Those last types of tests can still be very important and useful, but they’re never going to be 100% reliable.
I think the solution is to have a smaller number of those tests with external dependencies, don’t block the build on them, and look at statistics. Sound an alarm when a test fails multiple times in a row, but not for every failure.
Most of the other types of tests can be written in a way to drive flakiness down to almost zero. It’s not easy, but it can be doable. It requires a heavy investment in test infrastructure.
My team has just decided to make working smokes a mandatory part of merging a PR. If the smokes don’t work on your branch, it doesn’t merge to main. I’m somewhat conflicted - on one hand, we had frequent breaks in the smokes that developers didn’t fix, including ones that represented real production issues. On the other, smokes can fail for no reason and are time consuming to run.
We use playwright, running on github actions. The default free tier runner has been awful, and we’re moving to larger runners on the platform. We have a retry policy on any smokes that need to run in a step by step order, and we aggressively prune and remove smokes that frequently fail or don’t test for real issues.
Shouldn’t you find out, why the smoke tests keep failing? It’s not a good sign, if you can’t even guarantee stability in a controlled environment.
It’s not a fully controlled environment, that is the point of smokes.
How do you not fully control the environment your PRs are tested in?
dependencies.
As always I would say there is a huge “it depends”.
For context, I am part of a small team of engineers, working on a relatively new product, we have continuous deployment setup for our release branches. We prefer many small PRs, think at least a PR a day per engineer.
I am responsible for setting up a new e2e test suite right now, so it’s possible I reconsider later on. But, there are a couple lessons learned from our previous iteration.
- Our pipeline was slow (20-30 mins), flakiness was a no go. Decreasing pipeline time increased tolerance for flakiness.
- Flakiness on the pipeline translated to flakiness on the production instances. When we started caring for those our sentry got much more happy.
- We didn’t have the time to go back and fix issues, so we stopped having nightlies. If it’s important enough we should block merging on main and fix it.
I think/hope that the wording you used was a mistake.
End to end tests do not introduce flakiness, but uncover it.
Whenever we discover flakiness, we try to fix it immediately. When there is no time for the fix (which is more than often the case) we create a ticket that vanishes in the backlog.
For a long time the company I currently work at didn’t have end to end tests save unit tests for a lot of their code.
Through a push of newcomers we finally managed to add end to end tests to many more parts of the code. However, these are still not properly documented. Some end to end tests overlap and some only cover a small part of one larger functionality. That is why we often find bugs that were introduced by us, because we had no end to end tests covering those parts.
We used to run end end tests only every night on the whole product. They usually take an hour or more to complete. This takes too long to run them before each merge. However, we have them organized enough such that for sub-product A we can run the sub-product A end to end tests only before each merge where we assume that we did only touch code affecting sub-product A. In case the code changes affected some other parts of the product, the nightly tests help us out. We are doing this in my team for a long while now. But we just recently started to establish this procedure in the other teams of the company, too.
My experience with E2E testing is that the tools and methods necessary to test a complex app are flaky. Waits, checks for text or selectors and custom form field navigation all need careful balancing to make the test effective. On top of this, there is frequently a sequentiality to E2E tests that causes these failures to multiply in frequency, as you’re at the mercy of not just the worst test, but the product of every test in sequence.
I agree that the tests cause less flakiness in the app itself, but I have found smokes inherently flaky in a way that unit and integration tests are not.
Okay I must admit that I do not have much experience with smoke and integration tests. We run end to end tests only and skip running the other two types entirely. They would be covered by the end to end tests anyways.
Perhaps I am lucky in that our software doesn’t require us to use many waits at all. Most things are synchronous and those that are not mostly have API endpoints where the status of the process an be safely queried, i.e. a
wait(1000)
and hope for the best is not necessary, but ratherdo wait(1000) until isFinished()
.And yes, for us it is also a mess of errors popping up when one step in a pipeline fails, where many tests rely on this single step. I don’t know whether there is a way to approach this issue neatly. This is surely a chance in the market to be taken.
End-to-end tests are basically non-deterministic state machines. Flakiness can come from any point in the test: bad tests, bad state management, conflicting tests, network hiccups, etc.
Your goal is to reduce every single point of that flakiness. Just make sure you keep track of it. Sometimes flakiness in tests is really pointing at flakiness in the product itself.
Some things that can help reduce that flakiness:
- Dedicated network
- No external dependencies
- Polling instead of static waits/sleeps
Polling is certainly useful, but at some point introducing reliability degrades effectiveness. I certainly want to know if the app is unreachable over the open internet, and I absolutely need to know if a partner’s API is down.
I’m a fan of randomizing the test order. That helps catch ordering issues early.
Also, it’s usually valuable to have E2E tests all be as completely independent as possible so it’s impossible for one to affect another. Have each one spin up the whole system, even though it takes longer. Use more parallelism, use dozens of VMs each running a fraction of the tests rather than trying to get the sequential time down.
The problem with randomising the test order is that it compromises the reproducibility of results. If there are ordering issues, then your tests will sometimes fail and sometimes pass, but will developers look at that and think “ah there must be an ordering issue” or will they think “damn these flaky tests, guess I’d better rerun the pipeline”?
Wherever possible, this is a good idea. The campsite rule - tests don’t touch data they didn’t bring with them - helps as well.
However, many end to end tests exist as a pipeline, especially for entities that are core to the business function of the app. Cramming all sequentiality into single tests will give you all the problems described, but in a giant single test that you need to fish out the result for.
You run E2E test before each merge. So, you don’t merge very often?
How about running an integration test before each merge instead of a full fledged E2E and mocking out external dependencies (other services) during the test, then do E2E testing on a schedule like nightly?
I prefer it this way, because mocking out external dependencies cut out network instability and bugginess from dependencies. So, we can merge faster. Agree that test scenarios are overlapping, and if your E2E is very stable then it is probably not worth it, but unfortunately it’s not so stable in my environment.
Luckily, our e2e tests are pretty stable. And unfortunately we are not given the time to write integration tests as you describe. The good thing would be that with these mocks we were then also be able to load test single services instead of the whole product.
We merge multiple times a day and run only those e2e tests we think are relevant. Of course, this is not optimal and it is not too rare that one of the teams merges a regression, where one team or more talented at that than the others.
You see, we have issues and we realize we have them. Our management just thinks these are not important enough to spend time on writing integration tests. I think money and developer time are two of the reasons, but the lack of feature documentation, the lack of experts for parts of the codebase (some already left for another employer), and the amount of spaghetti code and infrastructure we have are other important reasons.
Reading the 3rd paragraph and I see myself 😄. Glad that you and the team managed to add another layer of testing successfully.
Having the e2e smokes be a requirement for PR merge is frustrating. But I’ve been on the fence before with this on my own teams. It’s enticing to have a completely “clean” main branch that has not been infested with regressions caused by a PR.
It also gives you confidence in the crummy cases where you need to push a fix to prod right now.
If the e2e’s flap too much, then it is not an option. I’ve tried it and it lasts one sprint before we nix it. It’s just too frustrating and development comes to a standstill.
We have a retry policy on any smokes that need to run in a step by step order, and we aggressively prune and remove smokes that frequently fail or don’t test for real issues.
I actually think that’s the best way to handle it.
Who fixes the issues that the smokes find?
On teams I’ve been on, typically a junior dev. Sounds crummy, but it actually gives them more experience with the product/code. I have been that junior dev before and I found it a positive experience.
I find it interesting that the junior would fix these issues. In my team we established a fictive role which each developer takes in sprint rotation. The developer in this role would then handle these cases and also drive the investigation of incoming support requests / bug reports, playing third level support in a sense.
How do you handle support requests in your team/company? Is it the developers or do you have a dedicated team for that?
My org has issues with e2e, but we keep them because they usually inform that something, somewhere isn’t quite right.
Our CI/CD pipeline is configured to automatically re-run a build if a test in the e2e suite fails. If it fails a second time, then it sends up the usual alerts and a human has to get involved.
In addition to that, we track transient failures on the main branch and have stats on which ones are the noisiest. Someone is always peeling the noisiest one off the stack to address why it’s failing (usually time zones or browser async issues that are easy to fix).
It’s imperfect, and still results in a lot of developers just spam retrying builds to “get stuff done”, but we’ve decided the signal to noise ratio is good enough that we want to keep things the way they are.
It depends, but generally as little as possible - nobody has time to inspect flakiness cause on a frequent basis when mindset is to move fast. Within team we identified components and degree of control we have over them to control flakiness
deleted by creator