• danA
    link
    fedilink
    arrow-up
    16
    ·
    edit-2
    11 months ago

    I broke the home page of a big tech (FAANG) company.

    I added a call to an API created by another team. I did an initial test with 2% of production traffic + 50% of employee traffic, and it worked fine. After a day or two, I rolled out to 100% of users, and it broke the home page. It was broken for around 3 minutes until the deployment oncall found the killswitch I put in the code and turned it off. They noticed the issue quicker than I did.

    What I didn’t realise was that only some of the methods of this class had Memcache caching. The method I was calling did not. It turns out it was running a database query on a DB with a single shard and only 4 replicas, that wasn’t designed for production traffic. As soon as my code rolled out to 100% of users. the DBs immediately fell over from tens of thousands of simultaneous connections.

    Always use feature flags for risky work! It would have been broken for a lot longer if I didn’t add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins.

    • jjjalljs@ttrpg.network
      link
      fedilink
      arrow-up
      14
      ·
      11 months ago

      Always use feature flags for risky work! It would have been broken for a lot longer if I didn’t add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins

      This reminds me of the old saying: everyone has a test environment. Some people are lucky enough to have a separate production environment, too.

        • danA
          link
          fedilink
          arrow-up
          7
          ·
          edit-2
          11 months ago

          Feature flags are just checks that let you enable or disable code paths at runtime. For example, say you’re rewriting the profile page for your app. Instead of just replacing the old code with the new code, you’d do something like:

          if (featureIsEnabled('profile_v2')) {
            // new code
          } else {
            // old code
          }
          

          Then you’d have some UI to enable or disable the flag. If anything goes wrong with the new page after launch, flip the flag and it’ll switch back to the old version without having to modify the code or redeploy the site.

          Fancier gating systems let you do things like roll out to a subset of users (eg a percentage of all users, or to 50% of a particular country, 20% of people that use the site in English, etc) and also let you create a control group in order to compare metrics between users in the test group and users in the control group.

          Larger companies all have custom in-house systems for this, but I’m sure there’s some libraries that make it easy too.

          At my workplace, we don’t have any Git feature branches. Instead, all changes are merged directly to trunk/master, and new features are all gated using feature flags.