Hacker Newsnew | past | comments | ask | show | jobs | submit | jacktli's commentslogin

As Shopify? 35b (SQ) market cap vs 86b market cap (SHOP), SF vs Ottawa HQ, guess 4000 vs 6500 employees (based on linkedin numbers) is fairly close, but Square and Shopify are quite different.


I was referring to Twitter and Square.


It’s about providing a platform for enabling global commerce. Our engineering blog has other posts that goes into technical depth about a lot of the challenges we face, and our solutions. For context, here is some info about Black Friday/Cyber Monday for us last year: https://engineering.shopify.com/blogs/engineering/preparing-...


I understand its a hugely popular service and there is a lot that goes in to something so big but there are bigger websites that run on much much smaller teams of developers. I just don't understand how there is even enough surface area on the app that 1000 people could be working on it at the same time. Is that number counting people working on the ops stuff or building other tooling not part of the main app?


Think of it this way. 100 different projects. Each with different business goals etc. Make some of the teams Platform teams for good measure.


The limitation of Bors for us was throughput, we were more interested in testing multiple simultaneous pull requests merging at the same time, rather than testing against the latest master.

With our CI times and PR volume, by the time any CI run completes, master would have drastically changed.


Sequentializing landing so that you can ensure passing tests is the main feature of bors. The normal GitHub CI flow is already what you wanted apparently.

In a bors workflow, bors is the only thing allowed to push to master, so master cannot get out of date.

The Rust projects solves throughput with rollups, which are semi-automated. It would be nice if someone would write fully automated rollup support into a bors, but alas, no one has tried that I know of.


You don't need absolute sequentiality for bors's guarantees. You can speculatively build and try to merge multiple PRs in parallel even though only one will "win". That's fine and not thundering herd stupidity of your build system is incremental so you can share work.

None of this is new at all, btw, I'm just regurgitating MVCC from postgresql.


Yes we have seen this before! The main difference is that throughput is extremely important for us, which we would not get worth Bors. Also, compatibility of multiple simultaneously merging PRs is the case that we are optimizing for, vs. compatibility with current master.


If you don't mind me asking, How long does a CI run take for you? How do you manage running CI with so many merges?

Our CI takes ~8 hours of machine/VM time, which is about 35 minutes of wall time with our current testing cluster (including non-distributed parts like building). We skip certain long tests during the day, so that brings wall time down to ~13 minutes. But we also test 2-3 branches with decent churn. So even if we're only doing post-merge CI based on the current state of master, we're still getting 5+ commits fairly often.

I want to get to a world where CI is run before and after each merge with master, on every commit (or push/pull), but it seems like it would take so much more resources and infrastructure than we currently have.


There is an explanation about how we handle this case when I talk about the failure-tolerance threshold. I go deeper into this in my GitHub Universe talk where I also talk about an alternative (but costlier) solution through running parallel branches, but unfortunately that talk is not posted up yet.


Hi, Author here!

Pull requests are our unit of work, and the queue was created to support all pull requests. We do have feature flags as a tool, but we let our developers make the judgment call on how their changes should be rolled out.


Is anyone "signing off" on the deploys or is it fully automatic? I can't really imagine it being manual 40 times per day, but just wanted to hear.

How do you handle the scenario that some developer pushes a send_me_all_the_credit_card_details() function to the code base which does something 'evil'? Do you rely on the reviewer "doing their works properly" to handle that?

I'm not saying formal "signing off"-steps in processes handle it, but some companies does them for that reason.


We generally require 2 reviewers, and no sign-off on deploys. For PCI-compliant code things work a bit differently, but tries to follow this as closely as possible.


Interesting. It seems like you have a very flexible process of how to launch code which could contribute to issues with visibility and rollbacks.

I’m curious as to why you had a queue instead of a develop branch before moving to CD? Was this to allow arbitrary commits to be launched to production rather than getting them batched by time?


A `develop` branch has several disadvantages.

You will want to make your `develop` branch the default branch in git and on GitHub, to make sure pull requests automatically are targeted properly (not doing this would be a major UX pain). However, that means that when you `git clone` a repository you are not guaranteed to get a working version.

The `develop` branch can still be broken, which is a problem that needs to be addressed. While you can revert breaking changes (or force-push it to a previous known good sha), and you can automate this process, the pull request is already marked as merged at this point. This means that developers have to open a new PR whenever that happens.

With the queue approach, pull requests remain open until we are sure they integrate properly. Also, we have the opportunity to use multiple branches to test different permutations of PRs, so we can still progress and merge some PRs even if the "happy path" that includes all PRs does not integrate properly.


Thanks, I was hoping for more of this in the blog post. Since tools are just an expression of process/policy, it’s more interesting to here about the process and why than it is about building “yet another CD tool”. Appreciate the thoughtful and thorough response.

The major pain point I agree with on develop is changing the defaults to merge to that rather than master. It’s a shame this is not easier to do in git/github.

I’m not sure I agree with “develop can still be broken” as an issue that supports a queue. Whether it’s a queue or develop, one should run CI on each change to validate that merging it to master will not cause issues. It’s possible for both to be broken via the same scenarios just as it’s possible for master to be broken. Since CI runs before the branch is merged to develop and upon merge, a failure would “stop the world” and prevent more code from being merged unless that code fixes the failure.

I guess I’m not fully understanding how a queue prevents this. Since you don’t have a full picture of the state of master until something is merged from the queue, how do the CI checks in the queue prevent things that branch-based CI checks wouldn’t prevent in a “develop” branch? With branches and develop, pull requests remain open until they can be assured they merge properly with develop as well.

For clarity, I’m not arguing that a develop branch is the way to go, I think CD is much better.

Maybe I’m missing something big here but using multiple branches is permissible in other setups also. You can cherry pick a bunch of commits to a branch and test permutations but only certain branches get deployed to staging and production based on rules.

I’m glad that Shopify has found tools and a process that works. Honestly, I’m just having trouble comparing and constraining this to the other tools that are out there. The article never speaks about other approaches and whether or not they were considered and why you decided to go with a queue. It’s not clear to me if this was a case of improving the existing queue system because it was already in place or whether or not the queue was specifically chosen again because it was better than other alternatives (and why).


> I guess I’m not fully understanding how a queue prevents this. Since you don’t have a full picture of the state of master until something is merged from the queue, how do the CI checks in the queue prevent things that branch-based CI checks wouldn’t prevent in a “develop” branch? With branches and develop, pull requests remain open until they can be assured they merge properly with develop as well.

The trick of the merge queue is that it splits the "merging a branch / pull request" in two steps:

1. Create a merge commit with master and your PR branch as ancestors.

2. Update the `master` ref to point to the merge commit.

Normally when you press the "Merge Pull Request" button, it will do those two things in one go. By splitting it up in two distinct steps, we can run CI between step 1 and 2, and only fast-forward master if CI is green.

This means that master only ever gets forwarded to green commits. And because the sha doesn't change during a fast-forward, all the CI statuses are retained. Only when we fast-forward will GitHub consider to pull request merged, so we don't have to "undo" pull request merges when they fail to integrate. If the merge commit fails to build successfully, we leave a comment on the PR that merging failed, and the PR is still open.

When we have multiple PRs in the queue, we can create merge commit on top of merge commit, and run CI on those merge commits. When once of these CI runs comes back, we can fast forward master to it, and potentially merge multiple pull requests at once with this approach.


I think I see where you are coming from. Being as we use different tools, we wouldn’t allow a pull to be merged if it wasn’t up-to-date with master which is similar but a different approach. You’ll have to check at merge time because getting up-to-date could take a while and master could have changed. Jenkins does this and it can be done in other CI/CD systems with a bit of custom code.

I’d imagine at 1,000 developers and with a monolithic codebase, you’re looking to minimize test runs both from a time and cost of runners perspective.

You may also want to look into Zuul or Bazel if cost of test suite runs is a factor in coming to this solution.


> Being as we use different tools, we wouldn’t allow a pull to be merged if it wasn’t up-to-date with master which is similar but a different approach

That wouldn't work for us due to the amount of changes we need to ship. If you rebase your branch and wait for CI to come back green, chances are another PR will have merged in the mean time, which means your rebased branch is no longer up to date with master. You end up stuck in a rebase cycle.

For this reason, we have no choice but to batch PRs, which is what the merge queue tool does. Faster CI will reduce this problem and we're working on that as well, but won't fully solve this.


That’s understandable. I’d imagine at some point you’ll need to decouple the monolith a bit in order to work effectively as you scale. Best of luck with the challenge.


The queue is simply an automated "develop" branch.


From what I gathered in the article, that’s the case now but before the queue required manual merges.


No, even with v1, the merge weren't manual. A bot would merge for you, but directly into master.

Now the bot merges into a temporary that is fast-forwarded as the new master if CI validates it.


Interesting.

Would you say this is more of a decision based around the constraints of using GitHub or more of the ideal process for Shopify’s needs?

I’m curious because the article doesn’t mention the core reasons that you chose to write your own CD tool versus the other options that exist. The workflow you describe seems readily available in most tools. Perhaps the throughput was causing other options to break?


The ideal process for Shopify’s needs based on the constraints we have to work with (CI speed, deploy speed, rate of changes, etc).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: