Main

September 24, 2007

ftp.m.o, shift-reloaded

For quite some time, we've been wanting to reorganize how bits and builds find their way out to the world.

It's a project we've affectionately referred to as "stage migration," and we're finally at a point where we're ready to make changes that will hopefully improve security and auditability, reliability, and expandability.

The primary goals of this move are to support:

  • Ensuring that builds get scanned before they're available for download
  • Audit access to the FTP tree and move to LDAP authentication for easier management
  • Move to expandable storage, backed by a highly-reliable storage platform
  • Combine ftp.m.o and archive.m.o, to make it easier to find builds

To address these goals, this week, the Build & IT teams will be reconfiguring the way the servers that make up ftp.m.o are set up.

The main changes that you'll notice are:

  • archive.m.o and ftp.m.o will be merged into one site, which will respond to both addresses. (So, no need to switch between them if you're regression testing!)
  • Directories with many builds, like firefox/nightly/, will be rearranged to allow much quicker access than the current archive.m.o. We'll be creating YYYY/MM subdirectories, and moving deliverables into those directories. For example, the builds from March of last year would be found in firefox/nightly/2006/03/, and in that directory, you would find directories with builds like 2006-03-01-04-mozilla1.8 and 2006-03-01-04-trunk. We will be doing this for all builds older than a month.
  • All files will be virus scanned before becoming available. We currently virus scan all builds, but depending on a number of factors, it was possible for unscanned-builds to appear on the FTP site for a window of time; we've removed this window.
  • Interactive shell accounts on the FTP farm will be replaced with sftp-only accounts.

We will be doing most of this work during an outage window starting on Thursday, 27 September at 10 pm to 2 am. The tree will be closed during this time, because we'll be teaching Tinderbox about the new FTP structure. Updates might be wonky on Friday, as well.

We will be moving the archived builds around earlier in the week, on Wednesday during the European day ("Go, go, gadget cf!" :-)

We're doing our best to minimize adverse changes but because of the above, it's possible that your account could become disabled, or data in your home directory on the current FTP machine becomes unavailable. If that happens, please email build@mozilla.org, and let us know, so we can fix it.

June 4, 2007

"Farewall axolotl. And we thank you."

During this Thursday's outage window, IT will be moving one of the final remaining services off of axolotl.1

Because this change will cause Tp numbers to noticeably shift, we will be closing the tree at 6 pm PDT, an hour before the outage window. Assuming the outage window goes well and the new pageload server stands up, we'll reopen the tree at midnight.

This will allow a few cycles using the new server, giving new base Tp numbers for the new pageload server.

Any questions, feel free to shoot build@m.o an email.






_________________
1 Don't know what an axolotl is? Yah... I didn't either.2
2 Why-oh-why wasn't this machine named WuperRuper?!
3 There is no 3 above, but I stole the picture from wikipedia and munged it; according to the license, I'm supposed to attribute that here; so this is the original copy.

May 16, 2007

AUS2's a movin'; can you help us lift the sofa?

AUS2, Mozilla's Automatic Update S{erver,ystem} is packing its bags!

rhelmer, morgamic, and oremj are furiously helping it stuff the update snippets into the U-Haul, as we prep to move it from OSL to the MPT colo. We've been planning to do this for some time now, but there's always been a release or firedrill or something else that prevented us from doing it.

We've got the AUS2's new diggs ready for it in San Jose, and we're starting testing now, before we have a housewarming party.

QA is helping us test on the client side1, and with Cesar Oliveira's help2, we're working on on the Build side, testing the server directly. If all goes well, we'll point aus2.m.o at the new installation next week, on Monday, May 21st.

To ease into it, we'll be removing all of the release updates for a couple of days while we test nightly updates. So on Monday and Tuesday of next week, please let us know if you experience nightly update problems.

If everything goes well, we'll re-add the release updates to AUS2 on Tuesday the 22nd. At that time, we'll also push out the 2.0.0.4rc3 release on the beta channels via the new AUS2 installation.

This will give AUS2 a full week of move-in time before we need to push any updates for a full release.

If there are any questions or you experience issues with updates next week, please file a bug, and let us know in #build, so we can take a look at it.

The AUS2 migration team—and millions of Firefox, Thunderbird, and Sunbird users—thank you in advance.

______________________
1 If you're interested in helping, pop into #qa on IRC and ask about it...
2 Ok, ok... I confess... I had to beg Shaver to let us grab some of Cesar's awesomeness to help

April 19, 2007

The Branches That Bind

(Crossposted1 to mozilla.dev.planning, mozilla.dev.builds, mozilla.dev.apps.camino, mozilla.dev.apps.calendar, mozilla.dev.apps.seamonkey, and mozilla.dev.apps.thunderbird).

rhelmer and I have recently been discussing the release process as it relates to tagging and retrieving source from the CVS repository for Firefox and Thunderbird releases.

The way we do it now has a lot of historical vestiges and the policy was crafted around some assumptions that aren't really valid for the project and the way Firefox and Thunderbird (and other projects in the Community) use the repository these days.

So we've come up with the following proposal we'd like feedback on.

The Proposal

When code complete is reached for the current Firefox release milestone, a release branch will be cut. The standard FIREFOX_version_RC1 and _RELEASE tags will be applied, but to a MOZILLA_datespec_RELBRANCH branch that will replace the venerable FIREFOX_version_MINIBRANCH. This _RELBRANCH tag will be applied to the same files the _RC1 and _RELEASE tags have traditionally been applied to.

If respins during the release are required, developers would check the fixes into the relevant (1.8/1.8.0) branches, as always; they could then either check the code into the _RELBRANCH themselves, or we can take care of that, whichever is easier. When the code for the candidate is ready, the next _RCn tag will be applied and the _RELEASE tag will be moved.

Rinse and repeat until the release is shipped.

Any equivalent Thunderbird release (if one is required) will come off of this branch as well. (This is why the suggested name is MOZILLA_datespec_RELBRANCH and not something like FIREFOX_version_BRANCH; products other than Firefox could come off the branch, and versions other than version could come off the branch.)

This branch will then die at the end of the release cycle.

This may be hard to visualize, so I took the liberty of adding it to our beloved branching diagram:2

The Reasoning

Currently, we apply _RCn and _RELEASE tags to the relevant development branch (i.e. MOZILLA_1_8_BRANCH) and then add _RCn tags and move the _RELEASE tag as we respin candidates. This process must be undertaken with extreme care, and while we record information when we perform the operations against the repository, because CVS does not version tag operations, that information is lost to external consumers of the source coming out of CVS (this is why we tag the _RCs individually; to track these changes).

Creating a release branch to isolate and record any changes required for a specific release is a long standing release engineering best practice. In the Mozilla Project's case, it will allow us to record the changes we make during a particular release cycle and isolate changes so that we are able to assert exactly what went into a release.

Additionally, it will make security firedrills significantly easier: the release branch can be revived at any point in time to release a fix to a particular security issue, so we can assert that a particularly release is the previous release with *only* the required security changes, an issue we've run in the past.

Finally, it will (admittedly, for me) simplify the release automation's respin support; the automation can now just track the _RELBRANCH, as opposed to attempting to fake branching by moving tags around on the regular development _BRANCHes.

The Impact

The impact to developers is quite low. In the worst case, those landing post-code complete fixes (fixes to a release candidate) will have to pull a branch to land their changes. In the best case (if a release engineer merges the change, which is TBD, but quite possible), then developers will not be impacted at all.

The impact to other projects relying on the Gecko milestone changes as well as any other consumer of Firefox-related source code should nonexistent; the _RCn and _RELEASE tags will, from an external perspective, still exist and pulling them will do the "expected thing." We have discussed, at some point, forgoing application of the _RELEASE tag until a particular _RC is declared to be the final release, but this would really only impact consumers of the source tarball who pick up the RC tarball.

In terms of the repository, it is the case that one extra tag per release will be applied to all the files; the _MINIBRANCH will be replaced by the _RELBRANCH, so for the four files that were tagged with the _MINIBRANCH, they will have the exact same number of tags.

The Upshot

We're planning on implementing this change for the upcoming Firefox 2.0.0.4 release, so please discuss and let us know if you have any questions or concerns.

______________________________
1 Read: "spammed"
2 Sorry in advance for the unwieldy size.3
3 Click to enlarge!

April 12, 2007

Version Control System Shootout Redux Redux

Late last year, as the Mozilla project began looking at the tools we'd need for the Mozilla 2 development effort, it became clear that trusty ol' CVS, while carrying the torch for so long, would not meet our requirements for ground-breaking development.


Mortal Kombat II, Version Control Edition: The Prologue

At the last Mozilla Summit, we all met in a session and were able to narrow down the choices to two contenders for consideration: Mercurial and Bazaar.


Brendan's merge requirements bring all the version control systems to the yard, and they're like "It's better than yours..."

We were able to narrow down the decision quickly because the type of development Mozilla 2 will require dictates a model that differs from CVS in ways that systems that attempt to emulate that working model, like SVN, would not work well for us. We turned our attention to the big names in open source distributed systems: Git, Mercurial, Bazaar, and Monotone.

While they've made recent progress, Git was lacking in Win32 support and it was unclear that this would ever change and if it did change, it was unclear that Git-on-Win32 would ever become something more than a second-class citizen. As good, performant Win32 (and Mac and Linux) is a hard-requirement, Git lost in early Kombat rounds. This is unfortunate because (as we would soon find out), lots of issues with the other systems did "just work" in Git.

Monotone was ruled out relatively early as well, due to the similar Win32 performance issues and not wanting to split developer resources with Monotone fix- and feature-requests.

At first, this left Mercurial standing in the ring. Ahh... at last, a simple Mozilla project decision.

But then, during the version control discussion at the Summit, Bazaar was brought up. It had a decent set of features that sounded interesting and useful to us.

We started investigating. And suddenly, there were two.

[Insert a quarter to continue...]

Continue reading "Version Control System Shootout Redux Redux" »

March 5, 2007

March Madness

Someone once told me: "[Your] release procedures are the most primitive and barbaric I've ever encountered."

After thinking about it for awhile (and getting over the initial sting of the commentary), I came to a couple of conclusions:

First, the person providing the critique had limited experience with... well... a complicated release process, one that involves releasing builds in 40+ locales and delivering updates to the desktops1 of millions of users. This person did have a lot of experience with release processes that he had [the fortune] of being able to design (and implement!) himself, and therefore seldom had to address the issues related to legacy- and/or poorly documented-infrastructure that we must deal with every day2.

Second, in some very real sense, he's right.

Our current release process, which is the fourth or fifth generation of, from what I can tell, is approximately the third or fourth mostly-completely-different version of a Release Process (tm) that we've used. It is way more manual3 than it needs to be, has a number of especially-retarded-if-you-don't-happen-to-know-the-history steps and provisions, encapsulates processes in unnatural-and-hard-to-understand ways, generally takes too long4, and requires an extreme amount of mostly-constant-while-it's-going-on focus.

On the charitable side it... well... addresses the requirements necessary to release 40+ locales, on three major platforms, including automated updates, for... millions of users.

At the beginning of last October, rhelmer, after having done a number of releases, began looking at ways to automate the current Firefox/Thunderbird release process, warts and all.

He, along with help from others, have made great strides in creating a solid framework for handling the various steps involved in putting out a release. A great side-effect of his work is that it has put our release process—warts and all—into the public repository with all our other code, so others can benefit5... and hopefully contribute.

One of our goals for the month of March is to knit together the release automation so that we can realize the seemingly-never-to-materialize-mirage of complete, end-to-end automation.

To that end, we're working towards that goal fully in the open, and as I made reference to last week, this is month for it.

I spent time today filing a bunch of bugs on the various issues we need to fix to get this to work, and made them all dependent on the tracking bug for this effort.

We also have a sorta-Joelish Google spreadsheet to help track the effort.

If you're interested in grabbing one of the bugs or generally helping out with the automation effort, please feel free to mosey into #build.

I can't promise that it will be the most glamorous work in the project, but it's some of the most-used and -useful work, and it will be appreciated—even if indirectly—by... well... millions of users.

In 40+ locales.

On three platforms.


______________________
1 The issues involved with releasing desktop software are not entirely similar to those related to releasing web-only software; that's not to say one is easier/better or harder/worse than the other, but rather that a lot of release engineers from my "generation," shall we say, have very limited desktop release experience, becuase they've only worked for dot-coms, and thus they were focused on website/web-app pushes.
2 As bsmedberg asked me in #build last week, "that's making things unhappy on msys... do you remember why it was necessary (shouldn't be)" and my answer was something along the lines of "No I don't, but I remember it was breaking something."
3 Read as "requiring human intervention, thus burning through people as if they were crumpled newsprint"
4 As in wall-clock time
5 Or, at least see how the magic happens.

February 23, 2007

It's been a hard day's night

The impending 2.0.0.2/1.5.0.10 release has been heretofore painful in a way reminiscent of the 2.0.0.1/1.5.0.9 release, but interestingly enough, it's been so for completely different reasons.

I'm currently at the office, signing bits and pushing them out to the mirror farm.

It's been an interesting release from a build/release standpoint: I fried my brain doing the first half of the release; rhelmer graciously saved my (and, by extension, our collective) ass(es) by stepping in last weekend to finish up another round of release candidates (which turned out to be _RELEASE candidates), but pulled the same burnout trick I pulled. So I swapped back in for him to finish it all up. There's been a bunch of handoffs that we've never really done before, but it's seemingly worked thus far.

I've done late nights before, as most of the Mozilla project is used to, but not as consistently during any release as I have during this one, except 1.5.0.2. Maybe.

Anyway, this will be a very different night from the morning I was here, working sometime around the RC1 cycle and feel asleep, to be awoken by the finance guys coming in promptly at 8:30 am.

And it'll be different than the night I had to come in to do some binary signing, and "accidentally" had a party no one was supposed to know about, which, incidentally, "leaked" the top secret information of where I spend all my time building and releasing looks like.1

Yes, this morning will be different because, armed with a pillow and snuggly blanket from home, I'm totally prepared to snag a quality 2.5 hours of shuteye, just to ensure maximal freshness for release activities.

I've learned my lesson.

I never thought I'd still be having slumber parties at this age.

But I must admit, I'm actually kinda happy that I still am, though.

______________________
1 Number one comment from people walking by? "You don't, by chance, like planes... do you?"

February 2, 2007

"Your mission, should you choose to accept it..."

Mozilla IT (thanks Aravind and Justin!) recently archived to tape a huge set of builds from the mozilla.org FTP staging server.1

These aren't just the builds, but also the original build artifacts, from the original tinderboxen, through what was released (and probably [hopefully?!] is still available) on the FTP server.

I asked IT to make three copies of this particular backup tape: one to store with the rest of our backups, one for the Build Team to keep offsite2, and one to keep significantly offsite.

To achieve the last requirement, Beltzner, who happened to be in town this week, will be helping me to find a safe, shady spot in the Toronto office for this little bit of Mozilla Project history.

_________________
1 Thus reclaiming a bunch of space so we could keep... releasing software.
2 Which is likely to translate to "somewhere in my apartment."

A Spoon Full of Triage...

It's a sour medicine, but if the list is ever going to become... realistic, it has to happen.

I, too, joined Coop and rhelmer in making my bug list reflect reality this evening. Now I only have five bugs!!

What ever will I do with all my free time?!

I think the weirdest part of going through my bug list was finding bugs I had actually already fixed, just never RESOLVED. D'oh!

Going through the queue also prompted me to update one of everyone's favorite bugs (with good news, even!) and file a new, necessary bug (it's that time of year... although I hear from the ReedBot it's a dup.)

I don't expect our bug list will reflect reality in the short term... but I think we can get there in the medium term. It will require being realistic about it, and it will involve gnashing of teeth in some cases, but... it'll be nice to look at an open bug list that doesn't

January 30, 2007

Downplaying the "Distributed" Dogma

Benjamin recently wrote about the current state of our effort to try to import our CVS repository to... something from this century.

His conclusion is spot on, although I think it... minimizes the head-banging he and I have been going through for a couple of weeks. My original characterization didn't turn out to be far from the truth, it seems... except, it's me and my quad-core-P4-with-4-gigs-of-RAM sitting there, bloodied and bruised on the floor, not ClearCase. ;-)

I was somewhat surprised by the number of responses to Benjamin's post that seemingly amounted to "Can't you just use Subversion? Subversion works. And if you want distributed, use SVK."

Well, the first issue with that is cvs2svn1 doesn't seem to import the Mozilla CVS tree anymore: it's hitting the error that Hg tends to hit2, and while completely dying is arguably more correct, bzr and cvs2svn 1.3.0's approach—annotating and ignoring the error, so the import can actually continue—is much more satisfying.

The second issue is that the march towards a distributed version control system really isn't about a distributed version control system; it's about using a tool that support merging algorithms that weren't invented in the 80s, back when you never did branches anyway, because it was annoyingly difficult with the tools of the time.

During the original discussion, the main issue that limited Subversion's advancement in the race was that it didn't support any better merging functionality or techniques than its predecessor. It requires external tools to record which merges had been performed and the actual algorithms used are the old ones we all love and/or hate.

Now don't get me wrong: I use Subversion for all my personal stuff and I like it. I think it's a great improvement over CVS (which I used for years and imported from) and in many (most?) cases, I would recommend it.

But when you're going to be doing the kind of "agile"3, disruptive, reconstructive work that Mozilla 2.0 requires, at a minimum, you need a tool that makes branching and merging easy. SVN does work for me (and lots of other people and projects) because I'm not faced with, for example, renaming nsIFrame::GetPresContext, a task where a branch makes a lot of sense, and I'm going to be doing hundreds of renames.

I contend that it's not so much that we require (or necessarily even want) a "distributed" version control system. In fact, as a counter example, Perforce is a [closed source] centralized VCS that has a lot of great features, including merging primitives that are awesome. Accurev is another (although, I've never personally used it.)

We just happen to be focused on "distributed" VCSs because those are the only open source offerings that have merging facilities that handle complicated situations and get the merging stuff right. This is likely because a distributed version control system isn't worth anything if you can't merge your work back in easily and [more importantly] reliably.

I'll concede, of course, that once you have things like offline diff/commit and easy patch sharing among peers, all built-in-and-tracked-by the VCS, that's (possibly addictive) icing on the cake.

But it's not about "distributed" part. It's about the capable-merging part.4

Breaking code apart is easy. Putting it back together is hard.

We want and need a tool that intrinsically expects, is designed to handle, and expertly supports the latter.

_____________________
1 As of 1.5.0
2 Which amounts to deleting files which don't exist on branches [possibly yet] that they're being deleted from.
3 I hate using that [buzz] word.
4 Coincidentally, Joel recently blogged about version control systems and large teams, and it seems the Windows team uses a model very similar to that of the 2.6 kernel developers, and possibly similar to what we'll end up using. It seems that easy branching (which is easy) and easy merging (which is hard) is the only real way to scale a development project into the thousands.

January 8, 2007

From the I-can't-believe-that-worked-department...

Last week, the Firefox trunk build needed to be clobbered due to a checkin that broke depend builds.

dbaron emailed us, requesting the clobber, but then he must've remembered me mentioning adding CLOBBER file support to the Tinderbox client.

The idea was that developers shouldn't have to find someone in #build to clobber builds for them; as long as the person requesting the clobber and reason can be tracked, developers should be able to clobber their own builds.

Implementation of the idea original stalled, though, because CLOBBER file support relies on a) all the configs being public and 2) all the Tinderboxen paying attention to those public configs. That's been done for awhile now (thanks again Coop), but we never had a chance to finish testing the clobber file support, so I had never announced it.

Well, dbaron tried it out and—much to my surprise—it actually worked!

I also tested on the 1.8 branch later that day and it worked there, too.

So, if you need to clobber a build for a checkin, here's how you do it:

  1. After your code goes in...
  2. Checkout mozilla/tools/tinderbox-configs; you probably want mozilla/tools/tinderbox-configs/[product]/[platform], i.e. mozilla/tools/tinderbox-configs/firefox/win32. If you want to clobber the 1.8 or 1.8.0 branch(es), check out those branches (-rMOZILLA_1_8_BRANCH).
  3. If there's not a CLOBBER file in that directory (case matters), cvs add one. Edit the file with the reason (and hopefully a bug number?) for the clobber request.
  4. Check it back in; include the reason in the commit message (yes, this is duplicated from the step above...)
  5. Annotate the build on Tinderbox to let others know it will (should be ;-) a clobber build, and thus will take longer to complete.
  6. If you're feeling benevolent, please stop by #build to let us know you requested a clobber, so we know to expect it
  7. Next build should be a clobber build; if it's not, please stop by #build to let us know, so we can track down why it didn't work

Obviously, clobber builds take longer than depend builds, so please only use this functionality only when necessary (there's a code change that requires it; nightlies are broken for some obvious reason). I'm pretty confident that we all know valid reasons for requesting clobbers. And please coordinate clobbering on IRC, if you can.

While it took us awhile to get here, this is the type of functionality I was talking about working more on this year to give developers the tools they need to be more productive/"agile," but in a framework that tracks the information we need to track.

I'm hoping this is just the beginning.

January 5, 2007

Come for the tulips and clogs, stay for the checkouts

mrz caught me in the hallway today to let me know about a late Christmas present for those doing checkouts on the other side of the globe: cvs-mirror.nl.mozilla.com!

This CVS mirror is not in production yet, but he told me about it so that he could get some help testing the mirror integrity/process.

So, if you've got a few extra seconds, you can help by doing a checkout of either a) your favorite tag or b) a specific date spec (-D) from cvs.m.o (or cvs-mirror.m.o) in one directory, and then pointing your CVSROOT at cvs-mirror.nl.mozilla.com, and performing the same checkout into a different directory, and then diff -ring the contents. The process for logging into the mirror should be identical to the current process. (If it is, that's a bug).

You should only see differences in the CVS/Root files.

You can even go hogwild and check out different branches and make sure they're correct!

If you're on this side of the planet, and feel like helping out, you can perform the same test using cvs-mirror.sj.mozilla.com, which will someday grow up to become cvs-mirror.mozilla.org.

I want to stress that neither of these mirrors are in production yet, but you can help make sure they're production ready by helping mrz and I out.

November 27, 2006

Version Control System Shootout Redux

At the Firefox Summit last week, we convened a session death match to discuss which version control system the Mozilla Project would use going forward.

There had been some initial work1 to specify our requirements for a new system, but now with work for Mozilla 2.0 looming, it was time to get everyone in a room and make a final decision.

I've been asked a few times about the outcome of the discussion.

For those that didn't catch the Pay-per-view broadcast, here's a review of the event and decisions, complete with screencaps:

Continue reading "Version Control System Shootout Redux" »

November 11, 2006

"I'll have a build with barbacoa, guacamole, and no sour cream, please."

A new Chipotle that just happens to be equidistant between my house and work recently opened. This is extremely dangerous, because I love Chipotle.

If you've never been to one, it's much like a Subway or ToGo's in that they have a few "base items" you can order and then you talk directly to the person making your food and watch as they adorn your meal with sour cream, cheese, guacamole, or whatever.

While waiting in line last week, watching them make my burrito, I begin thinking about a blog I read1 relating Starbucks' operations to asynchronous message handling patterns.

As I asked for guacamole, I thought to myself "Chipotle is kinda like open source." No, seriously!

At a "regular" restaurant, your don't get to see your food being made, and you don't get to talk to the chef. So, you have no real clue of what your options are. Maybe you'd have more of the salsa that plainly has those delicious habaneros you love.

Or maybe you you'd take less of the garlic pesto, because you can't stand garlic. Maybe you want extra cheese. There are a hundred little customizations that you can confirm get done while you watch the person make your food. It's opportunity to correct "issues" before they occur and help guide a successful outcome.

This is much better than pulling to the side of the drive thru, ripping out that burger you expressly said "No mayonnaise" four times for, taking off the bun, and cursing the fact that, yup, it has mayonnaise. Again.2

The analogy is particularly timely, because Coop just finished work on bug 337362: "Tracking bug to move tinderbox configurations into the public repo."

One of the first things I worked on was getting all of the mozconfigs and tinder-config.pls under version control. Many of the Tinderboxen had mozconfig.20051120 and mozconfig.pre-coop in them. Not only was this a mess to keep track of, it made it difficult to figure out what the settings had been for builds we shipped even a single release ago.

I originally put them into private CVS because I was in a hurry. That turned out to not be the best solution, since people (quite reasonable) wanting information about what might have changed between two releases had to ask one of us for the information. Coop has fixed that now.

They won't be entirely useful immediately, since the release configs for previous releases weren't moved out of the private repo, but as we march forward, people will be able to diff the options used between various releases, find out what changed, and re-produce their own versions of what we shipped.3

Coop also did a tremendous amount of work getting the Build:Farm page up to date, to include which tags to pull from mozilla/tools/tinderbox-configs/, if you're interested.4

The next steps I'd like to work on is standardizing the format of all the various configs, and working through how we manage making it easier for developers to commit nightly build config changes directly to the build system.

________________________
1 (Yes, this was one of the entries in The Best Software Writing.
2 I do not mean to imply that open source software analogizes particular accurately with only fast-food. The same analogy could, I suppose, be made for a sushi restaurant, at the bar, vs. some four-star hotel restaurant.
3 Ahhhh... reproducibility. As Martha says, it's a good thing.
4 The tags may initially seem confusing. But there is a structure to them. Note to self: write up a wiki page, explaining the structure.

October 12, 2006

A case of the Mondays... on a Wednesday

Last time I did 2.0 release candidate localization builds, I ran into a problem with the tagging and checkout.

For various reasons, we only tag the locales that we ship for a particular release with the _RELEASE tag. During the build, client.mk checks out every locale specified in all-locales. This has worked beautifully for the 1.5.0.x maintenance series, since the locales we ship aren't shifting a lot, there aren't many, if any, new locales on that branch, and if you check out a directory that isn't tagged, you get... nothing. Exactly what we want, right?

Well, we repeated this process for the 2.0 RCs, and suddenly, the builds started failing with "cvs [checkout aborted]: no such tag ". This was certainly a surprise; "I had just created the tag," I kept thinking to myself.

"Am I going insane?"

I started debugging it, and was only able to reproduce it once originally. Once I got a checkout going, it seemed to work repeatably, and since I was busy with 2.0 RC 2, I didn't investigate more.

Well, it happened again with today's l10n builds. Originally, rhelmer and I thought it may have been a problem of using the wrong CVSROOT, since cvs-mirror.m.o only gets updated every few minutes, and I had just created the tag. This didn't make a huge amount of sense, as the command is run with -d, to specify the CVSROOT directly. A peek at the source confirms that -d takes precedence over $CVSROOT and CVS/Root. Then I thought maybe it was a compatibility problem. It turns out that we use CVS 1.11.2 on the client side to create release tags; maybe this is so old—it was released in 2003—that it was hiccuping with something server side?

After trying to reproduce this problem for rhelmer, I was only able to reproduce it once before it worked. Again. Something must be modifying the state server-side.

After some experimentation and more source reading, it turns out that an "optimization" introduced in the CVS 1.11 line, so-called "val-tags", is responsible for the bug.

In a nutshell, when using val-tags, CVS searches for the existence of a tag by 1) looking into val-tags, and then b) looking at the RCS files themselves. This normally isn't a problem, except in the case where an untagged directory is requested before a tagged directory. In the case of RC2, l10n/af was not part of the release, and therefore untagged, but l10n/ar was tagged. They were checked out in that [alphabetical] order.

Running "cvs co l10n/af l10n/ar" will repeatably produce the (incorrect) "invalid tag" error until you run "cvs co l10n/ar" (or some other checkout for which the tag does exist first. This adds the tag to the "val-tags" file, and after that point, CVS will check the val-tags file first, to see the that tag does indeed exist, and then traverse all of the directories you've listed, instead of the first one.

All the gory, buggy details are in tag.c's tag_check_valid(), which, based upon my very cursory reading of the source code, still exists in CVS 1.11.22.

While reading through the source, I was surprised at the number of comments that... didn't inspire confidence:

/* FIXME: This routine doesn't seem to do any locking whatsoever
(and it is called from places which don't have locks in place).
If two processes try to write val-tags at the same time, it would
seem like we are in trouble.  */

/* FIXME: should check errors somehow (add dbm_error to myndbm.c?).*/


But, unlike so many other open source projects, at least we have...

/* warm fuzzies */
if (!really_quiet)

I think the moral of today's story is: open source is cool because you can look at the source... but if you ever do, there's a very real chance you could become very depressed.

Or scared.

Or both1,2.

***
In mostly unrelated4 news, today is National Coming Out Day.

I only remembered this because I was walking around Google's campus and saw a sign noting it.

It always seems to sneak up on me every year5.

___________________________
1 Which makes me wonder how much of our source has "gems" like that...

2 I'm sure at least someone out there is asking "So, Bigshot where's the patch then?" Well, I spent a some time looking at this... and decided that I couldn't spend anymore time wrapping my brain around a function (start_recursion) that calls yet another long function (do_recursion), especially when it seems that most of the CVS devs don't either3
3update.c says /* FIXME-twp: the arguments to start_recursion make me dizzy. This function call was copied from the update_fileproc call that follows it; someone should make sure that I did it right. */
4 Read "completely"
5 It's much like MMLRD, but on a yearly scale...

September 29, 2006

Sometimes, I don't understand VMware at all

pacifica-vm, which most of you probably know as the Firefox 2 Windows nightly build machine, has had an interesting week.

Last week, I took the machine down to back up its virtual disk image, increase the amount of RAM available to it, and, in an attempt to decrease the cycle-time, added a VMware virtual-CPU to the VM, increasing it to two.

This didn't really have the intended effect. Cycle times for both nightly and depend builds went up by about 15%.

Thinking that maybe the build system wasn't making as efficient usage of its shiny new virtual CPU as it could, I upped make's -j value from 3 to 4. This reduced cycle times... to what they were before the memory/CPU "upgrade," but also had the useful side-effect that make would hang every few builds, including most notably on nightly builds. (That is, incidentally, why nightly builds on Tuesday, Wednesday, and Thursday were all late; make kept hanging overnight with -j4.)

Finally, last night, I removed the second VCPU, but kept the extra memory and higher -j values.

That change not only made the machine start reliably producing nightlies again (or, at least, make stopped hanging), but it took the cycle time down to 40 minutes for a depend cycle, and 2ish hours for a nightly build. (Interestingly enough, that full build value seems to fluctuate anywhere from about 90 minutes to just over a couple of hours; I think this is because the trunk build machine and the 1.8 build machine are on the same VM box, and they're both starting their nightlies at the same time, which slows both of them down a bit.)

So, to recap here:

Nightly BuildDepend Build
Before changes~ 2 hours~1 hour
After memory/CPU "upgrade"~ 2 hours, 20 minutes~ 1 hour, 15 minutes
After adding -j4~2 hours; hung often1 hour; hung often
Remove one VCPU~ 2 hours; jury still out, though40 minutes
Things I've learned from this experience:
  • Linking Firefox, especially on Win32, takes memory. A lot of memory. In the couple of trials I paid attention to, it took around 700 megs. Seeing as the VM had 700 megs, a large part of the problem seemed to be the machine descending into swapping thrash-hell when trying to do the final link.
  • Win32 SMP = Teh suck. I had actually learned this from previous experiences in previous lives, but... a reminder is always good.
  • When you actually pay attention to VMs, and spend some time "tuning" them—which in this case, amounted to creating a better match between the memory profile for the machine's task and the virtual hardware—VMs don't perform all that badly, relatively to physical hardware. gaius-vm, for instance, has horrible cycle times compared to gaius, but it's not "just because it's a -vm." It's because no one's paid enough attention to it after migrating it to tune it. (No, I haven't taken what I've learned here and applied to gaius.)
  • Once again, sometimes... VMware['s performance] continues to confuse the hell out of me.

September 19, 2006

A special update, just for Talk Like A Pirate Day

Tonight, with help from morgamic and sspitzer, we've published Firefox's first ever "major update."

This type of update is intended to pull people from (for instance) 1.5.0.7 to 2.0. We won't be immediately publishing any major updates (including that particular update path). It includes the ability to display EULAs and allow users to ignore the major update, including forever (so they can stay on 1.5.0.x, if they wanted).

We'll be doing a couple of tests:

  • We've published an update for build 2006091813, win32/en-US only; most people have already updated to today's nightly, but if you'd like to try it out, you can download that build, run it, and check for updates. You should get offered a major update to "2.0mt1" ("mt" stands for "major [update] testing").

  • Within the next couple of days (hopefully tomorrow afternoon), we'll run a clobber build on the 1.8 branch, for all platforms/en-US, and publish an update to those builds that is a major update (will likely offer itself as "2.0mt2").

If you'd like to help test out the major update functionality, Seth, morgamic and I, along with millions of users, and at least a couple of pirates, would much appreciate it!

September 12, 2006

Moving on up... to that de-luxe [storage array] in the sky

You'd think 1.9 terrabytes of disk storage would be enough for Mozilla's builds, but... it's not.

In fact, with all the simultaneous releases going on lately, we've had to spend a lot of time babysitting the chronically under-spaced stage.m.o.

Fortunately, we now have a replacement and a plan to get us using it.

In addition to giving us another terrabyte of storage, we'll be able to reclaim some diskspace on the current arrays, and provide better verification of builds going out to the mirror farm.

The plan is to migrate stage.m.o and ftp.m.o to differet machines this Thursday. The downtime will start at 6 pm PDT and end at midnight.

Details of the plans can be found on the Wiki, including changes for contributors posting builds (they should be minimal).

If there are any questions or concerns, please feel free to email build@mozilla.org with them.

August 31, 2006

Subversive Subversion Conversion?

Bug 347069, "Setup a production SVN server," was recently RESOLVED as FIXED. This has spurred some confusion, so I wanted to answer some of the common questions we've received:

Is the Mozilla Project switching to Subversion?

There have been many discussions in the past few months about the version control system that the Mozilla project entrusts its code to. It's safe to say there's a desire from most of the community to thank CVS for taking good care of our source code—for the most part—and move into the 21st century.

Obviously, such a move is a big deal, and impacts the very core of the Mozilla Project: our source code. It's not a decision to be made lightly, or by a limited subset of people. It's a project-wide discussion.

The first part of these discussions has already taken place, and a set of Project's requirements for a version control system has emerged. As they wiki page notes, some of these requirements are in conflict, so they represent a utopian ideal of version control systems.

But no decision has been made on which version control system to switch to, nor have any concrete plans (schedules, etc.) even been considered.

Why has a "production" Subversion server been setup?

As part of evaluating various options, we would want to test various systems out.

Subversion is an obvious contender and various other open source projects (Gnome, KDE, Apache) use it. The goal in filing the bug was to set up a production quality server to provide a place to experiment with setting up a robust infrastructure for not just Subversion, but another production-quality revision-control-system-that-is-not-CVS. Additionally, it allows us to gain some experience and insight into our (often implicitly expressed) requirements around things like authentication and clustering.

Please don't get hung up on the word "production." (More on that below.)

Its use merely means that we wanted this on a machine that wasn't someone's desktop and wanted to ensure that IT was included in its setup and evaluation, so their requirements for providing A-level service and support for a CVS-replacement could be (experimentally) gathered and added to the requirements list for a new revision control system.

It is important to note:

  • This server has a significantly lower SLA than most of the other IT services; basically, 9-5, M-F support. It is not considered required infrastructure. There is no oncall@ support for Subversion.
  • This server instance may get blown away and re-created multiple times. We will be playing with hook scripts, repository layout, and other options. We may decide to "start over" possibly multiple times, while exploring these options.
  • This server does not have any support services, such as Bonsai, LXR, Tinderbox coverage, fine-grained access control setup, or even backups!

Why Subversion first?

While there is a healthy debate about which system to switch to, I think everyone can agree that Subversion is one of the viable options.

Subversion was chosen as the first test system because some small portions of code that we rely on were in Subversion repositories (at OSL, for instance).

Additionally, members of the Subversion project came forward to offer assistance with deploying an initial setup, and we've successfully performed a test import of the Mozilla CVS tree into Subversion, thus meeting a (pretty important) initial requirement listed in our requirements.

What is this "production" Server going to be used for, then?

The server will be used for projects that have no dependencies on code in the CVS repository.

Currently, the guinea pig project is AMO3. Again, as they work on the project, we'll undoubtedly play with the server's options and possibly layout. Recreating the server is a likely outcome.

As this trial installation matures, we may solicit help in the form of testing from additional "segmented" projects to ensure that the infrastructure we've created/migrated meets the needs of other Mozilla projects/initiatives.

Can I get an account on the Subversion server?

If you're working on a project that is on the SVN server (currently only AMO), testing it out, file an IT ticket indicating your username on the CVS server, and which project you're working on.

If you'd like to help out by testing, please email which project you'd like to volunteer, where it is in the CVS repo, and what specific goals/requirements you're trying to test out to build@mozilla.org.

Note that we'll want to get AMO settled and other parts of the support infrastructure setup before we let more people into the sandbox, and that is likely to take at least a couple of months, especially given other priorities like Firefox 2.

Also, we're likely to tend towards projects that allow us to get the most coverage of disparate requirements, to maximize the utility of this pilot testing project.

Hope that clears everything up! Feel free to post any other questions you may have.

August 24, 2006

build_team++ redux

I mentioned this at the weekly status meeting, but please help me in welcoming MoCo's newest addition to the seemingly-always-strapped Build Team: TR Fullhart.

TR comes to us with years of experience in build automation and scripting, and should be a huge help in the efforts to "de-insanify" our build process and infrastructure.

Just the other day in #build, he said: < trf> really, I like scripting and programming, feel free to give me stuff

That's what I like to see.1

If you have a sec, drop by #build, and welcome trf!

Glad to have you with us, TR!

________________
1 Anyone remember when their buglist was that small

August 22, 2006

Losing My Memory [leak test server]

For those of you watching balsa, our memory leak tinderbox, just a quick heads up that we'll soon be migrating to the virtualized balsa[s] very soonishly.

The new virtual memory leak tinderboxen names are balsa-trunk which, coincidentally enough, does trunkish builds, and balsa-18branch which does—you guessed it—Firefox 1.8 branch builds.

bz and I looked at the numbers for both the physical and virtual tinderboxen and they looked comparable/cogent/good. Because these are leak memory tests, the virtaulized versions of these tinderboxen don't seem to be affected by virtualization, which we expected.

If you'd like to take a gander at these new tinderboxen, check out the 1.8 branch page and the trunk tinderbox page.

(They're also publishing to the Seamonkey-Ports page, but for some reason, both physical- and virtual-balsa are in various states of unhappiness... which, on the one hand, is good news in terms of validating that the VMs are coherent images of the physical machines, but bad in that... they're both broken.)

Because physical-balsa is literally sitting on the colo floor and the ever-gallant IT peeps are tripping over the machine, we'll probably shut it down within the next few days. If there's a reason why we shouldn't, please let us know.

July 24, 2006

Some UB-relief

Just a heads up that in general, Mac coverage may be spotty over the next couple of days. We're finally getting around to bug 327092, the infamous "Upgrade all the Macs so they can build universal binary builds"-bug.

atlantia and columbia will be the first two to get the OS, software, and hardware—all new Mac build machines will be RAID1'ed now—upgrades.

This should not effect core-product nightly base builds, i.e. Firefox Trunk, 1.8 and 1.8.0 will still be available, as will Thunderbird 1.8 and 1.8.0.

Build that will be affected include XULRunner Mac builds and various l10n builds.

We'll work as quickly as possible to get them back online as quickly as possible. This work should cause the cycle time, currently as high as some 4-odd hours, down to something more reasonable.

July 6, 2006

s/Release/Nightly/g

Just a heads up: on the Tinderboxen pages, you may see the build names starting to change from "Release" to "Nightly, a la names like "WINNT 5.0 patrocles Dep Tb-Release" to names like "MacOSX Darwin 8.7.0 bm-xserve02 Dep Fx-Nightly."

The reason for this change is to clear up confusion between the tinderboxen that produce nightly builds, and those that produce release builds. Those used to be the same thing, but now, they [can be] separate. This was more of an issue on the maintenance branches, where the "release builds" had to come from "Clbr" tinderboxen, since there already were "Release" tinderboxen, making everyone more confused.

If this breaks anyone's scripts or anything, or if you have questions, please do let me know.

This change will take place slowly as we move tinderboxen around.

June 23, 2006

19:19 <preed> "It's only updates"

Nightly updates are broken again.

And not in a "Why can't I get my Firefox 1.0.3 Amiga build to auto-update to Firefox 2.0 alpha3" sort of way. In more of a "Every product, from every version, to any version, it's all broken!!!" sort of way.

We know, and morgamic and I have been working on it basically straight for the past couple of days.

Here's where we are, and what happened...

Continue reading "19:19 <preed> "It's only updates"" »

June 21, 2006

She sure was a good binary format. Farewell, PPC... and we thank you.

In the interest of reclaiming cycles to do more builds, I'd like to stop building nightly PPC builds for the 1.8 branch and trunk for Firefox, and possibly Thunderbird, but I have to talk to mscott first.

We've been doing both Universal builds and PPC builds for awhile now, and I'm pretty sure that everyone is wanting Universal builds over PPC builds.

Please feel free to hollar at me if this is not the case.

May 18, 2006

"First rule in government spending: why build one when you can have two at twice the price?"

rhelmer applied the first release tag to the tree—FIREFOX_1_5_0_4_RC3—since the CVS server upgrade.

It took just under four minutes to complete.

For comparison, the old CVS server?

Forty minutes. On a good day morning at 4 am.

(This is the point in the story where we all go hug justdave.)

***

Went to the Flock presentation this morning on the Browser Technology track. I've never used Flock, so it was interesting to see the demo and look at some of the features they have.

I asked what their automated update story was, since there have been two or three releases of Firefox since the release they're using (which, as I understand it, is 1.5 still).

They said "We couldn't find the code for the automatic updates stuff." Which we know about and are working on fixing. (In fact, I've been working on it at XTech!)

So, how did they solve this problem?

"We wrote our own replacement."

May 17, 2006

Microsummaries + performance data = a Sheriff's best deputy?

I found Myk's XTech talk on microsummaries very interesting.

Last night at dinner, I believe it was Axel who was suggesting that publishing tinderbox performance data as an RSS feed might offer some interesting possibilities.

At first, I didn't see the point of doing that exactly, but with microsummaries, a tree sheriff could put all the branches they're supposed to be watching in their toolbar, so they wouldn't have to scan a huge tinderbox page all day. Could maybe even whip up some XSLT (was it?) to make them change colors if the performance numbers jump outside of some pre-defined range.

Of course, the cool thing about microsummaries is you don't necessarily need the RSS feed, it sounds like.

It's an interesting idea, though... is there a more consumable format for perf data than we currently offer/publish?

Jitter Bug

One of the largest hurdles with the virtualization migration plan was this huge unknown question of whether or not the tinderboxen performing tests could be virtualized.

Now that we have one (somewhat modern) tinderboxen—argo— cloned in a VM and running in physical hardware, we do have some data to look at.

"argo's" data is actually a bit confusing, because while the machine instance was cloned, not migrated, the machine's identity was cloned; that is, "argo" on May 10th was a physical machine; "argo" after May 10th was a virtual machine. And then, "argo" once again became a physical machine on May 16th, with the virtual machine copy appearing on the tinderbox page as "argo-vm."

On to the graphs!

Continue reading "Jitter Bug" »

May 6, 2006

Ahead of the Release Curve III: Virtually There

The How-Does-This-Affect-Me? Version

Various Tinderboxen will be down next week, in cycles, so we can migrate them into virtual machines.

These migrations will start on Tuesday, 9 May, and will be performed in three rounds, with about four tinderboxen per round.

During each round, these machines will be unavailable for a 24 hour period. The migrations will not affect the Bon Echo Alpha 2 release plans.

The Short Version

Step 1. Move all Tinderboxen to VMs
Step 2. ????
Step 3. PROFIT!

The Long(er) Version

Starting with planning and help from Chase, over the past five months, we've been working towards migrating all of Mozilla's Tinderboxen into virtual machines.

For those not familiar with the technology—VMware and Xen are players in the space with a lot of name recognition; Microsoft has an offering too, but it makes me giggle—virtualization offers the ability to run multiple instances of a full-blown operating system and an associated work load on the same piece of hardware. These OS instances are isolated from each other (conceptually, at least).

We've already migrated certain branches to virtualized Tinderboxen. Currently, the Firefox and Thunderbird maintenance branches are built using virtual machines. The 1.5.0.1, 1.5.0.2, and 1.5.0.3 releases have all come from VMs.

The major benefit for Mozilla, in addition to the marketing hype, include:

  • is the ability to "deep-freeze" machine instances in their entirety, so we can go back and build previous releases, if necessary
  • the ability to provision entirely new machines in a couple of hours
  • The ability to "branch" machine configurations, so new software dependencies won't disturb other builds running on the same machine (because they won't be running on the same machine anymore)
  • being able to remove bulky old desktop machines out of the colo, where the space costs are higher
  • being able to move bulky, old, unreliable PC hardware into RAIDed VMs, with little or no change

Virtualization is, of course, not free. There's a performance hit to allow six to seven builds to run on the same machine. But we've been using big behemoth machines that are dual-core, dual-CPU monsters with 4-8 gigabytes of RAM, and have found that the performance hit isn't as bad as we had worried it might be, and well worth it in terms of the configuration management/administration/provisioning wins.

The biggest outstanding question is: "Can we continue to run performance tests in VMs?" The short answer is "We don't yet know." There's been some discussion on mozilla.dev.builds about using the resource limitation features VMware ESX to give strict CPU and I/O service levels requirements to each VM that executes performance testing.

We'll do that with the VMs we migrate, but to ensure the numbers are good, we'll continue to run certain tinderboxen after the migration, for comparison (currently, that list includes argo, gaius, prometheus, pacifica, btek, creature, and beast). It will be interesting to see if the performance numbers settle down, as it's been suggested that they might.

The ultimate goal with this rollout is to basically put every machine possible into a VM (*cough* are you listening, Apple?), and then work on defining reference platforms, that are VMs, and can be effectively versioned. When we're done, privisioning a new tinderbox should be a trivial task, involving cloning a VM and getting an IP address, as opposed to today, where it starts with a call to IT, involves installation CDs (ew!) and ends a few weeks later, with a build engineer whining that he can't get to it for a few more weeks, because they're busy doing releases.

April 20, 2006

Ahead of the Release Curve II: The Disappearing Act

One of the surprisingly common reports we get in #build are missing build reports: "The builds in latest-foobranch are five days old! Double-ewe-tee-eff!?"

Like a partner in a dysfunctional marriage, Tinderbox is an enabler of this bad behavior because after a certain amount of time, it just drops builds that haven't reported in without letting anyone1 know, so often times, by the time we receive a complaint about missing builds, they're not a day or two old, they're five or six days old.

This is unacceptable, and the Mozilla Community deserves better.

Having said that, with everything that the release team is typically doing during any given cycle, we don't have the bandwidth to sit there and monitor tinderboxen to make sure that every single one is building what it's supposed to. Often, this list changes so quickly that the documetnation about what each is supposed to be building isn't even correct.

Rob and Dave have been working on fixing this, though, with—drumroll, please—automation.

Now, Nagios monitors the contents of ftp.m.o, and we get an email whenever builds in the latest-* directories for relevant branches are more than a day old. And we continue to get this email every few hours until it gets fixed.

This should help to cut down on having to let us know that builds haven't shown up for five (or even more) days. It's always been a reporting problem, as we've been typically able to respond to Tinderbox machine issues within 24 hours.

The lesson here is twofold:

  • Automation will save us all2 And we're working on deploying more of it.
  • Spam really is the best motivator to get stuff fixed

The next time you notice that the nightly build you were expecting to exist actually does... be sure to think of Rob and Dave.
___________________
1 Who can do anything about it... yes, these abondonings are announced in IRC, but in ways that seldom get noticed.
2 This, of course, isn't anything new... but it's nice to be in a place where we have the bandwidth to really start working on automation projects3
3 And we're working on even more projects I haven't had time to blah-g about...