The core of the CV Partner web application was written in Ruby on Rails 8 years ago and is still going strong on Rails today. We started at Rails 3.2.2 and we have since upgraded it through 2 major versions, with plans to guide it through its third major version upgrade in the near future.
Major version upgrades can be difficult. Software developers usually only bump the major version when there are large, backwards-incompatible changes. When you upgrade to a new major version, you should expect to have to make significant changes for things to work. When the thing you’re upgrading is the foundational layer that your product is built on, you’re in for a ride.
This post is the story of a recent problem that was tracked down to the latest major version upgrade we did, from Rails 4 to Rails 5, to give some insight into the kinds of problems that can come up during a major version upgrade.
CV Partner is a web application to help any company that has to spend time keeping CVs, references, or case studies stay on top of things. We have features to help sort through this data, tailor it to specific purposes, share it, internationalise it, and export it into popular formats. One of our features is the ability to create a “proposal,” which is a collection of CVs and references that you can tailor for a specific purpose.
After you’ve created and tailored your proposal, you can export the proposal into popular formats like Word, PDF and PowerPoint.
This is where we found our problem.
When selecting either Word or PowerPoint as an export format, the file would be downloaded with the .zip file extension. Double clicking this in Mac or Windows would cause an error. Renaming the file to have a .pptx or .docx extension would make the file open just fine.
Given that the files work fine when renamed, we knew that there was no problem with their content. It was just the extension that was wrong. While that means it wasn’t a catastrophic problem for our users, it was very annoying and fixing it was a high priority.
How does a proposal export work?
Visually it looks a bit like this:
Before our last major update migration we knew that this worked. The tests around this had not broken, so something fishy was going on.
Things I knew for sure before getting my hands dirty in this:
And after a bit of checking I also knew:
These last three points are deeply confusing. Why do PDFs work fine but the Office formats don’t? Why are the file extensions correct in S3 but suddenly change when downloaded?
Looking closer at one of the objects in S3, I noticed that the metadata for the object looked off.
Cross-referencing with the response headers when downloading the file, it did indeed look like the Content-Type was set to “application/zip” when downloading.
This made sense, as we use temporary, secure, direct links to S3 for the file download. Changing the metadata Content-Type to `application/vnd.openxmlformats-officedocument.wordprocessing`, the MIME type for .docx files, meant the file downloaded correctly with a .docx file extension.
How does that metadata get there?
I knew I was looking for something that set the Content-Type on our S3 uploads to `application/zip` instead of `application/vnd.openxmlformats-officedocument.wordprocessing`. That made it much easier to know which bits to focus on.
A quick search of our code base for “Content-Type” and “application/zip” yielded nothing of interest. Why would it? That would be too easy. The problem must be in something we depend on.
For file uploads we use a popular Ruby gem called CarrierWave. It works well with Rails, has good support, and is easy to set up. It does allow you to set an explicit Content-Type when uploading files, but we don’t use this, so I started digging into their code base to figure out if they were setting a default for us.
Our `Gemfile.lock` file told me we were using version 2.1.0 of CarrierWave, and after an hour or so of digging in their code I found the following function:
I know we don’t set a content type, so I ruled out `existing_content_type` being the problem. The next function call looked promising, though.
CarrierWave makes use of a gem called MimeMagic to figure out the content type of a file if we don’t specify it explicitly. I didn’t have to look further than MimeMagic’s README.md to know I was on to something.
Microsoft Office 2007+ formats (xlsx, docx, and pptx) are not supported by the mime database at freedesktop.org. These files are all zipped collections of xml files and will be detected as "application/zip". Mimemagic comes with extra magic you can overlay on top of the defaults to correctly detect these file types.
Perfect. They also have some instructions on what to do to enable “extra magic” that would make all of my problems disappear. Except… it didn’t work. I enabled the extra magic, as instructed, but the problem persisted. I was deflated. What was I doing wrong?
What even is this extra magic?
Ho, boy. It took a while to figure out what this did but I got there in the end.
MimeMagic performs three checks in sequence for each Office file type:
If any of these checks fail, the file is deemed to not be one of these MIME types.
Lots of file types start with unique sequences of bytes that identify what’s going to be in them. For example, the first 4 bytes of a zip file is always `50 4b 03 04`, which is equal to `PK\003\004`, which is what is being looked for in check 1. These byte sequences are called “magic numbers,” hence the name “MimeMagic.”
To understand the next two checks we need a brief refresher on what’s in a zip file. Zip files are a collection of file entries, where a file entry is made up of a small header with the name of the file and other bits of information in it (last modified time, whether it’s compressed, etc.), followed by the file contents. At the end of the zip file is a “central directory” that lists all of the file entries and where to find them in the file. When you do `unzip -l <zip_file>` on the command line, what you’re seeing is the content of that central directory.
Presumably Office files all contain a file called `[Content_Types].xml`, and each type has a directory inside of it that identifies it as a PowerPoint, Excel, or Word document.
I installed MimeMagic locally and fed some of my not-zip files into them to see what it thought of them. Without the extra magic, it thought they were all zip files. Except the PDFs, it correctly identified those which explains why PDFs were working fine all this time. With the extra magic, the results were exactly the same.
Looking at each file in a hex editor showed that the `[Content_Types].xml` file was toward the end of the file, and the file size was around 250kb. That’s the problem. MimeMagic doesn’t scan enough of our .docx files to know that they’re .docx files, so falls back to classifying them as .zip files!
Modifying MimeMagic’s extra magic to search the entire file fixes the problem. This isn’t ideal for us, though, as some proposals can have thousands or tens of thousands of CVs in them. MimeMagic will read the whole file into memory before scanning it, and we can’t justify loading multi-gigabyte proposals into memory. It’s too risky from a denial-of-service perspective. Any user could create a really big proposal and kill our worker machines.
In the end we settled on not looking for the `[Content_Type].xml` file. The zip header and the “word/” entry is good enough for our use case, and in the documents we checked there were many “word/” entries throughout the file, and always one near the top.
How did this code ever work?
During our Rails 5 upgrade, we went from CarrierWave 1.3.0 to 2.1.0. Looking at the code for 1.3.0 we find that the content type is figured out like so:
In 1.3.0, it uses the mime-types gem to find the content type by looking at the file extension. As we were always setting the file extension correctly, this explains why it worked before the upgrade.
The key things to take away from this experience:
Fingers crossed I don’t have to write another post like this when we upgrade to Rails 6!