CV Partner blog

Major Upgrades, Subtle Problems

News
Technical
Sales and marketing
Productivity
Design

The core of the CV Partner web application was written in Ruby on Rails 8 years ago and is still going strong on Rails today. We started at Rails 3.2.2 and we have since upgraded it through 2 major versions, with plans to guide it through its third major version upgrade in the near future.

Major version upgrades can be difficult. Software developers usually only bump the major version when there are large, backwards-incompatible changes. When you upgrade to a new major version, you should expect to have to make significant changes for things to work. When the thing you’re upgrading is the foundational layer that your product is built on, you’re in for a ride.

This post is the story of a recent problem that was tracked down to the latest major version upgrade we did, from Rails 4 to Rails 5, to give some insight into the kinds of problems that can come up during a major version upgrade.

Background

CV Partner is a web application to help any company that has to spend time keeping CVs, references, or case studies stay on top of things. We have features to help sort through this data, tailor it to specific purposes, share it, internationalise it, and export it into popular formats. One of our features is the ability to create a “proposal,” which is a collection of CVs and references that you can tailor for a specific purpose.

Creating a proposal in CV Partner
Creating a proposal in CV Partner

After you’ve created and tailored your proposal, you can export the proposal into popular formats like Word, PDF and PowerPoint.

Exporting a proposal in CV Partner
Exporting a proposal in CV Partner

This is where we found our problem.

When selecting either Word or PowerPoint as an export format, the file would be downloaded with the .zip file extension. Double clicking this in Mac or Windows would cause an error. Renaming the file to have a .pptx or .docx extension would make the file open just fine.

The Tech Behind Proposal Exporting

Given that the files work fine when renamed, we knew that there was no problem with their content. It was just the extension that was wrong. While that means it wasn’t a catastrophic problem for our users, it was very annoying and fixing it was a high priority.

How does a proposal export work?

  1. A user clicks on the button to download a proposal, as shown above.
  2. Our web backend, which we call “CVP web”, handles this request by creating an export job record and saving it into Amazon DocumentDB.
  3. The web backend then adds this job record ID to a Redis queue.
  4. A separate backend, which we call “CVP worker”, polls this Redis queue and finds the new export job and starts processing it.
  5. As the worker processes the job, it updates the record in DocumentDB with its progress.
  6. While the job is processing, the user’s browser polls the web backend for the status of the job, which the web backend reads out of DocumentDB. This is what drives the progress bar in the gif in the previous section.
  7. When the job is finished, the CVP worker uploads the final result into S3, stores the S3 URL on the job record, and marks the job record as done.
  8. The user’s browser polls and finds that the job is done. It uses the newly provided URL to download the file.

Visually it looks a bit like this:


Before our last major update migration we knew that this worked. The tests around this had not broken, so something fishy was going on.

Narrowing Down the Problem

Things I knew for sure before getting my hands dirty in this:

  1. The content of the file is as it should be, so there’s no problem creating the file.
  2. The tests we have to ensure that file creation works as expected were passing, reinforcing point 1.
  3. Our testing suite actually doesn’t cover uploading to S3, we use a local-only uploader in our tests, so that was a blind spot that needed extra attention.

And after a bit of checking I also knew:

  1. The file extension when the file is generated is not .zip, it’s .docx.
  2. The file extension when the file is uploaded to S3 is also not .zip, it’s .docx.
  3. Exporting to .pdf worked correctly in all cases I tried.

These last three points are deeply confusing. Why do PDFs work fine but the Office formats don’t? Why are the file extensions correct in S3 but suddenly change when downloaded?

Looking closer at one of the objects in S3, I noticed that the metadata for the object looked off.


Cross-referencing with the response headers when downloading the file, it did indeed look like the Content-Type was set to “application/zip” when downloading.


This made sense, as we use temporary, secure, direct links to S3 for the file download. Changing the metadata Content-Type to `application/vnd.openxmlformats-officedocument.wordprocessing`, the MIME type for .docx files, meant the file downloaded correctly with a .docx file extension.

How does that metadata get there?

Finding the Culprit

I knew I was looking for something that set the Content-Type on our S3 uploads to `application/zip` instead of `application/vnd.openxmlformats-officedocument.wordprocessing`. That made it much easier to know which bits to focus on.

A quick search of our code base for “Content-Type” and “application/zip” yielded nothing of interest. Why would it? That would be too easy. The problem must be in something we depend on.

For file uploads we use a popular Ruby gem called CarrierWave. It works well with Rails, has good support, and is easy to set up. It does allow you to set an explicit Content-Type when uploading files, but we don’t use this, so I started digging into their code base to figure out if they were setting a default for us.

Our `Gemfile.lock` file told me we were using version 2.1.0 of CarrierWave, and after an hour or so of digging in their code I found the following function:


def content_type
  @content_type ||=
    existing_content_type ||
    mime_magic_content_type ||
    mini_mime_content_type
end


I know we don’t set a content type, so I ruled out `existing_content_type` being the problem. The next function call looked promising, though.


def mime_magic_content_type
  if path
    File.open(path) do |file|
      MimeMagic.by_magic(file).try(:type) || 'invalid/invalid'
    end 
  end
rescue Errno::ENOENT  
  nil
end


CarrierWave makes use of a gem called MimeMagic to figure out the content type of a file if we don’t specify it explicitly. I didn’t have to look further than MimeMagic’s README.md to know I was on to something.

Microsoft Office 2007+ formats (xlsx, docx, and pptx) are not supported by the mime database at freedesktop.org. These files are all zipped collections of xml files and will be detected as "application/zip". Mimemagic comes with extra magic you can overlay on top of the defaults to correctly detect these file types.

Perfect. They also have some instructions on what to do to enable “extra magic” that would make all of my problems disappear. Except… it didn’t work. I enabled the extra magic, as instructed, but the problem persisted. I was deflated. What was I doing wrong?

What even is this extra magic?


[['application/vnd.openxmlformats-officedocument.presentationml.presentation', [[0, "PK\003\004", [[0..5000, '[Content_Types].xml', [[0..5000, 'ppt/']]]]]]],
 ['application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', [[0, "PK\003\004", [[0..5000, '[Content_Types].xml', [[0..5000, 'xl/']]]]]]],
 ['application/vnd.openxmlformats-officedocument.wordprocessingml.document', [[0, "PK\003\004", [[0..5000, '[Content_Types].xml', [[0..5000, 'word/']]]]]]]].each do |magic|
  MimeMagic.add(magic[0], magic: magic[1])
end


Ho, boy. It took a while to figure out what this did but I got there in the end.

MimeMagic performs three checks in sequence for each Office file type:

  1. Check the first 4 bytes of the file to see if they’re equal to `PK\003\004`.
  2. Check the first 5000 bytes of the file looking for the string `[Content_Types].xml`.
  3. Check the first 5000 bytes of the file again looking for either `ppt/`, `xl/`, or `word/` depending on whether you’re looking for a PowerPoint, Excel, or Word document.

If any of these checks fail, the file is deemed to not be one of these MIME types.

Lots of file types start with unique sequences of bytes that identify what’s going to be in them. For example, the first 4 bytes of a zip file is always `50 4b 03 04`, which is equal to  `PK\003\004`, which is what is being looked for in check 1. These byte sequences are called “magic numbers,” hence the name “MimeMagic.”

To understand the next two checks we need a brief refresher on what’s in a zip file. Zip files are a collection of file entries, where a file entry is made up of a small header with the name of the file and other bits of information in it (last modified time, whether it’s compressed, etc.), followed by the file contents. At the end of the zip file is a “central directory” that lists all of the file entries and where to find them in the file. When you do `unzip -l <zip_file>` on the command line, what you’re seeing is the content of that central directory.

Presumably Office files all contain a file called `[Content_Types].xml`, and each type has a directory inside of it that identifies it as a PowerPoint, Excel, or Word document.

I installed MimeMagic locally and fed some of my not-zip files into them to see what it thought of them. Without the extra magic, it thought they were all zip files. Except the PDFs, it correctly identified those which explains why PDFs were working fine all this time. With the extra magic, the results were exactly the same.

Looking at each file in a hex editor showed that the `[Content_Types].xml` file was toward the end of the file, and the file size was around 250kb. That’s the problem. MimeMagic doesn’t scan enough of our .docx files to know that they’re .docx files, so falls back to classifying them as .zip files!

Fixing the Problem

Modifying MimeMagic’s extra magic to search the entire file fixes the problem. This isn’t ideal for us, though, as some proposals can have thousands or tens of thousands of CVs in them. MimeMagic will read the whole file into memory before scanning it, and we can’t justify loading multi-gigabyte proposals into memory. It’s too risky from a denial-of-service perspective. Any user could create a really big proposal and kill our worker machines.

In the end we settled on not looking for the `[Content_Type].xml` file. The zip header and the “word/” entry is good enough for our use case, and in the documents we checked there were many “word/” entries throughout the file, and always one near the top.

One Last Thing…

How did this code ever work?

During our Rails 5 upgrade, we went from CarrierWave 1.3.0 to 2.1.0. Looking at the code for 1.3.0 we find that the content type is figured out like so:


def content_type  
  return @content_type if @content_type  
  if @file.respond_to?(:content_type) and @file.content_type    
    @content_type = @file.content_type.to_s.chomp  
  elsif path    
    @content_type = ::MIME::Types.type_for(path).first.to_s  
  end
end

In 1.3.0, it uses the mime-types gem to find the content type by looking at the file extension. As we were always setting the file extension correctly, this explains why it worked before the upgrade.

Conclusion

The key things to take away from this experience:

  1. Major upgrades always carry risk, but so does not doing them. The further you fall behind, the harder it is to catch up. Try and keep your dependencies up to date.
  2. The more different your testing environment is to your production environment, the more likely you are to have this kind of subtle problem. Try to keep your testing environment as close to your production environment as possible.

Fingers crossed I don’t have to write another post like this when we upgrade to Rails 6!

Learn more by contacting CV Partner