Monday, August 05, 2013

UNO IllegalArgument during import phase: Source file cannot be read. URL seems to be an unsupported one.

Creating Word documents in a fully linux-based environment can be tricky. There is a trick you can do which is basically saving html with a .doc extension, which allows Word to open the resulting document, but it has some drawbacks - if you try to make any changes and save it again, you can't save it as a .doc, etc etc.

So we have a multi-step chain of services set up on our Quill Platform to allow us to render our articles as genuine Word documents.

In the Rails app:

  1. User clicks 'download as word doc' on an article - this results in a request like this: GET /articles/1234.doc
  2. Check for an existing word doc representing the correct version of the requested article. If it's there, serve it back as a document. If not....
    1. Render as a string, and save a WordDocumentConversion object which encapsulates the resulting HTML
    2. Submit a job to Resque to perform the actual conversion.
    3. Redirect the user, and flash a message saying "That document might take a minute or two to generate - we'll email it to you when it's ready"
In the Resque job:
  1. Load the WordDocumentConversion
  2. POST the saved HTML as a file input to our DocumentConverter web service - a little Sinatra app which provides a RESTful endpoint around LibreOffice
  3. Save the response as a file named .doc, in binary mode, and email it to the user.
In the DocumentConverter web service:
  1. accept the POST-ed HTML content
  2. invoke UNOCONV - a command line python script that wraps the LibreOffice / OpenOffice headless document conversion service.
  3. respond with the binary content of the returned Word doc.

It all works pretty well, most of the time, and was a good exercise in building complexity through keeping each individual part very simple. However, we recently moved our live platform servers from the US-EAST EC2 region over to the EU-WEST region, and took the opportunity to rebuild them from scratch on updated Ubuntu, and while setting up our staging server we got the above error ( 'UNO IllegalArgument during import phase: Source file cannot be read. URL seems to be an unsupported one.' ) which had us scratching our heads for most of Friday.

To cut a long story short, this is another instance of what we refer as "Tao errors" - the error which can be seen is not the true error. When it says that the URL is unsupported, what it actually means is "I can't handle the file format you've requested" - usually because there are some LibreOffice OpenOffice packages missing. 

If you've only installed the base & core packages, that's not enough - to be able to render Word documents, you need to actually install the "writer" package as well. A quick scan of the unoconv documentation does give you this little tidbit -

Various sub-packages are needed for specific import or export filters, e.g. XML-based filters require the xsltfilter subpackage, e.g. libobasis3.5-xsltfilter.
ImportantNeglecting these requirements will cause unoconv to fail with unhelpful and confusing error messages.
- so I guess we were warned... but still, problem solved at last.


Friday, January 04, 2013

MySQL "Row size too large" when saving many text fields

Using MySQL? InnoDB table type? Got a table with several TEXT or BLOB fields? Getting a "Row size too large" error when saving a row with lots of text in those TEXT fields? Confused, because you thought the whole point of TEXT fields was that they stored the text off-table? Well, read on....

disclaimer: if you're storing many text fields in a relational database table, you might want to look again at whether that's the right place and method for storing that data - if it looks like a document and quacks like a document, then hey, maybe a document store would be more appropriate? But that's a whole other topic...

The detail is in the MySQL docs 14.4.5. How InnoDB Stores Variable-Length Columns , but I'll give a quick summary here. It hinges on the file format of your InnoDB engine.

As the docs say:
Early versions of InnoDB used an unnamed file format (now called Antelope) for database files. With that format, tables were defined with ROW_FORMAT=COMPACT (or ROW_FORMAT=REDUNDANT) and InnoDB stored up to the first 768 bytes of variable-length columns (such as BLOB and VARCHAR) in the index record within the B-tree node, with the remainder stored on the overflow pages.
(emphasis mine)

So with the Antelope file format, it's perfectly possible to store a single TEXT field up to 2GB without encountering the "Row size too large" error, but it's not possible to store 10 x 1k TEXT fields - because InnoDB will store the first 768 bytes of each TEXT field on the record itself, and exceed the row size limit of 8192 bytes.

The solution is straightforward (well, mostly :) - change your innodb_file_format variable to Barracuda, and alter the table to use the DYNAMIC or COMPRESSED row_format. This will store the entire contents of the TEXT fields "off-page" -

SET GLOBAL innodb_file_format=Barracuda; SET GLOBAL innodb_file_per_table=ON; ALTER TABLE (your table) ROW_FORMAT=COMPRESSED;

- and you're good to go.

The one complication is if you're running on Amazon RDS, in which case you'll get an error saying you don't have SUPER privileges. If that's the case, you just need to set the innodb_file_format parameter in your RDS instance parameter group, and allow a few seconds for it to propagate to all your instances.