Sunday, February 26, 2012

Python and byte order marks


I am working with some files that have byte order marks (BOM) at the beginning. The first bytes in the file are FEFF. When I load them with the standard Python open command, the first two characters of the first line are corrupted with these two bytes before any text. There are two basic problems here, easily solved.

First, the file is encoded with UTF-8, which is unicode and not plain ASCII (though in most cases UTF-8 is directly compatible with ASCII). Whether or not this file contains any non-ASCII characters represented with multi-byte Unicode, it is more appropriate to load it encoded as UTF-8.

Second, the byte order mark is showing up as part of the data. I could just assume it is there, read the first two bytes and ignore them, but I need to let Python handle it properly so I don't have to think about it. Writing special code for the first two bytes of the first line is annoying and ugly. Note that a BOM is not required for a UTF-8 file, but many applications—particularly on Windows—include it.



Both of these are resolved with the codecs module in Python (note: I am using version 2.7, but 3.x has different Unicode support and this may not necessarily apply). Instead of an ordinary open command, we use the following:

import codecs
with codecs.open(input_file_name,'r',encoding='utf-8-sig') as infile:
    lines = infile.readlines()
    lines = [line.strip() for line in lines]
    return lines


This gives us an encoding of UTF-8 and handles the BOM mark at the beginning (and the strip() line with the list comprehension removes all of the line feeds and carriage returns—another item that can be different depending on the platform the file came from). If there is no BOM, we would use utf-8 (or utf-16 if that is the correct encoding). There are automated ways to scan for the encoding, but these are not 100% reliable and it is best to agree on the encoding (including BOM), or know what your source produces, and handle it properly.

Submitted by richard on Sun, 02/26/2012 - 23:41

No comments:

Post a Comment