We had this problem:
>> : x = '\xed\xa0\xbc\xed\xbc\xb8'
>> : x == x.decode('utf-8').encode('utf-8')
<< : False
That str contains two utf-8 codepoints, which I guess python is normalizing
into one unicode character, which it then encodes as one utf-8 codepoint.
Like this:
>> : u'\ud83c\udf38'
<< : u'\U0001f338'
I don't entirely understand that, but having a different byte representation
after round-tripping through unicode causes problems with replication and
listings.
This patch just rejects anything that doesn't re-encode to the same thing.
If someone smarter wants to do something different, please speak up.
Change-Id: I9ac48ac2693e4121be6585c6e4f5d0079e9bb3e4