[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?

Fri Oct 3 23:36:25 CEST 2008

On Fri, Oct 3, 2008 at 3:05 PM, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 10/3/2008 12:53 PM, came the following characters from the
> keyboard of James Y Knight:
>>
>> On Oct 3, 2008, at 3:24 PM, Glenn Linderman wrote:
>>>
>>> In order to work, the actual name must be preserved, or if translated,
>>> must be a reversible, 1-to-1 translation.  A lot of discussion here has
>>> talked about reversible translations, but haven't noted the requirement that
>>> it be 1-to-1... and if the translation produces something that looks like it
>>> could be a file name, then the reverse translation is unlikely to be 1-to-1!
>>>  Somewhere, you need to add a flag that indicates whether or not a reverse
>>> translation needs to be done, independently of the content of the translated
>>> name.
>>
>> That's not true. Both the U+0000 and UTF-8b proposals are 1-to-1
>> transforms.
>>
>> James
>
> My understanding of the Posix file names is that any byte values are valid
> except "/" and null.  Is this a correct understanding?
>
> The UTF-8b proposal seems to translate from a non-UTF-8 byte stream to a
> Unicode character stream.  Call the original byte stream FOO.  The
> transformation then produces FOOTR, a set of Unicode code points.  Now FOOTR
> has a representation in UTF-8, which is a byte stream, call that byte stream
> FOOTRUTF8.  How, by looking at FOOTR, do you know whether it represents the
> file name FOO or FOOTRUTF8 ?  And remember that the user might provide a
> Unicode character stream identical to FOOTR: should it be translated to FOO
> or FOOTRUTF8 when creating a new file according to the user-supplied name?

UTF-8b produces an *invalid* unicode sequence, via lone scalars.  Any
attempt to encode or decode using a validating UTF-8 (or
UTF-16/UTF-32) codec would reject them, which is why they can
unambiguously be used.

In other words, it's not unicode (despite a resemblence), so it's easy
to be 1-to-1.

> So the U+0000 transform may be 1-to-1 since it introduces null characters
> into the translated "file name", which are effectively producing names that
> are invalid according to the Posix file name standard ... but if it
> introduces null characters into the translated "file name", then there is
> file name parsing software that it will be incompatible with, which may be
> as problematic as not translating the file names in the first place... deep
> analysis would have to be used to determine which problem is larger, or more
> significant.  I've certainly been "guilty" of writing software that assumes
> that there are no null characters in a file name.  I've even been "guilty"
> of writing software that assumes there are no space characters in a file
> name, although I've tried to break that habit in recent years...

Yup, U+0000 is unicode, but still can't be used with many external
APIs, as it's a transformation of the real file name.  The only real
advantage is you can store it in certain external formats, but
wouldn't you know it, XML isn't one of them[1].  Can you think of any
common formats where it would work?

[1] http://www.w3.org/International/questions/qa-controls

-- 
Adam Olsen, aka Rhamphoryncus