[Python-Dev] Decoding incomplete unicode

Hye-Shik Chang hyeshik at gmail.com
Wed Jul 28 05:51:30 CEST 2004


On Tue, 27 Jul 2004 22:39:45 +0200, Walter Dörwald
<walter at livinglogic.de> wrote:
> Pythons unicode machinery currently has problems when decoding
> incomplete input.
> 
> When codecs.StreamReader.read() encounters a decoding error it
> reads more bytes from the input stream and retries decoding.
> This is broken for two reasons:
> 1) The error might be due to a malformed byte sequence in the input,
>     a problem that can't be fixed by reading more bytes.
> 2) There may be no more bytes available at this time. Once more
>     data is available decoding can't continue because bytes from
>     the input stream have already been read and thrown away.
> (sio.DecodingInputFilter has the same problems)

StreamReaders and -Writers from CJK codecs are not suffering from
this problems because they have internal buffer for keeping states
and incomplete bytes of a sequence. In fact, CJK codecs has its
own implementation for UTF-8 and UTF-16 on base of its multibytecodec
system.  It provides a "working" StreamReader/Writer already. :)

> I've uploaded a patch that fixes these problems to SF:
> http://www.python.org/sf/998993
> 
> The patch implements a few additional features:
> - read() has an additional argument chars that can be used to
>    specify the number of characters that should be returned.
> - readline() is supported on all readers derived from
>    codecs.StreamReader().

I have no comment for these, yet.

> - readline() and readlines() have an additional option
>    for dropping the u"\n".

+1

I wonder whether we need to add optional argument for writelines()
to add newline characters for each lines, then.


Hye-Shik


More information about the Python-Dev mailing list