[Python-Dev] Decoding incomplete unicode
Hye-Shik Chang
hyeshik at gmail.com
Wed Jul 28 05:51:30 CEST 2004
On Tue, 27 Jul 2004 22:39:45 +0200, Walter Dörwald
<walter at livinglogic.de> wrote:
> Pythons unicode machinery currently has problems when decoding
> incomplete input.
>
> When codecs.StreamReader.read() encounters a decoding error it
> reads more bytes from the input stream and retries decoding.
> This is broken for two reasons:
> 1) The error might be due to a malformed byte sequence in the input,
> a problem that can't be fixed by reading more bytes.
> 2) There may be no more bytes available at this time. Once more
> data is available decoding can't continue because bytes from
> the input stream have already been read and thrown away.
> (sio.DecodingInputFilter has the same problems)
StreamReaders and -Writers from CJK codecs are not suffering from
this problems because they have internal buffer for keeping states
and incomplete bytes of a sequence. In fact, CJK codecs has its
own implementation for UTF-8 and UTF-16 on base of its multibytecodec
system. It provides a "working" StreamReader/Writer already. :)
> I've uploaded a patch that fixes these problems to SF:
> http://www.python.org/sf/998993
>
> The patch implements a few additional features:
> - read() has an additional argument chars that can be used to
> specify the number of characters that should be returned.
> - readline() is supported on all readers derived from
> codecs.StreamReader().
I have no comment for these, yet.
> - readline() and readlines() have an additional option
> for dropping the u"\n".
+1
I wonder whether we need to add optional argument for writelines()
to add newline characters for each lines, then.
Hye-Shik
More information about the Python-Dev
mailing list