python 3.x - Read utf-8 character from byte stream -
given stream of bytes (generator, file, etc.) how can read single utf-8
encoded character?
- this operation must consume bytes of character stream.
- this operation must not consume bytes of stream exceed first character.
- this operation should succeed on unicode character.
i approach rolling own utf-8
decoding function prefer not reinvent wheel since i'm sure functionality must used elsewhere parse utf-8
strings.
wrap stream in textiowrapper
encoding='utf8'
, call .read(1)
on it.
this assuming started bufferediobase
or duck-type compatible (i.e. has read()
method). if have generator or iterator, may need adapt interface.
example:
from io import textiowrapper open('/path/to/file', 'rb') f: wf = textiowrapper(f, 'utf-8') wf._chunk_size = 1 # implementation detail, may not work everywhere wf.read(1) # gives next utf-8 encoded character f.read(1) # gives next byte
Comments
Post a Comment