python 3.x - Read utf-8 character from byte stream -


given stream of bytes (generator, file, etc.) how can read single utf-8 encoded character?

  • this operation must consume bytes of character stream.
  • this operation must not consume bytes of stream exceed first character.
  • this operation should succeed on unicode character.

i approach rolling own utf-8 decoding function prefer not reinvent wheel since i'm sure functionality must used elsewhere parse utf-8 strings.

wrap stream in textiowrapper encoding='utf8', call .read(1) on it.

this assuming started bufferediobase or duck-type compatible (i.e. has read() method). if have generator or iterator, may need adapt interface.

example:

from io import textiowrapper  open('/path/to/file', 'rb') f:   wf = textiowrapper(f, 'utf-8')   wf._chunk_size = 1  # implementation detail, may not work everywhere    wf.read(1) # gives next utf-8 encoded character   f.read(1)  # gives next byte 

Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -