Python

6343 readers

4 users here now

Welcome to the Python community on the programming.dev Lemmy instance!

📅 Events

Past

November 2023

PyCon Ireland 2023, 11-12th
PyData Tel Aviv 2023 14th

October 2023

PyConES Canarias 2023, 6-8th
DjangoCon US 2023, 16-20th (!django 💬)

July 2023

PyDelhi Meetup, 2nd
PyCon Israel, 4-5th
DFW Pythoneers, 6th
Django Girls Abraka, 6-7th
SciPy 2023 10-16th, Austin
IndyPy, 11th
Leipzig Python User Group, 11th
Austin Python, 12th
EuroPython 2023, 17-23rd
Austin Python: Evening of Coding, 18th
PyHEP.dev 2023 - "Python in HEP" Developer's Workshop, 25th

August 2023

PyLadies Dublin, 15th
EuroSciPy 2023, 14-18th

September 2023

PyData Amsterdam, 14-16th
PyCon UK, 22nd - 25th

🐍 Python project:

💓 Python Community:

#python IRC for general questions
#python-dev IRC for CPython developers
PySlackers Slack channel
Python Discord server
Python Weekly newsletters
Mailing lists
Forum

✨ Python Ecosystem:

🌌 Fediverse

Communities

#python on Mastodon
c/django on programming.dev
c/pythorhead on lemmy.dbzer0.com

Projects

Pythörhead: a Python library for interacting with Lemmy
Plemmy: a Python package for accessing the Lemmy API
pylemmy pylemmy enables simple access to Lemmy's API with Python
mastodon.py, a Python wrapper for the Mastodon API

Feeds

founded 1 year ago

MODERATORS

[email protected]

ISO-8859-x encodings and invalid bytes (programming.dev)

submitted 4 months ago by [email protected] to c/[email protected]

4 comments fedilink hide all child comments

Hi,

I have to interface with systems that use iso-8859-x encoding (not by choice...), and I'm surprised that the following doesn't throw an error:

>>> str(bytes(range(256)), encoding="iso-8859-1", errors="strict")
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'

Bytes in the 0x80—0x9f range are not valid iso-8859-1, and I was expecting the above to raise a DecodeError of some sort; instead it looks like those are passed through.

I'm perfectly happy with this behaviour, I would like to make sure I can depend on it. Can I take an arbitrary byte buffer, decode as ISO-8859-1, and never get any error? Is it guaranteed to be lossless ?

all 5 comments

sorted by: hot top controversial new old

[–] [email protected] 3 points 4 months ago (2 children)

I was curious, so I did some searches on this topic for you and found these pages:

The second link in particular notes:

The reason that things are much easier with all ASCII data is that practically every Unicode encoding in existence maps bytes 0x00..0x7f to the corresponding code points, so byte strings and Unicode strings that contain the same all-ASCII data are basically equivalent, even semantically. What usually trips people up with non-ASCII data is that the semantic meaning of bytes in the range 0x80..0xff changes from one encoding to another.

But, thinking like a systems programmer again, for many purposes the semantic meaning of bytes 0x80..0xff doesn’t matter. All that matters is that those bytes are preserved unchanged by whatever operations are done. Typical operations like tokenizing strings, looking for markers indicating particular types of data, etc. only need to care about the meaning of bytes in the range 0x00..0x7f; bytes in the range 0x80..0xff are just along for the ride.

So the trick for beating Python 3 strings into submission is to put in encoding and decoding calls where you need to, choosing a single-byte encoding that doesn’t mutate 0x80..0xff. There are many of these; most of the Latin-{1..6} sequence (aka ISO-8859-1..10) is has this property. What you do not want to do is pick utf-8 or any of the multibyte Asian encodings. Latin-1 will do fine; in fact it has an advantage over the others in memory consumption, which we’ll describe below.

Whether depending on this is actually correct or not is beyond me, but it seems like people have actually been using that pass-through behavior in practice and put it into things like Python2 -> 3 migration guides.

The first link suggests that the seemingly undefined ranges are valid as C0 and C1 control codes which may be why it doesn't throw errors.

[–] [email protected] 2 points 4 months ago

Thank you for the pointers @[email protected]

My use case certainly fall into that described by ESR, I only really need to understand markup that falls in the ASCII range and pass the rest unmodified

[–] [email protected] 2 points 4 months ago (1 children)

Maybe I’m tripping here but this kinda also explains why the human genome contains lots of noncoding DNA.

[–] [email protected] 1 points 4 months ago

I think it has more to do with the fact Mother Nature is really inefficient and allocated much more DNA storage than necessary.