utf8 open python unicode encoding character-encoding

open - python unicode to utf8



Obtenga una lista de todas las codificaciones que Python puede codificar para (8)

Estoy escribiendo un script que intentará codificar bytes en muchas codificaciones diferentes en Python 2.6. ¿Hay alguna forma de obtener una lista de codificaciones disponibles que pueda repetir?

La razón por la que trato de hacer esto es porque un usuario tiene algo de texto que no está codificado correctamente. Hay personajes divertidos. Sé que el personaje Unicode está estropeándolo. Quiero poder darles una respuesta como "Tu editor de texto interpreta esa cadena como codificación X, no como codificación Y". Pensé que trataría de codificar ese personaje usando una codificación, luego lo decodificaré de nuevo usando otra codificación, y veré si obtenemos la misma secuencia de caracteres.

es decir algo como esto:

for encoding1, encoding2 in itertools.permutation(encodinglist(), 2): try: unicode_string = my_unicode_character.encode(encoding1).decode(encoding2) except: pass


Aquí hay una forma programática de enumerar todas las codificaciones definidas en el paquete de codificaciones stdlib, tenga en cuenta que esto no enumerará las codificaciones definidas por el usuario. Esto combina algunos de los trucos en las otras respuestas, pero en realidad produce una lista de trabajo utilizando el nombre canónico del códec.

import encodings import pkgutil import pprint all_encodings = set() for _, modname, _ in pkgutil.iter_modules( encodings.__path__, encodings.__name__ + ''.'', ): try: mod = __import__(modname, fromlist=[str(''__trash'')]) except (ImportError, LookupError): # A few encodings are platform specific: mcbs, cp65001 # print(''skip {}''.format(modname)) pass try: all_encodings.add(mod.getregentry().name) except AttributeError as e: # the `aliases` module doensn''t actually provide a codec # print(''skip {}''.format(modname)) if ''regentry'' not in str(e): raise pprint.pprint(sorted(all_encodings))


Dudo que exista dicho método / funcionalidad en el módulo de códecs, pero si ve la encoding/__init__.py , la función de búsqueda busca en la carpeta de módulos de codificaciones, por lo que puede hacer lo mismo, por ejemplo

>>> os.listdir(os.path.dirname(encodings.__file__)) [''cp500.pyc'', ''utf_16_le.py'', ''gb18030.py'', ''mbcs.pyc'', ''undefined.pyc'', ''idna.pyc'', ''punycode.pyc'', ''cp850.py'', ''big5hkscs.pyc'', ''mac_arabic.py'', ''__init__.pyc'', ''string_escape.py'', ''hz.py'', ''cp037.py'', ''cp737.py'', ''iso8859_5.pyc'', ''iso8859_13.pyc'', ''cp861.pyc'', ''cp862.py'', ''iso8859_9.pyc'', ''cp949.py'', ''base64_codec.pyc'', ''koi8_r.py'', ''iso8859_2.py'', ''ptcp154.pyc'', ''uu_codec.pyc'', ''mac_croatian.pyc'', ''charmap.pyc'', ''iso8859_15.pyc'', ''euc_jp.py'', ''cp1250.py'', ''iso8859_10.pyc'', ''koi8_r.pyc'', ''unicode_escape.pyc'', ''cp863.pyc'', ''iso8859_4.pyc'', ''cp852.py'', ''unicode_internal.py'', ''big5hkscs.py'', ''cp1257.pyc'', ''cp1254.py'', ''shift_jisx0213.py'', ''shift_jis.pyc'', ''cp869.pyc'', ''hp_roman8.py'', ''iso8859_4.py'', ''cp775.py'', ''cp1251.py'', ''mac_cyrillic.pyc'', ''mac_greek.pyc'', ''mac_roman.pyc'', ''iso8859_11.pyc'', ''iso8859_6.py'', ''utf_8_sig.py'', ''iso8859_3.py'', ''iso2022_jp_1.py'', ''ascii.py'', ''cp1026.pyc'', ''cp1250.pyc'', ''cp950.py'', ''raw_unicode_escape.py'', ''euc_jis_2004.pyc'', ''cp775.pyc'', ''euc_kr.py'', ''mac _greek.py'', ''big5.pyc'', ''shift_jis_2004.pyc'', ''gbk.pyc'', ''cp1254.pyc'', ''cp1255.pyc'', ''cp855.pyc'', ''string_escape.pyc'', ''cp949.pyc'', ''cp1258.pyc'', ''iso8859_3.pyc'', ''mac_iceland.pyc'', ''cp1251.pyc'', ''cp860.py'', ''cp856.py'', ''cp874.py'', ''iso2022_kr.py'', ''cp856.pyc'', ''rot_13.py'', ''palmos.py'', ''iso2022_jp_2.pyc'', ''mac_farsi.py'', ''koi8_u.pyc'', ''cp1256.py'', ''iso8859_10.py'', ''tis_620.py'', ''iso8859_14.pyc'', ''cp1253.py'', ''cp1258.py'', ''cp437.py'', ''cp862.pyc'', ''mac_turkish.py'', ''undefined.py'', ''euc_kr.pyc'', ''gb18030.pyc'', ''aliases.pyc'', ''iso8859_9.py'', ''uu_codec.py'', ''gbk.py'', ''quopri_codec.pyc'', ''iso8859_7.py'', ''mac_iceland.py'', ''iso8859_2.pyc'', ''euc_jis_2004.py'', ''iso2022_jp_3.pyc'', ''cp874.pyc'', ''__init__.py'', ''mac_roman.py'', ''iso8859_16.py'', ''cp866.py'', ''unicode_internal.pyc'', ''mac_turkish.pyc'', ''johab.pyc'', ''cp037.pyc'', ''punycode.py'', ''cp1253.pyc'', ''euc_jisx0213.pyc'', ''iso2022_jp_2004.pyc'', ''iso2022_kr.pyc'', ''zlib_codec.pyc'', ''cp932.py'', ''cp1255.py'', ''iso2022_jp_1.pyc'', ''cp857.pyc'', ''cp424.pyc'', ''iso2022_jp_2.py'', ''iso2022_jp.pyc'', ''mbcs.py'', ''utf_8.py'', ''palmos.pyc'', ''cp1252.pyc'', ''aliases.py'', ''quopri_codec.py'', ''latin_1.pyc'', ''iso2022_jp.py'', ''zlib_codec.py'', ''cp1026.py'', ''cp860.pyc'', ''cp1252.py'', ''hex_codec.pyc'', ''iso8859_1.pyc'', ''cp850.pyc'', ''cp861.py'', ''iso8859_15.py'', ''cp865.pyc'', ''hp_roman8.pyc'', ''iso8859_7.pyc'', ''mac_latin2.py'', ''iso8859_11.py'', ''mac_centeuro.pyc'', ''iso8859_6.pyc'', ''ascii.pyc'', ''mac_centeuro.py'', ''iso2022_jp_3.py'', ''bz2_codec.py'', ''mac_arabic.pyc'', ''euc_jisx0213.py'', ''tis_620.pyc'', ''shift_jis_2004.py'', ''utf_8.pyc'', ''cp855.py'', ''mac_romanian.pyc'', ''iso8859_8.py'', ''cp869.py'', ''ptcp154.py'', ''utf_16_be.py'', ''iso2022_jp_ext.pyc'', ''bz2_codec.pyc'', ''base64_codec.py'', ''latin_1.py'', ''charmap.py'', ''hz.pyc'', ''cp950.pyc'', ''cp875.pyc'', ''cp1006.pyc'', ''utf_16.py'', ''shift_jisx0213.pyc'', ''cp424.py'', ''cp932.pyc'', ''iso8859_5.py'', ''mac_romanian.py'', ''utf_8_sig.pyc'', ''iso8859_1.py'', ''cp875.py'', ''cp437.pyc'', ''cp865.py'', ''utf_7.py'', ''utf_16_be.pyc'', ''rot_13.pyc'', ''euc_jp.p yc'', ''raw_unicode_escape.pyc'', ''iso8859_8.pyc'', ''utf_16.pyc'', ''iso8859_14.py'', ''iso8859_16.pyc'', ''cp852.pyc'', ''cp737.pyc'', ''mac_croatian.py'', ''mac_latin2.pyc'', ''iso2022_jp_ext.py'', ''cp1140.py'', ''mac_cyrillic.py'', ''cp1257.py'', ''cp500.py'', ''cp1140.pyc'', ''shift_jis.py'', ''unicode_escape.py'', ''cp864.py'', ''cp864.pyc'', ''cp857.py'', ''hex_codec.py'', ''mac_farsi.pyc'', ''idna.py'', ''johab.py'', ''utf_7.pyc'', ''cp863.py'', ''iso8859_13.py'', ''koi8_u.py'', ''gb2312.pyc'', ''cp1256.pyc'', ''cp866.pyc'', ''iso2022_jp_2004.py'', ''utf_16_le.pyc'', ''gb2312.py'', ''cp1006.py'', ''big5.py'']

pero como cualquiera puede registrar un códec, entonces esa no será una lista exhaustiva.


El código fuente de Python tiene una secuencia de comandos en Tools/unicode/listcodecs.py que enumera todos los códecs.

Entre los códecs enumerados, sin embargo, hay algunos que no son convertidores de Unicode a byte, como base64_codec , quopri_codec y bz2_codec , como señaló @John Machin.


Otras respuestas aquí parecen indicar que construir esta lista programáticamente es difícil y está lleno de trampas. Sin embargo, hacerlo probablemente sea innecesario ya que la documentación contiene una lista completa de las codificaciones estándar soportadas por Python, y lo ha hecho desde Python 2.3.

Puede encontrar estas listas (para cada versión estable del idioma hasta ahora publicado) en:

A continuación se encuentran las listas para cada versión documentada de Python. Tenga en cuenta que si desea compatibilidad con versiones anteriores en lugar de solo respaldar una versión particular de Python, puede copiar la lista de la última versión de Python y verificar si existe cada codificación en Python ejecutando su programa antes de intentar usarlo.

Python 2.3 (59 codificaciones)

[''ascii'', ''cp037'', ''cp424'', ''cp437'', ''cp500'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp869'', ''cp874'', ''cp875'', ''cp1006'', ''cp1026'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''koi8_r'', ''koi8_u'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'']

Python 2.4 (85 codificaciones)

[''ascii'', ''big5'', ''big5hkscs'', ''cp037'', ''cp424'', ''cp437'', ''cp500'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp866'', ''cp869'', ''cp874'', ''cp875'', ''cp932'', ''cp949'', ''cp950'', ''cp1006'', ''cp1026'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''euc_jp'', ''euc_jis_2004'', ''euc_jisx0213'', ''euc_kr'', ''gb2312'', ''gbk'', ''gb18030'', ''hz'', ''iso2022_jp'', ''iso2022_jp_1'', ''iso2022_jp_2'', ''iso2022_jp_2004'', ''iso2022_jp_3'', ''iso2022_jp_ext'', ''iso2022_kr'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''johab'', ''koi8_r'', ''koi8_u'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''ptcp154'', ''shift_jis'', ''shift_jis_2004'', ''shift_jisx0213'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'']

Python 2.5 (86 codificaciones)

[''ascii'', ''big5'', ''big5hkscs'', ''cp037'', ''cp424'', ''cp437'', ''cp500'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp866'', ''cp869'', ''cp874'', ''cp875'', ''cp932'', ''cp949'', ''cp950'', ''cp1006'', ''cp1026'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''euc_jp'', ''euc_jis_2004'', ''euc_jisx0213'', ''euc_kr'', ''gb2312'', ''gbk'', ''gb18030'', ''hz'', ''iso2022_jp'', ''iso2022_jp_1'', ''iso2022_jp_2'', ''iso2022_jp_2004'', ''iso2022_jp_3'', ''iso2022_jp_ext'', ''iso2022_kr'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''johab'', ''koi8_r'', ''koi8_u'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''ptcp154'', ''shift_jis'', ''shift_jis_2004'', ''shift_jisx0213'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'', ''utf_8_sig'']

Python 2.6 (90 codificaciones)

[''ascii'', ''big5'', ''big5hkscs'', ''cp037'', ''cp424'', ''cp437'', ''cp500'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp866'', ''cp869'', ''cp874'', ''cp875'', ''cp932'', ''cp949'', ''cp950'', ''cp1006'', ''cp1026'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''euc_jp'', ''euc_jis_2004'', ''euc_jisx0213'', ''euc_kr'', ''gb2312'', ''gbk'', ''gb18030'', ''hz'', ''iso2022_jp'', ''iso2022_jp_1'', ''iso2022_jp_2'', ''iso2022_jp_2004'', ''iso2022_jp_3'', ''iso2022_jp_ext'', ''iso2022_kr'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''iso8859_16'', ''johab'', ''koi8_r'', ''koi8_u'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''ptcp154'', ''shift_jis'', ''shift_jis_2004'', ''shift_jisx0213'', ''utf_32'', ''utf_32_be'', ''utf_32_le'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'', ''utf_8_sig'']

Python 2.7 (93 codificaciones)

[''ascii'', ''big5'', ''big5hkscs'', ''cp037'', ''cp424'', ''cp437'', ''cp500'', ''cp720'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp858'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp866'', ''cp869'', ''cp874'', ''cp875'', ''cp932'', ''cp949'', ''cp950'', ''cp1006'', ''cp1026'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''euc_jp'', ''euc_jis_2004'', ''euc_jisx0213'', ''euc_kr'', ''gb2312'', ''gbk'', ''gb18030'', ''hz'', ''iso2022_jp'', ''iso2022_jp_1'', ''iso2022_jp_2'', ''iso2022_jp_2004'', ''iso2022_jp_3'', ''iso2022_jp_ext'', ''iso2022_kr'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_11'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''iso8859_16'', ''johab'', ''koi8_r'', ''koi8_u'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''ptcp154'', ''shift_jis'', ''shift_jis_2004'', ''shift_jisx0213'', ''utf_32'', ''utf_32_be'', ''utf_32_le'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'', ''utf_8_sig'']

Python 3.0 (89 codificaciones)

[''ascii'', ''big5'', ''big5hkscs'', ''cp037'', ''cp424'', ''cp437'', ''cp500'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp866'', ''cp869'', ''cp874'', ''cp875'', ''cp932'', ''cp949'', ''cp950'', ''cp1006'', ''cp1026'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''euc_jp'', ''euc_jis_2004'', ''euc_jisx0213'', ''euc_kr'', ''gb2312'', ''gbk'', ''gb18030'', ''hz'', ''iso2022_jp'', ''iso2022_jp_1'', ''iso2022_jp_2'', ''iso2022_jp_2004'', ''iso2022_jp_3'', ''iso2022_jp_ext'', ''iso2022_kr'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''johab'', ''koi8_r'', ''koi8_u'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''ptcp154'', ''shift_jis'', ''shift_jis_2004'', ''shift_jisx0213'', ''utf_32'', ''utf_32_be'', ''utf_32_le'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'', ''utf_8_sig'']

Python 3.1 (90 codificaciones)

[''ascii'', ''big5'', ''big5hkscs'', ''cp037'', ''cp424'', ''cp437'', ''cp500'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp866'', ''cp869'', ''cp874'', ''cp875'', ''cp932'', ''cp949'', ''cp950'', ''cp1006'', ''cp1026'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''euc_jp'', ''euc_jis_2004'', ''euc_jisx0213'', ''euc_kr'', ''gb2312'', ''gbk'', ''gb18030'', ''hz'', ''iso2022_jp'', ''iso2022_jp_1'', ''iso2022_jp_2'', ''iso2022_jp_2004'', ''iso2022_jp_3'', ''iso2022_jp_ext'', ''iso2022_kr'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''iso8859_16'', ''johab'', ''koi8_r'', ''koi8_u'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''ptcp154'', ''shift_jis'', ''shift_jis_2004'', ''shift_jisx0213'', ''utf_32'', ''utf_32_be'', ''utf_32_le'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'', ''utf_8_sig'']

Python 3.2 (92 codificaciones)

[''ascii'', ''big5'', ''big5hkscs'', ''cp037'', ''cp424'', ''cp437'', ''cp500'', ''cp720'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp858'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp866'', ''cp869'', ''cp874'', ''cp875'', ''cp932'', ''cp949'', ''cp950'', ''cp1006'', ''cp1026'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''euc_jp'', ''euc_jis_2004'', ''euc_jisx0213'', ''euc_kr'', ''gb2312'', ''gbk'', ''gb18030'', ''hz'', ''iso2022_jp'', ''iso2022_jp_1'', ''iso2022_jp_2'', ''iso2022_jp_2004'', ''iso2022_jp_3'', ''iso2022_jp_ext'', ''iso2022_kr'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''iso8859_16'', ''johab'', ''koi8_r'', ''koi8_u'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''ptcp154'', ''shift_jis'', ''shift_jis_2004'', ''shift_jisx0213'', ''utf_32'', ''utf_32_be'', ''utf_32_le'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'', ''utf_8_sig'']

Python 3.3 (codificaciones 93)

[''ascii'', ''big5'', ''big5hkscs'', ''cp037'', ''cp424'', ''cp437'', ''cp500'', ''cp720'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp858'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp866'', ''cp869'', ''cp874'', ''cp875'', ''cp932'', ''cp949'', ''cp950'', ''cp1006'', ''cp1026'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''cp65001'', ''euc_jp'', ''euc_jis_2004'', ''euc_jisx0213'', ''euc_kr'', ''gb2312'', ''gbk'', ''gb18030'', ''hz'', ''iso2022_jp'', ''iso2022_jp_1'', ''iso2022_jp_2'', ''iso2022_jp_2004'', ''iso2022_jp_3'', ''iso2022_jp_ext'', ''iso2022_kr'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''iso8859_16'', ''johab'', ''koi8_r'', ''koi8_u'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''ptcp154'', ''shift_jis'', ''shift_jis_2004'', ''shift_jisx0213'', ''utf_32'', ''utf_32_be'', ''utf_32_le'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'', ''utf_8_sig'']

Python 3.4 (96 codificaciones)

[''ascii'', ''big5'', ''big5hkscs'', ''cp037'', ''cp273'', ''cp424'', ''cp437'', ''cp500'', ''cp720'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp858'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp866'', ''cp869'', ''cp874'', ''cp875'', ''cp932'', ''cp949'', ''cp950'', ''cp1006'', ''cp1026'', ''cp1125'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''cp65001'', ''euc_jp'', ''euc_jis_2004'', ''euc_jisx0213'', ''euc_kr'', ''gb2312'', ''gbk'', ''gb18030'', ''hz'', ''iso2022_jp'', ''iso2022_jp_1'', ''iso2022_jp_2'', ''iso2022_jp_2004'', ''iso2022_jp_3'', ''iso2022_jp_ext'', ''iso2022_kr'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_11'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''iso8859_16'', ''johab'', ''koi8_r'', ''koi8_u'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''ptcp154'', ''shift_jis'', ''shift_jis_2004'', ''shift_jisx0213'', ''utf_32'', ''utf_32_be'', ''utf_32_le'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'', ''utf_8_sig'']

Python 3.5 (98 codificaciones)

[''ascii'', ''big5'', ''big5hkscs'', ''cp037'', ''cp273'', ''cp424'', ''cp437'', ''cp500'', ''cp720'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp858'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp866'', ''cp869'', ''cp874'', ''cp875'', ''cp932'', ''cp949'', ''cp950'', ''cp1006'', ''cp1026'', ''cp1125'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''cp65001'', ''euc_jp'', ''euc_jis_2004'', ''euc_jisx0213'', ''euc_kr'', ''gb2312'', ''gbk'', ''gb18030'', ''hz'', ''iso2022_jp'', ''iso2022_jp_1'', ''iso2022_jp_2'', ''iso2022_jp_2004'', ''iso2022_jp_3'', ''iso2022_jp_ext'', ''iso2022_kr'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_11'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''iso8859_16'', ''johab'', ''koi8_r'', ''koi8_t'', ''koi8_u'', ''kz1048'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''ptcp154'', ''shift_jis'', ''shift_jis_2004'', ''shift_jisx0213'', ''utf_32'', ''utf_32_be'', ''utf_32_le'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'', ''utf_8_sig'']

Python 3.6 (98 codificaciones)

[''ascii'', ''big5'', ''big5hkscs'', ''cp037'', ''cp273'', ''cp424'', ''cp437'', ''cp500'', ''cp720'', ''cp737'', ''cp775'', ''cp850'', ''cp852'', ''cp855'', ''cp856'', ''cp857'', ''cp858'', ''cp860'', ''cp861'', ''cp862'', ''cp863'', ''cp864'', ''cp865'', ''cp866'', ''cp869'', ''cp874'', ''cp875'', ''cp932'', ''cp949'', ''cp950'', ''cp1006'', ''cp1026'', ''cp1125'', ''cp1140'', ''cp1250'', ''cp1251'', ''cp1252'', ''cp1253'', ''cp1254'', ''cp1255'', ''cp1256'', ''cp1257'', ''cp1258'', ''cp65001'', ''euc_jp'', ''euc_jis_2004'', ''euc_jisx0213'', ''euc_kr'', ''gb2312'', ''gbk'', ''gb18030'', ''hz'', ''iso2022_jp'', ''iso2022_jp_1'', ''iso2022_jp_2'', ''iso2022_jp_2004'', ''iso2022_jp_3'', ''iso2022_jp_ext'', ''iso2022_kr'', ''latin_1'', ''iso8859_2'', ''iso8859_3'', ''iso8859_4'', ''iso8859_5'', ''iso8859_6'', ''iso8859_7'', ''iso8859_8'', ''iso8859_9'', ''iso8859_10'', ''iso8859_11'', ''iso8859_13'', ''iso8859_14'', ''iso8859_15'', ''iso8859_16'', ''johab'', ''koi8_r'', ''koi8_t'', ''koi8_u'', ''kz1048'', ''mac_cyrillic'', ''mac_greek'', ''mac_iceland'', ''mac_latin2'', ''mac_roman'', ''mac_turkish'', ''ptcp154'', ''shift_jis'', ''shift_jis_2004'', ''shift_jisx0213'', ''utf_32'', ''utf_32_be'', ''utf_32_le'', ''utf_16'', ''utf_16_be'', ''utf_16_le'', ''utf_7'', ''utf_8'', ''utf_8_sig'']

En caso de que sean relevantes para el caso de uso de cualquier persona, tenga en cuenta que los documentos también enumeran algunas codificaciones específicas de Python , muchas de las cuales parecen ser principalmente para uso interno de Python o son de alguna manera extrañas, como la codificación ''undefined'' que siempre arroja una excepción si intenta usarlo. Probablemente quieras ignorarlos por completo si, como el que pregunta aquí, estás tratando de descubrir qué codificación se utilizó para algunos textos que has encontrado en el mundo real. A partir de Python 3.6, la lista es la siguiente:

["idna", "mbcs", "oem", "palmos", "punycode", "raw_unicode_escape", "rot_13", "undefined", "unicode_escape", "unicode_internal", "base64_codec", "bz2_codec", "hex_codec", "quopri_codec", "string_escape", "uu_codec", "zlib_codec"]

Finalmente, en caso de que quiera actualizar mis tablas anteriores para una versión más nueva de Python, aquí está el script (crudo, no muy robusto) que utilicé para generarlas:

import requests import lxml.html import pprint for version, url in [ (''2.3'', ''https://docs.python.org/2.3/lib/node130.html''), (''2.4'', ''https://docs.python.org/2.4/lib/standard-encodings.html''), (''2.5'', ''https://docs.python.org/2.5/lib/standard-encodings.html''), (''2.6'', ''https://docs.python.org/2.6/library/codecs.html#standard-encodings''), (''2.7'', ''https://docs.python.org/2.7/library/codecs.html#standard-encodings''), (''3.0'', ''https://docs.python.org/3.0/library/codecs.html#standard-encodings''), (''3.1'', ''https://docs.python.org/3.1/library/codecs.html#standard-encodings''), (''3.2'', ''https://docs.python.org/3.2/library/codecs.html#standard-encodings''), (''3.3'', ''https://docs.python.org/3.3/library/codecs.html#standard-encodings''), (''3.4'', ''https://docs.python.org/3.4/library/codecs.html#standard-encodings''), (''3.5'', ''https://docs.python.org/3.5/library/codecs.html#standard-encodings''), (''3.6'', ''https://docs.python.org/3.6/library/codecs.html#standard-encodings''), ]: html = requests.get(url).text doc = lxml.html.fromstring(html) standard_encodings_table = doc.xpath( ''//table[preceding::h2[.//text()[contains(., "Standard Encodings")]]][//th/text()="Codec"]'' )[0] codecs = standard_encodings_table.xpath(''.//td[1]/text()'') print("## Python %s (%i encodings)" % (version, len(codecs))) print(''<pre><code>'' + pprint.pformat(codecs) + ''</code></pre>'')


Probablemente puedas hacer esto:

from encodings.aliases import aliases print aliases.keys()


Puede usar una técnica para enumerar todos los módulos en el paquete de encodings .

import pkgutil import encodings false_positives = set(["aliases"]) found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg) found.difference_update(false_positives) print found


Tal vez debería intentar usar la biblioteca del detector de codificación universal (chardet) en lugar de implementarla usted mismo.

>>> import chardet >>> s = ''/xe2/x98/x83'' # ☃ >>> chardet.detect(s) {''confidence'': 0.505, ''encoding'': ''utf-8''}


Lamentablemente encodings.aliases.aliases.keys() NO es una respuesta adecuada.

aliases (como uno debería / debería esperar) contiene varios casos en los que diferentes claves se asignan al mismo valor, por ejemplo, 1252 y windows_1252 se asignan a cp1252 . Puede ahorrar tiempo si en lugar de aliases.keys() usa set(aliases.values()) .

PERO HAY UN PROBLEMA PEOR: aliases no contiene codecs que no tienen alias (como cp856, cp874, cp875, cp737 y koi8_u).

>>> from encodings.aliases import aliases >>> def find(q): ... return [(k,v) for k, v in aliases.items() if q in k or q in v] ... >>> find(''1252'') # multiple aliases [(''1252'', ''cp1252''), (''windows_1252'', ''cp1252'')] >>> find(''856'') # no codepage 856 in aliases [] >>> find(''koi8'') # no koi8_u in aliases [(''cskoi8r'', ''koi8_r'')] >>> ''x''.decode(''cp856'') # but cp856 is a valid codec u''x'' >>> ''x''.decode(''koi8_u'') # but koi8_u is a valid codec u''x'' >>>

También vale la pena señalar que, sin embargo, si obtienes una lista completa de códecs, puede ser una buena idea ignorar los códecs que no tienen que ver con la codificación / decodificación de conjuntos de caracteres, pero realiza alguna otra transformación, por ejemplo, zlib , quopri y base64 .

Lo que nos lleva a la pregunta de POR QUÉ quieres "intentar codificar bytes en muchas codificaciones diferentes". Si lo sabemos, podremos orientarlo en la dirección correcta.

Para empezar, eso es ambiguo. Uno codifica los bytes en unicode, y uno codifica unicode en bytes. ¿Qué quieres hacer?

¿Qué estás realmente tratando de lograr? ¿Estás tratando de determinar qué códec usar para decodificar algunos bytes entrantes, y planeas intentar esto con todos los códecs posibles? [nota: latin1 decodificará cualquier cosa] ¿Estás tratando de determinar el idioma de algún texto Unicode tratando de codificarlo con todos los códecs posibles? [nota: utf8 codificará cualquier cosa].