기본 인코딩이 ASCII 일 때 파이썬이 유니 코드 문자를 인쇄하는 이유는 무엇입니까?
파이썬 2.6 쉘에서 :
>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>>
"é"문자가 ASCII의 일부가 아니고 인코딩을 지정하지 않았기 때문에 print 문 뒤에 약간의 횡설수설 또는 오류가있을 것으로 예상됩니다. ASCII가 기본 인코딩이라는 것이 무엇인지 이해하지 못하는 것 같습니다.
편집하다
수정 사항을 답변 섹션 으로 옮기고 제안대로 수락했습니다.
다양한 답글의 비트와 조각 덕분에 설명을 할 수 있다고 생각합니다.
파이썬은 유니 코드 문자열 u '\ xe9'를 인쇄하려고하여 현재 sys.stdout.encoding에 저장된 인코딩 체계를 사용하여 해당 문자열을 암시 적으로 인코딩하려고합니다. 파이썬은 실제로 시작된 환경에서이 설정을 선택합니다. 환경에서 적절한 인코딩을 찾을 수없는 경우에만 기본 ASCII (ASCII)로 되돌 립니다.
예를 들어 인코딩 기본값은 UTF-8 인 bash 셸을 사용합니다. 파이썬을 시작하면 그 설정을 받아 사용합니다.
$ python
>>> import sys
>>> print sys.stdout.encoding
UTF-8
잠시 파이썬 셸을 종료하고 가짜 인코딩으로 bash 환경을 설정합시다.
$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.
그런 다음 파이썬 쉘을 다시 시작하고 실제로 기본 ASCII 인코딩으로 되 돌리는 지 확인하십시오.
$ python
>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968
빙고!
이제 ASCII 외부에서 유니 코드 문자를 출력하려고하면 멋진 오류 메시지가 표시됩니다
>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9'
in position 0: ordinal not in range(128)
파이썬을 끝내고 bash 쉘을 버릴 수 있습니다.
이제 파이썬이 문자열을 출력 한 후 어떤 일이 발생하는지 살펴 보겠습니다. 이를 위해 먼저 그래픽 터미널 내에서 bash 쉘을 시작하고 (Gnome Terminal을 사용합니다) ISO-8859-1 aka latin-1 (그래픽 터미널은 일반적으로 문자 설정 옵션이 있습니다) 드롭 다운 메뉴 중 하나에서 인코딩 ). 이것은 실제 쉘 환경의 인코딩을 변경하지 않으며 , 터미널 자체가 웹 브라우저가 제공하는 출력을 디코딩 하는 방식 만 변경합니다 . 따라서 쉘 환경에서 터미널 인코딩을 독립적으로 변경할 수 있습니다. 그런 다음 셸에서 Python을 시작하고 sys.stdout.encoding이 셸 환경의 인코딩 (UTF-8)으로 설정되어 있는지 확인하십시오.
$ python
>>> import sys
>>> print sys.stdout.encoding
UTF-8
>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>
(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.
(2) python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it's "UTF-8". After UTF-8 encoding, the resulting binary string is '\xc3\xa9' (see later explanation). Terminal receives the stream as such and tries to decode 0xc3a9 using latin-1, but latin-1 goes from 0 to 255 and so, only decodes streams 1 byte at a time. 0xc3a9 is 2 bytes long, latin-1 decoder therefore interprets it as 0xc3 (195) and 0xa9 (169) and that yields 2 characters: Ã and ©.
(3) python encodes unicode code point u'\xe9' (233) with the latin-1 scheme. Turns out latin-1 code points range is 0-255 and points to the exact same character as Unicode within that range. Therefore, Unicode code points in that range will yield the same value when encoded in latin-1. So u'\xe9' (233) encoded in latin-1 will also yields the binary string '\xe9'. Terminal receives that value and tries to match it on the latin-1 character map. Just like case (1), it yields "é" and that's what's displayed.
Let's now change the terminal's encoding settings to UTF-8 from the dropdown menu (like you would change your web browser's encoding settings). No need to stop Python or restart the shell. The terminal's encoding now matches Python's. Let's try printing again:
>>> print '\xe9' # (4)
>>> print u'\xe9' # (5)
é
>>> print u'\xe9'.encode('latin-1') # (6)
>>>
(4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn't understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a unicode code point. No code point found, no character printed.
(5) python attempts to implicitly encode the Unicode string with whatever's in sys.stdout.encoding. Still "UTF-8". The resulting binary string is '\xc3\xa9'. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol "é". Terminal displays "é".
(6) python encodes unicode string with latin-1, it yields a binary string with the same value '\xe9'. Again, for the terminal this is pretty much the same as case (4).
Conclusions: - Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. - Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. - Python gets that setting from the shell's environment. - the terminal displays output according to its own encoding settings. - the terminal's encoding is independant from the shell's.
More details on unicode, UTF-8 and latin-1:
Unicode is basically a table of characters where some keys (code points) have been conventionally assigned to point to some symbols. e.g. by convention it's been decided that key 0xe9 (233) is the value pointing to the symbol 'é'. ASCII and Unicode use the same code points from 0 to 127, as do latin-1 and Unicode from 0 to 255. That is, 0x41 points to 'A' in ASCII, latin-1 and Unicode, 0xc8 points to 'Ü' in latin-1 and Unicode, 0xe9 points to 'é' in latin-1 and Unicode.
When working with electronic devices, Unicode code points need an efficient way to be represented electronically. That's what encoding schemes are about. Various Unicode encoding schemes exist (utf7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point's value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.
Most encoding schemes have shortcomings regarding space requirement, the most economic ones don't cover all unicode code points, for example ascii only covers the first 128, while latin-1 covers the first 256. Others that try to be more comprehensive end up also being wasteful, since they require more bytes than necessary, even for common "cheap" characters. UTF-16 for instance, uses a minimum of 2 bytes per character, including those in the ascii range ('B' which is 65, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all characters in 4 bytes.
UTF-8 happens to have cleverly resolved the dilemma, with a scheme able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.
UTF-8 encoding of unicode code points in the ascii range (0-127):
0xxx xxxx (in binary)
- the x's show the actual space reserved to "store" the code point during encoding
- The leading 0 is a flag that indicates to the UTF-8 decoder that this code point will only require 1 byte.
- upon encoding, UTF-8 doesn't change the value of code points in that specific range (i.e. 65 encoded in UTF-8 is also 65). Considering that Unicode and ASCII are also compatible in the same range, it incidentally makes UTF-8 and ASCII also compatible in that range.
e.g. Unicode code point for 'B' is '0x42' or 0100 0010 in binary (as we said, it's the same in ASCII). After encoding in UTF-8 it becomes:
0xxx xxxx <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010 <-- Unicode code point 0x42
0100 0010 <-- UTF-8 encoded (exactly the same)
UTF-8 encoding of Unicode code points above 127 (non-ascii):
110x xxxx 10xx xxxx <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx <-- (from 2048 to 65535)
- the leading bits '110' indicate to the UTF-8 decoder the beginning of a code point encoded in 2 bytes, whereas '1110' indicates 3 bytes, 11110 would indicate 4 bytes and so forth.
- the inner '10' flag bits are used to signal the beginning of an inner byte.
- again, the x's mark the space where the Unicode code point value is stored after encoding.
e.g. 'é' Unicode code point is 0xe9 (233).
1110 1001 <-- 0xe9
When UTF-8 encodes this value, it determines that the value is larger than 127 and less than 2048, therefore should be encoded in 2 bytes:
110x xxxx 10xx xxxx <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001 <-- 0xe9
1100 0011 1010 1001 <-- 'é' after UTF-8 encoding
C 3 A 9
The 0xe9 Unicode code points after UTF-8 encoding becomes 0xc3a9. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you'll see é, because it just so happens that 0xc3 in latin-1 points to à and 0xa9 to ©.
When Unicode characters are printed to stdout, sys.stdout.encoding
is used. A non-Unicode character is assumed to be in sys.stdout.encoding
and is just sent to the terminal. On my system (Python 2):
>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> ud.name(u'\xe9') # U+00E9 Unicode codepoint
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.name('\xe9'.decode('cp437'))
'GREEK CAPITAL LETTER THETA'
>>> '\xe9'.decode('cp437') # byte E9 decoded using code page 437 is U+0398.
u'\u0398'
>>> ud.name(u'\u0398')
'GREEK CAPITAL LETTER THETA'
>>> print u'\xe9' # Unicode is encoded to CP437 correctly
é
>>> print '\xe9' # Byte is just sent to terminal and assumed to be CP437.
Θ
sys.getdefaultencoding()
is only used when Python doesn't have another option.
Note that Python 3.6 or later ignores encodings on Windows and uses Unicode APIs to write Unicode to the terminal. No UnicodeEncodeError warnings and the correct character is displayed if the font supports it. Even if the font doesn't support it the characters can still be cut-n-pasted from the terminal to an application with a supporting font and it will be correct. Upgrade!
The Python REPL tries to pick up what encoding to use from your environment. If it finds something sane then it all Just Works. It's when it can't figure out what's going on that it bugs out.
>>> print sys.stdout.encoding
UTF-8
You have specified an encoding by entering an explicit Unicode string. Compare the results of not using the u
prefix.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> '\xe9'
'\xe9'
>>> u'\xe9'
u'\xe9'
>>> print u'\xe9'
é
>>> print '\xe9'
>>>
In the case of \xe9
then Python assumes your default encoding (Ascii), thus printing ... something blank.
It works for me:
import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('utf-8')
As per Python default/implicit string encodings and conversions :
- When
print
ingunicode
, it'sencode
d with<file>.encoding
.- when the
encoding
is not set, theunicode
is implicitly converted tostr
(since the codec for that issys.getdefaultencoding()
, i.e.ascii
, any national characters would cause aUnicodeEncodeError
) - for standard streams, the
encoding
is inferred from environment. It's typically set fottty
streams (from the terminal's locale settings), but is likely to not be set for pipes- so a
print u'\xe9'
is likely to succeed when the output is to a terminal, and fail if it's redirected. A solution is toencode()
the string with the desired encoding beforeprint
ing.
- so a
- when the
- When
print
ingstr
, the bytes are sent to the stream as is. What glyphs the terminal shows will depend on its locale settings.
'Programming' 카테고리의 다른 글
`void_t`는 어떻게 작동합니까 (0) | 2020.06.23 |
---|---|
.net 4에서 async-await 사용 (0) | 2020.06.23 |
앱이 사용할 수있는 최대 RAM 양은 얼마입니까? (0) | 2020.06.23 |
데이터베이스 클러스터 및로드 밸런싱 (0) | 2020.06.23 |
왜 주입 된 클래스 이름이 있습니까? (0) | 2020.06.23 |