python乱码问题问题

python中一些常见的问题解决汇总
PEP 0263 -- Defining Python Source Code Encodings
来源:点击进入
转载日期:
Defining Python Source Code Encodings
Last-Modified:
22:03:18 +0200 (Thu, 28 Jun 2007)
Marc-André Lemburg &mal &, Martin von L?wis &martin at v.loewis.de&
Standards Track
06-Jun-2001
Python-Version:
Post-History:
This PEP proposes to introduce a syntax to declare the encoding of
a Python source file. The encoding information is then used by the
Python parser to interpret the file using the given encoding. Most
notably this enhances the interpretation of Unicode literals in
the source code and makes it possible to write Unicode literals
using e.g. UTF-8 directly in an Unicode aware editor.Problem
In Python 2.1, Unicode literals can only be written using the
Latin-1 based encoding "unicode-escape". This makes the
programming environment rather unfriendly to Python users who live
and work in non-Latin-1 locales such as many of the Asian
countries. Programmers can write their 8-bit strings using the
favorite encoding, but are bound to the "unicode-escape" encoding
for Unicode literals.Proposed Solution
I propose to make the Python source code encoding both visible and
changeable on a per-source file basis by using a special comment
at the top of the file to declare the encoding.
To make Python aware of this encoding declaration a number of
concept changes are necessary with respect to the handling of
Python source code data.Defining the Encoding
Python will default to ASCII as standard encoding if no other
encoding hints are given.
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=&encoding name&
or (using formats recognized by popular editors)
#!/usr/bin/python
# -*- coding: &encoding name& -*-
#!/usr/bin/python
# vim: set fileencoding=&encoding name& :
More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)". The first group of this
expression is then interpreted as encoding name. If the encoding
is unknown to Python, an error is raised during compilation. There
must not be any Python statement on the line that contains the
encoding declaration.
To aid with platforms such as Windows, which add Unicode BOM marks
to the beginning of Unicode files, the UTF-8 signature
'\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
(even if no magic encoding comment is given).
If a source file uses both the UTF-8 BOM mark signature and a
magic encoding comment, the only allowed encoding for the comment
is 'utf-8'.
Any other encoding will cause an error.Examples
These are some examples to clarify the different styles for
defining the source code encoding at the top of a Python source
1. With interpreter binary and using Emacs style file encoding
#!/usr/bin/python
# -*- coding: latin-1 -*-
import os, sys
#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
import os, sys
#!/usr/bin/python
# -*- coding: ascii -*-
import os, sys
2. Without interpreter line, using plain text:
# This Python file uses the following encoding: utf-8
import os, sys
3. Text editors might have different ways of defining the file's
encoding, e.g.
#!/usr/local/bin/python
# coding: latin-1
import os, sys
4. Without encoding comment, Python's parser will assume ASCII
#!/usr/local/bin/python
import os, sys
5. Encoding comments which don't work:
Missing "coding:" prefix:
#!/usr/local/bin/python
import os, sys
Encoding comment not on line 1 or 2:
#!/usr/local/bin/python
# -*- coding: latin-1 -*-
import os, sys
Unsupported encoding:
#!/usr/local/bin/python
# -*- coding: utf-42 -*-
import os, sys
...Concepts
The PEP is based on the following concepts which would have to be
implemented to enable usage of such a magic comment:
1. The complete Python source file should use a single encoding.
Embedding of differently encoded data is not allowed and will
result in a decoding error during compilation of the Python
source code.
Any encoding which allows processing the first two lines in the
way indicated above is allowed as source code encoding, this
includes ASCII compatible encodings as well as certain
multi-byte encodings such as Shift_JIS. It does not include
encodings which use two or more bytes for all characters like
e.g. UTF-16. The reason for this is to keep the encoding
detection algorithm in the tokenizer simple.
2. Handling of escape sequences should continue to work as it does
now, but with all possible source code encodings, that is
standard string literals (both 8-bit and Unicode) are subject to
escape sequence expansion while raw string literals only expand
a very small subset of escape sequences.
3. Python's tokenizer/compiler combo will need to be updated to
work as follows:
1. read the file
2. decode it into Unicode assuming a fixed per-file encoding
3. convert it into a UTF-8 byte string
4. tokenize the UTF-8 content
5. compile it, creating Unicode objects from the given Unicode data
and creating string objects from the Unicode literal data
by first reencoding the UTF-8 data into 8-bit string data
using the given file encoding
Note that Python identifiers are restricted to the ASCII
subset of the encoding, and thus need no further conversion
after step 4.Implementation
For backwards-compatibility with existing code which currently
uses non-ASCII in string literals without declaring an encoding,
the implementation will be introduced in two phases:
1. Allow non-ASCII in string literals and comments, by internally
treating a missing encoding declaration as a declaration of
"iso-8859-1". This will cause arbitrary byte strings to
correctly round-trip between step 2 and step 5 of the
processing, and provide compatibility with Python 2.2 for
Unicode literals that contain non-ASCII bytes.
A warning will be issued if non-ASCII bytes are found in the
input, once per improperly encoded input file.
2. Remove the warning, and change the default encoding to "ascii".
The builtin compile() API will be enhanced to accept Unicode as
input. 8-bit string input is subject to the standard procedure for
encoding detection as described above.
If a Unicode string with a coding declaration is passed to compile(),
a SyntaxError will be raised.
SUZUKI Hisao i see [2] for details. A patch
implementing only phase 1 is available at [1].Phases
Implemenation of steps 1 and 2 above were completed in 2.3,
except for changing the default encoding to "ascii".
The default encoding was set to "ascii" in version 2.5.
This PEP intends to provide an upgrade path from the current
(more-or-less) undefined source code encoding situation to a more
robust and portable definition.References
[1] Phase 1 implementation:
http://python.org/sf/526840
[2] Phase 2 implementation:
http://python.org/sf/534304History
1.10 and above: see CVS history
1.8: Added '.' to the coding RE.
1.7: Added warnings to phase 1 implementation. Replaced the
Latin-1 default encoding with the interpreter's default
encoding. Added tweaks to compile().
1.4 - 1.6: Minor tweaks
1.3: Worked in comments by Martin v. Loewis:
UTF-8 BOM mark detection, Emacs style magic comment,
two phase approach to the implementationCopyright
This document has been placed in the public domain.
chinakr 发短消息加为好友chinakr
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 .
传说中的长老
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 . 12#
00:18 | 只看该作者 Python中的utf-8编码与gb2312编码
来源:点击进入
作者:Yoker 日期:
  上次在某个文章中看到,只要把你写好的xml文件用记事本保存的时候选择utf-8编码,如果xml内容本身也采用的是utf-8编码,可以直接访问,是没有问题的。写好的代码如下:
&?xml version="1.0" encoding="utf-8" ?&
&rss version="2.0"&
&title&&![CDATA[rss标题文字]]&&/title&
&link&/&/link&
&description&
&![CDATA[文字...]]&
&/description&
&copyright&&![CDATA[Copyright (c) . All Rights Reserved,Inc.]]&&/copyright&
&webMaster&yoker.&/webMaster&
&pubDate& 18:14:28&/pubDate&
&title&&![CDATA[文字标题]]&&/title&
&link&/News/69.html&/link&
&description&&![CDATA[详细文字内容]]&&/description&
&pubDate& 10:29:12&/pubDate&
&/channel&
  当时看了文章,也没有太在意,恍然大捂的那种心情(自己老是这样粗心)。今天在改写一个网站的RSS的时候,发现了问题。我把程序rss.asp写好,用utf-8方式保存程序文件,放到服务器测试,问题就来了,由于是动态文件生成的xml格式数据,所以报了一个找不到服务器的页面错误给我,一头雾水!直觉告诉我是xml格式有错,但错在什么地方呢?这样一个错误页面,叫我如何思考?
  随后,我用fso把动态调用的数据加上xml格式,生成一个xml文件rss.xml丢到相同目录,访问生成后的rss.xml文件,乱码!再次陷入僵局。
  我打开文件rss.xml,看了一篇又一篇,乱了阵脚。跑到QQ群一篇狂呼,没有一人响应,是太简单了吗?还是我根本没有把问题阐述清楚呢?反正是没有人回答,上我最可爱的老师那里去找答案:,输入:asp utf-8 保存文件出来了一大堆。随便翻了几个网页,都有Adodb.Stream的东西,那些参数曾经认真去学习过,现在看到又觉得好陌生哦,再翻翻手册,思路慢慢打开,fso保存文件是按照Ansi编码,而我的xml格式数据是utf-8编码,难怪不得!
  回到了问题的根本上来了,是编码问题!我又瞢着头瞎折腾一翻后,未果。静下心思考,我程序用utf-8保存,而我再次用utf-8编码方式去生成的 rss.xml文件,这两次的utf-8一定有点问题吧,于是改写代码用ANSI保存程序文件rss.asp,再用utf-8编码方式保存调用后形成的 xml格式数据rss.xml,用FF访问rss.xml,OK,可爱的xml树结构出现了。才知道这和我最开始看到的文章有点关联了。那动态调用的文件 rss.asp为什么该怎么处理呢?又一个0.5秒的思考,增加Response.CharSet = "utf-8"访问,出来的是一堆地址列表。。而不是我想要的xml树,再看看源代码,乱码。这又是什么原因呐?
  心烦,都这么不顺利呀!静下心,我翻到了我的头部包含文件,看到一个让我今生难忘的代码:&%@ LANGUAGE = VBScript CodePage = 936%&,936 不就是gb2312嘛,狗东西,害得我好惨。但我还不能直接修改他,要不然其他页面我不是乱套了,于是又问了问,发现 session.codepage = "65001"还可以用,不过有过缺点,他会修改整个站的codepage,那不是和直接修改包含文件一样了,灵机一动,我用完后,又恢复不就得了。用上,完全成功!
结论:采用UTF-8编码,除了要将文件另存为UTF-8格式之外,还需要同时指定codepage及charset
附件:用utf-8保存Ansi数据代码如下:
Dim oStream
Set oStream = Server.CreateObject("ADODB.Stream")
With oStream
.Mode = 3 : .Type = 2 : .Charset = "utf-8" : .Open
.WriteText RssStr
.SaveToFile server.mappath("/Rss.xml"), 2
oStream = Nothing
chinakr 发短消息加为好友chinakr
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 .
传说中的长老
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 . 13#
00:19 | 只看该作者 python 的编码转换知识
来源:点击进入
作者:Yoker 日期:
  主要介绍了python的编码机制以及编码格式,unicode, utf-8, utf-16, GBK, GB2312,ISO-8859-1 等编码之间的转换。
一、python源码文件的编码
在源代码中如果包含了中文等汉字,就得对文件本身编码进行声明。这个编码同时决定了在该源文件中声明的字符串的编码格式。
#!/usr/bin/python
# -*- coding: utf-8 -*-
二、python读取外部文件的编码
外部文件的读取有:磁盘文件已经网络文件,这都涉及到编码的问题。如果读取外面的文件是gb2312编码格式的文件,那么在程序内部需要使用的时候,就必须转码才可以
def gethtml(url):
"""读取gb2312编码的文件,转换为utf8格式"""
import urllib
html = urllib.urlopen(url).read().decode('gb2312')
return html.encode('utf8')
三、python内部编码及转换
1、unicode 转换为其它编码(GBK, GB2312等)
例如:a为unicode编码 要转为gb2312。a.encode('gb2312')
# -*- coding=gb2312 -*-
a = u"中文"
a_gb2312 = a.encode('gb2312')
print a_gb2312
2、其它编码(utf-8,GBK)转换为unicode
例如:a为gb2312编码,要转为unicode. unicode(a, 'gb2312')或a.decode('gb2312')
# -*- coding=gb2312 -*-
a = u"中文"
a_gb2312 = a.encode('gb2312')
print a_gb2312
a_unicode = a_gb2312.decode('gb2312')
assert(a_unicode == a)
a_utf_8 = a_unicode.encode('utf-8')
print a_utf_8
3、非unicode编码之间的转换
编码1(GBK,GB2312)
转换为 编码2(utf-8,utf-16,ISO-8859-1) ,可以先转为unicode再转为编码2 ,如gb2312转utf-8
# -*- coding=gb2312 -*-
a = u"中文"
a_gb2312 = a.encode('gb2312')
print a_gb2312
a_unicode = a_gb2312.decode('gb2312')
assert(a_unicode == a)
a_utf_8 = a_unicode.encode('utf-8')
print a_utf_8
4、判断字符串的编码
isinstance(s, str) 用来判断是否为一般字符串;isinstance(s, unicode) 用来判断是否为unicode。如果一个字符串已经是unicode了,再执行unicode转换有时会出错(并不都出错)
下面代码为将任意字符串转换为unicode
def u(s, encoding):
if isinstance(s, unicode):
return unicode(s, encoding)
5、unicode 与其它编码之间的区别
为什么不所有的文件都使用unicode,还要用GBK,utf-8等编码呢?unicode可以称为抽象编码,也就是它只是一种内部表示,一般不能直接保存。保存到磁盘上时,需要把它转换为对应的编码,如utf-8和utf-16。
6、其它方法
除上以上的编码方法,在读写文件时还可以使用codecs的open方法在读写时进行转换。
相关阅读:
http://blog.csdn.net/kiki113/archive//4062063.aspx
http://blog.csdn.net/lanphaday/archive//2834883.aspx
/daping_zhang/blog/item/09dda71ea9d7d21f4134173e.html
http://www.python.org/dev/peps/pep-0008/ *
http://boodebr.org/main/python/all-about-python-and-unicode *
chinakr 发短消息加为好友chinakr
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 .
传说中的长老
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 . 14#
12:39 | 只看该作者 python与中文字体
google关键字:python Bad Unicode data
来源:点击进入
cmsgoogle 发表于
python读入和显示中文字体有些问题,网上搜了搜,发现下面这个链接还是给出了解决方法
主要是引入一个codecs的模块。
/articles/unicode/python.html
*********************************************************************
原文内容:
Unicode in Python
The first thing to know about Python's Unicode support is that you may need to install a recent version of Python to get it. Users of RedHat Linux 7.x have Python 1.5.2 by default, for compatibility reasons. Unicode support was introduced in Python 1.6.
Unicode Strings in Python
Python has two different string types: an 8-bit non-Unicode string type (str) and a 16-bit Unicode string type (unicode).
Unicode strings are written with a leading u. They may contain Unicode escape sequences of the form \u0000, just as in Java. For example:
question = u'\u00bfHabla espa\u00f1ol?'
# ?Habla espa?ol?Some Unicode characters have numbers beyond U+FFFF, so Python has another escape: \U, which offers more than enough digits to specify any Unicode codepoint. (Recent C and C++ standards also offer this, but Java does not.)
Python also offers a \N escape which allows you to specify any Unicode character by name.
# This string has 7 characters in all, including the spaces
# between the symbols.
symbols = u'\N{BLACK STAR} \N{WHITE STAR} \N{LIGHTNING} \N{COMET}'One more way to build a Unicode string object is with the built-in unichr() function, which is the Unicode version of chr().
Unicode Support in the Python Standard Library
Unicode strings are very similar to Python's ordinary 8-bit strings. They have the same useful methods (split(), strip(), find(), and so on). The + and * operators work on Unicode strings just as they do for plain strings. And like plain strings, Unicode strings can do printf-like formatting, using the % symbol. For the most part, you'll feel right at home.
This seamlessness extends to most of Python's standard library.
?Python regular expressions can search Unicode strings.
?Python's standard gettext module supports Unicode. This is the module to use for internationalization of Python programs.
?The Tkinter GUI toolkit offers excellent Unicode support. Here is a minimal Hello, world program using Unicode and Tkinter.
?Python's standard XML library is Unicode-aware (as required by the XML specification).
Most of the standard library works smoothly with Unicode strings. Some modules still are not fully Unicode-friendly, but the most important pieces are in place.
Unicode files and Python
Reading and writing Unicode files from Python is simple. Use codecs.open() and specify the encoding.
import codecs
# Open a UTF-8 file in read mode
infile = codecs.open("infile.txt", "r", "utf-8")
# Read its contents as one large Unicode string.
text = infile.read()
# Close the file.
infile.close()The same function is used to ope just use "w" (write) or "a" (append) as the second argument.
A fourth argument, after the encoding, can be provided to specify error-handling. The possible values are:
?'strict' - The default. Throw exceptions if errors are detected while encoding or decoding data.
?'ignore' - Skip over errors or unencodeable characters.
?'replace' - Replace bad or unencodeable data with a "replacement character", usually a question mark.
Since 'strict' is the default, expect a lot of UnicodeExceptions to be thrown if your data isn't quite right. Once you get the hang of it, those errors become much less frequent.
Sometimes a program simply needs to encode or decode a single chunk of Unicode data. This, too, is easy in Python: Unicode strings have an encode() method that returns a str, and str objects have a decode() method that returns a unicode string.
# Suppose we are given these bytes, perhaps over a socket
# or perhaps taken from a database.
bytes = 'Bun\xc4\x83-diminea\xc8\x9ba, lume'
# We want to convert these UTF-8 bytes to a Unicode string.
unicode_strg = bytes.decode('utf-8')
# Now print it, but in the ISO-8859-1 encoding, because
# (let's suppose) that is the format of our display.
print unicode_strg.encode('iso-8859-1', 'replace')However, note that in this particular example, the source string contains two characters (? and ?) that are not available in ISO-8859-1! Unfortunately, if our display can only handle ISO-8859-1 characters, there is no satisfactory answer to this problem. Some characters will be lost. The last line of the sample code instructs Python to use the 'replace' error-handling behavior instead of the default 'strict' behavior. This way, although some characters will be replaced with question marks, at least no exception will be thrown.
Of course, it would be better to use a display that can handle all Unicode characters, such as a Tk GUI.
print and Unicode strings
We now come to the most puzzling aspect of Python's Unicode support. Attempting to print a Unicode string causes an error:
&&& print u'\N{POUND SIGN}'Traceback (most recent call last):
File "&stdin&", line 1, in ?UnicodeError: ASCII encoding error: ordinal not in range(128) Two elements combine to cause this error:
1.Python's default encoding is ASCII. The pound sign is not an ASCII character. (By contrast, Java's default encoding is usually something like Latin-1, which covers a bit more ground than ASCII.)
2.The default error behavior is 'strict'. If Python encounters a character that it can't encode, it raises a UnicodeError. (This is different from Java, which silently replaces the character with a ? instead.)
Python defaults to ASCII because ASCII is the only thing likely to work everywhere. The correct encoding is not always Latin-1. In fact, it depends on how you are accessing Python.
When Python executes a print statement, it simply passes the output to the operating system (using fwrite() or something like it), and some other program is responsible for actually displaying that output on the screen. For example, on Windows, it might be the Windows console subsystem that displays the result. Or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. If you are running Python in an xterm on Unix, then xterm and your X server handle the display.
To print data reliably, you must know the encoding that this display program expects.
Earlier it was mentioned that IBM PC computers use the "IBM Code Page 437" character set at the BIOS level. The Windows console still emulates CP437. So this print statement will work, on Windows, under a console window.
# Windows console mode only
&&& s = u'\N{POUND SIGN}'
&&& print s.encode('cp-437')
?Several SSH clients display data using the Latin-1 Tkinter assumes UTF-8, when 8-bit strings are passed into it. So in general it is not possible to determine what encoding to use with print. It is therefore better to send Unicode output to files or Unicode-aware GUIs, not to sys.stdout.
*********************************************************************
一段用于读写的代码:
import ocdecs as p
infile = p.open('inputfile','r','utf-8')-------------the third parar must satisfy the inputfile's original format
outfile = p.open('outputfile','w','utf-8')-----------the third parar is chosen by yourself,any one u like:
while True:
line = infile.readline()
outfile.write(line)
line----------------now it can be displayed
normally, not messy code
outfile.close()
infile.close()
chinakr 发短消息加为好友chinakr
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 .
传说中的长老
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 . 15#
12:52 | 只看该作者 print 中文处理
google关键字:python脚本 默认codec
来源:点击进入
来源: http://blog.chinaunix.net/u2/68206/showart_668359.html
转载:Python、Unicode和中文
python的中文问题一直是困扰新手的头疼问题,这篇文章将给你详细地讲解一下这方面的知识。当然,几乎可以确定的是,在将来的版本中,python会彻底解决此问题,不用我们这么麻烦了。
先来看看python的版本:
&&& import sys
&&& sys.version
'2.5.1 (r251:54863, Apr 18 :08) [MSC v.1310 32 bit (Intel)]'
用记事本创建一个文件ChineseTest.py,默认ANSI:
s = "中文"
测试一下瞧瞧:
E:\Project\Python\Test&python ChineseTest.py
File "ChineseTest.py", line 1
SyntaxError: Non-ASCII character '\xd6' in file ChineseTest.py on line 1, but
see http://www.pytho
n.org/peps/pep-0263.html for details
偷偷地把文件编码改成UTF-8:
E:\Project\Python\Test&python ChineseTest.py
File "ChineseTest.py", line 1
SyntaxError: Non-ASCII character '\xe4' in file ChineseTest.py on line 1, but
see http://www.pytho
n.org/peps/pep-0263.html for details
无济于事。。。
既然它提供了网址,那就看看吧。简单地浏览一下,终于知道如果文件里有非ASCII字符,需要在第一行或第二行指定编码声明。把ChineseTest.py文件的编码重新改为ANSI,并加上编码声明:
# coding=gbk
s = "中文"
再试一下:
E:\Project\Python\Test&python ChineseTest.py
正常咯:)
看一看它的长度:
# coding=gbk
s = "中文"
print len(s)
s这里是str类型,所以计算的时候一个中文相当于两个英文字符,因此长度为4。
我们这样写:
# coding=gbk
s = "中文"
s1 = u"中文"
s2 = unicode(s, "gbk") #省略参数将用python默认的ASCII来解码
s3 = s.decode("gbk") #把str转换成unicode是decode,unicode函数作用与之相同
print len(s1)
print len(s2)
print len(s3)
接着来看看文件的处理:
建立一个文件test.txt,文件格式用ANSI,内容为:
用python来读取
# coding=gbk
print open("Test.txt").read()
结果:abc中文
把文件格式改成UTF-8:
结果:abc涓?枃
显然,这里需要解码:
# coding=gbk
import codecs
print open("Test.txt").read().decode("utf-8")
结果:abc中文
上面的test.txt我是用Editplus来编辑的,但当我用Windows自带的记事本编辑并存成UTF-8格式时,
运行时报错:
Traceback (most recent call last):
File "ChineseTest.py", line 3, in &module&
print open("Test.txt").read().decode("utf-8")
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in position 0: illegal multibyte sequence
原来,某些软件,如notepad,在保存一个以UTF-8编码的文件时,会在文件开始的地方插入三个不可见的字符(0xEF 0xBB 0xBF,即BOM)。
因此我们在读取时需要自己去掉这些字符,python中的codecs module定义了这个常量:
# coding=gbk
import codecs
data = open("Test.txt").read()
if data[:3] == codecs.BOM_UTF8:
data = data[3:]
print data.decode("utf-8")
结果:abc中文
(四)一点遗留问题
在第二部分中,我们用unicode函数和decode方法把str转换成unicode。为什么这两个函数的参数用"gbk"呢?
第一反应是我们的编码声明里用了gbk(# coding=gbk),但真是这样?
修改一下源文件:
# coding=utf-8
s = "中文"
print unicode(s, "utf-8")
运行,报错:
Traceback (most recent call last):
File "ChineseTest.py", line 3, in &module&
s = unicode(s, "utf-8")
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data
显然,如果前面正常是因为两边都使用了gbk,那么这里我保持了两边utf-8一致,也应该正常,不至于报错。
更进一步的例子,如果我们这里转换仍然用gbk:
# coding=utf-8
s = "中文"
print unicode(s, "gbk")
结果:中文
翻阅了一篇英文资料,它大致讲解了python中的print原理:
When Python executes a print statement, it simply passes the output to the operating system (using fwrite() or something like it), and some other program is responsible for actually displaying that output on the screen. For example, on Windows, it might be the Windows console subsystem that displays the result. Or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. If you are running Python in an xterm on Unix, then xterm and your X server handle the display.
To print data reliably, you must know the encoding that this display program expects.
简单地说,python中的print直接把字符串传递给操作系统,所以你需要把str解码成与操作系统一致的格式。Windows使用CP936(几乎与gbk相同),所以这里可以使用gbk。
最后测试:
# coding=utf-8
s = "中文"
print unicode(s, "cp936")
结果:中文
转载:Python 编码问题整理
几个概念性的东西
标准的 ANSCII 编码只使用7个比特来表示一个字符,因此最多编码128个字符。扩充的 ANSCII 使用8个比特来表示一个字符,最多也只能
编码 256 个字符。
使用2个甚至4个字节来编码一个字符,因此可以将世界上所有的字符进行统一编码。
UNICODE编码转换格式,就是用来指导如何将 unicode 编码成适合文件存储和网络传输的字节序列的形式 (unicode -&
str)。像其他的一些编码方式 gb2312, gb18030, big5 和 UTF 的作用是一样的,只是编码方式不同。
Python 里面有两种数据模型来支持字符串这种数据类型,一种是 str,另外一种是 unicode ,它们都是 sequence 的派生类
型,这个可以参考 Python Language Ref 中的描述:
The items of a string are characters. There is no separate
a character is represented by a string of one item.
Characters represent (at least) 8-bit bytes. The built-in functions
chr() and ord() convert between characters and nonnegative integers
representing the byte values. Bytes with the values 0-127 usually
represent the corresponding ASCII values, but the interpretation of
values is up to the program. The string data type is also used to
represent arrays of bytes, e.g., to hold data read from a file.
(On systems whose native character set is not ASCII, strings
may use EBCDIC in their internal representation, provided the
functions chr() and ord() implement a mapping between ASCII and
EBCDIC, and string comparison preserves the ASCII order. Or perhaps
someone can propose a better rule?)
The items of a Unicode object are Unicode code units. A
Unicode code unit is represented by a Unicode object of one item and
can hold either a 16-bit or 32-bit value representing a Unicode
ordinal (the maximum value for the ordinal is given in sys.maxunicode,
and depends on how Python is configured at compile time). Surrogate
pairs may be present in the Unicode object, and will be reported as
two separate items. The built-in functions unichr() and ord() convert
between code units and nonnegative integers representing the Unicode
ordinals as defined in the Unicode Standard 3.0. Conversion from and
to other encodings are possible through the Unicode method encode()
and the built-in function unicode().
这里面是这么几句:
"The items of a string are characters", "The items of a Unicode object
are Unicode code units", "The string data type is also used to
represent arrays of bytes, e.g., to hold data read from a file."
一二句说明 str 和 unicode 的组成单元(item)是什么(因为它们同是 sequence ) 。sequence 默认的
__len__ 函数的返回值正是该序列组成单元的个数。这样的话,len('abcd') == 4 和 len(u'我是中文') == 4 就很
容易理解了。
第三句告诉我们像从文件输入输出的时候是用 str 来表示数据的数组。不止是文件操作,我想在网络传输的时候应该也是这样的。这就是为什么一个
unicode 字符串在写入文件或者在网络上传输的时候要进行编码的原因了。
Python 里面的编码和解码也就是 unicode 和 str 这两种形式的相互转化。编码是 unicode -& str,相反的,解码就
是 str -& unicode。
下面剩下的问题就是确定何时需要进行编码或者解码了,像一些库是 unicode 版的,这样我们在将这些库函数的返回值进行传输或者写入文件的时候就
要考虑将它编码成合适的类型。
关于文件开头的"编码指示",也就是 # -*- coding: -*- 这个语句。Python 默认脚本文件都是 ANSCII 编码的,当文件
中有非 ANSCII 编码范围内的字符的时候就要使用"编码指示"来修正。
关于 sys.defaultencoding,这个在解码没有明确指明解码方式的时候使用。比如我有如下代码:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
s = '中文'
# 注意这里的 str 是 str 类型的,而不是 unicode
s.encode('gb18030')
这句代码将 s 重新编码为 gb18030 的格式,即进行 unicode -& str 的转换。因为 s 本身就是 str 类型的,因此
Python 会自动的先将 s 解码为 unicode ,然后再编码成 gb18030。因为解码是python自动进行的,我们没有指明解码方
式,python 就会使用 sys.defaultencoding 指明的方式来解码。很多情况下 sys.defaultencoding 是
ANSCII,如果 s 不是这个类型就会出错。
拿上面的情况来说,我的 sys.defaultencoding 是 anscii,而 s 的编码方式和文件的编码方式一致,是 utf8 的,所
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position
0: ordinal not in range(128)
对于这种情况,我们有两种方法来改正错误:
一是明确的指示出 s 的编码方式
#! /usr/bin/env python
# -*- coding: utf-8 -*-
s = '中文'
s.decode('utf-8').encode('gb18030')
二是更改 sys.defaultencoding 为文件的编码方式
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys) # Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入
sys.setdefaultencoding('utf-8')
str = '中文'
str.encode('gb18030')
转载:Python操作MySQL以及中文乱码的问题
Python操作MySQL需要安装Python-MySQL
可以从网上搜索一下,和一般的Python包一样安装
安装好之后,模块名字叫做MySQLdb ,在Window和Linux环境下都可以使用,试验了一下挺好用,
不过又发现了烦人的乱麻问题,最后用了几个办法,解决了!
我用了下面几个措施,保证MySQL的输出没有乱麻:
1 Python文件设置编码 utf-8 (文件前面加上 #encoding=utf-8)
2 MySQL数据库charset=utf-8
3 Python连接MySQL是加上参数 charset=utf8
4 设置Python的默认编码为 utf-8 (sys.setdefaultencoding(utf-8)
mysql_test.py
#encoding=utf-8
import sys
import MySQLdb
reload(sys)
sys.setdefaultencoding('utf-8')
db=MySQLdb.connect(user='root',charset='utf8')
cur=db.cursor()
cur.execute('use mydb')
cur.execute('select * from mytb limit 100')
f=file("/home/user/work/tem.txt",'w')
for i in cur.fetchall():
f.write(str(i))
f.write(" ")
cur.close()
上面是linux上的脚本,windows下运行正常!
注:MySQL的配置文件设置也必须配置成utf8
设置 MySQL 的 my.cnf 文件,在 [client]/[mysqld]部分都设置默认的字符集(通常在/etc/f):
default-character-set = utf8
default-character-set = utf8
转载:实现URL编码解码的python程序
import urllib
import sys
string = sys.argv[1]
string = unicode(string,"gbk")
utf8_string = string.encode("utf-8")
gbk_string=string.encode("gbk")
gbk=urllib.quote(gbk_string)
utf8=urllib.quote(utf8_string)
print utf8
解码使用unqute和decode函数
chinakr 发短消息加为好友chinakr
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 .
传说中的长老
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 . 16#
12:54 | 只看该作者 python CGI模块获取中文编码问题解决- 部分方案
google关键字:python脚本 默认codec
来源:点击进入
今天在尝试Python的CGI模块时遇到中文字符不能正确显示的问题,很郁闷.在网上仔细找了找,终于解决了这个问题,现在将解决方法陈述如下,以防下次失误.
页面源代码如下
#-*- coding: utf8 -*-
import cgitb , cgi
cgitb.enable()
form = cgi.FieldStorage()
if (form.has_key("name") and form.has_key("addr")):
print "&p&name:", form["name"].value
print "&p&addr:", form["addr"].value
[这里仅仅测试addr参数为中文]接收Ascii字符时运行良好,但是接收中文字符时显示乱码,浏览器切换到GB2312编码时
可以正常显示,但是个人要求它成为UTF-8编码显示
改成 print "&p&addr:", form["addr"].value.encode('utf-8')
就报如下错误:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data
在参阅了http://blog.chinaunix.net/u2/68206/showart_668359.html 后终于理解,
Python 里面的编码和解码也就是 unicode 和 str 这两种形式的相互转化。编码是 unicode -& str,相反的,解码就
是 str -& unicode。剩下的问题就是确定何时需要进行编码或者解码了.关于文件开头的"编码指示",也就是 # -*- coding: -*- 这个语句。Python 默认脚本文件都是 UTF-8 编码的,当文件中有非 UTF-8 编码范围内的字符的时候就要使用"编码指示"来修正. 关于 sys.defaultencoding,这个在解码没有明确指明解码方式的时候使用。比如我有如下代码:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
s = '中文'
# 注意这里的 str 是 str 类型的,而不是 unicode
s.encode('gb18030')
这句代码将 s 重新编码为 gb18030 的格式,即进行 unicode -& str 的转换。因为 s 本身就是 str 类型的,因此
Python 会自动的先将 s 解码为 unicode ,然后再编码成 gb18030。因为解码是python自动进行的,我们没有指明解码方式,python 就会使用 sys.defaultencoding 指明的方式来解码。很多情况下 sys.defaultencoding 是
ANSCII,如果 s 不是这个类型就会出错。拿上面的情况来说,我的 sys.defaultencoding 是 anscii,而 s 的编码方式和文件的编码方式一致,是 utf8 的,所以出错了:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position
0: ordinal not in range(128)
对于这种情况,我们有两种方法来改正错误:
一是明确的指示出 s 的编码方式
#! /usr/bin/env python
# -*- coding: utf-8 -*-
s = '中文'
s.decode('utf-8').encode('gb18030')
二是更改 sys.defaultencoding 为文件的编码方式
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys) # Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入
sys.setdefaultencoding('utf-8')
str = '中文'
str.encode('gb18030')
看完之后,改成这样
print "&p&addr:", form["addr"].value.decode('gb2312').encode('utf-8')
我总结一下为什么要这么写的原因:
1. 当取回来的数据与你当前脚本中声明的编码不一致时就要做编码转换
2.在编码转换时首先要将该数据以自身编码的格式换成unicode码,再将这个unicode按utf8编码
3.为什么我的浏览器会传回gb2312的编码数据到服务器,这应该和客户端的系统编码有关系
这里顺便转载一下,关于Python 操作Mysql的中文问题:
Python操作MySQL以及中文乱码的问题
下面几个措施,保证MySQL的输出没有乱麻:
1 Python文件设置编码 utf-8 (文件前面加上 #encoding=utf-8)
2 MySQL数据库charset=utf-8
3 Python连接MySQL是加上参数 charset=utf8
4 设置Python的默认编码为 utf-8 (sys.setdefaultencoding(utf-8)
Java代码 1.#encoding=utf-8
2.import sys
3.import MySQLdb
5.reload(sys)
6.sys.setdefaultencoding('utf-8')
8.db=MySQLdb.connect(user='root',charset='utf8')
9.cur=db.cursor()
10.cur.execute('use mydb')
11.cur.execute('select * from mytb limit 100')
13.f=file("/home/user/work/tem.txt",'w')
15.for i in cur.fetchall():
f.write(str(i))
f.write(" ")
19.f.close()
20.cur.close()
#encoding=utf-8
import sys
import MySQLdb
reload(sys)
sys.setdefaultencoding('utf-8')
db=MySQLdb.connect(user='root',charset='utf8')
cur=db.cursor()
cur.execute('use mydb')
cur.execute('select * from mytb limit 100')
f=file("/home/user/work/tem.txt",'w')
for i in cur.fetchall():
f.write(str(i))
f.write(" ")
cur.close()
测试以下连接成功: index.psp?name=iamsese&addr=北京
--------------------------------------------------------------------------------
/yobin/blog/item/1547fbdc6ef53ba4cc116611.html
写了一个简单的脚本,将我的数据整理到Mysql中去。遇到了乱码问题,改了一下,很快就解决了。连接mysql时要注意注明是utf-8字符集,所有中文都要是utf-8,如果有GBK也要先转换成utf-8,把握这个原则,中文乱码问题是不会有的。
转换脚本如下:
Python代码 1.#-*- coding: utf-8 -*-,
2.#coding = utf-8
4.import MySQLdb,os
6.def wgdata2DB():
print "Convert weg game data to mysql"
db=MySQLdb.connect(host='localhost',
#连接数据库
user='root',
passwd='123456',
db='testdb',
charset="utf8")
#这里要注明是utf8字符集,文件开头最好也加上utf-8的声明
cursor=db.cursor()
if os.path.exists('test.dat'):
rFile = open('test.dat', 'r')
lines = rFile.readlines()
rFile.close()
for line in lines:
print "handle line:%d" % (loop)
myset = line.split(' ')
sqlstr = "INSERT INTO wg_Content (type,title,url,speed,des,size) VALUES('%s','%s','%s','%s','%s','%s')" \
%(myset[0],myset[1],myset[2],myset[3],myset[4],myset[5])
cursor.execute(sqlstr)
db.commit()
cursor.close()
db.close()
#-*- coding: utf-8 -*-,
#coding = utf-8
import MySQLdb,os
def wgdata2DB():
print "Convert weg game data to mysql"
db=MySQLdb.connect(host='localhost',
#连接数据库
user='root',
passwd='123456',
db='testdb',
charset="utf8")
#这里要注明是utf8字符集,文件开头最好也加上utf-8的声明
cursor=db.cursor()
if os.path.exists('test.dat'):
rFile = open('test.dat', 'r')
lines = rFile.readlines()
rFile.close()
for line in lines:
print "handle line:%d" % (loop)
myset = line.split(' ')
sqlstr = "INSERT INTO wg_Content (type,title,url,speed,des,size) VALUES('%s','%s','%s','%s','%s','%s')" \
%(myset[0],myset[1],myset[2],myset[3],myset[4],myset[5])
cursor.execute(sqlstr)
db.commit()
cursor.close()
db.close()
1 楼 vb2005xu
Python操作MySQL需要安装Python-MySQL
1 Python文件设置编码 utf-8 (文件前面加上 #encoding=utf-8)
2 MySQL数据库charset=utf-8
3 Python连接MySQL是加上参数 charset=utf8
4 设置Python的默认编码为 utf-8 (sys.setdefaultencoding(utf-8)
Java代码 1.#encoding=utf-8
2.import sys
3.import MySQLdb
5.reload(sys)
6.sys.setdefaultencoding('utf-8')
8.db=MySQLdb.connect(user='root',charset='utf8')
9.cur=db.cursor()
10.cur.execute('use mydb')
11.cur.execute('select * from mytb limit 100')
13.f=file("/home/user/work/tem.txt",'w')
15.for i in cur.fetchall():
f.write(str(i))
f.write(" ")
19.f.close()
20.cur.close()
#encoding=utf-8
import sys
import MySQLdb
reload(sys)
sys.setdefaultencoding('utf-8')
db=MySQLdb.connect(user='root',charset='utf8')
cur=db.cursor()
cur.execute('use mydb')
cur.execute('select * from mytb limit 100')
f=file("/home/user/work/tem.txt",'w')
for i in cur.fetchall():
f.write(str(i))
f.write(" ")
cur.close()
推荐使用SQLObject,我在写测试代码时,在SQLITE中没有出现这个问题,因为turbogears缺省类库都是unicode编码的
chinakr 发短消息加为好友chinakr
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 .
传说中的长老
帖子34321 精华364 积分701328 在线时间5119 小时 注册时间 . 17#
03:01 | 只看该作者 再论字符集
google关键字:UnicodeDecodeError: 'gb18030' codec can't decode bytes in position
来源:点击进入
作者:kino 日期: 13:46:46
对应中国人来说字符集的相互转换真是麻烦,搞不好就是大串的乱码,实在有必要多多复习一下基本概念!!
ISO8859-1,通常叫做Latin-1。Latin-1包括了书写所有西方欧洲语言不可缺少的附加字符。而gb2312是标准中文字符集。
UTF-8 是 UNICODE 的一种变长字符编码,即 RFC 3629。简单的说——大字符集。可以解决多种语言文本显示问题,从而实现应用国际化和本地化。
对系统来讲,UTF-8 编码可以通过屏蔽位和移位操作快速读写,排序更加容易。UTF-8 是字节顺序无关的,它的字节顺序在所有系统中都是一样的。因此 UTF-8 具有更高的性能。
在chinaunix看到篇实例讲解,很是直观,下面就贴出来,与大家共享!!
1 python 代码
1. &&& a = "我"
2. &&& b = unicode(a,"gb2312")
3. &&& a.__class__
4. &type 'str'&
5. &&& b.__class__
6. &type 'unicode'&
看出来了吧,两种字符串。再来python 代码
2. '\xce\xd2'
4. u'\u6211'
变量a是两个字符,b是一个unicode字符。关于这两种字符串,Python文档--&LanguageReference--&DataModel--&The standard type hierarchy--&Sequences,有一些Strings,Unicode的描述。至于python 代码
1. &&& z = u"我"
2. &&& #这种代码,其实什么都不是。
3. &&& z.__class__
4. &type 'unicode'&
6. u'\xce\xd2'
看到了吧,这个奇怪的东西......后来在WindowsXP、纯python命令行下试过,得出的结论不同,z的结果变成了u'\u6211',这里完全不应该在pyshell下作试验的,看来还有很多问题尚未理解清楚 再来看看encode,decode什么情况用encode,什么情况又是decode呢,刚开始总是被搞昏。其实各种本地字符集的英文名是Coded Character Set,要转换为Coded,肯定是要encode了,同样,从里面解出来也应该叫decode…… decode就是把其他编码转换为unicode,等同于unicode函数;encode就是把unicode编码的字符串转换为特定编码。在pyshell里继续:a是Str类型的,所以再用encode会报错。用print输出时会调用默认编码转换为系统编码? python 代码
1. &&& a.decode("gb2312")
2. u'\u6211'
3. &&& print a.decode("gb2312")
5. &&& a.encode("gb2312")
6. Traceback (most recent call last):
File "&input&", line 1, in ?
8. UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)
b是unicode类型,打印时需要先encode(编码)成系统编码python 代码
1. &&& print b.encode("gb2312")
3. &&& b.encode("gb2312")
4. '\xce\xd2'
5. &&& b.decode("gb2312")
6. Traceback (most recent call last):
File "&input&", line 1, in ?
8. UnicodeEncodeError: 'ascii' codec can't encode character u'\u6211' in position 0: ordinal not in range(128
字符串内码的转换,是开发中经常遇到的问题。
在Java中,我们可以先对某个String调用getByte(),由结果生成新String的办法来转码,也可以用NIO包里面的Charset来实现。
在Python中,可以对String调用decode和encode方法来实现转码。
比如,若要将某个String对象s从gbk内码转换为UTF-8,可以如下操作
s.decode('gbk').encode('utf-8')
可是,在实际开发中,我发现,这种办法经常会出现异常:
UnicodeDecodeError: 'gbk' codec can't decode bytes in position : illegal multibyte sequence
这是因为遇到了非法字符——尤其是在某些用C/C++编写的程序中,全角空格往往有多种不同的实现方式,比如\xa3\xa0,或者\xa4\x57,这些字符,看起来都是全角空格,但它们并不是“合法”的全角空格(真正的全角空格是\xa1\xa1),因此在转码的过程中出现了异常。
这样的问题很让人头疼,因为只要字符串中出现了一个非法字符,整个字符串——有时候,就是整篇文章——就都无法转码。
幸运的是,tiny找到了完美的解决办法(我因此被批评看文档不仔细,汗啊……)
s.decode('gbk', 'ignore').encode('utf-8')
因为decode的函数原型是decode([encoding], [errors='strict']),可以用第二个参数控制错误处理的策略,默认的参数就是strict,代表遇到非法字符时抛出异常;
如果设置为ignore,则会忽略非法字符;
如果设置为replace,则会用?取代非法字符;
如果设置为xmlcharrefreplace,则使用XML的字符引用。
原文链接:http://blog.chinaunix.net/uid--id-4074412.html
阅读: 1856 |

我要回帖

更多关于 python 的文章

 

随机推荐