怎样理解python build opener里的Cookie和opener

风水堪舆学 | 网络营销 | 住宅风水 | 英文歌曲 | Adobe After Effects | 电脑配置 | 书籍改编电影 | 下载 | Legion | 网络推广 | 动画制作 | 赛事 | PLC | 小说创作 | 虚拟专用服务器 | 成语 | 家庭 | 单反相机 | 电视节目 | 投影机 | 面相 | 香港购物 | 配音 | 文具 | 二次元 | 影视 | 固态硬盘ssd | 虚拟机 | 跆拳道 | r（编程语言） | 秦时明月之天行九歌 | 使命召唤 | 网盘 | 地图 | 琅琊榜（电视剧） | 手机内存 | 角色扮演 | 华硕 | 百度输入法 | 盗墓笔记（小说） | 营销策划 | 化妆品 | Windows | ip地址 | 装修设计 | 齐内丁·齐达内 | 动画电影 | 中国中央电视台 | 罗兰 | 网站优化 | 斗鱼直播 | 冷知识 | 张帅 | 任天堂 | 摄影师 | 三菱商事 | 迅雷（软件） | 计算机病毒 | amd | 屏幕 | 微单相机 | 电学 | qq浏览器 | MacOS | 联赛 | snh48 | 芯片（集成电路） | 后宫·甄嬛传（书籍） | 植物辨识 | 运动 | 大一 | 美容 | 双色球 | 蓝牙音箱 | 楼盘 | 电脑电源 | 采暖 | 显卡驱动 | 体育赛事 | thinkpad | 离婚 | 武侠小说 | 索尼笔记本 | 中国足球协会超级联赛（csl） | youtube | 王力宏（人物） | 外星人 | 努比亚（手机品牌） | 海贼王 | 移动电源 | 完美世界（游戏） | 摩托车 | 编辑器 | 低音炮 | 收益 | 海关 | 徐波 | akb48 | 互联网创业 | 张璐 | 男性 | 性价比 | MacBook Air | 新疆维吾尔自治区 | 插座 | 外汇平台 | 华为Mate30 | 羽毛球技术 | 腾讯 QQ | 蓝屏 | 字幕 | 免费软件 | 电脑故障 | 女生 | 周星驰（人物） | 足球欧洲杯 | pdf | macbook | 直播 | 生活经历 | 骁龙处理器 | 主题曲 | 户外运动 | CPU | 娱乐圈 | 初恋 | 家居 | 流氓软件 | 名言 | 中国足球 | 近视眼 | acg | 一级方程式赛车（f1） | 小品 | 网站运营 | 英格兰足球超级联赛 | 一体机 | 人肉搜索 | 日本电影 | 系统软件 | 人生 | 流星花园 | 电钢琴 | 分辨率 | 迅雷 | 机械设计 | 古典音乐 | 液晶电视 | 睡眠 | 大片 | 资产 | Html/Css | ansys | 天蝎座 | 对联 | 大二 | 吉他学习 | 实习 | uc浏览器 | 计算机科学 | 新华社 | 脱毛 | 视力 | 乐视超级电视 | 大学生活 | 开关电源 | 平面设计 | 音乐版权 | iPhone 11 Pro | 面膜 | 鞠婧祎 | 胡歌（演员） | 郭富城 | 语言 | 赵丽颖（演员） | 意大利 | 电路设计 | 情侣 | NBA篮球 | 蔡徐坤 | 豆瓣电影 | 社交软件 | 微信开发 | 足球彩票 | 电工 | 手机摄像头 | 用户界面设计师 | 华语流行音乐 | 网卡 | 易烊千玺 | 笛子 | 日语学习 | 日语歌曲 | 歌手 | 张子枫 | 搏击项目 | 谭松韵 | 快捷键 | O2O | 移民 |

你的位置：网站首页 >> 频道首页 >>python >>怎样理解python build opener里的Cookie和opener

怎样理解python build opener里的Cookie和opener

来源：蜘蛛抓取(WebSpider) 时间：2016-08-05 02:13 标签： python opener

下次自动登录
现在的位置:
& 综合 & 正文
python学习笔记
官网library
中文手册，适合快速入门
python cook book中文版
1.数值尤其是实数很方便、字符串操作很炫、链表
a = complex(1,0.4)
字符串前加上r/R表示常规字符串，加上u/U表示unicode字符串
链表的append()方法在链表末尾加一个新元素
2.流程控制
循环中的else
funA(para)
没有return语句时函数返回None，参数传递进去的是引用
2)默认参数，默认参数是链表、字典、类实例时要小心
3)不定参数，def
funB(king, *arguments, **keywords) 不带关键字的参数值存在元组arguments中，关键字跟参数值存在字典keywords中。其实是元组封装和序列拆封的一个结合。
funC(para1, para2, para3) 下面的调用把链表元素分散成函数参数funcC(*list)
5)匿名函数 lambda arg1,arg2...:&expression&
特点：创建一个函数对象，但是没有赋值给标识符（不同于def）;lambda是表达式，不是语句；“：”后面只能是一个表达式
(‘y’, ‘ye’, ‘yes’): xxxxx 关键字in的用法
7)f = bambda x: x*2 等效于 def f(x): return x*2
4.数据结构
1)[] help(list) append(x) extend(L) insert(i,x) remove(x) pop([i]) index(x) count(x) sort() reverse()
2)List的函数化编程 filter()
3)链表推导式 aimTags = [aimTag for aimTag in aimTags if aimTag not in filterAimTags]
4)del删除链表切片或者整个变量
5)() help(tuple) 元组tuple，其中元素和字符串一样不能改变。元组、字符串、链表都是序列。 Python 要求单元素元组中必须使用逗号，以此消除与圆括号表达式之间的歧义。这是新手常犯的错误
6){} help(dict) 字典 keys() has_key() 可用以键值对元组为元素的链表直接构造字典
7)循环字典：for k, v in xxx.iteritems():… for item in xxx.items():... 序列：for i, v in enumerate([‘tic’, ‘tac’, ‘toe’]):… 同时循环多个序列：for q, a in zip(questions, answers):…
9)相同类型的序列对象之间可以用& & ==进行比较
10)判断变量类型的两种方法：isinstance（var,int） type(var).__name__=="int"
多种类型判断，isinstance(s,(str,unicode))当s是常规字符串或者unicode字符串都会返回True
11）在循环中删除list元素时尤其要注意出问题，for i in listA:... listA.remove(i)是会有问题的，删除一个元素之后后面的元素就前移了；for i in len(listA):...del listA[i]也会有问题，删除元素后长度变化，循环会越界
filter(lambda x:x !=4,listA)这种方式比较优雅
listA = [ i for i in listA if i !=4] 也不错，或者直接创建一个新的列表算球
1)"if k in my_dict" 优于 "if my_dict.has_key(k)"
2)"for k in my_dict" 优于 "for k in my_dict.keys()",也优于"for k in [....]"
1)模块名由全局变量__name__得到，文件fibo.py可以作为fibo模块被import fibo导入到其他文件或者解释器中，fibo.py中函数明明必须以fib开头
2)import变体： from fibo import fib, fib2 然后不用前缀直接使用函数
3)sys.path
4)内置函数 dir() 用于按模块名搜索模块定义，它返回一个字符串类型的存储列表，列出了所有类型的名称：变量，模块，函数，等等
help()也有类似的作用
5)包 import packet1.packet2.module
from packet1.packet2 import module
from packet1.packet2.module import functionA
6)import 语句按如下条件进行转换：执行 from package import * 时，如果包中的 __init__.py 定义了一个名为 __all__ 的链表，就会按照链表中给出的模块名进行导入
7)sys.path打印出当前搜索python库的路径，可以在中用sys.path.append("/xxx/xxx/xxx")来添加新的搜索路径
8)安装python模块时可以用easy_install，卸载easy_install -m pkg_name
9)用__doc__可以得到某模块、函数、对象的说明，用__name__可以得到名字（典型用法：if __name__=='__main__'： ...）
1)str() unicode()
xxx%(v1,v2)
2)f = open(“fileName”, “w”) w r a r+
Win和Macintosh平台还有一个模式”b”
f.read(size)
f.readline()
f.write(string)
f.writelines(list)
f.seek(offset, from_what) from_what:0开头 1当前 2末尾 offset:byte数/Linux/4p3.htm
linecache模块可以方便的获取文件某行数据，在http-server端使用时要注意，尤其是操作大文件很危险，并发情况下很容易就让机器内存耗尽、系统直接挂掉（本人血的教训）
文件操作时比较好用
遍历目录下所有文件
3)pickle模块(不是只能写入文件中)
封装（pickling）类似于php的序列化：pickle.dump(objectX, fileHandle)
拆封（unpickling）类似于php反序列化：objectX = pickle.load(fileHandle)
(easy_install msgpack-python)比pickle和cpickle都好用一些,速度较快
msgpack.dump(my_var, file('test_file_name','w'))
msgpack.load(file('test_file_name','r'))
4)raw_input()接受用户输入
1)以两个下划线下头、以不超过一个下划线结尾成员变量和成员函数都是私有的，父类的私有成员在子类中不可访问
2)调用父类的方法：1&ParentClass.FuncName(self,args) 2&super(ChildName,self).FuncName(args) 第二种方法的使用必须保证类是从object继承下来的，否则super会报错
3)静态方法定义，在方法名前一行写上@staticmethod。可以通过类名直接调用。
#!/bin/python
#encoding=utf8
class A(object):
def __init__(self, a, b):
self.a = a
self.b = b
def show(self):
print "A::show() a=%s b=%s" % (self.a,self.b)
class B(A):
def __init__(self, a, b, c):
#A.__init__(self,a,b)
super(B,self).__init__(a,b) #super这种用法要求父类必须是从object继承的
self.c = c
if __name__ == "__main__":
b = B(1,2,3)
print b.a,b.b,b.c
xudongsong@sysdev:~$ python class_test.py
A::show() a=1 b=2
常见的编码转换分为以下几种情况：
unicode-&其它编码
例如：a为unicode编码要转为gb2312。a.encode('gb2312')
其它编码-&unicode
例如：a为gb2312编码，要转为unicode。 unicode(a, 'gb2312')或a.decode('gb2312')
编码1 -& 编码2
可以先转为unicode再转为编码2
如gb2312转big5
unicode(a, 'gb2312').encode('big5')
判断字符串的编码
isinstance(s, str) 用来判断是否为一般字符串
isinstance(s, unicode) 用来判断是否为unicode
如果一个字符串已经是unicode了，再执行unicode转换有时会出错(并不都出错)
&&& str2 = u"sfdasfafasf"
&&& type(str2)
&type 'unicode'&
&&& isinstance(str2,str)
&&& isinstance(str2,unicode)
&&& type(str2)
&type 'unicode'&
&&& str3 = "safafasdf"
&&& type(str3)
&type 'str'&
&&& isinstance(str3,unicode)
&&& isinstance(str3,str)
&&& str4 = r'asdfafadf'
&&& isinstance(str4,str)
&&& isinstance(str4,unicode)
&&& type(str4)
&type 'str'&
可以写一个通用的转成unicode函数：
def u(s, encoding):
if isinstance(s, unicode):
return unicode(s, encoding)
1)要让子线程跟着父线程一起退出，可以对子线程调用setDaemon()
2)对子线程调用join()方法可以让父线程等到子线程退出之后再退出
3)ctrl+c只能被父线程捕获到（子线程不能调用信号捕获函数signal.signal(signal,function)），对子线程调用join()会导致父线程捕获不到ctrl+c，需要子线程退出后才能捕获到
附：成应元老师关于python信号的邮件
Some care must be taken if both signals and threads are used in the same program. The fundamental thing to remember in using signals and threads simultaneously is: always perform signal() operations in the main thread of execution. Any thread can perform an
alarm(), getsignal(), pause(), setitimer() or getitimer(); only the main thread can set a new signal handler, and the main thread will be the only one to receive signals (this is enforced by the Python signal module, even if the underlying thread implementation
supports sending signals to individual threads). This means that signals can’t be used as a means of inter-thread communication. Use locks instead.
总是在主线程调用signal设置信号处理器，主线程将是唯一处理信号的线程。因此不要把线程间通信寄托在信号上，而应该用锁。
The second, from http://docs.python.org/library/thread.html#module-thread:
Threads interact strangely with interrupts: the KeyboardInterrupt exception will be received by an arbitrary thread. (When the signal module is available, interrupts always go to the main thread.)
当导入signal模块时， KeyboardInterrupt异常总是由主线程收到，否则KeyboardInterrupt异常会被任意一个线程接到。
直接按Ctrl+C会导致Python接收到SIGINT信号，转成KeyboardInterrupt异常在某个线程抛出，如果还有线程没有被 setDaemon，则这些线程照运行不误。如果用kill送出非SIGINT信号，且该信号没设置处理函数，则整个进程挂掉，不管有多少个线程还没完成。
下面是signal的一个使用范例：
&&& import signal
&&& def f():
signal.signal(signal.SIGINT, sighandler)
signal.signal(signal.SIGTERM, sighandler)
while True:
time.sleep(1)
&&& def sighandler(signum,frame):
print signum,frame
^C2 &frame object at 0x15b2a40&
^C2 &frame object at 0x15b2a40&
^C2 &frame object at 0x15b2a40&
^C2 &frame object at 0x15b2a40&
signal的设置和清除：
import signal, time
term = False
def sighandler(signum, frame):
print "terminate signal received..."
global term
term = True
def set_signal():
signal.signal(signal.SIGTERM, sighandler)
signal.signal(signal.SIGINT, sighandler)
def clear_signal():
signal.signal(signal.SIGTERM, 0)
signal.signal(signal.SIGINT, 0)
set_signal()
while not term:
print "hello"
time.sleep(1)
print "jumped out of while loop"
clear_signal()
term = False
for i in range(5):
print "hello, again"
time.sleep(1)
[dongsong@bogon python_study]$ python signal_test.py
^Cterminate signal received...
jumped out of while loop
hello, again
hello, again
[dongsong@bogon python_study]$
多进程程序使用信号时，要想让父进程捕获信号并对子进程做一些操作，应该在子进程启动完成以后再注册信号处理函数，否则子进程继承父进程的地址空间，也会有该信号处理函数，程序会混乱不堪
from multiprocessing import Process, Pipe
import logging, time, signal
g_logLevel = logging.DEBUG
g_logFormat = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d]%(message)s"
def f(conn):
conn.send([42, None, 'hello'])
#conn.close()
logging.basicConfig(level=g_logLevel,format=g_logFormat,stream=None)
logging.debug("hello,world")
while True:
print "hello,world"
time.sleep(1)
termFlag = False
def sighandler(signum, frame):
print "terminate signal received..."
global termFlag
termFlag = True
if __name__ == '__main__':
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
print parent_conn.recv()
# prints "[42, None, 'hello']"
print parent_conn.recv()
p = Process(target=f2)
signal.signal(signal.SIGTERM, sighandler)
signal.signal(signal.SIGINT, sighandler)
while not termFlag:
time.sleep(0.5)
print "jump out of the main loop"
p.terminate()
10.Python 的内建函数locals() 。它返回的字典对所有局部变量的名称与值进行映射
11.扩展位置参数
def func(*args): ...
在参数名之前使用一个星号，就是让函数接受任意多的位置参数。
python把参数收集到一个元组中，作为变量args。显式声明的参数之外如果没有位置参数，这个参数就作为一个空元组。
12.扩展关键字参数（扩展键参数）
def accept(**kwargs): ...
python在参数名之前使用2个星号来支持任意多的关键字参数。
注意：kwargs是一个正常的python字典类型，包含参数名和值。如果没有更多的关键字参数，kwargs就是一个空字典。
位置参数和关键字参数参考这篇：
&&& def func(arg1, arg2 = "hello", *arg3, **arg4):
print arg1
print arg2
print arg3
print arg4
&&& func("xds","t1",t2="t2",t3="t3")
{'t2': 't2', 't3': 't3'}
13.装饰器在函数前加上@another_method，用于对已有函数做包装、前提检查=工作，这篇文章写得很透彻
14.异常处理的语法
import sys
f = open('myfile.txt')
s = f.readline()
i = int(s.strip())
except IOError, (errno, strerror):
print "I/O error(%s): %s" % (errno, strerror)
except ValueError:
print "Could not convert data to an integer."
print "Unexpected error:", sys.exc_info()[0]
raise Exception('spam', 'eggs')
... except Exception, inst:
print "error %s" % str(e)
print type(inst)
# the exception instance
print inst.args
# arguments stored in .args
print inst
# __str__ allows args to printed directly
x, y = inst
# __getitem__ allows args to be unpacked directly
print 'x =', x
print 'y =', y
&type 'instance'&
('spam', 'eggs')
('spam', 'eggs')
15.命令行参数的处理，用python的optparse库处理，具体用法见这篇文章
from optparse import OptionParser
def main():
usage = "usage: %prog [options] arg"
parser = OptionParser(usage)
parser.add_option("-f", "--file", dest="filename",
help="read data from FILENAME")
parser.add_option("-v", "--verbose",
action="store_true", dest="verbose")
parser.add_option("-q", "--quiet",
action="store_false", dest="verbose")
(options, args) = parser.parse_args()
if len(args) != 1:
parser.error("incorrect number of arguments")
if options.verbose:
print "reading %s..." % options.filename
if __name__ == "__main__":
通俗的讲，make_option()和add_option()用于创建对python脚本的某个命令项的解析方式，用parse_args()解析后单个参数存入args元组，键值对参数存入options；dest指定键值对的key,不写则用命令的长名称作为key；help用于对脚本调用--help/-h时候解释对应命令；action描述参数解析方式，默认store表示命令出现则用dest+后跟的value存入options,store_true表示命令出现则以dest+True存入options,store_false表示命令出现则以dest+False存入options
16.最近用了BeautifulSoup v4，出现如下错误（之前用的是低版本的BeautifulSoup,没遇到这个错误）
HTMLParser.HTMLParseError: malformed start tag
解决办法：用easy_install html5lib，安装html5lib，替代HTMLParser
beautifulsoup官网：
beautifulsoup的手册：
中文手册（用于快速入门）：
下面是一个beautifulsoup的一些用法
[dongsong@localhost boosenspider]$ vpython
Python 2.6.6 (r266:84292, Dec
[GCC 4.4.6
(Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
&&& from bs4 import BeautifulSoup as soup
&&& s = soup('&li class="dk_dk" id="dkdk"&&a href="javascript:;" onclick="MOP.DZH.clickDaka();" class="btn_dk"&打卡&/a&&/li&')
&html&&head&&/head&&body&&li class="dk_dk" id="dkdk"&&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&ae‰“???&/a&&/li&&/body&&/html&
&&& type(s)
&class 'bs4.BeautifulSoup'&
&&& t = s.body.contents[0]
&li class="dk_dk" id="dkdk"&&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&ae‰“???&/a&&/li&
&&& import re
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dks")})
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")})
[&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&ae‰“???&/a&]
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dk"),'href':None})
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dk"),'href':re.compile('')})
[&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&ae‰“???&/a&]
&&& t.contents[0]
&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&ae‰“???&/a&
&&& t.contents[0].string = "hello"
&li class="dk_dk" id="dkdk"&&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&hello&/a&&/li&
&&& t.contents[0].text
&&& t.contents[0].string
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dk"),'text':re.compile('')})
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dk"),'text':re.compile('h')})
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dk"),'text':re.compile('^h')})
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")})
[&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&hello&/a&]
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")},pile(r''))
[&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&hello&/a&]
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")},pile(r'a'))
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")},pile(r'^hell'))
[&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&hello&/a&]
&&& t.findAll(name='a',attrs={'class':re.compile(r"btn_dk")},pile(r'^hello$'))
[&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&hello&/a&]
&&& t.findAll(name='a',attrs={},pile(r'^hello$'))
[&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&hello&/a&]
&li class="dk_dk" id="dkdk"&&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&hello&/a&&/li&
&&& t1 = soup('&li class="dk_dk" id="dkdk"&&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&hello&/a&&/li&').body.contents[0]
&li class="dk_dk" id="dkdk"&&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&hello&/a&&/li&
&&& t == t1
&&& re.search(r'(^hello)|(^bbb)','hello')
&_sre.SRE_Match object at 0x25ef718&
&&& re.search(r'(^hello)|(^bbb)','hellosdfsd')
&_sre.SRE_Match object at 0x25ef7a0&
&&& re.search(r'(^hello)|(^bbb)','bbbsdfsdf')
&_sre.SRE_Match object at 0x25ef718&
&&& t2 = t1.contents[0]
&a class="btn_dk" href="javascript:;" onclick="MOP.DZH.clickDaka();"&hello&/a&
&&& t2.findAll(name='a')
[GCC 4.4.6
(Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
&&& from bs4 import BeautifulSoup as soup
&&& s = soup('&li&&a href="/techforum/articleslist/0/24.shtml" id="item天涯婚礼堂"&天涯婚礼堂&/a&&/li&')
&&& s.findAll(name='a',attrs={'href':None})
&&& s.findAll(name='a',attrs={'href':True})
[&a href="/techforum/articleslist/0/24.shtml" id="item?¤???ˉ????¤????"&?¤???ˉ????¤????&/a&]
&&& import re
&&& s.findAll(name='a',attrs={'href':re.compile(r'')})
[&a href="/techforum/articleslist/0/24.shtml" id="item?¤???ˉ????¤????"&?¤???ˉ????¤????&/a&]
&html&&head&&/head&&body&&li&&a href="/techforum/articleslist/0/24.shtml" id="item?¤???ˉ????¤????"&?¤???ˉ????¤????&/a&&/li&&/body&&/html&
&&& id(s1)
&&& s1.body.contents[0].contents[0]['href']=None
&html&&head&&/head&&body&&li&&a href id="item?¤???ˉ????¤????"&?¤???ˉ????¤????&/a&&/li&&/body&&/html&
&html&&head&&/head&&body&&li&&a href id="item?¤???ˉ????¤????"&?¤???ˉ????¤????&/a&&/li&&/body&&/html&
&&& id(s1)
&&& s.findAll(name='a',attrs={'href':re.compile(r'')})
&&& s.findAll(name='a',attrs={'href':True})
&&& s.findAll(name='a',attrs={'href':None})
[&a href id="item?¤???ˉ????¤????"&?¤???ˉ????¤????&/a&]
&&& s.findAll(name='a')
[&a href id="item?¤???ˉ????¤????"&?¤???ˉ????¤????&/a&]
#text是一个用于搜索NavigableString对象的参数。它的值可以是字符串，一个正则表达式，一个list或dictionary，True或None，一个以NavigableString为参数的可调用对象
#None,False,''表示不做要求；re.compile(''),True表示必须有NavigableString存在（跟attrs不同，attrs字典中指定为False的属性表示不能存在）
#注意findAll函数text参数的使用，如下：
&&& rts = s2.findAll(name=u'ul',attrs={u'id': u'contentbar', u'st_type': 'nav'}, pile(r''))
&&& len(rts)
&&& rts = s2.findAll(name=u'ul',attrs={u'id': u'contentbar', u'st_type': 'nav'}, text='')
&&& len(rts)
&&& rts = s2.findAll(name=u'ul',attrs={u'id': u'contentbar', u'st_type': 'nav'}, text=True)
&&& len(rts)
&&& rts = s2.findAll(name=u'ul',attrs={u'id': u'contentbar', u'st_type': 'nav'}, text=False)
&&& len(rts)
&&& rts = s2.findAll(name=u'ul',attrs={u'id': u'contentbar', u'st_type': 'nav'}, text=None)
&&& len(rts)
#关于string属性的用法，以及其在什么类型元素上出现的问题
&&& from bs4 import BeautifulSoup as soup
&&& soup1 = soup('&b&hello,&img href="sfdsf"&aaaa&/img&&/b&').body.contents[0]
&b&hello,&img href="sfdsf"/&aaaa&/b&
&&& soup1.string
&&& soup1.name
&&& soup1.text
u'hello,aaaa'
&&& type(soup1)
&class 'bs4.element.Tag'&
&&& soup1.contents[0]
&&& type(soup1.contents[0])
&class 'bs4.element.NavigableString'&
&&& soup1.contents[0].string
&&& soup2 = soup('&b&hello&/b&').body.contents[0]
&&& type(soup2)
&class 'bs4.element.Tag'&
&&& soup2.string
#limit的用法，为零表示不限制
&&& soup2.findAll(name='a',text=False,limit=0)
[&a href="/subject/4172417/"&&img class="m_sub_img" src="/spic/s4424194.jpg"/&&/a&, &a href="/subject/4172417/"&?OE+?OE+é,????&/a&]
&&& soup2.findAll(name='a',text=False,limit=1)
[&a href="/subject/4172417/"&&img class="m_sub_img" src="/spic/s4424194.jpg"/&&/a&]
BeautifulSoup的性能一般，但是对于不合法的hetml标签有很强的修复和容错能力，对于编码问题，能确定来源页面编码的情况下可以通过BeautifulSoup的构造函数（参数from_encoding）指定（如我解析天涯的页面时就指定了from_encoding='gbk'），不确定来源的话可以依赖bs的自动编码检测和转换(可能会有乱码，毕竟机器没人这么聪明)。
BeautifulSoup返回的对象、以及其各节点内的数据都是其转换后的unicode编码。
----------&
今天遇到一个小问题
有一段html源码在bs3.2.1下构建bs对象失败，抛出UnicodeEncodeError，不论把源码用unicode还是utf-8或者lantin1传入都报错，而且bs3.2.1构造函数居然没有from_encoding的参数可用
尼玛，在bs4下就畅行无阻，不论用unicode编码传入还是utf-8编码传入，都不用指定from_encoding（编码为utf-8、不指定from_encoding时出现乱码，但是也没有报错呀，谁有bs3那么脆弱啊！）
总结一个道理，代码在某个版本库下面测试稳定了以后用的时候安装相应版本的库就ok了，为嘛要委曲求全的做兼容，如果低版本的库有bug我也兼容吗？兼？贱！
&-------------------- 18:20
bs4构建对象：
[dongsong@bogon boosenspider]$ cat bs_constrator.py
#encoding=utf-8
from bs4 import BeautifulSoup as soup
from bs4 import Tag
if __name__ == '__main__':
sou = soup('&div&&/div&')
tag1 = Tag(sou, name='div')
tag1['id'] = 'gentie1'
tag1.string = 'hello,tag1'
sou.div.insert(0,tag1)
tag2 = Tag(sou, name='div')
tag2['id'] = 'gentie2'
tag2.string = 'hello,tag2'
sou.div.insert(1,tag2)
[dongsong@bogon boosenspider]$ vpython bs_constrator.py
&html&&head&&/head&&body&&div&&div id="gentie1"&hello,tag1&/div&&div id="gentie2"&hello,tag2&/div&&/div&&/body&&/html&
cgi可以对html字符串转义(escape);HTMLParser可以取消html的转义(unescape)
&&& t = Tag(name='t')
&&& t.string="&img src=''/&"
&t&&img src=''/&&/t&
&&& str(t)
"&t&&img src=''/&&/t&"
&&& t.string
u"&img src=''/&"
&&& HTMLParser.HTMLParser().unescape(str(t))
u"&t&&img src=''/&&/t&"
u"&t&&img src=''/&&/t&"
&&& s2 = cgi.escape(s1)
u"&t&&img src=''/&&/t&"
&&& HTMLParser.HTMLParser().unescape(s2)
u"&t&&img src=''/&&/t&"
17.加密md5模块或者hashlib模块
&&& md5.md5("asdfadf").hexdigest()
'aeeefe03c361e1eed93589'
&&& import hashlib
&&& hashlib.md5("asdfadf").hexdigest()
'aeeefe03c361e1eed93589'
18.urllib2.urlopen(url)不设置超时的话可能会一直等待远端服务器的反馈，导致卡死
urlFile = urllib2.urlopen(url, timeout=g_url_timeout)
urlData = urlFile.read()
19.正则匹配
用三个单引号括起来的字符串可以跨行，得到的实际字符串里面有\n，这个得注意
用单引号或者双引号加上\也可以实现字符串换行，得到的实际字符串没有\和\n，但是在做正则匹配时写正则串不要用这种方式写，会匹配不上的
&&& ss = '''
... hell0,a
... liumingdong
... xudongsong
'\nhell0,a\nshhh\nliumingdong\nxudongsong\nhello\n'
SyntaxError: EOL while scanning string literal
&&& sss = 'aaaa\
... cccccc'
'aaaabbbbcccccc'
&&& s3 = r'(^hello)|\
... (abc$)'
&&& re.search(s3,'hello,world')
&_sre.SRE_Match object at 0x7f&
#第一行的正则串匹配成功
&&& re.search(s3,'aaa,hello,worldabc')
#第二行的匹配失败
&&& s4 = r'(^hello)|(abc$)'
#s4没有用单引号加\做跨行，则两个正则串都匹配上了
&&& re.search(s4,"hello,world")
&_sre.SRE_Match object at 0x182e690&
&&& re.search(s4,"aaa,hello,worldabc")
&_sre.SRE_Match object at 0x7f&
#注意如何取匹配到的子串（把要抽取的子串对应的正则用圆括号括起来，group从1开始就是圆括号对应的子串）
&&& re.search(r'^(\d+)abc(\d+)$','232abc1').group(0,1,2)
('232abc1', '232', '1')
#下面是一个re和lambda混合使用的一个例子
#encoding=utf-8
f = lambda arg: re.search(u'^(\d+)\w+',arg).group(1)
print f(u'1111条评论')
f(u'aaaa')
except AttributeError,e:
print str(e)
:!python re_lambda.py
'NoneType' object has no attribute 'group'
re.findall（）很好用的哦
&&& re.findall(r'\\@[A-Za-z0-9]+', s)
['\\@userA', '\\@userB']
'hello,world,\\@userA\\@userB'
&&& re.findall(r'\\@([A-Za-z0-9]+)', s)
['userA', 'userB']
20.写了个爬虫，之前在做一些url的连接时总是自己来根据各种情况来处理，比如./xxx
#xxxx /xxx神马的都要考虑，太烦了，后来发现有现成的东西可以用
&&&from urlparse import urljoin
&&&import urllib
&&&url = urljoin(r"/tag/?view=type",u"./网络小说")
u'/tag/\u7f51\u7edc\u5c0f\u8bf4'
&&& conn2 = urllib.urlopen(url)
Traceback (most recent call last):
File "&stdin&", line 1, in &module&
File "/usr/lib64/python2.6/urllib.py", line 86, in urlopen
return opener.open(url)
File "/usr/lib64/python2.6/urllib.py", line 179, in open
fullurl = unwrap(toBytes(fullurl))
File "/usr/lib64/python2.6/urllib.py", line 1041, in toBytes
" contains non-ASCII characters")
UnicodeError: URL u'/tag/\u7f51\u7edc\u5c0f\u8bf4' contains non-ASCII characters
&&& conn2 = urllib.urlopen(url.encode('utf-8'))
21.urllib2做http请求时如何添加header，如何获取cookie的值
&&& request = urllib2.Request("/finance/pics/hv1/46/178/1.jpg",headers={'If-Modified-Since':'Wed, 02 May :20 GMT'})
#等同于request.add_header('If-Modified-Since','Wed, 02 May :20 GMT')
&&& urllib2.urlopen(request)
Traceback (most recent call last):
File "&stdin&", line 1, in &module&
File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib64/python2.6/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.6/urllib2.py", line 435, in error
return self._call_chain(*args)
File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib64/python2.6/urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 304: Not Modified
&&& urllib.urlencode({"aaa":"bbb"})
&&& urllib.urlencode([("aaa","bbb")])
#urlencode的使用，在提交post表单时需要把参数k-v用urlencode处理后放入头部
#urllib2.urlopen(url,data=urllib.urlencode(...))
今天(13.7.4)遇到一个问题是登录某个站点时需要把第一次访问服务器植入的csrftoken作为post数据一起返给服务器，所以就研究了写怎么获取cooke的值，具体代码不便透漏，把栈溢出上的一个例子摆出来(主要看获取cookie数据的那几行代码)
[dongsong@localhost python_study]$ cat cookie.py
from urllib2 import Request, build_opener, HTTPCookieProcessor, HTTPHandler
import httplib, urllib, cookielib, Cookie, os
conn = httplib.HTTPConnection('webapp.pucrs.br')
#COOKIE FINDER
cj = cookielib.CookieJar()
opener = build_opener(HTTPCookieProcessor(cj),HTTPHandler())
req = Request('http://webapp.pucrs.br/consulta/principal.jsp')
f = opener.open(req)
html = f.read()
import pdb
pdb.set_trace()
for cookie in cj:
c = cookie
#FIM COOKIE FINDER
params = urllib.urlencode ({'pr1':, 'pr2':'sssssss'})
headers = {"Content-type":"text/html",
"Set-Cookie" : "JSESSIONID=70E78DA0349"}
# I couldn't set the value automaticaly here, the cookie object can't be converted to string, so I change this value on every session to the new cookie's value. Any solutions?
conn.request ("POST", "/consulta/servlet/consulta.aluno.ValidaAluno",params, headers) # Validation page
resp = conn.getresponse()
temp = conn.request("GET","/consulta/servlet/consulta.aluno.Publicacoes") # desired content page
resp = conn.getresponse()
print resp.read()
22.如何修改logging的日志输出文件，尤其在使用multiprocessing模块做多进程编程时这个问题变得更急迫，因为子进程会继承父进程的日志输出文件和格式....
def change_log_file(fileName):
h = logging.FileHandler(fileName)
h.setLevel(g_logLevel)
h.setFormatter(logging.Formatter(g_logFormat))
logger = logging.getLogger()
#print logger.handlers
for handler in logger.handlers:
handler.close()
while len(logger.handlers) & 0:
logger.removeHandler(logger.handlers[0])
logger.addHandler(h)
logging设置logger、handler、formatter可以参见django的配置文件，下面是个人写的一个小例子
[dongsong@localhost python_study]$ cat logging_test.py
#encoding=utf-8
import logging, sys
if __name__ == '__main__':
logger = logging.getLogger('test')
logger.setLevel(logging.DEBUG)
print 'log handlers: %s' % str(logger.manager.loggerDict)
logger.error('here')
logger.warning('here')
logger.debug('here')
#handler = logging.FileHandler('test.log')
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
#logging.getLogger('test').addHandler(logging.NullHandler()) # python 2.7+
logger.error('here')
logger.warning('here')
logger.debug('here')
[dongsong@localhost python_study]$ vpython logging_test.py
log handlers: {'test': &logging.Logger instance at 0x7f1dde0c2758&}
No handlers could be found for logger "test"
11:30:48,725 - test - ERROR - here
11:30:48,725 - test - WARNING - here
11:30:48,725 - test - INFO - here
11:30:48,725 - test - DEBUG - here
23.multiprocessing模块使用demo
import multiprocessing
from multiprocessing import Process
import time
def func():
for i in range(3):
print "hello"
time.sleep(1)
proc = Process(target = func)
proc.start()
while True:
childList = multiprocessing.active_children()
print childList
if len(childList) == 0:
time.sleep(1)
[dongsong@bogon python_study]$ python multiprocessing_children.py
[&Process(Process-1, started)&]
[&Process(Process-1, started)&]
[&Process(Process-1, started)&]
[&Process(Process-1, started)&]
[dongsong@bogon python_study]$ fg
multiprocessing的Pool模块（进程池）是很好用的，今天差点多此一举的自己写了一个（当然，自己写也是比较easy的，只是必然没官方的考虑周到）
[dongsong@bogon python_study]$ vpython
Python 2.6.6 (r266:84292, Jun 18 :47)
[GCC 4.4.6
(Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
&&& from multiprocessing import Pool
&&& import time
&&& poolObj = Pool(processes = 10)
&&& procObj = poolObj.apply_async(time.sleep, (20,))
&&& procObj.get(timeout = 1)
Traceback (most recent call last):
File "&stdin&", line 1, in &module&
File "/usr/lib64/python2.6/multiprocessing/pool.py", line 418, in get
raise TimeoutError
multiprocessing.TimeoutError
&&& print procObj.get(timeout = 21)
&&& poolObj.__dict__['_pool']
[&Process(PoolWorker-1, started daemon)&, &Process(PoolWorker-2, started daemon)&, &Process(PoolWorker-3, started daemon)&, &Process(PoolWorker-4, started daemon)&, &Process(PoolWorker-5, started daemon)&, &Process(PoolWorker-6, started daemon)&, &Process(PoolWorker-7, started daemon)&, &Process(PoolWorker-8, started daemon)&, &Process(PoolWorker-9, started daemon)&, &Process(PoolWorker-10, started daemon)&]
&&& poolObj.close()
&&& poolObj.join()
24.关于bs的编码和str()函数编码的问题在下面的demo里面可见一斑(跟str()类似的内建函数是unicode())
#encoding=utf-8
from bs4 import BeautifulSoup as soup
tag = soup((u"&p&白痴代码&/p&"),from_encoding='unicode').body.contents[0]
newStr = str(tag) #tag内部的__str__()返回utf-8编码的字符串（tag不实现__str__()的话就会按照本文第38条表现了）
print type(newStr),isinstance(newStr,unicode),newStr
print u"[unicode]hello," + newStr #自动把newStr按照unicode解释，报错
except Exception,e:
print str(e)
print "[utf-8]hello," + newStr
print u"[unicode]hello," + newStr.decode('utf-8')
[dongsong@bogon python_study]$ vpython tag_str_test.py
&type 'str'& False &p&白痴代码&/p&
'ascii' codec can't decode byte 0xe7 in position 3: ordinal not in range(128)
[utf-8]hello,&p&白痴代码&/p&
[unicode]hello,&p&白痴代码&/p&
25.关于MySQLdb使用的一些问题
1&是鸟人11年在某个项目中封装的数据库操作接口database.py，具体的数据库操作可以继承该类并实现跟业务相关的接口
2&cursor.execute(), cursor.fetchall()查出来的是unicode编码，即使指定connect的charset为utf8
3&查询语句需要注意的问题见下述测试代码；推荐的cursor.execute()用法是cursor.execute(sql, args)，因为底层会自动做字符串逃逸
If you're not familiar with the Python DB-API, notethat the SQL statement incursor.execute() uses placeholders,"%s",rather than
adding parameters directly within the SQL. If you use thistechnique, the underlying database library will automatically add quotes andescaping to your parameter(s) as necessary. (Also note that Django expects the"%s"
placeholder,not the "?" placeholder, which is used by the SQLitePython bindings. This is for the sake of consistency and sanity.)
4&规范的做法需要conn.cursor().execute()后mit()，否则在某些不支持自动提交的数据库版本上会有问题
#encoding=utf-8
import MySQLdb
conn = MySQLdb.connect(host = "127.0.0.1", port = 3306, user = "xds", passwd = "xds", db = "xds_db", charset = 'utf8')
cursor = conn.cursor()
print cursor
siteName = u"百度贴吧"
bbsNames = [u"明星", u"影视"]
siteName = siteName.encode('utf-8')
for index in range(len(bbsNames)):
bbsNames[index] = bbsNames[index].encode('utf-8')
#正确的用法
#args = tuple([siteName] + bbsNames)
#sql = "select bbs from t_site_bbs where site = %s and bbs in (%s,%s)"
#rts = cursor.execute(sql,args)
#print rts
#正确的用法
args = tuple([siteName] + bbsNames)
sql = "select bbs from t_site_bbs where site = '%s' and bbs in ('%s','%s')" % args
rts = cursor.execute(sql)
#错误的用法,报错
#args = tuple([siteName] + bbsNames)
#sql = "select bbs from t_site_bbs where site = %s and bbs in (%s,%s)" % args
#rts = cursor.execute(sql)
#错误的用法,不报错，但是查不到数据(bbsName的成员是数字串或者英文字符串时正确)
#sql = "select bbs from t_site_bbs where site = '%s' and bbs in %s" % (siteName, str(tuple(bbsNames)))
#print sql
#rts = cursor.execute(sql)
#print rts
rts = cursor.fetchall()
for rt in rts:
print rt[0]
对于有自增列的数据表，insert之后可以通过cursor.lastrowid获取刚插入的记录的自增id，update不行
26.关于时间
[dongsong@bogon boosencms]$ vpython
Python 2.6.6 (r266:84292, Dec
[GCC 4.4.6
(Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
&&& import time
&&& time.gmtime()
time.struct_time(tm_year=2012, tm_mon=5, tm_mday=18, tm_hour=4, tm_min=14, tm_sec=55, tm_wday=4, tm_yday=139, tm_isdst=0)
&&& time.localtime()
time.struct_time(tm_year=2012, tm_mon=5, tm_mday=18, tm_hour=12, tm_min=15, tm_sec=2, tm_wday=4, tm_yday=139, tm_isdst=0)
&&& time.time()
&&& time.timezone
&&& time.gmtime(time.time())
time.struct_time(tm_year=2012, tm_mon=5, tm_mday=18, tm_hour=4, tm_min=19, tm_sec=45, tm_wday=4, tm_yday=139, tm_isdst=0)
&&& time.localtime(time.time())
time.struct_time(tm_year=2012, tm_mon=5, tm_mday=18, tm_hour=12, tm_min=19, tm_sec=54, tm_wday=4, tm_yday=139, tm_isdst=0)
&&& time.strftime("%a, %d %b %Y %H:%M:%S +0800", time.localtime(time.time()))
'Fri, 18 May :20 +0800'
&&& time.strftime("%a, %d %b %Y %H:%M:%S +0000", time.gmtime(time.time()))
'Fri, 18 May :36 +0000'
#%Z这玩意到底怎么用的，下面也没搞明白
&&& time.strftime("%a, %d %b %Y %H:%M:%S %Z", time.gmtime(time.time()))
'Fri, 18 May :09 CST'
&&& time.strftime("%a, %d %b %Y %H:%M:%S %Z", time.localtime(time.time()))
'Fri, 18 May :31 CST'
&&& timeStr = time.strftime("%a, %d %b %Y %H:%M:%S +0000", time.gmtime(time.time()))
&&& timeStr
'Fri, 18 May :29 +0000'
&&& t = time.strptime(timeStr, "%a, %d %b %Y %H:%M:%S %Z")
Traceback (most recent call last):
File "&stdin&", line 1, in &module&
File "/usr/lib64/python2.6/_strptime.py", line 454, in _strptime_time
return _strptime(data_string, format)[0]
File "/usr/lib64/python2.6/_strptime.py", line 325, in _strptime
(data_string, format))
ValueError: time data 'Fri, 18 May :29 +0000' does not match format '%a, %d %b %Y %H:%M:%S %Z'
&&& t = time.strptime(timeStr, "%a, %d %b %Y %H:%M:%S +0000")
time.struct_time(tm_year=2012, tm_mon=5, tm_mday=18, tm_hour=4, tm_min=24, tm_sec=29, tm_wday=4, tm_yday=139, tm_isdst=-1)
#下面是datetime的用法
&&& import datetime
&&& datetime.datetime.today()
datetime.datetime(, 12, 28, 25, 892141)
&&& datetime.datetime(,23,54)
datetime.datetime(, 23, 54)
&&& datetime.datetime(,23,54,32)
datetime.datetime(, 23, 54, 32)
&&& datetime.datetime.fromtimestamp(time.time())
datetime.datetime(, 12, 29, 15, 130257)
&&& datetime.datetime.utcfromtimestamp(time.time())
datetime.datetime(, 4, 29, 34, 897017)
&&& datetime.datetime.now()
datetime.datetime(, 12, 29, 52, 558249)
&&& datetime.datetime.utcnow()
datetime.datetime(, 4, 30, 6, 164009)
&&& datetime.datetime.fromtimestamp(time.time()).strftime("%a, %d %b %Y %H:%M:%S")
'Fri, 18 May :30'
&&& datetime.datetime.today().strftime("%a, %d %b %Y %H:%M:%S")
'Fri, 18 May :44'
&&& datetime.datetime.strptime('Fri, 18 May :29', "%a, %d %b %Y %H:%M:%S")
datetime.datetime(, 4, 24, 29)
%a 英文星期简写
%A 英文星期的完全
%b 英文月份的简写
%B 英文月份的完全
%c 显示本地日期时间
%d 日期，取1-31
%H 小时， 0-23
%I 小时， 0-12
%m 月， 01 -12
%M 分钟，0-59
%S 秒，0-61（官网这样写的）
%j 年中当天的天数
%w 显示今天是星期几
%x 当天日期
%X 本地的当天时间
%y 年份 00-99间
%Y 年份的完整拼写
27.关于整数转字符串的陷阱
有些整数是int，有些是long,对于long调用str()处理后返回的字符串是数字+L，该long数字在list等容器中时，对容器调用str()处理时也有这个问题，用者需谨慎啊！
至于一个整数什么时候是int，什么时候是long鸟人正在研究...（当然，指定int或者long就肯定是int或者long了）
28.join()的用法（列表中的元素必须是字符串）
&&& l = ['a','b','c','d']
&&& '&'.join(l)
29.python的pdb调试
跟gdb很类似：
b line_number 加断点，还可以指定文件和函数加断点
b 180, childWeiboRt.retweetedId == 6906 条件断点
b 显示所有断点
cl breakpoint_number 清除某个断点
cl 清除所有断点
s 跟进函数内部
whatis obj 查看某变量类型（跟python的内置函数type()等效）
up 移到调用栈的上一层（frame）,可以看该调用点的代码和变量（当然，程序实际进行到哪里了是不可改变的）
down 移到调用栈的下一层（frame）,可以看该调用点的代码和变量（当然，程序实际进行到哪里了是不可改变的）
调试过程中要查看某实例（instanceObj）的属性值可用下述语句：
for it in [(attr,getattr(instanceObj,attr)) for attr in dir(instanceObj)]: print it[0],'--&',it[1]
30.在函数内部获取函数名
&&& import sys
&&& def f2():
print sys._getframe().f_code.co_name
31.url中的空格等特殊字符的处理
url出现了有+，空格，/，?，%，#，&，=等特殊符号的时候，可能在服务器端无法获得正确的参数值，如何是好？
将这些字符转化成服务器可以识别的字符，对应关系如下：
URL字符转义
用其它字符替代吧，或用全角的。+
URL中+号表示空格
空格 URL中的空格可以用+号或者编码
分隔目录和子目录
分隔实际的URL和参数
指定特殊字符
URL中指定的参数间的分隔符
URL中指定参数的值
&&& import urllib
&&& import urlparse
&&& urlparse.urljoin('/weibo/',urllib.quote('python c++'))
'/weibo/python%20c%2B%2B'
当url与特殊字符碰撞、然后参数又用于有特殊字符的搜索引擎（lucene等）....
需要把url转义再转义，否则特殊字符安全通过http协议后就裸体进入搜索引擎了，查到的将不是你要的东东...
通过观察url可以发现浏览器脚本也是做了这种处理的
[dongsong@bogon python_study]$ cat url.py
#encoding=utf-8
import urllib, urlparse
if __name__ == '__main__':
baseUrl = '/weibo/'
url = urlparse.urljoin(baseUrl, urllib.quote(urllib.quote('python c++')))
conn = urllib.urlopen(url)
data = conn.read()
f = file('/tmp/d.html', 'w')
f.write(data)
[dongsong@bogon python_study]$ vpython url.py
/weibo/python%B%252B
32.json模块编码问题
json.dumps()默认行为：
把数据结构中所有字符串转换成unicode编码，然后对unicode串做编码转义(\u56fd变成\\u56fd)再整个导出utf-8编码(由参数encoding的默认值utf-8控制，没必要动它)的json串
如原数据结构中的元素编码不一致不影响dumps函数的行为，因为导出json串之前会把所有元素串转换成unicode串
参数ensure_ascii默认是True，如设置为False会改变dumps的行为：
原数据结构中的字符串编码为unicode则导出的json串是unicode串，且内部unicode串不做转义(\u56fd还是\u56fd)；
原数据结构中的字符串编码为utf-8则导出的json串是utf-8串，且内部utf-8串不做转义(\xe5\x9b\xbd还是\xe5\x9b\xbd)；
如原数据结构中的元素编码不一致则dumps函数会出现错误
通过这种方式拿到的json串是可以做编码转换的，默认行为得到的json串不行(因为原数据结构的字符串元素被转义了，对json串整个做编码转换无法触动原数据结构的字符串元素)
warning---& 10:00:
今天遇到一个问题，用这种方式转一个带繁体字的字典，转换成功，只是把json串入库时报错
_mysql_exceptions.Warning: Incorrect string value: '\xF0\x9F\x91\x91\xE7\xAC...' for column 'detail' at row 1
而用第一种方式存库就没有问题，初步认定是json.dumps(ensure_ascii = False)对繁体字的处理有编码问题
对于一些编码比较杂乱的数据，可能json.loads()会抛UnicodeDecodeError异常（比如我今天（）遇到的qq开放平台API返回的utf8编码json串在反解时总遇到这个问题），可如下解决：
myString = jsonStr.decode('utf-8', 'ignore') #转成unicode,并忽略错误
jsonObj = json.loads(myString)
可能会丢数据，但总比什么也不干要强。
#encoding=utf-8
import json
from pprint import pprint
def show_rt(rt):
pprint(rt)
print "type(rt) is %s" % type(rt)
if __name__ == '__main__':
u'中国':u'北京',
u'日本':u'东京',
u'法国':u'巴黎'
utf8Dic = {
r'中国':r'北京',
r'日本':r'东京',
r'法国':r'巴黎'
pprint(unDic)
pprint(utf8Dic)
print "\nunicode instance dumps to string:"
rt = json.dumps(unDic)
show_rt(rt)
print "utf-8 instance dumps to string:"
rt = json.dumps(utf8Dic)
show_rt(rt)
#encoding is the character encoding for str instances, default is UTF-8
#If ensure_ascii is False, then the return value will be a unicode instance, default is True
print "\nunicode instance dumps(ensure_ascii=False) to string:"
rt = json.dumps(unDic,ensure_ascii=False)
show_rt(rt)
print "utf-8 instance dumps(ensure_ascii=False) to string:"
rt = json.dumps(utf8Dic,ensure_ascii=False)
show_rt(rt)
print "\n-----------------数据结构混杂编码-----------------"
unDic[u'日本'] = r'东京'
utf8Dic[r'日本'] = u'东京'
pprint(unDic)
pprint(utf8Dic)
print "\nunicode instance dumps to string:"
rt = json.dumps(unDic)
except Exception,e:
print "%s:%s" % (type(e),str(e))
show_rt(rt)
print "utf-8 instance dumps to string:"
rt = json.dumps(utf8Dic)
except Exception,e:
print "%s:%s" % (type(e),str(e))
show_rt(rt)
print "\nunicode instance dumps(ensure_ascii=False) to string:"
rt = json.dumps(unDic, ensure_ascii=False)
except Exception,e:
print "%s:%s" % (type(e),str(e))
show_rt(rt)
print "utf-8 instance dumps to string:"
rt = json.dumps(utf8Dic, ensure_ascii=False)
except Exception,e:
print "%s:%s" % (type(e),str(e))
show_rt(rt)
[dongsong@bogon python_study]$ vpython json_test.py
{u'\u4e2d\u56fd': u'\u5317\u4eac',
u'\u65e5\u672c': u'\u4e1c\u4eac',
u'\u6cd5\u56fd': u'\u5df4\u9ece'}
{'\xe4\xb8\xad\xe5\x9b\xbd': '\xe5\x8c\x97\xe4\xba\xac',
'\xe6\x97\xa5\xe6\x9c\xac': '\xe4\xb8\x9c\xe4\xba\xac',
'\xe6\xb3\x95\xe5\x9b\xbd': '\xe5\xb7\xb4\xe9\xbb\x8e'}
unicode instance dumps to string:
'{"\\u4e2d\\u56fd": "\\u5317\\u4eac", "\\u65e5\\u672c": "\\u4e1c\\u4eac", "\\u6cd5\\u56fd": "\\u5df4\\u9ece"}'
{"\u4e2d\u56fd": "\u5317\u4eac", "\u65e5\u672c": "\u4e1c\u4eac", "\u6cd5\u56fd": "\u5df4\u9ece"}
type(rt) is &type 'str'&
utf-8 instance dumps to string:
'{"\\u4e2d\\u56fd": "\\u5317\\u4eac", "\\u6cd5\\u56fd": "\\u5df4\\u9ece", "\\u65e5\\u672c": "\\u4e1c\\u4eac"}'
{"\u4e2d\u56fd": "\u5317\u4eac", "\u6cd5\u56fd": "\u5df4\u9ece", "\u65e5\u672c": "\u4e1c\u4eac"}
type(rt) is &type 'str'&
unicode instance dumps(ensure_ascii=False) to string:
u'{"\u4e2d\u56fd": "\u5317\u4eac", "\u65e5\u672c": "\u4e1c\u4eac", "\u6cd5\u56fd": "\u5df4\u9ece"}'
{"中国": "北京", "日本": "东京", "法国": "巴黎"}
type(rt) is &type 'unicode'&
utf-8 instance dumps(ensure_ascii=False) to string:
'{"\xe4\xb8\xad\xe5\x9b\xbd": "\xe5\x8c\x97\xe4\xba\xac", "\xe6\xb3\x95\xe5\x9b\xbd": "\xe5\xb7\xb4\xe9\xbb\x8e", "\xe6\x97\xa5\xe6\x9c\xac": "\xe4\xb8\x9c\xe4\xba\xac"}'
{"中国": "北京", "法国": "巴黎", "日本": "东京"}
type(rt) is &type 'str'&
-----------------数据结构混杂编码-----------------
{u'\u4e2d\u56fd': u'\u5317\u4eac',
u'\u65e5\u672c': '\xe4\xb8\x9c\xe4\xba\xac',
u'\u6cd5\u56fd': u'\u5df4\u9ece'}
{'\xe4\xb8\xad\xe5\x9b\xbd': '\xe5\x8c\x97\xe4\xba\xac',
'\xe6\x97\xa5\xe6\x9c\xac': u'\u4e1c\u4eac',
'\xe6\xb3\x95\xe5\x9b\xbd': '\xe5\xb7\xb4\xe9\xbb\x8e'}
unicode instance dumps to string:
'{"\\u4e2d\\u56fd": "\\u5317\\u4eac", "\\u65e5\\u672c": "\\u4e1c\\u4eac", "\\u6cd5\\u56fd": "\\u5df4\\u9ece"}'
{"\u4e2d\u56fd": "\u5317\u4eac", "\u65e5\u672c": "\u4e1c\u4eac", "\u6cd5\u56fd": "\u5df4\u9ece"}
type(rt) is &type 'str'&
utf-8 instance dumps to string:
'{"\\u4e2d\\u56fd": "\\u5317\\u4eac", "\\u6cd5\\u56fd": "\\u5df4\\u9ece", "\\u65e5\\u672c": "\\u4e1c\\u4eac"}'
{"\u4e2d\u56fd": "\u5317\u4eac", "\u6cd5\u56fd": "\u5df4\u9ece", "\u65e5\u672c": "\u4e1c\u4eac"}
type(rt) is &type 'str'&
unicode instance dumps(en
&&&&推荐文章:
【上篇】【下篇】

怎样理解python build opener里的Cookie和opener

我要回帖

更多关于 python opener 的文章

随机推荐