手机APP 上面的内容 python抓取网页内容 有办法抓取到吗

温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!&&|&&
LOFTER精选
& & u = urllib.urlopen(addr)& & data = u.read()& & splitPath = addr.split('/')& & fName = splitPath.pop()& & print fName& & f = open(fName, 'wb')& & f.write(data)& & f.close()addr为图片url。def getImage2(addr): try:
u = urllib2.urlopen(addr)
data = u.read()
splitPath = addr.split('/')
fName = splitPath.pop()
print fName
urllib.urlretrieve(addr, fName)& except Exception,e:
print "[Error]Cant't download: %s:%s" %(fName,e)使用urllib.urlretrieve(addr, fName) 直接用urllib.urlretrieve获取并保存,fName为保存的文件名,当然可加路径。def getImage2(addr): try:
splitPath = addr.split('/')
fName = splitPath.pop()
print fName
open(fName, "wb").write(urllib2.urlopen(addr).read()) except Exception,e:
print "[Error]Cant't download: %s:%s" %(fName,e)使用urllib2.urlopen,其实与第一种一样。'''在python中用作注释,注意也要符合缩进规则。
阅读(1813)|
用微信&&“扫一扫”
将文章分享到朋友圈。
用易信&&“扫一扫”
将文章分享到朋友圈。
历史上的今天
loftPermalink:'',
id:'fks_',
blogTitle:'python中两种抓取网页中图片并保存的方法',
blogAbstract:'def getImage(addr):& & u = urllib.urlopen(addr)& & data = u.read()& & splitPath = addr.split(\'/\')& & fName = splitPath.pop()& & print fName& & f = open(fName, \'wb\')& & f.write(data)& & f.close()addr为图片url。def getImage2(addr):\ttry:',
blogTag:'python,图片,urllib,urlretrieve',
blogUrl:'blog/static/',
isPublished:1,
istop:false,
modifyTime:0,
publishTime:6,
permalink:'blog/static/',
commentCount:0,
mainCommentCount:0,
recommendCount:0,
bsrk:-100,
publisherId:0,
recomBlogHome:false,
currentRecomBlog:false,
attachmentsFileIds:[],
groupInfo:{},
friendstatus:'none',
followstatus:'unFollow',
pubSucc:'',
visitorProvince:'',
visitorCity:'',
visitorNewUser:false,
postAddInfo:{},
mset:'000',
remindgoodnightblog:false,
isBlackVisitor:false,
isShowYodaoAd:true,
hostIntro:'',
hmcon:'1',
selfRecomBlogCount:'0',
lofter_single:''
{list a as x}
{if x.moveFrom=='wap'}
{elseif x.moveFrom=='iphone'}
{elseif x.moveFrom=='android'}
{elseif x.moveFrom=='mobile'}
${a.selfIntro|escape}{if great260}${suplement}{/if}
{list a as x}
推荐过这篇日志的人:
{list a as x}
{if !!b&&b.length>0}
他们还推荐了:
{list b as y}
转载记录:
{list d as x}
{list a as x}
{list a as x}
{list a as x}
{list a as x}
{if x_index>4}{break}{/if}
${fn2(x.publishTime,'yyyy-MM-dd HH:mm:ss')}
{list a as x}
{if !!(blogDetail.preBlogPermalink)}
{if !!(blogDetail.nextBlogPermalink)}
{list a as x}
{if defined('newslist')&&newslist.length>0}
{list newslist as x}
{if x_index>7}{break}{/if}
{list a as x}
{var first_option =}
{list x.voteDetailList as voteToOption}
{if voteToOption==1}
{if first_option==false},{/if}&&“${b[voteToOption_index]}”&&
{if (x.role!="-1") },“我是${c[x.role]}”&&{/if}
&&&&&&&&${fn1(x.voteTime)}
{if x.userName==''}{/if}
网易公司版权所有&&
{list x.l as y}
{if defined('wl')}
{list wl as x}{/list}| 漏洞检测 |
| 隐藏捆绑 |
Python语言获取目录下所有文件或目录的方法
Python的os.listdir()可以获取当前目录下的所有文件和目录,但不支持递归。有时候希望获取以递归方式指定目录下所有文件列表,为此可以调用下面的get_recursive_file_list()函数。 文件名: file_util.py #! /usr/bin/python'''Utilities of file directories
的os.listdir()可以当前目的和,但不支持递归。有时候希望以递归方式指定目列表,为此可以调用下面的get_recursive_file_list()函数。
文件名: file_util.py
#! /usr/bin/python
Utilities of file & directories.
# Get the all files & directories in the specified directory (path).
def get_recursive_file_list(path):
current_files = os.listdir(path)
all_files = []
for file_name in current_files:
full_file_name = os.path.join(path, file_name)
all_files.append(full_file_name)
if os.path.isdir(full_file_name):
next_level_files = get_recursive_file_list(full_file_name)
all_files.extend(next_level_files)
return all_files
使用示例:
test@a_fly_bird /home/test/examples/python % ll
-rwxrwxrwx 1 test users 501
2月 26 20:04 file_util.py*
-rwxrwxrwx 1 test users 109
2月 26 20:04 test.py*
test@a_fly_bird /home/test/examples/python % mkdir aaa
test@a_fly_bird /home/test/examples/python % echo "first" > ./aaa/first.txt
test@a_fly_bird /home/test/examples/python % echo "second" > ./aaa/second.txt
test@a_fly_bird /home/test/examples/python % cat ./aaa/first.txt
test@a_fly_bird /home/test/examples/python % python
3.2.3 (default, Sep
[GCC 4.7.1
(prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import file_util
>>> files = file_util.get_recursive_file_list(".")
['./aaa', './aaa/second.txt', './aaa/first.txt', './file_util.py', './test.py', './__pycache__', './__pycache__/file_util.cpython-32.pyc']
>>> exit()
test@a_fly_bird /home/test/examples/python % ll
drwxr-xr-x 2 test users 4096
2月 26 20:06 aaa/
-rwxrwxrwx 1 test users
2月 26 20:04 file_util.py*
drwxr-xr-x 2 test users 4096
2月 26 20:06 __pycache__/
-rwxrwxrwx 1 test users
2月 26 20:04 test.py*
test@a_fly_bird /home/test/examples/python %
(责任编辑:admin)
------分隔线----------------------------
#coding=utf-8 Created on
@author: zinan.zhang import re import time im...
压缩数据创建gzip文件 先看一个略麻烦的做法 [python] import StringIO,gzip content ...
目前我常常使用的分词有结巴分词、NLPIR分词等等 最近是在使用结巴分词,稍微做一下推...
python的函数支持默认参数,这个功能非常实惠。我们在编写函数的时候,可以写出非常灵...
bsddb模块是用来操作bdb的模块,bdb是著名的Berkeley DB,它的性能非常好,mysql的存...
本文再次见证python是对付杂活的利器。不过,为什么这么多杂活呢?最近接到上级的任务...
admin@1744.cc
工作日:9:00-21:00
周 六:9:00-18:00
&&扫一扫关注幽灵学院每日抽奖免费领VIP会员
广告服务:QQ:运用python抓取CSDN博客首页的全部数据,并且定时持续抓取新发布的内容存入mongodb中 - 推酷
运用python抓取CSDN博客首页的全部数据,并且定时持续抓取新发布的内容存入mongodb中
原文地址:&a target=_blank target=&_blank& href=&/blogpost_detailDashboard.action?id=94&&运用python抓取CSDN博客首页的全部数据,定时持续抓取新发布的内容存入mongodb中&/a&
3.HTMLParser
# -*- coding: utf-8 -*-
@author: jiangfuqiang
from HTMLParser import
HTMLParser
import time
from datetime import
import pymongo
import urllib2
import sys
import traceback
import jieba
default_encoding = 'utf-8'
if sys.getdefaultencoding() != default_encoding:
reload(sys)
sys.setdefaultencoding(default_encoding)
isExist = False
class FetchCnblog(HTMLParser):
def __init__(self, id):
HTMLParser.__init__(self)
self.result = []
self.data = {}
self.isTitleLink = False
self.id = id
self.isSummary = False
self.isPostItem = False
self.isArticleView = False
def handle_data(self, data):
if self.isTitleLink and self.isPostItem:
self.data['title'] = data
self.isTitleLink = False
elif self.isSummary and self.isPostItem:
data = data.strip()
self.data['desc'] = data
def handle_starttag(self, tag, attrs):
if tag == 'a':
for key, value in attrs:
if key == 'class':
if value == 'titlelnk':
self.isTitleLink = True
elif value == 'gray' and self.isArticleView:
self.isArticleView = False
for key, value in attrs:
if key == 'href':
self.data['readmoreLink'] = value
reg = 'd+'
result = re.search(reg,value)
self.isPostItem = False
if result:
self.data['id'] = int(result.group())
self.data = {}
if self.data['id'] &= self.id:
self.data = {}
isExist = True
self.data['srouce'] = &&
self.data['source_key'] = 'cnblogs'
self.data['fetchTime'] = str(date.today())
self.data['keyword'] = &,&.join(jieba.cut(self.data['title']))
self.result.append(self.data)
self.data = {}
elif tag == 'p':
for key, value in attrs:
if key == 'class' and value == 'post_item_summary':
self.isSummary = True
elif tag == 'img':
for key, value in attrs:
if key == 'class' and value == 'pfs':
for key, value in attrs:
if key == 'src':
self.data['imgSrc'] = value
elif tag == 'div':
for key, value in attrs:
if key == 'class' and value == 'post_item_foot':
self.isSummary = False
elif key == 'class' and value == 'post_item':
self.isPostItem = True
elif tag == 'span':
for key , value in attrs:
if key == 'class' and value == 'article_view':
self.isArticleView = True
def getResult(self):
return self.result
if __name__ == &__main__&:
con = pymongo.Connection('localhost', 27017)
db = con.blog
fetchblog = db.fetch_blog
record = db.record
url = &/#p%d&
flag = False
'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/ Firefox/3.5.6'}
reco = record.find_one({&type&:'cnblogs'})
id = reco['maxId']
while isExist == False:
req = urllib2.Request(url%count,headers=headers)
request = urllib2.urlopen(req)
data = request.read()
fj = FetchCnblog(id)
fj.feed(data)
result = fj.getResult()
if len(result) & 1:
isExist = True
if flag == False:
flag = True
dic = result[0]
id = int(dic['id'])
record.update({&type&:'cnblogs'},{&$set&:{'maxId':id}},True,False)
result.reverse()
for doc in result:
fetchblog.insert(doc)
print &page is %d&%count
count += 1
time.sleep(5)
except Exception, e:
traceback.print_exc()
print &parse error&,e
程序如果在linux,mac下执行,在可在crontab -e中设置定时任务,如果在windows执行,则自己再在程序里加个定时器即可
已发表评论数()
&&登&&&陆&&
已收藏到推刊!
请填写推刊名
描述不能大于100个字符!
权限设置: 公开
仅自己可见Posts - 6,
Articles - 3,
Comments - 1
有梦的人才会快乐!
18:10 by etodream, ... 阅读,
#-*- coding: utf-8 -*-
import urllib2
import urllib
import time
import MySQLdb
import time,datetime
#from datetime import date
#----------- APP store 排行榜 -----------
class Spider_Model:
def __init__(self):
self.page = 1
self.pages = []
self.enable = False
def startWork(self,url,tabName):
nowtime = int(time.time())
content = self.GetCon(url)
oneItems =
self.Match(content) #匹配一级参数
time.sleep(1)
for index,item in enumerate(oneItems):
content_two = self.GetCon(item[1])
twoItems = self.Match_two(content_two)
oneItems[index].append([twoItems[0],twoItems[1]])
if oneItems[index][6][0] == '0':
fabutime = '0'
fabutime=int(time.mktime(time.strptime(oneItems[index][6][0].strip(),'%Y年%m月%d日')))
sql = "INSERT INTO "+tabName+"(`rank`,`detailurl`,`logo`,`name`,`type`,`appid`,`appstoretime`,`compatible`,`ctime`) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"%('"'+oneItems[index][0]+'"','"'+oneItems[index][1]+'"','"'+oneItems[index][2]+'"','"'+oneItems[index][3]+'"','"'+oneItems[index][4]+'"','"'+oneItems[index][5]+'"',fabutime,'"'+oneItems[index][6][1]+'"',nowtime)
self.contentDb(sql)
time.sleep(1)
def GetCon(self,url):
myUrl = url
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11','Accept': 'text/html,application/xhtml+xml,application/q=0.9,*/*;q=0.8'}
//网站禁止爬虫解决方法加上上面的代码 模拟浏览器访问
     req = urllib2.Request(myUrl, headers = headers)
global myResponse
myResponse = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()     //异常处理必须加
否则就算模拟了浏览器 也会返回 403
原因不知道......
myPage = myResponse.read()
#encode的作用是将unicode编码转换成其他编码的字符串
#decode的作用是将其他编码的字符串转换成unicode编码
#unicodePage = myPage.decode('utf-8').encode('gbk','ignore')
#unicodePage = myPage.decode('utf-8','ignore')
return myPage
def Match(self,con):
# 找出所有class="content"的div标记
#re.S是任意匹配模式,也就是.可以匹配换行符
pattenA = re.compile(r'&section class="section apps grid"&(.*?)&/section&',re.U|re.S)
pattenB = re.compile(r'&li&&strong&(.*?).&/strong&&a href="(.*?)".*?&&img src="(.*?)".*?&&/a&&h3&&a.*?&(.*?)&/a&&/h3&&h4&&a.*?&(.*?)&/a&&/h4&&a.*?&.*?&/a&&/li&',re.U|re.S)
match = re.findall(pattenA,con)
myItems = re.findall(pattenB,match[0])
items = []
for item in myItems:
items.append([item[0].replace("\n",""),item[1].replace("\n",""),item[2].replace("\n",""),(item[3].replace("\n","")).split('-')[0],item[4].replace("\n",""),(item[1].split('id')[1]).split('?')[0]])
return items
def Match_two(self,con):
pattenTwoA = re.compile(r'&li.*?class="release-date"&&span.*?&.*?&/span&(.*?)&/li&',re.U|re.S)
pattenTwoB = re.compile(r'&span.*?class="app-requirements"&.*?&/span&(.*?)&/p&',re.U|re.S)
matchTwoA = self.is_empty(re.findall(pattenTwoA,con))
matchTwoB = self.is_empty(re.findall(pattenTwoB,con))
itemsTwo = [matchTwoA,matchTwoB]
return itemsTwo
def is_empty(self,param):
if len(param):
res = param[0]
return res
def contentDb(self,sql):
conn = MySQLdb.connect(host="主机", user="用户", passwd="密码", db="表名",charset='utf8')
cur = conn.cursor()
result = cur.execute(sql)
except MySQLdb.Error,e:
print "Mysql Error %d: %s" %(e.args[0],e.args[1])
addArr = [["/jp/itunes/charts/free-apps/",'cg_jp_free'],
["/jp/itunes/charts/paid-apps/",'cg_jp_paid']]
myModel = Spider_Model()
for val in addArr:
myModel.startWork(val[0],val[1])
初识Python 代码写的有点烂,自制罪孽深重......
python版本:2.7.5& 测试环境:Linux、Windows
望高手拍砖 带我一起装逼!一起飞!python中抓取字符串内容 这段字符串里怎么抓取?_百度知道
python中抓取字符串内容 这段字符串里怎么抓取?
段字符串 callback({&code&:&14PPXJ&,&state&:&xxsss&}); 抓取14PPXJ几字符串 每访问都同code
我要用则表达式抓取import rete = callback({&code&:&14PPXJ&,&state&:&xxsss&}); pattern = re.compile('callback({&code&:&(.*?)&,&state&:&xxsss&})')
r = pattern.search(te)行啊 抓取 求神指点 我新手
提问者采纳
用则表达式codestate两key固定写te.split('&code&:&')[1].split('&,&state&:')[0]
提问者评价
其他类似问题
python的相关知识
等待您来回答
下载知道APP
随时随地咨询
出门在外也不愁

我要回帖

更多关于 webapp2 python 的文章

 

随机推荐