手机APP 上面的内容 python抓取网页内容有办法抓取到吗

风水堪舆学 | 网络营销 | 住宅风水 | 英文歌曲 | Adobe After Effects | 电脑配置 | 书籍改编电影 | 下载 | Legion | 网络推广 | 动画制作 | 赛事 | PLC | 小说创作 | 虚拟专用服务器 | 成语 | 家庭 | 单反相机 | 电视节目 | 投影机 | 面相 | 香港购物 | 配音 | 文具 | 二次元 | 影视 | 固态硬盘ssd | 虚拟机 | 跆拳道 | r（编程语言） | 秦时明月之天行九歌 | 使命召唤 | 网盘 | 地图 | 琅琊榜（电视剧） | 手机内存 | 角色扮演 | 华硕 | 百度输入法 | 盗墓笔记（小说） | 营销策划 | 化妆品 | Windows | ip地址 | 装修设计 | 齐内丁·齐达内 | 动画电影 | 中国中央电视台 | 罗兰 | 网站优化 | 斗鱼直播 | 冷知识 | 张帅 | 任天堂 | 摄影师 | 三菱商事 | 迅雷（软件） | 计算机病毒 | amd | 屏幕 | 微单相机 | 电学 | qq浏览器 | MacOS | 联赛 | snh48 | 芯片（集成电路） | 后宫·甄嬛传（书籍） | 植物辨识 | 运动 | 大一 | 美容 | 双色球 | 蓝牙音箱 | 楼盘 | 电脑电源 | 采暖 | 显卡驱动 | 体育赛事 | thinkpad | 离婚 | 武侠小说 | 索尼笔记本 | 中国足球协会超级联赛（csl） | youtube | 王力宏（人物） | 外星人 | 努比亚（手机品牌） | 海贼王 | 移动电源 | 完美世界（游戏） | 摩托车 | 编辑器 | 低音炮 | 收益 | 海关 | 徐波 | akb48 | 互联网创业 | 张璐 | 男性 | 性价比 | MacBook Air | 新疆维吾尔自治区 | 插座 | 外汇平台 | 华为Mate30 | 羽毛球技术 | 腾讯 QQ | 蓝屏 | 字幕 | 免费软件 | 电脑故障 | 女生 | 周星驰（人物） | 足球欧洲杯 | pdf | macbook | 直播 | 生活经历 | 骁龙处理器 | 主题曲 | 户外运动 | CPU | 娱乐圈 | 初恋 | 家居 | 流氓软件 | 名言 | 中国足球 | 近视眼 | acg | 一级方程式赛车（f1） | 小品 | 网站运营 | 英格兰足球超级联赛 | 一体机 | 人肉搜索 | 日本电影 | 系统软件 | 人生 | 流星花园 | 电钢琴 | 分辨率 | 迅雷 | 机械设计 | 古典音乐 | 液晶电视 | 睡眠 | 大片 | 资产 | Html/Css | ansys | 天蝎座 | 对联 | 大二 | 吉他学习 | 实习 | uc浏览器 | 计算机科学 | 新华社 | 脱毛 | 视力 | 乐视超级电视 | 大学生活 | 开关电源 | 平面设计 | 音乐版权 | iPhone 11 Pro | 面膜 | 鞠婧祎 | 胡歌（演员） | 郭富城 | 语言 | 赵丽颖（演员） | 意大利 | 电路设计 | 情侣 | NBA篮球 | 蔡徐坤 | 豆瓣电影 | 社交软件 | 微信开发 | 足球彩票 | 电工 | 手机摄像头 | 用户界面设计师 | 华语流行音乐 | 网卡 | 易烊千玺 | 笛子 | 日语学习 | 日语歌曲 | 歌手 | 张子枫 | 搏击项目 | 谭松韵 | 快捷键 | O2O | 移民 |

你的位置：网站首页 >> 频道首页 >>python >>手机APP 上面的内容 python抓取网页内容有办法抓取到吗

手机APP 上面的内容 python抓取网页内容有办法抓取到吗

来源：蜘蛛抓取(WebSpider) 时间：2015-01-19 03:52 标签： webapp2 python

温馨提示！由于新浪微博认证机制调整，您的新浪微博帐号绑定已过期，请重新绑定！&&|&&
LOFTER精选
& & u = urllib.urlopen(addr)& & data = u.read()& & splitPath = addr.split('/')& & fName = splitPath.pop()& & print fName& & f = open(fName, 'wb')& & f.write(data)& & f.close()addr为图片url。def getImage2(addr): try:
u = urllib2.urlopen(addr)
data = u.read()
splitPath = addr.split('/')
fName = splitPath.pop()
print fName
urllib.urlretrieve(addr, fName)& except Exception,e:
print "[Error]Cant't download: %s:%s" %(fName,e)使用urllib.urlretrieve(addr, fName) 直接用urllib.urlretrieve获取并保存，fName为保存的文件名，当然可加路径。def getImage2(addr): try:
splitPath = addr.split('/')
fName = splitPath.pop()
print fName
open(fName, "wb").write(urllib2.urlopen(addr).read()) except Exception,e:
print "[Error]Cant't download: %s:%s" %(fName,e)使用urllib2.urlopen，其实与第一种一样。'''在python中用作注释，注意也要符合缩进规则。
阅读(1813)|
用微信&&“扫一扫”
将文章分享到朋友圈。
用易信&&“扫一扫”
将文章分享到朋友圈。
历史上的今天
loftPermalink:'',
id:'fks_',
blogTitle:'python中两种抓取网页中图片并保存的方法',
blogAbstract:'def getImage(addr):& & u = urllib.urlopen(addr)& & data = u.read()& & splitPath = addr.split(\'/\')& & fName = splitPath.pop()& & print fName& & f = open(fName, \'wb\')& & f.write(data)& & f.close()addr为图片url。def getImage2(addr):\ttry:',
blogTag:'python,图片,urllib,urlretrieve',
blogUrl:'blog/static/',
isPublished:1,
istop:false,
modifyTime:0,
publishTime:6,
permalink:'blog/static/',
commentCount:0,
mainCommentCount:0,
recommendCount:0,
bsrk:-100,
publisherId:0,
recomBlogHome:false,
currentRecomBlog:false,
attachmentsFileIds:[],
groupInfo:{},
friendstatus:'none',
followstatus:'unFollow',
pubSucc:'',
visitorProvince:'',
visitorCity:'',
visitorNewUser:false,
postAddInfo:{},
mset:'000',
remindgoodnightblog:false,
isBlackVisitor:false,
isShowYodaoAd:true,
hostIntro:'',
hmcon:'1',
selfRecomBlogCount:'0',
lofter_single:''
{list a as x}
{if x.moveFrom=='wap'}
{elseif x.moveFrom=='iphone'}
{elseif x.moveFrom=='android'}
{elseif x.moveFrom=='mobile'}
${a.selfIntro|escape}{if great260}${suplement}{/if}
{list a as x}
推荐过这篇日志的人：
{list a as x}
{if !!b&&b.length>0}
他们还推荐了：
{list b as y}
转载记录：
{list d as x}
{list a as x}
{list a as x}
{list a as x}
{list a as x}
{if x_index>4}{break}{/if}
${fn2(x.publishTime,'yyyy-MM-dd HH:mm:ss')}
{list a as x}
{if !!(blogDetail.preBlogPermalink)}
{if !!(blogDetail.nextBlogPermalink)}
{list a as x}
{if defined('newslist')&&newslist.length>0}
{list newslist as x}
{if x_index>7}{break}{/if}
{list a as x}
{var first_option =}
{list x.voteDetailList as voteToOption}
{if voteToOption==1}
{if first_option==false},{/if}&&“${b[voteToOption_index]}”&&
{if (x.role!="-1") },“我是${c[x.role]}”&&{/if}
&&&&&&&&${fn1(x.voteTime)}
{if x.userName==''}{/if}
网易公司版权所有&&
{list x.l as y}
{if defined('wl')}
{list wl as x}{/list}| 漏洞检测 |
| 隐藏捆绑 |
Python语言获取目录下所有文件或目录的方法
Python的os.listdir()可以获取当前目录下的所有文件和目录，但不支持递归。有时候希望获取以递归方式指定目录下所有文件列表，为此可以调用下面的get_recursive_file_list()函数。文件名: file_util.py #! /usr/bin/python'''Utilities of file directories
的os.listdir()可以当前目的和，但不支持递归。有时候希望以递归方式指定目列表，为此可以调用下面的get_recursive_file_list()函数。
文件名: file_util.py
#! /usr/bin/python
Utilities of file & directories.
# Get the all files & directories in the specified directory (path).
def get_recursive_file_list(path):
current_files = os.listdir(path)
all_files = []
for file_name in current_files:
full_file_name = os.path.join(path, file_name)
all_files.append(full_file_name)
if os.path.isdir(full_file_name):
next_level_files = get_recursive_file_list(full_file_name)
all_files.extend(next_level_files)
return all_files
使用示例：
test@a_fly_bird /home/test/examples/python % ll
-rwxrwxrwx 1 test users 501
2月 26 20:04 file_util.py*
-rwxrwxrwx 1 test users 109
2月 26 20:04 test.py*
test@a_fly_bird /home/test/examples/python % mkdir aaa
test@a_fly_bird /home/test/examples/python % echo "first" > ./aaa/first.txt
test@a_fly_bird /home/test/examples/python % echo "second" > ./aaa/second.txt
test@a_fly_bird /home/test/examples/python % cat ./aaa/first.txt
test@a_fly_bird /home/test/examples/python % python
3.2.3 (default, Sep
[GCC 4.7.1
(prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import file_util
>>> files = file_util.get_recursive_file_list(".")
['./aaa', './aaa/second.txt', './aaa/first.txt', './file_util.py', './test.py', './__pycache__', './__pycache__/file_util.cpython-32.pyc']
>>> exit()
test@a_fly_bird /home/test/examples/python % ll
drwxr-xr-x 2 test users 4096
2月 26 20:06 aaa/
-rwxrwxrwx 1 test users
2月 26 20:04 file_util.py*
drwxr-xr-x 2 test users 4096
2月 26 20:06 __pycache__/
-rwxrwxrwx 1 test users
2月 26 20:04 test.py*
test@a_fly_bird /home/test/examples/python %
(责任编辑：admin)
------分隔线----------------------------
#coding=utf-8 Created on
@author: zinan.zhang import re import time im...
压缩数据创建gzip文件先看一个略麻烦的做法 [python] import StringIO,gzip content ...
目前我常常使用的分词有结巴分词、NLPIR分词等等最近是在使用结巴分词，稍微做一下推...
python的函数支持默认参数，这个功能非常实惠。我们在编写函数的时候，可以写出非常灵...
bsddb模块是用来操作bdb的模块，bdb是著名的Berkeley DB，它的性能非常好，mysql的存...
本文再次见证python是对付杂活的利器。不过，为什么这么多杂活呢？最近接到上级的任务...
admin@1744.cc
工作日：9:00-21:00
周　六：9:00-18:00
&&扫一扫关注幽灵学院每日抽奖免费领VIP会员
广告服务：QQ:运用python抓取CSDN博客首页的全部数据，并且定时持续抓取新发布的内容存入mongodb中 - 推酷
运用python抓取CSDN博客首页的全部数据，并且定时持续抓取新发布的内容存入mongodb中
原文地址：&a target=_blank target=&_blank& href=&/blogpost_detailDashboard.action?id=94&&运用python抓取CSDN博客首页的全部数据，定时持续抓取新发布的内容存入mongodb中&/a&
3.HTMLParser
# -*- coding: utf-8 -*-
@author: jiangfuqiang
from HTMLParser import
HTMLParser
import time
from datetime import
import pymongo
import urllib2
import sys
import traceback
import jieba
default_encoding = 'utf-8'
if sys.getdefaultencoding() != default_encoding:
reload(sys)
sys.setdefaultencoding(default_encoding)
isExist = False
class FetchCnblog(HTMLParser):
def __init__(self, id):
HTMLParser.__init__(self)
self.result = []
self.data = {}
self.isTitleLink = False
self.id = id
self.isSummary = False
self.isPostItem = False
self.isArticleView = False
def handle_data(self, data):
if self.isTitleLink and self.isPostItem:
self.data['title'] = data
self.isTitleLink = False
elif self.isSummary and self.isPostItem:
data = data.strip()
self.data['desc'] = data
def handle_starttag(self, tag, attrs):
if tag == 'a':
for key, value in attrs:
if key == 'class':
if value == 'titlelnk':
self.isTitleLink = True
elif value == 'gray' and self.isArticleView:
self.isArticleView = False
for key, value in attrs:
if key == 'href':
self.data['readmoreLink'] = value
reg = 'd+'
result = re.search(reg,value)
self.isPostItem = False
if result:
self.data['id'] = int(result.group())
self.data = {}
if self.data['id'] &= self.id:
self.data = {}
isExist = True
self.data['srouce'] = &&
self.data['source_key'] = 'cnblogs'
self.data['fetchTime'] = str(date.today())
self.data['keyword'] = &,&.join(jieba.cut(self.data['title']))
self.result.append(self.data)
self.data = {}
elif tag == 'p':
for key, value in attrs:
if key == 'class' and value == 'post_item_summary':
self.isSummary = True
elif tag == 'img':
for key, value in attrs:
if key == 'class' and value == 'pfs':
for key, value in attrs:
if key == 'src':
self.data['imgSrc'] = value
elif tag == 'div':
for key, value in attrs:
if key == 'class' and value == 'post_item_foot':
self.isSummary = False
elif key == 'class' and value == 'post_item':
self.isPostItem = True
elif tag == 'span':
for key , value in attrs:
if key == 'class' and value == 'article_view':
self.isArticleView = True
def getResult(self):
return self.result
if __name__ == &__main__&:
con = pymongo.Connection('localhost', 27017)
db = con.blog
fetchblog = db.fetch_blog
record = db.record
url = &/#p%d&
flag = False
'User-Agent':'Mozilla/5.0 （Windows； U； Windows NT 6.1； en-US； rv：1.9.1.6） Gecko/ Firefox/3.5.6'}
reco = record.find_one({&type&:'cnblogs'})
id = reco['maxId']
while isExist == False:
req = urllib2.Request(url%count,headers=headers)
request = urllib2.urlopen(req)
data = request.read()
fj = FetchCnblog(id)
fj.feed(data)
result = fj.getResult()
if len(result) & 1:
isExist = True
if flag == False:
flag = True
dic = result[0]
id = int(dic['id'])
record.update({&type&:'cnblogs'},{&$set&:{'maxId':id}},True,False)
result.reverse()
for doc in result:
fetchblog.insert(doc)
print &page is %d&%count
count += 1
time.sleep(5)
except Exception, e:
traceback.print_exc()
print &parse error&,e
程序如果在linux,mac下执行，在可在crontab -e中设置定时任务，如果在windows执行，则自己再在程序里加个定时器即可
已发表评论数()
&&登&&&陆&&
已收藏到推刊！
请填写推刊名
描述不能大于100个字符!
权限设置：公开
仅自己可见Posts - 6,
Articles - 3,
Comments - 1
有梦的人才会快乐！
18:10 by etodream, ... 阅读,
#-*- coding: utf-8 -*-
import urllib2
import urllib
import time
import MySQLdb
import time,datetime
#from datetime import date
#----------- APP store 排行榜 -----------
class Spider_Model:
def __init__(self):
self.page = 1
self.pages = []
self.enable = False
def startWork(self,url,tabName):
nowtime = int(time.time())
content = self.GetCon(url)
oneItems =
self.Match(content) #匹配一级参数
time.sleep(1)
for index,item in enumerate(oneItems):
content_two = self.GetCon(item[1])
twoItems = self.Match_two(content_two)
oneItems[index].append([twoItems[0],twoItems[1]])
if oneItems[index][6][0] == '0':
fabutime = '0'
fabutime=int(time.mktime(time.strptime(oneItems[index][6][0].strip(),'%Y年%m月%d日')))
sql = "INSERT INTO "+tabName+"(`rank`,`detailurl`,`logo`,`name`,`type`,`appid`,`appstoretime`,`compatible`,`ctime`) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"%('"'+oneItems[index][0]+'"','"'+oneItems[index][1]+'"','"'+oneItems[index][2]+'"','"'+oneItems[index][3]+'"','"'+oneItems[index][4]+'"','"'+oneItems[index][5]+'"',fabutime,'"'+oneItems[index][6][1]+'"',nowtime)
self.contentDb(sql)
time.sleep(1)
def GetCon(self,url):
myUrl = url
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11','Accept': 'text/html,application/xhtml+xml,application/q=0.9,*/*;q=0.8'}
//网站禁止爬虫解决方法加上上面的代码模拟浏览器访问
　　　　 req = urllib2.Request(myUrl, headers = headers)
global myResponse
myResponse = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()　　　　 //异常处理必须加
否则就算模拟了浏览器也会返回 403
原因不知道......
myPage = myResponse.read()
#encode的作用是将unicode编码转换成其他编码的字符串
#decode的作用是将其他编码的字符串转换成unicode编码
#unicodePage = myPage.decode('utf-8').encode('gbk','ignore')
#unicodePage = myPage.decode('utf-8','ignore')
return myPage
def Match(self,con):
# 找出所有class="content"的div标记
#re.S是任意匹配模式，也就是.可以匹配换行符
pattenA = re.compile(r'&section class="section apps grid"&(.*?)&/section&',re.U|re.S)
pattenB = re.compile(r'&li&&strong&(.*?).&/strong&&a href="(.*?)".*?&&img src="(.*?)".*?&&/a&&h3&&a.*?&(.*?)&/a&&/h3&&h4&&a.*?&(.*?)&/a&&/h4&&a.*?&.*?&/a&&/li&',re.U|re.S)
match = re.findall(pattenA,con)
myItems = re.findall(pattenB,match[0])
items = []
for item in myItems:
items.append([item[0].replace("\n",""),item[1].replace("\n",""),item[2].replace("\n",""),(item[3].replace("\n","")).split('-')[0],item[4].replace("\n",""),(item[1].split('id')[1]).split('?')[0]])
return items
def Match_two(self,con):
pattenTwoA = re.compile(r'&li.*?class="release-date"&&span.*?&.*?&/span&(.*?)&/li&',re.U|re.S)
pattenTwoB = re.compile(r'&span.*?class="app-requirements"&.*?&/span&(.*?)&/p&',re.U|re.S)
matchTwoA = self.is_empty(re.findall(pattenTwoA,con))
matchTwoB = self.is_empty(re.findall(pattenTwoB,con))
itemsTwo = [matchTwoA,matchTwoB]
return itemsTwo
def is_empty(self,param):
if len(param):
res = param[0]
return res
def contentDb(self,sql):
conn = MySQLdb.connect(host="主机", user="用户", passwd="密码", db="表名",charset='utf8')
cur = conn.cursor()
result = cur.execute(sql)
except MySQLdb.Error,e:
print "Mysql Error %d: %s" %(e.args[0],e.args[1])
addArr = [["/jp/itunes/charts/free-apps/",'cg_jp_free'],
["/jp/itunes/charts/paid-apps/",'cg_jp_paid']]
myModel = Spider_Model()
for val in addArr:
myModel.startWork(val[0],val[1])
初识Python 代码写的有点烂，自制罪孽深重......
python版本：2.7.5& 测试环境：Linux、Windows
望高手拍砖带我一起装逼！一起飞！python中抓取字符串内容这段字符串里怎么抓取？_百度知道
python中抓取字符串内容这段字符串里怎么抓取？
段字符串 callback({&code&:&14PPXJ&,&state&:&xxsss&}); 抓取14PPXJ几字符串每访问都同code
我要用则表达式抓取import rete = callback({&code&:&14PPXJ&,&state&:&xxsss&}); pattern = re.compile('callback({&code&:&(.*?)&,&state&:&xxsss&})')
r = pattern.search(te)行啊抓取求神指点我新手
提问者采纳
用则表达式codestate两key固定写te.split('&code&:&')[1].split('&,&state&:')[0]
提问者评价
其他类似问题
python的相关知识
等待您来回答
下载知道APP
随时随地咨询
出门在外也不愁

手机APP 上面的内容 python抓取网页内容有办法抓取到吗

我要回帖

更多关于 webapp2 python 的文章

随机推荐

手机APP 上面的内容 python抓取网页内容 有办法抓取到吗

我要回帖

更多关于 webapp2 python 的文章

随机推荐

手机APP 上面的内容 python抓取网页内容有办法抓取到吗