QQ名字和淘宝名字个性签名名是怎么设置成红色

python pandas IO tools 之read_csv文件读写参数详解
python pandas IO tools 之read_csv文件读写参数详解
python pandas IO tools 之csv文件读写
英文原文:
读取csv文件:pd.read_csv(),写入csv文件:pd.to_csv()
pandas还可以读取一下文件:
read_excel,
read_json,
read_msgpack (experimental),
read_html,
read_gbq (experimental),
read_stata,
read_clipboard,
相应的写入:
to_msgpack (experimental),
to_gbq (experimental),
to_clipboard,
to_pickle.
常用参数的读取csv文件
import pandas as pd
obj=pd.read_csv('f:/ceshi.csv')
print type(obj)
print obj.dtypes
Unnamed: 0
&class 'pandas.core.frame.DataFrame'&
Unnamed: 0
dtype: object
ceshi.csv为有列索引没有行索引的数据,read_csv会自动加上行索引,即使原数据集有行索引。
read_csv读取的数据类型为Dataframe,obj.dtypes可以查看每列的数据类型
obj_2=pd.read_csv('f:/ceshi.csv',header=None,names=range(2,5))
print obj_2
header=None时,即指明原始文件数据没有列索引,这样read_csv为自动加上列索引,除非你给定列索引的名字。
obj_2=pd.read_csv('f:/ceshi.csv',header=0,names=range(2,5))
print obj_2
header=0,表示文件第0行(即第一行,python,索引从0开始)为列索引,这样加names会替换原来的列索引。
obj_2=pd.read_csv('f:/ceshi.csv',index_col=0)
print obj_2
obj_2=pd.read_csv('f:/ceshi.csv',index_col=[0,2])
print obj_2
index_col为指定数据中那一列作为Dataframe的行索引,也可以可指定多列,形成层次索引,默认为None,即不指定行索引,这样系统会自动加上行索引(0-)
obj_2=pd.read_csv('f:/ceshi.csv',index_col=0,usecols=[0,1,2,3])
print obj_2
obj_2=pd.read_csv('f:/ceshi.csv',index_col=0,usecols=[1,2,3])
print obj_2
usecols:可以指定原数据集中,所使用的列。在本例中,共有4列,当usecols=[0,1,2,3]时,即选中所有列,之后令第一列为行索引,当usecols=[1,2,3]时,即从第二列开始,之后令原始数据集的第二列为行索引。
obj_2=pd.read_csv('f:/ceshi.csv',index_col=0,nrows=3)
print obj_2
nrows:可以给出从原始数据集中的所读取的行数,目前只能从第一行开始到nrows行。
obj_3=pd.read_csv('f:/ceshi_date.csv',index_col=0,)
print obj_3
print type(obj_3.index)
&class 'pandas.indexes.numeric.Int64Index'&
obj_3=pd.read_csv('f:/ceshi_date.csv',index_col=0,parse_dates=True)
print obj_3
print type(obj_3.index)
&class 'pandas.tseries.index.DatetimeIndex'&
parse_dates=True:可令字符串解析成时间格式。
data='date,value,cat\n1/6/2000,5,a\n2/6/2000,10,b\n3/6/2000,15,c'
print data
date,value,cat
1/6/2000,5,a
2/6/2000,10,b
3/6/2000,15,c
from StringIO import StringIO
print pd.read_csv(StringIO(data),parse_dates=[0],index_col=0)
print pd.read_csv(StringIO(data),parse_dates=[0],index_col=0,dayfirst=True)
US常用时间格式:MM/DD/YYYY,dayfirst=True:可将其改为DD/MM/YYYY
分隔符和阈值
tem='id|level|category\npatient1|123,000|x\npatient2|23,000|y\npatient3|1,234,018|z'
id|level|category
patient1|123,000|x
patient2|23,000|y
patient3|1,234,018|z
print pd.read_csv(StringIO(tem),sep='|')
level category
print pd.read_csv(StringIO(tem),sep='|',thousands=',')
level category
我的热门文章
即使是一小步也想与你分享python pandas读写csv文件乱码问题
已有 4596 次阅读
|个人分类:|系统分类:
在pandas中读取带有中文的csv文件时,读写中汉字为乱码,可加上encoding参数来避免,如:pd.read_csv(&ee.csv&,encoding=&gbk&)当然,在导出时记得也加上encoding参数,否则导出后用excel打开也是乱码,editplus打开正常,如:df.to_csv(&sel.csv&,index=False,encoding=&gbk&)关于编码的说明,参考:/blog/2007/10/ascii_unicode_and_utf-8.html
转载本文请联系原作者获取授权,同时请注明本文来自张国标科学网博客。链接地址:
上一篇:下一篇:
当前推荐数:0
评论 ( 个评论)
扫一扫,分享此博文
作者的其他最新博文
热门博文导读
Powered by
Copyright &帐号:密码:下次自动登录{url:/nForum/slist.json?uid=guest&root=list-section}{url:/nForum/nlist.json?uid=guest&root=list-section}
贴数:4&分页:部落发信人: lokta (部落), 信区: Python
标&&题: Re: 用pandas分析数据的话,是直接存放在csv之类的文件里好,还是存
发信站: 水木社区 (Sat Jul 23 12:07:47 2016), 站内 && csv吧,pandas数据量大的写xlsx速度很慢的。&&
或者mongodb也行。 && 【 在 chinalupin () 的大作中提到: 】
: 或者说什么情况下适合专门弄个sql来放数据
发自xsmth (iOS版)
-- && ※ 来源:·水木社区 ·[FROM: 117.25.139.*]
部落发信人: lokta (部落), 信区: Python
标&&题: Re: 用pandas分析数据的话,是直接存放在csv之类的文件里好,还
发信站: 水木社区 (Sat Jul 23 22:39:41 2016), 站内 && sql没用过,我平常就用这两个,快糙猛。 &&&&&&&& 【 在 chinalupin () 的大作中提到: 】
: 那SQL呢?有了CSV,还用得着SQL吗?
: 【 在 lokta 的大作中提到: 】
发自xsmth (iOS版)
-- && ※ 来源:·水木社区 ·[FROM: 117.25.139.*]
部落发信人: lokta (部落), 信区: Python
标&&题: Re: 用pandas分析数据的话,是直接存放在csv之类的文件里好,还
发信站: 水木社区 (Mon Jul 25 16:09:34 2016), 转信 && 只是存储为啥抗不住?
都是顺序读写。 && 【 在 chinalupin 的大作中提到: 】
: 要是csv里有上百万条记录还能扛得住吗?
&& -- && ※ 来源:·水木社区 ·[FROM: 222.85.138.*]
部落发信人: lokta (部落), 信区: Python
标&&题: Re: 用pandas分析数据的话,是直接存放在csv之类的文件里好,还
发信站: 水木社区 (Wed Jul 27 13:35:26 2016), 转信 && 处理UTF8,CSV没有现成的lib。又懒得自己写,就用pandas。
能偷懒就偷懒
【 在 seablue 的大作中提到: 】
: csv不错.
: 不过python自带的csv模块就是个鸡肋,我觉得还不如读完一行,用str.split(',')生成的list来操作来得爽快呢.
&& -- && ※ 来源:·水木社区 ·[FROM: 222.85.138.*]
文章数:4&分页:python - Pandas memory error when reading CSV? - Stack Overflow
Join the Stack Overflow Community
Stack Overflow is a community of 7.1 million programmers, just like you, helping each other.
J it only takes a minute:
I've been trying to process some CSV data with Pandas, but keep having memory problem. I have a CSV file that is about 1.4 GB. I have tried different things in attempt to make Pandas read_csv work to no avail.
It didn't work when I used the iterator=True and chunksize = number parameters. More over, the smaller the chunksize, the slower it takes to process the same amount of data. Simple heavier overhead doesn't explain it because it was way too slower when number of chunks is big. I suspect when processing every chunk, panda needs to go though all the chunks before it to "get to it", instead of jumping right to the start of the chunk. This seems the only way this can be explained.
Then as a last resort, I split the CSV files into 6 parts, and tried to read them one by one. But still, the memory error persists. I have monitored the memory usage of python when running the code below, and found that each time python finishes processing a file and move on to the next, the memory usage goes straight up. It seemed quite obvious that panda didn't release memory for the previous file when it's already finished processing it.
The code may not make sense but that's because I removed the part where it writes into an SQL database to simplify it and isolate the problem.
import csv,pandas as pd
import glob
filenameStem = 'Crimes'
counter = 0
for filename in glob.glob(filenameStem + '_part*.csv'): # reading files Crimes_part1.csv through Crimes_part6.csv
chunk = pd.read_csv(filename)
df = chunk.iloc[:,[5,8,15,16]]
df = df.dropna(how='any')
counter += 1
print(counter)
you may try to parse only those columns that you need (as @BrenBarn said in comments):
import glob
import pandas as pd
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)
fmask = 'Crimes_part*.csv'
cols = [5,8,15,16]
df = get_merged_csv(glob.glob(fmask), index_col=None, usecols=cols).dropna(how='any')
print(df.head())
PS this will include only 4 out of at least 17 columns in your resulting data frame
52.1k62551
Did you find this question interesting? Try our newsletter
Sign up for our newsletter and get our top new questions delivered to your inbox ().
Subscribed!
Success! Please click the link in the confirmation email to activate your subscription.
Thanks for the reply.
After some debugging, I have located the problem. The "iloc" subsetting of pandas created a circular reference, which prevented garbage recollection. Detailed discussion can be found
Your Answer
Sign up or
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Post as a guest
By posting your answer, you agree to the
Not the answer you're looking for?
Browse other questions tagged
rev .25805
Stack Overflow works best with JavaScript enabled

我要回帖

更多关于 名字个性签名连笔设计 的文章

 

随机推荐