博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
09 Python之Pandas库之数据处理与规整
阅读量:1887 次
发布时间:2019-04-26

本文共 9235 字,大约阅读时间需要 30 分钟。

Pandas库之数据处理与规整

import numpy as npimport pandas as pdimport pandas_datareader.data as webimport datetime# 爬取数据df = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,1),datetime.date.today())# 展示数据df
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
2019-04-08 200.229996 196.339996 196.419998 200.100006 25881700 200.100006
2019-04-09 202.850006 199.229996 200.320007 199.500000 35768200 199.500000
2019-04-10 200.740005 198.179993 198.679993 200.619995 21695300 200.619995
2019-04-11 201.000000 198.440002 200.850006 198.949997 20900800 198.949997
2019-04-12 200.139999 196.210007 199.199997 198.869995 27744300 198.869995

1 缺失数据处理

1.1 去掉包含缺失值的行

df_drop = df.dropna()df_drop.head()
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000

1.2 对缺失值进行补充

df_fillna = df.fillna(value=0)df_fillna.head()
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000

1.3 判断数据是否为nan,并进行布尔填充

df_isnull = pd.isnull(df)df_isnull.head()
High Low Open Close Volume Adj Close
Date
2019-04-01 False False False False False False
2019-04-02 False False False False False False
2019-04-03 False False False False False False
2019-04-04 False False False False False False
2019-04-05 False False False False False False

2 函数的应用和映射

# 列计算平均值df.mean()
High         1.981070e+02Low          1.950050e+02Open         1.962690e+02Close        1.971340e+02Volume       2.435307e+07Adj Close    1.971340e+02dtype: float64
# 行计算平均值df.mean(1)
Date2019-04-01    4.643826e+062019-04-02    3.794444e+062019-04-03    3.878796e+062019-04-04    3.185879e+062019-04-05    3.087931e+062019-04-08    4.313782e+062019-04-09    5.961534e+062019-04-10    3.616050e+062019-04-11    3.483633e+062019-04-12    4.624216e+06dtype: float64
#skipna参数默认是True 表示排除缺失值df.mean(axis = 1,skipna = False)
Date2019-04-01    4.643826e+062019-04-02    3.794444e+062019-04-03    3.878796e+062019-04-04    3.185879e+062019-04-05    3.087931e+062019-04-08    4.313782e+062019-04-09    5.961534e+062019-04-10    3.616050e+062019-04-11    3.483633e+062019-04-12    4.624216e+06dtype: float64
# 行名字排序sorted_row_df = df.sort_index()sorted_row_df.head()
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
# 列名字排序sorted_col_df = df.sort_index(axis=1)sorted_col_df.head()
Adj Close Close High Low Open Volume
Date
2019-04-01 191.240005 191.240005 191.679993 188.380005 191.639999 27862000
2019-04-02 194.020004 194.020004 194.460007 191.050003 191.089996 22765700
2019-04-03 195.350006 195.350006 196.500000 193.149994 193.250000 23271800
2019-04-04 195.690002 195.690002 196.369995 193.139999 194.789993 19114300
2019-04-05 197.000000 197.000000 197.100006 195.929993 196.449997 18526600

常用的方法如上所介绍,还有很多其他的,下面罗列了一些,可供参考:

count 非na值的数量

describe 针对Series或DataFrame列计算汇总统计

min、max 计算最小值和最大值

argmin、argmax 计算能够获取到最大值和最小值得索引位置

idxmin、idxmax 计算能够获取到最大值和最小值得索引值

quantile 计算样本的分位数(0到1)

sum 值的总和

mean 值的平均数

median 值的算术中位数

mad 根据平均值计算平均绝对离差

var 样本值的方差

std 样本值的标准差

skew 样本值的偏度(三阶矩)

kurt 样本值的累积和

cumsum 样本值的累积和

cummin,cummax 样本值的累计最大值和累计最小值

cumprod 样本值的累计积

diff 计算一阶差分

pct_change 计算百分数变化

3 数据规整

Pandas提供了大量的方法能够轻松的对Series,DataFrame和Pannel对象进行各种符号各种逻辑关系的合并操作

concat 可以沿一条轴将多个对象堆叠到一起

append 将一行连接到一个DataFrame上

duolicated 移除重复数据

3.1 数据堆叠concat

df1 = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,1),datetime.datetime(2019,4,3))df1
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
df2 = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,5),datetime.datetime(2019,4,8))df2
High Low Open Close Volume Adj Close
Date
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
2019-04-08 200.229996 196.339996 196.419998 200.100006 25881700 200.100006

纵向拼接(默认):

pd.concat([df1,df2],axis=0)
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
2019-04-08 200.229996 196.339996 196.419998 200.100006 25881700 200.100006

横向拼接,index对不上的用NaN填充

pd.concat([df1,df2],axis=1)
High Low Open Close Volume Adj Close High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000.0 191.240005 NaN NaN NaN NaN NaN NaN
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700.0 194.020004 NaN NaN NaN NaN NaN NaN
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800.0 195.350006 NaN NaN NaN NaN NaN NaN
2019-04-04 NaN NaN NaN NaN NaN NaN 196.369995 193.139999 194.789993 195.690002 19114300.0 195.690002
2019-04-05 NaN NaN NaN NaN NaN NaN 197.100006 195.929993 196.449997 197.000000 18526600.0 197.000000
2019-04-08 NaN NaN NaN NaN NaN NaN 200.229996 196.339996 196.419998 200.100006 25881700.0 200.100006

3.2 数据连接append

df1
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
s = df1.iloc[0]s
High         1.916800e+02Low          1.883800e+02Open         1.916400e+02Close        1.912400e+02Volume       2.786200e+07Adj Close    1.912400e+02Name: 2019-04-01 00:00:00, dtype: float64
# ignore_index=False 表示索引不变df1.append(s,ignore_index=False)
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000.0 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700.0 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800.0 195.350006
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000.0 191.240005
# ignore_index=True 表示索引重置df1.append(s,ignore_index=True)
High Low Open Close Volume Adj Close
0 191.679993 188.380005 191.639999 191.240005 27862000.0 191.240005
1 194.460007 191.050003 191.089996 194.020004 22765700.0 194.020004
2 196.500000 193.149994 193.250000 195.350006 23271800.0 195.350006
3 191.679993 188.380005 191.639999 191.240005 27862000.0 191.240005

3.3 移除重复数据duplicated

z = df1.append(s,ignore_index=False)z
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000.0 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700.0 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800.0 195.350006
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000.0 191.240005

查看重复数据

z.duplicated()
Date2019-04-01    False2019-04-02    False2019-04-03    False2019-04-01     Truedtype: bool

移除重复数据

z.drop_duplicates()
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000.0 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700.0 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800.0 195.350006

4 分组

z.groupby("Open").sum()
High Low Close Volume Adj Close
Open
191.089996 194.460007 191.050003 194.020004 22765700.0 194.020004
191.639999 383.359985 376.760010 382.480011 55724000.0 382.480011
193.250000 196.500000 193.149994 195.350006 23271800.0 195.350006
z.groupby(["Open","Close"]).sum()
High Low Volume Adj Close
Open Close
191.089996 194.020004 194.460007 191.050003 22765700.0 194.020004
191.639999 191.240005 383.359985 376.760010 55724000.0 382.480011
193.250000 195.350006 196.500000 193.149994 23271800.0 195.350006

转载地址:http://jdzdf.baihongyu.com/

你可能感兴趣的文章
推荐算法: 基于用户的协同过滤算法
查看>>
推荐算法:基于物品的协同过滤算法
查看>>
docker系列3:docker搭建CDH集群[单机单节点]
查看>>
ubuntu 16:使用系统自带的中文输入法
查看>>
k8s单机版[ microk8s ]
查看>>
docker系列6 :k8s集群[ 解压安装 ]
查看>>
maven- idea: 打包可执行jar
查看>>
docker系列2: windows安装docker
查看>>
hbase数据转移: 导入导出
查看>>
docker系列7: docker搭建mysql
查看>>
windows server 2012设置远程连接断开后自动注销
查看>>
python基础:list,map,open()文件读写
查看>>
Go面向对象-接口
查看>>
Go-多路选择和超时控制
查看>>
Go-channel的关闭和广播
查看>>
Go-任务的取消
查看>>
AIX 作为Web Server 使用时,tcp相关的几个参数调整
查看>>
自我学习37:请描述一下网页从开始请求到最后展示的完整过程
查看>>
自我学习38:如何区分前后端BUG
查看>>
自我学习39:接口自动化测试用例&功能测试用例区别
查看>>