小白学 Python 数据分析（4）：Pandas （三）数据结构 DataFrame

时间：2020-02-17 16:02:42 阅读：64 评论：0 收藏：0 [点我收藏+]

标签：ocs first als master htm 存在传送门 art 字母

技术图片

在家为国家做贡献太无聊，不如跟我一起学点 Python

人生苦短，我用 Python

前文传送门：

小白学 Python 数据分析（1）：数据分析基础

小白学 Python 数据分析（2）：Pandas （一）概述

小白学 Python 数据分析（3）：Pandas （二）数据结构 Series

引言

DataFrame 是由多种类型的列构成的二维标签数据结构。

简单理解是类似于 Excel 、 SQL 表的结构。

DataFrame 是最常用的 Pandas 对象，与 Series 一样，DataFrame 支持多种类型的输入数据：

一维 ndarray、列表、字典、Series 字典
二维 numpy.ndarray
结构多维数组或记录多维数组
Series
DataFrame

构建 DataFrame

同 Excel 一样， DataFrame 拥有行标签（ index ）和列标签（ columns ），可以理解为 Excel 的行和列。

在构建 DataFrame 的时候，可以有选择的传递 index 和 columns 参数。

这样可以确保生成的 DataFrame 里包含索引或列。

注意： Python > = 3.6，且 Pandas > = 0.23，数据是字典，且未指定 columns 参数时，DataFrame 的列按字典的插入顺序排序。

Python < 3.6 或 Pandas < 0.23，且未指定 columns 参数时，DataFrame 的列按字典键的字母排序。

Series 字典或字典构建 DataFrame

先看一个简单的示例：

d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df)

结果如下：

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

在通过 Series 构建 DataFrame 的时候，生成的 index （索引）是每个 Series 索引的并集。

先把嵌套字典转换为 Series 。如果没有指定列， DataFrame 的列就是字典键的有序列表。

这里我们在字典中使用两个字符串 one 和 two 作为字典的 key ，在构造 DataFrame 时会自动的使用我们的字典的 key 作为自己的 columns （列）。

如果我们在构造 DataFrame 手动指定索引，那么将会使用我们自行指定的索引，示例如下：

df1 = pd.DataFrame(d, index=['d', 'b', 'a'])
print(df1)

结果如下：

   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

如果我们同时指定 index 和 column ，那么 DataFrame 也将会使用我们指定的索引和列，如果我们指定的 index 或者 column 不存在，将会使用 NaN 进行默认值填充，示例如下：

df2 = pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
print(df2)

结果如下：

   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

注意： 这里有一点需要注意，指定列与数据字典一起传递时，传递的列会覆盖字典的键。

在使用 Series 构建 DataFrame 时， DataFrame 会自动继承 Series 的索引，如果没有指定列名，默认列名是输入 Series 的名称。

多维数组字典构建 DataFrame

首先，多维数组的长度必须相同。

如果传递了索引参数，index 的长度必须与数组一致。

如果没有传递索引参数，那么将会按照序列从 0 开始，自动生成，示例如下：

d1 = {'one': [1., 2., 3., 4.],
      'two': [4., 3., 2., 1.]}

df3 = pd.DataFrame(d1)
print(df3)

df4 = pd.DataFrame(d1, index=['a', 'b', 'c', 'd'])
print(df4)

结果如下：

   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0

列表字典构建 DataFrame

d2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

df5 = pd.DataFrame(d2)
print(df5)

df6 = pd.DataFrame(d2, index=['first', 'second'], columns=['a', 'b'])
print(df6)

结果如下：

   a   b     c
0  1   2   NaN
1  5  10  20.0

        a   b
first   1   2
second  5  10

元组字典构建 DataFrame

元组字典可以自动创建多层索引 DataFrame。

d3 = ({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
       ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
       ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
       ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
       ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

df7 = pd.DataFrame(d3)
print(df7)

结果如下：

       a              b      
       b    a    c    a     b
A B  1.0  4.0  5.0  8.0  10.0
  C  2.0  3.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

提取、添加、删除

创建好了 DataFrame 以后，我们自然是希望可以动态的操作它，那么标准的 CRUD 操作必不可少。

获取数据示例如下，这里我们使用 df4 做演示：

提取

# 获取数据
print(df4)
# 按列获取
print(df4['one'])
# 按行获取
print(df4.loc['a'])
print(df4.iloc[0])

df4['three'] = df4['one'] * df4['two']
df4['flag'] = df4['one'] > 2
print(df4)

结果如下：

   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0

a    1.0
b    2.0
c    3.0
d    4.0
Name: one, dtype: float64

one    1.0
two    4.0
Name: a, dtype: float64

one    1.0
two    4.0
Name: a, dtype: float64

   one  two  three   flag
a  1.0  4.0    4.0  False
b  2.0  3.0    6.0  False
c  3.0  2.0    6.0   True
d  4.0  1.0    4.0   True

删除

# 删除数据
del df4['two']
df4.pop('three')
print(df4)

结果如下：

   one   flag
a  1.0  False
b  2.0  False
c  3.0   True
d  4.0   True

增加

插入标量值，将会全部的列都插入，如下：

# 插入数据
df4['foo'] = 'bar'
print(df4)

结果如下

   one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  4.0   True  bar

插入与 DataFrame 索引不同的 Series 时，以 DataFrame 的索引为准：

df4['one_trunc'] = df4['one'][:2]
print(df4)

结果如下：

   one   flag  foo  one_trunc
a  1.0  False  bar        1.0
b  2.0  False  bar        2.0
c  3.0   True  bar        NaN
d  4.0   True  bar        NaN

可以插入原生多维数组，但长度必须与 DataFrame 索引长度一致。

可以使用 insert 方法插入数据，默认在 DataFrame 尾部插入列，但是可以手动指定插入列的位置，从 0 起算，示例如下：

df4.insert(1, 'bar', df4['one'])
print(df4)

结果如下：

   one  bar   flag  foo  one_trunc
a  1.0  1.0  False  bar        1.0
b  2.0  2.0  False  bar        2.0
c  3.0  3.0   True  bar        NaN
d  4.0  4.0   True  bar        NaN

示例代码

老规矩，所有的示例代码都会上传至代码管理仓库 Github 和 Gitee 上，方便大家取用。

示例代码-Github

示例代码-Gitee

参考

https://www.pypandas.cn/docs/getting_started/dsintro.html

小白学 Python 数据分析（4）：Pandas （三）数据结构 DataFrame

标签：ocs first als master htm 存在传送门 art 字母

原文地址：https://www.cnblogs.com/babycomeon/p/12321717.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行