笔者从事数据分析的工作,经常会用到pandas和numpy,虽然使用了很久,但仍有部分疑惑,现抽个时间好好梳理下。
下文将从是什么(what),怎么做(how)两个角度进行说明。
老规矩,talk is cheap, show me the code.
Ⅰ. What
1.1 numpy.ndarray
numpy.ndarray(下称ndarray)可以理解为一个多维同质的数组,ndarray可以拆分为n(multi)-d(dimension)-array。其由两部分组成:
- array object :数组中的数据;
- data-type object :数据的元数据信息。
数据具有以下特性:
- 多维度的 multidimensional
- 同数据类型 homogeneous
- 大小固定 fixed-size
元数据信息则主要包括:字节顺序、占用字节数、数据类型等。
如下是官网的介绍信息:
numpy.ndarray
An array object represents a multidimensional, homogeneous array of fixed-size items. An associated data-type object describes the format of each element in the array (its byte-order, how many bytes it occupies in memory, whether it is an integer, a floating point number, or something else, etc.)
Parameters
class numpy.ndarray(shape, dtype=float, buffer=None, offset=0, strides=None, order=None)
Examples
ndarray一般用于矩阵创建和操作,如下,我们创建一个简单的ndarray对象。
import numpy as np# nda = np.array(range(12)).reshape(3, -1) # 和下面的效果相同
nda = np.arange(12).reshape(3, -1)
nda[1]
nda[1,1]==nda[1][1] # True
查看对象type
>>> nda
Out[72]:
array([[ 0, 1, 2, 3],[ 4, 5, 6, 7],[ 8, 9, 10, 11]])>>> type(nda)
Out[73]: numpy.ndarray>>> nda.shape
Out[74]: (3, 4)>>> nda.dtype
Out[75]: dtype('int32')
1.2 pandas.Series
具有轴标签的一维数组(One-dimensional ndarray with axis labels (including time series).),但是这里的数据类型可以不一致。
官网介绍 pandas.Series
One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).
Parameters
class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
data array-like, Iterable, dict, or scalar value
Contains data stored in Series. If data is a dict, argument order is maintained.
index array-like or Index (1d)
Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.
dtype str, numpy.dtype, or ExtensionDtype, optional
Data type for the output Series. If not specified, this will be inferred from data. See the user guide for more usages.
name str, optional
The name to give to the Series.
copy bool, default False
Copy input data.
Examples
如下,我们创建一些Series。
d1 = {
'a': 1, 'b': 2, 'c': 3}
# d1 = {'a': 1, 'b': 2, 'c': 'hello'} # 数据类型可以不一致,一般不推荐
ser1 = pd.Series(data=d1, index=['a', 'b', 'c', 'd'])d2 = [['python', 10, 99, 'male'],['java', 14, 92, 'female'],['c', 18, 97, 'male'],['go', 22, 90, 'female']]
ser2 = pd.Series(data=d2, index=['lst', '2nd', '3rd', '4th'])
ser1[1]
查看输出:
>>> ser1
Out[88]:
a 1.0
b 2.0
c 3.0
d NaN
dtype: float64
>>> ser2
Out[89]:
lst [python, 10, 99, male]
2nd [java, 14, 92, female]
3rd [c, 18, 97, male]
4th [go, 22, 90, female]
dtype: object
>>> type(ser1)
Out[92]: pandas.core.series.Series
>>> type(ser2)
Out[92]: pandas.core.series.Series
>>> ser1[1]
out[1]: 2.0
1.3 pandas.DataFrame
DataFrame是二维的、可变大小的、多数据类型的数据表。可以把DataFrame想象成Mysql的表。
官网介绍 pandas.DataFrame
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
Parameters
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
data ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order.Changed in version 0.25.0: If data is a list of dicts, column order follows insertion-order.
index Index or array-like
Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.
columns Index or array-like
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided.
dtype dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.
copy bool, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input.
Examples
d = {
'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
>>> dfcol1 col2
0 1 3
1 2 4d2 = [['python', 10, 99, 'male'],['java', 14, 92, 'female'],['c', 18, 97, 'male'],['go', 22, 90, 'female']]
df = pd.DataFrame(data=d2, columns=['lang', 'age', 'popular', 'sex'], index=['lst', '2nd', '3rd', '4th'])
>>> df
Out[110]: lang age popular sex
lst python 10 99 male
2nd java 14 92 female
3rd c 18 97 male
4th go 22 90 female
Ⅱ. How
下面将演示Series、DataFrame、ndarray三者之间如何转化。
转化方式很简单,转化为ndarray直接使用np.array()
即可。
转华为pd对象,直接通过pd.Series()
or pd.DataFrame()
即可。
# ndarray => Series
npa = np.arange(12)
ser = pd.Series(npa)
# Series => ndarray
npa_s = np.array(ser)# ndarray => DataFrame
npa2 = npa.reshape(3, -1)
df = pd.DataFrame(npa2)
# DataFrame => ndarray
npa_d = np.array(df)
npa_v = df.values # npa_d npa_v 一样# DataFrame -> Series
type(df[0]) # pandas.core.series.Series
# Series -> DataFrame
pd.DataFrame(ser)