文章目录
- Data Load
-
- Format Transformation
- 转化为tf.data.Dataset
-
- 读取
- 操作
- Data Preprocessing
-
- vectorized & standardized
- Text
-
- 基本
- 热编码 one-hot encoded
- Image & CSV: normalizing features
- labels分类
- 更多
-
- 为了处理
- 为了生成Dataset
- 数据处理层可以写入到Model中
Data Load
Format Transformation
original format:
- Images
- Text files
- CSV data
you need to make your data available as one of 3 formats:
- NumPy arrays
适合不大的数据 tf.data.Dataset
objects :
①有着GPU优化,比其他类型能更好地利用GPU。
②能从磁盘上读取大到内存放不下的数据。- Python generators
转化为tf.data.Dataset
读取
- Images:
tf.keras.preprocessing.image_dataset_from_directory(...)
# image files sorted into class-specific folders
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg
dataset = keras.preprocessing.image_dataset_from_directory('path/to/main_directory', batch_size=64, image_size=(200, 200))# For demonstration, iterate over the batches yielded by the dataset.
for data, labels in dataset:print(data.shape) # (64, 200, 200, 3) 每批64张、200*200像素、3个RGB通道print(data.dtype) # float32print(labels.shape) # (64,) 每批标签64个print(labels.dtype) # int32
- Text files:
keras.preprocessing.text_dataset_from_directory(...)
同样,在不同文件夹中按类分类的文档。
dataset = keras.preprocessing.text_dataset_from_directory('path/to/main_directory', batch_size=64)# For demonstration, iterate over the batches yielded by the dataset.
for data, labels in dataset:print(data.shape) # (64,)print(data.dtype) # stringprint(labels.shape) # (64,)print(labels.dtype) # int32
- other:
tf.data.experimental.make_csv_dataset
to load structured data from CSV files.
tf.data.Dataset.from_tensor_slices()
:keras: tf.data.Dataset.from_tensor_slices()
操作
- 查看方式1:迭代
.as_numpy_iterator()
print(list(dataset.as_numpy_iterator()))
# [(array([1, 3], dtype=int32), array([b'A'], dtype=object)),
# (array([2, 1], dtype=int32), array([b'B'], dtype=object)),
# (array([3, 3], dtype=int32), array([b'A'], dtype=object))]
- 查看方式2:for
for element in dataset.as_numpy_iterator():print(element)
# (array([1, 3], dtype=int32), array([b'A'], dtype=object))
# (array([2, 1], dtype=int32), array([b'B'], dtype=object))
# (array([3, 3], dtype=int32), array([b'A'], dtype=object))
.take(count)
:取出几批的样本。
for inputs, targets in dataset.take(1):print(inputs) # tf.Tensor([1 3], shape=(2,), dtype=int32)print(targets) # tf.Tensor([b'A'], shape=(1,), dtype=string)
.batch()
:指定batch_size。必须指定,不然fit()
时会报错。
# 指定一批32个
dataset = dataset.batch(32)
Data Preprocessing
vectorized & standardized
简单来说:
- vectorized 向量化:非数字特征映射到数字,比如[狗, 猫]→[0, 1]
- standardized 标准化:修改范围到
[0.0, 1.0]
、符合概率学(均值0和方差1)
详细:
- Text files
①need to be read into string tensors,
②then split into words.
③Finally, the words need to be indexed & turned into integer tensors. - Images
①need to be read and decoded into integer tensors,
②then converted to floating point and normalized to small values (usually between 0 and 1). - CSV data
①needs to be parsed, with numerical features converted to floating point tensors and categorical features indexed and converted to integer tensors.
②Then each feature typically needs to be normalized to zero-mean and unit-variance.
Text
基本
tensorflow.keras.layers.experimental.preprocessing.TextVectorization
:holds a mapping between string tokens and integer indices.
- 词汇表必须是字符串。
- 索引
0
表示缺省值(即单词长度不够时的空单词""
),索引1
表示词汇表外的值(词汇表由adapt()
指定)。
from tensorflow.keras.layers.experimental import preprocessingvocabulary = ["aa bb cc"]
data = ["aa bb cc"]
layer = preprocessing.TextVectorization()
layer.adapt(vocabulary) # 以哪个为词汇表
normalized_data = layer(data) # 根据之前adapt()的vocabulary翻译data
print(normalized_data)
# tf.Tensor([[4 3 2 2 1 1]], shape=(1, 6), dtype=int64)
- 重复的单词
cc
,可以看到都是2
。 - 词汇表外的值
dd
和ee
,都是1
。 - 词汇表
vocabulary
映射adapt()
时,标点符号和空格不算,只看单词。重复的单词只留一个。 - 词汇表
vocabulary
可以是一维数组["aa bb cc"]
(句子)、["aa bb", "bb cc"]
(句子)、["aa", "bb", "cc"]
(单词),不能是字符串"aa bb cc"
,不能是多列二维数组[["aa", "bb"], ["aa", "cc"]]
,但可以是单列的二维数组[["aa bb"], ["aa cc"]]
(句子)、[["aa"], ["bb"], ["cc"]]
(单词)。 - 处理
data
同样也是同样的格式要求,结果的形状必定是二维。注意,认为每行是一个"..."
。
data = ["aa bb cc", "cc dd"]
''' tf.Tensor( [[2 4 3][3 1 0]], shape=(2, 3), dtype=int64) '''
- 单词长度不够,指的是data中的两句话,选取最长单词数作为结果的列维度,其他不足长度的句子少的单词就对应
0
。
热编码 one-hot encoded
# Example: one-hot encoded bigrams
from tensorflow.keras.layers.experimental import preprocessingvocabulary = ["aa bb cc"]
data = ["aa", "bb", "cc", "dd", ""]layer = preprocessing.TextVectorization(output_mode="binary", ngrams=2)
layer.adapt(vocabulary)integer_data = layer(data)
print(integer_data)
''' tf.Tensor( [[0. 0. 0. 0. 0. 1.][0. 0. 0. 1. 0. 0.][0. 1. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0.]], shape=(5, 6), dtype=float32) '''
每行都只有一个位是1,其他都是0.
Image & CSV: normalizing features
- 均值0和方差1:
tensorflow.keras.layers.experimental.preprocessing.Normalization
adapt()接收三类输入类型:a batched Dataset, a Tensor, or a Numpy array。不能直接用pd.DataFrame.
from tensorflow.keras.layers.experimental import preprocessingdata = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)normalizer = preprocessing.Normalization()
normalizer.adapt(data)normalized_data = normalizer(data)
print(normalized_data)
''' tf.Tensor( [[-1.2247448 -1.2247448 -1.2247448][ 0. 0. 0. ][ 1.2247448 1.2247448 1.2247448]], shape=(3, 3), dtype=float32) '''
# tf.keras.utils.normalize(): numpy array
normalized_data = tf.keras.utils.normalize(data)
print(normalized_data)
- 调整范围:
tensorflow.keras.layers.experimental.preprocessing.Rescaling
import numpy as np
from tensorflow.keras.layers.experimental import preprocessing# Example image data, with values in the [0, 255] range
training_data = np.random.randint(0, 256, size=(64, 200, 200, 3)).astype("float32")# 限定范围:从[0, 255]到[0.0, 1.0]
output_data = preprocessing.Rescaling(scale=1.0 / 255)(training_data)
如果是numpy,那么可以直接
a = np.array([25.5,255])
a = a/255
labels分类
num_classes
(这里是3)必须大于等于labels的最大值+1.y
表示的类别应该是[0, MAX]
,这样恰好符合num_classes
。如果从1开始的话,虽然可以,但是创出来就是有一个从没有用到的0列。
y = np.array([0, 2, 1, 2, 1]); # 三类:0 1 2
y = keras.utils.to_categorical(y, 3)
print(y)
''' [[1. 0. 0.][0. 0. 1.][0. 1. 0.][0. 0. 1.][0. 1. 0.]] '''
更多
为了处理
Categorical data preprocessing layers
- CategoryEncoding layer
- Hashing layer
- Discretization layer
- StringLookup layer
- IntegerLookup layer
- CategoryCrossing layer
Image preprocessing & augmentation layers
- Resizing layer
- Rescaling layer
- CenterCrop layer
- RandomCrop layer
- RandomFlip layer
- RandomTranslation layer
- RandomRotation layer
- RandomZoom layer
- RandomHeight layer
- RandomWidth layer
Core preprocessing layers
- TextVectorization layer
- Normalization layer
为了生成Dataset
Dataset preprocessing
- Image data preprocessing
- image_dataset_from_directory function
- load_img function
- img_to_array function
- ImageDataGenerator class
- flow method
- flow_from_dataframe method
- flow_from_directory method
- Timeseries data preprocessing
- timeseries_dataset_from_array function
- pad_sequences function
- TimeseriesGenerator class
- Text data preprocessing
- text_dataset_from_directory function
- Tokenizer class
数据处理层可以写入到Model中
normalizer = preprocessing.Normalization()
normalizer.adapt(x_train)inputs = keras.Input(shape=input_shape)
x = preprocessing_layer(inputs)
outputs = rest_of_the_model(x)
model = keras.Model(inputs, outputs)