keras: Data Load Data Preprocessing_综合

文章目录

Data Load
- Format Transformation
- 转化为tf.data.Dataset
- - 读取
  - 操作
Data Preprocessing
- vectorized & standardized
- Text
- - 基本
  - 热编码 one-hot encoded
- Image & CSV: normalizing features
- labels分类
- 更多
- - 为了处理
  - 为了生成Dataset
数据处理层可以写入到Model中

Data Load

Format Transformation

original format:

Images
Text files
CSV data

you need to make your data available as one of 3 formats:

NumPy arrays
适合不大的数据
tf.data.Dataset objects :
①有着GPU优化，比其他类型能更好地利用GPU。
②能从磁盘上读取大到内存放不下的数据。
Python generators

转化为tf.data.Dataset

读取

Images: tf.keras.preprocessing.image_dataset_from_directory(...)

# image files sorted into class-specific folders
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg

dataset = keras.preprocessing.image_dataset_from_directory('path/to/main_directory', batch_size=64, image_size=(200, 200))# For demonstration, iterate over the batches yielded by the dataset.
for data, labels in dataset:print(data.shape)  # (64, 200, 200, 3) 每批64张、200*200像素、3个RGB通道print(data.dtype)  # float32print(labels.shape)  # (64,) 每批标签64个print(labels.dtype)  # int32

Text files: keras.preprocessing.text_dataset_from_directory(...)
同样，在不同文件夹中按类分类的文档。

dataset = keras.preprocessing.text_dataset_from_directory('path/to/main_directory', batch_size=64)# For demonstration, iterate over the batches yielded by the dataset.
for data, labels in dataset:print(data.shape)  # (64,)print(data.dtype)  # stringprint(labels.shape)  # (64,)print(labels.dtype)  # int32

other:
tf.data.experimental.make_csv_dataset to load structured data from CSV files.
tf.data.Dataset.from_tensor_slices()：keras: tf.data.Dataset.from_tensor_slices()

操作

查看方式1：迭代.as_numpy_iterator()

print(list(dataset.as_numpy_iterator()))
# [(array([1, 3], dtype=int32), array([b'A'], dtype=object)), 
# (array([2, 1], dtype=int32), array([b'B'], dtype=object)), 
# (array([3, 3], dtype=int32), array([b'A'], dtype=object))]

查看方式2：for

for element in dataset.as_numpy_iterator():print(element)
# (array([1, 3], dtype=int32), array([b'A'], dtype=object))
# (array([2, 1], dtype=int32), array([b'B'], dtype=object))
# (array([3, 3], dtype=int32), array([b'A'], dtype=object))

.take(count)：取出几批的样本。

for inputs, targets in dataset.take(1):print(inputs)			# tf.Tensor([1 3], shape=(2,), dtype=int32)print(targets)			# tf.Tensor([b'A'], shape=(1,), dtype=string)

.batch()：指定batch_size。必须指定，不然fit()时会报错。

# 指定一批32个
dataset = dataset.batch(32)

Data Preprocessing

vectorized & standardized

简单来说：

vectorized 向量化：非数字特征映射到数字，比如[狗, 猫]→[0, 1]
standardized 标准化：修改范围到[0.0, 1.0]、符合概率学（均值0和方差1）

详细：

Text files
①need to be read into string tensors,
②then split into words.
③Finally, the words need to be indexed & turned into integer tensors.
Images
①need to be read and decoded into integer tensors,
②then converted to floating point and normalized to small values (usually between 0 and 1).
CSV data
①needs to be parsed, with numerical features converted to floating point tensors and categorical features indexed and converted to integer tensors.
②Then each feature typically needs to be normalized to zero-mean and unit-variance.

Text

基本

tensorflow.keras.layers.experimental.preprocessing.TextVectorization：holds a mapping between string tokens and integer indices.

词汇表必须是字符串。
索引0表示缺省值(即单词长度不够时的空单词"")，索引1表示词汇表外的值(词汇表由adapt()指定)。

from tensorflow.keras.layers.experimental import preprocessingvocabulary = ["aa bb cc"]
data = ["aa bb cc"]
layer = preprocessing.TextVectorization()
layer.adapt(vocabulary)						# 以哪个为词汇表
normalized_data = layer(data)				# 根据之前adapt()的vocabulary翻译data
print(normalized_data)
# tf.Tensor([[4 3 2 2 1 1]], shape=(1, 6), dtype=int64)

重复的单词cc，可以看到都是2。
词汇表外的值dd和ee，都是1。
词汇表vocabulary映射adapt()时，标点符号和空格不算，只看单词。重复的单词只留一个。
词汇表vocabulary可以是一维数组["aa bb cc"](句子)、["aa bb", "bb cc"](句子)、["aa", "bb", "cc"](单词)，不能是字符串"aa bb cc"，不能是多列二维数组[["aa", "bb"], ["aa", "cc"]]，但可以是单列的二维数组[["aa bb"], ["aa cc"]](句子)、[["aa"], ["bb"], ["cc"]](单词)。
处理data同样也是同样的格式要求，结果的形状必定是二维。注意，认为每行是一个"..."。

data = ["aa bb cc", "cc dd"]
''' tf.Tensor( [[2 4 3][3 1 0]], shape=(2, 3), dtype=int64) '''

单词长度不够，指的是data中的两句话，选取最长单词数作为结果的列维度，其他不足长度的句子少的单词就对应0。

热编码 one-hot encoded

# Example: one-hot encoded bigrams
from tensorflow.keras.layers.experimental import preprocessingvocabulary = ["aa bb cc"]
data = ["aa", "bb", "cc", "dd", ""]layer = preprocessing.TextVectorization(output_mode="binary", ngrams=2)
layer.adapt(vocabulary)integer_data = layer(data)
print(integer_data)
''' tf.Tensor( [[0. 0. 0. 0. 0. 1.][0. 0. 0. 1. 0. 0.][0. 1. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0.]], shape=(5, 6), dtype=float32) '''

每行都只有一个位是1，其他都是0.

Image & CSV: normalizing features

均值0和方差1：tensorflow.keras.layers.experimental.preprocessing.Normalization
adapt()接收三类输入类型：a batched Dataset, a Tensor, or a Numpy array。不能直接用pd.DataFrame.

from tensorflow.keras.layers.experimental import preprocessingdata = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)normalizer = preprocessing.Normalization()
normalizer.adapt(data)normalized_data = normalizer(data)
print(normalized_data)
''' tf.Tensor( [[-1.2247448 -1.2247448 -1.2247448][ 0. 0. 0. ][ 1.2247448 1.2247448 1.2247448]], shape=(3, 3), dtype=float32) '''

# tf.keras.utils.normalize(): numpy array
normalized_data = tf.keras.utils.normalize(data)
print(normalized_data)

调整范围：tensorflow.keras.layers.experimental.preprocessing.Rescaling

import numpy as np
from tensorflow.keras.layers.experimental import preprocessing# Example image data, with values in the [0, 255] range
training_data = np.random.randint(0, 256, size=(64, 200, 200, 3)).astype("float32")# 限定范围：从[0, 255]到[0.0, 1.0]
output_data = preprocessing.Rescaling(scale=1.0 / 255)(training_data)

如果是numpy，那么可以直接

a = np.array([25.5,255])
a = a/255

labels分类

num_classes（这里是3）必须大于等于labels的最大值+1.
y表示的类别应该是[0, MAX]，这样恰好符合num_classes。如果从1开始的话，虽然可以，但是创出来就是有一个从没有用到的0列。

y = np.array([0, 2, 1, 2, 1]);		# 三类：0 1 2
y = keras.utils.to_categorical(y, 3)
print(y)
''' [[1. 0. 0.][0. 0. 1.][0. 1. 0.][0. 0. 1.][0. 1. 0.]] '''

为了处理

Categorical data preprocessing layers

CategoryEncoding layer
Hashing layer
Discretization layer
StringLookup layer
IntegerLookup layer
CategoryCrossing layer

Image preprocessing & augmentation layers

Resizing layer
Rescaling layer
CenterCrop layer
RandomCrop layer
RandomFlip layer
RandomTranslation layer
RandomRotation layer
RandomZoom layer
RandomHeight layer
RandomWidth layer

Core preprocessing layers

TextVectorization layer
Normalization layer

为了生成Dataset

Dataset preprocessing

Image data preprocessing
- image_dataset_from_directory function
- load_img function
- img_to_array function
- ImageDataGenerator class
- flow method
- flow_from_dataframe method
- flow_from_directory method
Timeseries data preprocessing
- timeseries_dataset_from_array function
- pad_sequences function
- TimeseriesGenerator class
Text data preprocessing
- text_dataset_from_directory function
- Tokenizer class

数据处理层可以写入到Model中

normalizer = preprocessing.Normalization()
normalizer.adapt(x_train)inputs = keras.Input(shape=input_shape)
x = preprocessing_layer(inputs)
outputs = rest_of_the_model(x)
model = keras.Model(inputs, outputs)

keras: Data Load Data Preprocessing