当前位置: 代码迷 >> 综合 >> keras: Data Load Data Preprocessing
  详细解决方案

keras: Data Load Data Preprocessing

热度:8   发布时间:2024-01-12 15:23:26.0

文章目录

  • Data Load
    • Format Transformation
    • 转化为tf.data.Dataset
      • 读取
      • 操作
  • Data Preprocessing
    • vectorized & standardized
    • Text
      • 基本
      • 热编码 one-hot encoded
    • Image & CSV: normalizing features
    • labels分类
    • 更多
      • 为了处理
      • 为了生成Dataset
  • 数据处理层可以写入到Model中


Data Load

Format Transformation

original format:

  • Images
  • Text files
  • CSV data

you need to make your data available as one of 3 formats:

  • NumPy arrays
    适合不大的数据
  • tf.data.Dataset objects :
    ①有着GPU优化,比其他类型能更好地利用GPU。
    ②能从磁盘上读取大到内存放不下的数据。
  • Python generators

转化为tf.data.Dataset

读取

  • Images: tf.keras.preprocessing.image_dataset_from_directory(...)
# image files sorted into class-specific folders
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg
dataset = keras.preprocessing.image_dataset_from_directory('path/to/main_directory', batch_size=64, image_size=(200, 200))# For demonstration, iterate over the batches yielded by the dataset.
for data, labels in dataset:print(data.shape)  # (64, 200, 200, 3) 每批64张、200*200像素、3个RGB通道print(data.dtype)  # float32print(labels.shape)  # (64,) 每批标签64个print(labels.dtype)  # int32
  • Text files: keras.preprocessing.text_dataset_from_directory(...)
    同样,在不同文件夹中按类分类的文档。
dataset = keras.preprocessing.text_dataset_from_directory('path/to/main_directory', batch_size=64)# For demonstration, iterate over the batches yielded by the dataset.
for data, labels in dataset:print(data.shape)  # (64,)print(data.dtype)  # stringprint(labels.shape)  # (64,)print(labels.dtype)  # int32
  • other:
    tf.data.experimental.make_csv_dataset to load structured data from CSV files.
    tf.data.Dataset.from_tensor_slices():keras: tf.data.Dataset.from_tensor_slices()

操作

  • 查看方式1:迭代.as_numpy_iterator()
print(list(dataset.as_numpy_iterator()))
# [(array([1, 3], dtype=int32), array([b'A'], dtype=object)), 
# (array([2, 1], dtype=int32), array([b'B'], dtype=object)), 
# (array([3, 3], dtype=int32), array([b'A'], dtype=object))]
  • 查看方式2:for
for element in dataset.as_numpy_iterator():print(element)
# (array([1, 3], dtype=int32), array([b'A'], dtype=object))
# (array([2, 1], dtype=int32), array([b'B'], dtype=object))
# (array([3, 3], dtype=int32), array([b'A'], dtype=object))
  • .take(count):取出几批的样本。
for inputs, targets in dataset.take(1):print(inputs)			# tf.Tensor([1 3], shape=(2,), dtype=int32)print(targets)			# tf.Tensor([b'A'], shape=(1,), dtype=string)
  • .batch():指定batch_size。必须指定,不然fit()时会报错
# 指定一批32个
dataset = dataset.batch(32)

Data Preprocessing

vectorized & standardized

简单来说:

  • vectorized 向量化:非数字特征映射到数字,比如[狗, 猫]→[0, 1]
  • standardized 标准化:修改范围到[0.0, 1.0]、符合概率学(均值0和方差1)

详细:

  • Text files
    ①need to be read into string tensors,
    ②then split into words.
    ③Finally, the words need to be indexed & turned into integer tensors.
  • Images
    ①need to be read and decoded into integer tensors,
    ②then converted to floating point and normalized to small values (usually between 0 and 1).
  • CSV data
    ①needs to be parsed, with numerical features converted to floating point tensors and categorical features indexed and converted to integer tensors.
    ②Then each feature typically needs to be normalized to zero-mean and unit-variance.

Text

基本

tensorflow.keras.layers.experimental.preprocessing.TextVectorization:holds a mapping between string tokens and integer indices.

  • 词汇表必须是字符串。
  • 索引0表示缺省值(即单词长度不够时的空单词""),索引1表示词汇表外的值(词汇表由adapt()指定)。
from tensorflow.keras.layers.experimental import preprocessingvocabulary = ["aa bb cc"]
data = ["aa bb cc"]
layer = preprocessing.TextVectorization()
layer.adapt(vocabulary)						# 以哪个为词汇表
normalized_data = layer(data)				# 根据之前adapt()的vocabulary翻译data
print(normalized_data)
# tf.Tensor([[4 3 2 2 1 1]], shape=(1, 6), dtype=int64)
  • 重复的单词cc,可以看到都是2
  • 词汇表外的值ddee,都是1
  • 词汇表vocabulary映射adapt()时,标点符号和空格不算,只看单词。重复的单词只留一个。
  • 词汇表vocabulary可以是一维数组["aa bb cc"](句子)、["aa bb", "bb cc"](句子)、["aa", "bb", "cc"](单词),不能是字符串"aa bb cc",不能是多列二维数组[["aa", "bb"], ["aa", "cc"]],但可以是单列的二维数组[["aa bb"], ["aa cc"]](句子)、[["aa"], ["bb"], ["cc"]](单词)。
  • 处理data同样也是同样的格式要求,结果的形状必定是二维。注意,认为每行是一个"..."
data = ["aa bb cc", "cc dd"]
''' tf.Tensor( [[2 4 3][3 1 0]], shape=(2, 3), dtype=int64) '''
  • 单词长度不够,指的是data中的两句话,选取最长单词数作为结果的列维度,其他不足长度的句子少的单词就对应0

热编码 one-hot encoded

# Example: one-hot encoded bigrams
from tensorflow.keras.layers.experimental import preprocessingvocabulary = ["aa bb cc"]
data = ["aa", "bb", "cc", "dd", ""]layer = preprocessing.TextVectorization(output_mode="binary", ngrams=2)
layer.adapt(vocabulary)integer_data = layer(data)
print(integer_data)
''' tf.Tensor( [[0. 0. 0. 0. 0. 1.][0. 0. 0. 1. 0. 0.][0. 1. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0.]], shape=(5, 6), dtype=float32) '''

每行都只有一个位是1,其他都是0.

Image & CSV: normalizing features

  • 均值0和方差1:tensorflow.keras.layers.experimental.preprocessing.Normalization
    adapt()接收三类输入类型:a batched Dataset, a Tensor, or a Numpy array。不能直接用pd.DataFrame.
from tensorflow.keras.layers.experimental import preprocessingdata = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)normalizer = preprocessing.Normalization()
normalizer.adapt(data)normalized_data = normalizer(data)
print(normalized_data)
''' tf.Tensor( [[-1.2247448 -1.2247448 -1.2247448][ 0. 0. 0. ][ 1.2247448 1.2247448 1.2247448]], shape=(3, 3), dtype=float32) '''
# tf.keras.utils.normalize(): numpy array
normalized_data = tf.keras.utils.normalize(data)
print(normalized_data)
  • 调整范围:tensorflow.keras.layers.experimental.preprocessing.Rescaling
import numpy as np
from tensorflow.keras.layers.experimental import preprocessing# Example image data, with values in the [0, 255] range
training_data = np.random.randint(0, 256, size=(64, 200, 200, 3)).astype("float32")# 限定范围:从[0, 255]到[0.0, 1.0]
output_data = preprocessing.Rescaling(scale=1.0 / 255)(training_data)

如果是numpy,那么可以直接

a = np.array([25.5,255])
a = a/255

labels分类

  • num_classes(这里是3)必须大于等于labels的最大值+1.
  • y表示的类别应该是[0, MAX],这样恰好符合num_classes。如果从1开始的话,虽然可以,但是创出来就是有一个从没有用到的0列。
y = np.array([0, 2, 1, 2, 1]);		# 三类:0 1 2
y = keras.utils.to_categorical(y, 3)
print(y)
''' [[1. 0. 0.][0. 0. 1.][0. 1. 0.][0. 0. 1.][0. 1. 0.]] '''

更多

为了处理

Categorical data preprocessing layers

  • CategoryEncoding layer
  • Hashing layer
  • Discretization layer
  • StringLookup layer
  • IntegerLookup layer
  • CategoryCrossing layer

Image preprocessing & augmentation layers

  • Resizing layer
  • Rescaling layer
  • CenterCrop layer
  • RandomCrop layer
  • RandomFlip layer
  • RandomTranslation layer
  • RandomRotation layer
  • RandomZoom layer
  • RandomHeight layer
  • RandomWidth layer

Core preprocessing layers

  • TextVectorization layer
  • Normalization layer

为了生成Dataset

Dataset preprocessing

  • Image data preprocessing
    • image_dataset_from_directory function
    • load_img function
    • img_to_array function
    • ImageDataGenerator class
    • flow method
    • flow_from_dataframe method
    • flow_from_directory method
  • Timeseries data preprocessing
    • timeseries_dataset_from_array function
    • pad_sequences function
    • TimeseriesGenerator class
  • Text data preprocessing
    • text_dataset_from_directory function
    • Tokenizer class

数据处理层可以写入到Model中

normalizer = preprocessing.Normalization()
normalizer.adapt(x_train)inputs = keras.Input(shape=input_shape)
x = preprocessing_layer(inputs)
outputs = rest_of_the_model(x)
model = keras.Model(inputs, outputs)
  相关解决方案