一、tf.data简介

二、读取数据

1、从内存中读取数据-numpy数组

2、从文件中读取数据

三、变换Dataset中的元素

1、使用Dataset.map()预处理数据

2、使用Dataset.batch()批处理数据集元素

3、使用Dataset.shuffle()随机重排输入数据

4、使用Dataset.repeat()迭代数据集多个周期

四、创建Iterator访问Dataset中的元素

1、单次迭代器

2、可初始化迭代器

一、tf.data简介

借助tf.data，构建输入管道（将数据加载到模型）。

tf.data在TensorFlow中引入两个新的抽象类：tf.data.Dataset、tf.data.Iterator.

[tensorflow]tf.data.Dataset数据输入管道一、tf.data简介二、读取数据三、变换Dataset中的元素四、创建Iterator访问Dataset中的元素

Dataset:创建和转化datasets的基类。初始化dataset两种方式：从内存读取数据，从Python生成器读取数据。

TextLineDataset：从text文件中读取数据，创建dataset。

FTRecordDataset:从TFRecord文件中读取数据，创建dataset。

FixedLengthRecordDataset:从二进制文件中读取固定大小的记录，创建dataset。

Iterator:获取dataset中的元素。

二、读取数据

1、从内存中读取数据-numpy数组

适合小型数据集，将所有数据加载到numpy数组中，使用tf.data.Dataset.from_tensor_slices()创建Dataset。

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

2、从文件中读取数据

tf.data支持多种文件格式，可以处理那些不适合存储在内存中的大型数据集。

通过tf.data.TFRecordDataset类，读取tfrecord文件：

# Creates a dataset that reads all of the examples from two files.
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)

通过tf.data.TextLineDataset类，读取文本文件：

filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
dataset = tf.data.TextLineDataset(filenames)

通过tf.contrib.data.CsvDataset类，读取csv文件：

# Creates a dataset that reads all of the records from two CSV files, each with
# eight float columns
filenames = ["/var/data/file1.csv", "/var/data/file2.csv"]
record_defaults = [tf.float32] * 8   # Eight required float columns
dataset = tf.contrib.data.CsvDataset(filenames, record_defaults)

三、变换Dataset中的元素

变换Dataset中的元素方式，通常有：转换map、批处理batch、次序混乱shuffle、处理多个周期repeat。

1、使用Dataset.map()预处理数据

Dataset.map(f)转换将指定函数f应用于输入数据集的每个元素来生成新数据集。

解析tf.train.Example协议缓冲区消息。许多输入管道都从TFRecord格式的文件中提取tf.train.Example协议缓冲区消息，每个tf.train.Example记录都包含一个或多个特征，输入管道通常将这些特征转换为张量。

# Transforms a scalar string `example_proto` into a pair of a scalar string and
# a scalar integer, representing an image and its label, respectively.
def _parse_function(example_proto):
  features = {"image": tf.FixedLenFeature((), tf.string, default_value=""),
              "label": tf.FixedLenFeature((), tf.int64, default_value=0)}
  parsed_features = tf.parse_single_example(example_proto, features)
  return parsed_features["image"], parsed_features["label"]

# Creates a dataset that reads all of the examples from two files, and extracts
# the image and label features.
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(_parse_function)

解码图片数据并调整其大小。在用真实的图片数据训练神经网络时，通常将不同大小的图片转换为通用大小，这样就可以将他们批处理为具有固定大小的数据。

# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename, label):
  image_string = tf.read_file(filename)
  image_decoded = tf.image.decode_jpeg(image_string)
  image_resized = tf.image.resize_images(image_decoded, [28, 28])
  return image_resized, label

# A vector of filenames.
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])

# `labels[i]` is the label for the image in `filenames[i].
labels = tf.constant([0, 37, ...])

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)

2、使用Dataset.batch()批处理数据集元素

Dataset.batch()将数据集中的n个连续元素堆叠为一个元素。使用限制，对于每个组件i，所有元素的张量形状都必须完全相同。

inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

iterator = batched_dataset.make_one_shot_iterator()
next_element = iterator.get_next()

print(sess.run(next_element))  # ==> ([0, 1, 2,   3],   [ 0, -1,  -2,  -3])
print(sess.run(next_element))  # ==> ([4, 5, 6,   7],   [-4, -5,  -6,  -7])
print(sess.run(next_element))  # ==> ([8, 9, 10, 11],   [-8, -9, -10, -11])

3、使用Dataset.shuffle()随机重排输入数据

Dataset.shuffle()会维持一个固定大小的缓冲区，并从该缓冲区中随机地选择下一个元素。

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat()

4、使用Dataset.repeat()迭代数据集多个周期

Dataset.repeat()创建一个将输入重复多个周期的数据集

四、创建Iterator访问Dataset中的元素

读取Dataset中值的方法是构建迭代器对象。通过此对象可以一次访问数据集中的一个对象。

1、单次迭代器

单次迭代器Dataset.make_one_shot_iterator()，仅支持对数据集进行一次迭代，不需要显示初始化。目前，单次迭代器是唯一易于Estimator搭配使用的类型。

dataset = tf.data.Dataset.range(100)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

for i in range(100):
  value = sess.run(next_element)
  assert i == value

2、可初始化迭代器

可初始化迭代器Dataset.make_initallizable_iterator()，允许使用一个或多个tf.placeholder()张量参数化数据集的定义，显示iterator.initializer初始化后，才可以读取元素。

max_value = tf.placeholder(tf.int64, shape=[])
dataset = tf.data.Dataset.range(max_value)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

# Initialize an iterator over a dataset with 10 elements.
sess.run(iterator.initializer, feed_dict={max_value: 10})
for i in range(10):
  value = sess.run(next_element)
  assert i == value

# Initialize the same iterator over a dataset with 100 elements.
sess.run(iterator.initializer, feed_dict={max_value: 100})
for i in range(100):
  value = sess.run(next_element)
  assert i == value

参考资料：

https://developers.googleblog.com/2017/09/introducing-tensorflow-datasets.html

https://www.tensorflow.org/guide/datasets?hl=zh-cn#basic_mechanics

[tensorflow]tf.data.Dataset数据输入管道一、tf.data简介二、读取数据三、变换Dataset中的元素四、创建Iterator访问Dataset中的元素

一、tf.data简介

二、读取数据

1、从内存中读取数据-numpy数组

2、从文件中读取数据

三、变换Dataset中的元素

1、使用Dataset.map()预处理数据

2、使用Dataset.batch()批处理数据集元素

3、使用Dataset.shuffle()随机重排输入数据

4、使用Dataset.repeat()迭代数据集多个周期

四、创建Iterator访问Dataset中的元素

1、单次迭代器

2、可初始化迭代器

继续阅读

anaconda下镜像快速安装tensorflow和keras

anaconda中科大镜像

安装tensorflow1.12出现illegal hardware instruction python错误1、问题2、定位问题3、问题解决4、验证

Linux下Anaconda安装tensorflow-gpu

tensorflow笔记实践：正则化优化过拟合

TensorFlow运行模型——会话

【Ubuntu-Tensorflow】TF1.0到TF1.2出现“Key LSTM/basic_lstm_cell/bias not found in checkpoin”问题

linux下的conda安装tensorflow

Linux环境下 TensorFlow的安装和使用基于Anaconda的tensorflow安装

MindSpore保存模型的格式疑惑

【Tensorflow】Tensorflow介绍

鸢尾花分类

利用tensorflow构建AlexNet模型，实现小数量级的猫狗分类（只有train）

ImportError: libcublas.so.10.0: cannot open shared object file: No such file解决方法

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory（完美解决）

一种解决思路： ImportError: libcublas.so.10.0: cannot open shared object file: No such file