대학원생이 쉽게 설명해보기

전체 글

tf.data tutorial 번역 (5) 2020.02.28
tf.data tutorial 번역 (4) 2020.02.27 1
tf.data tutorial 번역 (3) 2020.02.26
Modularity Matters: Learning Invariant Relational Reasoning Tasks 2020.02.19
tf.data tutorial 번역 (2) 2020.02.18
tf.data tutorial 번역 (1) 2020.02.14
Curriculum Learning 2020.02.09
Net2Net: Accelerating Learning via Knowledge Transfer 2020.02.01
keras custom generator - 2 2020.01.31
DEEP COMPRESSION: COMPRESSING DEEP NEURALNETWORKS WITH PRUNING, TRAINED QUANTIZATIONAND HUFFMAN CODING 2020.01.29 1

` PREV 1 ···12 13 14 15 16 17 18 ···46 NEXT

tf.data tutorial 번역 (5)

다음 글을 참조하여 번역합니다(+ 개인 공부), 예제는 tf 2.0을 기준으로 합니다.

https://www.tensorflow.org/guide/data?hl=en

tf.data: Build TensorFlow input pipelines | TensorFlow Core

The tf.data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge random

www.tensorflow.org

Using high-level APIs

tf.keras

tf.keras API는 머신러닝 모델을 생성하고 실행하는 데 있어서 단순함을 제공합니다. .fit(), .evaluate(), .predict()는 입력으로 사용하는 데이터셋 활용을 도와줍니다. 다음 예제에서 keras 사용을 보여줍니다.

train, test = tf.keras.datasets.fashion_mnist.load_data()

images, labels = train
images = images/255.0
labels = labels.astype(np.int32)

fmnist_train_ds = tf.data.Dataset.from_tensor_slices((images, labels))
fmnist_train_ds = fmnist_train_ds.shuffle(5000).batch(32)

model = tf.keras.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=['accuracy'])

Model.fit과 Model.evaluate를 사용하기 위해 (feature, label) 데이터셋을 통과시킵니다.

model.fit(fmnist_train_ds, epochs=2)

Train for 1875 steps
Epoch 1/2
1875/1875 [==============================] - 4s 2ms/step - loss: 0.6066 - accuracy: 0.7949
Epoch 2/2
1875/1875 [==============================] - 4s 2ms/step - loss: 0.4627 - accuracy: 0.8417

<tensorflow.python.keras.callbacks.History at 0x7feeac4f45f8>

Dataset.repeat()을 전달해서 데이터셋을 무한으로 사용하는 경우, model.fit에 steps_per_epochs 인자를 전달합니다.

model.fit(fmnist_train_ds.repeat(), epochs=2, steps_per_epoch=20)

Train for 20 steps
Epoch 1/2
20/20 [==============================] - 0s 14ms/step - loss: 0.4443 - accuracy: 0.8531
Epoch 2/2
20/20 [==============================] - 0s 2ms/step - loss: 0.4467 - accuracy: 0.8422

<tensorflow.python.keras.callbacks.History at 0x7feeb4454198>

평가는 다음과 같이 사용합니다.

loss, accuracy = model.evaluate(fmnist_train_ds)
print("Loss :", loss)
print("Accuracy :", accuracy)

1875/1875 [==============================] - 3s 2ms/step - loss: 0.4412 - accuracy: 0.8485
Loss : 0.4411765540321668
Accuracy : 0.84845

데이터셋을 repeat()과 같이 길게 전달한다면, 평가 steps 인자를 전달해주어야 합니다.

loss, accuracy = model.evaluate(fmnist_train_ds.repeat(), steps=10)
print("Loss :", loss)
print("Accuracy :", accuracy)

10/10 [==============================] - 0s 3ms/step - loss: 0.4964 - accuracy: 0.8156
Loss : 0.4964326351881027
Accuracy : 0.815625

Model.predict는 레이블을 필요로 하지 않습니다.

predict_ds = tf.data.Dataset.from_tensor_slices(images).batch(32)
result = model.predict(predict_ds, steps = 10)
print(result.shape)

(320, 10)

레이블을 포함한 데이터셋을 통과시킬지라도 레이블을 자동으로 무시합니다.

result = model.predict(fmnist_train_ds, steps = 10)
print(result.shape)

(320, 10)

tf.estimator

- 공식 홈페이지를 참조바랍니다.

'# Machine Learning > TensorFlow doc 정리' 카테고리의 다른 글

(Keras doc 번역) Few-Shot learning with Reptile (1)	2020.08.10
tf.data tutorial 번역 (4) (1)	2020.02.27
tf.data tutorial 번역 (3) (0)	2020.02.26
tf.data tutorial 번역 (2) (0)	2020.02.18
tf.data tutorial 번역 (1) (0)	2020.02.14

tf.data tutorial 번역 (4)

다음 글을 참조하여 번역합니다(+ 개인 공부), 예제는 tf 2.0을 기준으로 합니다.

https://www.tensorflow.org/guide/data?hl=en

Preprocessing data

Dataset.map(f)는 입력 데이터셋의 각 원소에 주어진 함수 f를 적용하여 새로운 데이터셋을 생성해줍니다. 함수형 프로그래밍 언어에서 리스트 또는 기타 구조에 적용되는 map() 함수를 기반으로 합니다. 함수 f는 입력에서 단일 요소인 tf.Tensor 오브젝트를 받으며, 새로운 데이터셋에 포함될 tf.Tensor 오브젝트를 반환합니다. 이에 대한 구현은 TensorFlow 연산을 사용하여 한 요소를 다른 요소로 변환합니다.

이번 절에서는 Dataset.map()의 사용 방법을 다룹니다.

Decoding image data and resizing it

실제 환경의 이미지 데이터를 학습시킬 때, 보통 서로 다른 크기의 이미지를 공통 크기로 변환하여 고정 크기의 배치를 사용합니다. flower 데이터셋을 사용해봅시다.

list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

다음 함수는 데이터셋을 적절하게 처리합니다.

# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def parse_image(filename):
  parts = tf.strings.split(filename, '/')
  label = parts[-2]

  image = tf.io.read_file(filename)
  image = tf.image.decode_jpeg(image)
  image = tf.image.convert_image_dtype(image, tf.float32)
  image = tf.image.resize(image, [128, 128])
  return image, label

io.read_file: 파일을 읽은 뒤,
decode_jped: 이미지 파일을 디코딩합니다.
convert_image_dtype: 이미지의 타입을 tf.float32로 변환하고,
tf.image.resize: [128, 128]의 크기로 이미지 크기를 변환합니다.

실험해보죠.

file_path = next(iter(list_ds))
image, label = parse_image(file_path)

def show(image, label):
  plt.figure()
  plt.imshow(image)
  plt.title(label.numpy().decode('utf-8'))
  plt.axis('off')

show(image, label)

map 함수를 이용해서 데이터셋에 적용해보죠.

images_ds = list_ds.map(parse_image)

for image, label in images_ds.take(2):
  show(image, label)

Applying arbitrary Python logic

데이터 전처리 작업에 TensorFlow 연산을 사용하면 성능적으로 이득을 볼 수 있습니다. 하지만 가끔은 입력 데이터를 처리하기 위해 파이썬 라이브러리 함수가 유용할 때가 있습니다. 이를 위해 Dataset.map()에서 tf.py_function()을 사용하세요.

예를 들어, random rotation 처리를 적용하고 싶지만 TensorFlow 연산은 tf.image의 tf.image.rot90 함수만 제공하기 때문에 유용하지 않을 수 있습니다. tf.py_function()을 경험해보기 위해, scipy.ndimage.rotate 함수를 사용해보죠.

import scipy.ndimage as ndimage

def random_rotate_image(image):
  image = ndimage.rotate(image, np.random.uniform(-30, 30), reshape=False)
  return image

image, label = next(iter(images_ds))
image = random_rotate_image(image)
show(image, label)

이 함수를 Dataset.map() 함수와 함께 사용하려면 Dataset.from_generator처럼 shape과 type을 명시해주어야 합니다.

def tf_random_rotate_image(image, label):
  im_shape = image.shape
  [image,] = tf.py_function(random_rotate_image, [image], [tf.float32])
  image.set_shape(im_shape)
  return image, label

shape는 set_shape, type은 tf.py_function의 [tf.float32]를 통해 명시해주는 것 같습니다.

rot_ds = images_ds.map(tf_random_rotate_image)

for image, label in rot_ds.take(2):
  show(image, label)

images_ds는 flower dataset에서 이미지의 크기를 [128, 128]로 변환하여 반환하는 객체입니다.

Parsing tf.Example protocol buffer messages

많은 입력 파이프라인이 TFRecord 형식에서 tf.train.Example 프로토콜 버퍼 메시지를 추출합니다. tf.train.Example은 하나 또는 그 이상의 "특성"을 가지고, 입력 파이프라인은 이러한 특성을 텐서로 변환하여 사용합니다.

fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")
dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset

<TFRecordDatasetV2 shapes: (), types: tf.string>

tf.train.Example 프로토를 사용하여 tf.data.Dataset의 데이터를 확인할 수 있습니다.

raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())

feature = parsed.features.feature
raw_img = feature['image/encoded'].bytes_list.value[0]
img = tf.image.decode_png(raw_img)
plt.imshow(img)
plt.axis('off')
_ = plt.title(feature["image/text"].bytes_list.value[0])

tf.train.Example.FromString을 통해 feature를 읽어오고, feature의 ['image/encoded']는 이미지, ["image/text"]는 해당 이미지의 레이블을 의미하는 것 같습니다.

raw_example = next(iter(dataset))

def tf_parse(eg):
  example = tf.io.parse_example(
      eg[tf.newaxis], {
          'image/encoded': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
          'image/text': tf.io.FixedLenFeature(shape=(), dtype=tf.string)
      })
  return example['image/encoded'][0], example['image/text'][0]
  
img, txt = tf_parse(raw_example)
print(txt.numpy())
print(repr(img.numpy()[:20]), "...")

b'Rue Perreyon'
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02X' ...
tf.io.FixedLenFeature 함수는 고정길이의 입력 특성을 가져옵니다.

decoded = dataset.map(tf_parse)
decoded

<MapDataset shapes: ((), ()), types: (tf.string, tf.string)>

image_batch, text_batch = next(iter(decoded.batch(10)))
image_batch.shape

TensorShape([10])

Time series windowing

end-to-end 시계열 예시는 다음을 참조하세요.

시계열 데이터는 시간축을 변형하지 않고 그대로 사용합니다. Dataset.range를 사용해서 이를 확인해보죠.

range_ds = tf.data.Dataset.range(100000)

일반적으로 이러한 종류를 사용하는 모델은 연속적인 시간 단위를 사용할 것입니다.(예를 들면, 주 단위, 일 단위, 월 단위 또는 기타 등등) 가장 간단한 방법은 역시 배치 형태로 사용하는 것입니다.

Using batch

batches = range_ds.batch(10, drop_remainder=True)

for batch in batches.take(5):
  print(batch.numpy())

[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
drop_remainder는 마지막 배치를 무시하는 인자입니다.

또는, dense한 prediction을 원할 경우, feature와 label을 한 단계씩 이동(shift)할 수 있습니다.

def dense_1_step(batch):
  # Shift features and labels one step relative to each other.
  return batch[:-1], batch[1:]

predict_dense_1_step = batches.map(dense_1_step)

for features, label in predict_dense_1_step.take(3):
  print(features.numpy(), " => ", label.numpy())

[0 1 2 3 4 5 6 7 8]  =>  [1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18]  =>  [11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28]  =>  [21 22 23 24 25 26 27 28 29]

batches = range_ds.batch(15, drop_remainder=True)

def label_next_5_steps(batch):
  return (batch[:-5],   # Take the first 5 steps
          batch[-5:])   # take the remainder

predict_5_steps = batches.map(label_next_5_steps)

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())

데이터셋은 15 배치 크기를 가집니다. label_next_5_steps에서 batch[:-5]는 학습 데이터로 0~9까지 10개, batch[-5:]는 레이블로 10~14까지 5개를 반환합니다.
[0 1 2 3 4 5 6 7 8 9]  =>  [10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]  =>  [25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]  =>  [40 41 42 43 44]

데이터셋의 특성과 레이블이 각 배치에서 오버래핑하기 위해 Dataset.zip을 사용하세요.

feature_length = 10
label_length = 5

features = range_ds.batch(feature_length, drop_remainder=True)
labels = range_ds.batch(feature_length).skip(1).map(lambda labels: labels[:-5])

predict_5_steps = tf.data.Dataset.zip((features, labels))

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())

[0 1 2 3 4 5 6 7 8 9]  =>  [10 11 12 13 14]
[10 11 12 13 14 15 16 17 18 19]  =>  [20 21 22 23 24]
[20 21 22 23 24 25 26 27 28 29]  =>  [30 31 32 33 34]

Using window

Dataset.batch 작업을 사용하는 동안, 세부적인 통제가 필요한 상황이 있을 수 있습니다. Dataset.window는 이러한 작업을 수행할 수 있도록 합니다. 대신, Datasets의 Dataset을 반환합니다. 자세한 사항은 다음을 참조하세요.

window_size = 5

windows = range_ds.window(window_size, shift=1)
for sub_ds in windows.take(5):
  print(sub_ds)

<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>

Dataset.flat_map 함수는 datasets의 dataset을 가져와서 단일 dataset으로 만들 수 있습니다.

 for x in windows.flat_map(lambda x: x).take(30):
   print(x.numpy(), end=' ')

0 1 2 3 4 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
windows는 [0, 1, 2, 3, 4] --> [1, 2, 3, 4, 5] --> [2, 3, 4, 5, 6] --> ... 과 같이 데이터를 반환합니다.

거의 모든 경우에서, dataset의 첫 단계로 .batch를 사용할 것입니다.

def sub_to_batch(sub):
  return sub.batch(window_size, drop_remainder=True)

for example in windows.flat_map(sub_to_batch).take(5):
  print(example.numpy())

[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]

shift 인자가 이동 크기를 제어할 수 있습니다. 다음 예는 이를 보여줍니다.

def make_window_dataset(ds, window_size=5, shift=1, stride=1):
  windows = ds.window(window_size, shift=shift, stride=stride)

  def sub_to_batch(sub):
    return sub.batch(window_size, drop_remainder=True)

  windows = windows.flat_map(sub_to_batch)
  return windows

ds = make_window_dataset(range_ds, window_size=10, shift = 5, stride=3)

for example in ds.take(10):
  print(example.numpy())

[ 0 3 6 9 12 15 18 21 24 27]
[ 5 8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34 37]
[15 18 21 24 27 30 33 36 39 42]
[20 23 26 29 32 35 38 41 44 47]
[25 28 31 34 37 40 43 46 49 52]
[30 33 36 39 42 45 48 51 54 57]
[35 38 41 44 47 50 53 56 59 62]
[40 43 46 49 52 55 58 61 64 67]
[45 48 51 54 57 60 63 66 69 72
shift는 각 리스트의 맨 앞의 값을 보면 알 수 있고, stride는 리스트 안의 각 값의 차이를 보면 알 수 있습니다.

이제 레이블을 좀 더 쉽게 추출할 수 있습니다.

dense_labels_ds = ds.map(dense_1_step)

for inputs,labels in dense_labels_ds.take(3):
  print(inputs.numpy(), "=>", labels.numpy())

[ 0 3 6 9 12 15 18 21 24] => [ 3 6 9 12 15 18 21 24 27]
[ 5 8 11 14 17 20 23 26 29] => [ 8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34] => [13 16 19 22 25 28 31 34 37]
dense_1_step은 batch[-1:], batch[1:]를 반환하는 함수입니다.

Resampling

class-imbalanced한 작업을 수행할 때, dataset을 적절한 방법으로 샘플링해야 합니다. tf.data는 이를 위한 두 가지 방법을 제공합니다. 신용카드 이상탐지 데이터셋은 이 문제를 다루기 위한 매우 좋은 예제입니다.

zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip',
    fname='creditcard.zip',
    extract=True)

csv_path = zip_path.replace('.zip', '.csv')

creditcard_ds = tf.data.experimental.make_csv_dataset(
    csv_path, batch_size=1024, label_name="Class",
    # Set the column types: 30 floats and an int.
    column_defaults=[float()]*30+[int()])

클래스 분포를 확인합니다. 매우 비대칭적입니다.(skewed)

def count(counts, batch):
  features, labels = batch
  class_1 = labels == 1
  class_1 = tf.cast(class_1, tf.int32)

  class_0 = labels == 0
  class_0 = tf.cast(class_0, tf.int32)

  counts['class_0'] += tf.reduce_sum(class_0)
  counts['class_1'] += tf.reduce_sum(class_1)

  return counts

counts = creditcard_ds.take(10).reduce(
    initial_state={'class_0': 0, 'class_1': 0},
    reduce_func = count)

counts = np.array([counts['class_0'].numpy(),
                   counts['class_1'].numpy()]).astype(np.float32)

fractions = counts/counts.sum()
print(fractions)

reduce 함수는 데이터의 각 요소에 단일 함수를 적용하는 함수입니다.

tf.data는 class-imbalanced 문제를 해결하기 위한 몇 가지 방법을 제공합니다.

Datasets sampling

첫 번째 방법은 sample_from_datasets를 활용하는 것입니다. data.Dataset의 각 클래스를 분리할 때 매우 유용합니다.

필터를 사용해서 신용카드 이상탐지 데이터를 생성합니다.

negative_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==0)
    .repeat())
positive_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==1)
    .repeat())

for features, label in positive_ds.batch(10).take(1):
  print(label.numpy())

[1 1 1 1 1 1 1 1 1 1]

tf.data.experimental.sample_from_datasets에 데이터를 통과시키고, 가중을 부과할 수 있습니다.

balanced_ds = tf.data.experimental.sample_from_datasets(
    [negative_ds, positive_ds], [0.5, 0.5]).batch(10)

다음과 같이 50:50의 확률로 클래스를 생성합니다.

for features, labels in balanced_ds.take(10):
  print(labels.numpy())

[0 0 1 1 0 0 1 1 1 1]
[0 0 0 0 1 1 1 1 0 0]
[0 0 0 0 0 0 1 0 1 1]
[1 0 1 1 0 0 0 0 1 0]
[0 1 0 1 0 1 1 0 1 0]
[1 0 0 1 0 1 1 0 1 0]
[0 1 1 1 0 0 1 0 1 1]
[1 0 0 1 0 0 1 0 0 0]
[1 0 0 1 0 0 0 1 0 1]
[1 1 0 1 1 0 1 1 1 0]

Rejection resampling

먼저, rejection resampling은 리샘플링에서도 자주 사용되는 방법입니다. 이에 대해 관심이 있다면, 직접 검색하여 공부하는 것도 나쁘지 않습니다.

experimental.sample_from_datasets의 문제점은 클래스마다 별도의 tf.data.Dataset가 필요하다는 것입니다. Dataset.filter를 사용하면 해결할 수 있지만, 데이터를 두배로 로드하는 결과를 초래합니다.

data.experimental.rejection_resample 함수는 dataset 한 번만 로드하여 균형잡힌 결과를 얻을 수 있게 도와줍니다. 밸런스를 위해 이에 위반하는 요소는 제거됩니다. data.experimental.rejection_resample에서 class_func 인자를 사용합니다. class_func 인자는 각 dataset의 요소에 적용되며, 밸런싱을 위해 어떤 클래스에 속하는지를 결정합니다.

creditcard_ds의 요소는 (features, label) 쌍으로 이루어져 있습니다. class_func는 해당 레이블을 반환합니다.

def class_func(features, label):
  return label

resampler는 target distribution을 필요로 하며, 선택적으로 initial distribution 추정을 필요로 합니다.

resampler = tf.data.experimental.rejection_resample(
    class_func, target_dist=[0.5, 0.5], initial_dist=fractions)

resampler는 개별 요소를 다루기 때문에, unbatch를 통해 배치를 해제해야 합니다.

resample_ds = creditcard_ds.unbatch().apply(resampler).batch(10)

resampler는 class_func의 출력값으로 (class, example) 쌍을 반환합니다. 이 경우, example이 이미 (feature, label) 쌍을 이루고 있으므로, 중복되는 레이블은 제거하도록 합시다(여기서 class를 의미합니다).

balanced_ds = resample_ds.map(lambda extra_label, features_and_label: features_and_label)

dataset은 각 클래스를 50:50 비율로 생성합니다.

for features, labels in balanced_ds.take(10):
  print(labels.numpy())

[0 1 0 1 0 1 0 0 0 0]
[1 0 1 1 1 1 0 0 1 1]
[0 1 1 1 0 0 1 1 1 0]
[1 1 0 1 0 0 0 1 0 0]
[0 0 0 0 1 1 0 0 1 1]
[0 1 1 0 0 1 0 1 0 1]
[1 1 1 0 0 0 1 0 1 0]
[1 0 1 1 1 1 1 1 1 0]
[1 1 1 1 0 1 0 1 0 1]
[1 0 0 0 0 0 1 1 0 1]

'# Machine Learning > TensorFlow doc 정리' 카테고리의 다른 글

(Keras doc 번역) Few-Shot learning with Reptile (1)	2020.08.10
tf.data tutorial 번역 (5) (0)	2020.02.28
tf.data tutorial 번역 (3) (0)	2020.02.26
tf.data tutorial 번역 (2) (0)	2020.02.18
tf.data tutorial 번역 (1) (0)	2020.02.14

tf.data tutorial 번역 (3)

다음 글을 참조하여 번역합니다(+ 개인 공부), 예제는 tf 2.0을 기준으로 합니다.

https://www.tensorflow.org/guide/data?hl=en

Batching dataset elements

Simple batching

가장 간단한 형태의 배치는 단일 원소를 n개만큼 쌓는 것입니다. Dataset.batch() 변환은 정확히 이 작업을 수행하는데, tf.stack() 연산자와 거의 동일하게 작동합니다. 예를 들면, 각 구성 요소가 가지는 모든 원소는 전부 동일한 shape을 가져야 합니다.

inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

for batch in batched_dataset.take(4):
  print([arr.numpy() for arr in batch])

[array([0, 1, 2, 3]), array([ 0, -1, -2, -3])]
[array([4, 5, 6, 7]), array([-4, -5, -6, -7])]
[array([ 8, 9, 10, 11]), array([ -8, -9, -10, -11])]
[array([12, 13, 14, 15]), array([-12, -13, -14, -15])]

tf.data가 동일한 shape를 전파하는 동안, Dataset.batch는 가장 마지막 배치의 배치 크기를 알 수 없기 때문에 None shape를 default로 지정합니다. 예를 들어, 배치 크기가 32이고 데이터가 100개라면 마지막 배치 크기는 4입니다.

batched_dataset

<BatchDataset shapes: ((None,), (None,)), types: (tf.int64, tf.int64)>

drop_remainder 인자를 사용하면, 마지막 배치 크기를 무시하고 지정한 배치 크기를 사용할 수 있습니다.

batched_dataset = dataset.batch(7, drop_remainder=True)
batched_dataset

<BatchDataset shapes: ((7,), (7,)), types: (tf.int64, tf.int64)>

Batching tensors with padding

위의 예제에서는 전부 같은 shape의 데이터를 사용했습니다. 그러나 많은 모델(e.g. sequence models)에서 요구되는 입력의 크기는 매우 다양할 수 있습니다(sequence data의 length는 일정하지 않습니다). 이러한 경우를 다루기 위해, Dataset.padded_batch 변환은 패딩을 사용하여 다른 크기의 배치를 사용할 수 있게 도와줍니다.

dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes=(None,))

for batch in dataset.take(2):
  print(batch.numpy())
  print()

[[0 0 0]
[1 0 0]
[2 2 0]
[3 3 3]]

[[4 4 4 4 0 0 0]
[5 5 5 5 5 0 0]
[6 6 6 6 6 6 0]
[7 7 7 7 7 7 7]]
tf.fill([tf.cast(x, tf.int32)], x)는 임의의 숫자 x를 x개만큼 채워넣는 것을 의미합니다.

Dataset.padded_batch는 각 특성에 따라 다르게 패딩을 설정할 수 있으며, 패딩 설정은 가변 길이 또는 일정한 길이로 할 수 있습니다. 또한, 기본값은 0이지만, 다른 수를 채워넣을 수 있습니다.

Training workflows

Processing multiple epochs

tf.data API는 동일한 데이터에 대해 multiple epochs를 수행할 수 있는 두 가지 주요한 방법을 제공합니다.

multiple epochs에서 데이터셋을 반복하는 가장 단순한 방법은 Dataset.repeat()을 사용하는 것입니다. 먼저, 타이타닉 데이터셋을 불러오도록 하죠.

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)

def plot_batch_sizes(ds):
  batch_sizes = [batch.shape[0] for batch in ds]
  plt.bar(range(len(batch_sizes)), batch_sizes)
  plt.xlabel('Batch number')
  plt.ylabel('Batch size')

아무런 인자를 제공하지 않고, Dataset.repeat()을 사용하면 input을 무한히 반복합니다.

Dataset.repeat은 한 에폭의 끝과 다음 에폭의 시작에 상관없이 인자만큼 반복합니다. 이 때문에 Dataset.repeat 후에 적용된 Dataset.batch는 에폭과 에폭간의 경계를 망각한 채, 데이터를 생성합니다. 이는 이번 예제가 아닌 다음 예제를 보면 이해할 수 있습니다. epoch간의 경계가 없습니다.

titanic_batches = titanic_lines.repeat(3).batch(128)
plot_batch_sizes(titanic_batches)

명확하게 epoch을 구분하기 위해서는 batch 이후에 repeat을 사용합니다.

titanic_batches = titanic_lines.batch(128).repeat(3)

plot_batch_sizes(titanic_batches)

만약 각 에폭의 끝에서 사용자 정의 연산(예를 들면, 통계적 수집)을 사용하고 싶다면, 각 에폭에서 데이터셋 반복을 restart하는 것이 가장 단순합니다.

epochs = 3
dataset = titanic_lines.batch(128)

for epoch in range(epochs):
  for batch in dataset:
    print(batch.shape)
  print("End of epoch: ", epoch)

(128,) (128,) (128,) (128,) (116,) End of epoch: 0
(128,) (128,) (128,) (128,) (116,) End of epoch: 1
(128,) (128,) (128,) (128,) (116,) End of epoch: 2

Randomly shuffling input data

Dataset.shuffle()은 고정 크기의 버퍼를 유지하면서, 해당 버퍼에서 다음 요소를 무작위로 선택합니다.

결과 확인을 위해 데이터에 인덱스를 추가합니다.

lines = tf.data.TextLineDataset(titanic_file)
counter = tf.data.experimental.Counter()

dataset = tf.data.Dataset.zip((counter, lines))
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(20)
dataset

<BatchDataset shapes: ((None,), (None,)), types: (tf.int64, tf.string)>

buffer_size가 100이고, batch_size가 20이므로, 첫 번째 배치에서는 120 이상의 인덱스 요소가 존재하지 않습니다. 사용하는 데이터의 인덱스 수가 uniform하게 증가합니다.(아마도 전체 데이터를 사용하기 위해)

n,line_batch = next(iter(dataset))
print(n.numpy())

[ 73 71 16 28 6 65 91 12 42 68 54 40 81 46 4 98 105 89
67 11]

이번에도 Dataset.batch와 Dataset.repeat을 고려해야 합니다.

Dataset.shuffle은 셔플 버퍼가 빌 때까지 에폭의 끝에 대한 정보를 알려주지 않습니다. repeat 전에 shuffle을 사용하면 다음으로 넘어가기 전에 한 에폭의 원소를 전부 확인할 수 있습니다.

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.shuffle(buffer_size=100).batch(10).repeat(2)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(60).take(5):
  print(n.numpy())

Here are the item ID's near the epoch boundary:

[541 569 508 599 578 418 559 595 401 594]
[282 522 395 552 362 442 389 619 506 523]
[612 585 482 518 604 617 608 622]
[85 27 73 57 16 47 43 50 55 64]
[ 90 89 24 59 9 101 97 65 14 99]

shuffle_repeat = [n.numpy().mean() for n, line_batch in shuffled]
plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.ylabel("Mean item ID")
plt.legend()

shuffle 전에 repeat을 사용하면 epoch의 경계가 무너집니다.

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.repeat(2).shuffle(buffer_size=100).batch(10)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(55).take(15):
  print(n.numpy())

repeat_shuffle = [n.numpy().mean() for n, line_batch in shuffled]

plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.plot(repeat_shuffle, label="repeat().shuffle()")
plt.ylabel("Mean item ID")
plt.legend()

'# Machine Learning > TensorFlow doc 정리' 카테고리의 다른 글

tf.data tutorial 번역 (5) (0)	2020.02.28
tf.data tutorial 번역 (4) (1)	2020.02.27
tf.data tutorial 번역 (2) (0)	2020.02.18
tf.data tutorial 번역 (1) (0)	2020.02.14
tensorflow 2.0 keras Write custom callbacks (2)	2019.04.18

Modularity Matters: Learning Invariant Relational Reasoning Tasks

Abstract

In this article we focus on two supervised visual reasoning tasks whose labels encode a semantic relational rule between two or more objects in an image: the MNIST Parity task and the colorized Pentomino task. The objects in the images undergo random translation, scaling, rotation and coloring transformations. Thus these tasks involve invariant relational reasoning. We observed uneven performance of various deep convolutional neural network (CNN) models on these two tasks. For the MNIST Parity task, we report that the VGG19 model soundly outperforms a family of ResNet models. Moreover, the family of ResNet models exhibits a general sensitivity to random initialization for the MNIST Parity task. For the colorized Pentomino task, now both the VGG19 and ResNet models exhibit sluggish optimization and very poor test generalization, hovering around 30% test error. The CNN models we tested all learn hierarchies of fully distributed features and thus encode the distributed representation prior. We are motivated by a hypothesis from cognitive neuroscience which posits that the human visual cortex is modularized (as opposed to fully distributed), and that this modularity allows the visual cortex to learn higher order invariances. To this end, we consider a modularized variant of the ResNet model, referred to as a Residual Mixture Network (ResMixNet) which employs a mixture-of-experts architecture to interleave distributed representations with more specialized, modular representations. We show that very shallow ResMixNets are capable of learning each of the two tasks well, attaining less than 2% and 1% test error on the MNIST Parity and the colorized Pentomino tasks respectively. Most importantly, the ResMixNet models are extremely parameter efficient: generalizing better than various non-modular CNNs that have over 10x the number of parameters. These experimental results support the hypothesis that modularity is a robust prior for learning invariant relational reasoning.

이번 논문에서는 두 가지 이상의 객체에서 의미적 relational rule을 레이블로 인코딩하는 supervised visual reasoning 작업인 MNIST Parity, Colorized Pentomino에 대해 다룬다.

이미지에서의 각 객체는 회전, 크기, 색 변환이 랜덤하게 이루어진다.

이러한 작업들은 invariant relational reasoning이 포함된다.

두 가지 작업에서 CNN의 고르지 않으며, 다양한 성능을 발견했다.

MNIST Parity 작업에서는 VGG19 모델이 ResNet모델보다 우수한 성능을 보여주었다.

또한, ResNet 계열은 MNIST Parity 작업의 랜덤 초기화에서 민감하게 반응한다.

Colorized Pentomino 작업의 경우, VGG19와 ResNet 모두 최적화가 느리고 일반화가 약하며 약 30%의 테스트 오류를 보여준다.

우리가 테스트한 CNN 모델은 모두 완전히 분산된 계층적 구조를 학습하고, 이러한 표현을 인코딩합니다.

우리는 인간의 시각 피질이 모듈화되고(완전히 분배되는 것과 대조해서), 이는 더 높은 수준의 invariance를 배울 수 있다고 인식하는 신경 과학의 가설에서 시작한다.

ResMixNet 모델을 제안한다. 이는 mixture-of-experts를 통해 더욱 특화되고, 모듈화된 표현을 동시에 사용할 수 있다.

우리는 매우 얕은 ResMixNet이 두 작업에서 각각 2%, 1%의 테스트 오류를 보여주며, 매우 잘 학습할 수 있다는 것을 확인했다.

더욱 중요한 것은, ResMixNet은 매우 효율적인 수의 파라미터를 가진다: 10배가 넘는 매개변수를 가진 모듈화되지 않은 CNN보다 더 좋다.

이러한 실험적 결과는 invariant relational reasoning를 학습하기 위해 모듈화가 중요한 요소라는 것을 보여준다.

요약

기존의 CNN은 discriminative한 표현을 잘 학습하지만, i.i.d.의 조건때문에 adversarial한 공격에 취약하다.
주요한 CNN은 분산된 feature를 계층적으로 매우 잘 학습할 수 있다. 이번 논문에서는 invariant relational rule을 살펴보기 위해 두 가지 데이터셋을 사용한다.
먼저, MNIST Parity Dataset은 (64, 64)의 크기로 해당 이미지에 크기, 회전, 색깔을 랜덤으로 하여 숫자가 그려지게 된다. 이미지에 포함된 숫자가 만약 짝수이거나 홀수일 때, 같은 짝수(홀수)이면 1, 다르면 0을 레이블로 할당한다.

Colorized Pentomino Dataset은 동일하게 (64, 64)로 다음 그림과 같이 랜덤하게 그림을 생성한다. 만약 이미지에 포함된 도형의 모양이 같으면 0, 다르면 1을 레이블로 할당한다.

두 가지 작업의 차이점은 MNIST는 curve한 특성이 있지만, Pentomino 같은 경우는 완전히 사각형의 특징만 가지고 있으며, 둘의 레이블은 각각 AND gate와 XOR gate 문제와 닮아있다. 학습 시에 MNIST의 curve한 특성이 매우 도움이 되기 때문에 Pentomino의 학습이 더 어렵다고 주장하고 있다.
사용하는 모델은 이미지의 레이블이 변경되지 않는 상태에서 이미지 내부의 랜덤하게 생성된 도형을 잘 인식할 수 있어야 한다. 하지만 회전, 크기와 같은 invariant한 속성이 많을수록 inference problem이 존재한다.
이러한 invariant한 속성을 모듈화시켜 각자 특화될 수 있도록 하는 ResMixNet을 제안한다.

모델의 첫 단은 Conv로 구성하여 low-level의 feature를 공유할 수 있도록 한다. 다음으로는 M이라고 해서 Experts를 쌓아놓은 형태의 단순한 모델이다. G(Gater)는 단순히 4개의 Conv를 쌓고, 여기서 나온 가중치와 E(Experts)에서 나온 가중치를 matrix 곱해서 최종 결과를 만들어낸다.
MNIST Parity Dataset에서는 특이하게 VGG 모델이 다른 ResNet 계열보다 성능이 좋은 것을 확인할 수 있다.

Pentomino에서는 제안한 모델이 좋은 성능을 보여주었다.

CIFAR 데이터셋에서는 차이가 거의 나지 않거나, 떨어지는 현상을 보여주었다. 논문에서 그 이유로는 CIFAR-100은 비슷한 특징을 가지는 클래스가 많기 때문이라고 하면서, 제안한 모델은 high-level의 특징을 잘 학습하지 못하는 것 같다라고 한다.(비슷한 특징을 가지는 클래스일수록, 상당히 구체적인 특징을 잘 잡아낼 수 있어야한다.)

Reference

Jo, J., Verma, V., & Bengio, Y. (2018). Modularity Matters: Learning Invariant Relational Reasoning Tasks. arXiv preprint arXiv:1806.06765.

https://www.youtube.com/watch?v=dAGI3mlOmfw&list=PLWKf9beHi3Tg50UoyTe6rIm20sVQOH1br&index=93

'# Paper Abstract Reading' 카테고리의 다른 글

Squeeze-and-Excitation Networks (0)	2020.06.17
Fixing the train-test resolution discrepancy (0)	2020.04.09
Curriculum Learning (0)	2020.02.09
Net2Net: Accelerating Learning via Knowledge Transfer (0)	2020.02.01
DEEP COMPRESSION: COMPRESSING DEEP NEURALNETWORKS WITH PRUNING, TRAINED QUANTIZATIONAND HUFFMAN CODING (1)	2020.01.29

tf.data tutorial 번역 (2)

다음 글을 참조하여 번역합니다(+ 개인 공부), 예제는 tf 2.0을 기준으로 합니다.

https://www.tensorflow.org/guide/data?hl=en

tf.data: Build TensorFlow input pipelines | TensorFlow Core

www.tensorflow.org

Reading Input Data

Consuming NumPy arrays

다양한 NumPy array를 로딩하는 예제는 다음을 참조하세요.

https://www.tensorflow.org/tutorials/load_data/numpy

만약 모든 데이터가 메모리에 존재한다면, 이들로부터 Dataset을 만드는 가장 간단한 방법은 Dataset.from_tensor_slices()를 사용하여 tf.Tensor로 변환하는 것입니다.

train, test = tf.keras.datasets.fashion_mnist.load_data()

images, labels = train
images = images/255

dataset = tf.data.Dataset.from_tensor_slices((images, labels))
dataset

<TensorSliceDataset shapes: ((28, 28), ()), types: (tf.float64, tf.uint8)>
(28, 28)은 이미지이며, ()은 스칼라 형태로 label을 나타냅니다.

Consuming Python generators

tf.data.Dataset으로 쉽게 데이터를 살펴 볼 수 있는 또다른 방법은 파이썬 제네레이터를 사용하는 것입니다.

def count(stop):
  i = 0
  while i<stop:
    yield i
    i += 1

for n in count(5):
  print(n)

0 1 2 3 4

Dataset.from_generator 생성자는 tf.data.Dataset을 제네레이터처럼 사용할 수 있게 합니다.

이 생성자는 반복자가 아닌, 입력을 사용합니다. 이는 데이터의 끝에 도달했을 때, 제네레이터가 재시작할 수 있도록 도와줍니다. 선택적 args 인자를 가지는데, 이는 호출이 가능합니다.

output_types 인자는 tf.data가 내부적으로 tf.Graph를 빌드하고, graph edge에 tf.dtype을 요구하기 때문에 필요합니다.

ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes = (), )

count는 위에서 선언한 파이썬 제네레이터입니다.
arg는 인자로서 count 함수의 stop 인자로 통합니다.

for count_batch in ds_counter.repeat().batch(10).take(10):
  print(count_batch.numpy())

[0 1 2 3 4 5 6 7 8 9] [10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24 0 1 2 3 4] [ 5 6 7 8 9 10 11 12 13 14] [15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4 5 6 7 8 9] [10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24 0 1 2 3 4] [ 5 6 7 8 9 10 11 12 13 14] [15 16 17 18 19 20 21 22 23 24]
잘 보면 batch(10).take(10)이므로 총 10번에 10개씩 받아오며, stop 25까지 작동합니다. 25에 도달한 경우는 제네레이터가 restart하는 것을 3번째 배열에서 볼 수 있습니다.

output_shapes 인자는 필요하진 않지만, tensorflow의 많은 연산이 알 수 없는 rank 단위는 지원하지 않으므로 권장 사항입니다. 특정 축 또는 길이가 알 수 없거나, 가변인 경우, None 으로 지정하세요.

다른 dataset 메소드를 사용하기 위해 output_shapes와 output_types는 중요한 요소입니다.

다음은 두 가지 요소의 필요성을 보여주는 제네레이터 예제입니다. 두 배열은 길이가 알려지지 않은 벡터입니다.

def gen_series():
  i = 0
  while True:
    size = np.random.randint(0, 10)
    yield i, np.random.normal(size=(size,))
    i += 1

for i, series in gen_series():
  print(i, ":", str(series))
  if i > 5:
    break

0 : [ 0.3464 -0.306 ] 1 : [-0.7686] 2 : [1.9423] 3 : [] 4 : [ 0.5828 -0.0588 0.0094 -0.9467] 5 : [-0.8972 -0.4949 1.1115 0.8208 0.843 0.2968 -2.7236 -0.844 -1.7327] 6 : [ 1.2727 -0.6278 0.1622 -1.4087 -0.7683 -0.3966 0.3112]

첫 번째 출력값의 형태는 int32, 두 번째는 float32입니다.

첫 번째 요소는 스칼라이며, () shape입니다. 두 번째 요소는 길이가 알려지지 않은 벡터로 (None, ) shape입니다.

ds_series = tf.data.Dataset.from_generator(
    gen_series, 
    output_types=(tf.int32, tf.float32), 
    output_shapes=((), (None,)))

ds_series

<FlatMapDataset shapes: ((), (None,)), types: (tf.int32, tf.float32)>

이제 일반적인 tf.data.Dataset처럼 사용할 수 있습니다. variable shape과 함께 배치를 뽑아올 때는 Dataset.padded_batch를 사용합니다.

ds_series_batch = ds_series.shuffle(20).padded_batch(10)

ids, sequence_batch = next(iter(ds_series_batch))
print(ids.numpy())
print()
print(sequence_batch.numpy())

[ 3 2 5 8 0 11 24 26 16 12] [[ 3.3757 0.791 -0.7864 -0.5299 -0.5024 0. 0. 0. 0. ] [-0.8493 0. 0. 0. 0. 0. 0. 0. 0. ] [-0.3736 0.2187 0.3256 -0.8628 2.3045 0.7726 1.9534 0.1123 0.3906] [ 0.3752 1.0399 -1.6983 -1.2217 -1.2176 -1.1055 0.7014 0. 0. ] [ 0.2049 -0.5775 -1.5055 0. 0. 0. 0. 0. 0. ] [-2.0829 0.7266 -0.0104 -1.2408 -0.715 -0.232 0.2391 0. 0. ] [-0.0439 -0.3391 1.5569 -0.7063 1.3729 -0.31 0.9572 -0.0446 0.0635] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [-0.5468 0.3916 -0.432 0.6168 -1.0789 0.8624 -1.2116 -1.1322 0.2158] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]
padded_batch(10)이기 때문에 10개씩 가져옵니다(첫 번째 int 요소가 10개인 것을 보면 알 수 있습니다). padded_batch는 가변길이(최소 0부터 최대 10까지)에서 나머지 요소에 0을 넣어줍니다.
ds_series는 (0, 10) 범위의 값을 뽑아내니까요. shuffle(20)은 버퍼 사이즈가 20이라는 의미입니다. 아마 데이터의 개수가 21개를 넘어갈 경우 elements가 뽑히지 않을 것입니다.

좀 더 현실적인 예시를 위해 precessing.image.ImageDataGenerator를 사용해봅니다. 데이터를 다운로드받습니다.

flowers = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)

image.ImageDataGenerator를 만듭니다.

img_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)

images, labels = next(img_gen.flow_from_directory(flowers))

print(images.dtype, images.shape)
print(labels.dtype, labels.shape)

float32 (32, 256, 256, 3)
float32 (32, 5)

ds = tf.data.Dataset.from_generator(
    img_gen.flow_from_directory, args=[flowers], 
    output_types=(tf.float32, tf.float32), 
    output_shapes=([32,256,256,3], [32,5])
)

ds

<FlatMapDataset shapes: ((32, 256, 256, 3), (32, 5)), types: (tf.float32, tf.float32)>
ImageDataGenerator를 tf.data.Dataset으로 래핑하여 사용하는 것을 볼 수 있습니다.

Consuming TFRecord data

end-to-end example을 원한다면 Loading TFRecords를 참고하세요.

tf.data API는 메모리에 적재하기 힘든 매우 큰 데이터셋을 다룰 때, 다양한 file format을 다룰 수 있도록 도와줍니다. 예를 들어, TFRecord file format은 많은 TF app가 학습 데이터에 사용하는 간단한 record-oriented 이진 형식입니다. tf.data.TFRecordDataset 클래스는 인풋 파이프라인에서 하나 또는 그 이상의 TFRecord 파일의 내용이 흐르도록 합니다.

French Street Name Signs (FSNS)을 사용하는 예제입니다.

# Creates a dataset that reads all of the examples from two files.
fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")

TFRecordDataset을 초기화하는 filenames 인자는 string, string 배열 또는 string tf.Tensor를 전달받을 수 있습니다. 만약 학습과 검증을 위해 두 개의 파일을 사용한다면, 파일 이름을 입력으로 사용하여 데이터셋을 생성하는 팩토리 메소드로 만들 수 있습니다.

dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset

<TFRecordDatasetV2 shapes: (), types: tf.string>

많은 TensorFlow 프로젝트는 TFRecord에서 직렬화된 tf.train.Example을 사용합니다. 따라서 사용하기 전에 디코딩해야 합니다.

raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())

parsed.features.feature['image/text']

bytes_list { value: "Rue Perreyon" }

Consuming text data

end to end example은 다음을 참고하세요.

많은 데이터셋은 하나 또는 그 이상의 text 파일에 분산되어 있습니다. tf.data.TextLineDataset은 준비된 텍스트 파일에서 line 단위로 추출하는 쉬운 방법을 제공합니다. 주어진 하나 또는 그 이상의 파일 이름에서, TExtLineDataset은 line 단위로 string-value를 생성해 줄 것입니다.

directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']

file_paths = [
    tf.keras.utils.get_file(file_name, directory_url + file_name)
    for file_name in file_names
]

dataset = tf.data.TextLineDataset(file_paths)

첫 번째 파일의 5개 행을 보여줍니다.

for line in dataset.take(5):
  print(line.numpy())

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
b'His wrath pernicious, who ten thousand woes'
b"Caused to Achaia's host, sent many a soul"
b'Illustrious into Ades premature,'
b'And Heroes gave (so stood the will of Jove)'

Dataset.interleave는 파일을 번갈아 가면서 사용할 수 있게 해줍니다. 다음은 각 파일에서 나오는 문장의 예를 보여줍니다. cycle_length=3이므로 파일당 3개의 행씩 번갈아가면서 보여주겠군요.

files_ds = tf.data.Dataset.from_tensor_slices(file_paths)
lines_ds = files_ds.interleave(tf.data.TextLineDataset, cycle_length=3)

for i, line in enumerate(lines_ds.take(9)):
  if i % 3 == 0:
    print()
  print(line.numpy())

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
b"\xef\xbb\xbfOf Peleus' son, Achilles, sing, O Muse,"
b'\xef\xbb\xbfSing, O goddess, the anger of Achilles son of Peleus, that brought'

b'His wrath pernicious, who ten thousand woes'
b'The vengeance, deep and deadly; whence to Greece'
b'countless ills upon the Achaeans. Many a brave soul did it send'

b"Caused to Achaia's host, sent many a soul"
b'Unnumbered ills arose; which many a soul'
b'hurrying down to Hades, and many a hero did it yield a prey to dogs and'

기본적으로 TextLineDataset은 파일의 모든 line을 살펴보기 때문에 만약 파일에 header 행이나 주석이 포함된 경우 사용이 바람직하지 않을 수 있습니다. header 행이나 주석과 같은 불필요한 내용은 Dataset.skip(), Dataset.filter()를 사용하여 배제할 수 있습니다. 다음 예제는 첫 번째 행을 건너뛰고, 생존자 데이터만 찾는 경우입니다.

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)

for line in titanic_lines.take(10):
  print(line.numpy())

b'survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone'
b'0,male,22.0,1,0,7.25,Third,unknown,Southampton,n'
b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n'
b'1,female,26.0,0,0,7.925,Third,unknown,Southampton,y'
b'1,female,35.0,1,0,53.1,First,C,Southampton,n'
b'0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y'
b'0,male,2.0,3,1,21.075,Third,unknown,Southampton,n'
b'1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n'
b'1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n'
b'1,female,4.0,1,1,16.7,Third,G,Southampton,n'
titanic 데이터에서 10개의 행을 불러오고 있습니다. 또, 우리에게 불필요한 header 행이 포함되어 있는 것을 볼 수 있습니다.

def survived(line):
  return tf.not_equal(tf.strings.substr(line, 0, 1), "0")

survivors = titanic_lines.skip(1).filter(survived)

for line in survivors.take(10):
  print(line.numpy())

b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n'
b'1,female,26.0,0,0,7.925,Third,unknown,Southampton,y'
b'1,female,35.0,1,0,53.1,First,C,Southampton,n'
b'1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n'
b'1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n'
b'1,female,4.0,1,1,16.7,Third,G,Southampton,n'
b'1,male,28.0,0,0,13.0,Second,unknown,Southampton,y'
b'1,female,28.0,0,0,7.225,Third,unknown,Cherbourg,y'
b'1,male,28.0,0,0,35.5,First,A,Southampton,y'
b'1,female,38.0,1,5,31.3875,Third,unknown,Southampton,n'
tf.strings.substr(line, 0, 1): 0번째의 str 형태의 문자가 "0"인 것을 모두 걸러내고 있습니다. 아마 1이 생존자를 타나내는 것 같습니다. 또 skip(1)을 통해 header 행을 걸러내었음을 볼 수 있습니다.
tf.not_equal(x, y)는 (x != y)에 대한 boolean 값을 반환합니다. 즉, 1인 경우는 True를 반환하겠군요. filter는 false 값은 전부 제외 처리하는 것 같습니다.

Consuming CSV data

더 많은 예제는 다음_1과 다음_2를 참조하세요.

CSV 파일 포맷은 일반 텍스트를 테이블 형태의 데이터로 저장하기 위해 사용하는 대중적인 방법입니다.

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")

df = pd.read_csv(titanic_file, index_col=None)
df.head()

만약 메모리에 데이터가 존재한다면 Dataset.from_tensor_slices를 사용하여 사전 형태로 쉽게 불러올 수 있습니다.

titanic_slices = tf.data.Dataset.from_tensor_slices(dict(df))

for feature_batch in titanic_slices.take(1):
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

take(1)을 통해 1 크기의 배치를 불러오고, dict 형태이기 때문에 items()로 value, key를 받습니다.
'survived'          : 0
'sex'               : b'male'
'age'               : 22.0
'n_siblings_spouses': 1
'parch'             : 0
'fare'              : 7.25
'class'             : b'Third'
'deck'              : b'unknown'
'embark_town'       : b'Southampton'
'alone'             : b'n'

보다 확장 가능한 방법은 필요에 따라 디스크에서 로드하는 것입니다.
tf.data 모듈은 RFC 4180을 준수하는 하나 또는 그 이상의 CSV 파일로부터 데이터를 추출하기 위한 메소드를 제공합니다. + RFC 4180은 CSV 파일 구축을 위해 제안되는 규칙입니다.

experimental.make_csv_dataset는 csv 파일을 읽어오는 고수준 인터페이스 함수입니다. 이 함수는 column type을 추론하거나
batching, shuffling과 같은 많은 특성들을 쉽게 사용할 수 있도록 도와줍니다.

titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived")

for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  print("features:")
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

'survived': [1 0 1 0]
features:
  'sex'               : [b'female' b'male' b'female' b'male']
  'age'               : [30. 28.  2. 28.]
  'n_siblings_spouses': [3 0 0 0]
  'parch'             : [0 0 1 0]
  'fare'              : [21.      7.7958 12.2875 26.55  ]
  'class'             : [b'Second' b'Third' b'Third' b'First']
  'deck'              : [b'unknown' b'unknown' b'unknown' b'C']
  'embark_town'       : [b'Southampton' b'Southampton' b'Southampton' b'Southampton']
  'alone'             : [b'n' b'y' b'n' b'y']

select_columns 인자를 사용해서 원하는 column만 사용할 수 있습니다.

titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived", select_columns=['class', 'fare', 'survived'])

for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

'survived': [0 0 0 0]
'fare' : [29.125 7.8958 77.2875 7.75 ]
'class' : [b'Third' b'Third' b'First' b'Third']

섬세한 제어를 가능하게 하는 low-level의 experimental.CsvDataset도 있습니다. 이는 column type 추론을 제공하지 않습니다. 대신 각 컬럼의 type을 꼭 구체화해야 합니다.

titanic_types  = [tf.int32, tf.string, tf.float32, tf.int32, tf.int32, tf.float32, tf.string, tf.string, tf.string, tf.string] 
dataset = tf.data.experimental.CsvDataset(titanic_file, titanic_types , header=True)

for line in dataset.take(10):
  print([item.numpy() for item in line])

[0, b'male', 22.0, 1, 0, 7.25, b'Third', b'unknown', b'Southampton', b'n']
[1, b'female', 38.0, 1, 0, 71.2833, b'First', b'C', b'Cherbourg', b'n']
[1, b'female', 26.0, 0, 0, 7.925, b'Third', b'unknown', b'Southampton', b'y']
[1, b'female', 35.0, 1, 0, 53.1, b'First', b'C', b'Southampton', b'n']
[0, b'male', 28.0, 0, 0, 8.4583, b'Third', b'unknown', b'Queenstown', b'y']
[0, b'male', 2.0, 3, 1, 21.075, b'Third', b'unknown', b'Southampton', b'n']
[1, b'female', 27.0, 0, 2, 11.1333, b'Third', b'unknown', b'Southampton', b'n']
[1, b'female', 14.0, 1, 0, 30.0708, b'Second', b'unknown', b'Cherbourg', b'n']
[1, b'female', 4.0, 1, 1, 16.7, b'Third', b'G', b'Southampton', b'n']
[0, b'male', 20.0, 0, 0, 8.05, b'Third', b'unknown', b'Southampton', b'y']

만약 컬럼에서 몇 가지 데이터가 비어있을 경우, low-level 인터페이스는 column type 대신에 기본값을 제공하도록 할 수 있습니다.

%%writefile missing.csv
1,2,3,4
,2,3,4
1,,3,4
1,2,,4
1,2,3,
,,,

Writing missing.csv

# Creates a dataset that reads all of the records from two CSV files, each with
# four float columns which may have missing values.

record_defaults = [999,999,999,999]
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults)
dataset = dataset.map(lambda *items: tf.stack(items))
dataset

<MapDataset shapes: (4,), types: tf.int32>

for line in dataset:
  print(line.numpy())

[1 2 3 4]
[999   2   3   4]
[  1 999   3   4]
[  1   2 999   4]
[  1   2   3 999]
[999 999 999 999]

기본적으로 CsvDataset은 모든 행과 열을 반환합니다. 이는 header 행 또는 원하는 column이 포함되어 있는 경우 바람직하지 않을 수 있습니다. header와 select_cols 인자를 통해 제거할 수 있습니다.

# Creates a dataset that reads all of the records from two CSV files with
# headers, extracting float data from columns 2 and 4.
record_defaults = [999, 999] # Only provide defaults for the selected columns
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults, select_cols=[1, 3])
dataset = dataset.map(lambda *items: tf.stack(items))
dataset

<MapDataset shapes: (2,), types: tf.int32>

for line in dataset:
  print(line.numpy())

[2 4]
[2 4]
[999 4]
[2 4]
[ 2 999]
[999 999]

Consuming sets of files

데이터셋은 여러 가지 파일에 분산되어 저장되어 있을 수 있습니다.

flowers_root = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)
flowers_root = pathlib.Path(flowers_root)

root directory는 각 클래스의 directory를 포함합니다.

for item in flowers_root.glob("*"):
  print(item.name)

sunflowers
daisy
LICENSE.txt
roses
tulips
dandelion

다음 예는 각 클래스의 directory를 보여줍니다.

list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

for f in list_ds.take(5):
  print(f.numpy())

b'/home/kbuilder/.keras/datasets/flower_photos/roses/6409000675_6eb6806e59.jpg' b'/home/kbuilder/.keras/datasets/flower_photos/tulips/4520577328_a94c11e806_n.jpg' b'/home/kbuilder/.keras/datasets/flower_photos/sunflowers/4933229889_c5d9e36392.jpg' b'/home/kbuilder/.keras/datasets/flower_photos/roses/22506717337_0fd63e53e9.jpg' b'/home/kbuilder/.keras/datasets/flower_photos/daisy/20182559506_40a112f762.jpg'
Dataset.list_file() 함수는 클래스 디렉토리를 받아 하위에 존재하는 이미지의 경로를 가져다 주는 것 같아보입니다.

tf.io.read_file 함수를 사용해서 경로에서 레이블을 추출하고, (image, label) 쌍을 반환합니다.

def process_path(file_path):
  label = tf.strings.split(file_path, '/')[-2]
  return tf.io.read_file(file_path), label

labeled_ds = list_ds.map(process_path)

tf.strings.split을 통해 label만 추출하고 있습니다.

for image_raw, label_text in labeled_ds.take(1):
  print(repr(image_raw.numpy()[:100]))
  print()
  print(label_text.numpy())

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\x03\x02\x02\x03\x02\x02\x03\x03\x03\x03\x04\x03\x03\x04\x05\x08\x05\x05\x04\x04\x05\n\x07\x07\x06\x08\x0c\n\x0c\x0c\x0b\n\x0b\x0b\r\x0e\x12\x10\r\x0e\x11\x0e\x0b\x0b\x10\x16\x10\x11\x13\x14\x15\x15\x15\x0c\x0f\x17\x18\x16\x14\x18\x12\x14\x15\x14\xff\xdb\x00C\x01\x03\x04\x04\x05\x04\x05'

b'sunflowers'

'# Machine Learning > TensorFlow doc 정리' 카테고리의 다른 글

tf.data tutorial 번역 (4) (1)	2020.02.27
tf.data tutorial 번역 (3) (0)	2020.02.26
tf.data tutorial 번역 (1) (0)	2020.02.14
tensorflow 2.0 keras Write custom callbacks (2)	2019.04.18
tensorflow 2.0 keras Saving and Serializing Models with Tensorflow Keras (0)	2019.04.15

tf.data tutorial 번역 (1)

다음 글을 참조하여 번역합니다(+ 개인 공부), 예제는 tf 2.0을 기준으로 합니다.

https://www.tensorflow.org/guide/data?hl=en

tf.data: Build TensorFlow input pipelines | TensorFlow Core

www.tensorflow.org

tf.data API는 복잡한 input pipeline을 재사용성, 단순하게 만들어 사용할 수 있게 합니다. 예를 들어, 이미지 모델을 위한 파이프라인은 분산 파일 시스템에서 데이터를 통합하고, 각 이미지에 랜덤 변화를 주고, 학습에서 랜덤하게 선택한 이미지를 병합하여 사용할 수 있습니다. 텍스트 모델을 위한 파이프라인은 원본 텍스트 데이터에서 심볼을 추출하고, 룩업 테이블의 임베딩 식별자로 변환하여, 길이가 다른 시퀀스를 일괄 처리할 수 있습니다. tf.data API는 대용량의 데이터를 다룰 수 있게 도와주고, 서로 다른 데이터 포맷을 읽을 수 있으며, 복잡한 변환 작업을 수행합니다.

tf.data API는 일련의 요소를 나타낼 수 있는 tf.data.Dataset abstraction을 소개합니다. 예를 들어, 이미지 파이프라인에서 요소는 이미지와 레이블을 나타내는 텐서의 요소 쌍인 단일 학습 예시를 나타낼 수 있습니다.

dataset을 생성하는 두 가지 방법이 있습니다.

data source는 메모리 또는 하나 이상의 파일에 저장된 데이터로 구성합니다.
데이터 변환은 하나 이상의 tf.data.Dataset 객체에서 데이터 세트를 구성합니다.

Basic mechanics

input pipeline을 만들기 위해서는 data source를 필수적으로 사용해야 합니다. 예를 들어, 메모리에 존재하는 데이터로 Dataset을 구성하는 경우, tf.data.Dataset.from_tensors()와 tf.data.Dataset.from_tensor_slices()를 사용합니다. 만약 TFRecord 포맷을 사용하고 있다면, tf.data.TFRecordDataset()을 사용합니다.

Dataset 객체를 가지고 있으면, tf.data.Dataset 객체의 메서드를 호출하여 새로운 Dataset을 만들 수 있습니다. 예를 들어, 원소당 변환을 수행하는 Dataset.map()이나 다중 원소 변환을 수행하는 Dataset.batch() 적용할 수 있습니다. 전체 변환 목록은 tf.data.Dataset 문서를 참조하세요.

Dataset 객체는 Python iterable합니다. for-loop를 통해 해당 요소를 사용할 수 있습니다.

dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

for elem in dataset:
  print(elem.numpy())

8 3 0 8 2 1

iter을 통해 명시적으로 python iterator를 생성하고, next를 통해 사용할 수 있습니다.

it = iter(dataset)

print(next(it).numpy())

같은 방법으로 데이터셋의 원소를 모든 요소에 대해 단일 결과를 생성하는 reduce 변환을 통해 사용할 수 있습니다. 다음 예제는 데이터셋에 존재하는 숫자의 합을 계산할 때 reduce 변환을 어떻게 활용하는지 보여줍니다.

print(dataset.reduce(0, lambda state, value: state + value).numpy())

Dataset structure

dataset는 동일한 구조의 요소를 포함하며 구조의 개별 요소는 tf.TypeSpec으로 나타낼 수 있는 Tensor, SparseTensor, RaggedTensor, TensorArray, Dataset의 구조를 가질 수 있습니다.

Dataset.element_spec 속성을 사용하면 개별 요소의 유형을 확인할 수 있습니다. 단일 요소, 튜플 요소, 중첩 튜플 요소를 가지는 tf.TypeSpec 객체를 반환합니다. 다음과 같습니다.

dataset1 = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))

dataset1.element_spec

TensorSpec(shape=(10,), dtype=tf.float32, name=None)

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2.element_spec

(TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(100,), dtype=tf.int32, name=None))

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3.element_spec

(TensorSpec(shape=(10,), dtype=tf.float32, name=None), (TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(100,), dtype=tf.int32, name=None)))

# Dataset containing a sparse tensor.
dataset4 = tf.data.Dataset.from_tensors(tf.SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4]))

dataset4.element_spec

SparseTensorSpec(TensorShape([3, 4]), tf.int32)

여기서 tf.SparseTensor(indices=[[0, 0], [1, 2]], values =[1, 2], dense_shape=[3, 4])는 다음과 같은 결과를 보여줍니다.

[[1, 0, 0, 0]
[0, 0, 2, 0]
[0, 0, 0, 0]]

indices는 [0, 0], [1, 2]에 각각 non-zero value가 있음을 명시합니다. 실제로 위의 결과에서 [0, 0]에는 value의 첫 번째 값인 1, [1, 2]에는 두 번째 값인 2가 출력되고 있습니다. dense_shape = [3, 4]는 (3, 4)의 2차원 텐서를 나타냅니다.

# Use value_type to see the type of value represented by the element spec
dataset4.element_spec.value_type

tensorflow.python.framework.sparse_tensor.SparseTensor

Dataset 변환은 어떠한 구조의 dataset도 지원할 수 있습니다.각 원소에 함수를 적용하는 Dataset.map()과 Dataset.filter() 변환을 사용할 때, 요소 구조는 함수의 인수를 결정합니다.

dataset1 = tf.data.Dataset.from_tensor_slices(
    tf.random.uniform([4, 10], minval=1, maxval=10, dtype=tf.int32))

dataset1

<TensorSliceDataset shapes: (10,), types: tf.int32>

for z in dataset1:
  print(z.numpy())

[8 9 2 7 2 4 8 8 1 7]
[6 8 6 5 6 5 1 7 3 7]
[8 6 9 8 7 9 4 7 4 7]
[4 1 2 5 9 9 1 9 6 8]

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2

<TensorSliceDataset shapes: ((), (100,)), types: (tf.float32, tf.int32)>

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3

<ZipDataset shapes: ((10,), ((), (100,))), types: (tf.int32, (tf.float32, tf.int32))>

for a, (b,c) in dataset3:
  print('shapes: {a.shape}, {b.shape}, {c.shape}'.format(a=a, b=b, c=c))

shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)

'# Machine Learning > TensorFlow doc 정리' 카테고리의 다른 글

tf.data tutorial 번역 (3) (0)	2020.02.26
tf.data tutorial 번역 (2) (0)	2020.02.18
tensorflow 2.0 keras Write custom callbacks (2)	2019.04.18
tensorflow 2.0 keras Saving and Serializing Models with Tensorflow Keras (0)	2019.04.15
tensorflow 2.0 keras Writing layers and models with tf keras (2) (0)	2019.04.13

Curriculum Learning

Abstract

Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them “curriculum learning”. In the context of recent research studying the difficulty of training in the presence of non-convex training criteria (for deep deterministic and stochastic neural networks), we explore curriculum learning in various set-ups. The experiments show that significant improvements in generalization can be achieved. We hypothesize that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and, in the case of non-convex criteria, on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).

인간과 동물은 무작위로 제시되지 않고, 간단한 것에서 점점 더 많은 개념과 복잡한 것을 의미있는 순서로 구성된 예제를 볼 때 더 학습을 잘한다.

우리는 머신러닝의 맥락에서 이러한 교육 전략을 공식화하고, 이를 커리큘럼 학습이라고 부른다.

non-convex 학습 기준(심층 결정론과 확률적 신경망)에서 학습의 어려움을 연구하는 최근 연구의 맥락에서, 우리는 다양한 셋업을 다루는 커리큘럼 학습을 탐구한다.

실험은 일반화에서 상당한 성능 향상을 보여주었다.

우리는 커리큘럼 학습이 학습 과정의 빠른 수렴과 non-convex의 조건하에서 local minima에 빠질 수 있는 확률이 낮아지는 효과를 가지고 있다고 가정한다.: 커리큘럼 학습은 (non-convext 함수의 전역 최적화를 위한 일반적 전략) 특정 형태의 연속 방법으로 볼 수 있다.

요약

커리큘럼 학습은 일반적으로 사람이 초급 수준의 학습부터 대학 수준의 학습내용까지 긴 기간을 가지고 학습하는 경우를 의미하는데, 이를 머신러닝의 학습에 적용해보자는 것이다.
커리큘럼 학습은 일반화와 빠른 수렴 속도의 장점을 가진다.
논문에서 언급하고 있는 continuation method는 non-convex에서 좋은 local-minima를 찾기 위한 방법이다. 이 방법은 커리큘럼 학습과 같이 먼저 초기의 objective function을 쉽게 정의하고, 차츰 objective function을 어렵게 만들어 문제를 해결하는 방법이다. 이때 local minima는 계속 유지한다.
다시, 커리큘럼 학습은 다시 쉽게 설명해서 처음에는 모델한테 쉬운 샘플만 보여주다가 점차 어려운 샘플을 보여주는 것이다. 학습 시에 전체 데이터를 한번에 학습시키는 것보다 쉬운 것과 어려운 것을 정의하여 [쉬운 것->어려운 것] 순으로 학습하라는 의미이다.
쉬운 샘플을 정의하는 방법은 두 가지를 제시하고 있다. 첫 번째는 노이즈의 개수로 판단하는 것이고, 두 번째는 가우시안 분포의 바운더리에서 margin 거리를 활용하는 방법이 있다. margin 거리가 가까울수록 쉽고, 멀수록 어려운 샘플이라고 정의한다.
실험에서는 shape recognition을 보여주고 있는데, 쉬운 샘플로는 정확한 모양의 원, 정사각형 등만 사용하고(Basic Shape), 어려운 샘플로는 직사각형, 타원 등이 포함된 것을 사용한다(Geom Shape).

Reference

Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009, June). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning (pp. 41-48).

https://www.youtube.com/watch?v=fQtuWEuwXrA&list=PLWKf9beHi3Tg50UoyTe6rIm20sVQOH1br&index=85

'# Paper Abstract Reading' 카테고리의 다른 글

Fixing the train-test resolution discrepancy (0)	2020.04.09
Modularity Matters: Learning Invariant Relational Reasoning Tasks (0)	2020.02.19
Net2Net: Accelerating Learning via Knowledge Transfer (0)	2020.02.01
DEEP COMPRESSION: COMPRESSING DEEP NEURALNETWORKS WITH PRUNING, TRAINED QUANTIZATIONAND HUFFMAN CODING (1)	2020.01.29
Efficient Neural Architecture Search via Parameter Sharing (0)	2020.01.24

Net2Net: Accelerating Learning via Knowledge Transfer

ABSTRACT

We introduce techniques for rapidly transferring the information stored in one neural net into another neural net. The main purpose is to accelerate the training of a significantly larger neural net. During real-world workflows, one often trains very many different neural networks during the experimentation and design process. This is a wasteful process in which each new model is trained from scratch. Our Net2Net technique accelerates the experimentation process by instantaneously transferring the knowledge from a previous network to each new deeper or wider network. Our techniques are based on the concept of functionpreserving transformations between neural network specifications. This differs from previous approaches to pre-training that altered the function represented by a neural net when adding layers to it. Using our knowledge transfer mechanism to add depth to Inception modules, we demonstrate a new state of the art accuracy rating on the ImageNet dataset.

이 논문은 한 가지 신경망이 담고 있는 정보를 다른 신경망으로 빠른 속도로 전이할 수 있는 방법을 소개한다.

주요 목적은 상당히 큰 신경망의 학습 속도를 가속화하는 것이다.

실제 업무에서 설계 과정과 실험동안 많은 신경망을 학습한다.

새로운 모델을 scratch에서부터 학습하는 것은 매우 소모적인 프로세스이다.

Net2Net 테크닉은 이전에 사용했던 신경망에서 더욱 깊고 와이드한 네트워크로 정보를 이전하는 실험 과정을 가속화한다.

이 기술은 신경망의 구성요소 간의 변환을 보존하는 기능을 기반으로 한다.

모델에 층을 추가할 때 신경망의 기능적 요소들이 변경되는 이전의 사전 학습과는 다른 방법이다.

이 방법을 사용하여 ImageNet 데이터셋에서 훌륭한 성능을 얻었다.

요약

이 논문은 사전 학습된 작은 크기의 신경망의 정보를 좀 더 깊고 넓은 신경망에 전이 학습하려는 방법을 제안한다.
기존의 문제를 해결할 때, 여러 가지 네트워크를 실험해보아야하고 실제로 이를 scratch부터 학습하는 것은 매우 시간이 많이 소모된다. 따라서 이 논문은 이전 네트워크의 정보를 더 큰 네트워크를 학습할 때 사용해볼 수는 없을까?에 대한 질문에서 시작된다.

기존에 존재하던 방법인 FitNets는 이와 같은 방법을 수행할 수 있지만, 트레이닝이 필요하다는 단점이 존재한다. FitNets는 이전 네트워크의 feature map을 target으로 학습하는 네트워크이다.
논문의 방법은 네트워크 구조에 제약을 주고, 트레이닝없이 transfer를 하는 것이다.
네트워크를 더욱 와이드하게 구성할 경우, Teacher Net의 노드 중 하나를 랜덤하게 골라 사용하고, 늘어난 수만큼 가중치를 1/n 해준다.

네트워크를 더욱 깊게 구성하는 경우, 간단하게 밑의 그림처럼 새 네트워크를 끼워서 Identity Mapping을 사용하는 방식이다. 대신 ReLU는 괜찮지만, sigmoid는 사용이 불가능하다.(시그모이드의 단점 때문?)

Reference

https://www.youtube.com/watch?v=btsZOMsyH_o&list=PLWKf9beHi3Tg50UoyTe6rIm20sVQOH1br&index=78

Chen, T., Goodfellow, I., & Shlens, J. (2015). Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641.

'# Paper Abstract Reading' 카테고리의 다른 글

Modularity Matters: Learning Invariant Relational Reasoning Tasks (0)	2020.02.19
Curriculum Learning (0)	2020.02.09
DEEP COMPRESSION: COMPRESSING DEEP NEURALNETWORKS WITH PRUNING, TRAINED QUANTIZATIONAND HUFFMAN CODING (1)	2020.01.29
Efficient Neural Architecture Search via Parameter Sharing (0)	2020.01.24
Deep Neural Networks for YouTube Recommendations (0)	2020.01.16

keras custom generator - 2

이미지 제네레이터와 활용하고 싶은 데이터를 포함한 데이터 제네레이터의 구현 코드입니다.

이미지는 이미지데이터 제네레이터를 통해 불러오며, 활용하고 싶은 데이터인 color는 직접 인덱스를 통해 배치 크기만큼 부르는 것을 볼 수 있습니다.

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

class DataGenerator(tf.keras.utils.Sequence):
    def __init__(self, df, batch_size = 32, target_size = (112, 112), shuffle = True):
        self.len_df = len(df)
        self.batch_size = batch_size
        self.target_size = target_size
        self.shuffle = shuffle
        self.class_col = ['black', 'blue', 'brown', 'green', 'red', 'white', 
             'dress', 'shirt', 'pants', 'shorts', 'shoes']
        self.generator = ImageDataGenerator(rescale = 1./255)
        self.df_generator = self.generator.flow_from_dataframe(dataframe=df, 
                                                          directory='',
                                                            x_col = 'image',
                                                            y_col = self.class_col,
                                                            target_size = self.target_size,
                                                            color_mode='rgb',
                                                            class_mode='other',
                                                            batch_size=self.batch_size,
                                                            seed=42)
        self.colors_df = df['color']
        self.on_epoch_end()
        
    def __len__(self):
        return int(np.floor(self.len_df) / self.batch_size)
    
    def on_epoch_end(self):
        self.indexes = np.arange(self.len_df)
        if self.shuffle:
            np.random.shuffle(self.indexes)
        
    def __getitem__(self, index):
        indexes = self.indexes[index * self.batch_size : (index + 1) * self.batch_size]
        colors = self.__data_generation(indexes)
        
        images, labels = self.df_generator.__getitem__(index)
        
        # return multi-input and output
        return [images, colors], labels
    
    def __data_generation(self, indexes):
        colors = self.colors_df[indexes].to_numpy()
        # 또는
        # colors = np.array([self.colors_df[k] for k in indexes])
        
        return colors

1 - https://hwiyong.tistory.com/241

'# Machine Learning > Keras Implementation' 카테고리의 다른 글

keras load_model(), 커스텀 객체를 포함한 모델을 로드해보자 (0)	2020.07.10
케라스 layer 시각화하기 (visualization) (0)	2020.03.27
Keras, 1x1 Convolution만 사용해서 MNIST 학습시키기 (0)	2019.11.05
Keras Custom Activation 사용해보기 (0)	2019.10.27
keras Custom generator - 1 (1)	2019.07.29

DEEP COMPRESSION: COMPRESSING DEEP NEURALNETWORKS WITH PRUNING, TRAINED QUANTIZATIONAND HUFFMAN CODING

ABSTRACT

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce “deep compression”, a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35× to 49× without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9× to 13×; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35×, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49× from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3× to 4× layerwise speedup and 3× to 7× better energy efficiency.

신경망은 계산 및 메모리 집약적인 특성 떄문에 하드웨어 자원이 제한되어 있는 임베딩 환경에서 작동시키기가 어렵다.

이러한 제한점을 해결하기 위해, 정확도 손실 없이 35배에서 49배 저장 요구를 감소하는 작업을 수행하는 허프만 코딩, 프루닝, 학습 양자화의 세 가지 단계를 가지는 "deep compression"을 소개한다.

이 방법은 먼저 중요한 연결만 학습하도록 네트워크를 프루닝합니다.

다음으로 가중치 공유를 위해 가중치를 양자화하고, 마지막으로 허프만 코딩을 적용한다.

위의 두 가지 단계 이후에 나머지 연결과 양자화된 것들의 미세 조정을 위해 네트워크를 재학습합니다.

프루닝은 연결의 수를 9배에서 13배 감소시키고, 양자화는 각 연결을 나타내는 비트 수를 32에서 5로 줄입니다.

이미지넷 데이터셋에서 이 방법은 정확도 손실 없이 알렉스넷의 저장요구를 35배 줄였다.(240MB -> 6.9MB)

또한, 정확도 손실 없이 VGG-16 모델의 크기를 49배 감소시켰다.(552MB -> 11.3MB)

이는 off-chip DRAM memory이 아닌 on-chip SRAM에서의 모델 피팅을 가능케 한다.

우리의 압축 방법은 애플리케이션 크기와 다운로드 대역폭이 제한된 모바일 앱에서 복잡한 신경망 사용을 가능하게 한다.

CPU, GPU, 모바일 GPU에서 압축된 네트워크는 3 ~ 4배 빠른 속도와 3 ~ 7배 향상된 에너지 효율성을 나타낸다.

요약

논문에서 제안하는 방법의 과정은 Pruning -> Quantization -> huffman coding이다.

허프만 코딩은 확률에 따라 비트의 수가 달라진다는 점과 효과적인 디코딩 방법을 적용한다는 것이다.
프루닝은 의미없는 네트워크 간 연결을 전부 끊어버리는 것을 의미한다. 이 같은 방법을 반복하면서 정확도를 유지하도록 한다.
Quantization은 일정 값으로 나누거나 대표값을 저장하는 것을 의미한다. 여기서는 대표값의 인덱스를 저장하여 비트수를 감소시킨다. 이러한 대표값은 quantization 과정의 동일 인덱스는 전부 동일 대표값을 사용하게 된다. 예를 들어, 밑의 그림에서 index 1은 전부 파란색 값을 사용하는 것과 같은 경우이다.

Quantization의 초기화는 uniform init이 가장 좋은 결과를 보여주는데, 그 이유는 다른 초기화 같은 경우는 크거나 작은 가중치를 고려하지 않는 결과가 발생하기 때문에 이 부분에서 정확도 감소가 일어난다고 한다.
프루닝 또는 quantization만 사용한 경우보다 동시에 사용한 경우의 weight distribution이 좋게 나타나서 정확도 손실이나 모델 압축 측면에서 효과적이다.

Reference

Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.

https://www.youtube.com/watch?v=9mFZmpIbMDs&list=PLWKf9beHi3Tg50UoyTe6rIm20sVQOH1br&index=72

'# Paper Abstract Reading' 카테고리의 다른 글

Curriculum Learning (0)	2020.02.09
Net2Net: Accelerating Learning via Knowledge Transfer (0)	2020.02.01
Efficient Neural Architecture Search via Parameter Sharing (0)	2020.01.24
Deep Neural Networks for YouTube Recommendations (0)	2020.01.16
Style Transfer from Non-Parallel Text by Cross-Alignment (0)	2020.01.16

전체 글

Using high-level APIs

tf.keras

tf.estimator

'# Machine Learning > TensorFlow doc 정리' 카테고리의 다른 글

Preprocessing data

Decoding image data and resizing it

Applying arbitrary Python logic

Parsing tf.Example protocol buffer messages

Time series windowing

Using batch

Using window

Resampling

Datasets sampling

Rejection resampling

'# Machine Learning > TensorFlow doc 정리' 카테고리의 다른 글

Batching dataset elements

Simple batching

Batching tensors with padding

Training workflows

Processing multiple epochs

Randomly shuffling input data

'# Machine Learning > TensorFlow doc 정리' 카테고리의 다른 글

요약

Reference

'# Paper Abstract Reading' 카테고리의 다른 글

Reading Input Data

Consuming NumPy arrays

Consuming Python generators

Consuming TFRecord data

Consuming text data

Consuming CSV data

Consuming sets of files

'# Machine Learning > TensorFlow doc 정리' 카테고리의 다른 글

Basic mechanics

Dataset structure

'# Machine Learning > TensorFlow doc 정리' 카테고리의 다른 글

요약

'# Paper Abstract Reading' 카테고리의 다른 글

요약

Reference

'# Paper Abstract Reading' 카테고리의 다른 글

'# Machine Learning > Keras Implementation' 카테고리의 다른 글

요약

Reference

'# Paper Abstract Reading' 카테고리의 다른 글

티스토리툴바