tf.data.dataset.window 예시

시계열 데이터를 다룰 때 사용하면 매우 유용합니다.
시계열 데이터를 다룰 때 다음 함수와 비슷한 것들을 직접 정의하여 sequence를 만들어주어야 하는 번거로움이 있습니다.

def make_sequence(data, n):
    X, y = list(), list()
    
    for i in range(len(data)):
        _X = data.iloc[i:(i + n), :-1]
        if(i + n) < len(data):
            X.append(np.array(_X))
            y.append(data.iloc[i + n, -1])
        else:
            break
            
    return np.array(X), np.array(y)

tf.data를 사용하면 여러 줄로 구성되어 있는 위의 코드가 단 하나의 함수로 해결됩니다.

dataset = tf.data.Dataset.range(10)
dataset = dataset.window(5, shift=1)
for window_dataset in dataset:
  for val in window_dataset:
    print(val.numpy(), end=" ")
  print()

dataset.window의 첫 번째 인자는 window size이고, 두 번째는 shift 크기를 전달합니다.
결과는 다음과 같습니다.

0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
6 7 8 9
7 8 9
8 9
9

결과에서 window_size = 5만큼의 데이터를 얻다가, 끝 부분에서 [6, 7, 8, 9], [7, 8, 9], ... 의 원치않는 결과를 얻고 있습니다.
이는 가져오려는 window_size가 데이터셋의 크기를 초과했기 때문에 그렇습니다.

이를 방지하기 위해 drop_remainder = True 인자를 사용합니다.

dataset = tf.data.Dataset.range(10)
dataset = dataset.window(5, shift=1, drop_remainder=True)
for window_dataset in dataset:
  for val in window_dataset:
    print(val.numpy(), end=" ")
  print()

for-loop를 2중으로 사용하는 이유는 dataset.window가 Tensor가 아닌 Dataset을 반환하기 때문입니다.
이는 flat_map 함수를 사용해서 window_dataset을 flat해주어 바로 사용할 수 있습니다.
이 말은 쉽게 설명하면 원래 같은 경우 5 -> 4 -> 3 -> 처럼 iter 형식으로 받을 수 있었는데, flat_map을 사용하면 [5, 4, 3, 2, 1]로 바로 받을 수 있습니다.

dataset = tf.data.Dataset.range(10)
dataset = dataset.window(5, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(5))
for window in dataset:
  print(window.numpy())

마지막으로 다음과 같이 사용할 수도 있습니다.

dataset = tf.data.Dataset.range(10)
dataset = dataset.window(5, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(5))
dataset = dataset.map(lambda window: (window[:-1], window[-1:]))
for x,y in dataset:
  print(x.numpy(), y.numpy())

결과는 다음과 같습니다.
[0 1 2 3] [4]
[1 2 3 4] [5]
[2 3 4 5] [6]
[3 4 5 6] [7]
[4 5 6 7] [8]
[5 6 7 8] [9]

reference

https://www.tensorflow.org/guide/data

tf.data: Build TensorFlow input pipelines | TensorFlow Core

The tf.data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge random

www.tensorflow.org

'# Machine Learning > TensorFlow Function' 카테고리의 다른 글

Binary Accuracy vs Accuracy in TF (0)	2021.11.10
tensorflow StringLookUp, 다른 함수 사용해서 구현 (0)	2021.03.17
tensorflow Loss 함수에 존재하는 from_logits란 (3)	2020.03.06
tf.feature_column에 포함된 여러 함수들 (0)	2019.05.24
tf.image.non_max_suppression (0)	2019.04.16

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

대학원생이 쉽게 설명해보기