Rally 로 Natural Language Inference 문제 풀기

시작하기 전에 파이썬 3.6 이상이 설치되어 있는지 확인하고, 다음 명령을 실행해서 Tensor2Tensor 를 설치해주세요.

pip install tensor2tensor==1.10.0

S NLI

SNLI 는 약 50 만 개의 자연어 추론 (NLI) 문제 데이터셋입니다. 각 instance 는 한 쌍의 문장 (premise 와 hypothesis)과 label (entailment, contradiction, neutral) 로 구성되어 있습니다.

Name	Value
Premise	A man inspects the uniform of a figure in some East Asian country.
Hypothesis	The man is sleeping.
Label	contradiction

이 자습서는 SNLI 문제를 풀기 위해 Decomposable Attention Model (Parikh et al, 2017) 을 사용합니다.

In [1]:

# Imports we need.
import tensorflow as tf
import os

# Enable TF Eager execution
tfe = tf.contrib.eager
config = tf.ConfigProto()
config.gpu_options.allow_growth
tfe.enable_eager_execution(config=config)

# Disable TensorFlow debugging log
tf.logging.set_verbosity(tf.logging.ERROR)

디렉토리 설정

tmp_dir: SNLI 데이터셋을 다운로드할 위치
data_dir: SNLI 데이터셋과 vocab 을 저장할 위치
train_dir: Decomposable Attention 모델을 저장할 위치

In [2]:

# Setup some directories
tmp_dir = os.path.expanduser("/tmp/snli")
data_dir = os.path.expanduser("/scatter/workspace/youngwook/rally/data/snli")
train_dir = os.path.expanduser("/scatter/workspace/youngwook/rally/train/snli/decomposable_attention/seq2seq_small")

데이터셋 준비

먼저 데이터셋을 가져와서 변환하는 Problem 를 정의해야 합니다. 여기서는 Rally 에 구현되어 있는 SNLI 를 사용합니다.

In [3]:

from rally.data.datasets.snli import SNLI

# Fetch the SNLI problem
snli_problem = SNLI()

# The generate_data method of a problem will download data and process it into
# a standard format ready for training and evaluation.
snli_problem.generate_data(data_dir, tmp_dir)

# Get the encoders from the problem
encoders = snli_problem.feature_encoders(data_dir)

훈련 데이터셋의 토큰 ID 를 문자열로 변환해서 내용을 확인해봅니다.

In [4]:

it = tfe.Iterator(snli_problem.dataset(tf.estimator.ModeKeys.TRAIN, data_dir))

example = it.next()


def print_example(example, encoders):
    print("ID")
    for key in ["premise", "hypothesis", "targets"]:
        encoder = encoders[key]
        feature = example[key]
        print(
            "| {:<10} | {}".format(
                key, ", ".join("{}".format(i) for i in feature.numpy().ravel())
            )
        )
    
    print()
    print("Text")

    for key in ["premise", "hypothesis", "targets"]:
        encoder = encoders[key]
        feature = example[key]
        print(
            "| {:<10} | {}".format(
                key, encoder.decode(feature.numpy().ravel(), strip_extraneous=True)
            )
        )


print_example(example, encoders)

:::MLPv0.5.0 transformer 1542525539.019731045 (<ipython-input-4-ced681d97999>:1) input_order
ID
| premise    | 4, 67, 25, 307, 5, 623, 98, 5, 46, 12, 3, 550, 1468, 13, 37, 175, 5, 3, 698, 2, 1
| hypothesis | 4, 25, 1739, 90, 3, 698, 379, 12, 623, 17, 190, 89, 9, 37, 138, 2, 1
| targets    | 2

Text
| premise    | A little girl covered in paint sits in front of a painted rainbow with her hands in a bowl.
| hypothesis | A girl splashes around a bowl full of paint, getting some on her dress.
| targets    | neutral

연습 1

Problem.feature_encoders 는 <feature name, TextEncoder> 의 Dict 를 반환하는 함수입니다. 여기서 TextEncoder 는 입력/출력을 encode/decode 하는 클래스입니다.

예를 들어 Rally 의 SentencePairClassificationProblem 는 feature_encoders 를 다음과 같이 구현합니다.

class SentencePairClassificationProblem(text_problems.Text2TextProblem):
    """Base class for NLI (sentence pair classification) problems."""

    def feature_encoders(self, data_dir):
        encoder = self.get_or_create_vocab(data_dir, None, force_get=True)

        return {
            FeatureNames.PREMISE: encoder,
            FeatureNames.HYPOTHESIS: encoder,
            "targets": text_encoder.ClassLabelEncoder(self.class_labels(data_dir)),
        }

get_or_create_vocab 의 기본 구현은 vocab_type == SUBWORD 일 경우 SubwordTextEncoder 를 만듭니다. 그리고 SubwordTextEncoder 는 wordpiece 를 사용해서 입력/출력을 encode/decode 합니다.

만약 다른 TextEncoder 를 사용하고 싶다면 get_or_create_vocab 을 오버라이드하면 됩니다.

질문: 문자(character) 수준의 모델을 위한 Problem 을 만들고 싶다면 어떻게 해야 할까요? (아래 빈 칸을 채우십시오.)

In [5]:

class SNLICharacter(SNLI):
    """StanfordNLI classification problems with a character-based tokenizer."""

    @property
    def vocab_type(self):
        return NotImplemented

    def get_or_create_vocab(self, data_dir, tmp_dir, force_get=False):
        del force_get  # Always get

        ################################################################################
        # TODO:
        # 문자 단위로 토큰을 사용하는 `TextEncoder` 를 만드십시오.
        # 토큰에는 `oov_token`, " ", ".", 영문 대소문자가 포함되어야 합니다.
        # `oov_token`: <UNK>" 와 같이 토큰으로 사용되지 않는 문자열을 지정합니다.
        #
        # 사용할 수 있는 클래스:
        #   * rally.data.tokenizer.CharacterTokenizer
        #   * rally.data.vocabulary.Vocabulary
        #   * rally.data.token_encoder.TokenEncoder
        ################################################################################
        # Your code
        ################################################################################
        #                              END OF YOUR CODE                                #
        ################################################################################

        return encoder

In [6]:

def check_character_problem():
    character_data_dir = os.path.expanduser("~/t2t_data/snli_character")
    tf.gfile.MakeDirs(character_data_dir)
    snli_character_problem = SNLICharacter()
    snli_character_problem.generate_data(character_data_dir, tmp_dir)

    # Get the encoders from the problem
    character_encoders = snli_character_problem.feature_encoders(character_data_dir)
    character_it = tfe.Iterator(
        snli_character_problem.dataset(tf.estimator.ModeKeys.TRAIN, character_data_dir)
    )

    character_example = character_it.next()

    print_example(character_example, character_encoders)
    
check_character_problem()

:::MLPv0.5.0 transformer 1542525541.277723789 (<ipython-input-6-bf2cebb16253>:9) input_order
ID
| premise    | 31, 3, 8, 13, 23, 13, 18, 24, 9, 22, 9, 23, 24, 9, 8, 3, 29, 19, 25, 18, 11, 3, 27, 19, 17, 5, 18, 3, 5, 18, 8, 3, 5, 18, 3, 19, 16, 8, 9, 22, 3, 17, 5, 18, 3, 23, 24, 5, 18, 8, 3, 6, 29, 3, 5, 3, 6, 5, 22, 4, 1
| hypothesis | 50, 12, 9, 3, 27, 19, 17, 5, 18, 3, 13, 23, 3, 12, 19, 16, 8, 13, 18, 11, 3, 5, 3, 8, 22, 13, 18, 15, 4, 1
| targets    | 2

Text
| premise    | A disinterested young woman and an older man stand by a bar.<EOS>
| hypothesis | The woman is holding a drink.<EOS>
| targets    | neutral

모델

모델은 T2TModel 를 상속받아서 구현해야 합니다. 대부분의 경우 body 만 구현하면 됩니다. SNLI 문제를 위해 우리는 Decomposable Attention 모델을 사용할 것입니다.

연습 2

다음은 LSTM encoder 를 사용하지 않는 Decomposable Attention 의 모델 클래스입니다. 여기에 Bidirectional LSTM encoder 를 추가해보세요.

In [7]:

from rally.layers.mlp import MLP
import rally.layers.decomposable_attention as layers
from rally.models.entailment import EntailmentModel

class DecomposableAttentionLSTM(EntailmentModel):
    """
    This model implements the sentence pair classification
    with decomposable attention model (A Decomposable Attention
    Model for Natural Language Inference, 2016) with
    bi-directional LSTM encoder.
    """

    def body(self, features):
        embedding_dim = self._hparams.hidden_size * 2
        attention_mapper_num_layers = [embedding_dim] * 2
        attention_mapper_l2_coef = 0.0  # l2 coef in attention mapper kills attention
        attention_mapper = MLP(
            attention_mapper_num_layers,
            dropout=True,
            activation=tf.nn.relu,
            kernel_initializer=tf.contrib.layers.xavier_initializer(),
            kernel_regularizer=tf.contrib.layers.l2_regularizer(
                scale=attention_mapper_l2_coef
            ),
            name="attend_attention_mapper",
        )

        attend_dim = self._hparams.hidden_size * 2
        compare_mapper_num_layers = [(embedding_dim + attend_dim) / 2] * 2
        compare_mapper_l2_coef = 3e-4

        compare_mapper = MLP(
            compare_mapper_num_layers,
            dropout=True,
            activation=tf.nn.relu,
            kernel_initializer=tf.contrib.layers.xavier_initializer(),
            kernel_regularizer=tf.contrib.layers.l2_regularizer(
                scale=compare_mapper_l2_coef
            ),
            name="compare_mapper",
        )

        compare_dim = (embedding_dim + attend_dim) / 2
        aggregate_mapper_num_layers = [compare_dim // 2]
        aggregate_mapper_l2_coef = 3e-4

        aggregate_mapper = MLP(
            aggregate_mapper_num_layers,
            dropout=True,
            activation=tf.nn.relu,
            kernel_initializer=tf.contrib.layers.xavier_initializer(),
            kernel_regularizer=tf.contrib.layers.l2_regularizer(
                scale=aggregate_mapper_l2_coef
            ),
            name="aggregate_mapper",
        )
        
        premise_encoder = None
        hypothesis_encoder = None

        ################################################################################
        # TODO:
        # `premise_encoder` 와 `hypothesis_encoder` 를 위한 Bidirectional LSTM 을 만드십시오.
        # `name` 에는 각각 "premise_encoder", "hypothesis_encoder" 를 사용해야 합니다.
        #
        # 사용해야 하는 hyperparameters:
        #   * self._hparams.hidden_size
        #   * self._hparams.num_hidden_layers
        #   * self._hparams.dropout
        #
        # 사용해야 하는 클래스:
        #   * rally.layers.rnn.BidirectionalLSTMEncoder
        ################################################################################
        # Your code
        ################################################################################
        #                              END OF YOUR CODE                                #
        ################################################################################
        
        model = layers.DecomposableAttention(
            attention_mapper,
            compare_mapper,
            aggregate_mapper,
            premise_encoder=premise_encoder,
            hypothesis_encoder=hypothesis_encoder,
        )

        train = self._hparams.mode == tf.estimator.ModeKeys.TRAIN

        premise = features.get("premise")
        hypothesis = features.get("hypothesis")

        return model([premise, hypothesis], training=train)

사전 훈련된 모델과 동일한 hyperparameter 를 정의합니다.

In [8]:

from tensor2tensor.utils import trainer_lib
from tensor2tensor.layers import common_hparams
from tensor2tensor.utils import registry


@registry.register_hparams
def seq2seq_small():
    """hparams for Seq2seq encoder models."""
    hparams = common_hparams.basic_params1()
    hparams.daisy_chain_variables = False
    hparams.batch_size = 1024
    hparams.hidden_size = 128
    hparams.num_hidden_layers = 2
    hparams.initializer = "uniform_unit_scaling"
    hparams.initializer_gain = 1.0
    hparams.weight_decay = 0.0
    return hparams


# Create hparams and the model
hparams_set = "seq2seq_small"

hparams = trainer_lib.create_hparams(
    hparams_set, data_dir=data_dir, problem_name="snli"
)

da_model = DecomposableAttentionLSTM(hparams, tf.estimator.ModeKeys.EVAL)

사전 훈련된 모델의 checkpoint 는 train_dir 에 있어야 합니다.

In [9]:

ckpt_path = tf.train.latest_checkpoint(train_dir)
ckpt_path

'/Users/ywkim/t2t_train/snli/decomposable_attention/seq2seq_small/model.ckpt-195000'

사전 훈련된 모델로 예측해봅니다.

In [10]:

# Restore and infer!
def infer(example):
    input_batch = {
        key: tf.cast(tf.expand_dims(example[key], 0), tf.int32) for key in example
    }
    with tfe.restore_variables_on_create(ckpt_path):
        model_output = da_model.infer(input_batch)["outputs"]
    return encoders["targets"].decode(model_output.numpy().ravel())


example = it.next()
print_example(example, encoders)
outputs = infer(example)
print()
print("Outputs: %s" % outputs)

ID
| premise    | 4, 124, 21, 31, 645, 6, 380, 415, 64, 7, 1320, 2, 1
| hypothesis | 864, 6, 7, 742, 12, 7, 622, 2, 1
| targets    | 2

Text
| premise    | A lady wearing an apron is selling fish from the boxes.
| hypothesis | She is the owner of the business.
| targets    | neutral
:::MLPv0.5.0 transformer 1542525543.597415924 (/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/utils/t2t_model.py:213) model_hp_initializer_gain: 1.0
Outputs: neutral