Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

Rally / SNLI

Project: rally
Views: 781
Kernel: Python 3 (Ubuntu Linux)

Rally 로 Natural Language Inference 문제 풀기

시작하기 전에 파이썬 3.6 이상이 설치되어 있는지 확인하고, 다음 명령을 실행해서 Tensor2Tensor 를 설치해주세요.

pip install tensor2tensor==1.10.0

S NLI

SNLI 는 약 50 만 개의 자연어 추론 (NLI) 문제 데이터셋입니다. 각 instance 는 한 쌍의 문장 (premise 와 hypothesis)과 label (entailment, contradiction, neutral) 로 구성되어 있습니다.

NameValue
PremiseA man inspects the uniform of a figure in some East Asian country.
HypothesisThe man is sleeping.
Labelcontradiction

이 자습서는 SNLI 문제를 풀기 위해 Decomposable Attention Model (Parikh et al, 2017) 을 사용합니다.

# Imports we need. import tensorflow as tf import os # Enable TF Eager execution tfe = tf.contrib.eager config = tf.ConfigProto() config.gpu_options.allow_growth tfe.enable_eager_execution(config=config) # Disable TensorFlow debugging log tf.logging.set_verbosity(tf.logging.ERROR)

디렉토리 설정

  • tmp_dir: SNLI 데이터셋을 다운로드할 위치

  • data_dir: SNLI 데이터셋과 vocab 을 저장할 위치

  • train_dir: Decomposable Attention 모델을 저장할 위치

# Setup some directories tmp_dir = os.path.expanduser("/tmp/snli") data_dir = os.path.expanduser("/scatter/workspace/youngwook/rally/data/snli") train_dir = os.path.expanduser("/scatter/workspace/youngwook/rally/train/snli/decomposable_attention/seq2seq_small")

데이터셋 준비

먼저 데이터셋을 가져와서 변환하는 Problem 를 정의해야 합니다. 여기서는 Rally 에 구현되어 있는 SNLI 를 사용합니다.

from rally.data.datasets.snli import SNLI # Fetch the SNLI problem snli_problem = SNLI() # The generate_data method of a problem will download data and process it into # a standard format ready for training and evaluation. snli_problem.generate_data(data_dir, tmp_dir) # Get the encoders from the problem encoders = snli_problem.feature_encoders(data_dir)

훈련 데이터셋의 토큰 ID 를 문자열로 변환해서 내용을 확인해봅니다.

it = tfe.Iterator(snli_problem.dataset(tf.estimator.ModeKeys.TRAIN, data_dir)) example = it.next() def print_example(example, encoders): print("ID") for key in ["premise", "hypothesis", "targets"]: encoder = encoders[key] feature = example[key] print( "| {:<10} | {}".format( key, ", ".join("{}".format(i) for i in feature.numpy().ravel()) ) ) print() print("Text") for key in ["premise", "hypothesis", "targets"]: encoder = encoders[key] feature = example[key] print( "| {:<10} | {}".format( key, encoder.decode(feature.numpy().ravel(), strip_extraneous=True) ) ) print_example(example, encoders)
:::MLPv0.5.0 transformer 1542525539.019731045 (<ipython-input-4-ced681d97999>:1) input_order ID | premise | 4, 67, 25, 307, 5, 623, 98, 5, 46, 12, 3, 550, 1468, 13, 37, 175, 5, 3, 698, 2, 1 | hypothesis | 4, 25, 1739, 90, 3, 698, 379, 12, 623, 17, 190, 89, 9, 37, 138, 2, 1 | targets | 2 Text | premise | A little girl covered in paint sits in front of a painted rainbow with her hands in a bowl. | hypothesis | A girl splashes around a bowl full of paint, getting some on her dress. | targets | neutral

연습 1

Problem.feature_encoders 는 <feature name, TextEncoder> 의 Dict 를 반환하는 함수입니다. 여기서 TextEncoder 는 입력/출력을 encode/decode 하는 클래스입니다.

예를 들어 Rally 의 SentencePairClassificationProblemfeature_encoders 를 다음과 같이 구현합니다.

class SentencePairClassificationProblem(text_problems.Text2TextProblem): """Base class for NLI (sentence pair classification) problems.""" def feature_encoders(self, data_dir): encoder = self.get_or_create_vocab(data_dir, None, force_get=True) return { FeatureNames.PREMISE: encoder, FeatureNames.HYPOTHESIS: encoder, "targets": text_encoder.ClassLabelEncoder(self.class_labels(data_dir)), }

get_or_create_vocab 의 기본 구현은 vocab_type == SUBWORD 일 경우 SubwordTextEncoder 를 만듭니다. 그리고 SubwordTextEncoder 는 wordpiece 를 사용해서 입력/출력을 encode/decode 합니다.

만약 다른 TextEncoder 를 사용하고 싶다면 get_or_create_vocab 을 오버라이드하면 됩니다.

질문: 문자(character) 수준의 모델을 위한 Problem 을 만들고 싶다면 어떻게 해야 할까요? (아래 빈 칸을 채우십시오.)

class SNLICharacter(SNLI): """StanfordNLI classification problems with a character-based tokenizer.""" @property def vocab_type(self): return NotImplemented def get_or_create_vocab(self, data_dir, tmp_dir, force_get=False): del force_get # Always get ################################################################################ # TODO: # 문자 단위로 토큰을 사용하는 `TextEncoder` 를 만드십시오. # 토큰에는 `oov_token`, " ", ".", 영문 대소문자가 포함되어야 합니다. # `oov_token`: <UNK>" 와 같이 토큰으로 사용되지 않는 문자열을 지정합니다. # # 사용할 수 있는 클래스: # * rally.data.tokenizer.CharacterTokenizer # * rally.data.vocabulary.Vocabulary # * rally.data.token_encoder.TokenEncoder ################################################################################ # Your code ################################################################################ # END OF YOUR CODE # ################################################################################ return encoder
def check_character_problem(): character_data_dir = os.path.expanduser("~/t2t_data/snli_character") tf.gfile.MakeDirs(character_data_dir) snli_character_problem = SNLICharacter() snli_character_problem.generate_data(character_data_dir, tmp_dir) # Get the encoders from the problem character_encoders = snli_character_problem.feature_encoders(character_data_dir) character_it = tfe.Iterator( snli_character_problem.dataset(tf.estimator.ModeKeys.TRAIN, character_data_dir) ) character_example = character_it.next() print_example(character_example, character_encoders) check_character_problem()
:::MLPv0.5.0 transformer 1542525541.277723789 (<ipython-input-6-bf2cebb16253>:9) input_order ID | premise | 31, 3, 8, 13, 23, 13, 18, 24, 9, 22, 9, 23, 24, 9, 8, 3, 29, 19, 25, 18, 11, 3, 27, 19, 17, 5, 18, 3, 5, 18, 8, 3, 5, 18, 3, 19, 16, 8, 9, 22, 3, 17, 5, 18, 3, 23, 24, 5, 18, 8, 3, 6, 29, 3, 5, 3, 6, 5, 22, 4, 1 | hypothesis | 50, 12, 9, 3, 27, 19, 17, 5, 18, 3, 13, 23, 3, 12, 19, 16, 8, 13, 18, 11, 3, 5, 3, 8, 22, 13, 18, 15, 4, 1 | targets | 2 Text | premise | A disinterested young woman and an older man stand by a bar.<EOS> | hypothesis | The woman is holding a drink.<EOS> | targets | neutral

모델

모델은 T2TModel 를 상속받아서 구현해야 합니다. 대부분의 경우 body 만 구현하면 됩니다. SNLI 문제를 위해 우리는 Decomposable Attention 모델을 사용할 것입니다.

연습 2

다음은 LSTM encoder 를 사용하지 않는 Decomposable Attention 의 모델 클래스입니다. 여기에 Bidirectional LSTM encoder 를 추가해보세요.

from rally.layers.mlp import MLP import rally.layers.decomposable_attention as layers from rally.models.entailment import EntailmentModel class DecomposableAttentionLSTM(EntailmentModel): """ This model implements the sentence pair classification with decomposable attention model (A Decomposable Attention Model for Natural Language Inference, 2016) with bi-directional LSTM encoder. """ def body(self, features): embedding_dim = self._hparams.hidden_size * 2 attention_mapper_num_layers = [embedding_dim] * 2 attention_mapper_l2_coef = 0.0 # l2 coef in attention mapper kills attention attention_mapper = MLP( attention_mapper_num_layers, dropout=True, activation=tf.nn.relu, kernel_initializer=tf.contrib.layers.xavier_initializer(), kernel_regularizer=tf.contrib.layers.l2_regularizer( scale=attention_mapper_l2_coef ), name="attend_attention_mapper", ) attend_dim = self._hparams.hidden_size * 2 compare_mapper_num_layers = [(embedding_dim + attend_dim) / 2] * 2 compare_mapper_l2_coef = 3e-4 compare_mapper = MLP( compare_mapper_num_layers, dropout=True, activation=tf.nn.relu, kernel_initializer=tf.contrib.layers.xavier_initializer(), kernel_regularizer=tf.contrib.layers.l2_regularizer( scale=compare_mapper_l2_coef ), name="compare_mapper", ) compare_dim = (embedding_dim + attend_dim) / 2 aggregate_mapper_num_layers = [compare_dim // 2] aggregate_mapper_l2_coef = 3e-4 aggregate_mapper = MLP( aggregate_mapper_num_layers, dropout=True, activation=tf.nn.relu, kernel_initializer=tf.contrib.layers.xavier_initializer(), kernel_regularizer=tf.contrib.layers.l2_regularizer( scale=aggregate_mapper_l2_coef ), name="aggregate_mapper", ) premise_encoder = None hypothesis_encoder = None ################################################################################ # TODO: # `premise_encoder` 와 `hypothesis_encoder` 를 위한 Bidirectional LSTM 을 만드십시오. # `name` 에는 각각 "premise_encoder", "hypothesis_encoder" 를 사용해야 합니다. # # 사용해야 하는 hyperparameters: # * self._hparams.hidden_size # * self._hparams.num_hidden_layers # * self._hparams.dropout # # 사용해야 하는 클래스: # * rally.layers.rnn.BidirectionalLSTMEncoder ################################################################################ # Your code ################################################################################ # END OF YOUR CODE # ################################################################################ model = layers.DecomposableAttention( attention_mapper, compare_mapper, aggregate_mapper, premise_encoder=premise_encoder, hypothesis_encoder=hypothesis_encoder, ) train = self._hparams.mode == tf.estimator.ModeKeys.TRAIN premise = features.get("premise") hypothesis = features.get("hypothesis") return model([premise, hypothesis], training=train)

사전 훈련된 모델과 동일한 hyperparameter 를 정의합니다.

from tensor2tensor.utils import trainer_lib from tensor2tensor.layers import common_hparams from tensor2tensor.utils import registry @registry.register_hparams def seq2seq_small(): """hparams for Seq2seq encoder models.""" hparams = common_hparams.basic_params1() hparams.daisy_chain_variables = False hparams.batch_size = 1024 hparams.hidden_size = 128 hparams.num_hidden_layers = 2 hparams.initializer = "uniform_unit_scaling" hparams.initializer_gain = 1.0 hparams.weight_decay = 0.0 return hparams # Create hparams and the model hparams_set = "seq2seq_small" hparams = trainer_lib.create_hparams( hparams_set, data_dir=data_dir, problem_name="snli" ) da_model = DecomposableAttentionLSTM(hparams, tf.estimator.ModeKeys.EVAL)

사전 훈련된 모델의 checkpoint 는 train_dir 에 있어야 합니다.

ckpt_path = tf.train.latest_checkpoint(train_dir) ckpt_path
'/Users/ywkim/t2t_train/snli/decomposable_attention/seq2seq_small/model.ckpt-195000'

사전 훈련된 모델로 예측해봅니다.

# Restore and infer! def infer(example): input_batch = { key: tf.cast(tf.expand_dims(example[key], 0), tf.int32) for key in example } with tfe.restore_variables_on_create(ckpt_path): model_output = da_model.infer(input_batch)["outputs"] return encoders["targets"].decode(model_output.numpy().ravel()) example = it.next() print_example(example, encoders) outputs = infer(example) print() print("Outputs: %s" % outputs)
ID | premise | 4, 124, 21, 31, 645, 6, 380, 415, 64, 7, 1320, 2, 1 | hypothesis | 864, 6, 7, 742, 12, 7, 622, 2, 1 | targets | 2 Text | premise | A lady wearing an apron is selling fish from the boxes. | hypothesis | She is the owner of the business. | targets | neutral :::MLPv0.5.0 transformer 1542525543.597415924 (/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/utils/t2t_model.py:213) model_hp_initializer_gain: 1.0 Outputs: neutral