I am trying to write my first english blog based on two reasons: First, the data set used in this blog is english; Second, I’d like to expand my reach and attract more audiences, although I should admit that nobody cares.
Data
Initially I want to use chinese corpus, but I cannot find a proper one. The data should sound like this one:
If you can find the dataset where ‘similarity score’ is double, please donot hesitate to email me.
So, the choice has to be enlgish corpus. The dataset used in this experiment are STSbenchmark and SICK data. The SICK data contains 10,000 sentence paris labeled with semantic relatedness and entailment relation.
Similarity Methods
Baseline
As the baseline, we just take the embedding of the words in sentence, and compute the average, weighted by frequency of each word.
def(sentences1, sentences2, model=None, use_stoplist=False, doc_freqs=None): if doc_freqs isnotNone: N = doc_freqs["NUM_DOCS"]
sims = [] for (sent1, sent2) in zip(sentences1, sentences2):
tokens1 = sent1.tokens_without_stop if use_stoplist else sent1.tokens tokens2 = sent2.tokens_without_stop if use_stoplist else sent2.tokens
tokens1 = [token for token in tokens1 if token in model] tokens2 = [token for token in tokens2 if token in model] l if len(tokens1) == 0or len(tokens2) == 0: sims.append(0) continue
weights1 = [tokfreqs1[token] * math.log(N / (doc_freqs.get(token, 0) + 1)) for token in tokfreqs1] if doc_freqs elseNone weights2 = [tokfreqs2[token] * math.log(N / (doc_freqs.get(token, 0) + 1)) for token in tokfreqs2] if doc_freqs elseNone
embedding1 = np.average([model[token] for token in tokfreqs1], axis=0, weights=weights1).reshape(1, -1) embedding2 = np.average([model[token] for token in tokfreqs2], axis=0, weights=weights2).reshape(1, -1)
The baseline, like we did before, is very simple and crude of computing sentence embedding. Word frequency cannot reliably reflect its importance to sentence, semantically speaking. Smooth Inverse Frequency (SIF) tries to solve this problem.
SIF is very similar to the weighted average we used before, with the difference that it’s weighted by this formular. $$ operatorname { SIF } ( w ) = frac { a } { ( a + p ( w ) )} $$ where $a$ is a hyper-parameter (set to 0.001 by default) and $ p(w)$ is the estimated word frequency in the corpus. (这个权重和 TF或者 IDF 都是不相同的)
we need to perform common component removal: subtract from the sentence embedding obtained above the first principal component of the matrix. This corrects for the influence of high-frequency words that have syntactic or dicourse function, such as ‘but’, ‘and’, etc. You can find more information from this paper. 因为这个的输入直接是句子,没有经过分词的处理,所以不免有 but and 这类的词汇出现。
defremove_first_principal_component(X): svd = TruncatedSVD(n_components=1, n_iter=7, random_state=0) svd.fit(X) pc = svd.components_ XX = X - X.dot(pc.transpose()) * pc return XX
# common component analysis. for (sent1, sent2) in zip(sentences1, sentences2): tokens1 = sent1.tokens_without_stop if use_stoplist else sent1.tokens tokens2 = sent2.tokens_without_stop if use_stoplist else sent2.tokens
tokens1 = [token for token in tokens1 if token in model] tokens2 = [token for token in tokens2 if token in model]
weights1 = [a / (a + freqs.get(token, 0) / total_freq) for token in tokens1] weights2 = [a / (a + freqs.get(token, 0) / total_freq) for token in tokens2]
embedding1 = np.average([model[token] for token in tokens1], axis=0, weights=weights1) embedding2 = np.average([model[token] for token in tokens2], axis=0, weights=weights2)
InferSent is a pre-trained encoder that produces sentence embedding, which opensourced by Facebook. The Google Sentence Encoder is Google’s answer to Facebook’s InferSent. In contrast to InferSent, the Google Sentence Encoder was trained on a combination of unsupervised data and supervised data (SNLI corpus), which tends to give better results.
with tf.Session() as session: session.run(tf.global_variables_initializer()) session.run(tf.tables_initializer())
[gse_sims] = session.run( [sim_scores], feed_dict={ sts_input1: [sent1.raw for sent1 in sentences1], sts_input2: [sent2.raw for sent2 in sentences2] }) return gse_sims
Experiments
1 2 3 4 5 6 7 8 9 10 11 12 13 14
defrun_experiment(df, benchmarks): sentences1 = [Sentence(s) for s in df['sent_1']] sentences2 = [Sentence(s) for s in df['sent_2']]