  • 这是最近两个月来的一个小总结,实现的demo已经上传github,里面包含了CNN、LSTM、BiLSTM、GRU以及CNN与LSTM、BiLSTM的结合还有多层多通道CNN、LSTM、BiLSTM等多个神经网络模型的的实现。这篇文章总结一下最近一段时间遇到的问题、处理方法和相关策略,以及经验(其实并没有什么经验)等,白菜一枚
  • Demo Site:  https://github.com/bamtercelboo/cnn-lstm-bilstm-deepcnn-clstm-in-pytorch

(一) Pytorch简述

  • Pytorch是一个较新的深度学习框架,是一个 Python 优先的深度学习框架,能够在强大的 GPU 加速基础上实现张量和动态神经网络。



  1、我现在使用的语料是基本规范的数据(例如下),但是加载语料数据的过程中仍然存在着一些需要预处理的地方,像一些数据的大小写、数字的处理以及“\n \t”等一些字符,现在使用torchtext第三方库进行加载数据预处理。

  • 技术分享
    You Should Pay Nine Bucks for This : Because you can hear about suffering Afghan refugees on the news and still be unaffected . ||| 2
    Dramas like this make it human . ||| 4
  • 技术分享
    import torchtext.data as data
    # lower word
    text_field = data.Field(lower=True)
  • 技术分享
     1 from torchtext import data
     2       def clean_str(string):
     3             string = re.sub(r"[^A-Za-z0-9(),!?\‘\`]", " ", string)
     4             string = re.sub(r"\‘s", " \‘s", string)
     5             string = re.sub(r"\‘ve", " \‘ve", string)
     6             string = re.sub(r"n\‘t", " n\‘t", string)
     7             string = re.sub(r"\‘re", " \‘re", string)
     8             string = re.sub(r"\‘d", " \‘d", string)
     9             string = re.sub(r"\‘ll", " \‘ll", string)
    10             string = re.sub(r",", " , ", string)
    11             string = re.sub(r"!", " ! ", string)
    12             string = re.sub(r"\(", " \( ", string)
    13             string = re.sub(r"\)", " \) ", string)
    14             string = re.sub(r"\?", " \? ", string)
    15             string = re.sub(r"\s{2,}", " ", string)
    16             return string.strip()
    18         text_field.preprocessing = data.Pipeline(clean_str)
  • 加载数据集的时候可以使用random打乱数据
  • 技术分享
    1 if shuffle:
    2     random.shuffle(examples_train)
    3     random.shuffle(examples_dev)
    4     random.shuffle(examples_test)
  • torchtext建立训练集、开发集、测试集迭代器的时候,可以选择在每次迭代的时候是否去打乱数据
  • 技术分享
     1 class Iterator(object):
     2     """Defines an iterator that loads batches of data from a Dataset.
     4     Attributes:
     5         dataset: The Dataset object to load Examples from.
     6         batch_size: Batch size.
     7         sort_key: A key to use for sorting examples in order to batch together
     8             examples with similar lengths and minimize padding. The sort_key
     9             provided to the Iterator constructor overrides the sort_key
    10             attribute of the Dataset, or defers to it if None.
    11         train: Whether the iterator represents a train set.
    12         repeat: Whether to repeat the iterator for multiple epochs.
    13         shuffle: Whether to shuffle examples between epochs.
    14         sort: Whether to sort examples according to self.sort_key.
    15             Note that repeat, shuffle, and sort default to train, train, and
    16             (not train).
    17         device: Device to create batches on. Use -1 for CPU and None for the
    18             currently active GPU device.
    19     """
(四)Word Embedding

  1、word embedding简单来说就是语料中每一个单词对应的其相应的词向量,目前训练词向量的方式最使用的应该是word2vec(参考 http://www.cnblogs.com/bamtercelboo/p/7181899.html

  2、上文中已经通过torchtext建立了相关的词汇表,加载词向量有两种方式,一个是加载外部根据语料训练好的预训练词向量,另一个方式是随机初始化词向量,两种方式相互比较的话当时是使用预训练好的词向量效果会好很多,但是自己训练的词向量并不见得会有很好的效果,因为语料数据可能不足,像已经训练好的词向量,像Google News那个词向量,是业界公认的词向量,但是由于数量巨大,如果硬件设施(GPU)不行的话,还是不要去尝试这个了。



  • 加载词汇表中在词向量里面能够找到的词向量
  • 技术分享
     1 # load word embedding
     2 def load_my_vecs(path, vocab, freqs):
     3     word_vecs = {}
     4     with open(path, encoding="utf-8") as f:
     5         count  = 0
     6         lines = f.readlines()[1:]
     7         for line in lines:
     8             values = line.split(" ")
     9             word = values[0]
    10             # word = word.lower()
    11             count += 1
    12             if word in vocab:  # whether to judge if in vocab
    13                 vector = []
    14                 for count, val in enumerate(values):
    15                     if count == 0:
    16                         continue
    17                     vector.append(float(val))
    18                 word_vecs[word] = vector
    19     return word_vecs
  • 处理词汇表中在词向量里面找不到的word,俗称OOV(out of vocabulary),OOV越多,可能对加过的影响也就越大,所以对OOV词的处理就显得尤为关键,现在有几种策略可以参考:
  • 对已经找到的词向量平均化
  • 技术分享
     1 # solve unknown by avg word embedding
     2 def add_unknown_words_by_avg(word_vecs, vocab, k=100):
     3     # solve unknown words inplaced by zero list
     4     word_vecs_numpy = []
     5     for word in vocab:
     6         if word in word_vecs:
     7             word_vecs_numpy.append(word_vecs[word])
     8     print(len(word_vecs_numpy))
     9     col = []
    10     for i in range(k):
    11         sum = 0.0
    12         # for j in range(int(len(word_vecs_numpy) / 4)):
    13         for j in range(int(len(word_vecs_numpy))):
    14             sum += word_vecs_numpy[j][i]
    15             sum = round(sum, 6)
    16         col.append(sum)
    17     zero = []
    18     for m in range(k):
    19         # avg = col[m] / (len(col) * 5)
    20         avg = col[m] / (len(word_vecs_numpy))
    21         avg = round(avg, 6)
    22         zero.append(float(avg))
    24     list_word2vec = []
    25     oov = 0
    26     iov = 0
    27     for word in vocab:
    28         if word not in word_vecs:
    29             # word_vecs[word] = np.random.uniform(-0.25, 0.25, k).tolist()
    30             # word_vecs[word] = [0.0] * k
    31             oov += 1
    32             word_vecs[word] = zero
    33             list_word2vec.append(word_vecs[word])
    34         else:
    35             iov += 1
    36             list_word2vec.append(word_vecs[word])
    37     print("oov count", oov)
    38     print("iov count", iov)
    39     return list_word2vec
  • 随机初始化或者全部取zero,随机初始化或者是取zero,可以是所有的OOV都使用一个随机值,也可以每一个OOV word都是随机的,具体效果看自己效果
  • 随机初始化的值看过几篇论文,有的随机初始化是在(-0.25,0.25)或者是(-0.1,0.1)之间,具体的效果可以自己去测试一下,不同的数据集,不同的外部词向量估计效果不一样,我测试的结果是0.25要好于0.1
  • 技术分享
     1 # solve unknown word by uniform(-0.25,0.25)
     2 def add_unknown_words_by_uniform(word_vecs, vocab, k=100):
     3     list_word2vec = []
     4     oov = 0
     5     iov = 0
     6     # uniform = np.random.uniform(-0.25, 0.25, k).round(6).tolist()
     7     for word in vocab:
     8         if word not in word_vecs:
     9             oov += 1
    10             word_vecs[word] = np.random.uniform(-0.25, 0.25, k).round(6).tolist()
    11             # word_vecs[word] = np.random.uniform(-0.1, 0.1, k).round(6).tolist()
    12             # word_vecs[word] = uniform
    13             list_word2vec.append(word_vecs[word])
    14         else:
    15             iov += 1
    16             list_word2vec.append(word_vecs[word])
    17     print("oov count", oov)
    18     print("iov count", iov)
    19     return list_word2vec
  • 特别需要注意处理后的OOV词向量是否在一定的范围之内,这个一定要在处理之后手动或者是demo查看一下,想处理出来的词向量大于15,30的这种,可能就是你自己处理方式的问题,也可以是说是你自己demo可能存在bug,对结果的影响很大。



