Terms

  • NLU (Natural Language Understanding) / NLI (Natural Language Inferencing)

  • NLG (Natural Language Generation)

  • Tokenization (分词)

    • Chinese Word Segmentation (中文分词)
    • BPE/wordpiece/unigram
  • Stemming (词干提取)

  • Lemmatization (词形还原)

  • Parts of speech tagging (词性标注)

  • Named-entity Recognition (NER,命名实体识别)

  • Chunking (分块)

  • Language Model: A language model is a function that takes in a sequence of words and returns a probility distribution over all the possible next words in that sequence.

    • autoregressive/causal language model

    • masked language model (BERT)

  • BoW (Bag-of-Words)

  • seq2seq

  • Encoder-Decoder

  • Atttention layer

  • Transformer

  • BERT

  • GPT

Tutorials

References

Textbooks

Projects

Data Representation

from gensim import corpora

# a single document
document = "A survey of user opinion of computer system response time"

# corpora := multiple(all) documents
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# tokens
tokens = ["a", "survey", "of", "user", "opinion", "of", "comuter", "system", "response", "time"]

# dictionary, token to ids
dictionary = {
    "a": 1,
    "survey": 2,
    "of": 3,
    ...
}

# bag of words
BoW = [
    (2, 1) # means the document constains 1 of token 2
    (3, 4) # means the document constains 4 of token 3
    ...
]

# one-hot encoding tokens
encoded_tokens_of_a_document = [
    [0, 1, 0, 0, 0], # first word's encoding
    [1, 0, 0, 0, 0], # second word's encoding
    [0, 0, 1, 0, 0],
    [0, 0, 0, 1, 0],
    ...
] # the document has n_row of words, and the whole dictionary has n_col words

# word2vec
embedding = [
    [0.1, 0.2, -0.3], # first word's embedding
    [0.2, 0.2, -0.1], # second word's embedding
    [0.4, -0.2, -0.1],
    [0.1, 0.2, 0.3],
] # the document has n_row of words, the embedding has n_col of dimension

Evaluations

Networks

Stanford CS 230 Deep Learning - Cheatsheet

History

Word Embedding

  • word2vec
    • CBOW: countinous bag-of-word
    • SG: skip-gram
  • GloVe
  • fasttext

seq2seq

  • tasks
    • text generation
    • translation

Transformer

The Illustrated Transformer

The Annotated Transformer

放弃幻想,全面拥抱Transformer:自然语言处理三大特征抽取器(CNN/RNN/TF)比较

分析transformer模型的参数量、计算量、中间激活、KV cache

Datasets