NLP | YZCHEN.SPACE

Terms

NLU (Natural Language Understanding) / NLI (Natural Language Inferencing)
NLG (Natural Language Generation)
Tokenization (分词)
- Chinese Word Segmentation (中文分词)
- BPE/wordpiece/unigram
Stemming (词干提取)
Lemmatization (词形还原)
Parts of speech tagging (词性标注)
Named-entity Recognition (NER，命名实体识别)
Chunking (分块)
Language Model: A language model is a function that takes in a sequence of words and returns a probility distribution over all the possible next words in that sequence.
- autoregressive/causal language model
- masked language model (BERT)
BoW (Bag-of-Words)
seq2seq
Encoder-Decoder
Atttention layer
Transformer
BERT
GPT

Tutorials

References

word2vec
- The Illustrated Word2vec
- Gensim’s python implementation of word2vec
self-attention
- Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch

Textbooks

Speech and Language Processing

Projects

Github - NLTK
Github - gensim
Github - spaCy
- pretrained models
  - core models: use for general purpose tasks
  - starter models: use for fine-tuning
Github - OpenNMT
Chinese
tokenizer
- Github - tiktoken
- Github - HuggingFace - tokenizers
Embeddings
- Github - Sentence Transformers
- OpenAI - Embeddings

Data Representation

from gensim import corpora

# a single document
document = "A survey of user opinion of computer system response time"

# corpora := multiple(all) documents
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# tokens
tokens = ["a", "survey", "of", "user", "opinion", "of", "comuter", "system", "response", "time"]

# dictionary, token to ids
dictionary = {
    "a": 1,
    "survey": 2,
    "of": 3,
    ...
}

# bag of words
BoW = [
    (2, 1) # means the document constains 1 of token 2
    (3, 4) # means the document constains 4 of token 3
    ...
]

# one-hot encoding tokens
encoded_tokens_of_a_document = [
    [0, 1, 0, 0, 0], # first word's encoding
    [1, 0, 0, 0, 0], # second word's encoding
    [0, 0, 1, 0, 0],
    [0, 0, 0, 1, 0],
    ...
] # the document has n_row of words, and the whole dictionary has n_col words

# word2vec
embedding = [
    [0.1, 0.2, -0.3], # first word's embedding
    [0.2, 0.2, -0.1], # second word's embedding
    [0.4, -0.2, -0.1],
    [0.1, 0.2, 0.3],
] # the document has n_row of words, the embedding has n_col of dimension

Evaluations

Perplexity: ability to predict uniformly
- should be compared when using the same tokenization
- HuggingFace - Perplexity of fixed-length models
- The Gradient - Evaluation Metrics for Language Modeling
BLEU (Bilingual Evaluation Understudy)
GLUE/SuperGLUE: General Language Understanding Evaluation
- Github - GPT-Fathom
SQuAD:

Networks

Stanford CS 230 Deep Learning - Cheatsheet

History

Word Embedding

word2vec
- CBOW: countinous bag-of-word
- SG: skip-gram
GloVe
fasttext

seq2seq

tasks
- text generation
- translation

Transformer

The Illustrated Transformer

The Annotated Transformer

放弃幻想，全面拥抱Transformer：自然语言处理三大特征抽取器（CNN/RNN/TF）比较

分析transformer模型的参数量、计算量、中间激活、KV cache

Terms#

Tutorials#

References#

Textbooks#

Projects#

Data Representation#

Evaluations#

Networks#

History#

Word Embedding#

seq2seq#

Transformer#

Datasets#

Terms

Tutorials

References

Textbooks

Projects

Data Representation

Evaluations

Networks

History

Word Embedding

seq2seq

Transformer

Datasets