Terms
-
NLU (Natural Language Understanding) / NLI (Natural Language Inferencing)
-
NLG (Natural Language Generation)
-
Tokenization (分词)
- Chinese Word Segmentation (中文分词)
- BPE/wordpiece/unigram
-
Stemming (词干提取)
-
Lemmatization (词形还原)
-
Parts of speech tagging (词性标注)
-
Named-entity Recognition (NER,命名实体识别)
-
Chunking (分块)
-
Language Model: A language model is a function that takes in a sequence of words and returns a probility distribution over all the possible next words in that sequence.
-
autoregressive/causal language model
-
masked language model (BERT)
-
-
BoW (Bag-of-Words)
-
seq2seq
-
Encoder-Decoder
-
Atttention layer
-
Transformer
-
BERT
-
GPT
Tutorials
References
- word2vec
- self-attention
Textbooks
Projects
-
- pretrained models
- core models: use for general purpose tasks
- starter models: use for fine-tuning
- pretrained models
-
Chinese
-
tokenizer
-
Embeddings
Data Representation
from gensim import corpora
# a single document
document = "A survey of user opinion of computer system response time"
# corpora := multiple(all) documents
documents = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
]
# tokens
tokens = ["a", "survey", "of", "user", "opinion", "of", "comuter", "system", "response", "time"]
# dictionary, token to ids
dictionary = {
"a": 1,
"survey": 2,
"of": 3,
...
}
# bag of words
BoW = [
(2, 1) # means the document constains 1 of token 2
(3, 4) # means the document constains 4 of token 3
...
]
# one-hot encoding tokens
encoded_tokens_of_a_document = [
[0, 1, 0, 0, 0], # first word's encoding
[1, 0, 0, 0, 0], # second word's encoding
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
...
] # the document has n_row of words, and the whole dictionary has n_col words
# word2vec
embedding = [
[0.1, 0.2, -0.3], # first word's embedding
[0.2, 0.2, -0.1], # second word's embedding
[0.4, -0.2, -0.1],
[0.1, 0.2, 0.3],
] # the document has n_row of words, the embedding has n_col of dimension
Evaluations
-
Perplexity: ability to predict uniformly
- should be compared when using the same tokenization
- HuggingFace - Perplexity of fixed-length models
- The Gradient - Evaluation Metrics for Language Modeling
-
BLEU (Bilingual Evaluation Understudy)
-
GLUE/SuperGLUE: General Language Understanding Evaluation
Networks
History
Word Embedding
- word2vec
- CBOW: countinous bag-of-word
- SG: skip-gram
- GloVe
- fasttext
seq2seq
- tasks
- text generation
- translation