N-gram
An N-gram is a subsequence of n letters from a given string after removing all spaces. For example, the 3-grams that can be generated from "good morning" are "goo", "ood", "odm", "dmo", "mor" and so forth.
By converting a string to N-grams, it can be embedded in a vector space thus allowing the string to be compared to other strings in an efficient manner. For example, if we convert strings with only letters in the English alphabet into 3-grams, we get a 26³ dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Note that using this representation we lose information about the string. For example, both the strings "abcba" and "bcbab" give rise to exactly the same 2-grams. However, we know empirically that if two strings of real text have a similar vectorial representation (for example a small cosine distance) then they are likely to be similar.
N-grams are a commonly used technique to design kernels that allow machine learning algorithms such as support vector machines to learn from string data. They can also be used to find likely candidates for the correct spelling of a misspelled word. Also in compression algorithms where a small area of data requires N-grams of greater length to improve compression.
This concept is also used to break words into sequential groups of 2 or 3, in order to assess the probability of a word-sequence from appearing. For example, the sentence "This is a test", can be broken into bigrams of "This is", or "is a", or "a test". When this is done over a large number of sentences, it becomes possible to estimate the most common occurrence of a certain word-sequence, allowing better design of systems for speech recognition, OCR, ICR, machine translation etc.
In fact, this is also used in Speech Recognition with phonemes, or the sound-components that make up our speech.
Categories: Natural language processing | Computational linguistics | Speech Recognition