What is a Co-Occurrence Matrix?
A co-occurrence matrix is a grid (matrix) that records how frequently two items (such as words, phrases, or objects) appear together in a given context.
π Example:
Imagine you have the following three sentences:
- 1οΈβ£ "The cat π± sat on the mat."
- 2οΈβ£ "The dog πΆ sat on the mat."
- 3οΈβ£ "The cat π± and the dog πΆ are friends."
Now, letβs consider some words:
- "cat"
- "dog"
- "mat"
- "sat"
- "on"
A co-occurrence matrix for these words would show how often each word appears alongside the others.
How is it Structured?
Different Types of Counts in Co-Occurrence Matrix
-
1οΈβ£ Raw Co-Occurrence Count (R) π
R(j, k) β The number of times two words (phrases) appear together in a text window.
β Example: If the phrase "artificial intelligence" appears in 20 documents with "machine learning," the raw co-occurrence count for ("artificial intelligence", "machine learning") is 20. -
2οΈβ£ Disjunctive Interesting Count (D) π§
D(j, k) β Counts the number of times either of the two phrases appear as highlighted text (bold, underline, hyperlink, etc.).
β Example:
In a document, "Bitcoin" appears in bold and "cryptocurrency" is underlined.
The count D(Bitcoin, cryptocurrency) is increased because at least one of them is in a highlighted format. -
3οΈβ£ Conjunctive Interesting Count (C) π
C(j, k) β Counts the number of times both words/phrases appear as highlighted text in a document.
β Example:
If both "Elon Musk" and "Tesla" appear in bold in a document, then C(Elon Musk, Tesla) is increased.
This prevents irrelevant words (like "terms & conditions") from being considered important.
πΉ Importance of Co-Occurrence Matrix π
- 1οΈβ£ Finding Relationships π β Helps in natural language processing (NLP) to understand word associations.
- 2οΈβ£ Building Search Engines π β Improves keyword recommendations and autocomplete suggestions.
- 3οΈβ£ Content Categorization π β Groups similar topics together using related phrases.
- 4οΈβ£ Spam Detection π« β Identifies common word patterns used in spam messages.
- 5οΈβ£ Machine Learning & AI π€ β Helps in training algorithms for better predictions.
Phrase Lists and Co-Occurrence Matrix in Indexing
π Phrase Lists Used in Co-Occurrence Matrices
- 1οΈβ£ Possible Phrase List π β A list of phrases that might be useful but are not confirmed yet.
- 2οΈβ£ Good Phrase List β β Phrases that are useful and frequently appear in meaningful contexts.
- 3οΈβ£ Bad Phrase List β β Phrases that rarely appear and are not useful.
β Example:
If a phrase appears in 10+ documents and at least 5 times as a highlighted phrase, it is added to the Good Phrase List β
.
If a phrase appears in less than 2 documents and never as a highlighted phrase, it is considered a Bad Phrase β.
Filtering Meaningful Phrases Using Co-Occurrence Matrix
To remove useless phrases, a predictive measure is applied:
- 1οΈβ£ Compute Expected Value (E) β How often should a phrase appear if it were randomly distributed?
- 2οΈβ£ Compute Actual Co-Occurrence Rate (A) β How often does it actually appear with other phrases?
- 3οΈβ£ Compare Expected vs. Actual β If the phrase appears much more frequently than expected, it is meaningful!
β Formula:
π Information Gain (I) = Actual Co-Occurrence Rate (A) / Expected Co-Occurrence Rate (E)
β Example:
If "machine learning" appears 100 times more than expected alongside "AI," then it is a strong predictor of "AI" and should be kept in the Good Phrase List β
.