Open Information Extraction

What is Open Information Extraction?

🔍 What is Open IE?
Open Information Extraction (Open IE) is a fact extraction technique that connects different nouns based on their relationships in a text.
✅ Instead of relying on predefined databases or knowledge bases, Open IE automatically extracts facts from text!

🎯 Example:
👉 "Google bought Wavii for $30 million in 2013."
✅ Extracted Fact: (Google, bought, Wavii, $30M, 2013)

🚀 Key Concepts in Open IE

1️⃣ Fact Extraction 🧐
Open IE extracts facts from text without needing a structured database.
📝 Example:
📌 "Apple was founded by Steve Jobs in 1976."
🔹 Extracted: (Apple, founded by, Steve Jobs, 1976)
2️⃣ Connecting Nouns 📡
Open IE links nouns using relationships, making it easier to understand how entities are connected.
📝 Example:
📌 "Tesla's CEO is Elon Musk."
🔹 Extracted: (Tesla, CEO, Elon Musk)
3️⃣ Classifier & Confidence Score 🎯
A classifier assigns a confidence score to relationships between two nouns. This helps determine how accurate the extracted fact is.
📝 Example:
📌 "Microsoft is the parent company of LinkedIn."
🔹 Confidence Score: 95% (Highly accurate)
📌 "Microsoft owns Google."
🔹 Confidence Score: 10% (Incorrect relationship)

📊 Open IE & Text-to-Data Conversion

📢 Open IE converts unstructured text into structured data that machines can understand and analyze.

📝 Example:
📌 "Barack Obama was the 44th President of the United States."
🔹 Before: Just text
🔹 After: (Barack Obama, was, 44th President, USA) → Structured Data ✅

📌 Relation Types in Open IE

🔹 "created by" → (Harry Potter, created by, J.K. Rowling)
🔹 "author of" → (J.K. Rowling, author of, Harry Potter)
🔹 "is from" → (Cristiano Ronaldo, is from, Portugal)
🔹 "located in" → (Eiffel Tower, located in, Paris)

🔬 Open IE & Unknown Entities 🤖

Unlike traditional Named Entity Recognition (NER), Open IE can detect unknown or minor entities that are not pre-registered in a database.

📝 Example:
📌 "Zara Khan won the Best New Artist award."
🔹 Even if "Zara Khan" is not in any database, Open IE can still extract and relate her name to the award. 🏆

Process for Open Information Extraction

1️⃣ Corpus of Text (11)
The system starts with a big collection of text, like articles from the internet. This text is the main source of information.
2️⃣ Training Data (13)
A small part of this text is picked and used as training data. This helps the system learn how to find useful information.
3️⃣ Self-Supervised Learner (15)
A self-learning program is trained using the selected text. It learns to tell the difference between good and bad information.
4️⃣ Single-Pass Extractor (19)
The system scans all the text and pulls out useful pieces of information (called tuples). These are simple fact-like statements.
5️⃣ Classifier (17)
A checker (classifier) looks at the extracted information. It removes incorrect or uncertain facts and keeps only reliable ones.
6️⃣ Verify with Multiple Sources (21)
The system checks if the same information appears in different places. If a fact is mentioned in many sources, it's more likely to be true.
7️⃣ Store the Information (23)
The final trusted facts are saved in an organized structure (like a database). This makes it easy to search and use later.

🏗️ TEXTRUNNER’s Architecture & Key Components

TEXTRUNNER’s design is modular and consists of three main modules. Each module plays a special role in processing text and extracting reliable facts.

1. Self-Supervised Learner 🤖

What It Does:
Learns on its own: It takes a small sample of text (training data) and automatically labels extraction candidates as “trustworthy” or “not trustworthy.”
No manual tagging required: Instead of hand-tagging, the system uses a parser and heuristics to generate positive (good) and negative (bad) examples.
Trains a Classifier: It then trains a Naive Bayes classifier that can later decide whether a candidate relation from the text is reliable.

How It Works:
Parsing & Candidate Generation:
The system parses several sentences to identify base noun phrases (e.g., names of people, places, companies).
For each pair of noun phrases, it searches the connecting words to form a candidate tuple, e.g.,
Tuple format: (Entity1, Relation, Entity2)
📝 Example:
From the sentence: “Oppenheimer taught at Berkeley and CalTech,” it creates:
(Oppenheimer, taught at, Berkeley)
(Oppenheimer, taught at, CalTech)
Heuristics & Labeling:
It checks syntactic constraints such as:
- Dependency chain length: The chain linking two noun phrases should not be too long (e.g., ≤ 4 words).
- Sentence boundaries: The path should not cross boundaries like relative clauses.
- Avoiding pronouns only: Both entities should be more than just pronouns.
If the candidate meets the criteria, it is labeled positive; otherwise, it is labeled negative.
Feature Extraction & Training:
Each tuple is mapped to a feature vector (using low-level, language-specific features like part-of-speech tags, token counts, etc.).
The Naive Bayes classifier is trained on these features to learn what a “trustworthy” relation looks like.

2. Single-Pass Extractor ⚡

What It Does:
Efficient Extraction: It makes one pass over the entire text corpus to extract all possible candidate tuples.
Lightweight and Fast: It uses a lightweight noun-phrase chunker rather than a full parser, which makes it very fast.

How It Works:
Tagging & Chunking:
Each word in every sentence is tagged with its most likely part-of-speech.
A chunker then finds noun phrases (e.g., “Tesla,” “Elon Musk”) and also provides a confidence score for each detected entity.
Identifying Relations:
The text between noun phrases is examined to extract potential relations.
Heuristic rules help remove unnecessary words like extra prepositional phrases or adverbs.
📝 Example:
From the sentence:
“Scientists from many universities are studying climate change.”
The extractor simplifies the phrase by removing the extra modifier “from many universities” and focuses on:
(Scientists, are studying, climate change)
Classification:
Each candidate tuple is sent to the classifier from the Self-Supervised Learner.
Only those tuples that are labeled as trustworthy are kept.

3. Redundancy-Based Assessor 🔍

What It Does:
Normalizes and Merges: It converts the extracted relation phrases to a normalized form (e.g., “was originally developed by” → “was developed by”).
Assesses Reliability: It counts how many distinct sentences support each tuple and then uses a probabilistic model to assign a probability score to each tuple.

How It Works:
Normalization:
Non-essential modifiers are removed so that similar relations are grouped together.
📝 Example:
“is located in” and “is situated in” might be normalized to the same basic relation.
Merging Duplicates:
Tuples that are identical (after normalization) are merged, and the number of unique supporting sentences is tallied.
Probability Assignment:
Using the redundancy (i.e., the number of times a tuple appears across different sentences), the system calculates a probability score.
📝 Example:
(Einstein, presented, The Theory of Relativity) might appear in 10 different sentences, resulting in a high probability (e.g., 0.95).
A less reliable extraction like (The Theory of Relativity, in, a paper) might have a low probability (e.g., 0.3) and be discarded.

Output – The Extraction Graph:
The final output is an extraction graph where nodes represent entities and edges represent the extracted relationships along with their confidence scores.