arrow_backBack to research feed
otherPublished: June 25, 2026

Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline

By Kirill Solovev, Jana Lasser

Research TL;DR

"A modular, open-weight pipeline for multilingual joint entity-relation extraction using span-based NER, a three-stage linking cascade to Wikidata, and ontology-constrained MoE with guided decoding to build signed temporal knowledge graphs from news corpora."

Abstract

Whether political elites organise into rent-seeking coalitions that capture public resources or civic networks that sustain governance is a central question in comparative politics. Yet observing these complex, informal, and adversarial ties at scale has historically required intensive manual coding, while automated text-as-data methods have largely been limited to simple co-occurrence. Recent large language model (LLM) approaches offer a path forward but often rely on proprietary APIs, lack cross-lingual capability, and struggle with scalable entity resolution. We present a modular, fully open-weight pipeline for multilingual joint entity-relation extraction that builds signed, temporal knowledge graphs from massive unstructured news corpora. It combines span-based named-entity recognition (NER) with a three-stage linking cascade mapping mentions to language-independent Wikidata identifiers; a high-throughput, ontology-constrained mixture-of-experts model then uses guided decoding to extract directed, signed relationships grounded in a domain ontology. A full-coverage spot-check against a 3491-relation gold standard shows high textual correctness (68.2% strict to 93.7% lenient). Two large-scale case studies validate the pipeline against the public record. In Austria, it reconstructs a political party's complete lifecycle, dating internal fractures and tracking personnel into successor factions and court convictions. In a Polish corpus, it uncovers the overlapping economic and governance networks of state-enterprise patronage, alongside the structurally balanced, signed conflict network of the polarized Civic Platform (Platforma Obywatelska, PO)--Law and Justice (Prawo i Sprawiedliwość, PiS) duopoly. By bridging raw multilingual text and structured relational data, our framework provides a robust, replicable foundation for cross-national empirical computational social science.

Technical Analysis & Implementation

Technical Summary§

This paper presents a fully open-weight pipeline for multilingual joint entity-relation extraction (ERE) from large unstructured news corpora, building signed and temporal knowledge graphs (KGs). The pipeline consists of three main components: (1) span-based named-entity recognition (NER), (2) a three-stage entity linking cascade to Wikidata, and (3) an ontology-constrained mixture-of-experts (MoE) model with guided decoding for relation extraction.

Span-based NER§

A pretrained multilingual language model (e.g., XLM-R) is fine-tuned with a span classification head. For each input token sequence $X = [x_1, ..., x_n]$, all possible spans $s_{i,j}$ (contiguous subsequences) are enumerated. Each span is represented as the concatenation of its start/end token embeddings and a width embedding, fed into a classifier to predict entity type (PER, ORG, GPE, etc.) or "non-entity". The loss is a cross-entropy over spans.

Three-Stage Entity Linking§

Mentions are linked to Wikidata Q-IDs via: 1. Candidate Generation: Fuzzy string matching against a precomputed index of Wikidata labels/aliases. 2. Contextual Disambiguation: A bi-encoder (Sentence-BERT) scores candidate entities against the mention's left/right context window. 3. Coreference Resolution: Within-document and cross-document clustering using agglomerative clustering with a learned pairwise similarity threshold.

The pipeline outputs a set of Wikidata IDs for each document.

Ontology-Constrained MoE for Relation Extraction§

Relations are extracted using a decoder-only MoE transformer (e.g., Mixtral 8x7B) with guided decoding constrained by a domain ontology. The ontology defines relation types (e.g., "member_of", "conflict", "supports") with direction and sign (positive/negative). The model takes as input the concatenation of two entity IDs and the context text, and generates a relation token via constrained beam search. The MoE architecture uses a gating network $G(x) = \text{softmax}(W_g x)$ to select top-$k$ experts, with each expert $E_i$ being an FFN. The output is:

$$y = \sum_{i=1}^N G(x)_i E_i(x)$$

To enforce ontology constraints, the decoding step masks invalid tokens (e.g., disallowing "supports" between two organizations if not defined). Relations are extracted as triples $(h, r, t, s, t)$ where $h$ and $t$ are Wikidata IDs, $r$ is the relation type, $s \in \{-1, +1\}$ the sign, and $t$ the timestamp.

Implementation Details§

  • NER: Fine-tuned XLM-R Large on a multilingual dataset of political news (hand-annotated).
  • Entity Linking: Precomputed index of ~10M Wikidata entities; bi-encoder trained on Wikipedia hyperlinks.
  • Relation Extraction: Mixtral 8x7B with retrieval-augmented generation (RAG) for context; constrained decoding via a custom grammar.
  • Pipeline Scaling: Document-level parallelism with distributed Redis-backed mention queues.

Example Code Snippet§

# Pseudo-code for relation extraction with guided decoding
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistral-mixtral-8x7b")
tokenizer = AutoTokenizer.from_pretrained("mistral-mixtral-8x7b")

# Ontology constraints: allowed relations and their signatures
ontology_constraints = {
    "conflict": ["POLITICIAN", "POLITICIAN"],
    "supports": ["POLITICIAN", "POLICY"],
    # ...
}

def guided_generate(context, entity_h, entity_t, max_tokens=5):
    input_text = f"Context: {context}\nRelation between {entity_h} and {entity_t}:"
    input_ids = tokenizer.encode(input_text, return_tensors='pt').cuda()
    
    valid_ids = get_valid_relation_ids(ontology_constraints, entity_h.type, entity_t.type)
    
    outputs = model.generate(
        input_ids,
        max_new_tokens=max_tokens,
        prefix_allowed_tokens_fn=lambda batch_id, input_ids: valid_ids,
        do_sample=False
    )
    return tokenizer.decode(outputs[0][len(input_ids[0]):])

Evaluation§

On a gold-standard set of 3,491 relations, the pipeline achieves 68.2% strict and 93.7% lenient textual correctness. Two large-scale case studies (Austrian party lifecycle and Polish patronage networks) demonstrate validity against historical records.