GraphRAG

Earlier this year, GraphRAG is introduced, a graph-based approach to retrieval-augmented generation (RAG) for question-answering over private datasets. Now available on GitHub, GraphRAG offers more structured information retrieval and comprehensive responses than traditional RAG methods and includes a solution accelerator for easy, code-free deployment on Azure. It automatically uses a large language model (LLM) to…

Earlier this year, GraphRAG is introduced, a graph-based approach to retrieval-augmented generation (RAG) for question-answering over private datasets. Now available on GitHub, GraphRAG offers more structured information retrieval and comprehensive responses than traditional RAG methods and includes a solution accelerator for easy, code-free deployment on Azure. It automatically uses a large language model (LLM) to create a knowledge graph from text documents. It detects and organises data into hierarchical “communities” of related topics, allowing users to understand data structures and generate summaries without prior queries.

Knowledge graph of entity nodes and relationship edges derived from a news dataset(opens in new tab), with different colors representing various communities. Level 0 communities (left) represent the highest-level themes of the dataset, while level 1 communities (right) show the emergence of more granular topics within these themes

Knowledge graph of entity nodes and relationship edges derived from a news dataset(opens in new tab), with different colours representing various communities. Level 0 communities (left) represent the highest-level themes of the dataset, while level 1 communities (right) show the emergence of more granular topics within these themes

Advantages of community summaries for “global questions”

In a recent preprint, researchers explored how GraphRAG’s community summaries address global questions encompassing entire datasets, a task where traditional RAG methods typically fall short. For instance, asking, “What are the main themes in the dataset?” often prompts naive RAG to extract answers from superficially similar text chunks, which may not accurately represent the dataset. Naive RAG’s limitation lies in its reliance on top-k similar chunks, potentially leading to misleading answers.

Community summaries in GraphRAG overcome this challenge by leveraging a graph index of entity and relationship descriptions derived from all input texts. This approach employs a map-reduce strategy for question answering:

Group: Aggregate community reports within the LLM context window size.
Map: Apply the question across each group to generate community-specific answers.
Reduce: Consolidate relevant community answers into a final global answer.
This method ensures that responses reflect the comprehensive scope of the dataset, providing accurate answers to global inquiries.

Evaluation and results  To evaluate GraphRAG against naive RAG and hierarchical source-text summarisation, researchers utilised GPT-4 to generate activity-centred sense-making questions based on podcast transcripts and news articles. The comparison employed three metrics: comprehensiveness, diversity, and efficiency, as evaluated by an LLM. GraphRAG, with its community summaries, demonstrated a 70-80% advantage over naive RAG in comprehensiveness and diversity, outperforming source-text summarisation at lower token costs (approximately 20-70% per query). While GraphRAG competed closely with hierarchical summarisation for high-level communities, it achieved superior efficiency, using significantly fewer tokens (approximately 2-3% per query).

Comparison of naive RAG and GraphRAG responses to a global question about a news dataset(opens in new tab) indicates that GraphRAG outperformed naïve RAG in terms of comprehensiveness, diversity, and empowerment.

Comparison of naive RAG and GraphRAG responses to a global question about a news dataset(opens in new tab) indicates that GraphRAG outperformed naïve RAG in terms of comprehensiveness, diversity, and empowerment.

Research insights and future directions  Initial findings highlight that LLMs can effectively generate detailed knowledge graphs from unstructured text, enabling robust global queries beyond naive RAG’s capability. At the same time, hierarchical source-text summarisation proves too resource-intensive. The applicability of GraphRAG hinges on whether its benefits, such as structured knowledge and community summaries, outweigh the costs associated with graph index construction. Researchers are actively exploring methods to reduce these costs, including optimising LLM prompts and employing NLP techniques for knowledge graph approximation. By publicly releasing GraphRAG and its solution accelerator, the aim is to democratise graph-based RAG, inviting community feedback to refine and expand its applications.

This article is inspired from https://www.microsoft.com/en-us/research/blog/graphrag-new-tool-for-complex-data-discovery-now-on-github/

Tags: