Document Summarisation and Comparison with LLMs

Two common place and tedious tasks people are often faced with, are summarising documents, or comparing them for similarities and differences. The good news is that this can now be handled using LLMs (Large Language Models). The even better news is that is can now be done locally using Ollama and LLama3.

The motivating case for this project was that someone I know was comparing two PDFs about deer conservation policy in Scotland. I was interested to see whether I could produce a zero shot comparator that could compare the documents automatically using an LLM.

a picture of an apple, spanner, banana and pear. One of these is not like the others — One of these is not like the others.

Ollama and Llama 3

I have already covered getting started with Ollama in an earlier post. Also some really useful local models are now available including Llama 3 from Meta and Phi 3 from Microsoft. For this project I decided to use Llama3.

Document Length

Python can read documents direct from the web. However in this case it was simpler to download the deer PDFs and ingest them using pypdf. If you are interested to see the documents you can find them here and here.

A common problem LLMs face when asked to handle large documents is that the documents exceed their context window. This can cause information to be missed, or hallucinations to creep in. I tried two approaches to solving this.

The first approach was to try to break the two documents into chunks in a vector database (ChromaDB in my case) and then compare chunks from the two documents which have similar cosine similarity. This tries to ensure that you compare the most relevant sections between the two documents. The problem with this chunk based comparison was that it lost the overall view of the documents, that was important for deep comparison. The method might be worth revisiting but I decided on a different approach, namely summarisation.

Recursive Summarisation

I decided to divide each document to be compared into chunks and then use an LLM to summarise the data in each chunk to make it shorter. Using LangChain’s RecursiveCharacterTextSplitter to split the initial text into chunks, I then summarised with Llama3 using a prompt which specified the desired amount of compression. I joined these chunk summaries back together to give an overall summary. This summary was then checked for length, if it was short enough to fit inside Llama 3’s context window of 8000 tokens in future comparison steps then it was returned. Otherwise, if still too long, the summary was recursively fed to the summariser to compress further.

Some care had to be taken in the prompt engineering to ensure that the summaries handed back were suitable for use. It was also found useful to have a second function which used Llama 3 as an editor to make the completed summarisation more coherent after compression.

Comparison

Once the documents had been summarised down to a suitable length, and edited for coherence. It was time to compare them. As with the previous steps it was important to be specific about what comparison to conduct.

I found I got the best results when I not only asked for numbered lists of similarities and differences for the two documents but also specified what forms of similarity and difference should be noted. For instance you might be interested in similarities of style of comparing two authors, but similarities of content for factual documents. Alternatively you might want a more technical comparison, in the case of the two deer conservation documents. In this case one of the requirements was to compare the difference in strategy between the two documents, which required a slight change to create a function with a bespoke prompt

Results

A typical result when Llama 3 was asked to compare the two documents was:

Here are the similarities and differences in strategy between the two passages:

**SIMILARITIES**

1. Both strategies prioritize balancing environmental, social, and economic benefits.
2. They both aim to manage deer populations sustainably, considering ecosystem services, biodiversity, and human well-being.
3. Both strategies emphasize the importance of conservation of Scotland’s natural heritage and ecosystems.
4. They both recognize the need for collaboration with stakeholders and partnerships to achieve sustainable deer management.
5. Both strategies prioritize deer welfare, including minimizing non-native species spread and ensuring strong skill bases in management.

**DIFFERENCES**

1. Passage 1 focuses on a national approach to deer management, while Passage 2 is specific to Forestry Commission Scotland’s Forest Enterprise Scotland (FES) agency and its deer management strategy.
2. Passage 1 places more emphasis on promoting sustainable economic development, recreational opportunities, and community cooperation, whereas Passage 2 prioritizes environmental protection and ecosystem services.
3. The national approach in Passage 1 sets out six strategic priorities, including prioritizing wild deer welfare, conserving natural heritage, and encouraging public access to land and outdoor recreation. In contrast, FES’s strategy focuses on managing deer populations, protecting the environment, and contributing to Scotland’s Wild Deer: a National Approach.
4. While both strategies recognize the importance of monitoring and research, Passage 1 places more emphasis on tracking trends in ecosystem health, deer numbers, and habitat condition, whereas FES’s strategy focuses on monitoring deer densities, culling targets, and habitat assessments.
5. Passage 2 provides specific details about FES’s deer management practices, including using evidence-based methods to determine deer densities and culling targets, while Passage 1 provides a more general overview of the national approach.

Conclusion

Overall I would assess that this is a pretty good attempt at comparing the two papers. This method of summarisation and comparison certainly has the capability to be useful where large quantities of information needs to be summarised and compared. Since this task was completed on local hardware with a local LLM, it is both very cheap and suitable for confidential documents. The entire process took just a few minutes to run, meaning that it is significantly faster and cheaper that attempting the comparison by hand.

The workbook to run this project can be found on Github. The two documents compared can be found here and here.