Bio-Medical Text Summarization

Summarizing COVID documents using LLMs

The Challenge: Information Overload in Pandemic Research

When the COVID-19 pandemic hit, the Allen Institute released CORD-19 - a mammoth collection of over 1 million biomedical papers. But publishing 4,000+ COVID studies weekly at the peak, even domain experts struggle to keep up.

Our team set out to solve this with DistilBART-CORD19, an AI system that automatically generates technical summaries of complex medical literature. The results? A 28% improvement in ROUGE scores over baseline models - but also hard lessons about AI's limitations in clinical contexts.

Data Sources and Preparation

We focused on two key metadata fields from the CORD-19 dataset:

  • Abstract: The abstract of the paper, which provides a brief overview of the study.
  • Full Text: The complete text of the paper, which contains detailed information about the study.
# Dataset structure
  DatasetDict({
      train: Dataset({
          features: ['abstract', 'fulltext', 'select'],
          num_rows: 84077
      }),
      test: Dataset({
          features: ['abstract', 'fulltext', 'select'],
          num_rows: 10510
      }),
      validation: Dataset({
          features: ['abstract', 'fulltext', 'select'],
          num_rows: 10510
      })
  })

Baseline System: Simple Yet Revealing

Before implementing advanced transformer models, we established a baseline system using:

The baseline performance metrics revealed an interesting pattern:

While Rouge1 scores were surprisingly high, the dismal RougeL score (2.5) exposed a critical flaw: the model could match individual words but failed to capture meaningful sequences. This validated our need for more sophisticated approaches that understand contextual relationships.

Technical Breakdown: Why DistilBART?

We selected DistilBART-CNN over alternatives like T5 or PEGASUS because:

# Importing the pre-trained model and tokenizer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "lxyuan/distilbart-finetuned-summarization"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
            

Why ROUGE is Preferred Over BLEU in Text Summarization

These metrics were created with different tasks in mind:

This distinction in their original purpose has shaped how they measure textual similarity, with ROUGE being intentionally oriented toward summarization tasks.

ROUGE offers several variants that capture different aspects of summarization quality:

This versatility allows researchers to evaluate both lexical coverage and structural coherence

Implementation: Fine-tuning DistilBART for COVID-19 Research

Our implementation process involved fine-tuning the pre-trained DistilBART model on the CORD-19 dataset. Here's the core training configuration we used:

# Training arguments
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

args = Seq2SeqTrainingArguments(
  output_dir="./distilbart-finetuned-cord-19",
  evaluation_strategy="epoch",
  learning_rate=2e-5,
  per_device_train_batch_size=12,
  per_device_eval_batch_size=12,
  weight_decay=0.01,
  save_total_limit=3,
  num_train_epochs=1,
  predict_with_generate=True,
)

# Initialize trainer
trainer = Seq2SeqTrainer(
  model=model,
  args=args,
  train_dataset=tokenized_train,
  eval_dataset=tokenized_validation,
  data_collator=data_collator,
  tokenizer=tokenizer,
  compute_metrics=compute_metrics,
)

# Start training
trainer.train()
            

Initially, we planned for 5 epochs of training, but computational limitations forced us to reduce to just one epoch. Surprisingly, we achieved comparable results with the single-epoch approach, suggesting efficient knowledge transfer from the pre-trained model to our domain-specific task

Training Results: How Did DistilBART Perform?

We trained DistilBART on a subset of the CORD-19 dataset for 5 epochs using TPUs. While computational constraints limited us to fewer epochs than ideal, we observed consistent improvements in evaluation metrics like ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum.

Challenges and Limitations

Despite our promising results, we encountered several significant challenges:

Conclusion and Future Directions

Our DistilBART model for COVID-19 research paper summarization demonstrates the significant potential of domain-specific language models in addressing information overload in specialized biomedical fields. The substantial improvements in ROUGE scores after just one epoch of fine-tuning highlight the effectiveness of our approach

For future work, we plan to:

In conclusion, while our current implementation has shown promising results, there remains substantial room for improvement, particularly in maintaining the critical medical insights and conclusions from original research. As the COVID-19 pandemic has demonstrated, efficient access to the latest research findings can literally save lives, making this work not just a technical challenge but a meaningful contribution to public health infrastructure.