CSCE 689 - Special Topics in NLP for Science (Spring 2025)

Course Information

Grading

Schedule (Subject to changes)

Week Date Topic Papers Slides Presenter
W1 1/14 Course Overview - Link Instructor
1/16 Scientific LLMs: Encoder-Only & Encoder-Decoder * SciBERT: A Pretrained Language Model for Scientific Text [EMNLP 2019]
* BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining [Bioinformatics 2020]
* ELECTRAMed: A New Pre-trained Language Representation Model for Biomedical NLP [arXiv 2021]
* SciFive: A Text-to-Text Transformer Model for Biomedical Literature [arXiv 2021]
Link Instructor
W2 1/21 Scientific LLMs: Decoder-Only * Solving Quantitative Reasoning Problems with Language Models [NeurIPS 2022]
* SciInstruct: A Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [NeurIPS 2024]
* BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains [ACL 2024]
* OceanGPT: A Large Language Model for Ocean Science Tasks [ACL 2024]
Instructor
1/23 Citation Prediction * SPECTER: Document-Level Representation Learning using Citation-Informed Transformers [ACL 2020]
* Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings [EMNLP 2022]
* Explaining Relationships between Scientific Documents [ACL 2021]
* SciRepEval: A Multi-Format Benchmark for Scientific Document Representations [EMNLP 2023]
Instructor
W3 1/28 Scientific Literature Retrieval * MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval [Bioinformatics 2023]
* BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers [EMNLP 2024]
* Fact or Fiction: Verifying Scientific Claims [EMNLP 2020]
* Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding [EMNLP 2023]
Instructor
1/30 Scientific Question Answering * PubMedQA: A Dataset for Biomedical Research Question Answering [EMNLP 2019]
* Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries [WWW 2024]
* MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models [ICLR 2024]
Student
W4 2/4 Scientific Knowledge Extraction * AIONER: All-in-One Scheme-Based Biomedical Named Entity Recognition using Deep Learning [Bioinformatics 2023]
* SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [EMNLP 2024]
* ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision [ACL 2023]
* ActionIE: Action Extraction from Scientific Literature with Programming Languages [ACL 2024]
Instructor
2/6 Paper Classification * The Effect of Metadata on Scientific Literature Tagging: A Cross-Field Cross-Model Study [WWW 2023]
* Hierarchical Multi-Label Classification of Scientific Documents [EMNLP 2022]
* BERTMeSH: Deep Contextual Representation Learning for Large-Scale High-Performance MeSH Indexing with Full Text [Bioinformatics 2020]
Student
W5 2/11 Scientific VLMs: Bioimaging * MedCLIP: Contrastive Learning from Unpaired Medical Images and Text [EMNLP 2022]
* A Visual–Language Foundation Model for Pathology Image Analysis using Medical Twitter [Nature Medicine 2023]
* LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day [NeurIPS 2023]
* A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [Nature Medicine 2024]
Instructor
2/13 Scientific VLMs: Geometry * UniMath: A Foundational and Multimodal Mathematical Reasoner [EMNLP 2023]
* G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model [arXiv 2023]
* Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [EMNLP 2024]
Student
2/16 Project Proposal Due (Sunday)
W6 2/18 [Guest Lecture] Hanwen Xu (University of Washington): Towards Patient Level Representations for Better Clinical Outcome
* Suggested Reading: A Whole-Slide Foundation Model for Digital Pathology from Real-World Data [Nature 2024]
Guest Lecturer
2/20 Scientific VLMs: Miscellaneous * UrbanCLIP: Learning Text-Enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web [WWW 2024]
* BioCLIP: A Vision Foundation Model for the Tree of Life [CVPR 2024]
* MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI [CVPR 2024]
Student
W7 2/25 Protein Language Models * Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model [Science 2023]
* Large Language Models Generate Functional Protein Sequences across Diverse Families [Nature Biotechnology 2023]
* ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts [ICML 2023]
* BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations [EMNLP 2023]
Instructor
2/27 DNA/RNA/Single-Cell Language Models * DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in Genome [Bioinformatics 2021]
* A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions [Nature Machine Intelligence 2024]
* scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics using Generative AI [Nature Methods 2024]
Student
W8 3/4 Molecule Language Models * Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries [EMNLP 2021]
* Translation between Molecules and Natural Language [EMNLP 2022]
* LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset [COLM 2024]
* Fine-Tuned Language Models Generate Stable Inorganic Materials as Text [ICLR 2024]
Instructor
3/6 Urban Language Models * SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation [EMNLP 2022]
* GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding [EMNLP 2023]
* UrbanGPT: Spatio-Temporal Large Language Models [KDD 2024]
Student
3/7 Literature Review Due (Friday)
W9 3/11 Spring Break (No Class)
3/13 Spring Break (No Class)
W10 3/18 [Guest Lecture] Bowen Jin (University of Illinois Urbana-Champaign): Large Language Models on Scientific Text-Attributed Graphs
* Suggested Reading: Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs [ACL 2024]
Guest Lecturer
3/20 Midterm Project Presentations Students
3/23 Midterm Report Due (Sunday)
W11 3/25 Language Models with Academic Graphs * OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services [KDD 2022]
* LinkBERT: Pretraining Language Models with Document Links [ACL 2022]
* Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification [WWW 2022]
* Investigating Instruction Tuning Large Language Models on Graphs [COLM 2024]
Instructor
3/27 Table Language Models * TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables [ACL 2020]
* TableLlama: Towards Open Large Generalist Models for Tables [NAACL 2024]
* UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers [arXiv 2024]
Student
W12 4/1 LLMs for Research: Idea Generation * ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [arXiv 2024]
* Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas [arXiv 2024]
* Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers [arXiv 2024]
Student
4/3 LLMs for Research: Content Generation * Mapping the Increasing Use of LLMs in Scientific Papers [COLM 2024]
* Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews [ICML 2024]
* Let's Get to the Point: LLM-Supported Planning, Drafting, and Revising of Research-Paper Blog Posts [arXiv 2023]
Student
W13 4/8 [Guest Lecture] Qingyun Wang (University of Illinois Urbana-Champaign): AI4Scientist: Accelerating and Democratizing Scientific Research Lifecycle
* Suggested Reading: SciMON: Scientific Inspiration Machines Optimized for Novelty [ACL 2024]
Guest Lecturer
4/10 LLMs for Research: Reviewing * Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis [NEJM AI 2024]
* LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [EMNLP 2024]
* AgentReview: Exploring Peer Review Dynamics with LLM Agents [EMNLP 2024]
Student
W14 4/15 LLMs for Research: Miscellaneous * A Search Engine for Discovery of Scientific Challenges and Directions [AAAI 2022]
* Chain-of-Factors Paper-Reviewer Matching [arXiv 2023]
* ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews [ACL 2024]
Student
4/17 Scientific Agents * Autonomous Chemical Research with Large Language Models [Nature 2023]
* Augmenting Large Language Models with Chemistry Tools [Nature Machine Intelligence 2024]
* Monte Carlo Thought Search: Large Language Model Querying for Complex Scientific Reasoning in Catalyst Design [EMNLP 2023]
Student
W15 4/22 Final Project Presentations Students
4/24 Final Project Presentations Students
W16 5/4 Final Report Due (Sunday)