CSCE 689 - Special Topics in NLP for Science (Spring 2025)
Course Information
Instructor:
Yu Zhang
(yuzhang [AT] tamu [DOT] edu)
Lectures:
Time:
Tuesdays and Thursdays 3:55pm – 5:10pm
Location:
HRBB 126
Office Hour:
Time:
Thursdays 2pm – 3pm
Location:
PETR 222 (or drop me an email at least 1 day in advance if you would like to join via Zoom:
Link
)
Syllabus:
Link
Grading
Participation:
10%
Attendence:
8%
Pre-Lecture Questions:
2%
[due 1 day before the lecture]
Literature Review:
10%
[due 3/7]
Paper Presentation:
20%
Slides:
5%
[due 2 days before the lecture]
Completeness, Clarity, and Q&A:
15%
Project:
60%
Project Proposal:
5%
[due 2/16]
Midterm Spotlight Presentation:
5%
Midterm Report:
10%
[due 3/23]
Final Project Presentation:
15%
Final Report:
25%
[due 5/4]
Schedule (Subject to changes)
Week
Date
Topic
Papers
Slides
Presenter
W1
1/14
Course Overview
-
Link
Instructor
1/16
Scientific LLMs: Encoder-Only & Encoder-Decoder
*
SciBERT: A Pretrained Language Model for Scientific Text
[EMNLP 2019]
*
BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining
[Bioinformatics 2020]
*
ELECTRAMed: A New Pre-trained Language Representation Model for Biomedical NLP
[arXiv 2021]
*
SciFive: A Text-to-Text Transformer Model for Biomedical Literature
[arXiv 2021]
Link
Instructor
W2
1/21
Scientific LLMs: Decoder-Only
*
Solving Quantitative Reasoning Problems with Language Models
[NeurIPS 2022]
*
SciInstruct: A Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models
[NeurIPS 2024]
*
BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains
[ACL 2024]
*
OceanGPT: A Large Language Model for Ocean Science Tasks
[ACL 2024]
Instructor
1/23
Citation Prediction
*
SPECTER: Document-Level Representation Learning using Citation-Informed Transformers
[ACL 2020]
*
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings
[EMNLP 2022]
*
Explaining Relationships between Scientific Documents
[ACL 2021]
*
SciRepEval: A Multi-Format Benchmark for Scientific Document Representations
[EMNLP 2023]
Instructor
W3
1/28
Scientific Literature Retrieval
*
MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval
[Bioinformatics 2023]
*
BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers
[EMNLP 2024]
*
Fact or Fiction: Verifying Scientific Claims
[EMNLP 2020]
*
Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding
[EMNLP 2023]
Instructor
1/30
Scientific Question Answering
*
PubMedQA: A Dataset for Biomedical Research Question Answering
[EMNLP 2019]
*
Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries
[WWW 2024]
*
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
[ICLR 2024]
Student
W4
2/4
Scientific Knowledge Extraction
*
AIONER: All-in-One Scheme-Based Biomedical Named Entity Recognition using Deep Learning
[Bioinformatics 2023]
*
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents
[EMNLP 2024]
*
ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision
[ACL 2023]
*
ActionIE: Action Extraction from Scientific Literature with Programming Languages
[ACL 2024]
Instructor
2/6
Paper Classification
*
The Effect of Metadata on Scientific Literature Tagging: A Cross-Field Cross-Model Study
[WWW 2023]
*
Hierarchical Multi-Label Classification of Scientific Documents
[EMNLP 2022]
*
BERTMeSH: Deep Contextual Representation Learning for Large-Scale High-Performance MeSH Indexing with Full Text
[Bioinformatics 2020]
Student
W5
2/11
Scientific VLMs: Bioimaging
*
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
[EMNLP 2022]
*
A Visual–Language Foundation Model for Pathology Image Analysis using Medical Twitter
[Nature Medicine 2023]
*
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
[NeurIPS 2023]
*
A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks
[Nature Medicine 2024]
Instructor
2/13
Scientific VLMs: Geometry
*
UniMath: A Foundational and Multimodal Mathematical Reasoner
[EMNLP 2023]
*
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
[arXiv 2023]
*
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
[EMNLP 2024]
Student
2/16
Project Proposal Due (Sunday)
W6
2/18
[Guest Lecture] Hanwen Xu (University of Washington): Towards Patient Level Representations for Better Clinical Outcome
* Suggested Reading:
A Whole-Slide Foundation Model for Digital Pathology from Real-World Data
[Nature 2024]
Guest Lecturer
2/20
Scientific VLMs: Miscellaneous
*
UrbanCLIP: Learning Text-Enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web
[WWW 2024]
*
BioCLIP: A Vision Foundation Model for the Tree of Life
[CVPR 2024]
*
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
[CVPR 2024]
Student
W7
2/25
Protein Language Models
*
Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model
[Science 2023]
*
Large Language Models Generate Functional Protein Sequences across Diverse Families
[Nature Biotechnology 2023]
*
ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
[ICML 2023]
*
BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
[EMNLP 2023]
Instructor
2/27
DNA/RNA/Single-Cell Language Models
*
DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in Genome
[Bioinformatics 2021]
*
A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions
[Nature Machine Intelligence 2024]
*
scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics using Generative AI
[Nature Methods 2024]
Student
W8
3/4
Molecule Language Models
*
Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries
[EMNLP 2021]
*
Translation between Molecules and Natural Language
[EMNLP 2022]
*
LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset
[COLM 2024]
*
Fine-Tuned Language Models Generate Stable Inorganic Materials as Text
[ICLR 2024]
Instructor
3/6
Urban Language Models
*
SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation
[EMNLP 2022]
*
GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding
[EMNLP 2023]
*
UrbanGPT: Spatio-Temporal Large Language Models
[KDD 2024]
Student
3/7
Literature Review Due (Friday)
W9
3/11
Spring Break (No Class)
3/13
Spring Break (No Class)
W10
3/18
[Guest Lecture] Bowen Jin (University of Illinois Urbana-Champaign): Large Language Models on Scientific Text-Attributed Graphs
* Suggested Reading:
Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs
[ACL 2024]
Guest Lecturer
3/20
Midterm Project Presentations
Students
3/23
Midterm Report Due (Sunday)
W11
3/25
Language Models with Academic Graphs
*
OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services
[KDD 2022]
*
LinkBERT: Pretraining Language Models with Document Links
[ACL 2022]
*
Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification
[WWW 2022]
*
Investigating Instruction Tuning Large Language Models on Graphs
[COLM 2024]
Instructor
3/27
Table Language Models
*
TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables
[ACL 2020]
*
TableLlama: Towards Open Large Generalist Models for Tables
[NAACL 2024]
*
UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers
[arXiv 2024]
Student
W12
4/1
LLMs for Research: Idea Generation
*
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models
[arXiv 2024]
*
Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas
[arXiv 2024]
*
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
[arXiv 2024]
Student
4/3
LLMs for Research: Content Generation
*
Mapping the Increasing Use of LLMs in Scientific Papers
[COLM 2024]
*
Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews
[ICML 2024]
*
Let's Get to the Point: LLM-Supported Planning, Drafting, and Revising of Research-Paper Blog Posts
[arXiv 2023]
Student
W13
4/8
[Guest Lecture] Qingyun Wang (University of Illinois Urbana-Champaign): AI4Scientist: Accelerating and Democratizing Scientific Research Lifecycle
* Suggested Reading:
SciMON: Scientific Inspiration Machines Optimized for Novelty
[ACL 2024]
Guest Lecturer
4/10
LLMs for Research: Reviewing
*
Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis
[NEJM AI 2024]
*
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing
[EMNLP 2024]
*
AgentReview: Exploring Peer Review Dynamics with LLM Agents
[EMNLP 2024]
Student
W14
4/15
LLMs for Research: Miscellaneous
*
A Search Engine for Discovery of Scientific Challenges and Directions
[AAAI 2022]
*
Chain-of-Factors Paper-Reviewer Matching
[arXiv 2023]
*
ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews
[ACL 2024]
Student
4/17
Scientific Agents
*
Autonomous Chemical Research with Large Language Models
[Nature 2023]
*
Augmenting Large Language Models with Chemistry Tools
[Nature Machine Intelligence 2024]
*
Monte Carlo Thought Search: Large Language Model Querying for Complex Scientific Reasoning in Catalyst Design
[EMNLP 2023]
Student
W15
4/22
Final Project Presentations
Students
4/24
Final Project Presentations
Students
W16
5/4
Final Report Due (Sunday)