CSCE 689 - Special Topics in NLP for Science (Spring 2025)

Course Information

Instructor: Yu Zhang (yuzhang [AT] tamu [DOT] edu)
Lectures:
- Time: Tuesdays and Thursdays 3:55pm – 5:10pm
- Location: HRBB 126
Office Hour:
- Time: Thursdays 2pm – 3pm
- Location: PETR 222 (or drop me an email at least 1 day in advance if you would like to join via Zoom: https://tamu.zoom.us/j/6411788612)
Syllabus: PDF
Link to Submit Pre-Lecture Questions: https://docs.google.com/forms/d/e/1FAIpQLSdKAGdPP41dsKXylloWJCCFXWaNqobX-u4DL7b5IIw2Yy2OBw/viewform?usp=dialog

Grading

Participation: 10%
- Attendence: 8%
- Pre-Lecture Questions: 2% [due 1 day before the lecture]
Literature Review: 10% [due 3/7]
Paper Presentation: 20%
- Slides: 5% [due 2 days before the lecture]
- Completeness, Clarity, and Q&A: 15%
Project: 60%
- Project Proposal: 5% [due 2/23]
- Midterm Spotlight Presentation: 5%
- Midterm Report: 10% [due 3/23]
- Final Project Presentation: 15%
- Final Report: 25% [due 5/4]

Schedule (Subject to changes)

Week	Date	Topic	Papers	Slides	Presenter
W1	1/14	Course Overview	-	PDF	Instructor
	1/16	Scientific LLMs: Encoder-Only & Encoder-Decoder	* SciBERT: A Pretrained Language Model for Scientific Text [EMNLP 2019] * BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining [Bioinformatics 2020] * ELECTRAMed: A New Pre-trained Language Representation Model for Biomedical NLP [arXiv 2021] * SciFive: A Text-to-Text Transformer Model for Biomedical Literature [arXiv 2021]	PDF	Instructor
W2	1/21	Campus-Wide Class Cancellation
	1/23	Scientific LLMs: Decoder-Only	* Solving Quantitative Reasoning Problems with Language Models [NeurIPS 2022] * SciInstruct: A Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [NeurIPS 2024] * BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains [ACL 2024] * OceanGPT: A Large Language Model for Ocean Science Tasks [ACL 2024]	PDF	Instructor
W3	1/28	Citation Prediction	* SPECTER: Document-Level Representation Learning using Citation-Informed Transformers [ACL 2020] * Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings [EMNLP 2022] * Explaining Relationships between Scientific Documents [ACL 2021] * SciRepEval: A Multi-Format Benchmark for Scientific Document Representations [EMNLP 2023]	PDF	Instructor
	1/30	Scientific Question Answering	* PubMedQA: A Dataset for Biomedical Research Question Answering [EMNLP 2019] * Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries [WWW 2024] * MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models [ICLR 2024]	PDF	Yichen
W4	2/4	Scientific Knowledge Extraction	* AIONER: All-in-One Scheme-Based Biomedical Named Entity Recognition using Deep Learning [Bioinformatics 2023] * SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [EMNLP 2024] * ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision [ACL 2023] * ActionIE: Action Extraction from Scientific Literature with Programming Languages [ACL 2024]	PDF	Instructor
	2/6	Scientific Literature Retrieval	* MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval [Bioinformatics 2023] * BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers [EMNLP 2024] * Fact or Fiction: Verifying Scientific Claims [EMNLP 2020] * Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding [EMNLP 2023]	PDF	Instructor
W5	2/11	Scientific VLMs: Bioimaging	* MedCLIP: Contrastive Learning from Unpaired Medical Images and Text [EMNLP 2022] * A Visual–Language Foundation Model for Pathology Image Analysis using Medical Twitter [Nature Medicine 2023] * LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day [NeurIPS 2023] * A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [Nature Medicine 2024]	PDF	Instructor
	2/13	Scientific VLMs: Geometry	* UniMath: A Foundational and Multimodal Mathematical Reasoner [EMNLP 2023] * G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model [ICLR 2025] * Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [EMNLP 2024]	PDF	Shuo
W6	2/18	[Guest Lecture] Hanwen Xu (University of Washington): Towards Patient Level Representations for Better Clinical Outcome * Suggested Reading: A Whole-Slide Foundation Model for Digital Pathology from Real-World Data [Nature 2024]		N/A	Guest Lecturer
	2/20	Scientific VLMs: Miscellaneous	* UrbanCLIP: Learning Text-Enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web [WWW 2024] * BioCLIP: A Vision Foundation Model for the Tree of Life [CVPR 2024] * MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI [CVPR 2024]	PDF	Hasnat
	2/23	Project Proposal Due (Sunday)
W7	2/25	Protein Language Models	* Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model [Science 2023] * Large Language Models Generate Functional Protein Sequences across Diverse Families [Nature Biotechnology 2023] * ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts [ICML 2023] * BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations [EMNLP 2023]	PDF	Instructor
	2/27	DNA/RNA/Single-Cell Language Models	* DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in Genome [Bioinformatics 2021] * A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions [Nature Machine Intelligence 2024] * scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics using Generative AI [Nature Methods 2024]	PDF	Omnia
W8	3/4	Molecule Language Models	* Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries [EMNLP 2021] * Translation between Molecules and Natural Language [EMNLP 2022] * LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset [COLM 2024] * Fine-Tuned Language Models Generate Stable Inorganic Materials as Text [ICLR 2024]	PDF	Instructor
	3/6	Urban Language Models	* SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation [EMNLP 2022] * GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding [EMNLP 2023] * UrbanGPT: Spatio-Temporal Large Language Models [KDD 2024]	PDF	Shaohuai
	3/7	Literature Review Due (Friday)
W9	3/11	Spring Break (No Class)
	3/13	Spring Break (No Class)
W10	3/18	[Guest Lecture] Bowen Jin (University of Illinois Urbana-Champaign): Large Language Models on Scientific Text-Attributed Graphs * Suggested Reading: Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs [ACL 2024]		PDF	Guest Lecturer
	3/20	Language Models with Academic Graphs	* OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services [KDD 2022] * LinkBERT: Pretraining Language Models with Document Links [ACL 2022] * Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification [WWW 2022] * Investigating Instruction Tuning Large Language Models on Graphs [COLM 2024]	PDF	Instructor
W11	3/25	Midterm Project Presentations		N/A	Students
	3/27	Table Language Models	* TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables [ACL 2020] * TableLlama: Towards Open Large Generalist Models for Tables [NAACL 2024] * UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers [NAACL 2025] * Accurate Predictions on Small Data with a Tabular Foundation Model [Nature 2025]	PDF	Instructor
	3/30	Midterm Report Due (Sunday)
W12	4/1	LLMs for Research: Idea Generation	* ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [NAACL 2025] * Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas [arXiv 2024] * Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers [ICLR 2025]	PDF	Hangxiao
	4/3	LLMs for Research: Content Generation	* Mapping the Increasing Use of LLMs in Scientific Papers [COLM 2024] * Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews [ICML 2024] * Let's Get to the Point: LLM-Supported Planning, Drafting, and Revising of Research-Paper Blog Posts [arXiv 2023]	PDF	Ethan
W13	4/8	[Guest Lecture] Qingyun Wang (University of Illinois Urbana-Champaign): AI4Scientist: Accelerating and Democratizing Scientific Research Lifecycle * Suggested Reading: SciMON: Scientific Inspiration Machines Optimized for Novelty [ACL 2024]		N/A	Guest Lecturer
	4/10	LLMs for Research: Reviewing	* Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis [NEJM AI 2024] * LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [EMNLP 2024] * AgentReview: Exploring Peer Review Dynamics with LLM Agents [EMNLP 2024]	PDF	Michael
W14	4/15	LLMs for Research: Miscellaneous	* A Search Engine for Discovery of Scientific Challenges and Directions [AAAI 2022] * Chain-of-Factors Paper-Reviewer Matching [WWW 2025] * ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews [ACL 2024]	PDF	Instructor
	4/17	Scientific Agents	* Autonomous Chemical Research with Large Language Models [Nature 2023] * Augmenting Large Language Models with Chemistry Tools [Nature Machine Intelligence 2024] * Monte Carlo Thought Search: Large Language Model Querying for Complex Scientific Reasoning in Catalyst Design [EMNLP 2023]	PDF	Rithik
W15	4/22	Final Project Presentations		N/A	Students
	4/24	Final Project Presentations		N/A	Students
W16	5/4	Final Report Due (Sunday)