PhD machine learning researcher specializing in multimodal representation learning.
Published in Nature Machine Intelligence and Cell Press journals. Experienced in developing novel algorithms and translating to scalable, production-ready ML pipelines.
Core Expertise
Algorithms and Research Areas
Single- and Multi-Agent Reinforcement learning ∙ Distributed training ∙ Graph learning ∙ Large language models ∙ Multimodal models ∙ Representation learning
Main Languages and Tools
PyTorch ∙ Ray ∙ NCCL ∙ Docker ∙ LaTeX ∙ Illustrator
Techniques
Adversarial self-play ∙ Big data ∙ Generalizability ∙ Interpretability ∙ Memory optimization ∙ Quantization ∙ Reward modeling ∙ Spatiotemporal modeling ∙ Transformers
Leadership
Cross-functional collaboration ∙ End-to-End Ownership ∙ Executive Communication ∙ Research Strategy ∙ Research-to-Production ∙ Researcher Supervision
Education
Doctor of Philosophy - Computer Science, Minor Mathematics
University of Wisconsin-Madison (2020 - 2025)
Biologically-aware multimodal representation learning deciphers single-cell functions and dynamics
Bachelor of Science - Mathematics and Computer Science
DePaul University (2017 - 2020)
Work Experience
-
AI-powered platform built on behavioral science to engage and support women throughout various life stages
Oversee the production of a large language model application that provides real-time emotional support, integrating supervised fine-tuning and tone-matching to align model behavior with domain-specific requirements.
Design reward modeling and reinforcement learning pipelines to enforce safety constraints and match sample content and tone.
Supervise application development and strategy, including data collection, fine-tuning, and frontend integration.
-
Laboratory focused on developing machine learning and artificial intelligence approaches and bioinformatics tools to bridge computation and biology for mechanistic insights into complex brains and brain diseases
Designed and implemented a multi-agent reinforcement learning framework to infer virtual cell environments. Agents learn to coordinate through distributed training in large biological state spaces. Novel approaches and PPO optimizations are generalizable beyond biology.
Developed a distributed training architecture for reinforcement learning applications using PyTorch and Ray for multi-node, multi-GPU configurations, with a custom NCCL communication backend.
Conceptualized, produced, and analyzed novel variational autoencoder and optimal transport techniques for representation learning across multimodalities with a focus on interpretability.
Led teams of up to 5 researchers, resulting in 8+ publications. Guided collaborators in reinforcement learning, attention, and graph modeling theory and analysis.
Prepared and delivered lesson plans on transformer architectures and related applications, including encoder- and decoder-only large language models.
-
Independent research unit of The Mom Project, providing behavioral research and predictive analytics to help improve employee workplace experiences
Researched and prototyped autoencoder-based anonymization methodologies for privacy-focused representation learning.
Delivered technical briefings on emergent machine learning methodologies for applied research, including generative modeling and representation learning.
Built client-facing interactive dashboards linking key performance indicators to quantified public perception.
-
Digital talent platform designed to connect women to flexible, family-friendly work opportunities and inclusive employers
Conceptualized, developed, and maintained the talent matching algorithm from scratch through Series A ($8M funding) until 2022.
Developed shared data analysis libraries to be used across data science and product engineering teams.
Created statistics-grounded client pricing structures and contract evaluation models based on company size, proposals, and previous contracts, prioritizing rewarding large-scale partnerships
-
Departments of computer science and mathematics
Prepared and presented class curricula for discrete math and problem solving using computers
Consistently graded logic-based exams and assignments, following up with students and providing detailed feedback for continual improvement of course principles
Publications
-
Noah Cohen Kalafut, Chenfeng He, Jie Sheng, Pramod Bharadwaj Chandrashekar, Jerome Choi, Daifeng Wang (Full Text)
Single cells interact continuously to form a cell environment that drives key biological processes. Cells and cell environments are highly dynamic across time and space, fundamentally governed by molecular mechanisms, such as gene expression. Recent sequencing techniques measure single-cell-level gene expression under specific conditions, either temporally or spatially. Using these datasets, emerging works, such as virtual cells, can learn biologically useful representations of individual cells. However, these representations are typically static and overlook the underlying cell environment and its dynamics. To address this, we developed CellTRIP, a multi-agent reinforcement learning method that infers a virtual cell environment to simulate the cell dynamics and interactions underlying given single-cell data. Specifically, cells are modeled as individual agents with dynamic interactions, which can be learned through self-attention mechanisms via reinforcement learning. CellTRIP also applies novel truncated reward boot-strapping and adaptive input rescaling to stabilize training. We can in-silico manipulate any combination of cells and genes in our learned virtual cell environment, predict spatial and/or temporal cell changes, and prioritize corresponding genes at the single-cell level. We applied and benchmarked CellTRIP on various simulated and real gene expression datasets, including recapitulating cellular dynamic processes simulated by gene regulatory networks and stochastic models, imputing spatial organization of mouse cortical cells, predicting developmental gene expression changes after drug treatment in cancer cells, and spatiotemporal reconstruction of Drosophila embryonic development, demonstrating its outperformance and broad applicability. Interactive manipulation of those virtual cell environments, including in-silico perturbation, can prioritize spatial and developmental genes for single-cell-level changes, enabling the generation of new insights into cell dynamics over time and space. CellTRIP is open source as a general tool and available at github.com/daifengwanglab/CellTRIP.
-
*Chirag Gupta, *Noah Cohen Kalafut, *Declan Clarke, Jerome J Choi, Kalpana Hanthanan Arachchilage, Saniya Khullar, Yan Xia, Xiao Zhou, Cagatay Dursun, Mark Gerstein, Daifeng Wang (Full Text)
Neuropsychiatric disorders lack effective treatments due to a limited understanding of underlying cellular and molecular mechanisms. To address this, we integrated population-scale single-cell genomics data and analyzed cell-type-level gene regulatory networks across schizophrenia, bipolar disorder, and autism (23 cell classes/subclasses). Our analysis revealed potential druggable transcription factors co-regulating known risk genes that converge into cell-type-specific co-regulated modules. We applied graph neural networks on those modules to prioritize novel risk genes and leveraged them in a network-based drug repurposing framework to identify 220 drug molecules with the potential for targeting specific cell types. We found evidence for 37 of these drugs in reversing disorder-associated transcriptional phenotypes. Additionally, we discovered 335 drug-associated cell-type eQTLs, revealing genetic variation's influence on drug target expression at the cell-type level. Our results provide a single-cell network medicine resource that provides mechanistic insights for advancing treatment options for neuropsychiatric disorders.
-
Jerome J. Choi, Noah Cohen Kalafut, Tim Gruenloh, Corinne D. Engelman, Tianyuan Lu, Daifeng Wang (Full Text)
Single-omics approaches often provide a limited perspective on complex biological systems, whereas multi-omics integration enables a more comprehensive understanding by combining diverse data views. However, integrating heterogeneous data types and interpreting complex relationships between biological features—both within and across views—remains a major challenge. Here, to address these challenges, we introduce COSIME (Cooperative Multi-view Integration with a Scalable and Interpretable Model Explainer). COSIME applies the backpropagation of a learnable optimal transport algorithm to deep neural networks, thus enabling the learning of latent features from several views to predict disease phenotypes. It also incorporates Monte Carlo sampling to enable interpretable assessments of both feature importance and pairwise feature interactions for both within and across views. We applied COSIME to both simulated and real-world datasets—including single-cell transcriptomics, spatial transcriptomics, epigenomics and metabolomics—to predict Alzheimer’s disease-related phenotypes. Benchmarking of existing methods demonstrated that COSIME improves prediction accuracy and provides interpretability. For example, it reveals that synergistic interactions between astrocyte and microglia genes associated with Alzheimer’s disease are more likely to localize at the edges of the middle temporal gyrus. Finally, COSIME is also publicly available as an open-source tool.
-
Robert Hermod Olson, Noah Cohen Kalafut, Daifeng Wang (Full Text)
The bigger picture
Recently, it has become possible to obtain multiple types of data (modalities) from individual neurons, like how genes are used (gene expression), how a neuron responds to electrical signals (electrophysiology), and what it looks like (morphology). These datasets can be used to group similar neurons together and learn their functions, but the complexity of the data can make this process difficult for researchers without sufficient computational skills. Various methods have been developed specifically for combining these modalities, and open-source software tools can alleviate the computational burden on biologists performing analyses of new data. Open-source tools performing modality combination (integration), clustering, and visualization have the potential to streamline the research process. It is our hope that intuitive and freely available software will advance neuroscience research by making advanced computational methods and visualizations more accessible.Highlights
• MANGEM enables single-cell multimodal learning and visualization in a cloud-based app
• Application to Patch-seq data identifies multimodal functions of neuronal cells
• Visualizations reveal cross-modal relationships of neurons
• Supports asynchronous learning and background job running for large-scale data analysesSummary
Single-cell techniques like Patch-seq have enabled the acquisition of multimodal data from individual neuronal cells, offering systematic insights into neuronal functions. However, these data can be heterogeneous and noisy. To address this, machine learning methods have been used to align cells from different modalities onto a low-dimensional latent space, revealing multimodal cell clusters. The use of those methods can be challenging without computational expertise or suitable computing infrastructure for computationally expensive methods. To address this, we developed a cloud-based web application, MANGEM (multimodal analysis of neuronal gene expression, electrophysiology, and morphology). MANGEM provides a step-by-step accessible and user-friendly interface to machine learning alignment methods of neuronal multimodal data. It can run asynchronously for large-scale data alignment, provide users with various downstream analyses of aligned cells, and visualize the analytic results. We demonstrated the usage of MANGEM by aligning multimodal data of neuronal cells in the mouse visual cortex. -
Noah Cohen Kalafut, Xiang Huang, Daifeng Wang (Full Text)
Single-cell multimodal datasets have measured various characteristics of individual cells, enabling a deep understanding of cellular and molecular mechanisms. However, multimodal data generation remains costly and challenging, and missing modalities happen frequently. Recently, machine learning approaches have been developed for data imputation but typically require fully matched multimodalities to learn common latent embeddings that potentially lack modality specificity. To address these issues, we developed an open-source machine learning model, Joint Variational Autoencoders for multimodal Imputation and Embedding (JAMIE). JAMIE takes single-cell multimodal data that can have partially matched samples across modalities. Variational autoencoders learn the latent embeddings of each modality. Then, embeddings from matched samples across modalities are aggregated to identify joint cross-modal latent embeddings before reconstruction. To perform cross-modal imputation, the latent embeddings of one modality can be used with the decoder of the other modality. For interpretability, Shapley values are used to prioritize input features for cross-modal imputation and known sample labels. We applied JAMIE to both simulation data and emerging single-cell multimodal data including gene expression, chromatin accessibility, and electrophysiology in human and mouse brains. JAMIE significantly outperforms existing state-of-the-art methods in general and prioritized multimodal features for imputation, providing potentially novel mechanistic insights at cellular resolution.
-
Chenfeng He, Noah Cohen Kalafut, Soraya O. Sandoval, Ryan Risgaard, Carissa L. Sirois, Chen Yang, Saniya Khullar, Marin Suzuki, Xiang Huang, Qiang Chang, Xinyu Zhao, Andre M.M. Sousa, Daifeng Wang (Full Text)
Motivation
Organoids have become valuable models for understanding cellular and molecular mechanisms in human development, including development of brains. However, whether developmental gene expression programs are preserved between human organoids and brains, especially in specific cell types, remains unclear. Importantly, there is a lack of effective computational approaches for comparative data analyses between organoids and developing human brains. To address this, we developed a machine-learning framework for comparative gene expression analysis of brains and organoids to identify conserved and specific developmental trajectories as well as developmentally expressed genes and functions, especially at cellular resolution.Highlights
• Manifold alignment for comparing gene expression of organoids with developing brains
• Global alignment by given development time and local refinement by common manifolds
• Developmental similarity of brain regions and organoids in human and non-human primates
• Conserved and specific cell trajectories and genes across brains and organoidsSummary
Our machine-learning framework, brain and organoid manifold alignment (BOMA), first performs a global alignment of developmental gene expression data between brains and organoids. It then applies manifold learning to locally refine the alignment, revealing conserved and specific developmental trajectories across brains and organoids. Using BOMA, we found that human cortical organoids better align with certain brain cortical regions than with other non-cortical regions, implying organoid-preserved developmental gene expression programs specific to brain regions. Additionally, our alignment of non-human primate and human brains reveals highly conserved gene expression around birth. Also, we integrated and analyzed developmental single-cell RNA sequencing (scRNA-seq) data of human brains and organoids, showing conserved and specific cell trajectories and clusters. Further identification of expressed genes of such clusters and enrichment analyses reveal brain- or organoid-specific developmental functions and pathways. Finally, we experimentally validated important specific expressed genes through the use of immunofluorescence. BOMA is open-source available as a web tool for community use. -
Jie Sheng, Noah Cohen Kalafut, Daifeng Wang (Full Text)
Single-cell and spatial transcriptomics enable the analysis of cellular states and dynamics in gene expression, revealing how diverse biological processes relate to these states over time and space. To study these dynamics, trajectory inference methods order cells along computationally inferred paths to reconstruct gradual transitions in cell states. However, by encouraging smooth and continuous trajectories, these approaches tend to conflate co-occurring processes-such as proliferation, maturation, and spatial organization-that are jointly reflected in gene expression, potentially overlooking process-specific gene expression dynamics. To address this, we developed VAPOR, which integrates a variational autoencoder with transport operators to model and disentangle cellular gene expression dynamics for potentially co-occurring biological processes. VAPOR inputs single-cell (or spatial) gene expression data into a variational autoencoder (VAE) to learn the latent states of cells and then models their latent dynamics as an ordinary differential equation. The latent dynamics are further decomposed into process-specific components parameterized by transport operators (TOs) and their corresponding process weights. Each TO defines a process-specific dynamics, and its weight for each cell quantifies the process's contribution to the cell dynamics. After assessment by simulation studies, we applied VAPOR with benchmarking to real data, including time-course scRNA-seq from postconceptual human brain development, spatial transcriptomics of the mouse hippocampus, and cross-species scRNA-seq spanning human and macaque first-trimester forebrain development. In these applications, VAPOR has identified a variety of temporal and spatial co-occurring processes, such as cell cycle, gliogenesis, neurogenesis, and neuronal migration, along with associated dynamic genes, including those species-specific to human and macaque development. VAPOR is available as an open-source tool for general-purpose use.
-
*Pramod Bharadwaj Chandrashekar, *Sayali Anil Alatkar, *Noah Cohen Kalafut, *Ting Jin, Chirag Gupta, Ryan Burzak, Xiang Huang, Shuang Liu, Athan Z. Li, PsychAD Consortium, Kiran Girdhar, Georgios Voloudakis, Gabriel E. Hoffman, Jaroslav Bendl, John F. Fullard, Donghoon Lee, Panos Roussos, Daifeng Wang (Full Text)
Precision medicine for brain diseases faces many challenges, including understanding the heterogeneity of disease phenotypes. Such heterogeneity can be attributed to the variations in cellular and molecular mechanisms across individuals. However, personalized mechanisms remain elusive, especially at the single-cell level. To address this, the PsychAD project generated population-level single-nucleus RNA-seq data for 1,494 human brains with over 6.3 million nuclei covering diverse clinical phenotypes and neuropsychiatric symptoms (NPSs) in Alzheimer's disease (AD). Leveraging this data, we analyzed personalized single-cell functional genomics involving cell type interactions and gene regulatory networks. In particular, we developed a knowledge-guided graph neural network model to learn latent representations of functional genomics (embeddings) and quantify importance scores of cell types, genes, and their interactions for each individual. Our embeddings improved phenotype classifications and revealed potentially novel subtypes and population trajectories for AD progression, cognitive impairment, and NPSs. Our importance scores prioritized personalized functional genomic information and showed significant differences in regulatory mechanisms at cell type level across various phenotypes. Such information also allowed us to further identify subpopulation-level biological pathways, including ancestry for AD. Finally, we associated genetic variants with cell type-gene regulatory network changes across individuals, i.e., gene regulatory QTLs (grQTLs), providing novel functional genomic insights compared to existing QTLs. We further validated our results using external cohorts. Our analyses are available through iBrainMap, an open-source computational framework, and as a personalized functional genomic atlas for Alzheimer's Disease.
-
Xiang Huang, Noah Cohen Kalafut, Sayali Alatkar, Athan Z. Li, Qiping Dong, Qiang Chang, Daifeng Wang (Full Text)
Studying temporal features of neural activities is crucial for understanding the functions of neurons as well as underlying neural circuits. To this end, recent researches employ emerging techniques including calcium imaging, Neuropixels, depth electrodes, and Patch-seq to generate multimodal time-series data that depict the activities of single neurons, groups of neurons, and behaviors. However, challenges persist, including the analysis of noisy, high-sampling-rate neuronal data, and the modeling of temporal dynamics across various modalities. To address these challenges, we developed NeuroTD, a novel deep learning approach to align multimodal time-series datasets and infer cross-modality temporal relationships such as time delays or shifts. Particularly, NeuroTD integrates Siamese neural networks with frequency domain transformations and complex value optimization for inference. We applied NeuroTD to three multimodal datasets to (1) analyze electrophysiological (ephys) time series measured by depth electrodes, identifying time delays among neurons across various positions, (2) investigate neural activity and behavioral time series data derived from Neuropixels and 3D motion captures, establishing causal relationships between neural activities and corresponding behavioral activities, and (3) explore gene expression and ephys data of single neurons from Patch-seq, identifying gene expression signatures highly correlated with time shifts in ephys responses. Finally, NeuroTD is open-source at https://github.com/daifengwanglab/NeuroTD for general use.