ArXiv ML Papers | Daily Summaries

AZT1D: A Real-World Dataset for Type 1 Diabetes

Saman Khamesian, Asiful Arefeen, Bithika M. Thompson, Maria Adela Grando, Hassan Ghasemzadeh

AZT1D is a publicly available, real-world dataset for Type 1 Diabetes management, collected from 25 participants using AID systems.
The dataset includes detailed CGM data, insulin delivery logs, carbohydrate intake, and device mode (regular, sleep, exercise).
It uniquely provides fine-grained bolus insulin delivery details, such as bolus type and correction-specific doses, which are rarely found in other datasets.
The data was collected over 6–8 weeks per participant during routine clinical care, ensuring naturalistic and representative patient behaviors.
AZT1D enables diverse machine learning applications, including glucose prediction, therapy optimization, and simulation-based evaluations.

Abstract

The AZT1D dataset is a novel, real-world dataset designed to advance machine learning and artificial intelligence applications in the management of Type 1 Diabetes (T1D). The dataset includes detailed data collected from 25 individuals with T1D using automated insulin delivery (AID) systems over a period of 6–8 weeks. It contains continuous glucose monitoring (CGM) data, insulin pump records, carbohydrate intake, and device mode information (regular, sleep, and exercise). A unique feature of AZT1D is its granular bolus insulin delivery data, including bolus types and correction-specific amounts, which are rarely available in existing datasets. The dataset was collected during routine clinical care at the Mayo Clinic and reflects naturalistic, day-to-day diabetes management behaviors. By providing comprehensive and high-resolution data, AZT1D supports a wide range of machine learning applications, including personalized therapy optimization, glucose prediction, and digital twin modeling.

Methodology

The dataset was collected retrospectively from 25 individuals with T1D during routine endocrinology visits at the Mayo Clinic. Data sources included CGM devices (Dexcom G6 Pro) and Tandem t:slim X2 insulin pumps. Data preprocessing involved aligning timestamps across multiple data streams (e.g., CGM readings, insulin delivery, carbohydrate intake) and extracting basal rates and device modes from PDF reports using Optical Character Recognition (OCR). The resulting dataset was unified into a common timeseries format, with missing values handled appropriately.

Results

The AZT1D dataset comprises 26,707 hours of continuous monitoring data, including 320,488 CGM entries. It provides detailed records of insulin administration, carbohydrate intake, and device modes. The dataset captures real-world diabetes management behaviors and offers a high level of granularity, particularly in bolus insulin delivery details. This makes it a valuable resource for developing and evaluating machine learning models in T1D care.

Implications

AZT1D has the potential to significantly advance research in T1D management by enabling the development of personalized therapy optimization algorithms, glucose prediction models, and digital twin systems. Its real-world nature and comprehensive data coverage make it a critical resource for improving clinical decision-making and individualized care. Additionally, the dataset can support simulation-based evaluations of new diabetes management technologies and interventions.

View on arXiv

Active Learning-Guided Seq2Seq Variational Autoencoder for Multi-target Inhibitor Generation

Júlia Vilalta-Mor, Alexis Molina, Laura Ortega Varga, Isaac Filella-Merce, Victor Guallar

The paper proposes a Seq2Seq VAE framework integrated with active learning to generate molecules with multi-target affinity.
The workflow alternates between expanding latent chemical space and applying multi-target docking constraints to refine molecule generation.
A proof-of-concept study targeting coronavirus main proteases demonstrates the method's ability to generate diverse pan-inhibitor candidates.
Active learning is shown to overcome challenges such as sparse rewards and low-data regimes in multi-target drug discovery.
The framework is generalizable and can be applied to other polypharmacological drug discovery tasks.

Abstract

This paper introduces a novel active learning (AL)-guided sequence-to-sequence (Seq2Seq) variational autoencoder (VAE) framework for generating molecules with simultaneous affinity for multiple therapeutic targets. The proposed workflow addresses challenges in multi-target drug discovery, such as sparse rewards and conflicting design constraints, by iteratively refining the chemical space using active learning loops. The method alternates between expanding chemically feasible regions of the latent space and applying increasingly stringent multi-target docking thresholds to constrain molecule generation. A proof-of-concept study targeting the main proteases of SARS-CoV-2, SARS-CoV, and MERS-CoV demonstrates the framework's ability to efficiently generate structurally diverse pan-inhibitor candidates. The study highlights the importance of strategically integrating chemical filters and active learning to enhance exploration and optimization in multi-objective drug design. This approach provides a generalizable roadmap for navigating complex polypharmacological landscapes in drug discovery.

Methodology

The proposed workflow combines a Seq2Seq VAE with a two-level active learning process. The VAE is first pretrained on a general molecular dataset to learn chemical grammar and then fine-tuned on a dataset of molecules with known affinities for multiple targets. The active learning process iteratively generates and refines molecules by promoting those with desirable physicochemical properties and multi-target affinities. Molecular docking simulations are used to evaluate and filter generated molecules based on their predicted binding affinities to multiple targets.

Results

The framework successfully generated a diverse set of pan-inhibitor candidates targeting the main proteases of SARS-CoV-2, SARS-CoV, and MERS-CoV. The study demonstrated that the integration of active learning and chemical filters significantly improved the exploration of beneficial chemical space and the generation of molecules with optimized multi-target affinities.

Implications

This framework provides a scalable and efficient approach for multi-target drug discovery, enabling the design of polypharmacological drugs and pan-inhibitors. It has potential applications in treating complex diseases such as cancer and viral infections by targeting homologous proteins across different organisms. The methodology can be extended to other therapeutic areas requiring multi-objective molecular optimization.

View on arXiv

Bound by semanticity: universal laws governing the generalization-identification tradeoff

Marco Nurisso, Jesseba Fernando, Raj Deshpande, Alan Perotti, Raja Marjieh, Steven M. Frankland, Richard L. Lewis, Taylor W. Webb, Declan Campbell, Francesco Vaccarino, Jonathan D. Cohen, Giovanni Petri

The paper derives a universal Pareto front that quantifies the tradeoff between generalization and identification under finite semantic resolution.
Closed-form expressions predict a sharp 1/n collapse in multi-input processing capacity, highlighting the limitations of simultaneous input processing.
Empirical validation shows that neural networks self-organize resolution boundaries during training, closely following theoretical predictions.
The findings generalize across architectures, from simple ReLU networks to CNNs and vision-language models, confirming the universality of the tradeoff.
Finite-resolution similarity emerges as a fundamental constraint on the representational capacity of both artificial and biological systems.

Abstract

This paper investigates the fundamental tradeoff between generalization and identification in intelligent systems, proposing a universal theoretical framework to quantify this relationship. The authors derive closed-form expressions for the Pareto front that governs the tradeoff, showing that finite semantic resolution imposes a fundamental limit on the ability of models to simultaneously generalize and identify inputs. They extend their analysis to noisy and heterogeneous input spaces, as well as multi-input scenarios, revealing a sharp 1/n collapse in processing capacity as the number of inputs increases. Empirical validation is conducted using a minimal ReLU network, convolutional neural networks (CNNs), and state-of-the-art vision-language models, demonstrating that the theoretical predictions hold across diverse architectures. The study concludes that finite-resolution similarity is a universal constraint on representational capacity, with implications for both artificial neural networks and biological cognitive systems.

Methodology

The authors develop a theoretical framework based on similarity functions and distance metrics in latent spaces to quantify the tradeoff between generalization and identification. They derive closed-form expressions for the Pareto front and extend the analysis to noisy, heterogeneous, and multi-input scenarios. Empirical validation is performed using a minimal ReLU network, CNNs, and vision-language models, comparing their behavior to the theoretical predictions.

Results

The study demonstrates that finite semantic resolution imposes a universal tradeoff between generalization and identification, with empirical trajectories of neural networks closely matching theoretical predictions. The 1/n collapse in multi-input processing capacity is confirmed, and the findings generalize across diverse neural architectures, establishing finite-resolution similarity as a universal constraint.

Implications

The results provide a theoretical foundation for understanding the tradeoff between generalization and identification in neural networks and cognitive systems. This has implications for designing models with improved representational efficiency and understanding the limitations of multi-input processing in both artificial and biological systems. The findings could inform the development of architectures that balance generalization and identification more effectively.

View on arXiv

CACTUS as a Reliable Tool for Early Classification of Age-related Macular Degeneration

Luca Gherardini, Imre Lengyel, Tunde Peto, Caroline C.W. Klaverd, Magda A. Meester-Smoord, Johanna Maria Colijnd, EYE-RISK Consortium, E3 Consortium, Jose Sousa

CACTUS is an explainable AI model designed for early classification of AMD using diverse data sources such as genetic, dietary, clinical, and demographic factors.
The model builds knowledge graphs to represent feature interactions, enabling better handling of missing and noisy data.
CACTUS outperforms traditional machine learning models in accuracy and robustness while providing interpretable and trustworthy outputs.
The tool identifies key decision-making factors, allowing clinicians to validate and refine its predictions, reducing biases in the dataset.
The approach aligns with the growing demand for transparent AI in healthcare, addressing regulatory and ethical concerns.

Abstract

This paper introduces the Comprehensive Abstraction and Classification Tool for Uncovering Structures (CACTUS), a novel machine learning model designed for the early classification of Age-related Macular Degeneration (AMD). AMD is a chronic retinal disease that affects millions globally, with limited treatment options and a critical need for early diagnosis to enable preventive strategies. Traditional diagnostic methods rely heavily on retinal imaging and human interpretation, which are resource-intensive and prone to limitations such as data incompleteness and biases. CACTUS addresses these challenges by leveraging explainable AI techniques to classify AMD stages using a combination of genetic, dietary, clinical, and demographic data. The model builds knowledge graphs to represent feature interactions, handles missing and noisy data effectively, and provides interpretable outputs that align with existing medical knowledge. The study demonstrates that CACTUS outperforms standard machine learning models in accuracy and robustness while offering enhanced transparency and trustworthiness. By identifying key decision-making factors and simulating clinical scenarios, CACTUS facilitates feedback from clinicians and reduces biases, making it a reliable tool for early AMD diagnosis.

Methodology

CACTUS employs knowledge graph-based modeling to abstract and analyze complex datasets with missing and noisy values. It integrates genetic, dietary, clinical, and demographic data to classify AMD stages. The model emphasizes explainability by identifying key features influencing its predictions and simulating clinical scenarios to validate its outputs against medical knowledge.

Results

CACTUS demonstrated superior performance compared to standard machine learning models in classifying AMD stages. It effectively identified critical decision-making factors, reduced biases in the dataset, and provided interpretable outputs that align with existing medical knowledge. The model's robustness and transparency make it a reliable tool for early AMD diagnosis.

Implications

The adoption of CACTUS in clinical settings could significantly improve early AMD diagnosis, enabling preventive strategies and reducing the burden on healthcare systems. Its explainable AI approach addresses regulatory and ethical concerns, fostering trust among clinicians and patients. Additionally, the methodology could be extended to other diseases requiring early diagnosis and classification.

View on arXiv

CAWR: Corruption-Averse Advantage-Weighted Regression for Robust Policy Optimization

Ranting Hu

The paper identifies over-conservatism in AWR algorithms as a result of poor explorations in offline datasets.
CAWR introduces robust loss functions to mitigate the sensitivity of policy optimization to poor explorations.
An advantage-based prioritized experience replay mechanism is used to filter out low-advantage actions from training data.
Theoretical analysis validates the robustness of CAWR against data corruption.
Empirical results on the D4RL benchmark show that CAWR outperforms IQL and other state-of-the-art offline RL algorithms.

Abstract

This paper addresses the over-conservatism problem in offline reinforcement learning (RL), particularly within the Advantage-Weighted Regression (AWR) family of algorithms. Over-conservatism arises when policies become overly cautious due to poor explorations (low-advantage actions) in suboptimal offline datasets, leading to underutilization of high-quality data. The authors identify two key factors contributing to this issue: the sensitivity of the loss function to poor explorations and the proportion of poor explorations in the dataset. To mitigate these challenges, the paper introduces Corruption-Averse Advantage-Weighted Regression (CAWR), which incorporates robust loss functions to reduce the impact of poor explorations and employs an advantage-based prioritized experience replay mechanism to filter out low-quality data. Theoretical analysis and empirical validation on the D4RL benchmark demonstrate that CAWR significantly improves policy optimization performance compared to state-of-the-art methods like Implicit Q-Learning (IQL).

Methodology

The authors propose CAWR, which integrates robust loss functions to reduce the influence of poor explorations during policy optimization. Additionally, an advantage-based prioritized experience replay mechanism is employed to prioritize high-advantage actions in the training process. Theoretical analysis is conducted to validate the robustness of the approach, and extensive experiments are performed on the D4RL benchmark to evaluate its effectiveness.

Results

CAWR achieves superior performance across multiple datasets in the D4RL benchmark, surpassing state-of-the-art algorithms like IQL. The use of robust loss functions and prioritized experience replay significantly enhances policy optimization, particularly in scenarios with suboptimal offline data.

Implications

The proposed CAWR algorithm has the potential to improve offline RL applications in domains where data quality is inconsistent or suboptimal, such as robotics, healthcare, and autonomous systems. By addressing over-conservatism, CAWR enables more effective utilization of offline datasets, paving the way for robust policy optimization in real-world scenarios.

View on arXiv

Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina M.-C. Höhne, Oliver Eberle

PRISM introduces a multi-concept feature description framework that captures polysemanticity in neural network features.
The framework includes polysemanticity scoring to measure the diversity of concepts encoded by a feature and description scoring to assess the quality of feature descriptions.
PRISM outperforms existing methods in generating accurate and nuanced descriptions for both polysemantic and monosemantic features.
The analysis of language models using PRISM reveals that many features encode multiple semantically distinct concepts, challenging the monosemanticity assumption.
The authors provide publicly available code to facilitate further research on interpretability in neural networks.

Abstract

This paper introduces PRISM (Polysemantic FeatuRe Identification and Scoring Method), a novel framework for generating and evaluating multi-concept feature descriptions in neural networks. Unlike traditional methods that assume monosemanticity (i.e., each feature encodes a single concept), PRISM captures the inherent polysemanticity of features, where individual neurons or components encode multiple distinct concepts. The framework provides nuanced textual descriptions for both polysemantic and monosemantic features, improving the interpretability of large language models (LLMs). PRISM includes a polysemanticity scoring mechanism to quantify the diversity of concepts associated with a feature and a description scoring metric to evaluate the alignment between feature activations and their descriptions. The authors benchmark PRISM against existing methods and demonstrate its superior ability to produce accurate, granular, and faithful feature descriptions. The framework is applied to analyze features in LLMs, revealing that many features encode a heterogeneous set of concepts, challenging the traditional single-concept assumption. The authors make their code publicly available to encourage further research in this area.

Methodology

PRISM generates multi-concept feature descriptions by clustering token-level inputs that elicit similar neuron activations. It evaluates the quality of these descriptions using two metrics: polysemanticity scoring, which quantifies the diversity of concepts associated with a feature, and description scoring, which measures the alignment between feature activations and their textual descriptions. The framework is benchmarked against existing feature description methods using quantitative evaluations and applied to analyze features in large language models.

Results

PRISM produces more accurate and granular feature descriptions compared to existing methods, as evidenced by higher scores in both polysemanticity and description evaluations. The framework reveals that many features in language models encode a diverse set of concepts, providing a more comprehensive understanding of model internals. PRISM successfully identifies both polysemantic and monosemantic features, offering a significant improvement in interpretability.

Implications

PRISM has the potential to advance the field of interpretability in machine learning by providing a more nuanced understanding of neural network internals. Its ability to capture polysemanticity could improve debugging, auditing, and fine-tuning of large language models. Additionally, the framework could be applied to other domains, such as computer vision or reinforcement learning, where understanding feature representations is critical.

View on arXiv

Conditional Generative Modeling for Enhanced Credit Risk Management in Supply Chain Finance

Qingkai Zhang, L. Jeff Hong, Houmin Yan

The paper introduces a novel credit risk management framework tailored for 3PL-led SCF in CBEC, addressing both credit risk assessment and loan size determination.
The proposed Quantile-Regression-based Generative Metamodeling (QRGMM) models the full conditional sales distribution, enabling nuanced risk analysis.
Integration with Deep Factorization Machines (DeepFM) allows the model to capture complex covariate interactions in e-commerce sales data.
The framework provides theoretical guarantees and supports flexible estimation of multiple risk measures.
Extensive experiments validate the model's efficacy in improving credit risk assessment and loan sizing decisions.

Abstract

This paper addresses the challenges of credit risk management and loan size determination in the context of third-party logistics (3PL)-led supply chain finance (SCF) for small and medium-sized enterprises (SMEs) engaged in cross-border e-commerce (CBEC). SMEs often face financing barriers due to limited credit histories and collateral. The authors propose a novel framework that leverages conditional generative modeling to estimate sales distributions and assess credit risk. The core methodology, Quantile-Regression-based Generative Metamodeling (QRGMM), is integrated with Deep Factorization Machines (DeepFM) to capture complex interactions in e-commerce sales data. This unified framework enables flexible estimation of multiple risk measures and introduces a functional risk measure formulation to systematically relate risk measures to varying loan levels. The approach is validated through experiments on both synthetic and real-world datasets, demonstrating its effectiveness in improving credit risk assessment and loan size determination. This work represents a pioneering application of generative AI in CBEC SCF, offering a robust foundation for enhanced credit practices and improved SME access to capital.

Methodology

The authors propose a unified framework based on Quantile-Regression-based Generative Metamodeling (QRGMM) to model conditional sales distributions. This is combined with Deep Factorization Machines (DeepFM) to capture complex feature interactions in e-commerce data. The framework introduces a functional risk measure formulation to systematically relate risk measures to loan levels. Theoretical guarantees are provided, and the model is validated using synthetic and real-world datasets.

Results

The proposed framework demonstrated superior performance in credit risk assessment and loan size determination compared to traditional methods. Experiments showed that the model effectively captures sales variability and tail risks, enabling more informed and robust credit decisions. The integration of QRGMM and DeepFM proved effective in leveraging the richness of e-commerce data for predictive analytics.

Implications

This study highlights the potential of generative AI in enhancing credit risk management for CBEC SCF. The proposed framework can improve SME access to capital by enabling more accurate and flexible credit risk assessments. It also provides a foundation for developing advanced financial products and services tailored to the needs of SMEs in global e-commerce.

View on arXiv

Event-Driven Online Vertical Federated Learning

Ganyu Wang, Boyu Wang, Bin Gu, Charles Ling

Introduces an event-driven online VFL framework to handle asynchronous data reception in real-world scenarios.
Incorporates dynamic local regret (DLR) to address non-convex models and non-stationary environments.
Theoretical analysis proves the DLR bound for the proposed framework with partial client activation.
Experiments demonstrate improved stability under non-stationary conditions and reduced communication and computation costs.
Addresses a critical gap in online VFL research by moving beyond the assumption of synchronous data reception.

Abstract

This paper introduces an event-driven online Vertical Federated Learning (VFL) framework to address challenges in real-world online VFL scenarios. Unlike traditional VFL approaches that assume synchronous data reception across clients, the proposed framework accounts for asynchronous, event-driven data generation, where only a subset of clients is activated during each event. The framework incorporates a dynamic local regret (DLR) approach to handle non-convex models and non-stationary environments, which are common in practical applications. The authors provide a theoretical analysis of the DLR bound for their framework and demonstrate its effectiveness through extensive experiments. The results show that the proposed framework is more stable under non-stationary data conditions and significantly reduces communication and computation costs compared to existing methods.

Methodology

The authors propose an event-driven online VFL framework where only a subset of clients is activated during each event, while others passively collaborate. They adapt the dynamic local regret (DLR) approach to handle non-convex models and non-stationary data streams. The framework is analyzed theoretically to establish a DLR bound under these conditions. Extensive experiments are conducted to evaluate the framework's performance in terms of stability, communication, and computation efficiency.

Results

The proposed framework demonstrates greater stability compared to existing online VFL methods under non-stationary data conditions. It also significantly reduces communication and computation costs by activating only a subset of clients during each event. The theoretical analysis confirms the effectiveness of the dynamic local regret approach in handling non-convex and non-stationary scenarios.

Implications

This framework has significant implications for real-world applications of VFL, such as in industries where data is generated asynchronously (e.g., finance, healthcare, and IoT sensor networks). By reducing communication and computation costs while maintaining stability, the framework enables scalable and efficient VFL deployments in dynamic environments.

View on arXiv

GFLC: Graph-based Fairness-aware Label Correction for Fair Classification

Modar Sulaiman, Kallol Roy

Introduces GFLC, a method to address instance-dependent label noise while ensuring fairness in machine learning models.
Combines prediction confidence, graph-based regularization, and fairness constraints to correct noisy labels.
Utilizes k-NN graphs, Forman–Ricci curvature, and discrete Ricci flow for structural insights and label correction.
Achieves improved trade-offs between fairness and performance metrics compared to baseline methods.
Demonstrates robustness in high-noise scenarios, validating the method's effectiveness through empirical experiments.

Abstract

This paper introduces GFLC (Graph-based Fairness-aware Label Correction), a novel method aimed at addressing the dual challenges of label noise and fairness in machine learning. Label noise, particularly when it is instance-dependent and influenced by sensitive attributes, can exacerbate biases in machine learning models and degrade their performance. GFLC is designed to correct noisy labels while maintaining demographic parity, a key fairness metric. The method integrates three core components: prediction confidence measures, graph-based regularization using Ricci-flow-optimized graph Laplacians, and explicit fairness constraints. By leveraging structural insights from k-nearest neighbor (k-NN) graphs and advanced graph-theoretic concepts like Forman–Ricci curvature, GFLC effectively balances the trade-off between fairness and performance. Experimental results demonstrate that GFLC outperforms baseline methods, particularly in scenarios with high noise rates, by achieving significant improvements in both fairness and classification accuracy.

Methodology

GFLC employs a combination of prediction confidence measures, graph-based regularization using Ricci-flow-optimized graph Laplacians, and explicit demographic parity incentives. The method leverages structural insights from k-NN graphs and advanced graph-theoretic concepts like Forman–Ricci curvature to correct noisy labels while preserving fairness. The approach is evaluated through experiments comparing its performance and fairness metrics against baseline methods under varying noise conditions.

Results

The experimental evaluation shows that GFLC significantly improves the trade-off between fairness and performance metrics, particularly in datasets with high levels of instance-dependent label noise. The method outperforms baseline approaches in both classification accuracy and demographic parity, demonstrating its robustness and effectiveness in mitigating the impact of noisy labels on fairness-aware classification.

Implications

GFLC has potential applications in domains where fairness is critical, such as healthcare, legal decision-making, and hiring systems. By addressing label noise and fairness simultaneously, the method can contribute to the development of more trustworthy and equitable machine learning systems. Additionally, its graph-based approach could inspire further research into leveraging network science for fairness-aware learning.

View on arXiv

Global Ground Metric Learning with Applications to scRNA data

Damin Kühn, Michael T. Schaub

Introduces Global Ground Metric Learning (GGML) for learning task-specific ground metrics in optimal transport.
GGML requires only class labels at the distribution level, making it applicable to arbitrary distributions over a shared space.
Demonstrates improved performance in clustering, classification, and embedding tasks using scRNA-seq data.
Leverages the Mahalanobis distance as a learnable global metric, optimized via gradient descent.
Validates the approach on synthetic and real-world datasets, highlighting its interpretability and effectiveness.

Abstract

This paper introduces Global Ground Metric Learning (GGML), a novel framework for learning ground metrics in the context of optimal transport (OT). Unlike traditional approaches that rely on predefined metrics (e.g., Euclidean distance) or supervised learning with labeled data, GGML learns a global metric that can handle arbitrary distributions over a shared space using only class labels at the distribution level. The authors demonstrate the utility of GGML in analyzing single-cell RNA sequencing (scRNA-seq) data, where each patient is represented as a distribution of high-dimensional gene expression vectors. By learning a task-specific ground metric, GGML improves the accuracy of OT distances, leading to better performance in downstream tasks such as clustering, classification, and embedding. The framework is validated on synthetic and real-world scRNA-seq datasets, showcasing its ability to capture biologically meaningful relationships and enhance interpretability.

Methodology

The authors propose GGML, which learns a differentiable metric (e.g., Mahalanobis distance) using class labels of distributions. The metric is optimized via gradient descent to minimize distances between similar distributions while maintaining a margin between dissimilar ones. The learned metric is then used as the ground metric in OT to compute Wasserstein distances. The framework is tested on synthetic and real-world scRNA-seq datasets, with applications in clustering, classification, and feature importance analysis.

Results

The GGML framework outperforms traditional OT approaches with predefined metrics in clustering and classification tasks. It provides biologically meaningful insights when applied to scRNA-seq data, effectively distinguishing between patient groups with different disease states. The learned metric also enhances interpretability by identifying important features (genes) that contribute to the separation of distributions.

Implications

GGML has significant implications for analyzing high-dimensional biological data, such as scRNA-seq, where understanding relationships between distributions is critical. Its ability to learn task-specific metrics without requiring shared support or pairwise labels makes it broadly applicable to other domains like computer vision, natural language processing, and genomics. The framework could facilitate more accurate and interpretable analyses in fields requiring distribution-level comparisons.

View on arXiv

Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis

Bochen Lyu, Xiaojing Zhang, Fangyi Zheng, He Wang, Zheng Wang, Zhanxing Zhu

The authors introduce HB Flow (HBF), a piece-wise continuous differential equation that captures the discrete dynamics of the Heavy-Ball momentum method with high precision by controlling discretization error to arbitrary orders of the step size.
The study explicitly derives the leading order of the discretization error for HBF and provides a systematic method to reduce this error, bridging the gap between discrete and continuous time models for HB.
HBF reveals that HB implicitly regularizes directional smoothness, leading to different learning dynamics compared to gradient descent (GD).
The paper applies HBF to analyze the implicit bias of HB in diagonal linear networks, highlighting differences from gradient flow (GF) that cannot be captured by simpler continuous time models.
Numerical experiments validate the theoretical findings and demonstrate the practical utility of HBF in understanding momentum-based optimization methods.

Abstract

This paper investigates the Heavy-Ball (HB) momentum method, a widely used optimization algorithm, by bridging the gap between its discrete dynamics and continuous time approximations. The authors propose a novel continuous time model, termed HB Flow (HBF), which incorporates counter terms to explicitly account for discretization error. This approach allows the discretization error to be controlled to arbitrary orders of the step size, providing a more precise characterization of the HB method in continuous time. The study also explores the implicit regularization properties of HB, particularly its preference for solutions with smaller directional smoothness, and applies the findings to analyze the implicit bias of HB in diagonal linear networks. The theoretical contributions are supported by numerical experiments, demonstrating the practical relevance of the proposed framework.

Methodology

The authors construct a piece-wise continuous differential equation (HBF) by adding counter terms to the rescaled gradient flow (RGF) model to explicitly cancel discretization error. They derive the leading order of the discretization error and provide a series expansion to control it to arbitrary orders of the step size. Theoretical analysis is complemented by numerical experiments to validate the proposed model and its implications.

Results

The proposed HBF model achieves a more precise approximation of the discrete HB momentum method compared to existing continuous time models. It demonstrates that HB exhibits implicit regularization for directional smoothness and provides insights into the implicit bias of HB in diagonal linear networks. Numerical experiments confirm the theoretical predictions and show that HBF can effectively capture the learning dynamics of HB.

Implications

The findings have significant implications for understanding and improving momentum-based optimization methods in machine learning. The ability to precisely model HB in continuous time can aid in designing more effective optimization algorithms and provide deeper insights into their implicit regularization properties. The results are particularly relevant for applications in deep learning, where momentum methods are widely used.

View on arXiv

Integrating Dynamical Systems Learning with Foundational Models: A Meta-Evolutionary AI Framework for Clinical Trials

Joseph Geraci, Bessi Qorri, Christian Cumbaa, Mike Tsay, Paul Leonczyk, Luca Pani

NetraAI combines dynamical systems theory, information geometry, and evolutionary algorithms to identify stable and interpretable patient subgroups ('Personas').
The framework incorporates a meta-evolutionary layer where an LLM acts as a 'Strategist,' guiding the discovery process and ensuring robustness.
NetraAI emphasizes stability, interpretability, and domain knowledge integration, making it well-suited for small, high-dimensional clinical datasets.
Case studies in schizophrenia, depression, and pancreatic cancer show that NetraAI can significantly improve model performance by identifying high-effect-size subpopulations.
The framework represents a shift toward symbiotic AI systems, where specialized learners and LLMs collaborate rather than compete.

Abstract

This paper introduces a novel AI framework, NetraAI, which integrates dynamical systems learning with foundational large language models (LLMs) to address challenges in clinical trial analysis. NetraAI is designed for small, high-dimensional, and sensitive clinical datasets, prioritizing stability, interpretability, and domain knowledge integration over brute-force predictive performance. The framework employs contraction mappings, information geometry, and evolutionary algorithms to identify stable and interpretable patient subgroups, termed 'Personas,' which are defined by compact sets of 2–4 variables. These Personas are clinically meaningful and actionable for trial enrichment. The framework also incorporates a meta-evolutionary layer, where an LLM acts as a 'Strategist,' guiding the discovery process by injecting domain knowledge, prioritizing variables, and ensuring robustness. This two-tier architecture mirrors the human scientific process, with NetraAI functioning as the experimentalist and the LLM as the theorist. Case studies in schizophrenia, depression, and pancreatic cancer demonstrate that NetraAI can transform weak baseline models into near-perfect classifiers by uncovering high-effect-size subpopulations. The paper positions NetraAI as a step toward adaptive, self-reflective AI systems that align with emerging paradigms like concept-level reasoning and embedding-based prediction.

Methodology

NetraAI uses a dynamical systems approach, employing contraction mappings to iteratively cluster patient data into stable attractors that represent latent subgroups. These subgroups are refined using evolutionary algorithms to identify compact, interpretable feature sets ('Personas'). An LLM serves as a meta-evolutionary layer, guiding the process by injecting domain knowledge, prioritizing variables, and validating outputs. The framework embeds principles of reliability engineering to ensure traceable and trustworthy results.

Results

NetraAI demonstrated its effectiveness in case studies involving schizophrenia, depression, and pancreatic cancer. It identified small, high-effect-size subpopulations that transformed weak baseline models (AUC ≈ 0.50–0.68) into near-perfect classifiers using only a few features. This highlights its potential for uncovering actionable insights in clinical trials with limited data.

Implications

NetraAI offers a new paradigm for clinical trial analysis, enabling the discovery of interpretable and actionable patient subgroups in small, high-dimensional datasets. Its emphasis on stability and explainability makes it particularly valuable for high-stakes domains like biomedicine, where transparency and reliability are critical. The framework also exemplifies a symbiotic approach to AI development, where specialized systems and LLMs collaborate, potentially accelerating scientific discovery in other fields.

View on arXiv

LIT-LVM: Structured Regularization for Interaction Terms in Linear Predictors using Latent Variable Models

Mohammadreza Nemati, Zhipeng Huang, Kevin S. Xu

LIT-LVM introduces a structured regularization approach for interaction terms in linear predictors using latent variable models.
The method assumes a low-dimensional structure for the interaction coefficient matrix, reducing overfitting in high-dimensional settings.
LIT-LVM outperforms elastic net and factorization machines in predictive accuracy, especially when the number of interaction terms is large relative to the sample size.
The approach provides interpretable low-dimensional latent representations of features, useful for visualization and analysis.
A case study on kidney transplantation demonstrates the practical utility of LIT-LVM in biomedical applications.

Abstract

This paper introduces LIT-LVM, a novel approach for estimating coefficients of interaction terms in linear predictors by leveraging structured regularization based on latent variable models. Linear predictors, while simple and interpretable, often struggle to capture non-linear relationships between features. Interaction terms, which represent pairwise feature interactions, can address this limitation but lead to high-dimensional parameter spaces that are prone to overfitting, especially when the number of features (p) is large relative to the number of samples (n). LIT-LVM addresses this challenge by hypothesizing that the interaction coefficient matrix has an approximate low-dimensional structure. Each feature is represented as a latent vector in a low-dimensional space, and the interaction coefficients are modeled as a function of these latent representations. This structured regularization complements traditional methods like lasso and elastic net, improving predictive accuracy and interpretability. The authors demonstrate the effectiveness of LIT-LVM through simulations and real-world datasets, showing superior performance compared to elastic net and factorization machines, particularly in high-dimensional settings. Additionally, the latent feature representations provided by LIT-LVM enable visualization and analysis of feature relationships, with an application to kidney transplantation compatibility modeling highlighted as a case study.

Methodology

The authors propose a structured regularization framework that arranges the interaction coefficients into a matrix and imposes a low-dimensional structure on it. Each feature is represented as a latent vector in a lower-dimensional space, and the interaction coefficients are modeled as a function (e.g., dot product) of these latent vectors. This approach is integrated into linear predictors, such as linear regression and logistic regression, and combined with traditional regularization techniques like elastic net. The method is evaluated on simulated and real-world datasets across regression and classification tasks.

Results

LIT-LVM achieves superior predictive accuracy compared to elastic net and factorization machines, particularly in high-dimensional settings where the number of features or interaction terms exceeds the number of samples. The method also provides interpretable latent representations of features, enabling visualization and analysis of feature relationships. In a case study on kidney transplantation, LIT-LVM effectively models donor-recipient compatibility, showcasing its practical applicability.

Implications

LIT-LVM has significant implications for high-dimensional machine learning problems, particularly in domains like biomedicine, where feature interactions are critical but data is often limited. The method's ability to improve prediction accuracy while providing interpretable feature representations makes it valuable for tasks requiring both performance and explainability. Its application to kidney transplantation suggests potential for broader use in healthcare and other fields requiring interaction modeling.

View on arXiv

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong

Fine-tuning can erode the safety-critical low-rank subspaces in LLMs, making them vulnerable to attacks.
The proposed Low-Rank Extrapolation (LoX) method strengthens safety robustness by extrapolating alignment weight updates in the low-rank subspace.
LoX is a training-free, lightweight approach that does not compromise the model's ability to adapt to new tasks.
Experimental results show significant reductions in attack success rates, with ASR dropping from 52% to 7% for benign fine-tuning and from 63% to 9% for malicious fine-tuning.
LoX moves model parameters into a flatter safety landscape, reducing sensitivity to perturbations.

Abstract

This paper addresses the vulnerability of safety-aligned large language models (LLMs) to fine-tuning, which can erode their safety protections even when the fine-tuning data appears benign. The authors identify that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters. To mitigate this issue, they propose a novel, training-free method called Low-Rank Extrapolation (LoX). LoX enhances the robustness of safety-aligned LLMs by extrapolating the safety subspace of alignment weight updates, thereby making the models less sensitive to perturbations introduced during fine-tuning. Experimental results demonstrate that LoX significantly reduces attack success rates (ASR) for both benign and malicious fine-tuning attacks, while preserving the model's adaptability to new tasks. The paper also provides insights into the safety landscape of LLMs, showing that LoX moves the model parameters into a flatter region, which is inherently more robust to perturbations.

Methodology

The authors first analyze the role of low-rank subspaces in LLM safety and demonstrate their susceptibility to fine-tuning. Based on this insight, they propose LoX, a training-free method that extrapolates the alignment weight updates in the low-rank subspace using a scaling factor. This extrapolation shifts the model parameters into a flatter safety landscape, enhancing robustness. Extensive experiments and ablation studies are conducted to evaluate the effectiveness of LoX and the impact of different subspace choices and scaling coefficients.

Results

LoX achieves significant improvements in safety robustness against fine-tuning attacks. For benign fine-tuning using the Dolly dataset, the attack success rate (ASR) decreases from 52% to 7%. For malicious fine-tuning using the Pure Bad dataset, the ASR drops from 63% to 9%. These results demonstrate that LoX effectively mitigates safety vulnerabilities while maintaining the model's adaptability to new tasks.

Implications

The findings highlight the critical role of low-rank subspaces in LLM safety and provide a simple yet effective method to enhance robustness against fine-tuning attacks. LoX can be applied to existing safety-aligned LLMs without retraining, making it a practical solution for real-world applications. This work has implications for improving the safety of LLMs in sensitive domains, such as healthcare, finance, and content moderation, where maintaining robust safety guardrails is essential.

View on arXiv

MedSyn: Enhancing Diagnostics with Human-AI Collaboration

Burcu Sayin, Ipek Baris Schlicht, Ngoc Vo Hong, Sara Allievi, Jacopo Staiano, Pasquale Minervini, Andrea Passerini

MedSyn enables multi-step, interactive dialogues between physicians and LLMs to refine diagnoses and treatment decisions.
The framework addresses limitations of static decision-support tools by fostering dynamic, real-time collaboration.
Preliminary evaluations of 25 open-source LLMs identified promising candidates for sustained, in-depth medical dialogues.
Simulated physician-LLM interactions demonstrated improved diagnostic accuracy and clarity through iterative questioning.
Future work will involve real-world validation with clinicians to assess MedSyn's impact on diagnostic accuracy and patient outcomes.

Abstract

This paper introduces MedSyn, a hybrid human-AI framework designed to enhance clinical decision-making through interactive, multi-step dialogues between physicians and large language models (LLMs). Unlike static decision-support tools, MedSyn facilitates dynamic exchanges where physicians can challenge AI-generated suggestions and receive alternative perspectives. The framework aims to mitigate cognitive biases, incomplete information, and case complexity often encountered in clinical settings. Using a curated dataset from MIMIC-IV and MIMIC-IV-Note, the authors evaluate 25 open-source LLMs for their ability to sustain coherent, multi-turn medical dialogues. Preliminary results indicate that interactive exchanges between physicians and LLMs improve diagnostic accuracy and clarity. The study highlights the potential of open-source LLMs as conversational partners in medical diagnostics and outlines future plans to validate MedSyn with real clinicians in live clinical environments.

Methodology

The authors developed the MedSyn framework to facilitate multi-turn interactions between physicians and LLMs. They curated a dataset from MIMIC-IV and MIMIC-IV-Note and evaluated 25 open-source LLMs for their ability to engage in coherent, multi-turn medical dialogues. Promising models, including LLaMA3 (8B and 70B), Gemma2 (27B), and DeepSeek-R1, were selected for further experimentation. Simulated physician-LLM conversations were conducted in a controlled setting to assess the framework's effectiveness.

Results

Preliminary results show that interactive, multi-step exchanges between physicians and LLMs lead to more comprehensive patient assessments and enhanced diagnostic clarity. While some models struggled with multi-turn coherence, others demonstrated the ability to sustain in-depth discussions. Qualitative analysis by physicians supported the potential of MedSyn to improve diagnostic accuracy.

Implications

MedSyn has the potential to transform clinical decision-making by integrating human-AI collaboration into medical diagnostics. Its ability to enhance diagnostic accuracy and mitigate cognitive biases could improve patient outcomes, particularly in complex or ambiguous cases. Future real-world validation could pave the way for its deployment in clinical settings, offering a scalable solution for augmenting physician expertise.

View on arXiv

Muon Optimizes Under Spectral Norm Constraints

Lizhang Chen, Jonathan Li, Qiang Liu

Muon is interpreted as a special case of the Lion-K optimizer family, using the nuclear norm as the convex function.
Muon implicitly enforces spectral norm constraints on weight matrices, acting as a form of spectral regularization.
Theoretical convergence guarantees are established for Muon under both deterministic and stochastic gradient settings.
The Lion-K framework allows for generalizations of Muon to other convex functions, enabling new optimization algorithms with diverse regularization effects.
Empirical results confirm Muon's implicit spectral regularization and its adaptability to large-scale neural network training.

Abstract

This paper provides a theoretical foundation for the Muon optimizer, a recently proposed optimization algorithm that has shown strong empirical performance in training large-scale neural networks. The authors situate Muon within the Lion-K framework of optimizers, demonstrating that Muon corresponds to a special case where the nuclear norm is used as the convex function. They show that Muon implicitly solves an optimization problem that enforces spectral norm constraints on weight matrices, offering a novel interpretation of its regularization effects. The paper also extends Muon by generalizing it to a broader family of optimizers based on alternative convex functions. The authors provide rigorous convergence analysis for Muon under both deterministic and stochastic gradient settings, proving that it converges to Karush–Kuhn–Tucker (KKT) points of the constrained optimization problem. Empirical experiments validate the theoretical findings, showing that Muon enforces implicit spectral regularization and can be adapted to other forms of regularization through the Lion-K framework.

Methodology

The authors analyze Muon within the Lion-K framework, leveraging its theoretical properties to interpret Muon's behavior as solving a constrained optimization problem. They conduct convergence analysis using Lyapunov functions and Karush–Kuhn–Tucker (KKT) conditions. Empirical experiments are performed to validate the theoretical insights, including toy examples, constraint verification, and large-scale model training.

Results

The paper establishes that Muon converges to KKT points of a spectral-norm-constrained optimization problem, with proven convergence rates for both deterministic and stochastic gradient scenarios. Empirical experiments demonstrate that Muon enforces implicit spectral regularization and achieves competitive performance in large-scale neural network training. Additionally, generalizations of Muon using alternative convex functions are shown to produce diverse regularization effects.

Implications

The findings provide a deeper understanding of Muon's regularization effects and its theoretical underpinnings, making it a promising tool for training large-scale neural networks. The generalization of Muon through the Lion-K framework opens up new avenues for designing optimizers with tailored regularization properties, potentially improving training stability and generalization in various machine learning tasks.

View on arXiv

ODD: Overlap-aware Estimation of Model Performance under Distribution Shift

Aayush Mishra, Anqi Liu

The paper identifies a limitation in the Dis2 framework, which creates instability in overlapping regions between source and target domains.
The authors propose Overlap-aware Disagreement Discrepancy (ODD), which incorporates domain-overlap awareness to improve performance estimation.
ODD uses domain classifiers to estimate overlap and discounts disagreement in overlapping regions, leading to tighter performance bounds.
Theoretical analysis shows that ODD maintains the validity of performance bounds while improving their tightness.
Experiments on multiple benchmarks demonstrate that ODD achieves lower performance-estimation error compared to Dis2 while maintaining reliability.

Abstract

This paper addresses the challenge of estimating the performance of machine learning models under distribution shifts, a critical problem for deploying models in high-stakes applications. The authors build upon the Disagreement Discrepancy (Dis2) framework, which bounds the target domain error of a source-trained classifier by optimizing for a worst-case critic that maximally disagrees with the classifier in the target domain while agreeing in the source domain. However, the authors identify a limitation in Dis2: it creates instability in the overlapping region between source and target domains, leading to overly pessimistic performance estimates. To address this, the authors propose Overlap-aware Disagreement Discrepancy (ODD), which incorporates domain-overlap awareness into the optimization process. By leveraging domain classifiers to estimate overlap and discounting disagreement in overlapping regions, ODD provides tighter and more reliable performance bounds. The paper demonstrates the effectiveness of ODD through theoretical analysis and extensive experiments on benchmark datasets, showing that it outperforms Dis2 in terms of prediction accuracy and reliability.

Methodology

The authors extend the Dis2 framework by introducing a new training objective that discounts disagreement in overlapping regions between source and target domains. They use domain classifiers to estimate the degree of overlap and incorporate this information into the optimization process. Theoretical analysis is conducted using the notion of the ideal joint hypothesis, and empirical validation is performed on a variety of datasets and training methods.

Results

ODD achieves tighter performance bounds compared to Dis2, reducing performance-estimation error while maintaining reliable coverage. The method is shown to be effective across multiple benchmarks, demonstrating its robustness and practical utility in scenarios with distribution shifts.

Implications

The proposed ODD framework has significant implications for deploying machine learning models in safety-critical applications, such as healthcare and autonomous systems, where reliable performance estimation under distribution shifts is essential. It also provides a foundation for further research into overlap-aware methods for domain adaptation and robustness.

View on arXiv

Pixel-level Certified Explanations via Randomized Smoothing

Alaa Anani, Tobias Lorenz, Mario Fritz, Bernt Schiele

Introduces the first pixel-level certification framework for attribution methods, ensuring robustness under ℓ2-bounded perturbations.
Reframes attribution as a segmentation task by sparsifying attribution maps into binary pixel-importance classes.
Proposes three novel evaluation metrics: %certified (robustness), Certified GridPG (localization), and a deletion-based faithfulness score.
Demonstrates the framework's effectiveness on 12 attribution methods and 5 ImageNet models, with LRP and RISE showing the best trade-offs.
Provides actionable insights for creating robust, interpretable, and trustworthy attribution maps for downstream tasks.

Abstract

This paper introduces the first certification framework for pixel-level robustness in post-hoc attribution methods, addressing the vulnerability of attribution maps to small, imperceptible input perturbations. The authors propose a novel approach that reformulates attribution as a segmentation task by sparsifying attribution maps into binary pixel-importance classes. Using randomized smoothing, the framework certifies whether each pixel is robustly important, unimportant, or abstains from certification under ℓ2-bounded perturbations. The paper also introduces three evaluation metrics—%certified, Certified Grid Pointing Game (Certified GridPG), and a deletion-based faithfulness score—to assess robustness, localization, and faithfulness of certified attributions. Extensive experiments on 12 attribution methods across 5 ImageNet models (including CNNs and transformers) demonstrate that methods like LRP and RISE achieve the best trade-offs between robustness, localization, and faithfulness. The proposed certified attribution maps provide interpretable and reliable explanations, enabling their use in high-stakes applications such as autonomous driving and medical imaging.

Methodology

The authors reformulate attribution as a segmentation task by sparsifying attribution maps into binary classes ('important' or 'unimportant') and apply randomized smoothing to certify pixel-level robustness under ℓ2-bounded perturbations. The framework is evaluated using three metrics—%certified, Certified GridPG, and a deletion-based faithfulness score—on 12 attribution methods across 5 ImageNet models, including CNNs and transformers.

Results

The proposed framework successfully certifies pixel-level robustness for various attribution methods, creating interpretable and reliable certified attribution maps. LRP and RISE achieve the best trade-offs between robustness, localization, and faithfulness. Quantitative evaluations reveal significant differences between attribution methods, highlighting the utility of the proposed metrics in assessing robustness and interpretability.

Implications

This work has significant implications for high-stakes applications like autonomous driving, medical imaging, and judicial systems, where robust and interpretable model explanations are critical. The certified attribution maps can enhance trust in AI systems by providing reliable and interpretable insights into model predictions, paving the way for safer and more transparent deployment of machine learning models.

View on arXiv

Protein Language Model Zero-Shot Fitness Predictions are Improved by Inference-only Dropout

Aditya Ravuri, Neil D. Lawrence

Inference-time dropout improves zero-shot fitness predictions of PLMs without requiring retraining.
The method involves injecting a dropout layer between the embedding and transformer blocks and averaging outputs over multiple forward passes.
Performance gains are observed across all model sizes, with the most significant improvement in smaller models (e.g., 35M parameters).
A dropout rate of 0.1 is consistently effective across different PLM configurations.
The approach is computationally efficient and provides a proxy for model uncertainty and out-of-domain detection.

Abstract

This paper explores the use of inference-time dropout to improve the zero-shot fitness prediction capabilities of Protein Language Models (PLMs), specifically ESM2. The authors introduce a dropout layer between the embedding and transformer blocks of the PLM during inference, without requiring retraining or fine-tuning of the model. By averaging the outputs over multiple forward passes (akin to Monte Carlo dropout), the method enhances the model's performance on a subset of the ProteinGym dataset, which evaluates the effects of mutations on protein fitness. The study demonstrates that this approach improves Spearman rank correlation coefficients (SRCC) across various model sizes, with a dropout rate of 0.1 being particularly effective. The authors hypothesize that the improvement stems from better model calibration and propose future work to expand the dataset, refine scoring functions, and explore larger models.

Methodology

The authors use the ESM2 suite of pretrained PLMs and evaluate their method on a subset of the ProteinGym dataset, which includes 50 protein families. They introduce a dropout layer at inference time between the embedding and transformer blocks of the PLM. By running 100 Monte Carlo forward passes and averaging the outputs, they compute scalar fitness proxies using a simplistic scoring function. Performance is measured using the Spearman rank correlation coefficient (SRCC) between predicted and true fitness values.

Results

The proposed inference-time dropout method improves zero-shot fitness prediction performance across all tested PLM sizes. The 35M parameter model shows the most striking improvement, while larger models (150M and 15B parameters) also benefit when dropout is applied to early transformer layers. A dropout rate of 0.1 is found to be optimal. The method enhances model calibration and provides better predictions for out-of-domain examples.

Implications

This approach has significant implications for protein engineering and bioinformatics, as it enables improved zero-shot fitness predictions without requiring additional training. The method could be applied to other domains where PLMs are used for property prediction, particularly in scenarios with limited labeled data. Additionally, it provides insights into model uncertainty and calibration, which could inform the development of more robust PLMs.

View on arXiv

SFT-GO: Supervised Fine-Tuning with Group Optimization for Large Language Models

Gyuhak Kim, Sumiran Singh Thakur, Su Min Park, Wei Wei, Yujia Bao

SFT-GO introduces a token grouping mechanism to prioritize semantically important tokens during supervised fine-tuning of LLMs.
The method combines worst-group loss with standard cross-entropy loss to improve learning dynamics and handle diverse token distributions.
Three token grouping strategies are proposed: TF-IDF-based, semantics-based (LLMLingua-2), and task-specific (Rho-1).
Theoretical analysis proves SFT-GO's convergence efficiency, and empirical results show consistent improvements across benchmarks.
SFT-GO demonstrates flexibility in defining token importance, making it adaptable to various tasks and datasets.

Abstract

This paper introduces SFT-GO (Supervised Fine-Tuning with Group Optimization), a novel approach to improve supervised fine-tuning (SFT) for large language models (LLMs) by prioritizing semantically important tokens during training. Traditional SFT methods treat all tokens equally, which can lead to suboptimal performance as not all tokens contribute equally to task-specific semantics. SFT-GO addresses this by grouping tokens based on their importance and optimizing the model using a weighted combination of the worst-group loss and standard cross-entropy loss. This approach emphasizes challenging token groups, improving the model's ability to handle diverse token distributions. The authors propose three token grouping strategies: a statistics-based method using TF-IDF, a semantics-based method leveraging a compression model (LLMLingua-2), and a task-specific method reformulated from an existing framework (Rho-1). Theoretical analysis demonstrates SFT-GO's efficiency and convergence properties, while empirical evaluations on instruction-tuning datasets (LIMA and Alpaca) show consistent performance improvements across multiple benchmarks and base models. These results highlight the effectiveness and robustness of SFT-GO in enhancing LLM fine-tuning.

Methodology

SFT-GO groups tokens based on their importance using three strategies: (1) TF-IDF scores for statistical significance, (2) token-selection probabilities from LLMLingua-2 for semantic relevance, and (3) excess loss calculations from Rho-1 for task-specific utility. The model is optimized using a weighted combination of the worst-group loss and standard cross-entropy loss. Theoretical convergence properties are analyzed, and experiments are conducted on LIMA and Alpaca datasets using Llama 3.2-3B and Llama 3.1-8B models.

Results

SFT-GO consistently outperforms standard fine-tuning baselines across seven widely recognized benchmarks. The TF-IDF and LLMLingua-2 grouping strategies show significant improvements in commonsense reasoning tasks. The method demonstrates robustness across datasets and base models, validating its effectiveness in enhancing instruction-tuning pipelines.

Implications

SFT-GO has the potential to improve the alignment of LLMs with human expectations and task-specific requirements by focusing on semantically important tokens. Its flexibility in defining token importance makes it applicable to a wide range of tasks, including instruction-tuning, domain-specific fine-tuning, and improving model performance on challenging datasets. This approach could also inspire further research into token-level optimization techniques for LLMs.

View on arXiv

Sampling 3D Molecular Conformers with Diffusion Transformers

J. Thorben Frank, Winfried Ripken, Gregor Lied, Klaus-Robert Müller, Oliver T. Unke, Stefan Chmiela

Introduces DiTMC, a modular Diffusion Transformer architecture tailored for 3D molecular conformer generation.
Proposes two novel graph-based conditioning strategies to integrate molecular connectivity into the generative process.
Explores the impact of standard (non-equivariant) and SO(3)-equivariant self-attention mechanisms on model performance.
Achieves state-of-the-art precision and physical validity on benchmarks like GEOM-QM9, -DRUGS, and -XL.
Demonstrates scalability and competitive performance of simpler, non-equivariant attention mechanisms.

Abstract

This paper introduces DiTMC, a novel adaptation of Diffusion Transformers (DiTs) for the task of 3D molecular conformer generation. Molecular conformers, which represent the 3D arrangements of atoms in a molecule, are critical for applications in drug discovery and material design. The authors address key challenges in adapting DiTs to molecular data, such as integrating discrete molecular graph information with continuous 3D geometries, handling Euclidean symmetries, and designing scalable conditioning mechanisms for molecules of varying sizes. DiTMC incorporates two graph-based conditioning strategies and explores different self-attention mechanisms, including both standard (non-equivariant) and SO(3)-equivariant formulations. The proposed architecture achieves state-of-the-art (SOTA) performance on established benchmarks (GEOM-QM9, -DRUGS, -XL), demonstrating high precision and physical validity in the generated molecular conformers. The study also highlights the trade-offs between model accuracy and computational efficiency, offering insights into the role of architectural choices and symmetry priors in generative modeling for molecular systems.

Methodology

The authors adapt the Diffusion Transformer (DiT) architecture to molecular conformer generation by introducing a modular framework, DiTMC, that separates the processing of 3D atomic coordinates from molecular graph conditioning. Two graph-based conditioning strategies are proposed, leveraging trainable tokens for atomic pairs. The model employs various self-attention mechanisms, including standard non-equivariant and SO(3)-equivariant formulations, to balance accuracy and computational efficiency. The framework predicts atomic velocities to model a probability flow ODE, enabling sampling from the molecular conformer distribution.

Results

DiTMC achieves state-of-the-art performance on standard molecular conformer generation benchmarks (GEOM-QM9, -DRUGS, -XL), producing conformers with high precision and physical validity. The study finds that while SO(3)-equivariant attention mechanisms improve fidelity, simpler non-equivariant mechanisms are computationally efficient and still competitive. The generated molecular ensembles exhibit realistic physical properties, validating the model's effectiveness.

Implications

The proposed DiTMC framework has significant implications for computational drug discovery and material design, where accurate and efficient sampling of 3D molecular conformers is critical. Its scalability and performance make it a promising candidate for large-scale generative modeling of molecular structures, potentially accelerating the discovery of novel drugs and materials.

View on arXiv

Self-Composing Policies for Scalable Continual Reinforcement Learning

Mikel Malagón, Josu Ceberio, Jose A. Lozano

Introduces CompoNet, a modular neural network architecture for continual reinforcement learning that autonomously composes previously learned policies.
Avoids catastrophic forgetting and interference by freezing learned modules and selectively composing them for new tasks.
Achieves linear growth in parameters with respect to the number of tasks, significantly improving scalability compared to prior methods.
Demonstrates robust knowledge transfer and efficient learning on benchmark continuous control and visual tasks.
Balances scalability and plasticity without requiring additional networks for module composition.

Abstract

This paper introduces CompoNet, a modular and growable neural network architecture designed for continual reinforcement learning (CRL). CompoNet addresses the challenges of catastrophic forgetting and interference by enabling autonomous composition of previously learned policy modules. Unlike traditional growable neural networks, which often suffer from quadratic growth in parameters or require additional networks for module composition, CompoNet grows linearly in size with respect to the number of tasks. Each module in CompoNet selectively combines outputs from prior modules with its own internal policy, facilitating efficient knowledge transfer and accelerating learning for new tasks. The architecture is evaluated on diverse robotic manipulation tasks from the Meta-World environment and visual control tasks from the Arcade Learning Environment, demonstrating superior performance in knowledge transfer, robustness, and scalability compared to existing CRL methods.

Methodology

CompoNet adds a new trainable module for each new task while freezing previously learned modules. Each module can access and compose the outputs of earlier modules alongside its internal policy, creating a cascading structure of policies. The architecture is evaluated using Soft Actor-Critic (SAC) for robotic manipulation tasks and Proximal Policy Optimization (PPO) for visual control tasks. Experiments compare CompoNet's performance to other CRL methods in terms of knowledge transfer, robustness, and scalability.

Results

CompoNet outperforms alternative CRL methods in terms of knowledge transfer and task performance across diverse benchmarks. It efficiently learns new tasks when prior modules provide useful information and can learn from scratch without interference when prior knowledge is irrelevant. The architecture achieves linear parameter growth and scales efficiently in inference time, addressing key limitations of existing growable neural networks.

Implications

CompoNet's scalable and modular design makes it well-suited for real-world applications requiring continual learning, such as robotics, autonomous systems, and adaptive control in dynamic environments. Its ability to balance scalability and plasticity could inspire future research in lifelong learning and modular neural network architectures.

View on arXiv

Semi-supervised Graph Anomaly Detection via Robust Homophily Learning

Guoguo Ai, Hezhe Qiao, Hui Yan, Guansong Pang

RHO introduces adaptive frequency response filters (AdaFreq) to capture diverse homophily patterns in labeled normal nodes.
Graph normality alignment (GNA) ensures consistency between channel-wise and cross-channel homophily representations.
RHO addresses the limitations of existing methods that assume uniform homophily among normal nodes.
The proposed method achieves state-of-the-art performance on eight real-world GAD datasets.
RHO is particularly effective in handling low-homophily normal nodes, which are often misclassified by existing methods.

Abstract

This paper introduces RHO (Robust Homophily Learning), a novel approach for semi-supervised graph anomaly detection (GAD). Semi-supervised GAD aims to identify abnormal nodes in a graph using a small set of labeled normal nodes. Existing methods often assume that normal nodes exhibit uniform homophily and that labeled nodes adequately represent the overall homophily patterns. However, these assumptions fail in real-world datasets where normal nodes display diverse homophily levels. RHO addresses this limitation with two key modules: adaptive frequency response filters (AdaFreq) and graph normality alignment (GNA). AdaFreq learns adaptive spectral filters to capture varying homophily patterns across node attributes in both channel-wise and cross-channel views. GNA ensures consistency in the learned representations by aligning the homophily representations across these views. Extensive experiments on eight real-world datasets demonstrate that RHO significantly outperforms state-of-the-art methods, effectively learning robust and consistent representations of normal nodes with diverse homophily levels.

Methodology

RHO consists of two modules: (1) AdaFreq, which learns adaptive spectral filters to capture varying homophily patterns in both channel-wise and cross-channel views of node attributes, and (2) GNA, which aligns the homophily representations learned across these views by maximizing the similarity of positive pairs (same node across views) and minimizing the similarity of negative pairs (different nodes). This ensures robust and consistent representations of normal nodes with diverse homophily levels.

Results

RHO outperforms state-of-the-art semi-supervised GAD methods across eight real-world datasets. It demonstrates superior robustness in learning heterogeneous normal patterns, particularly for low-homophily normal nodes. The experiments validate the effectiveness of AdaFreq in capturing diverse homophily patterns and GNA in ensuring representation consistency.

Implications

RHO's ability to handle diverse homophily patterns in graph data makes it highly applicable to real-world scenarios such as fraud detection, spam detection, and abusive user identification. Its robust performance in semi-supervised settings can reduce the reliance on extensive labeled data, making it practical for large-scale, real-world graphs.

View on arXiv

Unifying VXAI: A Systematic Review and Framework for the Evaluation of Explainable AI

David Dembinsky, Adriano Lucieri, Stanislav Frolov, Hiba Najjar, Ko Watanabe, Andreas Dengel

The paper identifies a lack of standardized evaluation protocols in XAI and proposes a unified framework (VXAI) to address this gap.
VXAI categorizes evaluation metrics into 41 groups and introduces a three-dimensional scheme based on explanation type, evaluation contextuality, and quality desiderata.
The authors highlight the challenges of subjective and inconsistent evaluation practices in XAI, emphasizing the need for rigorous and systematic approaches.
The framework supports comparability across XAI methods and provides a flexible foundation for future research and extensions.
The study underscores the importance of distinguishing between human-grounded and functionality-grounded evaluations to assess explanation comprehensibility and faithfulness.

Abstract

This paper addresses the critical need for standardized evaluation protocols in the field of Explainable AI (XAI), which aims to make black-box AI models more interpretable and trustworthy. The authors conduct a systematic literature review of 362 publications, following PRISMA guidelines, to identify and categorize existing evaluation metrics for XAI. They introduce a unified framework called VXAI, which organizes these metrics into 41 functionally similar groups and proposes a three-dimensional categorization scheme based on explanation type, evaluation contextuality, and explanation quality desiderata. The framework aims to provide a comprehensive and structured approach to evaluating XAI methods, enabling systematic metric selection, comparability across studies, and a foundation for future research. The paper highlights the challenges of current XAI evaluation practices, including subjectivity, lack of ground truth, and inconsistent methodologies, and emphasizes the need for both human-grounded and functionality-grounded evaluation approaches.

Methodology

The authors conducted a systematic literature review of 362 publications using PRISMA guidelines. They aggregated existing evaluation metrics into 41 functionally similar groups and proposed a three-dimensional categorization scheme. The framework was designed to address gaps in current XAI evaluation practices and to provide a structured approach for future research.

Results

The VXAI framework offers the most comprehensive and structured overview of XAI evaluation metrics to date. It enables systematic metric selection, promotes comparability across methods, and provides a flexible foundation for extending evaluation protocols. The framework also highlights the need for distinguishing between human-grounded and functionality-grounded evaluations to ensure both comprehensibility and faithfulness of explanations.

Implications

The VXAI framework has the potential to standardize XAI evaluation practices, improving the trustworthiness and applicability of explainable AI systems in high-stakes domains such as healthcare, finance, and autonomous systems. By providing a structured approach, it can foster better comparability across studies, guide the development of new evaluation metrics, and support the broader adoption of XAI in real-world applications.

View on arXiv

dailymachinelearning

AZT1D: A Real-World Dataset for Type 1 Diabetes

Active Learning-Guided Seq2Seq Variational Autoencoder for Multi-target Inhibitor Generation

Bound by semanticity: universal laws governing the generalization-identification tradeoff

CACTUS as a Reliable Tool for Early Classification of Age-related Macular Degeneration

CAWR: Corruption-Averse Advantage-Weighted Regression for Robust Policy Optimization

Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Conditional Generative Modeling for Enhanced Credit Risk Management in Supply Chain Finance

Event-Driven Online Vertical Federated Learning

GFLC: Graph-based Fairness-aware Label Correction for Fair Classification

Global Ground Metric Learning with Applications to scRNA data

Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis

Integrating Dynamical Systems Learning with Foundational Models: A Meta-Evolutionary AI Framework for Clinical Trials

LIT-LVM: Structured Regularization for Interaction Terms in Linear Predictors using Latent Variable Models

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

MedSyn: Enhancing Diagnostics with Human-AI Collaboration

Muon Optimizes Under Spectral Norm Constraints

ODD: Overlap-aware Estimation of Model Performance under Distribution Shift

Pixel-level Certified Explanations via Randomized Smoothing

Protein Language Model Zero-Shot Fitness Predictions are Improved by Inference-only Dropout

SFT-GO: Supervised Fine-Tuning with Group Optimization for Large Language Models

Sampling 3D Molecular Conformers with Diffusion Transformers

Self-Composing Policies for Scalable Continual Reinforcement Learning

Semi-supervised Graph Anomaly Detection via Robust Homophily Learning

Unifying VXAI: A Systematic Review and Framework for the Evaluation of Explainable AI