Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

undefined, undefined; undefined, undefined

Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Pre-publication

Leyang Hu, Randall Balestriero

88 views

Abstract

The scaling of model size and data size has reshaped the paradigm of AI. As a result, the common protocol to leverage the latest models is to steer them towards a specific downstream task of interest through fine-tuning. Despite its importance, the main methods for fine-tuning remain limited to full or low-rank adapters–containing countless hyper-parameters and lacking interpretability. In this paper, we take a step back and demonstrate how novel and explainable post-training steering solutions can be derived theoretically from spline operators, a rich mathematical framing of Deep Networks that was recently developed. Our method–coined Curvature Tuning (CT)–has a single parameter that provably modulates the curvature of the model’s decision boundary henceforth allowing training-free steering. This makes CT both more efficient and interpretable than conventional fine-tuning methods. We empirically validate its effectiveness in improving generalization and robustness of pretrained models. For example, CT improves out-of-distribution transfer performances of ResNet-18/50 by 2.57%/1.74% across seventeen downstream datasets, and improves RobustBench robust accuracy by 11.76%/348.44%. Additionally, we apply CT to ReLU-based Swin-T/S, improving their generalization on nine downstream datasets by 2.43%/3.33%. Our code is available at https://github.com/Leon-Leyang/curvature-tuning.

Preprint sourced from arxiv.org/html/2502.07783v1

Review history is meant to serve as an example, this article did not undergo double blind peer review.

1 Introduction

The scaling of model and data sizes has given rise to foundation models, such as Llama3 (Dubey et al., 2024) for natural language processing (NLP), DINOv2 (Oquab et al., 2023) for computer vision (CV), CLIP (Radford et al., 2021)and SigLIP (Zhai et al., 2023) for multimodal tasks, and OpenVLA (Kim et al., 2024) for embodied agent. These models are more universally capable than ever, accelerating a paradigm shift in artificial intelligence (AI): transitioning from training task-specific models from scratch to leveraging models pretrained on large datasets and fine-tuning them for downstream applications.

Illustration of the Curvature Tuning (CT) mechanism for model steering. CT steers a pretrained model by replacing ReLUs with a $\beta$ -parameterized activation function and tuning $\beta$ from 1 to 0, progressively smoothing the model’s decision boundary across tasks (e.g., classification and regression). The $\beta$ -parameterized activation function is defined in Equation 7.

Full fine-tuning, the process of steering a pretrained model by adapting all its parameters to downstream datasets, was once the primary approach for transferring knowledge. While it effectively enhances generalization (Radford, 2018) and robustness (Jeddi et al., 2020), it is computationally expensive. To mitigate this, parameter-efficient fine-tuning (PEFT) methods such as Serial Adapter (Houlsby et al., 2019) and LoRA (Hu et al., 2021) have been introduced, which partially alleviate the computational burden (as further training is still required) by fine-tuning only a small subset of parameters. However, these approaches face two additional challenges: a lack of principled design and limited interpretability. For instance, they rely on heuristic choices–such as LoRA’s rank, placement, and initialization–with minimal theoretical guidance. Moreover, they treat the model as a black box, making it unclear how pretrained knowledge is preserved or how the model is steered for downstream tasks. This combination of partial efficiency, heuristic-driven design, and poor interpretability underscores the need for fine-tuning methods that are efficient, principled, and interpretable. We thus ask the following question: How can we construct principled steering solutions addressing both efficiency and interpretability?

We take a first step toward an overarching answer to how new PEFT solutions can be derived from theoretically grounded frameworks. Leveraging the spline framework of Deep Learning (Montufar et al., 2014; Balestriero et al., 2018), we develop a novel solution–Curvature Tuning (CT)–which modulates a model’s decision boundary curvature through a single parameter, $\beta$ . CT offers several advantages, which we briefly outline below.

CT steers a model in inference mode without backpropagation. Since CT uses a single parameter to modulate the model’s curvature, its optimal value can be determined via cross-validation without requiring training or backpropagation. This property ensures maximal computational and memory efficiency.

CT is interpretable for any value of $\beta$ . CT replaces internal activation functions such as ReLU and Leaky ReLU with a convex combination of a reparameterized Swish function (Ramachandran et al., 2017) and a Softplus function, controlled by the parameter $\beta$ . This theoretically grounded construction directly modulates the model’s decision boundary curvature. When $\beta=1$ , the original activation function is recovered, resulting in a piecewise affine decision boundary. When $\beta=0$ , the model becomes entirely linear, making the decision boundary globally affine. Intermediate values of $\beta$ gradually smooth the decision boundary, offering a continuous transition between these two extremes.

CT significantly improves a model’s performance across tasks and domains while enhancing robustness. We empirically validate the effectiveness of CT through extensive experiments, demonstrating improvements in both generalization and robustness. For same-task generalization, transferring ResNet-18 (He et al., 2016) across seventeen image classification datasets—including MNIST, CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and ImageNet (Deng et al., 2009)—yields a relative accuracy gain of 2.57%. For cross-task generalization, CT achieves a relative improvement of 0.41% in the mIoU of a PSPNet (Zhao et al., 2017) using an ImageNet-pretrained ResNet-50 as backbone on VOC2012 (Everingham et al., ). Moreover, CT delivers a relative improvement of 11.76% in the robust accuracy of an ImageNet-pretrained ResNet-18 on RobustBench (Croce et al., 2020). Additional experiments with models such as ResNet-50/152, Swin-T/S, as well as additional datasets, further confirm CT’s effectiveness.

A visual depiction of the CT mechanism is shown in Figure 1, and our key contributions are summarized below:

Theoretical Contribution: We introduce Curvature Tuning (CT), a training-free model steering technique that provably adjusts the curvature of model decision boundaries using a single parameter. This principled design ensures both efficiency and interpretability. Details are provided in Section 3.
Empirical Contribution: We demonstrate in Section 4 that CT enhances generalization and robustness across various models, datasets, and tasks. For example, CT improves out-of-distribution transfer performances of ResNet-18/50 by 2.57%/1.74% across seventeen downstream datasets, and improves RobustBench robust accuracy by 11.76%/348.44%. It also improves generalization of ReLU-based Swin-T/S on nine downstream datasets by 2.43%/3.33%.

The remainder of this paper is organized as follows: Section 2 reviews current fine-tuning techniques and introduces relevant spline concepts, the foundation for our method. Section 3 details our proposed method and its theoretical guarantees. Section 4 presents experimental results, and Section 5 summarizes our findings and potential future directions.

2 Background

This section presents a concise review of current fine-tuning techniques and their limitations in Section 2.1, followed by an introduction to relevant concepts in splines and their connections to Deep Networks (DNs), which are foundational for understanding CT.

2.1 The Fine-tuning Menagery

Fine-tuning, in the context of this paper, refers to adapting a pretrained model to improve its ability to solve a particular downstream task of interest. Initially, the common practice was to take the downstream task and continue training all of the model parameters, a process commonly referred to as full fine-tuning. Notable examples include GPT (Radford, 2018) and DINO (Caron et al., 2021). However, as model sizes continue to grow, performing full fine-tuning on the latest models would require immense infrastructure and often result in poor performance due to the small size of many downstream task datasets. Given these challenges, parameter-efficient fine-tuning (PEFT) methods were developed to mitigate the cost while maintaining effectiveness.

To better understand the landscape of PEFT approaches, we adopt the categorization proposed by Han et al. (2024), which organizes these methods into four primary categories. Additive PEFT introduces additional trainable parameters to the pretrained model, training only these new parameters during fine-tuning. Examples include Serial Adapter (Houlsby et al., 2019), Prefix-tuning (Li & Liang, 2021), and (IA)³ (Liu et al., 2022). Selective PEFT identifies a subset of existing parameters for fine-tuning, with examples such as U-Diff pruning and S-Diff pruning (Guo et al., 2020). Reparameterized PEFT: decomposes pretrained weights into low-rank matrices, fine-tuning only the low-rank components, which are converted back during inference; examples include LoRA (Hu et al., 2021) and DyLoRA (Valipour et al., 2022). Hybrid PEFT combines multiple PEFT approaches, such as UniPELT (Mao et al., 2021) and S4 (Chen et al., 2023).

While these techniques vary in the parameters they modify, they all require further training, which remains computationally expensive. In particular, backpropagation presents significant challenges for larger models. Additionally, their application often involves tuning numerous hyperparameters, typically guided by heuristics with limited theoretical justification, making it difficult to determine optimal values. Moreover, deep learning training remains largely opaque, complicating the understanding of how pretrained knowledge is preserved and limiting interpretability. For instance, deploying LoRA involves multiple design choices, including selecting the layers where it should be applied (Gao et al., 2024), determining its rank (Valipour et al., 2022; Chen et al., 2024), choosing the scaling factor during inference (Kalajdzievski, 2023), and initializing its parameters (Hayou et al., 2024), all of which rely primarily on heuristics. Furthermore, even with a low-rank configuration, fine-tuning LoRA variants of ResNets—relatively small models compared to contemporary large models—still requires tens of thousands to over a million parameters, as shown in Table 7.

In contrast, our proposed method, CT, bypasses training entirely, eliminating the need for backpropagation, significantly improving efficiency. Moreover, CT offers greater interpretability, as it directly and provably adjusts the model’s decision boundary, as demonstrated in later sections.

2.2 The Spline formulation of Deep Networks

In this subsection, we review relevant concepts in splines, which provide a mathematical framework for understanding the relationship between piecewise-affine functions and DNs.

A spline function is a function $s:\mathbb{R}^{D}\rightarrow\mathbb{R}$ defined piecewise by polynomials. An affine spline function is a special case where each piece is defined by an affine function. Such a function can be parameterized by three components:

$\mathbf{A}\in\mathbb{R}^{R\times D}$ : A matrix representing the slopes of the affine functions.
$\mathbf{b}\in\mathbb{R}^{R}$ : A vector representing the offsets of the affine functions.
$\Omega\triangleq\{\omega_{1},\dots,\omega_{R}\}$ : A partition of the input space $\mathbb{R}^{D}$ into $R$ regions.

For an input $\mathbf{x}\in\mathbb{R}^{D}$ , the affine spline function is defined as:

s[\mathbf{A},\mathbf{b},\omega](x)=\sum_{r=1}^{R}\big(\langle\mathbf{A}_{r,\cdot},\mathbf{x}\rangle+\mathbf{b}_{r}\big)\mathbf{1}_{\{\mathbf{x}\in\omega_{r}\}}

where $\mathbf{1}_{\{\mathbf{x}\in\omega_{r}\}}$ is an indicator function that equals 1 if $\mathbf{x}$ belongs to region $\omega_{r}$ and 0 otherwise.

A max-affine spline function is a special case of an affine spline function that does not explicit knowledge of $\Omega$ . Instead, its output is computed as the maximum value over the affine functions:

Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Pre-publication

Leyang Hu, Randall Balestriero

88 views

Abstract

Preprint sourced from arxiv.org/html/2502.07783v1

Review history is meant to serve as an example, this article did not undergo double blind peer review.

1 Introduction

A visual depiction of the CT mechanism is shown in Figure 1, and our key contributions are summarized below:

Theoretical Contribution: We introduce Curvature Tuning (CT), a training-free model steering technique that provably adjusts the curvature of model decision boundaries using a single parameter. This principled design ensures both efficiency and interpretability. Details are provided in Section 3.
Empirical Contribution: We demonstrate in Section 4 that CT enhances generalization and robustness across various models, datasets, and tasks. For example, CT improves out-of-distribution transfer performances of ResNet-18/50 by 2.57%/1.74% across seventeen downstream datasets, and improves RobustBench robust accuracy by 11.76%/348.44%. It also improves generalization of ReLU-based Swin-T/S on nine downstream datasets by 2.43%/3.33%.

2 Background

2.1 The Fine-tuning Menagery

2.2 The Spline formulation of Deep Networks

In this subsection, we review relevant concepts in splines, which provide a mathematical framework for understanding the relationship between piecewise-affine functions and DNs.

$\mathbf{A}\in\mathbb{R}^{R\times D}$ : A matrix representing the slopes of the affine functions.
$\mathbf{b}\in\mathbb{R}^{R}$ : A vector representing the offsets of the affine functions.
$\Omega\triangleq\{\omega_{1},\dots,\omega_{R}\}$ : A partition of the input space $\mathbb{R}^{D}$ into $R$ regions.

For an input $\mathbf{x}\in\mathbb{R}^{D}$ , the affine spline function is defined as:

s[\mathbf{A},\mathbf{b},\omega](x)=\sum_{r=1}^{R}\big(\langle\mathbf{A}_{r,\cdot},\mathbf{x}\rangle+\mathbf{b}_{r}\big)\mathbf{1}_{\{\mathbf{x}\in\omega_{r}\}}

where $\mathbf{1}_{\{\mathbf{x}\in\omega_{r}\}}$ is an indicator function that equals 1 if $\mathbf{x}$ belongs to region $\omega_{r}$ and 0 otherwise.