Ahnaf Rafi

Hi! I'm an Assistant Professor of Economics in the Department of Economics at the University of Virginia. My research interests are in theoretical and applied econometrics. You can download my CV or email me.

Published, accepted or forthcoming papers

Bootstrap based asymptotic refinements for high-dimensional nonlinear models.
With Joel Horowitz.
Journal of Econometrics, Volume 249, Part B, May 2025, Article 105977.
[ArXiv: 2303.09680]
Click to show/hide abstract
We consider penalized extremum estimation of a high-dimensional, possibly nonlinear model that is sparse in the sense that most of its parameters are zero but some are not. We use the SCAD penalty function, which provides model selection consistent and oracle efficient estimates under suitable conditions. However, asymptotic approximations based on the oracle model can be inaccurate with the sample sizes found in many applications. This paper gives conditions under which the bootstrap, based on estimates obtained through SCAD penalization with thresholding, provides asymptotic refinements of size $O \left( n^{- 2} \right)$ for the error in the rejection (coverage) probability of a symmetric hypothesis test (confidence interval) and $O \left( n^{- 1} \right)$ for the error in rejection (coverage) probability of a one-sided or equal tailed test (confidence interval). The results of Monte Carlo experiments show that the bootstrap can provide large reductions in errors in coverage probabilities. The bootstrap is consistent, though it does not necessarily provide asymptotic refinements, even if some parameters are close but not equal to zero. Random-coefficients logit and probit models and nonlinear moment models are examples of models to which the procedure applies.
On the Performance of the Neyman Allocation with Small Pilots.
With Yong Cai.
Journal of Econometrics, Volume 242, Issue 1, May 2024, Article 105793.
[ArXiv: 2206.04643]
Click to show/hide abstract
The Neyman Allocation is used in many papers on experimental design, which typically assume that researchers have access to large pilot studies. This may be unrealistic. To understand the properties of the Neyman Allocation with small pilots, we study it in a novel asymptotic framework for two-wave experiments which takes pilot size to be fixed while the main wave grows. We find that the method can produce estimates of the ATE with higher asymptotic variance than balanced randomization, particularly for relatively homoskedastic populations. Empirical examples show that this occurs for values of homoskedasticity that are relevant for researchers.
K-Anonymity: A Note on the Trade-Off between Data Utility and Data Security.
With Tatiana Komarova, Denis Nekipelov, and Evgeny Yakovlev.
Applied Econometrics, Vol. 48, 2017, 44-62.
[SSRN]
Click to show/hide abstract
Researchers often use data from multiple datasets to conduct credible econometric and statistical analysis. The most reliable way to link entries across such datasets is to exploit unique identifiers if those are available. Such linkage however may result in privacy violations revealing sensitive information about some individuals in a sample. Thus, a data curator with concerns for individual privacy may choose to remove certain individual information from the private dataset they plan on releasing to researchers. The extent of individual information the data curator keeps in the private dataset can still allow a researcher to link the datasets, most likely with some errors, and usually results in a researcher having several feasible combined datasets. One conceptual framework a data curator may rely on is $k$-anonymity, $k \geq 2$, which gained wide popularity in computer science and statistical community. To ensure $k$-anonymity, the data curator releases only the amount of identifying information in the private dataset that guarantees that every entry in it can be linked to at least $k$ different entries in the publicly available datasets the researcher will use. In this paper, we look at the data combination task and the estimation task from both perspectives - from the perspective of the researcher estimating the model and from the perspective of a data curator who restricts identifying information in the private dataset to make sure that $k$-anonymity holds. We illustrate how to construct identifiers in practice and use them to combine some entries across two datasets. We also provide an empirical illustration on how a data curator can ensure $k$-anonymity and consequences it has on the estimation procedure. Naturally, the utility of the combined data gets smaller as $k$ increases, which is also evident from our empirical illustration.

Working papers

Efficient Semiparametric Estimation of Average Treatment Effects Under Covariate Adaptive Randomization.
Click to show/hide abstract
Experiments that use covariate adaptive randomization (CAR) are commonplace in applied economics and other fields. In such experiments, the experimenter first stratifies the sample according to observed baseline covariates and then assigns treatment randomly within these strata so as to achieve balance according to pre-specified stratum-specific target assignment proportions. In this paper, we compute the semiparametric efficiency bound for estimating the average treatment effect (ATE) in such experiments with binary treatments allowing for the class of CAR procedures considered in Bugni, Canay, and Shaikh (2018, 2019). This is a broad class of procedures and is motivated by those used in practice. The stratum-specific target proportions play the role of the propensity score conditional on all baseline covariates (and not just the strata) in these experiments. Thus, the efficiency bound is a special case of the bound in Hahn (1998), but conditional on all baseline covariates. Additionally, this efficiency bound is shown to be achievable under the same conditions as those used to derive the bound by using a cross-fitted Nadaraya-Watson kernel estimator to form nonparametric regression adjustments.
Regression Discontinuity Design with Spillovers.
With Eric Auerbach and Yong Cai.
Click to show/hide abstract
Researchers who estimate treatment effects using a regression discontinuity design (RDD) typically assume that there are no spillovers between the treated and control units. This may be unrealistic. We characterize the estimand of RDD in a setting where spillovers occur between units that are close in their values of the running variable. Under the assumption that spillovers are linear-in-means, we show that the estimand depends on the ratio of two terms: (1) the radius over which spillovers occur and (2) the choice of bandwidth used for the local linear regression. Specifically, RDD estimates direct treatment effect when radius is of larger order than the bandwidth, and total treatment effect when radius is of smaller order than the bandwidth. In the more realistic regime where radius is of similar order as the bandwidth, the RDD estimand is a mix of the above effects. To recover direct and spillover effects, we propose incorporating estimated spillover terms into local linear regression – the local analog of peer effects regression. We also clarify the settings under which the donut-hole RD is able to eliminate the effects of spillovers.

Work in progress

Gaussian approximation for maximum score and non-smooth M-estimators with multiway dependence.
With Harold D. Chiang.
Click to show/hide abstract
The maximum score estimator of Manski (1975) provides an elegant approach to estimate slope coefficient in binary choice models without requiring parametric assumptions on the error distribution. However, under i.i.d. sampling, it admits a non-Gaussian limiting distribution and exhibits cube-root asymptotics, which complicates statistical inference. We show that, under multiway dependence, the maximum score estimator attains asymptotic normality at a parametric rate. We obtain this surprising result through the development of a general M-estimation theory that accommodates non-smooth objective functions under multiway dependence. We further propose and establish the validity of a bootstrap procedure for inference.
Nonparametric inference for a class of functionals in the random coefficients logit model.
Click to show/hide abstract
The random coefficients logit model is widely used in choice analysis, empirical industrial organization, and transport economics among other fields. Much recent work has gone into relaxing distributional assumptions made about the random coefficients (RCs). Many objects of interest in this model can be represented as averages against the distribution of RCs, e.g. certain welfare measures, such as welfare measures, choice probabilities and their derivatives. This paper provides a nonparametric estimator of the RC distribution under which the implied plug-in estimator of such averages are asymptotically normal. A consistent estimator of the variance of this limiting distribution is also provided. Together, these results make consistent tests of hypotheses and valid confidence intervals possible in the RC logit model when the distribution of RCs is estimated nonparametrically.

Teaching

Some notes on econometrics and statistics

Note: WIP = Work in progress.