-
Full distribution of the ground-state energy of potentials with weak disorder
Authors:
Naftali R. Smith
Abstract:
We study the full distribution $P(E)$ of the ground-state energy of a single quantum particle in a potential $V(\bf x) = V_0(\bf x) + \sqrtε \, V_1(\bf x)$, where $V_0(\bf x)$ is a deterministic "background" trapping potential and $V_1(\bf x)$ is the disorder. In the weak-disorder limit $ε\to 0$, we find that $P(E)$ scales as $P(E) \sim e^{-s(E)/ε}$. The large-deviation function $s(E)$ is obtained…
▽ More
We study the full distribution $P(E)$ of the ground-state energy of a single quantum particle in a potential $V(\bf x) = V_0(\bf x) + \sqrtε \, V_1(\bf x)$, where $V_0(\bf x)$ is a deterministic "background" trapping potential and $V_1(\bf x)$ is the disorder. In the weak-disorder limit $ε\to 0$, we find that $P(E)$ scales as $P(E) \sim e^{-s(E)/ε}$. The large-deviation function $s(E)$ is obtained by calculating the most likely configuration of $V(\bf x)$ conditioned on a given ground-state energy $E$. We consider arbitrary trapping potentials $V_0(\bf x)$ and white-noise disorder $V_1(\bf x)$. For infinite systems, we obtain $s(E)$ analytically in the limits $E \to \pm \infty$ and $E \simeq E_0$ where $E_0$ is the ground-state energy in the absence of disorder. We perform explicit calculations for the case of a harmonic trap $V_0(\bf x) \propto x^2$. Next, we calculate $s(E)$ exactly for a finite, periodic one-dimensional system with a homogeneous background $V_0(x)=0$. We find that, remarkably, the system exhibits a sudden change of behavior as $E$ crosses a critical value $E_c < 0$: At $E>E_c$, the most likely configuration of $V(x)$ is homogeneous, whereas at $E < E_c$ it is inhomogeneous, thus spontaneously breaking the translational symmetry of the problem. As a result, $s(E)$ is nonanalytic: Its second derivative jumps at $E=E_c$. We interpret this singularity as a second-order dynamical phase transition.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
Symmetry constrained neural networks for detection and localization of damage in metal plates
Authors:
James Amarel,
Christopher Rudolf,
Athanasios Iliopoulos,
John Michopoulos,
Leslie N. Smith
Abstract:
The present paper is concerned with deep learning techniques applied to detection and localization of damage in a thin aluminum plate. We used data generated on a tabletop apparatus by mounting to the plate four piezoelectric transducers, each of which took turn to generate a Lamb wave that then traversed the region of interest before being received by the remaining three sensors. On training a ne…
▽ More
The present paper is concerned with deep learning techniques applied to detection and localization of damage in a thin aluminum plate. We used data generated on a tabletop apparatus by mounting to the plate four piezoelectric transducers, each of which took turn to generate a Lamb wave that then traversed the region of interest before being received by the remaining three sensors. On training a neural network to analyze time-series data of the material response, which displayed damage-reflective features whenever the plate guided waves interacted with a contact load, we achieved a model that detected with greater than 99% accuracy in addition to a model that localized with $3.14 \pm 0.21$ mm mean distance error and captured more than 60% of test examples within the diffraction limit. For each task, the best-performing model was designed according to the inductive bias that our transducers were both similar and arranged in a square pattern on a nearly uniform plate.
△ Less
Submitted 9 September, 2024;
originally announced September 2024.
-
OLMoE: Open Mixture-of-Experts Language Models
Authors:
Niklas Muennighoff,
Luca Soldaini,
Dirk Groeneveld,
Kyle Lo,
Jacob Morrison,
Sewon Min,
Weijia Shi,
Pete Walsh,
Oyvind Tafjord,
Nathan Lambert,
Yuling Gu,
Shane Arora,
Akshita Bhagia,
Dustin Schwenk,
David Wadden,
Alexander Wettig,
Binyuan Hui,
Tim Dettmers,
Douwe Kiela,
Ali Farhadi,
Noah A. Smith,
Pang Wei Koh,
Amanpreet Singh,
Hannaneh Hajishirzi
Abstract:
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat an…
▽ More
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
Toward a More Complete OMR Solution
Authors:
Guang Yang,
Muru Zhang,
Lin Qiu,
Yanming Wan,
Noah A. Smith
Abstract:
Optical music recognition (OMR) aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image (object detection) and then assembles them into a music notation (notation assembly). Most previous work on notation assembly unrealistically assumes perfect object detection. In…
▽ More
Optical music recognition (OMR) aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image (object detection) and then assembles them into a music notation (notation assembly). Most previous work on notation assembly unrealistically assumes perfect object detection. In this study, we focus on the MUSCIMA++ v2.0 dataset, which represents musical notation as a graph with pairwise relationships among detected music objects, and we consider both stages together. First, we introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output. We find that this model is able to outperform existing models trained on perfect detection output, showing the benefit of considering the detection and assembly stages in a more holistic way. These findings, together with our novel evaluation metric, are important steps toward a more complete OMR solution.
△ Less
Submitted 30 August, 2024;
originally announced September 2024.
-
Risks and NLP Design: A Case Study on Procedural Document QA
Authors:
Nikita Haduong,
Alice Gao,
Noah A. Smith
Abstract:
As NLP systems are increasingly deployed at scale, concerns about their potential negative impacts have attracted the attention of the research community, yet discussions of risk have mostly been at an abstract level and focused on generic AI or NLP applications. We argue that clearer assessments of risks and harms to users--and concrete strategies to mitigate them--will be possible when we specia…
▽ More
As NLP systems are increasingly deployed at scale, concerns about their potential negative impacts have attracted the attention of the research community, yet discussions of risk have mostly been at an abstract level and focused on generic AI or NLP applications. We argue that clearer assessments of risks and harms to users--and concrete strategies to mitigate them--will be possible when we specialize the analysis to more concrete applications and their plausible users. As an illustration, this paper is grounded in cooking recipe procedural document question answering (ProcDocQA), where there are well-defined risks to users such as injuries or allergic reactions. Our case study shows that an existing language model, applied in "zero-shot" mode, quantitatively answers real-world questions about recipes as well or better than the humans who have answered the questions on the web. Using a novel questionnaire informed by theoretical work on AI risk, we conduct a risk-oriented error analysis that could then inform the design of a future system to be deployed with lower risk of harm and better performance.
△ Less
Submitted 16 August, 2024;
originally announced August 2024.
-
EFT Workshop at Notre Dame
Authors:
Nick Smith,
Daniel Spitzbart,
Jennet Dickinson,
Jon Wilson,
Lindsey Gray,
Kelci Mohrman,
Saptaparna Bhattacharya,
Andrea Piccinelli,
Titas Roy,
Garyfallia Paspalaki,
Duarte Fontes,
Adam Martin,
William Shepherd,
Sergio Sánchez Cruz,
Dorival Goncalves,
Andrei Gritsan,
Harrison Prosper,
Tom Junk,
Kyle Cranmer,
Michael Peskin,
Andrew Gilbert,
Jonathon Langford,
Frank Petriello,
Luca Mantani,
Andrew Wightman
, et al. (5 additional authors not shown)
Abstract:
The LPC EFT workshop was held April 25-26, 2024 at the University of Notre Dame. The workshop was organized into five thematic sessions: "how far beyond linear" discusses issues of truncation and validity in interpretation of results with an eye towards practicality; "reconstruction-level results" visits the question of how best to design analyses directly targeting inference of EFT parameters; "l…
▽ More
The LPC EFT workshop was held April 25-26, 2024 at the University of Notre Dame. The workshop was organized into five thematic sessions: "how far beyond linear" discusses issues of truncation and validity in interpretation of results with an eye towards practicality; "reconstruction-level results" visits the question of how best to design analyses directly targeting inference of EFT parameters; "logistics of combining likelihoods" addresses the challenges of bringing a diverse array of measurements into a cohesive whole; "unfolded results" tackles the question of designing fiducial measurements for later use in EFT interpretations, and the benefits and limitations of unfolding; and "building a sample library" addresses how best to generate simulation samples for use in data analysis. This document serves as a summary of presentations, subsequent discussions, and actionable items identified over the course of the workshop.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
CPS-TaskForge: Generating Collaborative Problem Solving Environments for Diverse Communication Tasks
Authors:
Nikita Haduong,
Irene Wang,
Bo-Ru Lu,
Prithviraj Ammanabrolu,
Noah A. Smith
Abstract:
Teams can outperform individuals; could adding AI teammates further bolster performance of teams solving problems collaboratively? Collaborative problem solving (CPS) research commonly studies teams with two agents (human-human or human-AI), but team research literature finds that, for complex tasks, larger teams are more effective. Progress in studying collaboration with more than two agents, thr…
▽ More
Teams can outperform individuals; could adding AI teammates further bolster performance of teams solving problems collaboratively? Collaborative problem solving (CPS) research commonly studies teams with two agents (human-human or human-AI), but team research literature finds that, for complex tasks, larger teams are more effective. Progress in studying collaboration with more than two agents, through textual records of team interactions, is hindered by a major data challenge: available CPS corpora are predominantly dyadic, and adapting pre-existing CPS tasks to more agents is non-trivial. We address this data challenge by developing a CPS task generator, CPS-TaskForge, that can produce environments for studying CPS under a wide array of conditions, and releasing a CPS task design checklist grounded in the theoretical PISA 2015 CPS framework to help facilitate the development of CPS corpora with more agents. CPS-TaskForge takes the form of a resource management (tower defense) game, and different CPS tasks can be studied by manipulating game design parameters. We conduct a case study with groups of 3-4 humans to validate production of diverse natural language CPS communication in a game instance produced by CPS-TaskForge. We discuss opportunities for advancing research in CPS (both with human-only and human-AI teams) using different task configurations. We will release data and code.
△ Less
Submitted 16 August, 2024;
originally announced August 2024.
-
Coarsening and parallelism with reduction multigrids for hyperbolic Boltzmann transport
Authors:
S. Dargaville,
R. P. Smedley-Stevenson,
P. N. Smith,
C. C. Pain
Abstract:
Reduction multigrids have recently shown good performance in hyperbolic problems without the need for Gauss-Seidel smoothers. When applied to the hyperbolic limit of the Boltzmann Transport Equation (BTE), these methods result in very close to $\mathcal{O}(n)$ growth in work with problem size on unstructured grids. This scalability relies on the CF splitting producing an $A_\textrm{ff}$ block that…
▽ More
Reduction multigrids have recently shown good performance in hyperbolic problems without the need for Gauss-Seidel smoothers. When applied to the hyperbolic limit of the Boltzmann Transport Equation (BTE), these methods result in very close to $\mathcal{O}(n)$ growth in work with problem size on unstructured grids. This scalability relies on the CF splitting producing an $A_\textrm{ff}$ block that is easy to invert. We introduce a parallel two-pass CF splitting designed to give diagonally dominant $A_\textrm{ff}$. The first pass computes a maximal independent set in the symmetrized strong connections. The second pass converts F-points to C-points based on the row-wise diagonal dominance of $A_\textrm{ff}$. We find this two-pass CF splitting outperforms common CF splittings available in hypre.
Furthermore, parallelisation of reduction multigrids in hyperbolic problems is difficult as we require both long-range grid-transfer operators and slow coarsenings (with rates of $\sim$1/2 in both 2D and 3D). We find that good parallel performance in the setup and solve is dependent on several factors: repartitioning the coarse grids, reducing the number of active MPI ranks as we coarsen, truncating the multigrid hierarchy and applying a GMRES polynomial as a coarse-grid solver.
We compare the performance of two different reduction multigrids, AIRG (that we developed previously) and the hypre implementation of $\ell$AIR. In the streaming limit with AIRG, we demonstrate 81\% weak scaling efficiency in the solve from 2 to 64 nodes (256 to 8196 cores) with only 8.8k unknowns per core, with solve times up to 5.9$\times$ smaller than the $\ell$AIR implementation in hypre.
△ Less
Submitted 15 August, 2024;
originally announced August 2024.
-
One Year of SN 2023ixf: Breaking Through the Degenerate Parameter Space in Light-Curve Models with Pulsating Progenitors
Authors:
Brian Hsu,
Nathan Smith,
Jared A. Goldberg,
K. Azalee Bostroem,
Griffin Hosseinzadeh,
David J. Sand,
Jeniveve Pearson,
Daichi Hiramatsu,
Jennifer E. Andrews,
Emma R. Beasor,
Yize Dong,
Joseph Farah,
LluÍs Galbany,
Sebastian Gomez,
Estefania Padilla Gonzalez,
Claudia P. Gutiérrez,
D. Andrew Howell,
Réka Könyves-Tóth,
Curtis McCully,
Megan Newsome,
Manisha Shrestha,
Giacomo Terreran,
V. Ashley Villar,
Xiaofeng Wang
Abstract:
We present and analyze the extensive optical broadband photometry of the Type II SN 2023ixf up to one year after explosion. We find that, when compared to two pre-existing model grids, the pseudo-bolometric light curve is consistent with drastically different combinations of progenitor and explosion properties. This may be an effect of known degeneracies in Type IIP light-curve models. We independ…
▽ More
We present and analyze the extensive optical broadband photometry of the Type II SN 2023ixf up to one year after explosion. We find that, when compared to two pre-existing model grids, the pseudo-bolometric light curve is consistent with drastically different combinations of progenitor and explosion properties. This may be an effect of known degeneracies in Type IIP light-curve models. We independently compute a large grid of ${\tt MESA+STELLA}$ single-star progenitor and light-curve models with various zero-age main-sequence masses, mass-loss efficiencies, and convective efficiencies. Using the observed progenitor variability as an additional constraint, we select stellar models consistent with the pulsation period and explode them according to previously established scaling laws to match plateau properties. Our hydrodynamic modeling indicates that SN 2023ixf is most consistent with a moderate-energy ($E_{\rm exp}\approx7\times10^{50}$ erg) explosion of an initially high-mass red supergiant progenitor ($\gtrsim 17\ M_{\odot}$) that lost a significant amount of mass in its prior evolution, leaving a low-mass hydrogen envelope ($\lesssim 3\ M_{\odot}$) at the time of explosion, with a radius $\gtrsim 950\ R_{\odot}$ and a synthesized $^{56}$Ni mass of $0.07\ M_{\odot}$. We posit that previous mass transfer in a binary system may have stripped the envelope of SN 2023ixf's progenitor. The analysis method with pulsation period presented in this work offers a way to break degeneracies in light-curve modeling in the future, particularly with the upcoming Vera C.~Rubin Observatory Legacy Survey of Space and Time, when a record of progenitor variability will be more common.
△ Less
Submitted 14 August, 2024;
originally announced August 2024.
-
Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models
Authors:
Hila Gonen,
Terra Blevins,
Alisa Liu,
Luke Zettlemoyer,
Noah A. Smith
Abstract:
Despite their wide adoption, the biases and unintended behaviors of language models remain poorly understood. In this paper, we identify and characterize a phenomenon never discussed before, which we call semantic leakage, where models leak irrelevant information from the prompt into the generation in unexpected ways. We propose an evaluation setting to detect semantic leakage both by humans and a…
▽ More
Despite their wide adoption, the biases and unintended behaviors of language models remain poorly understood. In this paper, we identify and characterize a phenomenon never discussed before, which we call semantic leakage, where models leak irrelevant information from the prompt into the generation in unexpected ways. We propose an evaluation setting to detect semantic leakage both by humans and automatically, curate a diverse test suite for diagnosing this behavior, and measure significant semantic leakage in 13 flagship models. We also show that models exhibit semantic leakage in languages besides English and across different settings and generation scenarios. This discovery highlights yet another type of bias in language models that affects their generation patterns and behavior.
△ Less
Submitted 12 September, 2024; v1 submitted 12 August, 2024;
originally announced August 2024.
-
Pauli Check Sandwiching for Quantum Characterization and Error Mitigation during Runtime
Authors:
Joshua Gao,
Ji Liu,
Alvin Gonzales,
Zain H. Saleem,
Nikos Hardavellas,
Kaitlin N. Smith
Abstract:
This work presents a novel quantum system characterization and error mitigation framework that applies Pauli check sandwiching (PCS). We motivate our work with prior art in software optimizations for quantum programs like noise-adaptive mapping and multi-programming, and we introduce the concept of PCS while emphasizing design considerations for its practical use. We show that by carefully embeddi…
▽ More
This work presents a novel quantum system characterization and error mitigation framework that applies Pauli check sandwiching (PCS). We motivate our work with prior art in software optimizations for quantum programs like noise-adaptive mapping and multi-programming, and we introduce the concept of PCS while emphasizing design considerations for its practical use. We show that by carefully embedding Pauli checks within a target application (i.e. a quantum circuit), we can learn quantum system noise profiles. Further, PCS combined with multi-programming unlocks non-trivial fidelity improvements in quantum program outcomes.
△ Less
Submitted 14 August, 2024; v1 submitted 10 August, 2024;
originally announced August 2024.
-
Performance Prediction of Hub-Based Swarms
Authors:
Puneet Jain,
Chaitanya Dwivedi,
Vigynesh Bhatt,
Nick Smith,
Michael A Goodrich
Abstract:
A hub-based colony consists of multiple agents who share a common nest site called the hub. Agents perform tasks away from the hub like foraging for food or gathering information about future nest sites. Modeling hub-based colonies is challenging because the size of the collective state space grows rapidly as the number of agents grows. This paper presents a graph-based representation of the colon…
▽ More
A hub-based colony consists of multiple agents who share a common nest site called the hub. Agents perform tasks away from the hub like foraging for food or gathering information about future nest sites. Modeling hub-based colonies is challenging because the size of the collective state space grows rapidly as the number of agents grows. This paper presents a graph-based representation of the colony that can be combined with graph-based encoders to create low-dimensional representations of collective state that can scale to many agents for a best-of-N colony problem. We demonstrate how the information in the low-dimensional embedding can be used with two experiments. First, we show how the information in the tensor can be used to cluster collective states by the probability of choosing the best site for a very small problem. Second, we show how structured collective trajectories emerge when a graph encoder is used to learn the low-dimensional embedding, and these trajectories have information that can be used to predict swarm performance.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
Circumstellar Interaction in the Ultraviolet Spectra of SN 2023ixf 14-66 Days After Explosion
Authors:
K. Azalee Bostroem,
David J. Sand,
Luc Dessart,
Nathan Smith,
Saurabh W. Jha,
Stefano Valenti,
Jennifer E. Andrews,
Yize Dong,
Alexei V. Filippenko,
Sebastian Gomez,
Daichi Hiramatsu,
Emily T. Hoang,
Griffin Hosseinzadeh,
D. Andrew Howell,
Jacob E. Jencson,
Michael Lundquist,
Curtis McCully,
Darshana Mehta,
Nicolas E. Meza Retamal,
Jeniveve Pearson,
Aravind P. Ravi,
Manisha Shrestha,
Samuel Wyatt
Abstract:
SN 2023ixf was discovered in M101 within a day of explosion and rapidly classified as a Type II supernova with flash features. Here we present ultraviolet (UV) spectra obtained with the Hubble Space Telescope 14, 19, 24, and 66 days after explosion. Interaction between the supernova ejecta and circumstellar material (CSM) is seen in the UV throughout our observations in the flux of the first three…
▽ More
SN 2023ixf was discovered in M101 within a day of explosion and rapidly classified as a Type II supernova with flash features. Here we present ultraviolet (UV) spectra obtained with the Hubble Space Telescope 14, 19, 24, and 66 days after explosion. Interaction between the supernova ejecta and circumstellar material (CSM) is seen in the UV throughout our observations in the flux of the first three epochs and asymmetric MgII emission at day 66. We compare our observations to CMFGEN supernova models which include CSM interaction ($\dot{M}<10^{-3}$ $M_{\odot}$/yr) and find that the power from CSM interaction is decreasing with time, from $L_{\rm sh}\approx5\times10^{42}$ erg/s at day 14 to $L_{sh}\approx1\times10^{40}$ erg/s at day 66. We examine the contribution of individual atomic species to the spectra at day 14 and 19, showing that the majority of the features are dominated by iron, nickel, magnesium, and chromium absorption in the ejecta. The UV spectral energy distribution of SN 2023ixf sits between that of supernovae which show no definitive signs of CSM interaction and those with persistent signatures assuming the same progenitor radius and metallicity. Finally, we show that the evolution and asymmetric shape of the MgII emission are not unique to SN 2023ixf. These observations add to the early measurements of dense, confined CSM interaction, tracing the mass-loss history of SN 2023ixf to $\sim33$ yr prior to explosion and the density profile to a radius of $\sim5.7\times10^{15}$ cm. They show the relatively short evolution from quiescent red supergiant wind to high mass loss.
△ Less
Submitted 7 August, 2024;
originally announced August 2024.
-
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
Authors:
Jonathan Hayase,
Alisa Liu,
Yejin Choi,
Sewoong Oh,
Noah A. Smith
Abstract:
The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair enc…
▽ More
The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data. Given a tokenizer's merge list along with example data for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o and Mistral NeMo's tokenizers are much more multilingual than their predecessors, training on 39% and 47% non-English language data, respectively; Llama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.
△ Less
Submitted 5 September, 2024; v1 submitted 23 July, 2024;
originally announced July 2024.
-
Pair-instability evolution and explosions in massive stars
Authors:
M. Renzo,
N. Smith
Abstract:
Very massive stars are radiation pressure dominated. Before running out of viable nuclear fuel, they can reach a thermodynamic state where electron-positron pair-production robs them of radiation support, triggering their collapse. Thermonuclear explosion(s) in the core ensue. These have long been predicted to result in either repeated episodic mass loss (pulsational pair instability), which reduc…
▽ More
Very massive stars are radiation pressure dominated. Before running out of viable nuclear fuel, they can reach a thermodynamic state where electron-positron pair-production robs them of radiation support, triggering their collapse. Thermonuclear explosion(s) in the core ensue. These have long been predicted to result in either repeated episodic mass loss (pulsational pair instability), which reduces the mass available to eventually form a black hole, or, if sufficient energy is generated, the complete unbinding of all stellar material in one single explosive episode (pair instability supernova), which leaves behind no black hole. Despite theoretical agreement among modelers, the wide variety of predicted signatures and the rarity of very high-mass stellar progenitors have so far resulted in a lack of observational confirmation. Nevertheless, because of the impact of pair instability evolution on black hole masses relevant to gravitational-wave astronomy, as well as the present and upcoming expanded capabilities of time-domain astronomy and high redshift spectroscopy, interest in these explosion remains high. We review the current understanding of pair instability evolution, with particular emphasis on the known uncertainties. We also summarize the existing claimed electromagnetic counterparts and discuss prospects for future direct and indirect searches.
△ Less
Submitted 19 August, 2024; v1 submitted 22 July, 2024;
originally announced July 2024.
-
The Long-lived Broadband Afterglow of Short Gamma-Ray Burst 231117A and the Growing Radio-Detected Short GRB Population
Authors:
Genevieve Schroeder,
Wen-fai Fong,
Charles D. Kilpatrick,
Alicia Rouco Escorial,
Tanmoy Laskar,
Anya E. Nugent,
Jillian Rastinejad,
Kate D. Alexander,
Edo Berger,
Thomas G. Brink,
Ryan Chornock,
Clecio R. de Bom,
Yuxin Dong,
Tarraneh Eftekhari,
Alexei V. Filippenko,
Celeste Fuentes-Carvajal,
Wynn V. Jacobson-Galan,
Matthew Malkan,
Raffaella Margutti,
Jeniveve Pearson,
Lauren Rhodes,
Ricardo Salinas,
David J. Sand,
Luidhy Santana-Silva,
Andre Santos
, et al. (6 additional authors not shown)
Abstract:
We present multiwavelength observations of the Swift short $γ$-ray burst GRB 231117A, localized to an underlying galaxy at redshift $z = 0.257$ at a small projected offset ($\sim 2~$kpc). We uncover long-lived X-ray (Chandra) and radio/millimeter (VLA, MeerKAT, and ALMA) afterglow emission, detected to $\sim 37~$days and $\sim 20~$days (rest frame), respectively. We measure a wide jet (…
▽ More
We present multiwavelength observations of the Swift short $γ$-ray burst GRB 231117A, localized to an underlying galaxy at redshift $z = 0.257$ at a small projected offset ($\sim 2~$kpc). We uncover long-lived X-ray (Chandra) and radio/millimeter (VLA, MeerKAT, and ALMA) afterglow emission, detected to $\sim 37~$days and $\sim 20~$days (rest frame), respectively. We measure a wide jet ($\sim 10.4^\circ$) and relatively high circumburst density ($\sim 0.07~{\rm cm}^{-3}$) compared to the short GRB population. Our data cannot be easily fit with a standard forward shock model, but they are generally well fit with the incorporation of a refreshed forward shock and a reverse shock at $< 1~$day. We incorporate GRB 231117A into a larger sample of 132 X-ray detected events, 71 of which were radio-observed (17 cm-band detections), for a systematic study of the distributions of redshifts, jet and afterglow properties, galactocentric offsets, and local environments of events with and without detected radio afterglows. Compared to the entire short GRB population, the majority of radio-detected GRBs are at relatively low redshifts ($z < 0.6$) and have high circumburst densities ($> 10^{-2}~{\rm cm}^{-3}$), consistent with their smaller ($< 8~$kpc) projected galactocentric offsets. We additionally find that 70% of short GRBs with opening angle measurements were radio-detected, indicating the importance of radio afterglows in jet measurements, especially in the cases of wide ($> 10^\circ$) jets where observational evidence of collimation may only be detectable at radio wavelengths. Owing to improved observing strategies and the emergence of sensitive radio facilities, the number of radio-detected short GRBs has quadrupled in the past decade.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
The Art of Saying No: Contextual Noncompliance in Language Models
Authors:
Faeze Brahman,
Sachin Kumar,
Vidhisha Balachandran,
Pradeep Dasigi,
Valentina Pyatkin,
Abhilasha Ravichander,
Sarah Wiegreffe,
Nouha Dziri,
Khyathi Chandu,
Jack Hessel,
Yulia Tsvetkov,
Noah A. Smith,
Yejin Choi,
Hannaneh Hajishirzi
Abstract:
Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a…
▽ More
Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Measuring the Angular Momentum of a Neutron Using Earth's Rotation
Authors:
Niels Geerits,
Stephan Sponar,
Kyle E. Steffen,
William M. Snow,
Steven R. Parnell,
Giacomo Mauri,
Gregory N. Smith,
Robert M. Dalgliesh,
Victor de Haan
Abstract:
A coupling between Earths rotation and orbital angular momentum (OAM), known as the Sagnac effect, is observed in entangled neutrons produced using a spin echo interferometer. After correction for instrument systematics the measured coupling is within 5% of theory, with an uncertainty of 7.2%. The OAM in our setup is transverse to the propagation direction and scales linearly with wavelength (4 A…
▽ More
A coupling between Earths rotation and orbital angular momentum (OAM), known as the Sagnac effect, is observed in entangled neutrons produced using a spin echo interferometer. After correction for instrument systematics the measured coupling is within 5% of theory, with an uncertainty of 7.2%. The OAM in our setup is transverse to the propagation direction and scales linearly with wavelength (4 A - 12.75 A), hence the coupling can be varied, without mechanically rotating the device. Therefore, the systematic error is lower than in previous experiments. The detected transverse OAM of our beam corresponds to 4098 +- 295 hbar A-1, 5 orders of magnitude lower than in previous neutron experiments, thereby demonstrating the feasibility of using the Sagnac effect to definitively measure neutron OAM and paving the way towards observations of the quantum Sagnac effect
△ Less
Submitted 19 July, 2024; v1 submitted 12 July, 2024;
originally announced July 2024.
-
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization
Authors:
Orevaoghene Ahia,
Sachin Kumar,
Hila Gonen,
Valentin Hoffman,
Tomasz Limisiewicz,
Yulia Tsvetkov,
Noah A. Smith
Abstract:
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the current tokenization algorithms introduce to non-Latin script languages, the main one being over-segmentation. In this work, we propose MAGNET; multilingual adaptiv…
▽ More
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the current tokenization algorithms introduce to non-Latin script languages, the main one being over-segmentation. In this work, we propose MAGNET; multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization. MAGNET learns to predict segment boundaries between byte tokens in a sequence via sub-modules within the model, which act as internal boundary predictors (tokenizers). Previous gradient-based tokenization methods aimed for uniform compression across sequences by integrating a single boundary predictor during training and optimizing it end-to-end through stochastic reparameterization alongside the next token prediction objective. However, this approach still results in over-segmentation for non-Latin script languages in multilingual settings. In contrast, MAGNET offers a customizable architecture where byte-level sequences are routed through language-script-specific predictors, each optimized for its respective language script. This modularity enforces equitable segmentation granularity across different language scripts compared to previous methods. Through extensive experiments, we demonstrate that in addition to reducing segmentation disparities, MAGNET also enables faster language modelling and improves downstream utility.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Relative Sensitivities and Correlation of Factors Introducing Uncertainty in Radiotherapy Dosimetry Audits
Authors:
Padmini Krishnadas,
Spencer Angus Thomas,
Jessica Goldring,
Nadia A. S. Smith,
Mohammad Hussein
Abstract:
Dosimetry audits are carried out to determine how well radiotherapy is delivered to the patient. It is also used to understand the uncertainty introduced into the measurement result when using different computational models. As measurement procedures are becoming increasingly complex with technological advancements, it is harder to establish sources of variability in measurements and understand if…
▽ More
Dosimetry audits are carried out to determine how well radiotherapy is delivered to the patient. It is also used to understand the uncertainty introduced into the measurement result when using different computational models. As measurement procedures are becoming increasingly complex with technological advancements, it is harder to establish sources of variability in measurements and understand if they stem from true differences in measurands or in the measurement pipelines themselves. The gamma index calculation is a widely accepted metric used for the comparison of measured and predicted doses in radiotherapy. However, various steps in the measurement pipeline can introduce variation in the measurement result. In this paper, we perform a sensitivity and correlation analysis to investigate the influence of various input factors (i.e. setting) in gamma index calculations on the uncertainty introduced in dosimetry audits. We identify a number of factors where standardization will improve measurements by reducing variability in outputs. Furthermore, we also compare gamma index metrics and similarities across audit sites.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
MUSE: Machine Unlearning Six-Way Evaluation for Language Models
Authors:
Weijia Shi,
Jaechan Lee,
Yangsibo Huang,
Sadhika Malladi,
Jieyu Zhao,
Ari Holtzman,
Daogao Liu,
Luke Zettlemoyer,
Noah A. Smith,
Chiyuan Zhang
Abstract:
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approxim…
▽ More
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approximate unlearning algorithms. The evaluation of the efficacy of these algorithms has traditionally been narrow in scope, failing to precisely quantify the success and practicality of the algorithm from the perspectives of both the model deployers and the data owners. We address this issue by proposing MUSE, a comprehensive machine unlearning evaluation benchmark that enumerates six diverse desirable properties for unlearned models: (1) no verbatim memorization, (2) no knowledge memorization, (3) no privacy leakage, (4) utility preservation on data not intended for removal, (5) scalability with respect to the size of removal requests, and (6) sustainability over sequential unlearning requests. Using these criteria, we benchmark how effectively eight popular unlearning algorithms on 7B-parameter LMs can unlearn Harry Potter books and news articles. Our results demonstrate that most algorithms can prevent verbatim memorization and knowledge memorization to varying degrees, but only one algorithm does not lead to severe privacy leakage. Furthermore, existing algorithms fail to meet deployer's expectations because they often degrade general model utility and also cannot sustainably accommodate successive unlearning requests or large-scale content removal. Our findings identify key issues with the practicality of existing unlearning algorithms on language models, and we release our benchmark to facilitate further evaluations: muse-bench.github.io
△ Less
Submitted 14 July, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects
Authors:
Orevaoghene Ahia,
Anuoluwapo Aremu,
Diana Abagyan,
Hila Gonen,
David Ifeoluwa Adelani,
Daud Abolade,
Noah A. Smith,
Yulia Tsvetkov
Abstract:
Yorùbá an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel…
▽ More
Yorùbá an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YORÙLECT across three domains and four regional Yorùbá dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yorùbá and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yorùbá and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YORÙLECT dataset and models publicly under an open license.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Decoding-Time Language Model Alignment with Multiple Objectives
Authors:
Ruizhe Shi,
Yifang Chen,
Yushi Hu,
Alisa Liu,
Hannaneh Hajishirzi,
Noah A. Smith,
Simon Du
Abstract:
Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $\textbf{multi-objective decoding (MOD)}$, a decoding-time algorithm that outputs the next token from a lin…
▽ More
Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $\textbf{multi-objective decoding (MOD)}$, a decoding-time algorithm that outputs the next token from a linear combination of predictions of all base models, for any given weightings over different objectives. We exploit a common form among a family of $f$-divergence regularized alignment approaches (such as PPO, DPO, and their variants) to identify a closed-form solution by Legendre transform, and derive an efficient decoding strategy. Theoretically, we show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method. Empirical results demonstrate the effectiveness of the algorithm. For example, compared to a parameter-merging baseline, MOD achieves 12.8% overall reward improvement when equally optimizing towards $3$ objectives. Moreover, we experiment with MOD on combining three fully-finetuned LLMs of different model sizes, each aimed at different objectives such as safety, coding, and general user preference. Unlike traditional methods that require careful curation of a mixture of datasets to achieve comprehensive improvement, we can quickly experiment with preference weightings using MOD to find the best combination of models. Our best combination reduces toxicity on Toxigen to nearly 0% and achieves 7.9--33.3% improvement across other three metrics ($\textit{i.e.}$, Codex@1, GSM-COT, BBH-COT).
△ Less
Submitted 28 June, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
Evaluating Copyright Takedown Methods for Language Models
Authors:
Boyi Wei,
Weijia Shi,
Yangsibo Huang,
Noah A. Smith,
Chiyuan Zhang,
Luke Zettlemoyer,
Kai Li,
Peter Henderson
Abstract:
Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. These models can memorize and generate content similar to their training data, posing potential concerns. Therefore, model creators are motivated to develop mitigation methods that prevent generating protected content. We term this procedure as copyright takedowns fo…
▽ More
Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. These models can memorize and generate content similar to their training data, posing potential concerns. Therefore, model creators are motivated to develop mitigation methods that prevent generating protected content. We term this procedure as copyright takedowns for LMs, noting the conceptual similarity to (but legal distinction from) the DMCA takedown This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs. We propose CoTaEval, an evaluation framework to assess the effectiveness of copyright takedown methods, the impact on the model's ability to retain uncopyrightable factual knowledge from the training data whose recitation is embargoed, and how well the model maintains its general utility and efficiency. We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches. Our findings indicate that no tested method excels across all metrics, showing significant room for research in this unique problem setting and indicating potential unresolved challenges for live policy proposals.
△ Less
Submitted 11 July, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
Pauli Check Extrapolation for Quantum Error Mitigation
Authors:
Quinn Langfitt,
Ji Liu,
Benchen Huang,
Alvin Gonzales,
Kaitlin N. Smith,
Nikos Hardavellas,
Zain H. Saleem
Abstract:
Pauli Check Sandwiching (PCS) is an error mitigation scheme that uses pairs of parity checks to detect errors in the payload circuit. While increasing the number of check pairs improves error detection, it also introduces additional noise to the circuit and exponentially increases the required sampling size. To address these limitations, we propose a novel error mitigation scheme, Pauli Check Extr…
▽ More
Pauli Check Sandwiching (PCS) is an error mitigation scheme that uses pairs of parity checks to detect errors in the payload circuit. While increasing the number of check pairs improves error detection, it also introduces additional noise to the circuit and exponentially increases the required sampling size. To address these limitations, we propose a novel error mitigation scheme, Pauli Check Extrapolation (PCE), which integrates PCS with an extrapolation technique similar to Zero-Noise Extrapolation (ZNE). However, instead of extrapolating to the `zero-noise' limit, as is done in ZNE, PCE extrapolates to the `maximum check' limit--the number of check pairs theoretically required to achieve unit fidelity. In this study, we focus on applying a linear model for extrapolation and also derive a more general exponential ansatz based on the Markovian error model. We demonstrate the effectiveness of PCE by using it to mitigate errors in the shadow estimation protocol, particularly for states prepared by the variational quantum eigensolver (VQE). Our results show that this method can achieve higher fidelities than the state-of-the-art Robust Shadow (RS) estimation scheme, while significantly reducing the number of required samples by eliminating the need for a calibration procedure. We validate these findings on both fully-connected topologies and simulated IBM hardware backends.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
LHC EFT WG Note: SMEFT predictions, event reweighting, and simulation
Authors:
Alberto Belvedere,
Saptaparna Bhattacharya,
Giacomo Boldrini,
Suman Chatterjee,
Alessandro Calandri,
Sergio Sánchez Cruz,
Jennet Dickinson,
Franz J. Glessgen,
Reza Goldouzian,
Alexander Grohsjean,
Laurids Jeppe,
Charlotte Knight,
Olivier Mattelaer,
Kelci Mohrman,
Hannah Nelson,
Vasilije Perovic,
Matteo Presilla,
Robert Schöfbeck,
Nick Smith
Abstract:
This note gives an overview of the tools for predicting expectations in the Standard Model effective field theory (SMEFT) at the tree level and one loop available through event generators. Methods of event reweighting, the separate simulation of squared matrix elements, and the simulation of the full SMEFT process are compared in terms of statistical efficacy and potential biases.
This note gives an overview of the tools for predicting expectations in the Standard Model effective field theory (SMEFT) at the tree level and one loop available through event generators. Methods of event reweighting, the separate simulation of squared matrix elements, and the simulation of the full SMEFT process are compared in terms of statistical efficacy and potential biases.
△ Less
Submitted 28 June, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG
Authors:
William Merrill,
Noah A. Smith,
Yanai Elazar
Abstract:
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily…
▽ More
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily large $n$). To enable arbitrary-length $n$-gram search over a corpus in constant time, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for $n > 4$, LM-generated text is less novel than human-written text, though it is more novel for smaller $n$. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete $n$-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.
△ Less
Submitted 25 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Authors:
Yushi Hu,
Weijia Shi,
Xingyu Fu,
Dan Roth,
Mari Ostendorf,
Luke Zettlemoyer,
Noah A Smith,
Ranjay Krishna
Abstract:
Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In t…
▽ More
Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching process (e.g., draw bounding boxes with object detection models, draw masks with segmentation models), to further enhance visual perception and reasoning. We experiment with a wide range of math tasks (including geometry, functions, graphs, and chess) and complex visual reasoning tasks. Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in https://visualsketchpad.github.io/.
△ Less
Submitted 10 July, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Authors:
Hamish Ivison,
Yizhong Wang,
Jiacheng Liu,
Zeqiu Wu,
Valentina Pyatkin,
Nathan Lambert,
Noah A. Smith,
Yejin Choi,
Hannaneh Hajishirzi
Abstract:
Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core a…
▽ More
Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. Despite significant gains of up to 5% in mathematical evaluation when scaling up reward models, we surprisingly observe marginal improvements in other categories.
We publicly release the code used for training (https://github.com/hamishivi/EasyLM) and evaluating (https://github.com/allenai/open-instruct) our models, along with the models and datasets themselves (https://huggingface.co/collections/allenai/tulu-v25-suite-66676520fd578080e126f618).
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Shockingly Bright Warm Carbon Monoxide Molecular Features in the Supernova Remnant Cassiopeia A Revealed by JWST
Authors:
J. Rho,
S. -H. Park,
R. Arendt,
M. Matsuura,
D. Milisavljevic,
T. Temim,
I. De Looze,
W. P. Blair,
A. Rest,
O. Fox,
A. P. Ravi,
B. -C. Koo,
M. Barlow,
A. Burrows,
R. Chevalier,
G. Clayton,
R. Fesen,
C. Fransson,
C. Fryer,
H. L. Gomez,
H. -T. Janka,
F. Kirchschlarger,
J. M. Laming,
S. Orlando,
D. Patnaude
, et al. (14 additional authors not shown)
Abstract:
We present JWST NIRCam (F356W and F444W filters) and MIRI (F770W) images and NIRSpec- IFU spectroscopy of the young supernova remnant Cassiopeia A (Cas A). We obtained the data as part of a JWST survey of Cas A. The NIRCam and MIRI images map the spatial distributions of synchrotron radiation, Ar-rich ejecta, and CO on both large and small scales, revealing remarkably complex structures. The CO em…
▽ More
We present JWST NIRCam (F356W and F444W filters) and MIRI (F770W) images and NIRSpec- IFU spectroscopy of the young supernova remnant Cassiopeia A (Cas A). We obtained the data as part of a JWST survey of Cas A. The NIRCam and MIRI images map the spatial distributions of synchrotron radiation, Ar-rich ejecta, and CO on both large and small scales, revealing remarkably complex structures. The CO emission is stronger at the outer layers than the Ar ejecta, which indicates the reformation of CO molecules behind the reverse shock. NIRSpec-IFU spectra (3 - 5.5 microns) were obtained toward two representative knots in the NE and S fields. Both regions are dominated by the bright fundamental rovibrational band of CO in the two R and P branches, with strong [Ar VI] and relatively weaker, variable strength ejecta lines of [Si IX], [Ca IV], [Ca V] and [Mg IV]. The NIRSpec-IFU data resolve individual ejecta knots and filaments spatially and in velocity space. The fundamental CO band in the JWST spectra reveals unique shapes of CO, showing a few tens of sinusoidal patterns of rovibrational lines with pseudo-continuum underneath, which is attributed to the high-velocity widths of CO lines. The CO also shows high J lines at different vibrational transitions. Our results with LTE modeling of CO emission indicate a temperature of 1080 K and provide unique insight into the correlations between dust, molecules, and highly ionized ejecta in supernovae, and have strong ramifications for modeling dust formation that is led by CO cooling in the early Universe.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Capillary Flow Printing of Submicron Carbon Nanotube Transistors
Authors:
Brittany N. Smith,
Faris M. Albarghouthi,
James L. Doherty,
Xuancheng Pei,
Quentin Macfarlane,
Matthew Salfity,
Daniel Badia,
Marc Pascual,
Pascal Boncenne,
Nathan Bigan,
Amin M'Barki,
Aaron D. Franklin
Abstract:
Although printed transistors have a wide range of applications, the limited resolution of printing techniques (10-30 um) has been a barrier to advancement and scaling, particularly down to submicron dimensions. While previous works have shown creative approaches to realizing submicron channel lengths with printing, reliance on chemical processes unique to specific inks or tedious post-processing l…
▽ More
Although printed transistors have a wide range of applications, the limited resolution of printing techniques (10-30 um) has been a barrier to advancement and scaling, particularly down to submicron dimensions. While previous works have shown creative approaches to realizing submicron channel lengths with printing, reliance on chemical processes unique to specific inks or tedious post-processing limit their applicability. Here, we report the use of capillary flow printing (CFP) to repeatably create fully printed submicron carbon nanotube thin-film transistors (CNT-TFTs) without chemical modification or physical manipulation post-printing. The versatility of this printing technique is demonstrated by printing conducting, semiconducting, and insulating inks on several types of substrates (SiO2, Kapton, and paper) and through the fabrication of various TFT device (contacting/gating) architectures. Notably, CFP of these CNT-TFTs yielded on-currents of 1.12 mA/mm when back gated on Si/SiO2, and 490 uA/mm when side gated through ion gel on Kapton, demonstrating the strong transistor performance achievable with CFP. Mechanical bending and sweep rate resilience of devices printed on Kapton show the wide utility of CFP-fabricated devices for flexible applications. This work highlights the ability of CFP as a viable fabrication method for submicron electronics through cleanroom-free printing techniques.
△ Less
Submitted 7 June, 2024; v1 submitted 5 June, 2024;
originally announced June 2024.
-
Dissecting the Crab Nebula with JWST: Pulsar wind, dusty filaments, and Ni/Fe abundance constraints on the explosion mechanism
Authors:
Tea Temim,
J. Martin Laming,
P. J. Kavanagh,
Nathan Smith,
Patrick Slane,
William P. Blair,
Ilse De Looze,
Niccolò Bucciantini,
Anders Jerkstrand,
Nicole Marcelina Gountanis,
Ravi Sankrit,
Dan Milisavljevic,
Armin Rest,
Maxim Lyutikov,
Joseph DePasquale,
Thomas Martin,
Laurent Drissen,
John Raymond,
Ori D. Fox,
Maryam Modjaz,
Anatoly Spitkovsky,
Lou Strolger
Abstract:
We present JWST observations of the Crab Nebula, the iconic remnant of the historical SN 1054. The observations include NIRCam and MIRI imaging mosaics, plus MIRI/MRS IFU spectra that probe two select locations within the ejecta filaments. We derive a high-resolution map of dust emission and show that the grains are concentrated in the innermost, high-density filaments. These dense filaments coinc…
▽ More
We present JWST observations of the Crab Nebula, the iconic remnant of the historical SN 1054. The observations include NIRCam and MIRI imaging mosaics, plus MIRI/MRS IFU spectra that probe two select locations within the ejecta filaments. We derive a high-resolution map of dust emission and show that the grains are concentrated in the innermost, high-density filaments. These dense filaments coincide with multiple synchrotron bays around the periphery of the Crab's pulsar wind nebula (PWN). We measure synchrotron spectral index changes in small-scale features within the PWN's torus region, including the well-known knot and wisp structures. The index variations are consistent with Doppler boosting of emission from particles with a broken power-law distribution, providing the first direct evidence that the curvature in the particle injection spectrum is tied to the acceleration mechanism at the termination shock. We detect multiple nickel and iron lines in the ejecta filaments and use photoionization models to derive nickel-to-iron abundance ratios that are a factor of 3-8 higher than the solar ratio. We also find that the previously reported order-of-magnitude higher Ni/Fe values from optical data are consistent with the lower values from JWST when we reanalyze the optical emission using updated atomic data and account for local extinction from dust. We discuss the implications of our results for understanding the nature of the explosion that produced the Crab Nebula and conclude that the observational properties are most consistent with a low-mass iron-core-collapse supernova, even though an electron-capture explosion cannot be ruled out.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
Extended Shock Breakout and Early Circumstellar Interaction in SN 2024ggi
Authors:
Manisha Shrestha,
K. Azalee Bostroem,
David J. Sand,
Griffin Hosseinzadeh,
Jennifer E. Andrews,
Yize Dong,
Emily Hoang,
Daryl Janzen,
Jeniveve Pearson,
Jacob E. Jencson,
M. J. Lundquist,
Darshana Mehta,
Aravind P. Ravi,
Nicolas Meza Retamal,
Stefano Valenti,
Peter J. Brown,
Saurabh W. Jha,
Colin Macrie,
Brian Hsu,
Joseph Farah,
D. Andrew Howell,
Curtis McCully,
Megan Newsome,
Estefania Padilla Gonzalez,
Craig Pellegrino
, et al. (18 additional authors not shown)
Abstract:
We present high-cadence photometric and spectroscopic observations of supernova (SN) 2024ggi, a Type II SN with flash spectroscopy features which exploded in the nearby galaxy NGC 3621 at $\sim$7 Mpc. The light-curve evolution over the first 30 hours can be fit by two power law indices with a break after 22 hours, rising from $M_V \approx -12.95$ mag at +0.66 days to $M_V \approx -17.91$ mag after…
▽ More
We present high-cadence photometric and spectroscopic observations of supernova (SN) 2024ggi, a Type II SN with flash spectroscopy features which exploded in the nearby galaxy NGC 3621 at $\sim$7 Mpc. The light-curve evolution over the first 30 hours can be fit by two power law indices with a break after 22 hours, rising from $M_V \approx -12.95$ mag at +0.66 days to $M_V \approx -17.91$ mag after 7 days. In addition, the densely sampled color curve shows a strong blueward evolution over the first few days and then behaves as a normal SN II with a redward evolution as the ejecta cool. Such deviations could be due to interaction with circumstellar material (CSM). Early high- and low-resolution spectra clearly show high-ionization flash features from the first spectrum to +3.42 days after the explosion. From the high-resolution spectra, we calculate the CSM velocity to be 37 $\pm~4~\mathrm{km\,s^{-1}} $. We also see the line strength evolve rapidly from 1.22 to 1.49 days in the earliest high-resolution spectra. Comparison of the low-resolution spectra with CMFGEN models suggests that the pre-explosion mass-loss rate of SN 2024ggi falls in a range of $10^{-3}$ to $10^{-2}$ M$_{\odot}$ yr$^{-1}$, which is similar to that derived for SN 2023ixf. However, the rapid temporal evolution of the narrow lines in the spectra of SN 2024ggi ($R_\mathrm{CSM} \sim 2.7 \times 10^{14} \mathrm{cm}$) could indicate a smaller spatial extent of the CSM than in SN 2023ixf ($R_\mathrm{CSM} \sim 5.4 \times 10^{14} \mathrm{cm}$) which in turn implies lower total CSM mass for SN 2024ggi.
△ Less
Submitted 1 August, 2024; v1 submitted 28 May, 2024;
originally announced May 2024.
-
Disorder effects in planar semiconductor-superconductor structures: Majorana wires versus Josephson junctions
Authors:
Purna P. Paudel,
Nathan O. Smith,
Tudor D. Stanescu
Abstract:
Disorder effects in hybrid semiconductor-superconductor (SM-SC) nanowires, widely recognized as the main obstacle to realizing stable Majorana zero modes (MZMs) in these structures, have been systematically investigated theoretically in recent years. However, there are no corresponding detailed studies of disorder effects in planar Josephson junction (JJ) structures, which represent a promising al…
▽ More
Disorder effects in hybrid semiconductor-superconductor (SM-SC) nanowires, widely recognized as the main obstacle to realizing stable Majorana zero modes (MZMs) in these structures, have been systematically investigated theoretically in recent years. However, there are no corresponding detailed studies of disorder effects in planar Josephson junction (JJ) structures, which represent a promising alternative to the Majorana nanowire platform. In this paper, we perform a numerical analysis of the low-energy physics of JJ structures based on an effective microscopic model that includes two types of disorder, charge impurities inside the semiconductor and roughness on the surface of the superconducting film. We consider different parameter regimes, including low and high chemical potential values, weak and strong effective SM-SC coupling strengths, and weak and strong disorder strengths. The results are benchmarked using disordered hybrid nanowires realized in planar SM-SC structures similar to those involved in the fabrication of Josephson junctions and having similar model parameters and disorder strengths. We find that the topological superconducting phase hosted by a JJ structure is, generally, more robust against disorder than the topological superconductivity realized in a hybrid nanowire with similar parameters. On the other hand, we find that operating the JJ in a regime characterized by large values of chemical potential results in huge finite-size effects that can destroy the stability of MZMs.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Large deviations in statistics of the local time and occupation time for a run and tumble particle
Authors:
Soheli Mukherjee,
Pierre Le Doussal,
Naftali R. Smith
Abstract:
We investigate the statistics of the local time $\mathcal{T} = \int_0^T δ(x(t)) dt$ that a run and tumble particle (RTP) $x(t)$ in one dimension spends at the origin, with or without an external drift. By relating the local time to the number of times the RTP crosses the origin, we find that the local time distribution $P(\mathcal{T})$ satisfies the large deviation principle…
▽ More
We investigate the statistics of the local time $\mathcal{T} = \int_0^T δ(x(t)) dt$ that a run and tumble particle (RTP) $x(t)$ in one dimension spends at the origin, with or without an external drift. By relating the local time to the number of times the RTP crosses the origin, we find that the local time distribution $P(\mathcal{T})$ satisfies the large deviation principle $P(\mathcal{T}) \sim \, e^{-T \, I(\mathcal{T} / T)} $ in the large observation time limit $T \to \infty$. Remarkably, we find that in presence of drift the rate function $I(ρ)$ is nonanalytic: We interpret its singularity as dynamical phase transitions of first order. We then extend these results by studying the statistics of the amount of time $\mathcal{R}$ that the RTP spends inside a finite interval (i.e., the occupation time), with qualitatively similar results. In particular, this yields the long-time decay rate of the probability $P(\mathcal{R} = T)$ that the particle does not exit the interval up to time $T$. We find that the conditional endpoint distribution exhibits an interesting change of behavior from unimodal to bimodal as a function of the size of the interval. To study the occupation time statistics, we extend the Donsker-Varadhan large-deviation formalism to the case of RTPs, for general dynamical observables and possibly in the presence of an external potential.
△ Less
Submitted 10 August, 2024; v1 submitted 11 May, 2024;
originally announced May 2024.
-
What Can Natural Language Processing Do for Peer Review?
Authors:
Ilia Kuznetsov,
Osama Mohammed Afzal,
Koen Dercksen,
Nils Dycke,
Alexander Goldberg,
Tom Hope,
Dirk Hovy,
Jonathan K. Kummerfeld,
Anne Lauscher,
Kevin Leyton-Brown,
Sheng Lu,
Mausam,
Margot Mieskes,
Aurélie Névéol,
Danish Pruthi,
Lizhen Qu,
Roy Schwartz,
Noah A. Smith,
Thamar Solorio,
Jingyan Wang,
Xiaodan Zhu,
Anna Rogers,
Nihar B. Shah,
Iryna Gurevych
Abstract:
The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time…
▽ More
The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review -- manuscripts, reviews, discussions -- are largely text-based, Natural Language Processing has great potential to improve reviewing. As the emergence of large language models (LLMs) has enabled NLP assistance for many new tasks, the discussion on machine-assisted peer review is picking up the pace. Yet, where exactly is help needed, where can NLP help, and where should it stand aside? The goal of our paper is to provide a foundation for the future efforts in NLP for peer-reviewing assistance. We discuss peer review as a general process, exemplified by reviewing at AI conferences. We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance, illustrated by existing work. We then turn to the big challenges in NLP for peer review as a whole, including data acquisition and licensing, operationalization and experimentation, and ethical issues. To help consolidate community efforts, we create a companion repository that aggregates key datasets pertaining to peer review. Finally, we issue a detailed call for action for the scientific community, NLP and AI researchers, policymakers, and funding bodies to help bring the research in NLP for peer review forward. We hope that our work will help set the agenda for research in machine-assisted scientific quality control in the age of AI, within the NLP community and beyond.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
SN2023fyq: A Type Ibn Supernova With Long-standing Precursor Activity Due to Binary Interaction
Authors:
Yize Dong,
Daichi Tsuna,
Stefano Valenti,
David J. Sand,
Jennifer E. Andrews,
K. Azalee Bostroem,
Griffin Hosseinzadeh,
Emily Hoang,
Saurabh W. Jha,
Daryl Janzen,
Jacob E. Jencson,
Michael Lundquist,
Darshana Mehta,
Aravind P. Ravi,
Nicolas E. Meza Retamal,
Jeniveve Pearson,
Manisha Shrestha,
Alceste Bonanos,
D. Andrew Howell,
Nathan Smith,
Joseph Farah,
Daichi Hiramatsu,
Koichi Itagaki,
Curtis McCully,
Megan Newsome
, et al. (7 additional authors not shown)
Abstract:
We present photometric and spectroscopic observations of SN 2023fyq, a type Ibn supernova in the nearby galaxy NGC 4388 (D$\simeq$18~Mpc). In addition, we trace long-standing precursor emission at the position of SN 2023fyq using data from DLT40, ATLAS, ZTF, ASAS-SN, Swift, and amateur astronomer Koichi Itagaki. Precursor activity is observed up to nearly three years before the supernova explosion…
▽ More
We present photometric and spectroscopic observations of SN 2023fyq, a type Ibn supernova in the nearby galaxy NGC 4388 (D$\simeq$18~Mpc). In addition, we trace long-standing precursor emission at the position of SN 2023fyq using data from DLT40, ATLAS, ZTF, ASAS-SN, Swift, and amateur astronomer Koichi Itagaki. Precursor activity is observed up to nearly three years before the supernova explosion, with a relatively rapid rise in the final 100 days. The double-peaked post-explosion light curve reaches a luminosity of $\sim10^{43}~\rm erg\,s^{-1}$. The strong intermediate-width He lines observed in the nebular spectrum of SN 2023fyq imply the interaction is still active at late phases. We found that the precursor activity in SN 2023fyq is best explained by the mass transfer in a binary system involving a low-mass He star and a compact companion. An equatorial disk is likely formed in this process ($\sim$0.6$\rm M_{\odot}$), and the interaction of SN ejecta with this disk powers the main peak of the supernova. The early SN light curve reveals the presence of dense extended material ($\sim$0.3$\rm M_{\odot}$) at $\sim$3000$\rm R_{\odot}$ ejected weeks before the SN explosion, likely due to final-stage core silicon burning or runaway mass transfer resulting from binary orbital shrinking, leading to rapid rising precursor emission within $\sim$30 days prior to explosion. The final explosion could be triggered either by the core-collapse of the He star or by the merger of the He star with a compact object. SN 2023fyq, along with SN 2018gjx and SN 2015G, forms a unique class of Type Ibn SNe which originate in binary systems and are likely to exhibit detectable long-lasting pre-explosion outbursts with magnitudes ranging from $-$10 to $-$13.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically
Authors:
Kabir Ahuja,
Vidhisha Balachandran,
Madhur Panwar,
Tianxing He,
Noah A. Smith,
Navin Goyal,
Yulia Tsvetkov
Abstract:
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge. We extensively experiment with transfor…
▽ More
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge. We extensively experiment with transformer models trained on multiple synthetic datasets and with different training objectives and show that while other objectives e.g. sequence-to-sequence modeling, prefix language modeling, often failed to lead to hierarchical generalization, models trained with the language modeling objective consistently learned to generalize hierarchically. We then conduct pruning experiments to study how transformers trained with the language modeling objective encode hierarchical structure. When pruned, we find joint existence of subnetworks within the model with different generalization behaviors (subnetworks corresponding to hierarchical structure and linear order). Finally, we take a Bayesian perspective to further uncover transformers' preference for hierarchical generalization: We establish a correlation between whether transformers generalize hierarchically on a dataset and whether the simplest explanation of that dataset is provided by a hierarchical grammar compared to regular grammars exhibiting linear generalization.
△ Less
Submitted 31 May, 2024; v1 submitted 25 April, 2024;
originally announced April 2024.
-
The LHCb VELO Upgrade Module Construction
Authors:
K. Akiba,
M. Alexander,
C. Bertella,
A. Biolchini,
A. Bitadze,
G. Bogdanova,
S. Borghi,
T. J. V. Bowcock,
K. Bridges,
M. Brock,
A. T. Burke,
J. Buytaert,
W. Byczynski,
J. Carroll,
V. Coco,
P. Collins,
A. Davis,
O. De Aguiar Francisco,
K. De Bruyn,
S. De Capua,
K. De Roo,
F. Doherty,
L. Douglas,
L. Dufour,
R. Dumps
, et al. (62 additional authors not shown)
Abstract:
The LHCb detector has undergone a major upgrade for LHC Run 3. This Upgrade I detector facilitates operation at higher luminosity and utilises full-detector information at the LHC collision rate, critically including the use of vertex information. A new vertex locator system, the VELO Upgrade, has been constructed. The core element of the new VELO are the double-sided pixelated hybrid silicon dete…
▽ More
The LHCb detector has undergone a major upgrade for LHC Run 3. This Upgrade I detector facilitates operation at higher luminosity and utilises full-detector information at the LHC collision rate, critically including the use of vertex information. A new vertex locator system, the VELO Upgrade, has been constructed. The core element of the new VELO are the double-sided pixelated hybrid silicon detector modules which operate in vacuum close to the LHC beam in a high radiation environment. The construction and quality assurance tests of these modules are described in this paper. The modules incorporate 200 \mum thick, n-on-p silicon sensors bump-bonded to 130 \nm technology ASICs. These are attached with high precision to a silicon microchannel substrate that uses evaporative CO$_2$ cooling. The ASICs are controlled and read out with flexible printed circuits that are glued to the substrate and wire-bonded to the chips. The mechanical support of the module is given by a carbon fibre plate, two carbon fibre rods and an aluminium plate. The sensor attachment was achieved with an average precision of 21 $\mathrm{μm}$, more than 99.5\% of all pixels are fully functional, and a thermal figure of merit of 3 \mathrm{Kcm^{2}W^{-1}}$ was achieved. The production of the modules was successfully completed in 2021, with the final assembly and installation completed in time for data taking in 2022.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Authors:
Xingyu Fu,
Yushi Hu,
Bangzheng Li,
Yu Feng,
Haoyu Wang,
Xudong Lin,
Dan Roth,
Noah A. Smith,
Wei-Chiu Ma,
Ranjay Krishna
Abstract:
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challeng…
▽ More
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
△ Less
Submitted 3 July, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
Deep JWST/NIRCam imaging of Supernova 1987A
Authors:
Mikako Matsuura,
M. Boyer,
Richard G. Arendt,
J. Larsson,
C. Fransson,
A. Rest,
A. P. Ravi,
S. Park,
P. Cigan,
T. Temim,
E. Dwek,
M. J. Barlow,
P. Bouchet,
G. Clayton,
R. Chevalier,
J. Danziger,
J. De Buizer,
I. De Looze,
G. De Marchi,
O. Fox,
C. Gall,
R. D. Gehrz,
H. L. Gomez,
R. Indebetouw,
T. Kangas
, et al. (24 additional authors not shown)
Abstract:
JWST/NIRCam obtained high angular-resolution (0.05-0.1''), deep near-infrared 1--5 micron imaging of Supernova (SN) 1987A taken 35 years after the explosion. In the NIRCam images, we identify: 1) faint H2 crescents, which are emissions located between the ejecta and the equatorial ring, 2) a bar, which is a substructure of the ejecta, and 3) the bright 3-5 micron continuum emission exterior to the…
▽ More
JWST/NIRCam obtained high angular-resolution (0.05-0.1''), deep near-infrared 1--5 micron imaging of Supernova (SN) 1987A taken 35 years after the explosion. In the NIRCam images, we identify: 1) faint H2 crescents, which are emissions located between the ejecta and the equatorial ring, 2) a bar, which is a substructure of the ejecta, and 3) the bright 3-5 micron continuum emission exterior to the equatorial ring. The emission of the remnant in the NIRCam 1-2.3 micron images is mostly due to line emission, which is mostly emitted in the ejecta and in the hot spots within the equatorial ring. In contrast, the NIRCam 3-5 micron images are dominated by continuum emission. In the ejecta, the continuum is due to dust, obscuring the centre of the ejecta. In contrast, in the ring and exterior to the ring, synchrotron emission contributes a substantial fraction to the continuum.
Dust emission contributes to the continuum at outer spots and diffuse emission exterior to the ring, but little within the ring. This shows that dust cooling and destruction time scales are shorter than the synchrotron cooling time scale, and the time scale of hydrogen recombination in the ring is even longer than the synchrotron cooling time scale.
With the advent of high sensitivity and high angular resolution images provided by JWST/NIRCam, our observations of SN 1987A demonstrate that NIRCam opens up a window to study particle-acceleration and shock physics in unprecedented details, probed by near-infrared synchrotron emission, building a precise picture of how a SN evolves.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Investigating Ocean Circulation Dynamics Through Data Assimilation: A Mathematical Study Using the Stommel Box Model with Rapid Oscillatory Forcings
Authors:
Nathaniel Smith,
Anvaya Shiney-Ajay,
Emmanuel Fleurantin,
Ivo Pasmans
Abstract:
We investigate ocean circulation changes through the lens of data assimilation using a reduced-order model. Our primary interest lies in the Stommel box model which reveals itself to be one of the most practicable models that has the ability of reproducing the meridional overturning circulation. The Stommel box model has at most two regimes: TH (temperature driven circulation with sinking near the…
▽ More
We investigate ocean circulation changes through the lens of data assimilation using a reduced-order model. Our primary interest lies in the Stommel box model which reveals itself to be one of the most practicable models that has the ability of reproducing the meridional overturning circulation. The Stommel box model has at most two regimes: TH (temperature driven circulation with sinking near the north pole) and SA (salinity driven with sinking near the equator). Currently, the meridional overturning is in the TH regime. Using box-averaged Met Office EN4 ocean temperature and salinity data, our goal is to provide a probability that a future regime change occurs and establish how this probability depends on the uncertainties in initial conditions, parameters and forcings. We will achieve this using data assimilation tools and DAPPER within the Stommel box model with fast oscillatory regimes.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Analysis Facilities White Paper
Authors:
D. Ciangottini,
A. Forti,
L. Heinrich,
N. Skidmore,
C. Alpigiani,
M. Aly,
D. Benjamin,
B. Bockelman,
L. Bryant,
J. Catmore,
M. D'Alfonso,
A. Delgado Peris,
C. Doglioni,
G. Duckeck,
P. Elmer,
J. Eschle,
M. Feickert,
J. Frost,
R. Gardner,
V. Garonne,
M. Giffels,
J. Gooding,
E. Gramstad,
L. Gray,
B. Hegner
, et al. (41 additional authors not shown)
Abstract:
This white paper presents the current status of the R&D for Analysis Facilities (AFs) and attempts to summarize the views on the future direction of these facilities. These views have been collected through the High Energy Physics (HEP) Software Foundation's (HSF) Analysis Facilities forum, established in March 2022, the Analysis Ecosystems II workshop, that took place in May 2022, and the WLCG/HS…
▽ More
This white paper presents the current status of the R&D for Analysis Facilities (AFs) and attempts to summarize the views on the future direction of these facilities. These views have been collected through the High Energy Physics (HEP) Software Foundation's (HSF) Analysis Facilities forum, established in March 2022, the Analysis Ecosystems II workshop, that took place in May 2022, and the WLCG/HSF pre-CHEP workshop, that took place in May 2023. The paper attempts to cover all the aspects of an analysis facility.
△ Less
Submitted 15 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
A Taxonomy of Ambiguity Types for NLP
Authors:
Margaret Y. Li,
Alisa Liu,
Zhaofeng Wu,
Noah A. Smith
Abstract:
Ambiguity is an critical component of language that allows for more effective communication between speakers, but is often ignored in NLP. Recent work suggests that NLP systems may struggle to grasp certain elements of human language understanding because they may not handle ambiguities at the level that humans naturally do in communication. Additionally, different types of ambiguity may serve dif…
▽ More
Ambiguity is an critical component of language that allows for more effective communication between speakers, but is often ignored in NLP. Recent work suggests that NLP systems may struggle to grasp certain elements of human language understanding because they may not handle ambiguities at the level that humans naturally do in communication. Additionally, different types of ambiguity may serve different purposes and require different approaches for resolution, and we aim to investigate how language models' abilities vary across types. We propose a taxonomy of ambiguity types as seen in English to facilitate NLP analysis. Our taxonomy can help make meaningful splits in language ambiguity data, allowing for more fine-grained assessments of both datasets and model performance.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
RewardBench: Evaluating Reward Models for Language Modeling
Authors:
Nathan Lambert,
Valentina Pyatkin,
Jacob Morrison,
LJ Miranda,
Bill Yuchen Lin,
Khyathi Chandu,
Nouha Dziri,
Sachin Kumar,
Tom Zick,
Yejin Choi,
Noah A. Smith,
Hannaneh Hajishirzi
Abstract:
Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. Resources for reward model training a…
▽ More
Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. Resources for reward model training and understanding are sparse in the nascent open-source community around them. To enhance scientific understanding of reward models, we present RewardBench, a benchmark dataset and code-base for evaluation. The RewardBench dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO). We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.
△ Less
Submitted 8 June, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks
Authors:
Bo-Ru Lu,
Nikita Haduong,
Chien-Yu Lin,
Hao Cheng,
Noah A. Smith,
Mari Ostendorf
Abstract:
Transformer-based NLP models are powerful but have high computational costs that limit deployment. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as GPT-4. We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks where multiple outputs ar…
▽ More
Transformer-based NLP models are powerful but have high computational costs that limit deployment. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as GPT-4. We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks where multiple outputs are required for a single shared input. Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency by avoiding duplicate input encoding and increasing the operational intensity (ratio of numbers of arithmetic operation to memory access) of decoding process by sharing the input key-value cache. We achieve computation reduction that roughly scales with the number of subtasks, gaining up to 4.6x speed-up over state-of-the-art models for dialogue state tracking, summarization, and question-answering tasks, with comparable or better performance.
△ Less
Submitted 23 May, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Average circuit eigenvalue sampling on NISQ devices
Authors:
Emilio Pelaez,
Victory Omole,
Pranav Gokhale,
Rich Rines,
Kaitlin N. Smith,
Michael A. Perlin,
Akel Hashim
Abstract:
Average circuit eigenvalue sampling (ACES) was introduced by Flammia in arXiv:2108.05803 as a protocol to characterize the Pauli error channels of individual gates across the device simultaneously. The original paper posed using ACES to characterize near-term devices as an open problem. This work advances in this direction by presenting a full implementation of ACES for real devices and deploying…
▽ More
Average circuit eigenvalue sampling (ACES) was introduced by Flammia in arXiv:2108.05803 as a protocol to characterize the Pauli error channels of individual gates across the device simultaneously. The original paper posed using ACES to characterize near-term devices as an open problem. This work advances in this direction by presenting a full implementation of ACES for real devices and deploying it to Superstaq arXiv:2309.05157, along with a device-tailored resource estimation obtained through simulations and experiments. Our simulations show that ACES is able to estimate one- and two-qubit non-uniform Pauli error channels to an average eigenvalue absolute error of under $0.003$ and total variation distance of under 0.001 between simulated and reconstructed probability distributions over Pauli errors with $10^5$ shots per circuit using 5 circuits of depth 14. The question of estimating general error channels through twirling techniques in real devices remains open, as it is dependent on a device's native gates, but simulations with the Clifford set show results in agreement with reported hardware data. Experimental results on IBM's Algiers and Osaka devices are presented, where we characterize their error channels as Pauli channels without twirling.
△ Less
Submitted 20 March, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Third-Party Language Model Performance Prediction from Instruction
Authors:
Rahul Nadkarni,
Yizhong Wang,
Noah A. Smith
Abstract:
Language model-based instruction-following systems have lately shown increasing performance on many benchmark tasks, demonstrating the capability of adapting to a broad variety of instructions. However, such systems are often not designed to be transparent about their limitations; a user may easily prompt a model with an instruction without any idea of whether the responses should be expected to b…
▽ More
Language model-based instruction-following systems have lately shown increasing performance on many benchmark tasks, demonstrating the capability of adapting to a broad variety of instructions. However, such systems are often not designed to be transparent about their limitations; a user may easily prompt a model with an instruction without any idea of whether the responses should be expected to be accurate, or if the system is even capable of performing the task. We propose a third party performance prediction framework, where a separate model is trained to predict the metric resulting from evaluating an instruction-following system on a task while assuming access only to its inputs and outputs at inference time. We perform this analysis with a variety of both open and closed instruction-following models as well as multiple performance predictors, and examine the effect of various factors such as model size, number of training tasks, and prompt format. Our findings indicate that third-party performance prediction is very challenging, and much work remains in developing predictors that can automatically reveal the limitations of modern instruction-following natural language processing systems.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
5 Year Update to the Next Steps in Quantum Computing
Authors:
Kenneth Brown,
Fred Chong,
Kaitlin N. Smith,
Tom Conte,
Austin Adams,
Aniket Dalvi,
Christopher Kang,
Josh Viszlai
Abstract:
It has been 5 years since the Computing Community Consortium (CCC) Workshop on Next Steps in Quantum Computing, and significant progress has been made in closing the gap between useful quantum algorithms and quantum hardware. Yet much remains to be done, in particular in terms of mitigating errors and moving towards error-corrected machines. As we begin to transition from the Noisy-Intermediate Sc…
▽ More
It has been 5 years since the Computing Community Consortium (CCC) Workshop on Next Steps in Quantum Computing, and significant progress has been made in closing the gap between useful quantum algorithms and quantum hardware. Yet much remains to be done, in particular in terms of mitigating errors and moving towards error-corrected machines. As we begin to transition from the Noisy-Intermediate Scale Quantum (NISQ) era to a future of fault-tolerant machines, now is an opportune time to reflect on how to apply what we have learned thus far and what research needs to be done to realize computational advantage with quantum machines.
△ Less
Submitted 26 January, 2024;
originally announced March 2024.
-
Know Your Audience: The benefits and pitfalls of generating plain language summaries beyond the "general" audience
Authors:
Tal August,
Kyle Lo,
Noah A. Smith,
Katharina Reinecke
Abstract:
Language models (LMs) show promise as tools for communicating science to the general public by simplifying and summarizing complex language. Because models can be prompted to generate text for a specific audience (e.g., college-educated adults), LMs might be used to create multiple versions of plain language summaries for people with different familiarities of scientific topics. However, it is not…
▽ More
Language models (LMs) show promise as tools for communicating science to the general public by simplifying and summarizing complex language. Because models can be prompted to generate text for a specific audience (e.g., college-educated adults), LMs might be used to create multiple versions of plain language summaries for people with different familiarities of scientific topics. However, it is not clear what the benefits and pitfalls of adaptive plain language are. When is simplifying necessary, what are the costs in doing so, and do these costs differ for readers with different background knowledge? Through three within-subjects studies in which we surface summaries for different envisioned audiences to participants of different backgrounds, we found that while simpler text led to the best reading experience for readers with little to no familiarity in a topic, high familiarity readers tended to ignore certain details in overly plain summaries (e.g., study limitations). Our work provides methods and guidance on ways of adapting plain language summaries beyond the single "general" audience.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.