Royer Research Labs, LLC

Compute-efficient model architectures

Royer Research Labs (RRL) is an independent AI research lab. We design and validate model-architecture techniques that reduce the inference cost and extend the usable context length of transformer-class language models.

Our work targets the structural sources of serving cost — attention, positional encoding, and feed-forward computation — and validates each technique empirically before it would reach deployment.

What we work on

Four lines of architecture research, each aimed at lowering cost without conceding quality. Results to date are described qualitatively.

Linear-time sequence mixing

A mixer that replaces most attention layers, reducing KV-cache memory and improving serving throughput at long context while preserving model quality.

Positional encoding for long context

A method that keeps retrieval reliable as context extends beyond the trained length — the regime where rotary encodings degrade.

Efficient feed-forward blocks

Feed-forward designs that improve quality at matched parameter count, usable as drop-in replacements for the standard block.

Token-conditioned feed-forward

Conditional computation that lets a smaller model match a larger dense baseline's quality with fewer active computations per token.

News

Releases and notes from the lab.

July 20, 2026

Open-sourcing TriGLU, a controlled study of how much attention a decoder needs#

We have released TriGLU v0.2.0, a controlled study of how much of a decoder's attention is actually load-bearing — and whether the rest can be replaced with something cheaper. In an 89-million-parameter, 20-layer decoder trained on one billion FineWeb-Edu tokens, replacing more than half of the attention layers with cheap token-local blocks held language-modeling quality at parity while training about 19% faster and using 55% less key/value cache, with the efficiency gains widening at longer context. The finding is that where and how much attention you replace matters more than the exact mixer used.

TriGLU — the Triple-Product Gated Linear Unit — is the token-local mixer at the center of the study: a simple three-factor gated product that mixes channels within each token and moves no information between positions. We are deliberate about what we do not claim — the algebra belongs to a known family of gated units, and the contribution is the controlled study and its map, not a new building block. Every number is backed by a raw record: the release ships the full implementation, the exact configuration for every run, and an evidence archive of resolved configs, metric streams, environment metadata, and data hashes — enough to audit or reproduce any comparison. The code is released under the Apache-2.0 license.

Code & study on GitHub → Release v0.2.0 → Archived record (DOI) →
June 7, 2026

Open-sourcing Niah, a long-context retrieval harness#

We have released Niah, a compact harness for measuring how reliably a language model retrieves information buried in a long context. It implements needle-in-a-haystack testing through likelihood-based multiple-choice scoring with a control-prior correction — an approach that stays stable for small and from-scratch models, where generation-and-match retrieval is noisy.

Niah runs both as a standalone evaluation over a checkpoint and as an in-the-loop metric during training, ships depth-by-context-length charting, and integrates with the EleutherAI LM Evaluation Harness. It is paired with an open model architecture and training loop so the full pipeline runs from a clean clone, and is released under the MIT license. Long-context retrieval is the regime our positional-encoding research targets directly; releasing the harness we use to measure it is a step toward making those results easy to reproduce.

View on GitHub →

Approach

Serving transformer models at long context is expensive, and most of that cost is fixed at the architecture level — by how attention scales, how positions are encoded, and how much computation each token triggers. Choices made before a model is trained set the ceiling on how cheaply it can later be served.

RRL works at that level. We develop architecture techniques that reduce serving cost before deployment rather than optimizing around a fixed design afterward, and we validate each technique empirically through controlled experiments at scale. Core techniques have been validated at up to 1.6-billion-parameter scale, where the aim is consistent: significant inference-throughput gains and reduced KV-cache memory while matching a strong baseline.

About

RRL is an independent research lab focused on model efficiency. It was founded by an engineer with more than 20 years of systems and software architecture experience across telecom, automotive, and finance, now directed toward reducing the cost of language-model inference.

The lab runs active experimentation at small-to-medium scale with selective scale-up, advancing techniques only as the evidence supports them.

Get in touch

For research collaboration or technical correspondence, reach us directly:

contact@royerresearchlabs.com