Royer Research Labs, LLC

Compute-efficient model architectures

Royer Research Labs (RRL) is an independent AI research lab. We design and validate model-architecture techniques that reduce the inference cost and extend the usable context length of transformer-class language models.

Our work targets the structural sources of serving cost — attention, positional encoding, and feed-forward computation — and validates each technique empirically before it would reach deployment.

What we work on

Four lines of architecture research, each aimed at lowering cost without conceding quality. Results to date are described qualitatively.

Linear-time sequence mixing

A mixer that replaces most attention layers, reducing KV-cache memory and improving serving throughput at long context while preserving model quality.

Positional encoding for long context

A method that keeps retrieval reliable as context extends beyond the trained length — the regime where rotary encodings degrade.

Efficient feed-forward blocks

Feed-forward designs that improve quality at matched parameter count, usable as drop-in replacements for the standard block.

Token-conditioned feed-forward

Conditional computation that lets a smaller model match a larger dense baseline's quality with fewer active computations per token.

News

Releases and notes from the lab.

  1. Open-sourcing Niah, a long-context retrieval harness

    We have released Niah, a compact harness for measuring how reliably a language model retrieves information buried in a long context. It implements needle-in-a-haystack testing through likelihood-based multiple-choice scoring with a control-prior correction — an approach that stays stable for small and from-scratch models, where generation-and-match retrieval is noisy.

    Niah runs both as a standalone evaluation over a checkpoint and as an in-the-loop metric during training, ships depth-by-context-length charting, and integrates with the EleutherAI LM Evaluation Harness. It is paired with an open model architecture and training loop so the full pipeline runs from a clean clone, and is released under the MIT license. Long-context retrieval is the regime our positional-encoding research targets directly; releasing the harness we use to measure it is a step toward making those results easy to reproduce.

Approach

Serving transformer models at long context is expensive, and most of that cost is fixed at the architecture level — by how attention scales, how positions are encoded, and how much computation each token triggers. Choices made before a model is trained set the ceiling on how cheaply it can later be served.

RRL works at that level. We develop architecture techniques that reduce serving cost before deployment rather than optimizing around a fixed design afterward, and we validate each technique empirically through controlled experiments at scale. Core techniques have been validated at up to 1.6-billion-parameter scale, where the aim is consistent: significant inference-throughput gains and reduced KV-cache memory while matching a strong baseline.

About

RRL is an independent research lab focused on model efficiency. It was founded by an engineer with more than 20 years of systems and software architecture experience across telecom, automotive, and finance, now directed toward reducing the cost of language-model inference.

The lab runs active experimentation at small-to-medium scale with selective scale-up, advancing techniques only as the evidence supports them.

Get in touch

For research collaboration or technical correspondence, reach us directly:

contact@royerresearchlabs.com