Stop Paying Premium Prices for Repeated Log Content

Cost Engineering

Craig Conboy

October 21, 2025

I recently discovered something surprising while examining customer logs that included OAuth scope lists within the JSON context of log messages. What started as a routine cost investigation turned into a masterclass in why sometimes the smallest optimizations can have the biggest impact on your bill.

The Problem: When One Field Breaks the Bank

The field seemed innocent enough at first glance. Just a list of string tokens, right? But here's what I found: each token was 15-45 characters in length, and these lists could get quite long. So it wasn't unusual for this one field to occupy several KB of space per log entry.

Now, as you know with many log services, you pay for the bytes ingested. When I dug into the billing breakdown, a surprising 25% of the total logging cost was coming from ingesting this single field. Twenty-five percent! That's not a rounding error—that's real money walking out the door every month for what was essentially the same information repeated over and over again.

The Insight: Repetition is Opportunity

Here's where it gets interesting. When I analyzed the actual content of these token lists, I discovered something that changed everything: there was massive repetition in the data. The same common tokens appeared in log entry after log entry—while only occasionally would something new or uncommon show up.

This repetition wasn't a bug; it was a feature. And more importantly, it was an opportunity.

The Solution: 128-bit Bitfield Token Compression

We implemented what I'm calling 128-bit Bitfield Token Compression—a technique that compresses repeated string token lists using a hybrid approach that's both effective and future-proof.

Here's how it works:

Core Method:

Map the 128 most frequent token values to bit positions in a 128-bit (16-byte) bitfield
Encode the bitfield as base64 (~22 characters)
Handle unknown or rare tokens as comma-separated fallback strings

Format: #<base64_bitfield>|<unknown_token1>,<unknown_token2>

Example:

Original: {'app:feature:alpha', 'sys:read:config', 'user:profile:basic', 'api:v2:access', 'new:experimental:beta'}

Compressed: #rwAAAAAAAAAAAAAAAAAAAA==|new:experimental:beta

Progress! Instead of hundreds of characters for common token combinations, we're down to about 22 characters for the base64 bitfield, plus only the truly uncommon tokens spelled out.

The Results: 97% Compression Ratio

The numbers don't lie. We achieved a 97% compression ratio on this data. That field that was eating up 25% of our logging budget? It's now consuming less than 1% of the ingestion costs.

But suppose new tokens start appearing in the system? The technique is future-proof because of the extensibility mechanism built in. As tokens change, the compression ratio might start to decline, but everything keeps working. New tokens just get added to the fallback string section.

The Tradeoff: Readability vs. Cost

There's no such thing as a free lunch, and this optimization comes with an important tradeoff. Engineers can no longer easily read the tokens directly in the log viewer.

Instead of seeing {'app:feature:alpha', 'sys:read:config', 'user:profile:basic'}, they see #rwAAAAAAAAAAAAAAAAAAAA==.

The problem of course is that debugging often requires human-readable logs. So we built a simple web page that allows the encoded value to be restored to the original format. Engineers can paste in the compressed token string and immediately see what the original values were. We also provide a command-line script for those who prefer that workflow.

Why This Matters Beyond OAuth Scopes

This isn't just about OAuth scopes—it's about recognizing that log data often contains highly repetitive structured information that you're paying premium prices to store and search. User agent strings, error codes, API endpoints, feature flags, environment configurations—all of these can be targets for similar compression techniques.

The key insight is that when you're dealing with a constrained vocabulary (a limited set of possible values that appear frequently), bitfield compression can dramatically reduce your ingestion costs while maintaining all the information you actually need.

Bill First, Optimize Second

Here's the broader lesson: you can't optimize what you don't measure. I only discovered this opportunity because I started with the bill, saw that 25% line item, and traced it back through metrics to the actual code generating these oversized log entries.

If I had started by looking at the code and asking "how can we make logging more efficient?", I might have focused on reducing log frequency or restructuring the JSON. Instead, starting with the bill led me directly to the highest-impact optimization.

In Closing

Sometimes the best optimizations are the ones that feel almost too simple. Map common values to bits, encode as base64, handle the edge cases. It's been done before in other contexts, but applying it to log ingestion costs? That turned a significant budget line item into a rounding error.

The technique is sitting there in production now, quietly compressing token data and saving money every day. And the best part? As our application evolves and new tokens become popular, the system adapts automatically. It's an old idea applied to a modern problem, but sometimes that's exactly what you need to stop paying premium prices for repeated content.

At Frugal, we're always looking for practical ways to make cloud applications more cost-efficient. If you're dealing with similar log ingestion cost challenges, we'd love to hear about them.

Share Link