Binary data: never base64-encode inside JSON
The most common JSON overreach is using it to transport binary data — images, audio, PDF files, compiled binaries — by base64-encoding the content and embedding it in a JSON string. This is wrong on multiple levels.
Base64 encoding inflates the payload size by 33% (three bytes become four characters) before you even start serializing the JSON. The resulting string must be escaped by the JSON serializer, adding more overhead. The receiver must then JSON-parse the document and then base64-decode the string — two steps where one would do.
// Wrong — do not do this
POST /api/upload
Content-Type: application/json
{
"filename": "photo.jpg",
"content": "iVBORw0KGgoAAAANSUhEUgAA..." // base64 — 33% larger, double-parsed
}
// Right — use multipart/form-data for mixed content
POST /api/upload
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary
------WebKitFormBoundary
Content-Disposition: form-data; name="metadata"
Content-Type: application/json
{"filename": "photo.jpg", "description": "Profile photo"}
------WebKitFormBoundary
Content-Disposition: form-data; name="file"; filename="photo.jpg"
Content-Type: image/jpeg
[binary data here]
// Or just use two separate requests: upload binary to storage,
// store the URL in JSON. This is what every cloud storage service does.Configuration files: use TOML or YAML, not JSON
JSON was not designed for configuration files. It has no comment syntax, no multi-line string support without escape sequences, and strict syntax that makes hand-editing error-prone. Teams using JSON for configuration (package.json excepted, since it has an ecosystem) are working against the format.
TOML (Tom's Obvious, Minimal Language) is the underappreciated alternative. It reads like an INI file, supports comments, handles multi-line strings natively, has a clean date/time literal syntax, and maps to a typed data model rather than YAML's implicit coercion minefield. It is the configuration format of Rust's Cargo, Python's pyproject.toml, and Hugo.
# TOML — much better for human-authored configuration [database] host = "localhost" port = 5432 # integer, not string — no implicit coercion surprises name = "myapp" [server] # Development vs production is obvious here port = 8080 debug = false [feature_flags] new_dashboard = true beta_api = false # Compare to the JSON equivalent — no comments, no context
Tabular data at scale: use Parquet or Arrow
For analytics, data exports, and machine learning datasets, JSON is the wrong format past a few thousand rows. A JSON array of objects repeats every field name with every record — for a dataset with 20 columns and 1 million rows, the field names alone are thousands of megabytes of wasted space.
Apache Parquet is a columnar binary format designed for analytical workloads. The same dataset that is 1GB as JSON might be 50MB as Parquet — a 20x reduction — and it can be queried column-by-column, meaning tools like DuckDB can compute a sum over one column without reading the others. For any dataset that will be queried analytically, Parquet is not a nice-to-have; it is the correct tool.
# Python: write analytics data to Parquet instead of JSON
import pandas as pd
df = pd.DataFrame(records) # records is a list of dicts
# Bad — JSON export of analytics data
df.to_json('export.json', orient='records') # 1.2 GB
# Good — Parquet with compression
df.to_parquet('export.parquet', compression='snappy') # ~60 MB
# DuckDB can query it without loading into memory
import duckdb
result = duckdb.query("SELECT user_id, SUM(revenue) FROM 'export.parquet' GROUP BY 1")High-frequency real-time data: use MessagePack or Protocol Buffers
Game state updates, sensor readings, financial tick data, real-time telemetry — these are workloads where JSON's text-based encoding and parsing overhead becomes measurable. At 10,000 messages per second, JSON serialization and deserialization can consume 10-20% of a core.
MessagePack is a binary serialization format that encodes the same data model as JSON (strings, numbers, arrays, objects, booleans, null) but in a binary format that is 20-30% smaller and parses 2-3x faster. It requires no schema, is supported in every language, and is a drop-in replacement for JSON in most applications. This is the format to reach for when JSON performance becomes a bottleneck and you do not want to deal with schema management.
Protocol Buffers and FlatBuffers go further — they require a schema definition, but in return they produce payloads 60-80% smaller than JSON and deserialize at a fraction of the cost. FlatBuffers in particular allows zero-copy access: you can read fields from the buffer without deserializing the entire structure. For game engines, high-frequency trading, and embedded systems, this matters enormously.
// MessagePack — drop-in replacement for JSON in WebSocket messages
import { encode, decode } from '@msgpack/msgpack';
// Server: encode to binary
const message = { type: 'game_state', tick: 1024, players: [...] };
const binary = encode(message); // Uint8Array, ~25% smaller than JSON
// Send over WebSocket
ws.send(binary);
// Client: decode from binary
ws.on('message', (data) => {
const message = decode(data);
// message is identical to original object
});Logs: use NDJSON (newline-delimited JSON)
Structured logging — machine-parseable log lines — is broadly accepted as the correct approach for production systems. The question is format. Plain JSON is technically valid for log lines but fails in practice: a multi-line JSON document cannot be processed line-by-line by log aggregators, and a JSON array of log entries must be fully parsed before any entry can be processed.
NDJSON (Newline-Delimited JSON, also called JSON Lines) is the right format for logs and streaming data. Each line is a complete, valid JSON object. Log aggregators can process line-by-line without buffering. Files can be appended to without invalidating the structure. Tail-following works natively.
// Each line is a complete JSON object — one log event per line
{"ts":"2026-05-31T10:00:00Z","level":"info","msg":"Server started","port":8080}
{"ts":"2026-05-31T10:00:01Z","level":"info","msg":"Request","method":"GET","path":"/api/users","ms":12}
{"ts":"2026-05-31T10:00:02Z","level":"error","msg":"DB error","err":"connection refused"}
// Process with standard UNIX tools
cat app.log | grep '"level":"error"' | jq .
cat app.log | jq -s 'map(select(.ms > 100))' # slow requestsThe practical thresholds
To make this concrete: here are the thresholds at which JSON stops being the right choice, based on payload characteristics rather than vague scale concerns:
Scenario JSON OK? Better alternative ------------------------------------------------------------------- API responses < 1 MB Yes — API responses > 10 MB No Streaming + pagination Config files (human-edited) No TOML or YAML Config files (generated) Yes — Real-time messages < 100/s Yes — Real-time messages > 1000/s Maybe MessagePack Analytics exports < 10k rows Yes — Analytics exports > 100k rows No Parquet Binary file content Never multipart/form-data Log lines NDJSON only — (JSON, one per line) Schema-stable RPC at scale No Protocol Buffers
Below these thresholds, JSON's universality, readability, and zero-dependency ecosystem make it the right choice. The moment you cross them, you are trading a known engineering liability for short-term convenience. Be intentional about that tradeoff.
Try it yourself
JSON Formatter — validate and explore your JSON data →