MongoDB Schema Design Patterns for Developers
MongoDB is a document database where your schema design has a massive impact on performance and maintainability. Unlike relational databases, MongoDB doesn’t enforce a rigid schema — but that flexibility means you have to make deliberate choices. The central question in every MongoDB schema design is: embed or reference? Get it wrong and you’ll either hit document size limits or write slow, join-heavy queries that MongoDB wasn’t designed to run.
Embedding vs Referencing
Embedding (Denormalization)
Embed related data inside the same document when you always read it together. This is MongoDB’s native style — queries stay within a single document, no joins required.
{
"_id": "user123",
"name": "Alice",
"address": {
"street": "123 Main St",
"city": "Mumbai",
"pin": "400001"
},
"preferences": {
"theme": "dark",
"notifications": true
}
}
Embedding is ideal when:
- The nested data is only meaningful in the context of the parent (a user’s address)
- You always fetch the parent and child together
- The embedded data is bounded in size and won’t grow unboundedly
- Write operations mostly update the whole document
Referencing (Normalization)
Reference a separate document by storing its _id. MongoDB doesn’t have foreign keys, so enforcement is up to your application.
// posts collection
{
"_id": "post456",
"title": "Getting Started with MongoDB",
"author_id": "user123",
"body": "..."
}
// users collection
{
"_id": "user123",
"name": "Alice"
}
Use references when:
- The related data is large or unbounded (a user’s comment history)
- The related data is shared across multiple parents (a category used by many posts)
- You sometimes need the related data and sometimes don’t
- The related entity is updated frequently and independently
Common Schema Design Patterns
Pattern 1: Attribute Pattern
When a document has many similar fields that vary by instance, collapse them into a key-value array. This makes queries and indexes far cleaner.
Before (hard to index on every spec field):
{
"_id": "product1",
"name": "Gaming Chair",
"color": "black",
"weight_kg": 15,
"max_load_kg": 120,
"material": "leather"
}
After (attribute pattern):
{
"_id": "product1",
"name": "Gaming Chair",
"specs": [
{ "k": "color", "v": "black" },
{ "k": "weight_kg", "v": 15 },
{ "k": "max_load_kg", "v": 120 },
{ "k": "material", "v": "leather" }
]
}
Now you create a single compound index on specs.k and specs.v and query any attribute uniformly:
db.products.createIndex({ "specs.k": 1, "specs.v": 1 })
db.products.find({ specs: { $elemMatch: { k: "color", v: "black" } } })
Pattern 2: Bucket Pattern
For time-series data or streams of events, grouping many small documents into “buckets” reduces document count and index overhead significantly.
Instead of one document per IoT sensor reading:
{ "_id": "...", "device_id": "sensor1", "temp": 22.5, "ts": "2026-05-25T10:00:00Z" }
{ "_id": "...", "device_id": "sensor1", "temp": 22.7, "ts": "2026-05-25T10:01:00Z" }
Bucket them by device and hour:
{
"_id": { "device_id": "sensor1", "hour": "2026-05-25T10:00:00Z" },
"readings": [
{ "temp": 22.5, "ts": "2026-05-25T10:00:00Z" },
{ "temp": 22.7, "ts": "2026-05-25T10:01:00Z" }
],
"count": 2,
"avg_temp": 22.6
}
This can reduce document count by 60x for per-minute sensor data. Storing pre-aggregated avg_temp avoids recomputing it on every read.
Pattern 3: Outlier Pattern
When 99% of documents are small but 1% are very large — a celebrity with 10M followers vs a typical user with 200 — use a flag to handle outliers separately rather than letting them bloat the main collection.
// Typical user — followers embedded directly
{
"_id": "user789",
"name": "Bob",
"followers": ["user1", "user2", "user3"],
"has_overflow": false
}
// Celebrity — first N embedded, rest in overflow collection
{
"_id": "user001",
"name": "Priya",
"followers": ["user1", "user2"],
"has_overflow": true
}
// followers_overflow collection
{ "user_id": "user001", "followers": ["user3", "user4", ...] }
Pattern 4: Computed Pattern
Pre-compute expensive aggregations and store them on the document. Update them asynchronously or on write, not on every read.
{
"_id": "product1",
"name": "Gaming Chair",
"review_count": 1482,
"avg_rating": 4.3,
"rating_sum": 6373
}
When a new review comes in, increment review_count and rating_sum, then recompute avg_rating. The average is always ready to read — no aggregation pipeline on every product page load.
Pattern 5: Extended Reference Pattern
Instead of always looking up the referenced document, embed a small subset of its fields directly. Reduces join-like lookups for common queries.
{
"_id": "order101",
"customer": {
"_id": "user123",
"name": "Alice",
"email": "alice@example.com"
},
"items": [...],
"total": 4999
}
You still keep the reference _id for full lookups, but you inline the fields you display on the order summary page.
Document Size and Growth
MongoDB documents have a 16 MB BSON limit. More importantly, documents that grow unboundedly are a red flag. Avoid patterns like this:
// BAD: appending to an array with no bound
db.posts.updateOne(
{ _id: "post456" },
{ $push: { comments: newComment } }
)
If a post can accumulate thousands of comments, store comments in their own collection and reference the post:
{
"_id": "comment789",
"post_id": "post456",
"author_id": "user123",
"body": "Great article!",
"created_at": "2026-05-25T10:00:00Z"
}
Indexing
Create indexes on the fields you filter and sort most often:
// Single field index
db.orders.createIndex({ customer_id: 1 })
// Compound index — follow the ESR rule: Equality fields first, then Sort, then Range
db.orders.createIndex({ customer_id: 1, status: 1, created_at: -1 })
// Multikey index — automatically used when indexing an array field
db.products.createIndex({ "specs.k": 1, "specs.v": 1 })
// Partial index — only indexes documents matching a condition
db.orders.createIndex(
{ created_at: -1 },
{ partialFilterExpression: { status: "pending" } }
)
Use explain() to verify a query hits an index:
db.orders.find({ customer_id: "user123", status: "pending" }).explain("executionStats")
Look for IXSCAN (index scan) instead of COLLSCAN (full collection scan) in the output.
Embed vs Reference: Decision Guide
| Factor | Embed | Reference |
|---|---|---|
| Access pattern | Always together | Sometimes separate |
| Data size | Small, bounded | Large, unbounded |
| Update frequency | Parent and child updated together | Child updated independently |
| Sharing | Owned by one parent | Shared across parents |
| Relationship type | One-to-one, one-to-few | One-to-many, many-to-many |
Conclusion
MongoDB schema design is driven by your query patterns, not by normalization rules. Start by listing how your application reads and writes data, then choose embedding or referencing to serve those patterns efficiently. The patterns covered here — attribute, bucket, outlier, computed, and extended reference — each solve a specific class of problem. Apply them selectively rather than adopting any single pattern across your entire schema, and use explain() to confirm your indexes are doing their job.