Your new analytics engineer just spent three days figuring out why some models use leading commas and others don't. Your data prep layer has five different naming conventions because five different people built it. Your PR reviews have become style debates instead of logic reviews.
This is the tax you pay for not having a style guide. PR reviews that should take 20 minutes drag into hours. Onboarding that should take a week takes three. Every model becomes a choose-your-own-adventure where the reader has to learn this file's conventions before they can understand this file's logic.
We've been in the dbt ecosystem since 2017-back when Fishtown Analytics was still our neighbor in Philadelphia. We've seen what works across dozens of implementations, and we've codified it here. These aren't theoretical best practices. They're battle-tested conventions that actually stick when real teams, real deadlines, and real production pressure show up.
What's in this guide:
- •
- SQL formatting, naming, project structure, testing, and documentation standards
- •
- A ready-to-use .sqlfluff config that enforces everything automatically
- •
- A PR checklist your reviewers can use today
Download the complete PDF and .sqlfluff config at the end.
General Principles
Four principles underpin everything in this guide. When a specific rule doesn't cover your situation, return to these.
1. Optimize for the reader, not the writer.
Code is read 10x more often than it's written. Every decision should prioritize someone reading the code for the first time-not minimizing your keystrokes.
This means: descriptive names over abbreviations (users_with_subscription_history not usr_sub_hist), explicit syntax even when optional, and generous whitespace between logical sections.
2. Only as complex as absolutely necessary.
Every line of code is a liability. Complexity should be earned, not assumed.
Before adding abstraction, ask: Can this be expressed more simply? Would a new team member understand this without explanation? Are you solving a real problem or a hypothetical future one?
3. Consistency beats "better."
A codebase that consistently uses one convention is easier to navigate than one that uses "better" conventions inconsistently.
The hierarchy: Project consistency beats team conventions. Team conventions beat this guide. An intentional, consistent deviation is far better than inconsistent adherence.
4. Make the right thing easy.
Structure your project so following conventions is the path of least resistance. Use SQLFluff to enforce formatting automatically. Put templates where developers will find them. Document conventions in the repo, not a wiki nobody checks.
When doing the right thing requires extra effort, people stop doing it.
Quick Reference
| Category | Convention |
|---|---|
| Keywords | lowercase |
| Indentation | 2 spaces |
| Comma placement | Leading |
| Aliases | Explicit `as` keyword |
| Line length | 120 characters max |
| Model prefixes | `base_`, `prep_`, `enr_`, `intg_`, `core_` |
| Base model naming | `base_{source}__{table}` |
Print this. Tape it to your monitor.
SQL Formatting Rules
Keywords and Functions
All SQL keywords, functions, data types, and literals: lowercase.
-- Good
select
id
, upper(trim(name)) as name_normalized
, case when status = 'active' then true else false end as is_active
, cast(created_at as date) as created_date
from users
where deleted_at is null
-- Bad
SELECT
id,
UPPER(TRIM(name)) AS name_normalized,
CASE WHEN status = 'active' THEN TRUE ELSE FALSE END AS is_active,
CAST(created_at AS DATE) AS created_date
FROM users
WHERE deleted_at IS NULLThis isn't about right or wrong-it's about picking one and enforcing it everywhere. We picked lowercase because it's easier on the eyes in long queries and most modern editors handle syntax highlighting anyway.
Indentation
Two spaces. Never tabs. Configure your editor once and forget it.
select
column_1
, column_2
, case
when condition_1 then 'value_1'
when condition_2 then 'value_2'
else 'default'
end as derived_column
from table_name
where conditionComma Placement
Leading commas-commas at the beginning of lines.
-- Good: Leading commas
select
id
, first_name
, last_name
, email
, created_at
from users
-- Bad: Trailing commas
select
id,
first_name,
last_name,
email,
created_at
from usersHere's what they don't tell you about leading commas: When you add a column, your git diff changes one line instead of two. When you comment out a column, it's one keystroke. When you're scanning a long select list, the commas create a visual gutter that's easier to follow. Small things that compound across thousands of lines.
No trailing comma after the last item. SQLFluff will catch this.
Aliasing
Always use the explicit as keyword. For columns and tables.
-- Good: Explicit aliasing
select
u.id as user_id
, u.email as user_email
, o.id as order_id
from users as u
left join orders as o
on u.id = o.user_id
-- Bad: Implicit aliasing
select
u.id user_id
, u.email user_email
, o.id order_id
from users u
left join orders o
on u.id = o.user_idImplicit aliasing saves a few characters. Explicit aliasing saves every future reader from wondering "wait, is that a typo or an alias?"
Line Length
Keep lines under 120 characters. Break long lines at logical points:
select
s.created_at as subscription_created_at
, s.id as subscription_id
, s.account_id
, coalesce(
s.annual_revenue,
s.monthly_revenue * 12,
0
) as estimated_arr
from subscriptions as sJoin Formatting
Align join conditions vertically when you have multiple:
select
s.subscription_id
, a.account_id
, p.plan_name
from subscriptions as s
left join accounts as a
on s.account_id = a.id
and s.workspace_id = a.workspace_id
left join plans as p
on s.plan_id = p.id
and s.billing_interval = p.billing_intervalWhere Clause Formatting
Each condition on its own line, aligned:
where
s.created_at::date >= '2023-01-01'
and s.plan_type <> 'internal'
and s.status = 'active'CTE Patterns
CTEs are where most dbt model logic lives. Get these right.
CTE Naming
Use descriptive snake_case names that explain the transformation:
with normalized_domains as (...) -- Describes the transformation
, accounts_with_duplicates as (...) -- Indicates intermediate state
, accounts_deduped as (...) -- Final cleaned version
, subscription_events as (...) -- Describes the data contentBad CTE names: cte1, temp, data, final. Good CTE names tell you what's in the box without opening it.
CTE Structure
Three rules:
- 1.
- Start with
withfor the first CTE - Start with
- 2.
- Use leading comma and blank line before subsequent CTEs
- 3.
- End with a simple
select * from final_cte - End with a simple
-- Pre-normalize domains for efficient joining
with normalized_domains as (
select distinct
domain
, lower(trim(domain)) as domain_normalized
from {{ ref('base_clearbit__company_domains') }}
)
, accounts_with_enrichment as (
select
a.id as account_id
, a.company_name
, a.domain
, d.industry
from {{ ref('base_app__accounts') }} as a
left join normalized_domains as d
on lower(trim(a.domain)) = d.domain_normalized
)
, accounts_deduped as (
select *
from accounts_with_enrichment
qualify row_number() over (
partition by account_id
order by domain nulls last
) = 1
)
select * from accounts_dedupedProgressive Transformation Pattern
Structure CTEs to build progressively from raw data to final output:
- 1.
- Normalization CTEs - Prepare data for efficient joining
- 2.
- Base CTEs - Core data with initial joins
- 3.
- Enrichment CTEs - Add derived columns and business logic
- 4.
- Deduplication CTEs - Handle data quality issues
- 5.
- Final select - Output the cleaned dataset
This isn't arbitrary. When something breaks, you can comment out later CTEs and inspect intermediate states. Debug time drops from hours to minutes.
CTE Comments
Add a single-line comment above CTEs explaining purpose:
-- Pre-normalize company domains for efficient joining
with normalized_domains as (
...
)
-- Deduplicate accounts to ensure unique account_id
select *
from accounts_base
qualify row_number() over (...)Naming Conventions
Model Prefixes
| Layer | Prefix | Example |
|---|---|---|
| Source Mirror | `base_` | `base_stripe__subscriptions` |
| Data Prep | `prep_` | `prep_stripe__subscriptions` |
| Enrich | `enr_` | `enr_subscription_events` |
| Integrate | `intg_` | `intg_customer_subscriptions` |
| Core | `core_` | `core_monthly_recurring_revenue` |
| Serve | *(consumer name)* | `retail_usa_daily_orders` |
Base Model Naming
Base models follow the pattern: base_{source_schema}__{table_name}
base_stripe__subscriptions
base_stripe__invoices
base_app_public__users
base_salesforce__accounts
base_segmcore__tracksThe double underscore (__) separates the source schema from the table name. This matters when you're scanning a list of 200 models and need to find all Stripe sources fast.
Column Naming
Use snake_case with descriptive suffixes:
| Type | Suffix | Examples |
|---|---|---|
| Boolean flags | `is_`, `has_` | `is_active`, `has_valid_paymcore_method` |
| Timestamps | `_at` | `created_at`, `converted_at`, `churned_at` |
| Dates | `_date` | `trial_start_date`, `renewal_date` |
| Counts | `_count` | `seat_count`, `login_count` |
| Amounts | `_amount` | `mrr_amount`, `discount_amount` |
| IDs | `_id` | `user_id`, `account_id` |
| Percentages | `_pct`, `_rate` | `churn_rate`, `conversion_pct` |
| Normalized | `_normalized` | `email_normalized`, `domain_normalized` |
These suffixes aren't just style-they're type hints. When you see _at, you know it's a timestamp. When you see is_, you know it's a boolean. Your BI tool and your analysts will thank you.
Table Aliases
Short, 1-2 character aliases in joins:
from subscriptions as s
left join accounts as a
on s.account_id = a.id
left join users as u
on a.owner_id = u.id
left join plans as p
on s.plan_id = p.idCommon aliases: u for users, a for accounts, s for subscriptions, i for invoices, p for plans/products, e for events, o for orders.
Project Structure
Directory Layout
dbt/
├── models/
│ ├── source_mirror/ # 1:1 copies from source systems
│ │ ├── stripe/
│ │ ├── salesforce/
│ │ └── ...
│ ├── data_prep/ # Cleaned, renamed, typed
│ │ ├── stripe/
│ │ └── ...
│ ├── enrich/ # Source-specific business logic
│ ├── integrate/ # Cross-source combinations
│ ├── core/ # Org-wide canonical tables
│ └── serve/ # Consumer-specific delivery
│ ├── retail_usa/
│ └── finance/
├── macros/
│ ├── communication/ # Domain-specific macros
│ │ ├── advance_notice/
│ │ └── reminders/
│ └── utilities/
├── tests/ # Singular tests
├── seeds/
├── snapshots/
├── docs/ # Documentation
├── dbt_project.yml
└── .sqlfluffPurpose-Driven Layer Architecture
Name your directories for what they do. Six directories, six purposes:
| Layer | Directory | Purpose |
|---|---|---|
| Source Mirror | `source_mirror/` | 1:1 source copy, changelog, track deletes |
| Data Prep | `data_prep/` | Field renaming, type casting, no business logic |
| Enrich | `enrich/` | Source-specific business logic |
| Integrate | `integrate/` | Combine data from different systems |
| Core | `core/` | Org-wide analytical tables |
| Serve | `serve/` | Consumer-specific delivery |
Source Mirror
- •
- Schema:
source_mirror - Schema:
- •
- Purpose: 1:1 copy of the source table. Track schema changes, log deletes, maintain a full changelog. Nothing else.
- •
- Materialization (how dbt builds the model: as a view, table, or incremental refresh): View, or incremental (only processes new or changed rows) for high-volume sources
{{
config(
materialized='view',
schema='source_mirror'
)
}}
select
*
, _fivetran_synced as _loaded_at
, _fivetran_deleted as _is_deleted
from {{ source('stripe', 'subscriptions') }}Data Prep
- •
- Schema:
data_prep - Schema:
- •
- Purpose: Field renaming, type casting, null handling. No business logic. If you're writing a CASE statement, you've gone too far.
- •
- Materialization: View
{{
config(
materialized='view',
schema='data_prep'
)
}}
select
id as subscription_id
, cast(created_at as timestamp_ntz) as created_at
, nullif(trim(plan_name), '') as plan_name
from {{ ref('base_stripe__subscriptions') }}Enrich
- •
- Schema:
enrich - Schema:
- •
- Purpose: Source-specific business logic: deduplication, status derivation, temporal calculations. Each model still lives within a single source domain.
- •
- Materialization: View (table for expensive computations)
{{
config(
materialized='view',
schema='enrich'
)
}}
with subscription_events as (
select
subscription_id
, plan_name
, created_at
, lead(created_at) over (
partition by account_id order by created_at
) as next_evcore_at
from {{ ref('prep_stripe__subscriptions') }}
)
select * from subscription_eventsIntegrate
- •
- Schema:
integrate - Schema:
- •
- Purpose: Combine data from different systems. This is where Stripe subscriptions meet Salesforce accounts meet product usage events. All combinations across systems live here and nowhere else.
- •
- Materialization: View (table for large cross-source joins)
{{
config(
materialized='view',
schema='integrate'
)
}}
select
a.account_id
, a.company_name
, s.subscription_id
, s.plan_name
, u.last_login_at
from {{ ref('enr_salesforce__accounts') }} as a
left join {{ ref('enr_stripe__subscriptions') }} as s
on a.stripe_customer_id = s.customer_id
left join {{ ref('enr_app__user_activity') }} as u
on a.account_id = u.account_idCore
- •
- Schema:
core - Schema:
- •
- Purpose: Org-wide analytical tables. Canonical definitions of customers, subscriptions, revenue. Governed, tested, documented. The tables your CFO trusts.
- •
- Materialization: Table (always)
{{
config(
materialized='table',
schema='core',
cluster_by=['account_id', 'created_at']
)
}}Serve
- •
- Schema: Named for the consumer, e.g.
retail_usa_data,finance_reporting - Schema: Named for the consumer, e.g.
- •
- Purpose: Consumer-specific views. Pre-filtered, pre-aggregated, shaped for a specific team, dashboard, or data sync to business tools.
- •
- Materialization: Table or incremental
{{
config(
materialized='table',
schema='retail_usa_data',
cluster_by=['store_id', 'order_date']
)
}}The Fast Lane
Sometimes data needs to move faster than six layers allow. A real-time inventory feed or a fraud-detection signal can't wait for a full pipeline run. For these cases, build a fast lane: a model that goes directly from Source Mirror to Serve, skipping the layers in between. Tag it, document why it exists, and revisit it quarterly. The fast lane is a conscious trade-off, not a shortcut.
Model Configuration
Config Block Placement
Config blocks go at the top of the file. Always.
{{
config(
materialized='table',
cluster_by=['subscription_id', 'created_at']
)
}}
-- Model logic starts here
with ...Materialization Strategy
| Layer | Default | Override When |
|---|---|---|
| Source Mirror | view | High-volume sources needing incremental loads |
| Data Prep | view | Large reference tables frequently joined |
| Enrich | view | Expensive computations, complex window functions |
| Integrate | view | Large cross-source joins with multiple aggregations |
| Core | table | Always tables. These are your governed, canonical models |
| Serve | table | Incremental for high-volume delivery targets |
Clustering
Use cluster_by for tables with common filter/join patterns:
{{
config(
materialized='table',
cluster_by=['account_id', 'created_at']
)
}}Choose clustering keys based on: frequently filtered columns, common join keys, date/time partitioning columns.
Post-Hooks
Use post-hooks for granting permissions or logging:
{{
config(
materialized='table',
schema='core',
post_hook=[
"grant select on {{ this }} to role analyst",
"{{ log_model_refresh(this.identifier) }}"
]
)
}}Schema Change Handling
In dbt_project.yml:
models:
your_project:
+on_schema_change: "sync_all_columns"Testing Strategy
Test Hierarchy
- 1.
- Generic tests (column-level) - Quick validation with broad coverage
- 2.
- Singular tests (dataset-level) - Complex business logic validation
Generic Tests
Apply to primary keys and critical columns in schema.yml:
version: 2
models:
- name: prep_subscriptions
columns:
- name: subscription_id
description: "Unique identifier for the subscription"
tests:
- not_null
- unique
- name: created_at
description: "Timestamp when the subscription was created"
tests:
- not_nullEvery primary key gets not_null and unique. No exceptions. This is the minimum bar-the tests that catch 80% of data issues with 5% of the effort.
Singular Tests
Create SQL files in tests/ for complex validation:
-- tests/assert_core_mrr_data_quality.sql
with validation_checks as (
select
-- Check for duplicate records
count(*) as total_records
, count(distinct subscription_id) as unique_subscriptions
-- Validate temporal consistency
, sum(case when churned_at < created_at then 1 else 0 end) as invalid_dates
-- Validate MRR calculation accuracy
, sum(
case
when mrr_amount != (
base_mrr + expansion_mrr - contraction_mrr
)
then 1
else 0
end
) as incorrect_mrr_totals
from {{ ref('core_monthly_recurring_revenue') }}
)
-- Return rows only when failures are detected (dbt convention)
select
'Data Quality Issues Found' as test_name
, *
from validation_checks
where
total_records != unique_subscriptions
or invalid_dates > 0
or incorrect_mrr_totals > 0Test Naming Convention
tests/
├── assert_{model_name}_data_quality.sql
├── assert_{model_name}_outliers.sql
├── no_{business_rule_violation}.sqlTests return 0 rows on success. Any returned rows indicate failure.
Documentation Standards
Schema.yml Structure
version: 2
models:
- name: model_name
description: |
**Brief title or purpose**
Detailed description of the model's business purpose.
**Key Features:**
- Feature 1
- Feature 2
**Business Questions Answered:**
- Question 1?
- Question 2?
meta:
owner: "Team or Person Name"
tags: ["domain", "category", "tier-1"]
business_domain: "Domain Name"
columns:
- name: column_name
description: |
Clear description of what this column represents.
Include calculation logic for derived columns:
Formula: column_a + column_b
tests:
- not_null
- uniqueCalculated Columns
Document the formula. Future you will thank present you.
- name: health_score
description: |
Composite customer health score (0-100 scale) combining:
- Product usage frequency (30% weight)
- Feature adoption breadth (25% weight)
- Support ticket sentiment (20% weight)
- NPS survey responses (25% weight)
Scores < 50 indicate churn risk requiring intervention.
- name: days_to_convert
description: |
Calculated: Days between trial start and paid conversion.
Formula: DATEDIFF(day, trial_started_at, converted_at)
NULL if not yet converted.Deprecation Notices
Mark deprecated models clearly:
- name: mart_revenue_legacy
description: |
**DEPRECATED - Use `core_monthly_recurring_revenue` instead**
Legacy model maintained for backward compatibility.
Scheduled for removal: 2025-Q2
meta:
tags: ["legacy", "scheduled-removal"]Macro Conventions
One Macro Per File
Each macro in its own file with a matching name:
macros/
├── communication/
│ ├── advance_notice/
│ │ ├── email_attempt_dates.sql
│ │ ├── sms_attempt_dates.sql
│ │ ├── attempt_date_cols.sql
│ │ └── shipped_diff_cols.sql
│ └── reminders/
│ └── ...
└── utilities/
└── explore_data.sqlMacro Naming
Use descriptive snake_case that indicates output:
{action}_{subject}_{output_type}.sql
email_attempt_dates.sql
shipped_diff_cols.sql
create_feature_registry_table.sqlMacro Structure
{% macro attempt_date_cols(prefix, channel_expr, channel_name) %}
-- Emits attempt 1..3 dates like: {prefix}1_{channel_name}_date
{% for n in [1, 2, 3] %}
, to_char(
max(case when {{ channel_expr }} and attempt_number = {{ n }} then contact_at end),
'YYYY-MM-DD'
) as {{ prefix }}{{ n }}_{{ channel_name }}_date
{% endfor %}
{% endmacro %}Wrapper Macros
Create thin wrappers for common use cases:
-- email_attempt_dates.sql
{% macro email_attempt_dates(prefix) %}
{{ attempt_date_cols(prefix, "activity_type = 'Email'", "email") }}
{% endmacro %}
-- sms_attempt_dates.sql
{% macro sms_attempt_dates(prefix) %}
{{ attempt_date_cols(prefix, "activity_type = 'SMS'", "sms") }}
{% endmacro %}SQLFluff Configuration
Here's the complete .sqlfluff config that enforces everything in this guide:
[sqlfluff]
templater = dbt
dialect = snowflake
exclude_rules = L034, L044
max_line_length = 120
[sqlfluff:templater:dbt]
profiles_dir = ~/.dbt
profile = your_profile_name
target = dev
[sqlfluff:rules:capitalisation.keywords]
capitalisation_policy = lower
[sqlfluff:rules:capitalisation.functions]
extended_capitalisation_policy = lower
[sqlfluff:rules:capitalisation.literals]
capitalisation_policy = lower
[sqlfluff:rules:capitalisation.types]
extended_capitalisation_policy = lower
[sqlfluff:rules:aliasing.table]
aliasing = explicit
[sqlfluff:rules:aliasing.column]
aliasing = explicit
[sqlfluff:rules:aliasing.expression]
aliasing = explicit
[sqlfluff:rules:layout.long_lines]
ignore_commcore_lines = true
ignore_commcore_clauses = true
[sqlfluff:rules:structure.column_order]
select_clause_elemcore_order_preference = any
[sqlfluff:indentation]
indcore_unit = space
indents = 2
[sqlfluff:layout:type:comma]
line_position = leading
[sqlfluff:rules:convention.select_trailing_comma]
select_clause_trailing_comma = forbid
[sqlfluff:rules:convention.terminator]
require_final_newline = trueCommon Commands
# Lint all models
sqlfluff lint models/
# Lint specific layer
sqlfluff lint models/source_mirror/
sqlfluff lint models/data_prep/
sqlfluff lint models/enrich/
sqlfluff lint models/integrate/
sqlfluff lint models/core/
sqlfluff lint models/serve/
# Auto-fix linting issues
sqlfluff fix models/
# Fix specific file
sqlfluff fix models/data_prep/prep_stripe__subscriptions.sql
# Format SQL files
sqlfluff format models/CI/CD Integration
Add SQLFluff to your pipeline:
# .github/workflows/dbt-lint.yml
name: dbt Lint
on: [pull_request]
jobs:
sqlfluff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install SQLFluff
run: pip install sqlfluff sqlfluff-templater-dbt
- name: Lint SQL
run: sqlfluff lint models/ --dialect snowflakeAction: Add this workflow to your repo today. Takes 5 minutes. Style debates in PRs drop to zero.
Get the Complete Package
Everything in this guide-plus a one-page cheatsheet and PR reviewer checklist-is available for download:
Download the Tabby Stack dbt Style Guide (PDF + .sqlfluff config)
The PDF includes a complete example model showing all conventions working together, and a PR checklist you can paste into your review template.
When Standards Aren't Enough
This guide solves the consistency problem. But consistency is table stakes.
The harder problems-model sprawl that nobody can navigate, tests that don't catch real issues, Snowflake bills that grow 30% month-over-month with no visibility-require more than a style guide.
We help data teams get their dbt projects to a state where following the rules is the easy path and the architecture actually scales. If you've adopted standards but still feel like you're fighting your data stack, that's the work we do.