DBT or Data Build Tool is an open source command line tool that has gained popularity in the recent years. It allows DE to transform data using SQL queries more effectively (according to Wikipedia). In this blog, I document my learning steps and the questions I had along the way.
Installation.
I created a new conda environment locally, and used pip to install dbt. I also installed duckDB adapter while I am at it.
pip install dbt-core
pip install dbt-duckdbThe version that I started with was 1.10.9 (and 1.9.4 for duckdb).
Data
I wanted to use real world data as I was learning. So for this purpose I downloaded the following two datasets,
- Electric Vehicle Popoulation Data from https://catalog.data.gov/dataset/electric-vehicle-population-data
- Alternate Fuel Data Center from https://afdc.energy.gov/corridors (I downloaded for WA)
Folder Structure
- data ( Where I saved the raw data)
- dbt_project.yml
- logs (dbt automatically creates this)
- models (folder for each layers of models)
- profiles.yml
- seeds
- target (dbt automatically creates this)
dbt_project.yml
DBT looks specifically for a file named dbt_project.yml in the project root. This file tells dbt how to organize and materialize everything we do in this project. My dbt_project.yml file looks like this
name: 'dbt_intro'
version: '1.0.0'
config-version: 2
profile: 'dbt_intro'
model-paths: ['models']
seed-paths: ['seeds']
models:
dbt_intro:
+materialized: view
staging:
+materialized: view
intermediate:
+materialized: table
marts:
+materialized: table
seeds:
dbt_intro:
+quote_columns: falseThe explanation for each of the lines is as follows:
- ‘name’ is the name of the project. I chose dbt_intro as the name
- ‘version’ is the projects version.
- ‘config-version’. DBT had an old format which is config-version:1. The modern format is config-version:2. So all current projects has 2 as the config-version
- ‘profile’ tells dbt which database connection setting to use, which would be added in profiles.yml later.
- ‘model-paths’: the directory where all the models are stored. You can have different sub directories in this directory for each layers (including raw, staging, intermediate and mart)
- ‘seed-path’ : the directory where .csv seeds are stored. Seeds are typically small reference data. It is always stored as a .csv file.
- ‘materialized’ tells how the models are built in the database.
- view means that an SQL view is created,
- table means that dbt runs the query and stores it as a table.
- incremental means that dbt builds a table once, then only adds new or changed data each run
- ephemeral means that dbt doesn’t build anything in DB. It inlines the SQL as a subquery in downstream models.
- ‘models’ configuration for models.
- ‘raw’: I set the configuration for this in the model file itself. So it isn’t included here.
- ‘staging’: raw but cleaned data, do basic cleaning like type casting, renaming columns, etc. This would mirror the source table
- ‘intermediate’: join staging tables, do calculations and add logic
- ‘mart’: final business facing models.
- ‘seeds’ configuration for seeds.
- ‘quote-columns’: false indicates don’t auto quote columns. I.e., don’t add quotations around column names. Simple, lowercase names with underscore are cleaner and safer.
# profiles.yml
DBT needs to know how to connect to the data warehouse. To do that it looks for profiles.yml in the default location. ‘.dbt’ is a hidden directory where dbt expects to find the profiles.yml.
~/.dbt/profiles.ymlWhen we build the models, dbt reads ‘dbt_project.yml’ from the project folder. It then looks for a matching profile name in ‘profiles.yml’. In our case, the profile name is dbt-intro. It then uses the connection details there. We can keep a version of profiles.yml in our repo for version control purposes, but we need to have a version of it in the default location. There is only one ‘profiles.yml’ stored at ‘~/.dbt/’ for all projects in the machine. Inside this file, we can define multiple profiles.
dbt_intro: # (1) Profile name, must match "profile:" in dbt_project.yml
target: dev # (2) Default environment to use
outputs: # (3) List of available environments/connections
dev: # (4) Name of this environment
type: duckdb # (5) Warehouse type (DuckDB, Postgres, BigQuery, Snowflake, etc.)
path: data/warehouse.duckdb # (6) Path to DuckDB file (your local warehouse DB)
threads: 2 # (7) Number of parallel threads dbt can use (compiles/runs faster)
schema: main # (8) Default schema inside the warehouse (like a namespace)Models
This is where we write the SQL queries to process the data. I have four directories in it.
Raw
This is where I read in the data from source (/data). In raw, I apply only the minimum technical cleanup (e.g., safer column names) to make the data queryable. Here are the two raw models.
raw_ev_charging_stations.sql
{{ config(materialized='table', schema='raw') }}
with src as (
select *
from read_csv_auto('data/alt_fuel_stations (Aug 20 2025).csv', header=true)
)
select
"Station Name" as station_name,
City as city,
State as state,
cast(ZIP as varchar) as postal_code,
"EV Level1 EVSE Num" as level1_ports,
"EV Level2 EVSE Num" as level2_ports,
"EV DC Fast Count" as fast_chargers,
Latitude as latitude,
Longitude as longitude,
"EV Network" as ev_network,
"EV Connector Types" as connector_types,
"Date Last Confirmed" as last_confirmed,
"Open Date" as open_date,
"Fuel Type Code" as fuel_type_code,
"Status Code" as status_code
from src
where "Fuel Type Code" = 'ELEC'raw_ev_registrations.sql
{{ config(materialized='table', schema='raw') }}
select
"VIN (1-10)" as vin_prefix,
County as county,
City as city,
State as state,
"Postal Code" as postal_code,
"Model Year" as model_year,
Make as make,
Model as model,
"Electric Vehicle Type" as ev_type,
"Clean Alternative Fuel Vehicle (CAFV) Eligibility" as cafv_eligibility,
"Electric Range" as electric_range,
"Base MSRP" as base_msrp,
"Legislative District" as legislative_district,
"DOL Vehicle ID" as dol_vehicle_id,
"Vehicle Location" as vehicle_location,
"Electric Utility" as electric_utility,
"2020 Census Tract" as census_tract
from read_csv_auto('data/Electric_Vehicle_Population_Data.csv', header=true)Note how I added {{ config(materialized=‘table’, schema=‘raw’) }} to the top of each of the models (and not included it in dbt_project.yml).
Building the model.
There are several ways to build the model. We use selectors with -s to tell dbt what to build.
By folder name
dbt run -s rawMatches anything under models/raw because raw is part of the model’s fully qualified name (FQN).
By explicit path
dbt run -s path:models/rawThis guarentees to select just this directory. This is a more unambiguous way of running the model.
By exact model name
dbt run -s raw_ev_registrations raw_ev_charging_stationsWe explicitly state what models we want to run.
When we run this command, dbt does the following:
- load config and connect
- reads dbt_project.yml
- looks up matching profile (in this case it is dbt_intro) in ~/.dbt/profiles.yml
- connects to data/warehouse.duckdb (as given in profiles.yml).
- sets concurrency to 2 threads (as given in profiles.yml)
- parse the project
- scans ‘models/’ for .sql and schema.
- build a DAG (Directed Acyclic Graph)
- Figures out run order from dependencies.
- Here in raw, models don’t depend on anything, and so they are the start of the graph.
- select nodes
- the selector -s picks only the models that we want to run.
- compiles SQL
- runs SQL
- make sure the schema exists (in this case it is raw) or creates it. By default it would be called ‘main_raw’.
- since materialized is set to tables, it creates/overwrites physical tables. The following tables are created
- main_raw.raw_ev_registrations
- main_raw.raw_ev_charging_stations
- create summary and logs
- in target/ some files are created (I didn’t check out what they are yet).
No tests were conducted in the raw layer.
Staging
This is where I do light, semantic cleanup so downstream joins/metrics are easy. I use ref() to point at the raw tables, normalize text, cast types, and (for this project) (filter to Washington) so both datasets line up.
stg_ev_registrations.sql
with src as (
select * from {{ ref('raw_ev_registrations') }}
),
typed as (
select
vin_prefix,
county,
trim(city) as city,
upper(trim(state)) as state,
cast(postal_code as varchar) as postal_code,
try_cast(model_year as int) as model_year,
make,
model,
ev_type,
cafv_eligibility,
try_cast(electric_range as int) as electric_range,
try_cast(base_msrp as double) as base_msrp,
try_cast(legislative_district as int) as legislative_district,
try_cast(dol_vehicle_id as bigint) as dol_vehicle_id,
vehicle_location,
electric_utility,
cast(census_tract as varchar) as census_tract
from src
)
select * from typed
where state = 'WA'
stg_ev_charging_stations.sql
```sql
with src as (
select * from {{ ref('raw_ev_charging_stations') }}
),
typed as (
select
station_name,
trim(city) as city,
upper(trim(state)) as state,
cast(postal_code as varchar) as postal_code,
try_cast(level1_ports as int) as level1_ports,
try_cast(level2_ports as int) as level2_ports,
try_cast(fast_chargers as int) as fast_chargers,
try_cast(latitude as double) as latitude,
try_cast(longitude as double) as longitude,
ev_network,
connector_types,
cast(last_confirmed as date) as last_confirmed,
cast(open_date as date) as open_date,
status_code
from src
)
select * from typed
where state = 'WA'schema.yml
version: 2
models:
- name: stg_ev_registrations
description: Cleaned EV registrations (typed + normalized)
columns:
- name: state
tests:
- not_null
- accepted_values:
arguments:
values: ['WA'] # all rows should be Washington
- name: stg_ev_charging_stations
description: Cleaned Washington EV charging stations (typed + normalized)
columns:
- name: city
tests: [not_null]
- name: state
tests:
- not_null
- accepted_values:
arguments:
values: ['WA']The schema.yml are for tests & docs metadata.
Similar to what we did in raw, we can run
dbt run -s path:models/stagingAfter this we can do our tests (state is WA), and not null values in city and state using the following command.
dbt test -s path:models/stagingIn my case, it returned
20:35:36 Found 4 models, 5 data tests, 429 macros
20:35:36
20:35:36 Concurrency: 2 threads (target='dev')
20:35:36
20:35:37 1 of 5 START test accepted_values_stg_ev_charging_stations_state__WA ........... [RUN]
20:35:37 2 of 5 START test accepted_values_stg_ev_registrations_state__WA ............... [RUN]
20:35:37 1 of 5 PASS accepted_values_stg_ev_charging_stations_state__WA ................. [PASS in 0.24s]
20:35:37 3 of 5 START test not_null_stg_ev_charging_stations_city ....................... [RUN]
20:35:37 2 of 5 PASS accepted_values_stg_ev_registrations_state__WA ..................... [PASS in 0.25s]
20:35:37 4 of 5 START test not_null_stg_ev_charging_stations_state ...................... [RUN]
20:35:37 3 of 5 PASS not_null_stg_ev_charging_stations_city ............................. [PASS in 0.06s]
20:35:37 4 of 5 PASS not_null_stg_ev_charging_stations_state ............................ [PASS in 0.06s]
20:35:37 5 of 5 START test not_null_stg_ev_registrations_state .......................... [RUN]
20:35:37 5 of 5 PASS not_null_stg_ev_registrations_state ................................ [PASS in 0.07s]
20:35:37
20:35:37 Finished running 5 data tests in 0 hours 0 minutes and 0.68 seconds (0.68s).
20:35:37
20:35:37 Completed successfully
20:35:37
20:35:37 Done. PASS=5 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=5