PythonFeb 12, 202620 minute read

Pandas Method Chaining: Write Data Transformations That Read Like a Story

CH
Conor Hollern
Pandas Method Chaining: Write Data Transformations That Read Like a Story

Write Clean, Composable Data Transformation Pipelines

Most pandas code starts clean… and ends as a pile of intermediate variables. Method chaining fixes that. Instead of mutating the same DataFrame over and over, you build a clear, linear transformation pipeline — one expression, top to bottom, readable like a story. Let’s break down how and why to use it.

1. What Is Method Chaining?

Method chaining means calling multiple DataFrame methods sequentially on the same object. Note: you'll want to wrap the chain in parentheses to avoid needing to add a backslash after each method.

1
import pandas
2
3
raw_df = pandas.read_csv('myData.csv')
4
5
transformed_df = (raw_df
6
.dropna(subset=['revenue'])
7
.assign(revenue_usd=lambda x: x['revenue'] / 100)
8
.query("""revenue_usd > 10000""")
9
.sort_values('revenue_usd', ascending=False)
10
)

This code is much cleaner and reads easier than something like this:

1
import pandas
2
3
raw_df = pandas.read_csv('myData.csv')
4
5
transformed_df = raw_df.copy()
6
transformed_df = transformed_df.dropna(subset=['revenue'])
7
transformed_df['revenue_usd'] = transformed_df['revenue'] / 100
8
transformed_df = transformed_df[transformed_df['revenue_usd'] > 10000]
9
transformed_df = transformed_df.sort_values('revenue_usd', ascending=False)

The chained version:

  • Avoids mutation confusion
  • Avoids accidental state reuse
  • Makes transformation order explicit
  • Feels closer to SQL

2. Building the Chain

Nearly every action that you would take to manipulate a dataset can be done via method chaining — although some methods are more intuitive than others.

Let's import our data:

1
import pandas
2
3
raw_df = pandas.read_csv('salesData.csv')

From here, we can apply some of the more intuitive data transformations. In the following example, we will sort, then rename columns, select specific columns, and perform a left join with another dataset.

1
# Define a dictionary to use for re-naming AND selecting specific columns
2
result_columns = {
3
'customer__unique_id': 'customer_id',
4
'account_name': 'customer_name',
5
'contract_id': 'contract_number',
6
'reporting_month': 'month',
7
'monthly_revenue_usd': 'revenue'
8
}
9
10
transformed_df = (raw_df
11
.sort_values(by=['account_name', 'contract_id'], ascending=[False, True])
12
.rename(columns=result_columns)
13
.filter(result_columns.values())
14
.merge(customer_detail_df, on='customer_id', how='left')
15
)

In the example, the following transformations were made:

  • Sort the DataFrame by account_name then contract_id.
  • Rename the DataFrame's columns using a dictionary defined above. Each key in the dictionary will be renamed to the corresponding value.
  • Select only the columns defined in the dictionary. Note: we are using .values() to get an iterable of just the column names after they are modified.
  • Merge (Left Join) with the DataFrame customer_detail_df.

This dictionary enhances method chaining even futher by allowing us to define logic once and only once.

We can add additional methods to the chain to continue the transformation process. Most pandas methods make this extremely straightforward, so I won't cover every case. In the example below, we will take transformed_df and then group it, aggregate, and only return 10 rows to generate a report of the Top 10 Customers by Revenue and Month.

1
report_df = (transformed_df
2
.groupby(['customer_name', 'month'], as_index=False)
3
.agg({'revenue': 'sum'})
4
.sort_values('revenue', ascending=False)
5
.head(10)
6
)

Sorting, limiting, de-duplicating, dropping nulls, filling nulls, renaming columns, selecting columns, grouping, aggregating, pivoting, un-pivoting, merging (joining), the list goes on. All of these actions fit nicely into the conceptual framework of method chaining. Although, 2 very important and common data transformation operations were not included in this list. Column mutation and Filtering will be covered below.

3. Performing Column Mutations

The traditional way to create new and manipulate existing columns would look like the following:

1
# Modify the existing revenue column by filling NAs
2
sales_df['revenue'] = sales_df['revenue'].fillna(value=0)
3
4
# Create a margin column
5
sales_df['margin'] = sales_df['revenue'] - sales_df['monthly_cost']

This works fine and is convenient for quick analysis. However, when a column needs to be created or modified in the middle of the data transformation process, the concept of method chaining could fall apart.

Use .assign(): This method assigns a new column or overwrites an existing column in a DataFrame.

We can re-write the above examples using assign:

1
transformed_df = (sales_df
2
.assign(revenue=lambda x: x['revenue'].fillna(value=0)) # Overwrite existing
3
.assign(margin=lambda x: x['revenue'] - x['monthly_cost']) # Create new column
4
)

Now column mutations can fully integrate into method chaining data pipelines.

4. Filtering Data

Many analysts will filter DataFrames in the following manner:

1
# Filtering on a single column
2
revenue_df = revenue_df[revenue_df['customer_name'] == 'ACME Corp']
3
4
# Filtering on multiple columns
5
revenue_df = revenue_df[
6
(revenue_df['customer_name'] == 'ACME Corp') &
7
(revenue_df['invoice_month'] == '2025-01-01')
8
]

Similar to the traditional method for column mutations, this approach works, but it breaks method chaining and is harder to read.

There are 2 alternatives and both fit cleanly into method chaining.

1. The .query() method

The .query() method is intuitive and works extremely well. Despite this, it is not my preference, only because it uses a string-based expression syntax, which differs slightly from standard pandas boolean masking.

To provide a very simple example, this method works as follows:

1
# We can reference this variable in the .query() method
2
customer_name = 'ACME Corp'
3
4
transformed_df = (raw_df
5
.query("""
6
customer_name == @customer_name and
7
invoice_month == '2025-01-01'
8
""")
9
)

This approach has several pros:

  • The syntax is clean and clear
  • This line is readable and works in a method chain
  • It feels similar to SQL

2. The .loc[] method

I prefer using .loc[] because it is explicit and more pythonic. This method uses the same boolean masking that we used in the "bad" example above, but it can flow in a chain of methods.

1
customer_name = 'ACME Corp'
2
3
filtered_df = (raw_df
4
.loc[lambda x: (
5
x['customer_name'] == customer_name) &
6
(x['invoice_month'] == '2025-01-01')
7
]
8
)

This approach also has several pros:

  • Native python syntax
  • Full IDE support (autocomplete, linting)
  • Safer for dynamic expressions

Both .query() and .loc[] work well, choose your preference.

A note on chain-safety

In both the .loc[] filtering example and the .assign() column mutation example, I used lambda functions. This is necessary to ensure we receive the most recently modified version of the DataFrame.

See this example, where we rename account_name to customer_name, then attempt to filter on account_name:

1
# This won't work
2
filtered_df = (raw_df
3
.rename(columns={'account_name': 'customer_name'})
4
.loc[raw_df['customer_name'] == 'ACME Corp']
5
)
6
7
# This will work
8
filtered_df = (raw_df
9
.rename(columns={'account_name': 'customer_name'})
10
.loc[lambda x: x['customer_name'] == 'ACME Corp']
11
)
  • The first code snippet will not work, because we are attemping to filter the DataFrame using the original raw_df DataFrame.
  • The second code snippet will work, because .loc[] is receiving the latest DataFrame, with account_name renamed to customer_name.

It is my recommendation to default to using lambda when method chaining, even though it isn't technically always necessary.

5. Building Reusable Logic

As your pandas workflows grow, duplication of code becomes a concern. .pipe() solves this by leting you move transformation logic into reusable and composable functions — while keeping your method chain clean.

The core idea

.pipe() does the following:

  • Passes the DataFrame as the first argument
  • Accepts additional arguments
  • Preserves chain flow
1
example_df.pipe(my_function, some_arg=value)

Let's say we have a function that takes a DataFrame and filters for an active status.

1
import pandas
2
3
def filter_active(df: pandas.DataFrame) -> pandas.DataFrame:
4
return df.loc[df['status'] == 'Active']

We could pass our DataFrame into this function like so:

1
new_df = filter_active(current_df)

But this doesn't fit into the method chaining framework. This is where .pipe() comes into play. .pipe() works just like any other chained method by taking a method that expects and returns a DataFrame.

1
transformed_df = (raw_df
2
.pipe(filter_active)
3
)

There are many pros to this approach:

  • Modular business logic
  • Reusable definitions for metrics and reporting
  • Cleaner pipelines
  • Removal of duplicated business logic across modules
  • Easier testing

Final Thoughts

Method chaining improves readability and can support virtually any data transformation, as long as you understand the tools available.