Speed up data exploration with ad-hoc data filters in Streamlit

2022-06-30 2192 words 17 minutes

/posts/datascience/adhoc-data-filters-streamlit/featured.jpg

Contents

I love Streamlit. It is an amazing tool, to quickly create interactive data apps. In data science, it is often beneficial to get first results early and then improve iteratively. Making data available and accessible to domain experts is an important step in that journey.

With Streamlit, it is straightforward to build custom applications. Apps can easily be tailored to specific data science projects. But with a few tricks, they can also be made more generally applicable.

In this post, I want to highlight a few techniques I used to create this dataset-agnostic data visualization app. If you are familiar with Streamlit, you may skip the introduction of the library and description of its components and jump to this section.

What is Streamlit?

Just in case you have never heard of Streamlit, here is a brief description. Streamlit is an open-source Python framework to create custom web applications. It is specifically designed for machine learning and data science, but it is in no way limited to those use cases. The underlying Python code is executed server-side, and the resulting outputs rendered to the user. Streamlit applications can be run on a local machine, or deployed to a cloud platform such as Streamlit Cloud or Heroku.

Almost no boilerplate code is required and fairly sophisticated interactive apps can be written in only a few lines of code. Here is an example, that showcases how little effort is needed to create a reasonably good-looking interface users can interact with.

We can see a heading, some explanatory text with parts highlighted in bold and italics, followed by a textbox. In this simple demo, the user is supposed to enter a name. The input is written to a final text element, but only if something was actually entered. Otherwise, a blue info box with instructions is shown.

Below is the entire code for the example app — Less than 15 lines, including a couple of empty ones. The syntax is so clean that the code almost explains itself.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# st_intro.py
import streamlit as st

"""# A simple Streamlit app
Here could be an **explanation of the use case** and a description
of the app's functionality. *Strings are interpreted as markdown*.
"""

name = st.text_input("What is your name?")
if not name:
    st.info("Please enter your name above.")
else:
    st.write(f"Hello, **{name}**!")

There are many excellent resources out there describing the basics of how Streamlit works. A good place to start is the official documentation. TL;DR: The app is executed top to bottom. Any literal or variable given on its own line is printed to the UI. In addition, Streamlit provides many different UI elements to display and receive data. Values given through input elements (such as the textbox in the example) can be stored in Python variables and accessed later in the code.

Limitations of Streamlit

Streamlit’s simplicity also comes with some drawbacks. If you are not careful, a Streamlit app can become slow very quickly. Handling bigger datasets needs extra care and additional effort to run smoothly. Also, there is little flexibility in the layout/design of your app and only recently have stateful behavior (July 2021) and multi-page apps (June 2022) been made officially available. Here is a great article comparing Streamlit to other alternatives in R and Python.

Building an interactive data app

Through the following sections, we will build this Streamlit app from scratch, that creates a colored regression plot for user-specified variables. The underlying data can be interactively sliced into more narrow subsets, which I have found to be very useful in exploratory data analysis and discussions with domain experts. Although the app’s functionality itself is not that impressive, the underlying concept is applicable to all sorts of data apps.

Hard-coded, static baseline

Before making the app dataset-agnostic, we are going to hard code everything. As exemplary data, I like to use the Palmer penguins dataset¹. It provides various body characteristics of penguins from three different species — more info here.

Here is the code for a fairly underwhelming, static app that creates and displays a colored scatter plot inside Streamlit.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


from matplotlib import pyplot as plt
import streamlit as st
import seaborn as sns

data = sns.load_dataset("penguins")

xvar = "bill_length_mm"
yvar = "body_mass_g"
huevar = "species"

fig = plt.figure()
sns.scatterplot(data=data, x=xvar, y=yvar, hue=huevar, alpha=0.75)
st.pyplot(fig)

Under the hood, seaborn is used to create the plot, which was an arbitrary choice for this example. Streamlit supports many other plotting libraries very well. Below is the corresponding app. It’s really unexciting, but we are going to build upon it to create a powerful tool for EDA.

Improving the regression plot

Briefly, I want to extend the plot itself to also show a regression line and choose a different color palette for numerical data. For this, the plotting code is wrapped inside a plot_regression function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


def plot_regression(df, x, y, hue, regression=True):
    """Create (colored) scatter plot with optional regression line.
    """
    fig = plt.figure()
    palette = "mako" if hue in df.select_dtypes("number") else None
    sns.scatterplot(data=df, x=x, y=y, hue=hue, alpha=0.75, palette=palette)

    if regression:
        sns.regplot(data=df, x=x, y=y, scatter=False, line_kws={"color": ".3"})

    return fig

fig = plot_regression(data, x=xvar, y=yvar, hue=huevar)
st.pyplot(fig)

Again, this is just an example. The plot could be improved in many ways, but we want to focus on the Streamlit setup. The plot below shows the improved regression plot. It highlights that there may be a need to consider different subgroups in the data.

User-selected plot variables

We can start to enable user interaction by making the X, Y and hue column selectable. All we need is a couple of st.selectbox elements. X and Y should be numeric columns, hue can be either numeric or categorical. Because the app should work with any dataset, we need to first make sure that there are some numeric columns to display.

1
2
3
4


numeric_vars = data.select_dtypes("number").columns
if len(numeric_vars) < 1:
    st.warning("No numeric columns found for plotting.")
    st.stop()

If there are no numeric columns, we warn the user and stop the execution. st.stop() in Streamlit, much like return in a regular function, stops the execution early without raising an error. Having a list of numeric columns as a separate variable avoids code duplication in the creation of the X and Y selection boxes.

At this point, I want to introduce a two-column layout, that features the plot on the left and variable options on the right. Streamlit makes this easy. The following code creates 2 columns, where the first is twice as wide as the second one.

1

leftcol, rightcol = st.columns([2, 1])

We can now use the leftcol and rightcol as context managers to add UI elements to each column. In the code, the right column needs to be filled first, as the plot on the left relies upon user input gathered here.

1
2
3
4
5
6
7
8


with rightcol:  # plot setup selectors on the right
    xvar = st.selectbox("X variable", numeric_vars)
    yvar = st.selectbox("Y variable", numeric_vars, index=len(numeric_vars)-1)

    # hue column is optional - the "None" string is replaced by actual None
    huevar = st.selectbox("Color by", ["None"] + data.columns.tolist())
    if huevar == "None":
        huevar = None

The selection boxes for X and Y are straightforward. By default, the first numeric column in data is chosen as X (implicitly) and the last numeric column as Y, which is often a reasonable choice (😉). Because the st.selectbox does not allow empty selections, a small trick is needed to make coloring optional. The hack is inspired by comments from this discussion. We manually add a "None" option to the list and later replace it with Python’s actual None to indicate an empty selection.

With X, Y and hue selected by the user, we can create the regression plot and display it on the left:

1
2
3


with leftcol:
    fig = plot_regression(data, x=xvar, y=yvar, hue=huevar)
    st.pyplot(fig)

And this is what we get:

Loading any dataset

By now, the code for the app does not reference any variable names and could therefore be used with any tidy dataset. Streamlit provides the st.file_uploader widget, which allows the user to provide the app with any data they choose. Let’s replace loading the penguins data with a file_uploader. It is good practice in Streamlit, to use caching for potentially resource intensive or time consuming tasks. We create a read_data function and decorate it with st.cache. For this demo, only CSV files are allowed, but read_data could easily be extended to allow other file types like Excel or Feather.

1
2
3
4
5


import pandas as pd

@st.cache
def read_data(uploaded_file):
    return pd.read_csv(uploaded_file)

With caching, a copy of the data frame is stored in memory and accessed upon re-execution. This improves performance considerably, as the CSV content does not need to parsed with every user-interaction.

In the app, we can add the file_uploader to the sidebar, so that it does not clutter the main area. As long as no data is given, a message is shown to the user and the execution stopped.

1
2
3
4
5
6


datafile = st.sidebar.file_uploader("Upload dataset", ["csv"])
if datafile is None:
    st.info("""Upload a dataset (.csv) in the sidebar to get started.""")
    st.stop()

data = read_data(datafile).copy()

Note that the file uploader is explicitly limited to only accept CSV files. Also note that after reading the cached data, the data frame is immediately copied again. This is important in the next step, when filters are applied to the data, which would mutate the cached object.

Now, any CSV dataset can be uploaded and processed by the app:

Ad-hoc data filters

The regression analysis for the penguins dataset is overshadowed by clear groupings in the data. Depending on the goal, it might make sense to investigate different subpopulations individually. So we need interactive elements that allow the user to subset the data on the fly. Again, this is quite straightforward to do in Streamlit and we can even make the filter setup dataset-agnostic.

Adding categorical filters on the fly

Before showing my final implementation of this app, I want to highlight the approach taken by only considering the categorical columns. First, we add a selection box to the sidebar, that allows the user to “promote” specific columns to filter columns. For now, we exclude numbers and datetimes, as they will require different UI elements.

1
2
3
4


data = read_data(datafile).copy()

categories = data.select_dtypes(exclude=["number", "datetime"]).columns
filter_cols = st.sidebar.multiselect("Filter columns", categories)

For each promoted column, we can now create a new st.multiselect element, that allows the user to select or deselect specific categories in the dataset. The selections are stored for later use in a dictionary, with column names as keys and selections as values.

1
2
3
4
5
6
7


filters = {}
with st.sidebar.expander("Filters", expanded=True):
    for col in filter_cols:
        options = data[col].dropna().unique()
        selection = st.multiselect(col, options, default=options)
        if selection:
            filters[col] = selection

All of this is done inside an st.expander, which visually sets the filters apart from the rest of the sidebar. Note that empty selections are not included in the dictionary here. They would lead to an empty data frame, which is quite useless for our data app.

We could choose to filter the data already within the for-loop above. However, this leads to a behavior, where the selection on the first filter influences the options available for a second filter - and vice versa. I have found this to be very unintuitive and also not very user-friendly, because once options change for a Streamlit input element, it is reset to the default value.

Instead, the data is filtered in a separate step. We can leverage pandas' query method:

1
2
3


for col, selection in filters.items():
    data = data.query(f"`{col}` in @selection")
    st.write(f"*`{col}` in {selection}*")

Here is the code and a screen capture of the app so far. Only 64 lines already achieve quite a lot.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64


from matplotlib import pyplot as plt
import streamlit as st
import seaborn as sns
import pandas as pd


def plot_regression(df, x, y, hue, regression=True):
    """Create (colored) scatter plot with optional regression line.
    """
    fig = plt.figure()
    palette = "mako" if hue in df.select_dtypes("number") else None
    sns.scatterplot(data=df, x=x, y=y, hue=hue, alpha=0.75, palette=palette)

    if regression:
        sns.regplot(data=df, x=x, y=y, scatter=False, line_kws={"color": ".3"})

    return fig


@st.cache
def read_data(uploaded_file):
    return pd.read_csv(uploaded_file)


datafile = st.sidebar.file_uploader("Upload dataset", ["csv"])
if datafile is None:
    st.info("""Upload a dataset (.csv) in the sidebar to get started.""")
    st.stop()

data = read_data(datafile).copy()

categories = data.select_dtypes(exclude=["number", "datetime"]).columns
filter_cols = st.sidebar.multiselect("Filter columns", categories)

filters = {}
with st.sidebar.expander("Filters", expanded=True):
    for col in filter_cols:
        options = data[col].dropna().unique()
        selection = st.multiselect(col, options, default=options)
        if selection:
            filters[col] = selection

for col, selection in filters.items():
    data = data.query(f"`{col}` in @selection")
    st.write(f"*`{col}` in {selection}*")

numeric_vars = data.select_dtypes("number").columns
if len(numeric_vars) < 1:
    st.warning("No numeric columns found for plotting.")
    st.stop()

leftcol, rightcol = st.columns([2, 1])
with rightcol:  # plot setup selectors on the right
    xvar = st.selectbox("X variable", numeric_vars)
    yvar = st.selectbox("Y variable", numeric_vars, index=len(numeric_vars)-1)

    # hue column is optional - the "None" string is replaced by actual None
    huevar = st.selectbox("Color by", ["None"] + data.columns.tolist())
    if huevar == "None":
        huevar = None

with leftcol:
    fig = plot_regression(data, x=xvar, y=yvar, hue=huevar)
    st.pyplot(fig)

Data slicers for all data types

I have found categorical filters most useful during EDA of my projects, but there is no need to limit ourselves to categorical variables. We can create two-sided select slicers for numeric and even datetime types. Let’s create a function that takes a column name and creates the corresponding Streamlit element.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


def create_slicer(df, col):
    """Create a data filter (slider/multiselect) and return user selection.

    Filters are created for 'number', 'date' and 'object'/'category' types.
    Returns `None` otherwise.
    """
    if col in df.select_dtypes(include=["number", "datetime"]):
        if col in df.select_dtypes(include="integer"):
            valmin, valmax = int(df[col].min()), int(df[col].max())
        elif col in df.select_dtypes(include="datetime"):
            valmin, valmax = df[col].min().date(), df[col].max().date()
        else:
            valmin, valmax = float(df[col].min()), float(df[col].max())
        return st.slider(col, valmin, valmax, (valmin, valmax))
    elif col in df.select_dtypes(["object", "category"]):
        options = df[col].dropna().unique()
        return st.multiselect(col, options, default=options)

    return None

With this, the code for creating the filter UI elements becomes a bit more readable:

1
2
3
4
5
6
7


filter_cols = st.sidebar.multiselect("Filter columns", data.columns)
filters = {}
with st.sidebar.expander("Filters", expanded=True):
    for col in filter_cols:
        selection = create_slicer(data, col)
        if selection:
            filters[col] = selection

Finally, we need to apply the user’s slicer selections on the dataset. Depending on the column’s data type, the filter syntax is slightly different, which is best addressed in a separate function.

1
2
3
4
5
6
7
8
9


def apply_slicer(df, col, selection):
    """Filter dataset according to slicer selections.
    """
    if col in df.select_dtypes(include=["number", "datetime"]):
        low, high = selection
        df = df.query(f"@low <= `{col}` <= @high")
    if col in df.select_dtypes(exclude=["number", "datetime"]):
        df = df.query(f"`{col}` in @selection")
    return df

If the column is either numeric or a datetime, we can use the Python’s concise comparison notation. If it is any other data type, we can assume the selection is a list of included values. Applying this function is almost trivial:

1
2
3


for col, selection in filters.items():
    data = apply_slicer(data, col, selection)
    st.write(f"*`{col}` in {selection}*")

Below is the entire code so far. It is not that much different from the final data app.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93


from matplotlib import pyplot as plt
import streamlit as st
import seaborn as sns
import pandas as pd


def plot_regression(df, x, y, hue, regression=True):
    """Create (colored) scatter plot with optional regression line.
    """
    fig = plt.figure()
    palette = "mako" if hue in df.select_dtypes("number") else None
    sns.scatterplot(data=df, x=x, y=y, hue=hue, alpha=0.75, palette=palette)

    if regression:
        sns.regplot(data=df, x=x, y=y, scatter=False, line_kws={"color": ".3"})

    return fig


def create_slicer(df, col):
    """Create a data filter (slider/multiselect) and return user selection.

    Filters are created for 'number', 'date' and 'object'/'category' types.
    Returns `None` otherwise.
    """
    if col in df.select_dtypes(include=["number", "datetime"]):
        if col in df.select_dtypes(include="integer"):
            valmin, valmax = int(df[col].min()), int(df[col].max())
        elif col in df.select_dtypes(include="datetime"):
            valmin, valmax = df[col].min().date(), df[col].max().date()
        else:
            valmin, valmax = float(df[col].min()), float(df[col].max())
        return st.slider(col, valmin, valmax, (valmin, valmax))
    elif col in df.select_dtypes(["object", "category"]):
        options = df[col].dropna().unique()
        return st.multiselect(col, options, default=options)

    return None


def apply_slicer(df, col, selection):
    """Filter dataset according to slicer selections.
    """
    if col in df.select_dtypes(include=["number", "datetime"]):
        low, high = selection
        df = df.query(f"@low <= `{col}` <= @high")
    if col in df.select_dtypes(exclude=["number", "datetime"]):
        df = df.query(f"`{col}` in @selection")
    return df


@st.cache
def read_data(uploaded_file):
    return pd.read_csv(uploaded_file)


datafile = st.sidebar.file_uploader("Upload dataset", ["csv"])
if datafile is None:
    st.info("""Upload a dataset (.csv) in the sidebar to get started.""")
    st.stop()

data = read_data(datafile).copy()

filter_cols = st.sidebar.multiselect("Filter columns", data.columns)
filters = {}
with st.sidebar.expander("Filters", expanded=True):
    for col in filter_cols:
        selection = create_slicer(data, col)
        if selection:
            filters[col] = selection

for col, selection in filters.items():
    data = apply_slicer(data, col, selection)
    st.write(f"*`{col}` in {selection}*")

numeric_vars = data.select_dtypes("number").columns
if len(numeric_vars) < 1:
    st.warning("No numeric columns found for plotting.")
    st.stop()

leftcol, rightcol = st.columns([2, 1])
with rightcol:  # plot setup selectors on the right
    xvar = st.selectbox("X variable", numeric_vars)
    yvar = st.selectbox("Y variable", numeric_vars, index=len(numeric_vars)-1)

    # hue column is optional - the "None" string is replaced by actual None
    huevar = st.selectbox("Color by", ["None"] + data.columns.tolist())
    if huevar == "None":
        huevar = None

with leftcol:
    fig = plot_regression(data, x=xvar, y=yvar, hue=huevar)
    st.pyplot(fig)

Bonus: Catch-all text query

Just in case the interactive slicers are not sufficient, we can add a final trick to allow any filters the user can imagine. We can achieve this by exposing the pandas query function directly to the user through a text input. Then, the user can enter any valid query string to perform a custom data selection.

All we need is two additional lines:

1
2
3
4
5


with st.sidebar.expander("Filters", expanded=True):
    # ...
    query = st.text_area("Custom query") or "tuple()"
# later:
data = data.query(query, engine="python")

Note that an empty query is replaced by "tuple()" which selects all rows in the data frame (more details here). Alternatively, we could also check for an empty string and simply not execute the query.

Final Streamlit EDA app with ad-hoc data filters

With the bonus above, we have re-created the complete app I introduced at the top of this post. You can access the app here and you are welcome to build upon the code behind the app.

Conclusion

Of course, the app is fairly limited in it’s functionality, but I personally still think it is quite impressive for less than 140 lines of code (including some comments and descriptive text elements). The intuitive and clean syntax of Streamlit makes it incredibly easy, to create interactive data apps. In mere hours, a data scientist can create a functioning prototype and thus quickly generate value for stakeholders.

One could easily extend and improve the app in various ways. For example, you could use Plotly instead of seaborn, to make the charts themselves interactive. You could also add box/violin plots for categorical X/Y variables, or choose a different data visualization all together. This app is 100% dataset-agnostic and only requires a CSV file in standard tidy format — other file types could be implemented as well. For other use cases it might make more sense to tailor the app to a specific dataset and/or to connect to a data source directly.

Finally, hosting the app is really easy with Streamlit Cloud, which works directly from GitHub and is free for public repositories. There are other options to deploy the app as well, such as Heroku. In many cases however, it may already be enough to spin up the app on your local machine while working through the data alone or together with domain experts.

(References)

K.B. Gorman, T.D. Williams and W. R. Fraser, “Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis),” PLoS ONE, vol. 9, no. 3, p. e90081, 2014, doi:10.1371/journal.pone.0090081 ↩︎