As a recent article in Charlottesville Tomorrow reports, Charlottesville Area Transit doesn't have enough drivers. Based on a quote from the story, CAT was operating at 79% of its current reduced service schedule as of April, and the city website continues to list multiple buses out of service on a daily basis.

Can we measure those delays more precisely? How often are buses out of service, and which routes are most affected? Has service improved since April? I wasn't able to get precise answers from CAT staff, who have their hands full trying to operate an understaffed bus system. But the CAT tracker website is powered by a public application programming interface (API) that anybody can use to look up bus locations, routes, and stops in real time. Reader, I used it.

Every few seconds, the CAT tracker sends a request to https://catpublic.etaspot.net/service.php?service=get_vehicles&includeETAData=1&orderedETAArray=1&token=TESTING. If you open this URL while buses are in service, you'll see coordinates for each CAT bus in operation, as well as metadata about each bus, including estimates times of arrival for upcoming stops. As far as I can tell, the API only provides real-time information and can't be used to look up historical data. So I wrote a script to store bus locations to a database, and used cron to fetch the data every fifteen seconds.

Note: this analysis is only as accurate as the tracking data provided by CAT. If the transponder for one of the buses stops working, we won't know that the bus is in service.

To verify that this works, let's replay all bus locations for June 10th:

import datetime

import altair as alt
import branca
import folium
import geojson
import pandas as pd
import pytz
import shapely.wkt
from folium.plugins import TimestampedGeoJson
from google.cloud import bigquery

bq_client = bigquery.Client()

# Define utility functions

def get_position(row):
    return geojson.Feature(
        id=row.name,
        geometry=shapely.wkt.loads(row.position),
        properties={
            "icon": "circle",
            "iconstyle": {"color": row.color, "fillColor": row.color, "fillOpacity": 1},
            "times": [row.receiveminute.replace(tzinfo=pytz.timezone("US/Eastern")).timestamp() * 1000],
        },
    )

# Adapted from https://github.com/mrcagney/examples_folium/blob/develop/notebooks/categorical_legend.ipynb
def add_categorical_legend(map_, title, colors, labels):
    """
    Given a Folium map, add to it a categorical legend with the given title, colors, and corresponding labels.
    The given colors and labels will be listed in the legend from top to bottom.
    Return the resulting map.
    
    Based on `this example <http://nbviewer.jupyter.org/gist/talbertc-usgs/18f8901fc98f109f2b71156cf3ac81cd>`_.
    """
    # Error check
    if len(colors) != len(labels):
        raise ValueError("colors and labels must have the same length.")

    color_by_label = dict(zip(labels, colors))

    # Make legend HTML
    template = f"""
    {{% macro html(this, kwargs) %}}

    <!doctype html>
    <html lang="en">
    <head>
      <meta charset="utf-8">
      <meta name="viewport" content="width=device-width, initial-scale=1">
    </head>
    <body>
    <div id='maplegend' class='maplegend'>
      <div class='legend-title'>{title}</div>
      <div class='legend-scale'>
        <ul class='legend-labels'>
    """

    for label, color in color_by_label.items():
        template += f"<li><span style='background:{color}'></span>{label}</li>"

    template += """
        </ul>
      </div>
    </div>

    </body>
    </html>

    <style type='text/css'>
      .maplegend {
        position: absolute;
        z-index:9999;
        background-color: rgba(255, 255, 255, 1);
        border-radius: 5px;
        border: 2px solid #bbb;
        padding: 10px;
        font-size:12px;
        right: 10px;
        bottom: 20px;
      }
      .maplegend .legend-title {
        text-align: left;
        margin-bottom: 5px;
        font-weight: bold;
        font-size: 90%;
        }
      .maplegend .legend-scale ul {
        margin: 0;
        margin-bottom: 5px;
        padding: 0;
        float: left;
        list-style: none;
        }
      .maplegend .legend-scale ul li {
        font-size: 80%;
        list-style: none;
        margin-left: 0;
        line-height: 18px;
        margin-bottom: 2px;
        }
      .maplegend ul.legend-labels li span {
        display: block;
        float: left;
        height: 16px;
        width: 30px;
        margin-right: 5px;
        margin-left: 0;
        border: 0px solid #ccc;
        }
      .maplegend .legend-source {
        font-size: 80%;
        color: #777;
        clear: both;
        }
      .maplegend a {
        color: #777;
        }
    </style>
    {% endmacro %}
    """

    macro = branca.element.MacroElement()
    macro._template = branca.element.Template(template)
    map_.get_root().add_child(macro)

    return map_

vehicles = bq_client.query(
    """
    with ranked as (
      select
        *,
        st_geogpoint(lng, lat) as position,
        timestamp_trunc(receivetimestamp, minute) as receiveminute,
        row_number() over (partition by equipmentid, timestamp_trunc(receivetimestamp, minute) order by receivetimestamp desc) as rank
      from `cvilledata.cat.vehicles` vehicles
      join `cvilledata.cat.routes` routes on vehicles.routeid = cast(routes.id as string)
      where receivetimestamp >= '2022-06-10T06:00:00'
        and receivetimestamp < '2022-06-10T22:30:00'
        and inservice = 1
    )
    select
        *
    from ranked
    where rank = 1
    """
).result().to_dataframe()
positions = vehicles.apply(get_position, axis=1)
features = geojson.FeatureCollection(list(positions))

routes_sorted = vehicles[["color", "name", "id"]].drop_duplicates().sort_values(["id"])
route_names = list(routes_sorted.name)
route_colors = list(routes_sorted.color)

map_ = folium.Map([38.0293, -78.4767,], zoom_start=13)

g = TimestampedGeoJson(
    dict(features),
    auto_play=True,
    period="PT1M",
    duration="PT1S",
    max_speed=25,
    loop=False,
).add_to(map_)

add_categorical_legend(map_, "Routes", route_colors, route_names)

map_

This looks like a plausible record of bus positions on a given day, but it doesn't tell us whether the buses were on time or not. Next, let's plot the number of buses in service over time for the 4, which serves my neighborhood:

bus_counts_by_minute = bq_client.query(
    """
    with counts as (
      select
        routeid,
        timestamp_trunc(receivetimestamp, minute) as receiveminute,
        count(distinct equipmentid) as buses
      from `cvilledata.cat.vehicles`
      where inservice = 1
        and receivetimestamp >= '2022-06-10T06:00:00'
        and receivetimestamp < '2022-06-10T22:30:00'
      group by routeid, receiveminute
      order by receiveminute
    )
    select
      *,
      routes.name as routename,
    from counts
    join `cvilledata.cat.routes` routes on counts.routeid = cast(routes.id as string)
    """
).result().to_dataframe()

alt.Chart(bus_counts_by_minute[bus_counts_by_minute.routeid == "5"]).mark_point().encode(
    y=alt.Y("buses:Q"),
    x=alt.X("receiveminute:T", sort="x"),
).properties(
    title="Buses in service per minute, Route 4",
)

As expected from the CAT schedule, the 4 has two buses in service from about 6:30 a.m. to 9:00 a.m., and one bus in service until the end of service at 10:30 p.m. Aside from a few moments when one of the buses doesn't report in, the 4 provided the expected (if infrequent) level of service last Friday.

Let's look at a few more routes, aggregating bus counts by hour to handle noisy data:

bus_counts_by_hour = bq_client.query(
    """
    select
      *
    from `cvilledata.cat_derived.buses_by_route_by_hour`
    where receivehour between '2022-06-10' and '2022-06-11'
    """
).result().to_dataframe()

alt.Chart(bus_counts_by_hour[bus_counts_by_hour.routeid.isin(["2", "6", "8", "1"])]).mark_point().encode(
    y=alt.Y("buses:Q"),
    x=alt.X("receivehour:T", sort="x"),
    color="routename:N",
).properties(
    title="Buses in service per hour",
).facet(
    facet="routename:N",
    columns=2,
)

To summarize:

The 1 (PVCC/Riverside) is scheduled to have a single bus in service all day, and it does. In other words, the 1 operated at 100% of its current service model.
The 5 (BRSC/Walmart) is scheduled to have three buses in service all day, but it only has two: 67% of its service model.
The 7 (UVA Health/BRSC/FSQ) is scheduled to have three buses in service all day, but has two most of the day—except for a brief stint in the early afternoon when three buses are in service: 69% service.
The Trolley has one bus in service all day, but it's supposed to have two: 50% service.

Overall, CAT operated at about 81% of its proposed service model on June 10th, similar to the 79% figure from the Charlottesville Tomorrow report:

bq_client.query(
    """
    select
      sum(buses) / sum(busesexpected) as busratio,
    from `cvilledata.cat_derived.buses_by_route_by_hour`
    where receivehour between '2022-06-10' and '2022-06-11'
    """
).result().to_dataframe()

Since I began collecting data from the CAT tracker, the overall service level has been similar, at about 78%:

bq_client.query(
    """
    select
      sum(buses) / sum(busesexpected) as busratio,
    from `cvilledata.cat_derived.buses_by_route_by_hour`
    """
).result().to_dataframe()

Overall, CAT has been providing about 80% of the service scheduled in its current reduced service schedule. Are gaps in service consistent, or does the driver shortage affect some routes or times more than others?

bus_counts_by_route = bq_client.query(
    """
    select
      routename,
      sum(buses) / sum(busesexpected) as busratio,
    from `cvilledata.cat_derived.buses_by_route_by_hour`
    group by routename
    """
).result().to_dataframe()

alt.Chart(bus_counts_by_route).mark_bar().encode(
    y=alt.X("routename:N", sort="-x"),
    x="busratio:Q",
    tooltip=["routename", "busratio"],
).properties(
    title="Bus service ratio by route",
)

bus_counts_by_time_of_day = bq_client.query(
    """
    select
      extract(hour from receivehour) as receivehour,
      sum(buses) / sum(busesexpected) as busratio,
    from `cvilledata.cat_derived.buses_by_route_by_hour`
    group by receivehour
    """
).result().to_dataframe()
bus_counts_by_time_of_day["receivetime"] = bus_counts_by_time_of_day.receivehour.apply(lambda value: str(datetime.time(value)))

alt.Chart(bus_counts_by_time_of_day).mark_bar().encode(
    y=alt.Y("receivetime:N"),
    x="busratio:Q",
    tooltip=["receivetime", "busratio"],
).properties(
    title="Bus service ratio by time of day",
)

bus_counts_by_day_of_week = bq_client.query(
    """
    with means as (
    select
      timestamp_trunc(receivehour, day) as receiveday,
      sum(buses) / sum(busesexpected) as busratio,
    from `cvilledata.cat_derived.buses_by_route_by_hour`
    group by receiveday
    )
    select
      *,
      format_date('%A', receiveday) as receivedayname,
      extract(dayofweek from receiveday) as receivedayord,
    from means
    """
).result().to_dataframe()

alt.Chart(bus_counts_by_day_of_week).mark_bar().encode(
    y=alt.Y("receivedayname:N", sort=alt.SortField("receivedayord")),
    x="busratio:Q",
    tooltip=["receivedayname", "busratio"],
).properties(
    title="Bus service ratio by day of week",
)

Some routes have much higher service levels than others, relative to scheduled frequency. Perhaps not by coincidence, the routes with the lowest service ratios are generally the routes with the highest scheduled frequency, including the 5, 7, and Trolley routes. When there aren't enough drivers, CAT might prefer to reduce a three-bus route to two buses than to take a one-bus route out of service entirely.
Service ratios are relatively high from mid-morning to late afternoon but decline later in the day, ranging from 86% at 2:00 p.m. to 69% at 10:00 p.m.
With the caveat that this analysis includes just over a month's worth of data, reliability has been highest for Mondays (95%) and lowest for Saturdays (76%).

Overall, despite the switch to a reduced service model last year, CAT still has fewer buses in service than expected. The effects of the driver shortage are more pronounced for certain routes, times of day, and days of the week, but in general, CAT may be closer to 60-minute frequencies than the frequencies of 30 minutes or less promised for some routes.

CAT is working to address the driver shortage, and city staff may soon be able to push for higher wages via collective bargaining. Hopefully service will improve as CAT raises wages and recruits more drivers. To follow their progress, I've also set up a live dashboard showing the number of buses in service relative to scheduled service levels. Check back later in the year to see how CAT reliability changes over time.

I hope this has been interesting. If you have comments, corrections, etc., reach out and let me know! And if you want to look at the data yourself, it's available on BigQuery.