Skip to Content

Visualizing Transit Systems with R and GTFS Feeds

chicago-cta

Up until about 6 months ago, spatial data manipulation in R was kind of a mess. It was slow, fragmented into different packages, and completely outclassed by PostGIS and geopandas. That all seems to have changed with the release of sf, which uses the simple features access model while providing PostGIS-like spatial manipulation functions and geopandas-like spatial dataframes. In other words, it’s awesome.

I wanted a project to test out sf and the newly-released version of gganimate. I use a lot of transit and driving data for work, so I figured it might be fun to try to visualize a GTFS feed. A GTFS feed contains all the scheduling data for a city’s transit system, as well as stop locations, the shape of each train or bus line, and a bunch of other stuff. The goal is to make an animation which emulates the style of the game Mini Metro, with distinct, brightly colored train lines and little trains running on them. The challenge is to do it using only R, no PostGIS or Python allowed.

Setup

First we need to load the relevant libraries. Be sure to grab all of these from GitHub, the versions on CRAN are outdated and won’t work with this script.

library(imputeTS)
library(ggplot2)
library(gganimate)
library(gtfsr)
library(sf)

Getting The Data

Next we need to get a GTFS feed. I chose Chicago’s feed, but any feed should work with this script given a little tweaking. GTFS feeds are always zipped, per the standard, and there’s no need to unzip them. The gtfsr package will extract the GTFS feed into individual dataframes.

url <- "http://www.transitchicago.com/downloads/sch_data/google_transit.zip"
download.file(url, "gtfs.zip")

gtfs <- import_gtfs("gtfs.zip", local = TRUE)

We need a temporary dataframe of only the train route IDs. We could also include buses here, but it would up the complexity a fair bit.

temp_df <- gtfs[["routes_df"]] %>%
  filter(route_type == 1) %>%
  select(route_id, route_long_name, route_color)

Creating Static Spatial Geometries

Next we need to create dataframes of the static geometries for lines and stops. To do so, we essentially need to aggregate all of the individual points that make up a line, order them, then convert them to an sf linestring.

lines_df <- temp_df %>%
  inner_join(gtfs$trips_df, by = "route_id") %>%
  distinct(route_id, shape_id) %>%
  left_join(gtfs$shapes_df, by = "shape_id") %>%
  st_as_sf(coords = c("shape_pt_lon", "shape_pt_lat"), crs = 4326) %>%
  group_by(shape_id) %>%
  summarize(do_union = FALSE) %>%
  st_cast("LINESTRING") %>%
  left_join(gtfs$trips_df, by = "shape_id") %>%
  group_by(route_id) %>%
  summarize() %>% 
  left_join(temp_df, by = "route_id") %>%
  mutate(route_color = paste0("#", route_color))

Then do the same thing for individual stops. The reason there’s so many joins is because many of the attributes for each stop are stored in the separate dataframes created by gtfsr.

stops_df <- temp_df %>%
  inner_join(gtfs$trips_df, by = "route_id") %>%
  left_join(gtfs$stop_times_df, by = "trip_id") %>%
  left_join(gtfs$stops_df, by = "stop_id") %>%
  distinct(route_id, stop_id, route_color,
           stop_lon, stop_lat, stop_sequence) %>%
  mutate(route_color = paste0("#", route_color)) %>%
  rename(
    lat = stop_lat,
    lon = stop_lon
  )

Creating Trips and Interpolating

Here’s where it gets a little bit complicated. This animation works by moving the train cars from stop to stop (point to point) according to the schedule data from the GTFS feed. The problem is, if the cars move stop to stop, they will fly straight between stops, disregarding curves in the actual track. In order to move the train cars along their respective tracks, they need points from the actual track shape interspersed between each stop. We can extract these points from the points that make up the tracks, then merge them with the data for each inidividual trip.

# Getting the points of each track
shape_df <- temp_df %>%
  inner_join(gtfs$trips_df, by = "route_id") %>%
  distinct(route_id, shape_id) %>%
  left_join(gtfs$shapes_df, by = "shape_id") %>%
  group_by(shape_id) %>%
  rename(
    lat = shape_pt_lat,
    lon = shape_pt_lon,
    dist = shape_dist_traveled
    ) %>%
  select(-shape_pt_sequence)

# Creating a dataframe of all individual trips
trips_df <- temp_df %>%
  inner_join(gtfs$trips_df, by = "route_id") %>%
  left_join(gtfs$stop_times_df, by = "trip_id") %>%
  left_join(gtfs$stops_df, by = "stop_id") %>%
  distinct(
    route_id, route_color, shape_id, trip_id,
    stop_lat, stop_lon, arrival_time, shape_dist_traveled) %>%
  rename(
    lat = stop_lat,
    lon = stop_lon,
    dist = shape_dist_traveled
  ) %>%
  mutate(time = as.POSIXct(
    arrival_time,
    format = "%H:%M:%S",
    tz = "UTC")) %>%
  na.omit()

Next we merge the track points with the trips dataframe, ordering the points for each trip by the distance traveled for each car (which is available in both datasets). The cars move according to their scheduled arrival time at each stop, however there’s no scheduled times for the newly interspersed points. As such, we have to interpolate the arrival times for all the points between each stop. This can be done fairly easily with the imputeTS package.

temp_df_2 <- trips_df %>%
  group_by(trip_id, shape_id) %>%
  summarize() %>%
  left_join(shape_df, by = "shape_id")

trips_df <- trips_df %>%
  bind_rows(temp_df_2) %>%
  group_by(trip_id) %>%
  arrange(trip_id, dist) %>%
  mutate(time = as.POSIXct(
    na.interpolation(
      as.numeric(time),
      option = "stine"),
    origin = '1970-01-01', tz = 'UTC')
    ) %>%
  fill(arrival_time) %>%
  select(trip_id, arrival_time, lat, lon, time, dist) %>%
  group_by(trip_id, dist) %>%
  filter(row_number() == 1)

Plotting and Animating

Next, we create the ggplot which will be animated with gganimate, which will draw individual frames (plots) according to a field in your dataframe. It will tween the animation according to the number of frames specified, more frames = smoother animation.

There are three geoms in each plot: the static lines, the static stops, and the dynamic, animated trips.

plot <- ggplot() +
  geom_point(
    data = stops_df,
    aes(x = lon, y = lat, color = route_id),
    size = 2.5,
    show.legend = FALSE
    ) +
  geom_sf(
    data = lines_df,
    aes(color = route_id),
    show.legend = FALSE
    ) +
  geom_point(
    data = trips_df,
    aes(x = lon, y = lat),
    size = 1.5,
    shape = 15
    ) + 
  scale_color_manual(values = lines_df$route_color) +
  transition_components(trip_id, time) +
  ease_aes("sine-in-out") +
  theme_bw() +
  labs(
    title = "Chicago Rail ('L') System Map",
    subtitle = '{frame_time}')  + 
  theme(
    line = element_blank(),
    rect = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank(),
    plot.title = element_text(
      face = "bold",
      size = 24,
      margin = margin(b = -70, t = 42)),
    plot.subtitle = element_text(
      size = 18,
      margin = margin(b = -122, t = 80)),
    panel.grid.major = element_line(colour = "transparent")
    ) + 
  annotate(
    "text",
    x = -87.9350395,
    y = 41.734339,
    label = "Created by Dan Snow \ngithub.com/dfsnow",
    hjust = 0,
    size = 5,
    color = "grey60")

Next, we call animate() on the plot object to render each individual frame as a .png file. The specifications here will make a smooth animation but will take a long time to render. Generating 15,000 frames took about 4 hours on my machine. The individual pngs are saved in /tmp/.

frames <- as.numeric(length(unique(trips_df$arrival_time))) * 3
plot_mg <- animate(
  plot, frames, fps = 50, width = 1024, height = 1024)

Finally, we can put it all together by creating an mp4 file using ffmpeg. I use Docker for convenience. If you make over 10,000 frames you’ll have to rename the pngs to order them correctly. The files will go from gganim_plot8888.png to 08888.png.

ls *.png | rename 's/gganim_plot//'
ls *.png | rename 's/\d+/sprintf("%05d",$&)/e'

docker run --rm -v /tmp/:/tmp jrottenberg/ffmpeg -framerate 50 \
-pattern_type glob -i '/tmp/*.png' -c:v libx265 \
-pix_fmt yuv420p plot.mp4

docker cp {container_id}:/tmp/workdir/plot.mp4 plot.mp4

Overall, this method works fairly well and should be easy to adapt to other cities. There are a few weird glitches where trains will move very quickly between some stops. This is a side effect of interpolating the time of arrival for points along the tracks. The times are interpolated with linear approximation, but the points are not equidistant, so the time between two points 50 meters apart and 500 meters apart is the same. Nonetheless, the actual time that each train takes to travel between each official stop is accurate, and the animation works fairly well overall.