PDF Parser and REST API for Canteen at Ulm University

2022-11-06

Summary

If you don‘t want to read the whole post, here are the most important links:

Background

As a new student at Ulm University, I wanted an easy way to check the canteen plan for the current day. So why not implement a Telegram bot that sends me the current plan? However, there were some small hurdles when getting the daily plans.

The Current Issue

Unfortunately, the weekly canteen plans are only provided as PDF files by the Studierendenwerk Ulm on their website. There is no official API that provides the plan in a machine readable format. This is a problem, as I don’t want the bot to send large PDF files (~1MB) to me every time I want to see the daily plan.

There is an unofficial API hosted by the Fachschaft Elektrotechnik at Ulm University which can be accessed at the following URL: mensaplan.fs-et.de/data/mensaplan.json. However, the parser currently has a bug so that student prices for the meals are not correctly parsed.

This means that you could use the unofficial API to get the meals but without student prices, which results in a worse usability. To fix this issue, I had three different choices:

  • Look for another parser (Problem: I wasn‘t able to find one)
  • Fix the bugs of the current parser (Problem: I don’t know PHP)
  • Implement a new one from scratch

Approach

So by process of elimination, I implemented a parser from scratch. I decided on Python as there exist multiple PDF parsing libraries and it is the language I‘m most comfortable with. The slower runtime performance in comparison to other programming languages is not an issue here, because the script needs to only run once per week.

The general process looks like this:

  1. Scrape all PDF links on the Studierendenwerk Ulm website
  2. Filter the links to get the relevant canteens (i.e. ignore plans from other universities)
  3. For every plan, parse the PDF file with a library
  4. Convert the intermediary dictionary to a specific JSON format (e.g. OpenMensa or from the Fachschaft Elektrotechnik)
  5. Provide the JSON file with a REST API
  6. Fetch the daily plans from the API and send it via a Telegram bot

Step 0: Getting the PDF files

This is the easy part: Just get all <a> tags with the BeautifulSoup library. As all links to the PDF plans have the content “hier”, the German word for here, the relevant links can be found very easily, like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from bs4 import BeautifulSoup
import urllib.request

with urllib.request.urlopen(BASE_URL) as response:
    speiseplan_source = response.read().decode("utf-8")
    soup = BeautifulSoup(speiseplan_source, "html.parser")
    links = soup.find_all("a", string="hier")
    plans = []    
    for a in links:
        plans.append(a["href"])

After getting all the links to the PDF plans, you can filter the relevant canteens by parsing the filenames.

Step 1: Parsing the PDF file

Scraping data from PDF files is tricky. Tables inside PDFs might look structured for the human eye, but most PDF parsers struggle to automagically identify cells of a table. Sure, you can extract all of the text, but then all the information about the layout is missing. After trying out some Python libraries, I decided on PyMuPDF.

At first, I tried parsing the extracted text all at once. That did work well for fully packed plans. But when there are empty cells (like for public holidays), you can’t clearly decide from the extracted text which day a specific meal belongs to.

So I decided on another approach. Instead of extracting the text all at once, you can give the PyMuPDF library a specific area from where the text should be extracted, like this:

1
2
3
4
5
6
7
8
import fitz

document = fitz.open(filename="ulm.pdf")
page = document[0]
area = fitz.Rect(x, y, width, height) 

day_column = page.get_text("text", clip=area)
print(day_column) # extracted text for a specific date

For this method, I had to determine the areas for every weekday and hardcode them into the parser. This is a disadvantage in comparison to the previous method, as layout changes might affect the correctness of the results.

After extracting the texts for every weekday, the actual parsing is relatively straightforward. If you want to try out this area text extraction on your own PDF files, there is also a tool that helps you to determine the coordinates of the areas:

1
2
3
git clone https://github.com/pymupdf/PyMuPDF-Utilities.git
cd PyMuPDF-Utilities/examples
python wxTableExtract.py

When you run the Python script, you get a GUI where you can create a new rectangle by dragging the mouse on the PDF. After that, the x, y, width and height values are shown on the left side:

Get coordinates of rectangle with wxTableExtract

Get coordinates of rectangle with wxTableExtract

Step 2: Provide the data on a REST API

After the meals are parsed, the data can be provided to other services via a REST API. For this, I created a separate project, imported the pdf parser, and implemented the REST endpoints with Flask. For the structure of the JSON file, I roughly followed the JSON structure by OpenMensa.

As there is only one new plan per week, the parsed plan can be cached to reduce traffic to the webserver of Studierendenwerk Ulm. For this, I used the cachetools library:

1
2
3
4
5
6
from cachetools import cached, TTLCache

@cached(cache=TTLCache(maxsize=4, ttl=3600)
def get_plan() -> dict:
    # ...
    return plan

Result

I’m very happy with how this project turned out. You can try out the REST API here: uulm.anter.dev/api/v1/canteens/ul_uni/sued

If you want to try out the Telegram Bot, add it here and use the command /mensa: @uniulm_bot

/mensa command of Telegram Bot

/mensa command of Telegram Bot

For the future, I might implement a proper web frontend for the API so that people without Telegram can benefit from the improved Mensa parser.

If you have any questions or feedback regarding this project, feel free to contact me here.