BeautifulSoup¶
[1]:
import requests
url = "https://de.wikipedia.org/wiki/Liste_der_Stra%C3%9Fen_und_Pl%C3%A4tze_in_Berlin-Mitte"
r = requests.get(url)
Install:
With Spack you can make BeautifulSoup available in your kernel:
$ spack env activate python-311 $ spack install py-beautifulsoup4~html5lib~lxml
Alternatively, you can install BeautifulSoup with other package managers, for example
$ uv add beautifulsoup4
With
r.contentwe can output the HTML of the page.
Next, we have to decompose this string into a Python representation of the page with BeautifulSoup:
[2]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content, "html.parser")
To structure the code, we create a new function
get_dom(Document Object Model) that includes all the previous code:
[3]:
def get_dom(url):
r = request.get(url)
r.raise_for_status()
return BeautifulSoup(r.content, "html.parser")
Filtering out individual elements can be done, for example, via CSS selectors. These can be determined in a website, for example, by right-clicking on one of the table cells in the first column of the table in Firefox. In the Inspector that now opens, you can right-click the element again and then select Copy → CSS Selector. The clipboard will then contain, for example, table.wikitable:nth-child(13) > tbody:nth-child(2) > tr:nth-child(1). We now clean up this CSS selector, as we do not
want to filter for the 13th child element of the table.wikitable or the 2nd child element in tbody, but only for the 1st column within tbody.
Finally, with limit=4 in this notebook, we only display the first three results as an example:
[4]:
links = soup.select(
"table.wikitable > tbody > tr > td:nth-child(1) > a", limit=4
)
print(links)
[<a href="/wiki/Ackerstra%C3%9Fe_(Berlin)" title="Ackerstraße (Berlin)">Ackerstraße</a>, <a class="mw-disambig" href="/wiki/Adalbertstra%C3%9Fe" title="Adalbertstraße">Adalbertstraße</a>, <a class="new" href="/w/index.php?title=Albrechtstra%C3%9Fe&action=edit&redlink=1" title="Albrechtstraße (Seite nicht vorhanden)">Albrechtstraße</a>, <a href="/wiki/Alexanderplatz" title="Alexanderplatz">Alexanderplatz</a>]
However, we do not want the entire HTML link, but only its text content:
[5]:
for content in links:
print(content.text)
Ackerstraße
Adalbertstraße
Albrechtstraße
Alexanderplatz
See also