xkcd Comics#
Caution
This example is for reference only. It is not extensively tested, and it not intended to be a fully-fledged Concourse resource for production pipelines. Copy and paste at your own risk.
This example will showcase the SelfOrganisingConcourseResource, and how to use it to save lines of code and avoid implementing logic to determine which versions are newer than the last. In this particular example we will build a resource type to trigger on new xkcd comics.
Relevant xkcd#
xkcd Version#
Each xkcd comic is associated with an integer, and so these form a natural choice for the version. We’ll inherit from TypedVersion to minimise the code we need, and make sure our versions are hashable with unsafe_hash=True as is required by the resource. Finally, we’ll also implement the __lt__() method to make sure our comparisons use the comic ID as an orderable integer.
@dataclass(unsafe_hash=True)
class ComicVersion(TypedVersion, SortableVersionMixin):
comic_id: int
def __lt__(self, other: Any) -> bool:
return self.comic_id < other.comic_id
xkcd Resource#
We start by inheriting from the SelfOrganisingConcourseResource. We pass in our version, and also include an optional configuration parameter for the xkcd link. Although this is unlikely to matter to 99% of users, it is good practice to allow this ot be configured in case a user is hosting their own version of the comics (or if the URL changes), and it will avoid them needing to reimplement your work.
def __init__(self, url: str = "https://xkcd.com"):
super().__init__(ComicVersion)
self.url = url
Next we need to overload fetch_all_versions(). Given that our versions are comparable, the resource will only return versions considered “greater than” the previous versions (if it exists, normal rules apply if it hasn’t been passed), and will even order them for us, removing our need to worry about that ourselves.
Although we could technically scrape the xkcd website to check for new versions, it is much more polite to make use of the available RSS feed or, in this case, the Atom feed, which can be found here. Although there are existing libraries designed to parse Atom feeds, it doesn’t take too much effort to pull out the comic IDs, which is all we really need for this to work:
def yield_comic_links(xml_data: str):
root = ET.fromstring(xml_data)
for entry in root:
if entry.tag.endswith("entry"):
for child in entry:
if child.tag.endswith("link"):
items = dict(child.items())
yield items["href"]
def yield_comic_ids(xml_data: str):
for comic_url in yield_comic_links(xml_data):
parsed_url = urllib.parse.urlparse(comic_url)
comic_id = parsed_url.path.strip("/")
yield int(comic_id)
When given the contents of the feed as a string, the yield_comic_ids will give us all of the integers corresponding to the newer comics. All we need to do is fetch the data from the source, and return our list of versions:
def fetch_all_versions(self):
atom_url = f"{self.url}/atom.xml"
response = requests.get(atom_url)
feed_data = response.text
return {ComicVersion(comic_id) for comic_id in yield_comic_ids(feed_data)}
The next step is to implement actually “getting” the version. Technically it is possible that between checking the versions and fetching one, it is no longer available in the feed. However, Randall has been kind enough to implement a JSON API, and so we can just use that.
Note
Since this API yields a link to the current comic, it would be possible to implement this example as a TriggerOnChangeConcourseResource, but we’ve already got an example for that one.
We’ll fetch this information and pull out some of the useful information such as the title and upload date. We’ll also generate the URL, and set this aside as Step Metadata:
def download_version(self, version: ComicVersion, destination_dir: pathlib.Path,
build_metadata: BuildMetadata, image: bool = True,
link: bool = True, alt: bool = True):
comic_info_url = f"{self.url}/{version.comic_id}/info.0.json"
response = requests.get(comic_info_url)
info = response.json()
title = info["title"]
url = f"{self.url}/{version.comic_id}/"
upload_date = datetime(year=int(info["year"]), month=int(info["month"]),
day=int(info["day"]))
metadata = {
"Title": title,
"Uploaded": upload_date.strftime(r"%d/%m/%Y"),
"URL": f"{self.url}/{version.comic_id}/",
}
Next we write the metadata to a file, in case the user wishes to use it themselves:
info_path = destination_dir / "info.json"
info_path.write_text(json.dumps(info))
We optionally (but defaulting to True) download the image:
if image:
image_path = destination_dir / "image.png"
image_request = requests.get(info["img"], stream=True)
with open(image_path, "wb") as wf:
for chunk in image_request:
wf.write(chunk)
And finally we do something similar with the comic link, and the alt text:
if link:
link_path = destination_dir / "link.txt"
link_path.write_text(url)
if alt:
alt_path = destination_dir / "alt.txt"
alt_path.write_text(info["alt"])
return version, metadata
We don’t intend for the resource to publish new comics (unless Randall displays some interest), but the publish_new_version() method is an abstractmethod(), and so we need to explicitly overload it to ensure that the resource can be used at all. We do this by explicitly raising a NotImplementedError:
def publish_new_version(self, sources_dir, build_metadata):
raise NotImplementedError
xkcd Conclusion#
The final resource only requires 93 lines of code, and looks like this:
1# (C) Crown Copyright GCHQ
2from dataclasses import dataclass
3from datetime import datetime
4import json
5import pathlib
6from typing import Any
7import urllib.parse
8import xml.etree.ElementTree as ET
9
10import requests
11
12from concoursetools.additional import SelfOrganisingConcourseResource
13from concoursetools.metadata import BuildMetadata
14from concoursetools.version import SortableVersionMixin, TypedVersion
15
16
17@dataclass(unsafe_hash=True)
18class ComicVersion(TypedVersion, SortableVersionMixin):
19 comic_id: int
20
21 def __lt__(self, other: Any) -> bool:
22 return self.comic_id < other.comic_id
23
24
25class XKCDResource(SelfOrganisingConcourseResource):
26
27 def __init__(self, url: str = "https://xkcd.com"):
28 super().__init__(ComicVersion)
29 self.url = url
30
31 def fetch_all_versions(self):
32 atom_url = f"{self.url}/atom.xml"
33 response = requests.get(atom_url)
34 feed_data = response.text
35 return {ComicVersion(comic_id) for comic_id in yield_comic_ids(feed_data)}
36
37 def download_version(self, version: ComicVersion, destination_dir: pathlib.Path,
38 build_metadata: BuildMetadata, image: bool = True,
39 link: bool = True, alt: bool = True):
40 comic_info_url = f"{self.url}/{version.comic_id}/info.0.json"
41 response = requests.get(comic_info_url)
42 info = response.json()
43
44 title = info["title"]
45 url = f"{self.url}/{version.comic_id}/"
46
47 upload_date = datetime(year=int(info["year"]), month=int(info["month"]),
48 day=int(info["day"]))
49 metadata = {
50 "Title": title,
51 "Uploaded": upload_date.strftime(r"%d/%m/%Y"),
52 "URL": f"{self.url}/{version.comic_id}/",
53 }
54
55 info_path = destination_dir / "info.json"
56 info_path.write_text(json.dumps(info))
57
58 if image:
59 image_path = destination_dir / "image.png"
60 image_request = requests.get(info["img"], stream=True)
61 with open(image_path, "wb") as wf:
62 for chunk in image_request:
63 wf.write(chunk)
64
65 if link:
66 link_path = destination_dir / "link.txt"
67 link_path.write_text(url)
68
69 if alt:
70 alt_path = destination_dir / "alt.txt"
71 alt_path.write_text(info["alt"])
72
73 return version, metadata
74
75 def publish_new_version(self, sources_dir, build_metadata):
76 raise NotImplementedError
77
78
79def yield_comic_ids(xml_data: str):
80 for comic_url in yield_comic_links(xml_data):
81 parsed_url = urllib.parse.urlparse(comic_url)
82 comic_id = parsed_url.path.strip("/")
83 yield int(comic_id)
84
85
86def yield_comic_links(xml_data: str):
87 root = ET.fromstring(xml_data)
88 for entry in root:
89 if entry.tag.endswith("entry"):
90 for child in entry:
91 if child.tag.endswith("link"):
92 items = dict(child.items())
93 yield items["href"]