xkcd Comics

Caution

This example is for reference only. It is not extensively tested, and it not intended to be a fully-fledged Concourse resource for production pipelines. Copy and paste at your own risk.

This example will showcase the SelfOrganisingConcourseResource, and how to use it to save lines of code and avoid implementing logic to determine which versions are newer than the last. In this particular example we will build a resource type to trigger on new xkcd comics.

'Automating' comes from the roots 'auto-' meaning 'self-', and 'mating', meaning 'screwing'.

Relevant xkcd

xkcd Version

Each xkcd comic is associated with an integer, and so these form a natural choice for the version. We’ll inherit from TypedVersion to minimise the code we need, and make sure our versions are hashable with unsafe_hash=True as is required by the resource. Finally, we’ll also implement the __lt__() method to make sure our comparisons use the comic ID as an orderable integer.

@dataclass(unsafe_hash=True)
class ComicVersion(TypedVersion, SortableVersionMixin):
    comic_id: int

    def __lt__(self, other: object) -> bool:
        if not isinstance(other, type(self)):
            return NotImplemented
        return self.comic_id < other.comic_id

xkcd Resource

We start by inheriting from the SelfOrganisingConcourseResource. We pass in our version, and also include an optional configuration parameter for the xkcd link. Although this is unlikely to matter to 99% of users, it is good practice to allow this ot be configured in case a user is hosting their own version of the comics (or if the URL changes), and it will avoid them needing to reimplement your work.

def __init__(self, url: str = "https://xkcd.com"):
    super().__init__(ComicVersion)
    self.url = url

Next we need to overload fetch_all_versions(). Given that our versions are comparable, the resource will only return versions considered “greater than” the previous versions (if it exists, normal rules apply if it hasn’t been passed), and will even order them for us, removing our need to worry about that ourselves.

Although we could technically scrape the xkcd website to check for new versions, it is much more polite to make use of the available RSS feed or, in this case, the Atom feed, which can be found here. Although there are existing libraries designed to parse Atom feeds, it doesn’t take too much effort to pull out the comic IDs, which is all we really need for this to work:

def yield_comic_links(xml_data: str) -> Generator[str]:
    root = ET.fromstring(xml_data)
    for entry in root:
        if entry.tag.endswith("entry"):
            for child in entry:
                if child.tag.endswith("link"):
                    items = dict(child.items())
                    yield items["href"]
def yield_comic_ids(xml_data: str) -> Generator[int]:
    for comic_url in yield_comic_links(xml_data):
        parsed_url = urllib.parse.urlparse(comic_url)
        comic_id = parsed_url.path.strip("/")
        yield int(comic_id)

When given the contents of the feed as a string, the yield_comic_ids will give us all of the integers corresponding to the newer comics. All we need to do is fetch the data from the source, and return our list of versions:

def fetch_all_versions(self) -> set[ComicVersion]:
    atom_url = f"{self.url}/atom.xml"
    response = requests.get(atom_url)
    feed_data = response.text
    return {ComicVersion(comic_id) for comic_id in yield_comic_ids(feed_data)}

The next step is to implement actually “getting” the version. Technically it is possible that between checking the versions and fetching one, it is no longer available in the feed. However, Randall has been kind enough to implement a JSON API, and so we can just use that.

Note

Since this API yields a link to the current comic, it would be possible to implement this example as a TriggerOnChangeConcourseResource, but we’ve already got an example for that one.

We’ll fetch this information and pull out some of the useful information such as the title and upload date. We’ll also generate the URL, and set this aside as Step Metadata:

def download_version(self, version: ComicVersion, destination_dir: Path,
                     build_metadata: BuildMetadata, image: bool = True,
                     link: bool = True, alt: bool = True) -> tuple[ComicVersion, dict[str, str]]:
    comic_info_url = f"{self.url}/{version.comic_id}/info.0.json"
    response = requests.get(comic_info_url)
    info = response.json()

    title = info["title"]
    url = f"{self.url}/{version.comic_id}/"

    upload_date = datetime(year=int(info["year"]), month=int(info["month"]),
                           day=int(info["day"]))
    metadata = {
        "Title": title,
        "Uploaded": upload_date.strftime(r"%d/%m/%Y"),
        "URL": f"{self.url}/{version.comic_id}/",
    }

Next we write the metadata to a file, in case the user wishes to use it themselves:

info_path = destination_dir / "info.json"
info_path.write_text(json.dumps(info))

We optionally (but defaulting to True) download the image:

if image:
    image_path = destination_dir / "image.png"
    image_request = requests.get(info["img"], stream=True)
    with open(image_path, "wb") as wf:
        for chunk in image_request:
            wf.write(chunk)

And finally we do something similar with the comic link, and the alt text:

if link:
    link_path = destination_dir / "link.txt"
    link_path.write_text(url)

if alt:
    alt_path = destination_dir / "alt.txt"
    alt_path.write_text(info["alt"])

return version, metadata

We don’t intend for the resource to publish new comics (unless Randall displays some interest), but the publish_new_version() method is an abstractmethod(), and so we need to explicitly overload it to ensure that the resource can be used at all. We do this by explicitly raising a NotImplementedError:

def publish_new_version(self, sources_dir: Path, build_metadata: BuildMetadata) -> tuple[ComicVersion, dict[str, str]]:
    raise NotImplementedError

xkcd Conclusion

The final resource only requires 97 lines of code, and looks like this:

 1# (C) Crown Copyright GCHQ
 2from __future__ import annotations
 3
 4from collections.abc import Generator
 5from dataclasses import dataclass
 6from datetime import datetime
 7import json
 8from pathlib import Path
 9import urllib.parse
10import xml.etree.ElementTree as ET
11
12import requests
13
14from concoursetools.additional import SelfOrganisingConcourseResource
15from concoursetools.metadata import BuildMetadata
16from concoursetools.version import SortableVersionMixin, TypedVersion
17
18
19@dataclass(unsafe_hash=True)
20class ComicVersion(TypedVersion, SortableVersionMixin):
21    comic_id: int
22
23    def __lt__(self, other: object) -> bool:
24        if not isinstance(other, type(self)):
25            return NotImplemented
26        return self.comic_id < other.comic_id
27
28
29class XKCDResource(SelfOrganisingConcourseResource[ComicVersion]):
30
31    def __init__(self, url: str = "https://xkcd.com"):
32        super().__init__(ComicVersion)
33        self.url = url
34
35    def fetch_all_versions(self) -> set[ComicVersion]:
36        atom_url = f"{self.url}/atom.xml"
37        response = requests.get(atom_url)
38        feed_data = response.text
39        return {ComicVersion(comic_id) for comic_id in yield_comic_ids(feed_data)}
40
41    def download_version(self, version: ComicVersion, destination_dir: Path,
42                         build_metadata: BuildMetadata, image: bool = True,
43                         link: bool = True, alt: bool = True) -> tuple[ComicVersion, dict[str, str]]:
44        comic_info_url = f"{self.url}/{version.comic_id}/info.0.json"
45        response = requests.get(comic_info_url)
46        info = response.json()
47
48        title = info["title"]
49        url = f"{self.url}/{version.comic_id}/"
50
51        upload_date = datetime(year=int(info["year"]), month=int(info["month"]),
52                               day=int(info["day"]))
53        metadata = {
54            "Title": title,
55            "Uploaded": upload_date.strftime(r"%d/%m/%Y"),
56            "URL": f"{self.url}/{version.comic_id}/",
57        }
58
59        info_path = destination_dir / "info.json"
60        info_path.write_text(json.dumps(info))
61
62        if image:
63            image_path = destination_dir / "image.png"
64            image_request = requests.get(info["img"], stream=True)
65            with open(image_path, "wb") as wf:
66                for chunk in image_request:
67                    wf.write(chunk)
68
69        if link:
70            link_path = destination_dir / "link.txt"
71            link_path.write_text(url)
72
73        if alt:
74            alt_path = destination_dir / "alt.txt"
75            alt_path.write_text(info["alt"])
76
77        return version, metadata
78
79    def publish_new_version(self, sources_dir: Path, build_metadata: BuildMetadata) -> tuple[ComicVersion, dict[str, str]]:
80        raise NotImplementedError
81
82
83def yield_comic_ids(xml_data: str) -> Generator[int]:
84    for comic_url in yield_comic_links(xml_data):
85        parsed_url = urllib.parse.urlparse(comic_url)
86        comic_id = parsed_url.path.strip("/")
87        yield int(comic_id)
88
89
90def yield_comic_links(xml_data: str) -> Generator[str]:
91    root = ET.fromstring(xml_data)
92    for entry in root:
93        if entry.tag.endswith("entry"):
94            for child in entry:
95                if child.tag.endswith("link"):
96                    items = dict(child.items())
97                    yield items["href"]