Sharpening the ax

So a while back (apparently October of two years ago?!) I wrote a shell script to automate the maintenance of my Alpine packages. It started out as a simple version checker, and grew into a full blow workflow automation tool. I love this little script, and use it literally several times a day. I wouldn't consider managing all of the packages I do by hand ever again. But for all of the love I have for this tool, it is several flawed.

And it's by design! Entirely and utterly my fault! In fact, all of the flaws I put into this script were utterly intentional at the time I wrote it. See I was on this Python binge, had to do a whole bunch of it for work, and so it weaseled its way into my shell script. And this past year I've been rubying a ton, and that is all over the place too! I even started to rewrite my terribly janky script into ruby, but that really only made the problem worse.

You see, my main system is just not that strong, it turns out that when you rely on an old armv7 cpu and a gig of ram, it really can only do so much. And it really struggles to deal with unoptimized low resource unfriendly languages like python and ruby. See those languages trade ease of development for performance. So while I can absolutely bang out a python or ruby script in a few lines of what feels like pseudo-code, it just does not run "well". And that is exactly the jank we're dealing with now. I've suffered my own technical debt for too long.

This was my bright idea two years ago. I didn't want to deal with parsing XML inside of the shell script, I wanted to be lazy. So what if I just heredoc'd a really shitty python script into the python repl? Instead of admonishing me for this stupid idea, it actually worked, and thus my ENTIRE version checking pipeline was born!

check_feed() {
        title=$(python3 - <<EOF
import feedparser
feed = feedparser.parse("$1")
entry = feed.entries[0]
print(entry.title)
EOF
                )
        echo "$title" | sed 's/'$pkg'//g' | grep -m1 -Eo "([0-9]+)((\.)[0-9]+)*[a-z]*" | head -n1
}

But of course, this was a temporary solution, I'd rewrite this later right? NOPE. This solution just got WORSE, because every temporary solution is for some god awful reason permanent. And of course it turns out that my quick little over simplified python in a shell script was not up to the task of actually parsing all of the wild things that people shove into their git forge RSS/Atom feeds. To the point where I needed to keep notes on what it could and couldn't do, what it choked on, literally duplicating the entire script with different handling because it's pretty important to know when a release is a beta or alpha or rc. Yeah, it was terrible frankly.

check_feed() {
	title=$(python3 - <<EOF
import feedparser
from bs4 import BeautifulSoup

feed = feedparser.parse("$1")
entry = feed.entries[0]
print(entry.title)

if "-v" in "$2":
    for k in entry.content:
        if k["type"] =="text/html":
                detail = BeautifulSoup(k.value, features="lxml")
                print(detail.get_text())
elif "-d" in "$2":
    print(entry)
EOF
		 )

	if [ -z $2 ]; then
		ver=$(echo "$title" | sed 's/'$pkg'//g' | grep -m1 -Eo "([0-9]+)((\.)[0-9]+)*[a-z]*" | head -n1)
		pr=$(echo "$title" | grep -oi "alpha\|beta\|rc[0-9]\|rc\.[0-9]")
		if [ "$ver" == "" ]; then
			link=$(python3 - <<EOF
import feedparser
from bs4 import BeautifulSoup

feed = feedparser.parse("$1")
entry = feed.entries[0]
print(entry.link)
EOF
				)
			ver=$(echo "$link" | sed 's/'$pkg'//g' | grep -m1 -Eo "([0-9]+)((\.)[0-9]+)*[a-z]*" | head -n1)
			pr=$(echo "$link" | grep -oi "alpha\|beta\|rc[0-9]")
			if [ "$pr" == "" ]; then
				echo "$ver"
			else
				echo "$ver [$pr]"
			fi
		else
			if [ "$pr" == "" ]; then
				echo "$ver"
			else
				echo "$ver [$pr]"
			fi
		fi
	else
		echo "$title"
	fi
}

Quantifying the jank

We can all look at that code and immediately realize there is a major problem. It never should have made it into "production", but how bad was it exactly?

Well the last iteration of that python in a shell script jank took this long to skim through ~170 different RSS feeds. Who wants to waste 10 minutes of their life every time they run the script just to realize "oh yes I need to update things" or maybe not. I sure don't.

real    9m 46.78s
user    7m 27.79s
sys     1m 2.00s

Now some of this pain is self inflicted. I insist on using a weak armv7 system with a minimal amount of ram. and this script, as terrible as it was, ran OKish enough on x86_64 hardware. For a long while I was using my Chuwi netbook with its 4c Celeron J series processor and it couldn't have cared less about this. Couple minutes in and out at most. But when the code is just this poorly written, and the language chosen to work in is optimized for lower developer complexity and not function, the results can be terrible. There's no reason my Droid can't handle this type of workload just as quickly, the limiting factor is that the code needs to be... better.

PEBKAC, enough said.

Let's make it awk-ward

Now my gut reaction here was to rewrite the entire tool into something that compiles real small and runs real fast. Nim is a GREAT candidate for this! Fennel would be another excellent choice if compiled statically like tkts is. Or maybe even going so far as to pickup a new language, Janet comes to mind!

But, alas, as the 9 blog posts I managed to write in 2024 indicated, I didn't really have time for that. Learning a new language is high effort and requires a lot of time. Maintained.sh is a massive glob of things wrapped around recutils, which is another bottleneck I need to address, and that would mean migrating to sqlite3 and making helper functions for manual data correction. MEH. None of these felt like they would fit. So instead, I took a quick detour into my friend AWK! It's a great language, that we all probably just think of as a tool we call to strip out text. You know ps aux | awk '{print $2}', that jazz. Well awk my friends is so much more than that.

Awk will happily chew away at several different regexp patterns in a single go, parsing the contents of XML tags and then consuming what is inside them to ultimately attempt to find several matches. All of this effort is necessary because of semver, and its inconsistent application. See semver is a flawed system. It isn't enough to just grep [0-9] and hope you get things, semver isn't an int, nor a float, it's a string! So we get to split the string into bits and compare each int. Easy enough in theory, lots and lots of libraries out there to support it, we'll make due. But what if people treat the semver like the string it is? Software development is a messy affair and people often litter their release tags with nuggets of information, like alpha, beta, RC[0-9], a/b[0-9]+, or sometimes literally emojis. These weird edge cases can cause frustrations when developing automated package maintenance tooling.

It is extremely important to know that a tag is a release candidate and not the actual release version, and denoting that by tacking RC1 or similar to the semver is very common. But there is no standard, perhaps 3.0.0RC1 should be 3.0.0-RC1 or 3.0.0 RC1 or maybe 3.0.0b1? These patterns are all easy enough to parse, but require logic to handle each variant. But more and more I'm seeing projects on Github and Gitlab insert meaningless emojis and other nonsense into their project's tags. And this isn't even to say anything of people who don't use a version system at all and just expect their project to be built from HEAD. It's a ridiculous state of affairs we package maintainers must deal with. But ultimately, if you're the one writing the software, and you're providing it open source and libre, then I will work around those weird edge cases to make sure I can deliver that software to people who use Alpine. Keep rocking your emoji's Mealie devs, you make a wicked cool application.

Anyways, this is the revamp awk-ward version that attempts to compensate for all of these weird edge cases. It behaves exceptionally well for repos that just follow semver as expected, and tries to massage other common patterns as best as it can. I'm positive it will be extended throughout its lifetime, I already found a couple of edge cases with this new parser.

check_feed() {
	if [ ! -z $1 ]; then
		read -r -d '' parser << 'EOF'
BEGIN {
	RS="[<>]"  # Split on XML tags
	in_entry = 0
	in_title = 0
	found_first = 0
	OFS="\t"   # Output field separator
}

/^entry/ || /^item/ { in_entry = 1 }
/^\/entry/ || /^\/item/ { in_entry = 0 }
/^title/ { in_title = 1; next }
/^\/title/ { in_title = 0; next }

in_entry && in_title && !found_first {
	gsub(/^[ \t]+|[ \t]+$/, "")
	if (length($0) > 0) {
		title = $0
		version = ""
		type = ""
		
		# py3-bayesian-optimizations uses a 3.0.0b1 variant, this needs checking.
		# nyxt uses pre-release in some of their tags.

		# Pattern 1: Version with space + Beta/Alpha/RC
		if (match(title, /[vV]?[0-9][0-9\.]+[0-9]+[ \t]+(Beta|Alpha|RC[0-9]*|beta|alpha|rc[0-9]*)/)) {
			full_match = substr(title, RSTART, RLENGTH)
			split(full_match, parts, /[ \t]+/)
			version = parts[1]
			type = parts[2]
		}
		# Pattern 2: Version with hyphen + qualifier
		else if (match(title, /[vV]?[0-9][0-9\.]+[0-9]+-(Beta|Alpha|RC[0-9]*|beta|alpha|rc[0-9]*)/)) {
			full_match = substr(title, RSTART, RLENGTH)
			split(full_match, parts, /-/)
			version = parts[1]
			type = parts[2]
		}
		# Pattern 3: Just version number
		else if (match(title, /[vV]?[0-9][0-9\.]+[0-9]+/)) {
			version = substr(title, RSTART, RLENGTH)
		}
		
		# Clean up version and type if found
		if (version) {
			# Remove leading v/V if present
			sub(/^[vV]/, "", version)
			if (type) {
				# Convert type to lowercase for consistency
				type = tolower(type)
				print version, type
			} else {
				print version
			}
			found_first = 1
			exit 0
		}
	}
}
EOF
	
		# Set strict error handling
		set -eu
		
		# Configure curl to be lightweight and timeout quickly
		CURL_OPTS="-s --max-time 10 --compressed --no-progress-meter"
		local feed_url="$1"
		version=$(curl $CURL_OPTS "$feed_url" | awk "$parser")
		
		case "$version" in
			*$'\t'*)
				ver="${version%%$'\t'*}"
				pr="${version#*$'\t'}"
				echo "$ver [$pr]"
				;;
			*)
				ver="$version"
				echo "$ver"
				;;
		esac
	else
		echo "000"
	fi
}

The major optimization here is that we aren't spawning a python sub-process for every single freaking check! To nobodies surprise that works so amazingly better. I could probably have gotten similar "better" results by properly using Python here, I'll admit that fully. But since this is a very personal ad hoc maintenance script, awk was the right choice for a night of hacking.

Quantifying the effort

So we made it a lot more complicated than a couple lines of python, was it worth it? On that say ~170 RSS feeds we're now looking at a much saner 3m load time. And this is still a decently inefficient system built on top of a recfile DB. We could optimize even further by migrating to sqlite3, or batching (or even better paralleling) our requests.

real    3m 3.28s
user    1m 45.47s
sys     0m 21.26s

So yeah, revisit those temporary solutions from time to time, they can really suck the life out of otherwise wonderful tooling. And I cannot believe I spent 2 years letting this thing churn for 10m each time it ran. yikes!