For the past 6 years, I have been employed on and off in an international NGO with offices in more than 30 countries. My working areas have mainly been in Fundraising or Business Intelligence, doing fun things like converting .xlsx files to .csv and generating latent contempt for people with “digital” or “social media” in their job titles. However, as the main tech wizard I have been asked to do very weird things. I want to tell you a story of intrigue, technology from a forgotten time and a prototypically badly planned project. All of this is from memory and it is a few years back, so details might be slightly off.
Before reading on, know that I love GNU. It took me years to learn SED, AWK, cut, cat, sort and tr, mainly because I had no purpose for anything these wonderful tools can do. All tutorials I found had to do with system management and log files and while I broke systems and reinstalled them regularly, I never managed one.
Then I got my first job that forced MS Excel on me. When that happened, I soon pulled out my rusty *NIX-toolbox and got to learning by doing.
I want to tell this story for two reasons. The first being that I like the story and actually solved a problem in a cheap and dirty way. The second, and much less self indulgent reason, is that I believe these tools are undervalued, because they are misunderstood and the steep learning curve combined with the lack of reasonable examples makes it an elitist practice. I am all for elitism, as long as it’s inclusionary and everybody can be a part of the elite.
A brief introduction:
My good friend and colleague hits me up during a coffee-break.
Colleague: “Hey Mads, I have this small issue, could you take a look at it?”
Me, oblivious and naive, thinking it’s an hour at max: “Sure”
He explains the problem: Download these freaking posters
The NGO in which I work ran an international campaign with an attached competition. To raise awareness, people all around the world were asked to create posters in the theme of the campaign, and upload those to a website. After a certain deadline, the *best* posters would be chosen by a panel of judges.
That was about as much of a plan as the project started out with.
Some web bureau were charged with setting up an international site that allowed for uploading these posters. Fast forward three months and what have you, more than 2000 posters have been submitted. The campaign people are blown away, this exceeds expectations wildly!
The campaign people are not happy for long though. Suddenly they need to download these 2000 posters to find the best one. With the well prepared project planning phase, surely the web bureau was asked to add a bulk download button?
Nah. Planning ahead is not for NGOs apparently. Can they add one? Nah, they’re “consultants” and that would cost us a ton of money nobody wants to pay.
Okay, so we need to manually download each of the 2000 posters from a wordpress site, one poster after another.
Furthermore we need the metadata for these posters. There’s no reason to find the best poster and then realise you have no idea who made it.
Luckily, the metadata is stored in one large csv file.
After crying my eyes out from realizing I do not like web developers, I look at the problem. There’s no way i’m clicking 2000 times and manually pairing up metadata with posters. I think of Selenium and Python and decide there’s got to be a better way than making a browserbot out of a test-automation suite.
I first dive into the metadata from the **csv** file. I want to pet the wildebeest when possible, so I fire up a bash prompt and run
head -1 stupid_csv_file.csv
The output is a 263 characters long, hard to read string. I left it here, unformatted, as I want any reader to feel my pain. It gets lonely when screaming into the void.
I realize I must pet the wildebeest harder. If I pipe the head command to tr, I can translate the commas into newlines. If I feed that into cat, I can number the lines, effectively giving me an index into this mess of a csv. (This is one of my favourite one-liners btw!)
head -1 stupid_csv_file.csv | tr , '\n' | cat -n
Much better. Now I can read what’s in that terrible file.
The csv file contains 24 fields. Most of them, like a duplicate email field (17- value), are useless, but I do manage to strike gold. The contact info is all there, but better yet, something that looks like part of a URL is there as well! The values all start with the substring “poster-container/”. Hmm. It’s probably a folder with all the posters. The field name is uri. After the common substring, all the fields have a unique 17-char ID of the pattern 8 digits, a dash and 8 more digits. This ID is also duplicated in the “alias” field. Such redundancy.
It turns out, that the uri field, when appended to the URL of the campaignsite, leads to an html page that displays the poster I want to download. However, it rescales the image on this site, and I want the full resolution posters. Luckily, there is a URL straight to the poster in the HTML, and the original unscaled posters are stored on the server with the same name, but .orig appended as an extension.
I strike up a plan:
1) Use cut or AWK to extract aliases into a separate file
2) Use wget to fetch an HTML page that contains the link I need, for each alias.
3) Grep for the link
4) Download the actual poster
5) Use grep and awk to extract metadata related to the poster, save it in a similarly named file
So far so good!
Since I have a csv file with the aliases, step 1 is easy.
The field I need is field 5 and the file is comma separated. -d, means “this is comma separated” and -fX means give me field(s) X.
cut -d, -f5 stupid_csv_file.csv > poster_aliases
Now for step 2 and 3. This takes a bit more code, but bash is up to the task.
while read p; do
# The actual URL is hidden, so we will dig it out by fetching the HTML
# for the poster-container page and extract the direct link for download
URL=$(wget -q -O - "https://www.stupidpostercontest.org/en/poster-container/"$p | grep "<img class=\"loaded" | cut -d\" -f4)
The code is pretty self-explanatory, if you know what a pipe does and how parentheses work in bash.
I read one alias at a time into $p, from the file of aliases I created in step 1.
I append the alias to a partial url and download the corresponding HTML page. I pipe that page directly to grep and search for a specific tag on the page. The matching line is then piped to cut, where the actual url of the poster is extracted.
I stumbled upon a few minor hassles in this step. The website allows for both jpegs and pngs and enforce no naming scheme. You can have .jpg, .jpeg, and even .JpEg if you want to. I want all the posters named in a regular fashion, as either .jpg or .png, which happens in this case statement. I use shell expansion to get the original extension, and update it to what I want the extension to be. If you do not know what shell expansions are, look it up. You might need it one day, when downloading 2000 posters unprepared.
# Get the file extension, so we can name the file correctly
case "$EXT" in
Now for step 4, actually downloading the poster. We finally have the URL and the proper extension for the file. wget solves it all!
The -q flag makes wget shut up.
The -O flag designates an output file for wget. I want the poster to be named alias.file-extension. I have the alias stored in the variable $p, so that’s easy.
wget -q -O $p"."$EXT $URL".orig"
Phew. Now I only need to create those metadatafiles. It seems to be time for more sophisticated tools than cut and grep. I pull out the almighty spreadsheetkiller, AWK (The actual GNU AWK, not the craptastic version Apple supplies and calls AWK).
I made a quick script, specifying a field separator that can deal with double-quote enclosed strings, (grrr!) and simply print the fields I want in the correct order. Now I just need some data I can input, and AWK should do the rest. This was done so fast, that I even forgot to rename the script. My first test worked, so I just kept the name test.awk.
Since I still have the alias stored in a variable, and the alias is present (twice!) in the csv file, I can just grep for that. While i’m at it, my good friend asks for all the emails of poster-submitters. Might as well fetch that immediately.
echo $(grep $p ../files/fixed.csv) | awk -f ../test.awk - > $p".txt"
grep Email $p".txt" | cut -d" " -f2 >> tmpListOfEmails.txt
I am done. Finally. I run the script, get 2000+ posters and their corresponding metadata saved into one directory. I zip it and stoically walk over to my colleagues desk.
Me: “Hey, i’m done. How do you want your zip?”
I spend a minute explaining file-sizes and how this won’t fit in an email.
We agree on delivery and my colleague thanks me for my effort.
Colleague: “Oh by the way, Germany would like all of their posters separately. Well, so do all the countries. Can you do that as well?”
Me: “Sure, it won’t take that long”.
Colleague: “It’d also be nice if you could make a pdf, per country, containing the posters. Oh, and could you put the name and age of the posters creator in the pdf as well? Not as text on a page, but in the actual poster?”
Me, muttering curse words in multiple languages: “Sure.”
Part 2 coming at some point. Probably.