• Tobias Schwarz
  • Extract titles and meta descriptions using a shell script

Extract titles and meta descriptions using a shell script

Wed, Aug 28, 2019

Motivation

I had to extract all Titles and Meta-Descriptions from a small website so they could be optimized. The HTML wasn’t consistent so I decided to write a small script that gets the job done using wget, curl, xmllint and xpath expressions.

The shell script

#!/bin/bash
if [ $1 ]; then
  rm -rf urls.txt metadata.tsv
  wget -m $1 2>&1 | grep '^--' | awk '{ print $3 }' > urls.txt
  rm -f metadata.tsv
  echo -e "URL\tTitle\tMeta-Description\t" > metadata.tsv
  while read -r url; do
    curl -s "$url" > tmp_file
    title=$(cat tmp_file | xmllint --html --xpath '/html/head/title/text()' - 2>/dev/null)
    metadesciption=$(cat tmp_file | xmllint --html --xpath 'string(/html/head/meta[@name="description"]/@content)' - 2>/dev/null)
    echo -e "$url\t$title\t$metadesciption" >> metadata.tsv
  done < "urls.txt"
else
  echo "Usage: ./extract.sh <URL>"
fi
rm -rf urls.txt tmp_file

Usage

To use the script make sure wget, curl and xmllint are installed. Make sure the script is executable and run ./extract.sh https://example.com/.

Notes

The script mirrors the website using wget to generate a list of all URLs. In a second step it fetches all URLs again with curl and extracts the titles and meta descriptions. This is not ideal but I was to lazy to create a solution that only does one request for each URL. Feel free to send me such an solution and I will update this post.