- A Bonus Tip
When you're working with HTML day in and day out, very often the same types of tasks come up time and time again. Whether your focus is client or server side, the end result is the same - quality content and modern interfaces for your users. Whether you're generating content to be displayed with PHP or jQuery, it's not an easy task. You Need a Helping Hand!
What Is the Solution
When that's combined with the ever increasing complexity of modern applications, to be doing this by hand is slow, error prone and let's admit it, painful. So in today's post, for your time & sanity saving pleasure, I present to you 6 XPath queries that will save you time, stress and effort.
We're going to be looking at a set of use cases that will come up again and again and use the excellent XPath Checker plugin for Firefox to make it quick and simple. For the site, we're going to need one that's reasonably content rich, loaded with information, links, videos, images; you know, the full gamut of what we work with daily.
So the site we're using is tagsschau.de. Now if you don't speak a word of German, other than Bier, Bratwurst and Lederhosen, that's perfectly fine - you don't need to. I've chosen the site because it fulfils all of the criteria above (+ I'm a student of German).
Let's Get Going
So what are we going to be finding today? Well, here's what we'll be covering, 6 use cases that cover the kind of searches that you'd perform on a page like Tagesschau.
- Get All the Page Links
- Retrieve only the Sport and Nachrichten (News) Content Blocks
- Get the Social Media Links
- Get the Keywords Metadata Content
- Get the Even Video Links
- Retrieve only the Header and Footer
So, let's crack on!
Get All the Page Links
I thought we'd start simply with this one. Actually, it's a little too easy, almost trivial even. Our first XPath query will be simply:
What this says, is to look throughout the whole document, identified by &'//', then retrieve any anchor element found, identified by &'a'. The output will appear as below:
Retrieve only the Nachrichten (News) and Sport Content Blocks
Ok, now here's one with a bit more meat on the proverbial bones. You see in the screenshot below that the page is composed of three columns, with the middle one being the core content. In the middle column, there are a series of content blocks, with the first two being Nachrichten and Sport.
Well, we're going to extract the first two and leave the rest out. To do this, we're going to use the XPath Or operator. Just like in SQL or PHP syntax, the or operator retrieves content that matches one or more of a series of conditions. So, what is the XPath query?
Feast your eyes on:
/html//div[@class="ardTeaserNormal sport"] | /html//div[@class="ardTeaserAufmacher tagesschau"]
NB: Formatted for readability.
I've made this one a bit more verbose than it could be - it could be rather further simplified. It says, anywhere within the html content identified by &'/html//&', find divs that have their class name set to either: &'ardTeaserNormal sport&' or &'ardTeaserAufmacher tagesschau&'.
The output of this query is as below
This goes a bit further than before by inspecting attributes of an element. We inspect attributes by the [@…] syntax. We could look for more elements by concatenating more expressions with the Or operator, or pipe. What else do you think you could find?
Get the Social Media Links
Let's keep stepping up the difficulty of the queries. In the bottom of the right hand column, you'll see the social media links; including RSS, YouTube, Tagesschau, Sportschau and Facebook. We're going to compose a query that retrieves just that block of content.
The block is contained in a div with the classes: &'ardReSpBlock service&'. The first div within that has the classes: &'ReSpUeber ard4col&' and the content text: &'Mobil | RSS | Social Media&'. We're going to compose a query that checkes for:
- The parent div by its classes
- The first inner or title, div by it's classes
- That the inner div contains the words RSS and Social Media
Let's see what it looks like.
/html//div[@class="ardReSpBlock service"]/div[@class="ReSpUeber ard4col" and contains(text(),"Social Media")]/parent::node()
NB: Formatted for readability.
In this XPath query, we've built on the last query, where we looked for all divs in the HTML body which had a specific set of class names. This time, we've gone one step further, by then looking for a child div with a set of class names. Still with me? Good.
Then, we've stepped up a bit further, by using two XPath functions: contains and text. Contains asks, does the first parameter, contain the text in the second parameter. Text simply retrieves the text from the node identified.
Putting the two together here, I can check that the text in the div contains the words Social Media, which it does. The last part introduces what's called XPath Axes. What are they? Well, said most simply over on XMLPlease, XPath Axes are:
An XPath axis is a path through the node tree making use of particular relationship between nodes.
We can use these to find related elements, such as child, parent, descendant, preceding, self and so on. This is really handy for us here. Without it, we'd only be retrieving the title div, when what we want is the div that it lies within.
So by using parent::node(), we step back up a level and grab the content at that point. This gives us our social media node below:
Get the Keywords Metadata Content
Let's ease back on the throttle a wee bit here and look at something simpler, the site metadata. Specifically, let's retrieve the stock standard (if a bit outdated) SEO-related keywords from the page.
By doing this, we look at another aspect of XPath. We're able to extract the text, not of an element, but of an element's attribute. Look at the following query and I'll go through it.
This time, we've filtered out everything but meta tags. We've then limited out search to the keywords metatag. Now, we want to see the actual keywords, which are contained in another element, the content element.
So we can't use the text function as before. This time, we append /@content, which will retrieve and display the contents of that attribute, if available. It is and contains the following text:
ARD.de, Fernsehen, Radio, Das Erste, Tagesschau, Sportschau, Nachrichten, Sport, Börse
Not too bad, wouldn't you say? I'm sure you're mind's already racing with the possibilities that XPath offers.
Get the Even Video Links
Right, we're down to the final two. Navigating a bit through the site, takes us to a page with loads of links under the Sportschau section.
In the right hand column, you'll see 5 video links. Let's say, for random's sake, we're only interested in the even links. How do we filter out the odd ones? Have a look at the XPath below, then we'll step through it together
//div[@class="ardReSpBlock"]//ul/li[position() mod 2 = 0]
Firstly, we grab the list items, in the div with the class name "ardReSpBlock". Then we take advantage of the fact that each element has an index. In this list, we start at 1 and end at 5. Now, without some form of voodoo, this might be really difficult; enter a function and an operator: position and modulus.
Using the position function, we are able to find the position value of the element. Then we perform a bit of arithmetic on that value and we're able to filter out the odd elements. I don't know about you, but my math is rusty. However, by using: position() mod 2 = 0 we get the result that we need.
How seriously simple is that? At this stage, you're seeing the benefits of XPath I'm sure!!
Retrieve only the Header and Footer
Now for the last and final query. Returning to the home page, we see that the page has a nice banner in the header and a nice set of text links in the footer. Let's write a simple query that retrieves just that, the header and footer.
In the source, we see that the header div has the class "impretc" and the footer has the id "footer". As you can likely guess by now, the query will be:
We've simply extracted divs with the respective class and footer.
Bonus: Handy Tips To Save You Time
Now, there are harder and simpler ways to use XPath queries. The hard way is to look at the source and read down to the element that you're wanting. The simple way is to use a combination of Google Chrome dev tools, by right clicking and clicking "Inspect Element" and in Firefox right clicking and clicking "View XPath".
By doing so, you can quickly see sample XPaths and the element in the HTML document that you want to work with, in or around. One thing to bear in mind with the XPath Checker plugin for Firefox. When you click "View XPath", it can, often times, give you a really long query. This is fine initially.
But you don't want to write such long winded ones forever. The main reasons are efficiency and simplicity. If you're new to XPath, then don't rush to make your queries shorter. Just know that you should aim for them to be as simple and concise as possible.
So there you have it. We've covered 6 great XPath queries that you can use, adapt, experiment with and change to help you do less, but achieve more in your daily, coding, life. Now it's over to you. Does you believe xPath gives you the skills you need to make you more productive? Share your thoughts with me in the comments below.
- XPath Tutorial (zvon.org)
- XPath Spec (w3c)
- XML in a Nutshell by O'Reilly Press
- XPath Wikipedia reference
- XPath on MDN