PHP XPATH within changing XML structure

PHP XPATH within changing XML structure - php

I've an XML file like this:
<tr class="station">
<td class="realtime">
<span>
15:11
</span>
</td>
</tr>
<tr class="station">
<td class="clock">
15:20
</td>
</tr>
<tr class="station">
<td class="clock">
15:30
</td>
</tr>
<tr class="station">
<td class="realtime">
<span>
15:41
</span>
</td>
</tr>
and I wanna parse it with xpath in php. The xml is been updated and parsed quite often.
I always want to get the first time (in this case 15:11)
The problem is that its not sure whether the surrounding tag is a td by class "clock" or "realtime".
If there is a so surrounding realtime, then there is a span tag within. Otherwise not.
In fact, its always the first "station"-class tag in which the information is, that matters.
So is it possible to tell xpath to just evaluate within this tag?
Is there a good method for doing this in xpath?
(sry for my bad english)

In fact, its always the first
"station"-class tag in which the
information is, that matters. So is it
possible to tell xpath to just
evaluate within this tag?
With this wellformed input source:
<table>
<tr class="station">
<td class="realtime">
<span>
15:41
</span>
</td>
</tr>
<tr class="station">
<td class="clock">
15:20
</td>
</tr>
<tr class="station">
<td class="clock">
15:30
</td>
</tr>
<tr class="station">
<td class="realtime">
<span>
15:41
</span>
</td>
</tr>
</table>
This XPath expression:
/table/tr[#class='station'][1]/td
Note: Just select the element you want and use the proper DOM API method to get the string value. It doesn't matter whether there is a span element or not.
If you want to...
/table/tr[#class='station'][1]/td//text()

Related

How to get strings with Regex

I want to get strings between td's but one of the td has not close tag. How can I get from this tag with other string.
<tr>
<td class="exclass">Text 0
<td class="exclass">Text 1</td>
<td class="exclass">Text 2</td>
<td class="exclass3" >Text</td >
<td class="exclass"> Text </td>
<td class="exclass3">Text</td>
<td class="exclass">Text</td>
<td class="exclass">Text</td><td class="exclass">Text</td>
<td class="exclass2">Text</td>
<td class="exclass">Text</td>
<td class="exclass" width="20"><img src="exampleSrc"></td>
</tr>
As you can see below code, I want to get Text 0 and the other strings with PHP.
So far, I tried to:
<td.+?>([\w\W]*?)<\/td.+?|<td

I assume because one of the td doesn't have close tag, that's why you can't use the DOM parser.
Here is my regex solution
(?<=>)([\s\w\n]+)(?=<)
https://regex101.com/r/BRaJAu/1

Extracting table cell text contents with xpath in rows for consumption?

I have something along the following lines in terms of HTML. I would like to extract the various contents of the table cells, however I discovered that there are some embedded divs occasionally in the cells and perhaps other oddities that I'm not sure of yet:
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
<blockquote>
<p class="textstyle">
Text.
</p>
</blockquote>
My first impulse was to extract ALL element texts and just programmatically slice it up. I would watch for Title1, Title2, etc. to know when a row starts and then if a "----" is found meaning no value, just skip this row and move on. However, I realized that there is probably a better way of handling this with xpath directly.
How could this be solved with xpath so as to essentially give each cell's final child text content vs having to walk into each div if it exists? Or is there a more xpath like way to approach this?
Obviously I'm attempting to have the most flexible solution that will not be brittle if other unexpected elements crop up, even though they are unlikely.

The provided text isn't well-formed XML document, therefore XPath isn't applicable.
If you correct and covert it to a well-formed xml document as the one below, an expression like this might be useful:
/*/TABLE//TD//text()
or even:
//TABLE//TD//text()
Here is a wellformed XML document, constructed from the provided HTML:
<html>
<p align="center">
<img src="some_image.gif" alt="Some Title"/>
</p>
<TABLE WIDTH="500" BORDER="1" class="textwhite" ALIGN="center" CELLPADDING="0" CELLSPACING="0">
<TR>
<TD colspan="4" ALIGN="center">
<b>Title</b>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title</TD>
<TD ALIGN="center">date</TD>
<TD ALIGN="center">value</TD>
<TD ALIGN="center">value</TD>
</TR>
<TR>
<TD ALIGN="center">Title2</TD>
<TD ALIGN="center"></TD>
<TD ALIGN="center">
<div class="redtext">----</div>
</TD>
<TD> </TD>
</TR>
<TR>
<TD ALIGN="center">Title3</TD>
<TD ALIGN="center">
<div class="yellowtext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD ALIGN="center">value
<SUP>6</SUP>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title4</TD>
<TD ALIGN="center">
<div class="bluetext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD> </TD>
</TR>
</TABLE>
<blockquote>
<p class="textstyle"> Text. </p>
</blockquote>
</html>

So maybe you don't want to walk the divs, but here is my solution using lxml, which I highly recommend:
import re
from cStringIO import StringIO
from lxml import etree
def getTable(html, table_xpath, rows_xpath, cells_xpath):
"""Get a table on a webpage"""
parser = etree.HTMLParser()
# Build document tree and get table
root = etree.parse(StringIO(html), parser)
table = root.find(table_xpath)
if table == None:
print 'No table.'
return []
rows = table.findall(rows_xpath)
document = []
def cleanText(text):
"""Clean up text by replacing line breaks and tabs. """
return re.sub(r'[\r\n\t]+','',str(text).strip())
# iterate over the table rows and collect text from each cell.
for r in rows:
cells = r.findall(cells_xpath)
rowdata = []
for c in cells:
text = ''
it = c.itertext()
for i in it:
text += cleanText(i) + ' '
rowdata.append(text)
document.append(rowdata)
return document
html = """
<html><head><title></title></head><body>
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
</body>
</html>
"""
tp = "//table[#width='500']"
rt = "tr"
cp = "td[#align='center']"
doc = getTable(html, tp, rt, cp)
print repr(doc)

I believe that your program is going to run into many problems as the input data is manipulated -- what if the case of 'title' changes, or there is a typo?
It's not really possible to make a rigorous solution to scraping someone else's website, as they can at no notice completely change everything. Better is normally to write tolerant and flexible code that at least tries to verify that its output is sane. In this case it's probably best to iterate over the results of '//table/tr', then inside this loop, process the td elements:
import lxml.etree
tree = lxml.etree.fromstring("<table><tr><td>test</td></tr><tr><td><div>test2</div></td></tr></table>")
stringify = lambda x : "".join(x.xpath(".//text()"))
for x in tree.xpath("//table/tr"):
print "New row"
for y in x.xpath("td"):
print stringify(y)
Output:
New row
test
New row
test2
The following code will, however, get the list you ask for:
print map(stringify, tree.xpath("//table/tr/td"))
Output:
['test', 'test2']
This will find all text elements which are at all descended from a td which is a direct descendant of a tr which is in turn a direct descendant of a table.
(Simply asking for all text() elements will create some funny bugs when run on HTML which contains "<td>Foo <b>bar</b></td>" or similar.)

XPath select descendent of parents sibling with in limits

My xpath:
(//tr[td[contains(., 'Refine by Vehicle Types')]])[1] /following-sibling::tr /td/div/table /tr/td/font /ul/li/a
My source:
<tr><td><font color="White">Refine by Vehicle Types</font></td> </tr>
<tr><td><div>
<table> <tr> <td><font<ul><li><a> Automobile/Light Trucks</a></li></ul></font></td> </tr> </table>
</div></td> </tr>
<tr> <td></td> </tr>
<tr> <td><font>Refine by Category</font></td> </tr>
<tr> <td><div>
<table> <tr> <td><font><ul><li><a>Agricultural</a></li></ul></font></td></tr>
I'm trying to scrape this source and collect the <li> nodes after "Refine by Vehicle Types" but not after "Refine by Category".
Any help is appriciated.

You are almost there.
Change:
(//tr
[td[contains(., 'Refine by Vehicle Types')]]
)
[1]
/following-sibling::tr
/td/div/table
/tr/td/font
/ul/li/a
to:
(//tr
[td[contains(., 'Refine by Vehicle Types')]]
)
[1]
/following-sibling::tr[1]
/td/div/table
/tr/td/font
/ul/li/a
When the second XPath expression is evaluated against the following XML document (your severely malformed text corrected to become a well-formed XML document):
<table>
<tr>
<td>
<font color="White">Refine by Vehicle Types</font>
</td>
</tr>
<tr>
<td>
<div>
<table>
<tr>
<td>
<font>
<ul>
<li>
<a> Automobile/Light Trucks</a>
</li>
</ul>
</font>
</td>
</tr>
</table>
</div>
</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>
<font>Refine by Category</font>
</td>
</tr>
<tr>
<td>
<div>
<table>
<tr>
<td>
<font>
<ul>
<li><a>Agricultural</a></li>
</ul>
</font>
</td>
</tr>
</table>
</div>
</td>
</tr>
</table>
Only one -- the wanted -- a element is selected:
<a> Automobile/Light Trucks</a>
Note: Did I mention that an XPath Visualizer will help you a lot?

For a robust XPath, which will work no matter how many tr/li elements are between the two text labels, try:
(//tr
[td[contains(., 'Refine by Vehicle Types')]]
)[1]
/following-sibling::tr[not(preceding-sibling::tr
[contains(., 'Refine by Category')])]
/td/div/table
/tr/td/font
/ul/li/a
(Borrowing from #Dimitre's formatting.)
The above is inefficient (could be O(n^2)), so if you have a long page, it could get slow.
But for moderate pages it should be fine.

php replace images with divs

below is the markup im pulling from my database table. basically i want to replace the image
<img src="http://newvision.co.ug/IM/logo_white_big.gif" width="80" style="background-color:white;padding:1px">
to
<div style='background:url(http://newvision.co.ug/IM/logo_white_big.gif) center center no-repeat;width:40px;height:40px'></div>
I dnt wanna use regular expressions just an htmlparser that ships with php
<table>
<tbody>
<tr>
<td valign="top"><a href="http://newvision.co.ug/PA/8/13/748484" target=
"_blank"><img src="http://newvision.co.ug/IM/logo_white_big.gif" width="80"
style="background-color:white;padding:1px" /></a></td>
<td valign="top">
<table>
<tbody>
<tr>
<td></td>
</tr>
<tr>
<td valign="top"><b><a target="_blank" href=
"http://newvision.co.ug/PA/8/13/748484" style="font-size:9pt">The New
Vision Online : Holland withholds sh10b over CHOGM</a></b></td>
</tr>
<tr>
<td valign="top"><a href="http://newvision.co.ug/PA/8/13/748484" style=
"font-size:8pt;color<img src="smilies/worry.gif" alt="worry" />ilver"
target="_blank">http://newvision.co.ug/PA/8/13/748484</a></td>
</tr>
<tr>
<td valign="top" style="font-size:8pt;font-weight:normal">The New Vision
is Uganda's leading daily newspaper.</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>

There is no parser that ships with PHP, so use PHPQuery, a way of manipulating the DOM in a JQuery like manner instead. This will allow you to use selectors to easily swap out chunks of HTML.

PHP REGEX: Find a dom node based on innerHTML

As I am well aware that PHPDom can solve half of my problem, I'm in need of a way (not necessarily regex) to be able to find a certain DOM element based on a given innerHTML.
say for example i got this code:
<tr>
<td class="ranking_rank" style="vertical-align:middle;">48697</td>
<td class="ranking_ign" style="vertical-align:middle;">kanineh</td>
<td class="ranking_img" style="vertical-align:middle;">
<img src="http://avatar.maplesea.com/Character/NKGEHGDLFNINKPMFLDCNNOHKHKBOHBKLGCBLABFLABHAGBPAEMDEFABJBLKJIHJAANGEKFJGELEPKMCNLKPCINEJDGAJFLKG.gif" onerror="this.src='/images/ranking/noimage.jpg'"/>
</td>
<td class="ranking_lvl" style="vertical-align:middle;">122</td>
<td class="ranking_world" style="vertical-align:middle;">
<img src="/images/ranking/Bootes.gif" onMouseover="ddrivetip('Bootes','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_job" style="vertical-align:middle;">
<img src="/images/ranking/Warrior.gif" onMouseover="ddrivetip('Warrior','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_fame" style="vertical-align:middle;">449</td>
</tr>
<tr>
<td class="ranking_rank" style="vertical-align:middle;">48698</td>
<td class="ranking_ign" style="vertical-align:middle;">WannaLogic</td>
<td class="ranking_img" style="vertical-align:middle;">
<img src="http://avatar.maplesea.com/Character/DOMELFGEGCGDBFCOLADBDOJLHADCIBNKEGKGINPNBEKPDDKOEEGBLMDLBGBDHGCNPGLAECAMLGKEMDKJGPODIDKCOJCMNNKN.gif" onerror="this.src='/images/ranking/noimage.jpg'"/>
</td>
<td class="ranking_lvl" style="vertical-align:middle;">122</td>
<td class="ranking_world" style="vertical-align:middle;">
<img src="/images/ranking/Aquila.gif" onMouseover="ddrivetip('Aquila','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_job" style="vertical-align:middle;">
<img src="/images/ranking/Magician.gif" onMouseover="ddrivetip('Magician','white', 70)" onMouseout="hidetip()">
</td>
<td class="ranking_fame" style="vertical-align:middle;">56</td>
</tr>
I need to be able to get a hold of the whole row node with the td that has WannaLogic in it. that way, when I have this table row already, I can now easily traverse the nodes using PHP DOM. I'm a sucker for regular expression so I'd really much appreciate it if you can shed me some light on this.

Using regex on a DOM tree is a no-no and bound to fail when faced with malformed XML/HTML. Try this:
$xpath = new DOMXPath($doc);
$query = "//*[.='WannaLogic']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
// do whatever
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP XPATH within changing XML structure - php

Related

How to get strings with Regex

Extracting table cell text contents with xpath in rows for consumption?

XPath select descendent of parents sibling with in limits

php replace images with divs

PHP REGEX: Find a dom node based on innerHTML

Categories

Resources