Parse span class text with DOM PHP

Parse span class text with DOM PHP - php

I've been having an issue trying to parse text in a span class with DOM. Here is my code example.
$remote = "http://website.com/";
$doc = new DOMDocument();
#$doc->loadHTMLFile($remote);
$xpath = new DOMXpath($doc);
$node = $xpath->query('//span[#class="user"]');
echo $node;
and this returns the following error -> "Catchable fatal error: Object of class DOMNodeList could not be converted to string". I am so lost I NEED HELP!!!
What I am trying to do is parse the user name between this span tag.
<span class="user">bballgod093</span>
Here is the full source from the remote website.
<div id="randomwinner">
<div id="rndmLeftCont">
<h2 id="rndmTitle">Hourly Random <span>Winner</span></h2>
</div>
<div id="rndmRightCont">
<div id="rndmClaimImg">
<table cellspacing="0" cellpadding="0" width="200">
<tbody>
<tr>
<td align="right" valign="middle">
</td>
</tr>
</tbody>
</table>
</div>
<div id="rndmCaimTop">
<span class="user">bballgod093</span>You've won 1000 SB</div>
<div id="rndmCaimBottom">
<a id="rndmCaimBtn" class="btn1 btn2" href="/?cmd=cp-claim-random" rel="nofollow">Claim Bucks</a>
</div>
</div>
<div class="clear"></div>
</div>

This call
$node = $xpath->query('//span[#class="user"]');
does not return a string, but a DOMNodeList.
You can use this list somewhat like array (using $node->length for the number of elements and $node->item(0) to get the first element) to get DOMNode objects. Each of these objects has a nodeValue property which is a string.
So you would do something like
$node = $xpath->query('//span[#class="user"]');
if($node->length != 1) {
// error?
}
echo $node->item(0)->nodeValue;
Of course, changing the variable name for $node to something more appropriate would be nice.

Related

Using DOMXpath to find data in not so nice html

I am trying to get some data from a plant list site. This proves to be a bit problematic because their html isn't really well-formed. These are two lines from the search result (disclaimer: I am not responsible for this code):
<tr>
<td>
<i class="glyphicons-icon leaf"></i>
</td>
<td>
<a title="Cimicifuga simplex" href="/taxon/wfo-0000604773" class="result">
<h4 class="h4Results"><em>Cimicifuga simplex</em>(DC.) Wormsk. ex Turcz.</h4>
</a>
Bull. Soc. Imp. Naturalistes Moscou<br/>
<div>
<em>Status:</em><span id="entryStatus">Synonym of </span>
<em>Actaea simplex</em>(DC.) Wormsk. ex Prantl
</div>
<div>
<em>Rank:</em><span id="entryRank">Species</span>
</div>
<div>
<em>Family:</em> Ranunculaceae
</div>
</td>
<td>
<img title="No Image Available" src="/css/images/no_image.jpg" class="thumbnail pull-right"/>
</td>
</tr>
<tr>
<td>
<i class="glyphicons-icon leaf"></i>
</td>
<td>
<a title="Actaea simplex" href="/taxon/wfo-0000519124" class="result">
<h4 class="h4Results"><strong><em>Actaea simplex</em>(DC.) Wormsk. ex Prantl</strong></h4>
</a>
Bot. Jahrb. Syst.<br/>
<div>
<em>Status:</em><span id="entryStatus">Accepted Name</span>
</div>
<div>
<em>Rank:</em><span id="entryRank">Species</span>
</div>
<div>
<em>Family:</em> Ranunculaceae</div>
<div>
<em>Order:</em> Ranunculales
</div>
</td>
<td>
<img title="No Image Available" src="/css/images/no_image.jpg" class="thumbnail pull-right"/>
</td>
</tr>
I added some layout myself, otherwise it wasn't readable.
Anyway, I loaded the page in php and DOMXpath and now I want to get two things:
Select the row that has Accepted Name in it
Get the species name and the corresponding link from it
In this case the result would be "Actaea simplex" and "/taxon/wfo-0000519124". Mind that there will be more results resembling the first row, and that the position of the row that I am looking for doesn't have to be the second one.
Normally I just try, use google and try some more and in the end I get there, but in this case IDs are used as classes, and are not unique. This make it impossible to use an Xpath tester, and perhaps even useless for DOMXpath.
So, is it possible to get my data with DOMXpath, and if yes - what query do I use?

Try something like:
$dom = new DOMDocument();
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$target = $xpath->query("//td[.//span[.='Accepted Name']]/a");
$link = $target[0]->getAttribute('href');
$title = $target[0]->getAttribute('title');
echo $title," ",$link;
Output
Actaea simplex /taxon/wfo-0000519124

How to return in php DOMXPath object?

Now found query if '$NotXP->query' = query return string?!
How to make work next code?
$xp = new \DOMXPath(#\DOMDocument::loadHTMLFile($url));
$list = $xp->query('//table[#class="table-list quality series"] tbody');
$link = $list->query('//tr[#class="item"]');
$arr_links = [];
foreach ($link as $link_in_cycle) {
$link_quality = $link_in_cycle->query('//td[#class="column first video"]');
$link_audio = $link_in_cycle->query('//td[#class="column audio"]');
$link_size = $link_in_cycle->query('//td[#class="column size"]');
$link_seed = $link_in_cycle->query('//td[#class="column seed-leech"] span[#class="seed"]');
$link_download_url = $link_in_cycle->query('//td[#class="column last download"] a')->getAttribute("data-default");
html source for request #nigel-ren
From this code need grab of info
<tbody>
<tr class="item">
<td class="column first video">720x400</td>
<td class="column audio">mp3</td>
<td class="column size">5.70 Gb</td>
<td class="column seed-leech">
<span class="seed">15</span>
<span class="leech">26</span>
</td>
<td class="column updated">07.07.2017</td>
<td class="column consistence"></td>
<td class="column last download">
<a class="button middle rounded download zona-link"
data-type="download"
data-zona="0"
data-torrent=""
data-default="url_data"
data-not-installed=""
data-installed=""
data-metriks="{'eventType': 'click', 'data' : { 'type': 'show_download', 'id': '84358'}}"
title="text in title" href="javascript:void(0);" >Download</a> </td>

I've made a few changes to help me in debug the code. The main thing is that your XPath expressions were invalid, you can always try a site like FreeFormatter which allows you to check your expressions with some example source.
$doc = new \DOMDocument();
$doc->loadHTMLFile($url);
$xp = new \DOMXPath($doc);
$list = $xp->query('//table[#class="table-list quality series"]//tr[#class="item"]');
$arr_links = [];
foreach ($list as $link_in_cycle) {
$link_quality = $xp->query('//td[#class="column first video"]/text()', $link_in_cycle)[0]->wholeText;
$link_audio = $xp->query('//td[#class="column audio"]/text()', $link_in_cycle)[0]->wholeText;
$link_size = $xp->query('//td[#class="column size"]/text()', $link_in_cycle)[0]->wholeText;
$link_seed = $xp->query('//td[#class="column seed-leech"]//span[#class="seed"]/text()', $link_in_cycle)[0]->wholeText;
$link_download_url = $xp->query('//td[#class="column last download"]//a/#data-default', $link_in_cycle)[0]->value;
echo $link_quality.PHP_EOL;
echo $link_audio.PHP_EOL;
echo $link_size.PHP_EOL;
echo $link_seed.PHP_EOL;
echo $link_download_url.PHP_EOL;
}
The XPath expressions try and retrieve the text node in each element, which will return a list of all of the nodes, this code does assume there isn't any whitespace around the actual content (and uses [0] to fetch the first element of the list). The wholetext is just the actual content of the DOMText element.
With the sample content you gave (plus the surrounding bits I had to invent) it gives...
720x400
mp3
5.70 Gb
15
Download

HTML DOM remove/replace between <tr> and </tr> tags

I've searched for solution but i'm lost. I have to remove or replace with blank everything between <tr> tags. I'm loading html file, which contains many <tr> tags, my goal is to remove <tr> with specific id. My <tr> looks like this:
<tr id="ctl00_cphMain_DisplayRecords1_RepeaterResults_ctl03_trZSD">
<td id="ctl00_cphMain_DisplayRecords1_RepeaterResults_ctl03_tdZSD" class="td-zsd footable-visible footable-last-column footable-first-column" colspan="9">
<div id="divZSDBanners" class="table-banners-zsd clearfix">
<div>
<div class="medium-4 columns zsd-ext-ad">
<div>
<script type="text/javascript">
</script>
<script>
</script>
<div id="ctl00_cphMain_DisplayRecords1_RepeaterResults_ctl03_ctl00_divSpace1" class="adSpacer">
</div>
</div>
</div>
<script type="text/javascript">
</script>
</div>
</div>
</td>
</tr>
I'm using Simple HTML DOM, I've already tried with $html->find('tr[id=tr_id]), but don't know to replace everything between, including divs and script tags.
Any ideas?

Use ->innertext property:
$tr = $html->find( 'tr[id=tr_id]', 0 ); // Select first node (0)
$tr->innertext = '';
echo $html->save();
Output:
<tr id="tr_id"></tr>
Or:
$tr->innertext = '<td>New Content</td>';
echo $html->save();
Output:
<tr id="tr_id"><td>New Content</td></tr>

To remove the TR element itself via DOM, use the removeChild method of its parent node:
$tr->parentNode->removeChild($tr);
To remove the element’s contents, either set its textContent property to empty string '' (PHP 5.6.1+) or remove all child nodes one by one using the element’s removeChild() method in a loop, e. g.:
while ($tr->lastChild) {
$tr->removeChild($tr->lastChild);
}
SimpleXMLElement object can be converted to DOMElement object using the dom_import_simplexml() function.

Find and separate the HTML blocks to an array

First of all I want to describe the idea - anyone know that any CMS or a simple website has some kind of blocks like the list of articles for example on the main page of wordpress where shown each in a block of information: Title, author, content, date etc.
So the main idea is how to find and separate such blocks of HTML and append each of them to an array.
I thought first need to clear them from: classes, ids and styles.
step1:
<div id="box1">
<h3 class="title_style">Title1</h3>
<p>content for box1</p>
<div class="author">Author Name1<span class="style_date">date1<span>any text</div>
</div>
<div id="box2">
<h3 class="title_style">Title2</h3>
<p>content for box2</p>
<div class="author">Author Name2<span class="style_date">date2<span>any text2</div>
</div>
to
<div>
<h3>Title1</h3>
<p>content for box1</p>
<div>Author Name1<span>date1<span>any text</div>
</div>
<div>
<h3>Title2</h3>
<p>content for box2</p>
<div>Author Name2<span>date2<span>any text2</div>
</div>
Step2:
I need to find each block and write them to an array so I can to put each block to a row in the table like this: (note that this blocks are present on almost any site so it doesn't matter what tags it has, they just repeat with different content and attributes, only the structure is the same)
<table>
<tr id="block1">
<td>Title1</td>
<td>content for box1</td>
<td>Author Name1</td>
<td>date1</td>
<td>any text</td>
</tr>
<tr id="block2">
<td>Title2</td>
<td>content for box2</td>
<td>Author Name2</td>
<td>date2</td>
<td>any text</td>
</tr>
</table>
Any ideas ? I need the logic how to do this, not the code itself.

You can walk the DOM of the document using PHP's DOMDocument class.
So you can do something like this:
$str = <<<STR
<div id="box1">
<h3 class="title_style">Title1</h3>
<p>content for box1</p>
<div class="author">Author Name1<span class="style_date">date1</span>any text</div>
</div>
<div id="box2">
<h3 class="title_style">Title2</h3>
<p>content for box2</p>
<div class="author">Author Name2<span class="style_date">date2</span>any text2</div>
</div>
STR;
$dom = new DOMDocument();
$dom->loadHTML($str);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
//read child elements
}

Try this library Simple HTML Dom Parser.

How do I parse HTML using PHP DOMDocument?

I have an HTML block here:
<div class="title">
<a href="http://test.com/asus_rt-n53/p195257/">
Asus RT-N53
</a>
</div>
<table>
<tbody>
<tr>
<td class="price-status">
<div class="status">
<span class="available">Yes</span>
</div>
<div name="price" class="price">
<div class="uah">758<span> ua.</span></div>
<div class="usd">$ 62</div>
</div>
How do I parse the link (http://test.com/asus_rt-n53/p195257/), title (Asus RT-N53) and price (758)?
Curl code here:
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$models = $xpath->query('//div[#class="title"]/a');
foreach ($models as $model) {
echo $model->nodeValue;
$prices = $xpath->query('//div[#class="uah"]');
foreach ($prices as $price) {
echo $price->nodeValue;
}
}

One ugly solution is to cast the price result to keep only numbers:
echo (int) $price->nodeValue;
Or, you can query to find the span inside the div, and remove it from the price (inside the prices foreach):
$span = $xpath->query('//div[#class="uah"]/span')->item(0);
$price->removeChild($span);
echo $price->nodeValue;
Edit:
To retrieve the link, simply use getAttribute() and get the href one:
$model->getAttribute('href')

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parse span class text with DOM PHP - php

Related

Using DOMXpath to find data in not so nice html

How to return in php DOMXPath object?

HTML DOM remove/replace between <tr> and </tr> tags

Find and separate the HTML blocks to an array

How do I parse HTML using PHP DOMDocument?

Categories

Resources