DOMDocument PHP web scraping

DOMDocument PHP web scraping - php

I was wondering if there was any way to use dom to select elements that have dynamic tags. All of the tags start with link_(some id).
Example:
<tr id="link_111111">something in here...</tr>
<tr id="link_222222">something in here...</tr>
<tr id="link_333333">something in here...</tr>
<tr id="link_444444">something in here...</tr>
<tr id="link_555555">something in here...</tr>
I was wondering if I could grab all the tr's that have the id with link_ because I don't have the specific id's, they are random.

You can use an XPath expression to achieve this:
//tr[starts-with(#id, "link")]
Example:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('///tr[starts-with(#id, "link")]');
foreach ($nodes as $node) {
// Do whatever
}
Demo

DOM way using some string functions ...
$dom = new DOMDocument;
$dom->loadHTML($html); $tagK = 'link_';
foreach ($dom->getElementsByTagName('tr') as $tag) {
if (substr(strtolower($tag->getAttribute('id')),0,strlen($tagK))===$tagK) {
echo $tag->getAttribute('id').PHP_EOL;
}
}
Demo

Or if you want to have more flexible way and easy to Web Scrape.. I suggest you take a look at
https://github.com/fabpot/goutte which act as wrapper. that you can also used for clicking a link or submitting a form..
I made some tutorials using Goutte Class for Web Scraping.. Feel free to check it.
http://iapdesign.com/webdev/laravel-4-webdev/superb-web-scraping-tutorials-using-laravel-4/

Related

Get DOMNodeList of elements with only the given class

I am parsing a 3rd party HTML page using PHP DOMDocument and DomXPath.
I use the following code:
$dom = new DOMDocument();
$html = file_get_contents($url);
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);
$dom->encoding = "UTF-8";
$finder = new DomXPath($dom);
Now there are several elements using the same class, but I want to target the one that uses only the given class, for example:
<table class="tbl"></table>
<table class="tbl red"></table>
<table class="tbl large blue"></table>
I use the following selector:
$classname = "tbl";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
Which, of course fetches all three tables given above. Is there a simple way to get only the first one?
Thanks

Yes, there is a way.
Note that with your XPath query you can access to desired node by this way:
$nodes->item(0);
To select only the first node you have to modify your pattern in this way:
$nodes = $finder->query("(//*[contains(#class, '$classname')])[1]");
But to access to desired node you need anyway to use this syntax:
$nodes->item(0);

Trying to retrieve text only from a div with xpath

I'm trying to write a document that will go through a webpage that was poorly coded and return the title element. However, the person who made the website I plan on scraping did not use ANY classes, simply just div elements. Heres the source webpage I'm trying to scrape:
<tbody>
<tr>
<td style = "...">
<div style = "...">
<div style = "...">TEXT I WANT</div>
</div>
</td>
</tr>
</tbody>
and when I copy the xpath in chrome I get this string:
/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]
I'm having trouble figuring out where I put that string in an xpath query.
If not an xpath query maybe I should do a preg_match?
I tried this:
$location = '/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]';
$html = file_get_contents($URL);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query($location) as $node) {
echo $node, "\n";
}
but nothing is printed to the page.
Thanks.
EDIT: Full sourse code here:
http://pastebin.com/K5tZ4dFH
EDIT2: Cleaner code screen shot: http://i.imgur.com/lWKheBy.png

From looking at your source, try the following:
$html = file_get_contents($URL);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//div[contains(#style, 'left:20px')]");
foreach ($nodes as $node) {
echo $node->textContent;
}

It looks like you want the text just before the first </div>, so this regex will find that:
[^<>]+(?=<\/div>)
Here's a live demo.

Using regex to get the value from a tag in PHP

Using regex in PHP how can I get the 108 from this tag?
<td class="registration">108</td>

Regex isn't a good solution for parsing HTML. Use a DOM Parser instead:
$str = '<td class="registration">108</td>';
$dom = new DOMDocument();
$dom->loadHTML($str);
$tds = $dom->getElementsByTagName('td');
foreach($tds as $td) {
echo $td->nodeValue;
}
Output:
108
Demo!
The above code loads up your HTML string using loadHTML() method, finds all the the <td> tags, loops through the tags, and then echoes the node value.
If you want to get only the specific class name, you can use an XPath:
$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DomXPath($dom);
// get the td tag with 'registration' class
$tds = $xpath->query("//*[contains(#class, 'registration')]");
foreach($tds as $td) {
echo $td->nodeValue;
}
Demo!
This is similar to the above code, except that it uses XPath to find the required tag. You can find more information about XPaths in the PHP manual documentation. This post should get you started.

If you wish to force regex, use the <td class=["']?registration["']?>(.*)</td> expression

PHP DOMDocument how to get that content of this tag?

I am using domDocument hoping to parse this little html code. I am looking for a specific span tag with a specific id.
<span id="CPHCenter_lblOperandName">Hello world</span>
My code:
$dom = new domDocument;
#$dom->loadHTML($html); // the # is to silence errors and misconfigures of HTML
$dom->preserveWhiteSpace = false;
$nodes = $dom->getElementsByTagName('//span[#id="CPHCenter_lblOperandName"');
foreach($nodes as $node){
echo $node->nodeValue;
}
But For some reason I think something is wrong with either the code or the html (how can I tell?):
When I count nodes with echo count($nodes); the result is always 1
I get nothing outputted in the nodes loop
How can I learn the syntax of these complex queries?
What did I do wrong?

You can use simple getElementById:
$dom->getElementById('CPHCenter_lblOperandName')->nodeValue
or in selector way:
$selector = new DOMXPath($dom);
$list = $selector->query('/html/body//span[#id="CPHCenter_lblOperandName"]');
echo($list->item(0)->nodeValue);
//or
foreach($list as $span) {
$text = $span->nodeValue;
}

Your four part question gets an answer in three parts:
getElementsByTagName does not take an XPath expression, you need to give it a tag name;
Nothing is output because no tag would ever match the tagname you provided (see #1);
It looks like what you want is XPath, which means you need to create an XPath object - see the PHP docs for more;
Also, a better method of controlling the libxml errors is to use libxml_use_internal_errors(true) (rather than the '#' operator, which will also hide other, more legitimate errors). That would leave you with code that looks something like this:
<?php
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query("//span[#id='CPHCenter_lblOperandName']") as $node) {
echo $node->textContent;
}

preg_match() find all values inside of table?

hey guys,
a curl function returns a string $widget that contains regular html -> two divs where the first div holds a table with various values inside of <td>'s.
i wonder what's the easiest and best way for me to extract only all the values inside of the <td>'s so i have blank values without the remaining html.
any idea what the pattern for the preg_match should look like?
thank you.

Regex is not a suitable solution. You're better off loading it up in a DOMDocument and parsing it.

You're betting off using a DOM parser for that task:
$html = <<<HTML
<div>
<table>
<tr>
<td>foo</td>
<td>bar</td>
</tr>
<tr>
<td>hello</td>
<td>world</td>
</tr>
</table>
</div>
<div>
Something irrelevant
</div>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tds = $xpath->query('//div/table/tr/td');
foreach ($tds as $cell) {
echo "{$cell->textContent}\n";
}
Would output:
foo
bar
hello
world

You shouldn't use regexps to parse HTML. Use DOM and XPath instead. Here's an example:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//td');
$result = array();
foreach ($nodes as $node) {
$result[] = $node->nodeValue;
}
// $result holds the values of the tds

Only if you have very limited, well-defined HTML can you expect to parse it with regular expressions. The highest ranked SO answer of all time addresses this issue.
He comes ...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

DOMDocument PHP web scraping - php

You can use an XPath expression to achieve this: //tr[starts-with(#id, "link")] Example: $dom = new DOMDocument; $dom->loadHTML($html); $xpath = new DOMXPath($dom); $nodes = $xpath->query('///tr[starts-with(#id, "link")]'); foreach ($nodes as $node) { // Do whatever } Demo

DOM way using some string functions ... $dom = new DOMDocument; $dom->loadHTML($html); $tagK = 'link_'; foreach ($dom->getElementsByTagName('tr') as $tag) { if (substr(strtolower($tag->getAttribute('id')),0,strlen($tagK))===$tagK) { echo $tag->getAttribute('id').PHP_EOL; } } Demo

Related

Get DOMNodeList of elements with only the given class

Trying to retrieve text only from a div with xpath

Using regex to get the value from a tag in PHP

PHP DOMDocument how to get that content of this tag?

preg_match() find all values inside of table?

Categories

Resources