Traversing a DOM with PHP

Traversing a DOM with PHP - php

I want to find in a specific site an element within the DOM.
Inside the DOM there is a tag called "cufon".
assume the url is http://www.xyzw.com/
The code i use is the following:
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.xyzw.com/');
$teams = $dom->getElementById('cufon');
at this point the $teams var suppose to contain all of the cufon elements inside the DOM but it contains nothing if i try to find for "div" elements it does find it all.
What is the problem?

If, as you say, there is a TAG called cufontext then trying to find a collection of nodes using one ID would only return one element ( IDs need to be unique ) so perhaps you want to find all elements of the specified tagname??
$dom = new DOMDocument();
$dom->loadHTMLFile('http://www.xyzw.com/');
$teams = $dom->getElementsByTagName('cufontext');
if( $teams ){
foreach($teams as $team){
/* do stuff */
}
}
As we have not been given the actual url involved I had to test like this:-
/* random url - just happened to be open in browser just now */
$url='http://www.interparcel.com/';
/* the tag to search for */
$tag='div';
$dom = new DOMDocument();
$dom->loadHTMLFile( $url );
$teams = $dom->getElementsByTagName( $tag );
/* As pointed out by #Pieter it would have always returned true so additional check */
if( $teams && $teams->length > 0 ){
foreach($teams as $team){
echo $team->nodeValue;
}
}
This will spit out lots of content from the remote url - so if you are unable to find a tag called cufontext I'd suggest confirming there are tags of that name

Related

PHP - preg_replace - html tags and attributes

I'm trying to allow some tags and attributes using an array, and remove the rest
here is my example:
$allowed=array("img", "p", "style");
$text='<img src="image.gif" onerror="myFunction()" style="background:red" onclick="myFunction()">
<p>A function is triggered if an error occurs when loading the image. The function shows an alert box with a text.
In this example we refer to an image that does not exist, therefore the onerror event occurs.</p>
<script>
function myFunction() {
alert(\'The image could not be loaded.\');
}
</script>';
using $text= preg_replace('#<script(.*?)>(.*?)</script>#is', '', $text);
I could remove script tag with content, but I need to remove everything not in $allowed array

I would suggest using DOMParser for better readability if you are mixing scripts with html altogether like this, take care about the performance if performance matters.
http://php.net/manual/en/class.domdocument.php

This function should do what you want. Given a DOMDocument ($doc) and a node ($node) to search from, it recursively iterates over the children of that node, removing any tags that are not in the $allowed_tags array, and, for those tags that are kept, removing any attributes not in the $allowed_attributes array:
function remove_nodes_and_attributes($doc, $node, $allowed_tags, $allowed_attributes) {
$xpath = new DOMXPath($doc);
foreach ($xpath->query('./*', $node) as $child) {
if (!in_array($child->nodeName, $allowed_tags)) {
$node->removeChild($child);
continue;
}
$a = 0;
while ($a < $child->attributes->length) {
$attribute = $child->attributes->item($a)->name;
if (!in_array($attribute, $allowed_attributes)) {
$child->removeAttribute($attribute);
// don't increment the pointer as the list will shift with the removal of the attribute
}
else {
// allowed attribute, skip it
$a++;
}
}
// remove any children as necessary
remove_nodes_and_attributes($doc, $child, $allowed_tags, $allowed_attributes);
}
}
You would use this function like this. Note it is necessary to wrap the HTML in a top-level element which is then stripped off again at the end using substr.
$doc = new DOMDocument();
$doc->loadHTML("<html>$text</html>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$html = $doc->getElementsByTagName('html')[0];
remove_nodes_and_attributes($doc, $html, $allowed_tags, $allowed_attributes);
echo substr($doc->saveHTML(), 6, -8);
Output (for your sample data):
<img style="background:red">
<p>A function is triggered if an error occurs when loading the image. The function shows an alert box with a text. In this example we refer to an image that does not exist, therefore the onerror event occurs.</p>
Demo on 3v4l.org

Using DOMDocument is always the best way to manipulate HTML, it understands the structure of the document.
In this solution I use XPath to find any nodes which are not in the allowed list, the XPath expression will look something like...
//body//*[not(name() = "img" or name() = "p" or name() = "style")]
This looks for any element in the <body> tag (loadHTML will automatically put this tag in for you) who's name isn't in the list of allowed tags. The XPath is built dynamically from the $allowed list and so you just change the list of tags to update it...
$allowed=array("img", "p", "style");
$text='<img src="image.gif" onerror="myFunction()" style="background:red" onclick="myFunction()">
<p>A function is triggered if an error occurs when loading the image. The function shows an alert box with a text.
In this example we refer to an image that does not exist, therefore the onerror event occurs.</p>
<script>
function myFunction() {
alert(\'The image could not be loaded.\');
}
</script>';
$doc = new DOMDocument();
$doc->loadHTML($text);
$xp = new DOMXPath($doc);
$find = '//body//*[not(name() = "'.implode ('" or name() = "', $allowed ).
'")]';
echo "XPath = ".$find.PHP_EOL;
$toRemove = $xp->evaluate($find);
print_r($toRemove);
foreach ( $toRemove as $remove ) {
$remove->parentNode->removeChild($remove);
}
// recreate HTML
$outHTML = "";
foreach ( $doc->getElementsByTagName("body")[0]->childNodes as $tag ) {
$outHTML.= $doc->saveHTML($tag);
}
echo $outHTML;
If you also want to remove attributes, you can do the same process using #* as part of the XPath expression...
$allowedAttribs = array();
$find = '//body//#*[not(name() = "'.implode ('" or name() = "', $allowedAttribs ).
'")]';
$toRemove = $xp->evaluate($find);
foreach ( $toRemove as $remove ) {
$remove->parentNode->removeAttribute($remove->nodeName);
}
It would be possible to combine these two, but it makes the code less legible (IMHO).

Simple html dom parser get tr from table

I am trying to scrap http://spys.one/free-proxy-list/but here i just want get Proxy by ip:port column only
i checked the website there was 3 table
Anyone can help me out?
<?php
require "scrapper/simple_html_dom.php";
$html=file_get_html("http://spys.one/free-proxy-list/");
$html=new simple_html_dom($html);
$rows = array();
$table = $html->find('table',3);
var_dump($table);

Try the below script. It should fetch you only the required items and nothing else:
<?php
include 'simple_html_dom.php';
$url = "http://spys.one/free-proxy-list/";
$html = file_get_html($url);
foreach($html->find("table[width='65%'] tr[onmouseover]") as $file) {
$data = $file->find('td', 0)->plaintext;
echo $data . "<br/>";
}
?>
Output it produces like:
176.94.2.84
178.150.141.93
124.16.84.208
196.53.99.7
31.146.161.238

I really don 't know, what your simple html dom library does. Anyway. Nowadays PHP has all aboard what you need for parsing specific dom elements. Just use PHPs own DOMXPath class for querying dom elements.
Here 's a short example for getting the first column of a table.
$dom = new \DOMDocument();
$dom->loadHTML('https://your.url.goes.here');
$xpath = new \DomXPath($dom);
// query the first column with class "value" of the table with class "attributes"
$elements = $xpath->query('(/table[#class="attributes"]//td[#class="value"])[1]');
// iterate through all found td elements
foreach ($elements as $element) {
echo $element->nodeValue;
}
This is a possible example. It does not solve exactly your issue with http://spys.one/free-proxy-list/. But it shows you how you could easily get the first column of a specific table. The only thing you have to do now is finding the right query in the dom of the given site for the table you want to query. Because the dom of the given site is a pretty complex table layout from ages ago and the table you want to parse does not have a unique id or something else, you have to find out.

How to get iTunes-specific child nodes of RSS feeds?

I'm trying to process an RSS feed using PHP and there are some tags such as 'itunes:image' which I need to process. The code I'm using is below and for some reason these elements are not returning any value. The output is length is 0.
How can I read these tags and get their attributes?
$f = $_REQUEST['feed'];
$feed = new DOMDocument();
$feed->load($f);
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
foreach($items as $key => $item)
{
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
$description = $item->getElementsByTagName('description')->item(0)->textContent; // textContent
$arrt = $item->getElementsByTagName('itunes:image');
print_r($arrt);
}

getElementsByTagName is specified by DOM, and PHP is just following that. It doesn't consider namespaces. Instead, use getElementsByTagNameNS, which requires the full namespace URI (not the prefix). This appears to be http://www.itunes.com/dtds/podcast-1.0.dtd*. So:
$img = $item->getElementsByTagNameNS('http://www.itunes.com/dtds/podcast-1.0.dtd', 'image');
// Set preemptive fallback, then set value if check passes
urlImage = '';
if ($img) {
$urlImage = $img->getAttribute('href');
}
Or put the namespace in a constant.
You might be able to get away with simply removing the prefix and getting all image tags of any namespace with getElementsByTagName.
Make sure to check whether a given item has an itunes:image element at all (example now given); in the example podcast, some don't, and I suspect that was also giving you trouble. (If there's no href attribute, getAttribute will return either null or an empty string per the DOM spec without erroring out.)
*In case you're wondering, there is no actual DTD file hosted at that location, and there hasn't been for about ten years.

<?php
$rss_feed = simplexml_load_file("url link");
if(!empty($rss_feed)) {
$i=0;
foreach ($rss_feed->channel->item as $feed_item) {
?>
<?php echo $rss_feed->children('itunes', true)->image->attributes()->href;?>
<?php
}
?>

How do I account for missing xPaths and keep my data uniform when scraping a website using DOMXPath query method?

I am attempting to scrape a website using the DOMXPath query method. I have successfully scraped the 20 profile URLs of each News Anchor from this page.
$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[#class='bio']/a/#href";
$html = new DOMDocument();
#$html->loadHtmlFile($url);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($xPath);
$profileurl = array();
foreach ($nodelist as $n){
$value = $n->nodeValue;
$profileurl[] = $value;
}
I used the resulting array as the URL to scrape data from each of the News Anchor's bio pages.
$imgurl = array();
for($z=0;$z<$elementCount;$z++){
$html = new DOMDocument();
#$html->loadHtmlFile($profileurl[$z]);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query("//img[#class='photo fn']/#src");
foreach($nodelist as $n){
$value = $n->nodeValue;
$imgurl[] = $value;
}
}
Each News Anchor profile page has 6 xPaths I need to scrape (the $imgurl array is one of them). I am then sending this scraped data to MySQL.
So far, everything works great - except when I attempt to get the Twitter URL from each profile because this element isn't found on every News Anchor profile page. This results in MySQL receiving 5 columns with 20 full rows and 1 column (twitterurl) with 18 rows of data. Those 18 rows are not lined up with the other data correctly because if the xPath doesn't exist, it seems to be skipped.
How do I account for missing xPaths? Looking for an answer, I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration?
Here's the query for the Twitter URLs:
$twitterurl = array();
for($z=0;$z<$elementCount;$z++){
$html = new DOMDocument();
#$html->loadHtmlFile($profileurl[$z]);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query("//*[#id='bio']/div[2]/p[3]/a/#href");
foreach($nodelist as $n){
$value = $n->nodeValue;
$twitterurl[] = $value;
}
}

Since the twitter node appears zero or one times, change the foreach to
$twitterurl [] = $nodelist->length ? $nodelist->item(0)->nodeValue : NULL;
That will keep the contents in sync. You will, however, have to make arrangements to handle NULL values in the query you use to insert them in the database.

I think you have multiple issues in the way you scrape the data and will try to outline those in my answer in the hope it always clarifies your central question:
I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration?
First of all collecting the URLs of each profile (detail) page is a good idea. You can even benefit more from it by putting this into the overall context of your scraping job:
* profile pages
`- profile page
+- name
+- role
+- img
+- email
+- facebook
`- twitter
This is the structure you have with the data you like to obtain. You already managed to obtain all profile pages URLs:
$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[#class='bio']/a/#href";
$html = new DOMDocument();
#$html->loadHtmlFile($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query($xPath);
$profileurl = array();
foreach ($nodelist as $n) {
$value = $n->nodeValue;
$profileurl[] = $value;
}
As you know that the next steps would be to load and query the 20+ profile pages, one of the very first things you could do is to extract the part of your code that is creating a DOMXPath from an URL into a function of it's own. This will also allow you to do better error handling easily:
/**
* #param string $url
*
* #throws RuntimeException
* #return DOMXPath
*/
function xpath_from_url($url)
{
$html = new DOMDocument();
$saved = libxml_use_internal_errors(true);
$result = $html->loadHtmlFile($url);
libxml_use_internal_errors($saved);
if (!$result) {
throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
}
$xpath = new DOMXPath($html);
return $xpath;
}
This changes the main processing into a more compressed form then only by the extraction (move) of the code into the xpath_from_url function:
$xpath = xpath_from_url($url);
$nodelist = $xpath->query($xPath);
$profileurl = array();
foreach ($nodelist as $n) {
$value = $n->nodeValue;
$profileurl[] = $value;
}
But it does also allow you another change to the code: You can now process the URLs directly in the structure of your main extraction routine:
$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xpath = xpath_from_url($url);
$profileUrls = $xpath->query("//p[#class='bio']/a/#href");
foreach ($profileUrls as $profileUrl) {
$profile = xpath_from_url($profileUrl->nodeValue);
// ... extract the six (inkl. optional) values from a profile
}
As you can see, this code skips creating the array of profile-URLs because a collection of all profile-URLs are already given by the first xpath operation.
Now there is the part missing to extract the up to six fields from the detail page. With this new way to iterate over the profile URLs, this is pretty easy to manage - just create one xpath expression for each field and fetch the data. If you make use of DOMXPath::evaluate instead of DOMXPath::querythen you can get string values directly. The string-value of a non-existing node, is an empty string. This is not really testing if the node exists or not, in case you need NULL instead of "" (empty string), this needs to be done differently (I can show that, too, but that's not the point right now). In the following example the anchors name and role is being extracted:
foreach ($profileUrls as $i => $profileUrl) {
$profile = xpath_from_url($profileUrl->nodeValue);
printf(
"#%02d: %s (%s)\n", $i + 1,
$profile->evaluate('normalize-space(//h1[#class="entry-title"])'),
$profile->evaluate('normalize-space(//h2[#class="fn"])')
);
// ... extract the other four (inkl. optional) values from a profile
}
I choose to directly output the values (and not care about adding them into an array or a similar structure), so that it's easy to follow what happens:
#01: Marc Bailey (Morning Anchor)
#02: Heather Myers (Morning Anchor)
#03: Jim Patton (10pm Anchor)
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
...
Fetching the details about email, facebook and twitter works the same:
foreach ($profileUrls as $i => $profileUrl) {
$profile = xpath_from_url($profileUrl->nodeValue);
printf(
"#%02d: %s (%s)\n", $i + 1,
$profile->evaluate('normalize-space(//h1[#class="entry-title"])'),
$profile->evaluate('normalize-space(//h2[#class="fn"])')
);
printf(
" email...: %s\n",
$profile->evaluate('substring-after(//*[#class="bio-email"]/a/#href, ":")')
);
printf(
" facebook: %s\n",
$profile->evaluate('string(//*[#class="bio-facebook url"]/a/#href)')
);
printf(
" twitter.: %s\n",
$profile->evaluate('string(//*[#class="bio-twitter url"]/a/#href)')
);
}
This now already outputs the data as you need it (I've left the images out because those can't be well displayed in text-mode:
#01: Marc Bailey (Morning Anchor)
email...: m.bailey#sandiego6.com
facebook: https://www.facebook.com/marc.baileySD6
twitter.: http://www.twitter.com/MarcBaileySD6
#02: Heather Myers (Morning Anchor)
email...: heather.myers#sandiego6.com
facebook: https://www.facebook.com/heather.myersSD6
twitter.: http://www.twitter.com/HeatherMyersSD6
#03: Jim Patton (10pm Anchor)
email...: jim.patton#sandiego6.com
facebook: https://www.facebook.com/Jim.PattonSD6
twitter.: http://www.twitter.com/JimPattonSD6
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
email...: Neda.Iranpour#sandiego6.com
facebook: https://www.facebook.com/lightenupwithneda
twitter.: http://www.twitter.com/#LightenUpWNeda
...
So now these little lines of code with one foreach loop already fairly well represent the original structure outlined:
* profile pages
`- profile page
+- name
+- role
+- img
+- email
+- facebook
`- twitter
All you have to do is just to follow that overall structure of how the data is available with your code. Then at the end when you see that all data can be obtained as wished, you do the store operation in the database: one insert per profile. that is one row per profile. you don't have to keep the whole data, you can just insert (perhaps with some check if it already exists) the data for each row.
Hope that helps.
Appendix: Code in full
<?php
/**
* Scraping detail pages based on index page
*/
/**
* #param string $url
*
* #throws RuntimeException
* #return DOMXPath
*/
function xpath_from_url($url)
{
$html = new DOMDocument();
$saved = libxml_use_internal_errors(true);
$result = $html->loadHtmlFile($url);
libxml_use_internal_errors($saved);
if (!$result) {
throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
}
$xpath = new DOMXPath($html);
return $xpath;
}
$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xpath = xpath_from_url($url);
$profileUrls = $xpath->query("//p[#class='bio']/a/#href");
foreach ($profileUrls as $i => $profileUrl) {
$profile = xpath_from_url($profileUrl->nodeValue);
printf(
"#%02d: %s (%s)\n", $i + 1, $profile->evaluate('normalize-space(//h1[#class="entry-title"])'),
$profile->evaluate('normalize-space(//h2[#class="fn"])')
);
printf(" email...: %s\n", $profile->evaluate('substring-after(//*[#class="bio-email"]/a/#href, ":")'));
printf(" facebook: %s\n", $profile->evaluate('string(//*[#class="bio-facebook url"]/a/#href)'));
printf(" twitter.: %s\n", $profile->evaluate('string(//*[#class="bio-twitter url"]/a/#href)'));
}

PHP - Extracting two values from a line

I'm a beginner with regular expressions and am working on a server where I cannot instal anything (does using DOM methods require the instal of anything?).
I have a problem that I cannot solve with my current knowledge.
I would like to extract from the line below the album id and image url.
There are more lines and other url elements in the string (file), but the album ids and image urls I need are all in strings similar to the one below:
<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">
So in this case I would like to get '774' and 'http://img255.imageshack.us/img00/000/000001.png'
I've seen multiple examples of extracting just the url or one other element from a string, but I really need to keep these both together and store these in one record of the database.
Any help is really appreciated!

Since you are new to this, I'll explain that you can use PHP's HTML parser known as DOMDocument to extract what you need. You should not use a regular expression as they are inherently error prone when it comes to parsing HTML, and can easily result in many false positives.
To start, lets say you have your HTML:
$html = '<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">';
And now, we load that into DOMDocument:
$doc = new DOMDocument;
$doc->loadHTML( $html);
Now, we have that HTML loaded, it's time to find the elements that we need. Let's assume that you can encounter other <a> tags within your document, so we want to find those <a> tags that have a direct <img> tag as a child. Then, check to make sure we have the correct nodes, we need to make sure we extract the correct information. So, let's have at it:
$results = array();
// Loop over all of the <a> tags in the document
foreach( $doc->getElementsByTagName( 'a') as $a) {
// If there are no children, continue on
if( !$a->hasChildNodes()) continue;
// Find the child <img> tag, if it exists
foreach( $a->childNodes as $child) {
if( $child->nodeType == XML_ELEMENT_NODE && $child->tagName == 'img') {
// Now we have the <a> tag in $a and the <img> tag in $child
// Get the information we need:
parse_str( parse_url( $a->getAttribute('href'), PHP_URL_QUERY), $a_params);
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
}
}
A print_r( $results); now leaves us with:
Array
(
[0] => Array
(
[0] => 774
[1] => http://img255.imageshack.us/img00/000/000001.png
)
)
Note that this omits basic error checking. One thing you can add is in the inner foreach loop, you can check to make sure you successfully parsed an album parameter from the <a>'s href attribute, like so:
if( isset( $a_params['album'])) {
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
Every function I've used in this can be found in the PHP documentation.

If you've already narrowed it down to this line, then you can use a regex like the following:
$matches = array();
preg_match('#.+album=(\d+).+src="([^"]+)#', $yourHtmlLineHere, $matches);
Now if you
echo $matches[1];
echo " ";
echo $matches[2];
You'll get the following:
774 http://img255.imageshack.us/img00/000/000001.png

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Traversing a DOM with PHP - php

Related

PHP - preg_replace - html tags and attributes

Simple html dom parser get tr from table

How to get iTunes-specific child nodes of RSS feeds?

How do I account for missing xPaths and keep my data uniform when scraping a website using DOMXPath query method?

PHP - Extracting two values from a line

Categories

Resources