Using cURL and dom to scrape data with php [duplicate]

Using cURL and dom to scrape data with php [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
Hi i am using cURL to get data from a website i need to get multiple items but cannot get it by tag name or id. I have managed to put together some code that will get one item using a class name by passing it through a loop i then pass it through another loop to get the text from the element.
I have a few problems here the first is i can see there must be a more convenient way of doing this. The second i will need to get multiple elements and stack together ie title, desciption, tags and a url link.
# Create a DOM parser object and load HTML
$dom = new DOMDocument();
$result = $dom->loadHTML($html);
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), 'classname')]");
$tmp_dom = new DOMDocument();
foreach ($nodes as $node)
{
$tmp_dom->appendChild($tmp_dom->importNode($node,true));
}
$innerHTML = trim($tmp_dom->saveHTML());
$buffdom = new DOMDocument();
$buffdom->loadHTML($innerHTML);
# Iterate over all the <a> tags
foreach ($buffdom->getElementsByTagName('a') as $link)
{
# Show the <a href>
echo $link->nodeValue, "<br />", PHP_EOL;
}
I want to stick with PHP only.

I wonder if your problem is in the line:
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), 'classname')]");
As it stands, this literally looks for nodes that belong to the class with the name 'classname' - where 'classname' is not a variable, it's the actual name. This looks like you might have copied an example from somewhere - or did you literally name your class that?
I imagine that the data you are looking may not be in such nodes. If you could post a short piece of the actual HTML you are trying to parse, it should be possible to do a better job guiding you to a solution.
As an example, I just made the following complete code (based on yours, but adding code to open the stackoverflow.com home page, and changing 'classname' to 'question', since there seemed to be a lot of classes with question in the name, so I figured I should get a good harvest. I was not disappointed.
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://stackoverflow.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
//print_r($output);
$dom = new DOMDocument();
#$dom->loadHTML($output);
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), 'question')]");
print_r($nodes);
$tmp_dom = new DOMDocument();
foreach ($nodes as $node)
{
$tmp_dom->appendChild($tmp_dom->importNode($node,true));
}
$innerHTML.=trim($tmp_dom->saveHTML());
$buffdom = new DOMDocument();
#$buffdom->loadHTML($innerHTML);
# Iterate over all the <a> tags
foreach($buffdom->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->nodeValue, PHP_EOL;
echo "<br />";
}
?>
Resulted in many many lines of output. Try it - the page is at http://www.floris.us/SO/scraper.php
(or paste the above code into a page of your own). You were very, very close!
NOTE - this doesn't produce all the output you want - you need to include other properties of the node, not just print out the nodeValue, to get everything. But I figure you can take it from here (again, without actual samples of your HTML it's impossible for anyone else to get much further than this in helping you...)

Related

How do I extract all URL links from an RSS feed? [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 7 years ago.
I need to extract all the links to news articles from the NY Times RSS feed to a MySQL database periodically. How do I go about doing this? Can I use some regular expression (in PHP) to match the links? Or is there some other alternative way? Thanks in advance.

UPDATE 2 I tested the code below and had to modify the
$links = $dom->getElementsByTagName('a');
and change it to:
$links = $dom->getElementsByTagName('link');
It successfully outputted the links. Good Luck
UPDATE Looks like there is a complete answer here: How do you parse and process HTML/XML in PHP.
I developed a solution so that I could recurse all the links in my website. I've removed the code which verified the domain was the same with each recursion (since the question didn't ask for this), but you can easily add one back in if you need it.
Using html5 DOMDocument, you can parse HTML or XML document to read links. It is better than using regex. Try something like this
<?php
//300 seconds = 5 minutes - or however long you need so php won't time out
ini_set('max_execution_time', 300);
// using a global to store the links in case there is recursion, it makes it easy.
// You could of course pass the array by reference for cleaner code.
$alinks = array();
// set the link to whatever you are reading
$link = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";
// do the search
linksearch($link, $alinks);
// show results
var_dump($alinks);
function linksearch($url, & $alinks) {
// use $queue if you want this fn to be recursive
$queue = array();
echo "<br>Searching: $url";
$href = array();
//Load the HTML page
$html = file_get_contents($url);
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('link');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
$href[] = $link->getAttribute('href');
}
foreach (array_unique($href) as $link) {
// add to list of links found
$queue[] = $link;
}
// remove duplicates
$queue = array_unique($queue);
// get links that haven't yet been processed
$queue = array_diff($queue, $alinks);
// update array passed by reference with new links found
$alinks = array_merge($alinks, $queue);
if (count($queue) > 0) {
foreach ($queue as $link) {
// recursive search - uncomment out if you use this
// remember to check that the domain is the same as the one starting from
// linksearch($link, $alinks);
}
}
}

DOM+Xpath allows you to fetch nodes using expressions.
RSS Item Links
To fetch the RSS link elements (the link for each item):
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$expression = '//channel/item/link';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->textContent);
}
Atom Links
The atom:link have a different semantic, they are part of the Atom namespace and used to describe relations. NYT uses the standout relation to mark featured stories. To fetch the Atom links you need to register a prefix for the namespace. Attributes are nodes, too so you can fetch them directly:
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/a:link[#rel="standout"]/#href';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->value);
}
Here are other relations like prev and next.
HTML Links (a elements)
The description elements contain HTML fragments. To extract the links from them you have to load the HTML into a separate DOM document.
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/description';
foreach ($xpath->evaluate($expression) as $description) {
$fragment = new DOMDocument();
$fragment->loadHtml($description->textContent);
$fragmentXpath = new DOMXpath($fragment);
foreach ($fragmentXpath->evaluate('//a[#href]/#href') as $link) {
var_dump($link->value);
}
}

How do I use str_replace with DomDocument

I am using DomDocument to pull content from a specific div on a page.
I would then like to replace all instances of links with a path equal to http://example.com/test/ with http://example.com/test.php.
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$doc = new DomDocument('1.0', 'UTF-8');
$doc->loadHtml(file_get_contents($url));
$div = $doc->getElementById('upcoming_league_dates');
foreach ($div->getElementsByTagName('a') as $item) {
$item->setAttribute('href', 'http://example.com/test.php');
}
echo $doc->saveHTML($div);
As you can see in the example above, str_replace causes problems after I target the upcoming_league_dates div with getElementById. I understand this but unfortunately I don't know where to go from here!
I've tried several different ways including executing the str_replace above the getElementById function (I figured I could replace the strings first and then target the specific div), with no luck.
What am I missing here?
EDIT: UPDATED CODE TO SHOW WORKING SOLUTION

You can't just use str_replace on that node. You need to access it properly first. Thru the DOMElement class you can use the method ->setAttribute() and make the replacement.
Example:
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom); // use xpath
$needle = 'http://example.com/test/';
$replacement = 'http://example.com/test.php';
// target the link
$links = $xpath->query("//div[#id='upcoming_league_dates']/a[contains(#href, '$needle')]");
foreach($links as $anchor) {
// replacement of those href values
$anchor->setAttribute('href', $replacement);
}
echo $dom->saveHTML();
Update: After your revision, your code is now working anyway. This is just to answer your logic replacement (ala str_replace search/replace) on your previous question.

Scraping within a table using PHP DOM

So I have been experimenting with the PHP Simple HTML Parser and the built-in PHP DOM parser for PHP 5 to scrape a website.
We'll take this as an example: http://www.ammunitiondepot.com/12-Gauge-Shotgun-Ammo-s/1922.htm
I am trying to grab all of the product images within the v65-productDisplay table. I am able to grab all of the images on the page, but am having difficulty trying to grab only the images within the table.
This is the code I am using to grab all of the images:
$html = file_get_contents('http://www.ammunitiondepot.com/12-Gauge-Shotgun-Ammo-s/1922.htm');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}

The problem you get all <img> src-attributes is basically because you run
$images = $dom->getElementsByTagName('img');
That says you want to get all <img> elements from the whole document. But actually you want to get only those images that are inside a specific table. At least you right now think so, but let's do this straight for the moment. The table you're looking for is the 13th table in document order. So what you do now to fix your problem is you first get the <table> and then you get all <img> elements from it:
$tables = $dom->getElementsByTagName('table');
$table = $tables->item(12); # 13th table in the document
$images = $table->getElementsByTagName('img');
This will then already give you the images you ask for in your question (excerpt of src attributes):
//cdn3.volusion.com/sfvhn.cpdkd/v/vspfiles/photos/SPL12-00BK-1.jpg?1381182668
/v/vspfiles/templates/ammunition/images/clear1x1.gif
/v/vspfiles/templates/ammunition/images/clear1x1.gif
/v/vspfiles/templates/ammunition/images/clear1x1.gif
/v/vspfiles/templates/ammunition/images/clear1x1.gif
/v/vspfiles/templates/ammunition/images/buttons/btn_addtocart_small.gif
/v/vspfiles/templates/ammunition/images/clear1x1.gif
...
Obviously this leads to some further problems:
The list of images has tons of those you're not interested in. You want the .jpg files and not all those many 1-pixel-gifs or shopping-cart-buttons.
The number of the table is hardcoded. This is not very stable, it would be better lets say to look for the class="v65-productDisplay" attribute (you write it even in your question already).
The image URLs are relative to the document, so need to be resolved.
I'll show first now how to solve the first two problems.
It seems that getElementsByTagName even if useful is not that flexible for your scraping needs. And there is a better way to query elements from the document, and that is called xpath (ref). It's a query language in which you express which elements you want. So we want image src attributes from within a specific table being jpegs. The xpath query does look like this:
//table[#class="v65-productDisplay"]/tr/td/a/img/#src[contains(., ".jpg")]
This is run with the help of a DOMXPath illustrated as following:
$xpath = new DOMXPath($dom);
$srcQuery = '//table[#class="v65-productDisplay"]/tr/td/a/img/#src[contains(., ".jpg")]';
/** #var DOMAttr $src */
foreach ($xpath->query($srcQuery) as $src) {
echo $src->nodeValue, "\n";
}
This now already reduces the list greatly to what you're looking for while being less verbose:
//cdn3.volusion.com/sfvhn.cpdkd/v/vspfiles/photos/SPL12-00BK-1.jpg?1381182668
//cdn3.volusion.com/sfvhn.cpdkd/v/vspfiles/photos/GTL1275-1.jpg?1380206953
//cdn3.volusion.com/sfvhn.cpdkd/v/vspfiles/photos/SS12L8-1.jpg?1390326206
//cdn3.volusion.com/sfvhn.cpdkd/v/vspfiles/photos/LEF127RS-1.jpg?1368458526
//cdn3.volusion.com/sfvhn.cpdkd/v/vspfiles/photos/LE13300-1.jpg?1368458467
//cdn3.volusion.com/sfvhn.cpdkd/v/vspfiles/photos/ADLE13300AC-1.jpg?1393516003
...
So now only the problem is left with the cleanup of the URIs, that is resolution to the document URI (as there is no further base URI in the document) and probably cleaning up the query string. I do this with the help of Net_URL2, here the src processing alone:
/** #var DOMAttr $src */
foreach ($xpath->query($srcQuery) as $src) {
$href = $uri->resolve($src->nodeValue);
$href->setQuery(false);
echo $href, "\n";
}
And here is a full example:
<?php
/*
* #link http://stackoverflow.com/questions/24344420/scraping-within-a-table-using-php-dom
* #auhtor hakre <http://hakre.wordpress.com>
*/
# uses Net_URL2 -- http://pear.php.net/package/Net_URL2/ -- https://packagist.org/packages/pear/net_url2
require __DIR__ . '/vendor/autoload.php';
$uri = new Net_URL2('http://www.ammunitiondepot.com/12-Gauge-Shotgun-Ammo-s/1922.htm');
$cache = '12-Gauge-Shotgun-Ammo-s-1922.htm';
if (is_readable($cache)) {
$html = file_get_contents($cache);
} else {
$options = [
'http' => [
'user_agent' => "Godzilla/42.4 (Gabba Gandalf Client 7.3; C128; Z80) Lord of the Table Weed Edition (KHTML, like Gold Dust Day Gecko) Chrome/97.0.43043.0 Safari/1337.42",
'max_redirects' => 1, # do not follow redirects
]
];
$context = stream_context_create($options);
$html = file_get_contents($uri, null, $context);
file_put_contents($cache, $html);
}
$dom = new domDocument;
$dom->preserveWhiteSpace = false;
$save = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($save);
$dom->documentURI = $uri;
$xpath = new DOMXPath($dom);
$srcQuery = '//table[#class="v65-productDisplay"]/tr/td/a/img/#src[contains(., ".jpg")]';
/** #var DOMAttr $src */
foreach ($xpath->query($srcQuery) as $src) {
$href = $uri->resolve($src->nodeValue);
$href->setQuery(false);
echo $href, "\n";
}
And here is the HTML for future reference: http://pastebin.com/HCTTRm9E

DOM in PHP: Decoded entities and setting nodeValue

I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decoded entities then. To illustrate what bothers me, I give a quick example.
Suppose we have the following code
$doc = new DOMDocument();
$doc->loadXML(<XML data>);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);
foreach($node_list as $node) {
//do something
}
If the code in the loop is something like
$attr = "<some string>";
$val = $node->getAttribute($attr);
//do something with $val
$node->setAttribute($attr, $val);
it works fine. But if it's more like
$text = $node->textContent;
//do something with $text
$node->nodeValue = $text;
and $text contains some decoded &, it doesn't get encoded, even if one does nothing with $text at all.
At the moment, I apply htmlspecialchars on $text before I set $node->nodeValue to it. Now I want to know
if that is sufficient,
if not, what would suffice,
and if there are more elegant solutions for this, as in the case of attribute manipulation.
The XML documents I have to deal with are mostly feeds, so a solution should be pretty general.
EDIT
It turned out that my original question had the wrong scope, sorry for that. Here I provide an example where the described behaviour actually happens.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->loadXML($output);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query('//item/link');
foreach($node_list as $node) {
$node->nodeValue = $node->textContent;
}
echo $doc->saveXML();
If I execute this code on the CLI with
php beeb.php |egrep 'link|Warning'
I get results like
<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss</link>
which should be
<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</link>
(and is, if the loop is omitted) and according warnings
Warning: main(): unterminated entity reference ns_source=PublicRSS20-sa in /private/tmp/beeb.php on line 15
When I apply htmlspecialchars to $node->textContent, it works fine, but I feel very uncomfortable doing that.

Your question is basically whether or not setting DOMText::nodeValue to an XML encoded string or to a verbatim string.
So let's just try that out and set it to & and '& and see what happens:
$doc = new DOMDocument();
$doc->loadXML('<root>*</root>');
$text = $doc->documentElement->childNodes->item(0);
echo "Before Edit: ", $doc->saveXML($text), "\n";
$text->nodeValue = "&";
echo "After Edit 1: ", $doc->saveXML($text), "\n";
$text->nodeValue = "&";
echo "After Edit 2: ", $doc->saveXML($text), "\n";
The output then is as the following (PHP 5.0.0 - 5.5.0):
Before Edit: *
After Edit 1: &
After Edit 2: &amp;
This shows that setting the nodeValue of a DOMText-node expects a UTF-8 encoded string and the DOM library encodes the XML reserved characters automatically.
So you should not apply htmlspecialchars() onto any text you add this way. That would create a double-encoding.
As you write you experience the opposite I suggest you to execute an isolated PHP example on the commandline / within your IDE so that you can see exactly the output. Not that your browser renders this as HTML and then you think the reserved XML characters have not been encoded.
As you have pointed out you're not editing a DOMText but an DOMElement node. It works a bit different, here the & character needs to be passed as entity & instead of verbatim , however only this character.
So this needs a little bit more work:
Read out the text-content and turn it into a DOMText node. Everything will be perfectly encoded.
Remove the node-value of the element node so it's empty.
Append the DOMText node form first step as child.
And done. Here your inner foreach modified showing this:
foreach($node_list as $node) {
$text = $doc->createTextNode($node->textContent);
$node->nodeValue = "";
$node->appendChild($text);
}
For your concrete example albeit I must admit I don't understand why you do that because this does not change the value so it wouldn't need this.
Tip: In PHP DOMDocument can open this feed directly, you don't need curl here:
$doc = new DOMDocument();
$doc->load("http://feeds.bbci.co.uk/news/rss.xml?edition=uk");

As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText and DOMElement differ in this regard.
To illustrate this, an example:
$doc = new DOMDocument();
$doc->formatOutput = True;
$doc->loadXML('<root/>');
$s = 'text &<<"\'&text;&text';
$root = $doc->documentElement;
$node = $doc->createElement('tag1', $s); #line 10
$root->appendChild($node);
$node = $doc->createElement('tag2');
$text = $doc->createTextNode($s);
$node->appendChild($text);
$root->appendChild($node);
$node = $doc->createElement('tag3');
$text = $doc->createCDATASection($s);
$node->appendChild($text);
$root->appendChild($node);
echo $doc->saveXML();
outputs
Warning: DOMDocument::createElement(): unterminated entity reference text in /tmp/DOMtest.php on line 10
<?xml version="1.0"?>
<root>
<tag1>text &<<"'&text;</tag1>
<tag2>text &amp;&lt;<"'&text;&text</tag2>
<tag3><![CDATA[text &<<"'&text;&text]]></tag3>
</root>
In this particular case, it is appropriate to alter the nodeValue of DOMText nodes. Combining hakre's two answers one gets a quite elegant solution.
$doc = new DOMDocument();
$doc->loadXML(<XML data>);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);
$visitTextNode = function (DOMText $node) {
$text = $node->textContent;
/*
do something with $text
*/
$node->nodeValue = $text;
};
foreach ($node_list as $node) {
if ($node->nodeType == XML_TEXT_NODE) {
$visitTextNode($node);
} else {
foreach ($node->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) {
$visitTextNode($child);
}
}
}
}

How to get the html inside a $node rather than just the $nodeValue [duplicate]

This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 2 years ago.
Description of the current situation:
I have a folder full of pages (pages-folder), each page inside that folder has (among other things) a div with id="short-info".
I have a code that pulls all the <div id="short-info">...</div> from that folder and displays the text inside it by using textContent (which is for this purpose the same as nodeValue)
The code that loads the divs:
<?php
$filename = glob("pages-folder/*.php");
sort($filename);
foreach ($filename as $filenamein) {
$doc = new DOMDocument();
$doc->loadHTMLFile($filenamein);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("*//div[#id='short-info']");
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->textContent;
}
}
}
?>
Now the problem is that if the page I am loading has a child, like an image: <div id="short-info"> <img src="picture.jpg"> Hello world </div>, the output will only be Hello world rather than the image and then Hello world.
Question:
How do I make the code display the full html inside the div id="short-info" including for instance that image rather than just the text?

You have to make an undocumented call on the node.
$node->c14n() Will give you the HTML contained in $node.
Crazy right? I lost some hair over that one.
http://php.net/manual/en/class.domnode.php#88441
Update
This will modify the html to conform to strict HTML. It is better to use
$html = $Node->ownerDocument->saveHTML( $Node );
Instead.

You'd want what amounts to 'innerHTML', which PHP's dom doesn't directly support. One workaround for it is here in the PHP docs.
Another option is to take the $node you've found, insert it as the top-level element of a new DOM document, and then call saveHTML() on that new document.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using cURL and dom to scrape data with php [duplicate] - php

Related

How do I extract all URL links from an RSS feed? [duplicate]

How do I use str_replace with DomDocument

Scraping within a table using PHP DOM

DOM in PHP: Decoded entities and setting nodeValue

How to get the html inside a $node rather than just the $nodeValue [duplicate]

Categories

Resources