Using Zend_Dom as a screen scraper

Using Zend_Dom as a screen scraper - php

How?
More to the point...
this:
$url = 'http://php.net/manual/en/class.domelement.php';
$client = new Zend_Http_Client($url);
$response = $client->request();
$html = $response->getBody();
$dom = new Zend_Dom_Query($html);
$result = $dom->query('div.note');
Zend_Debug::dump($result);
gives me this:
object(Zend_Dom_Query_Result)#867 (7) {
["_count":protected] => NULL
["_cssQuery":protected] => string(8) "div.note"
["_document":protected] => object(DOMDocument)#79 (0) {
}
["_nodeList":protected] => object(DOMNodeList)#864 (0) {
}
["_position":protected] => int(0)
["_xpath":protected] => NULL
["_xpathQuery":protected] => string(33) "//div[contains(#class, ' note ')]"
}
And I cannot for the life of me figure out how to do anything with this.
I want to extract the various parts of the retrieved data (that being the div with the class "note" and any of the elements inside it... like the text and urls) but cannot get anything working.
Someone pointed me to the DOMElement class over at php.net but when I try using some of the methods mentioned, I can't get things to work. How would I grab a chunk of html from a page and go through it grabbing the various parts? How do I inspect this object I am getting back so I can at least figure out what is in it?
Hjälp?

The Iterator implementation of Zend_Dom_Query_Result returns a DOMElement object for each iteration:
foreach ($result as $element) {
var_dump($element instanceof DOMElement); // always true
}
From the $element variable, you can use any DOMElement method:
foreach ($result as $element) {
echo 'Element Id: '.$element->getAttribute('id').PHP_EOL;
if ($element->hasChildNodes()) {
echo 'Element has child nodes'.PHP_EOL;
}
$aNodes = $element->getElementsByTagName('a');
// etc
}
You can also access the document element, or you can use Zend_Dom_Query_Result to do so:
$document1 = $element->ownerDocument;
$document2 = $result->getDocument();
var_dump($document1 === $document2); // true
echo $document1->saveHTML();

Related

Is there any php function to find any url title or description?

I am new in data scraping, i am working on url to title scraping, actually i want to write a funtion that take a url/link as request and in return i get <title> </title> , og:title , og:description etc. all meta property
i am trying this funtion for scrape only title
/**
* #param Request $request
* #return \Illuminate\Http\JsonResponse
*
* #throws ValidationException
*/
public function getTitle(Request $request)
{
$this->validate($request, [
'link' => 'required',
]);
$link = $request->input('link');
$str = #file_get_contents($link);
if(strlen($str)>0){
$str = trim(preg_replace('/\s+/', ' ', $str));
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title);
$result = $title[1];
}
return Response::json([
'message' => 'Get title',
'data' => $result,
], \Symfony\Component\HttpFoundation\Response::HTTP_OK);
}
route
Route::post('request-title', 'BuyShipRequestController#getTitle');
Example what i request in input field:
Amazon-url
and what i want to my reutrn response
<title>Amazon.com: Seagate Portable 2TB External Hard Drive Portable HDD – USB 3.0 for PC, Mac, PS4, & Xbox - 1-Year Rescue Service (STGX2000400): Computers & Accessories</title>
and
<meta name="description"/> , <meta name="title"/>, <meta name="keywords" /> , link
in return response i want only those meta properties content or value

A pretty straightforward way without the need to use external libraries would be to use XPath to query an HTML document:
XPath expression
Result
//div
Returns all div tags
//meta
Returns all meta tags
//meta[#name]
Returns all meta tags having a 'name' attribute
In PHP, XPath is available via DomXPath. Since XPath works on a DOM tree, we'd need a DomDocument first:
$dom = new DomDocument;
$dom->loadHTML($some_html);
$xpath = new DomXPath($dom);
$xpath->query(".//meta");
So, given the document you've provided...
$html = file_get_contents('amazon.html');
...we could write up a basic function to query it for a set of tags:
function get_from_html(string $html, array $tags) {
$collect = [];
// Turn off default error reporting so we're not drowning
// in errors when the HTML is malformed. We can get a
// hold of them anytime via libxml_get_errors().
// Cf. https://www.php.net/libxml_use_internal_errors
libxml_use_internal_errors(true);
// Turn HTML string into a DOM tree.
$dom = new DomDocument;
$dom->loadHTML($html);
// Set up XPath
$xpath = new DomXPath($dom);
// Query the DOM tree for the given set of tags.
foreach ($tags as $tag) {
// You can do *a lot* more with XPath, cf. this cheat sheet:
// https://gist.github.com/LeCoupa/8c305ec8c713aad07b14
$result = $xpath->query("//{$tag}");
if ($result instanceof DOMNodeList) {
$collect[$tag] = $result;
}
}
// Clear errors to free up memory, cf.
// https://www.php.net/manual/de/function.libxml-use-internal-errors.php#78236
libxml_clear_errors();
return $collect;
}
When invoking it ...
$results = get_from_html($html, ['title', 'meta']);
...it returns an array of iterable DOMNodeList objects, which you could easily evaluate further (for example, to examine the attributes of all nodes in the list):
// For demonstration purposes, just walk the results and turn each found node
// back to its HTML representation.
//
// For real world stuff, cf.:
// - https://www.php.net/manual/en/class.domnodelist.php
// - https://www.php.net/manual/en/class.domnode.php
// - https://www.php.net/manual/en/class.domelement.php
if (!empty($results)) {
foreach ($results as $key => $nodes) {
if ($key == 'title') {
$node = $nodes->item(0);
// Get HTML, cf. https://stackoverflow.com/a/12909924/3323348
// Output: <title>Amazon.com: Seagate (...)</title>
var_dump($node->ownerDocument->saveHTML($node));
}
if ($key == 'meta') {
foreach ($nodes as $node) {
// Get HTML, cf. https://stackoverflow.com/a/12909924/3323348
// Output: <meta (...)>
var_dump($node->ownerDocument->saveHTML($node));
// Or get an attribute
if ($node->hasAttribute('name')) {
// Output: "keywords", or "description", or...
var_dump($node->getAttribute('name'));
}
}
}
}
}
On XPath:
https://github.com/code4libtoronto/2016-07-28-librarycarpentrylessons/blob/master/xpath-xquery/lesson.md (Introduction, Tutorial)
https://devhints.io/xpath (Cheatsheet)
https://gist.github.com/LeCoupa/8c305ec8c713aad07b14 (Cheatsheet)

Copy SimpleXMLElement children to a different object

Given two SimpleXMLElement objects structured as follows (identical):
$obj1
object SimpleXMLElement {
["#attributes"] =>
array(0) {
}
["event"] =>
array(1) {
[0] =>
object(SimpleXMLElement) {
["#attributes"] =>
array(1) {
["url"] =>
string "example.com"
}
}
}
}
$obj2
object SimpleXMLElement {
["#attributes"] =>
array(0) {
}
["event"] =>
array(1) {
[0] =>
object(SimpleXMLElement) {
["#attributes"] =>
array(1) {
["url"] =>
string "another-example.com"
}
}
}
}
I am trying to copy all the child event items from the second object to the first as so:
foreach ($obj2->event as $event) {
$obj1->event[] = $event
}
They move, but the copied objects are now empty.
$obj1
object SimpleXMLElement {
["#attributes"] =>
array(0) {
}
["event"] =>
array(2) {
[0] =>
object(SimpleXMLElement) {
["#attributes"] =>
array(1) {
["url"] =>
string "example.com"
}
}
[1] =>
object(SimpleXMLElement) {
}
}
}

SimpleXML's editing functionality is rather limited, and in this case, both the assignment you're doing and ->addChild can only set the text content of the new element, not its attributes and children.
This may be one case where you need the power of the more complex DOM API. Luckily, you can mix the two in PHP with almost no penalty, using dom_import_simplexml and simplexml_import_dom to switch to the other wrapper.
Once you have a DOM representation, you can use the appendChild method of DOMNode to add a full node, but there's a catch - you can only add nodes "owned by" the same document. So you have to first call the importNode method on the document you're trying to edit.
Putting it all together, you get something like this:
// Set up the objects you're copying from and to
// These don't need to be complete documents, they could be any element
$obj1 = simplexml_load_string('<foo><event url="example.com" /></foo>');
$obj2 = simplexml_load_string('<foo><event url="another.example.com" /></foo>');
// Get a DOM representation of the root element we want to add things to
// Note that $obj1_dom will always be a DOMElement not a DOMDocument,
// because SimpleXML has no "document" object
$obj1_dom = dom_import_simplexml($obj1);
// Loop over the SimpleXML version of the source elements
foreach ($obj2->event as $event) {
// Get the DOM representation of the element we want to copy
$event_dom = dom_import_simplexml($event);
// Copy the element into the "owner document" of our target node
$event_dom_copy = $obj1_dom->ownerDocument->importNode($event_dom, true);
// Add the node as a new child
$obj1_dom->appendChild($event_dom_adopted);
}
// Check that we have the right output, not trusting var_dump or print_r
// Note that we don't need to convert back to SimpleXML
// - $obj1 and $obj1_dom refer to the same data internally
echo $obj1->asXML();

SimpleXml::xpath() after simplexml_import_dom() operates on the whole DomDocument and not just the Node

I'm not sure if this is the expected behavior or if I'm doing something wrong:
<?php
$xml = '<?xml version="1.0"?>
<foobar>
<foo>
<nested>
<img src="example1.png"/>
</nested>
</foo>
<foo>
<nested>
<img src="example2.png"/>
</nested>
</foo>
</foobar>';
$dom = new DOMDocument();
$dom->loadXML($xml);
$node = $dom->getElementsByTagName('foo')[0];
$simplexml = simplexml_import_dom($node);
echo $simplexml->asXML() . "\n";
echo " === With // ====\n";
var_dump($simplexml->xpath('//img'));
echo " === With .// ====\n";
var_dump($simplexml->xpath('.//img'));
Even though I only imported a specific DomNode, and asXml() returns only that part, the xpath() still seems to operate on the whole document.
I can prevent that by using .//img, but that seemed rather strange to me.
Result:
<foo>
<nested>
<img src="example1.png"/>
</nested>
</foo>
=== With // ====
array(2) {
[0] =>
class SimpleXMLElement#4 (1) {
public $#attributes =>
array(1) {
'src' =>
string(12) "example1.png"
}
}
[1] =>
class SimpleXMLElement#5 (1) {
public $#attributes =>
array(1) {
'src' =>
string(12) "example2.png"
}
}
}
=== With .// ====
array(1) {
[0] =>
class SimpleXMLElement#5 (1) {
public $#attributes =>
array(1) {
'src' =>
string(12) "example1.png"
}
}
}

It is expected behavior. You're importing an DOM element node into an SimpleXMLElement. This does not modify the XML document in the background - the node keeps its context.
Here are Xpath expressions that go up (parent::, ancestor::) or to siblings (preceding-sibling::, following-sibling::).
Location paths starting with a / are always relative to the document, not the context node. An explicit reference to the current node with the . avoids that trigger. .//img is short for current()/descendant-or-self::img - an alternative would be descendant::img.
However you don't need to convert the DOM node into a SimpleXMLElement to use Xpath.
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
foreach ($xpath->evaluate('//foo[1]') as $foo) {
var_dump(
$xpath->evaluate('string(.//img/#src)', $foo)
);
}
Output:
string(12) "example1.png"
//foo[1] fetches the first foo element node in the document. If here is no matching element in the document it will return an empty list. Using foreach allows to avoid an error in that case. It will be iterated once or never.
string(.//img/#src) fetches the src attribute of descendant img elements and casts the first one into a string. If here is no matching node the return value will be and empty string. The second argument to DOMXpath::evaluate() is the context node.

get data in object - SimpleXML

So I got a simple function that works, but I'm trying to evolve my experince wtih OOP and try to make sure I can use my code without having to edit everything.
Here is my simple function
$xmlfeed = file_get_contents('/forum/syndication.php?limit=3');
$xml = new SimpleXMLElement($xmlfeed);
$result = $xml->xpath('channel/item/title');
while(list( , $node) = each($result)) {
echo $node;
}
Now so far I got to this point:
class ForumFeed {
private function getXMLFeeds($feed = 'all'){
/*
Fetch the XML feeds
*/
$globalFeedXML = file_get_contents('/forum/syndication.php?limit=3');
$newsFeedXML = file_get_contents('/forum/syndication.php?fid=4&limit=3');
/*
Turn feed strings into actual objects
*/
$globalFeed = new SimpleXMLElement($globalFeedXML);
$newsFeed = new SimpleXMLElement($newsFeedXML);
/*
Return requested feed
*/
if ($feed == 'news') {
return $newsFeed;
} else if ($feed == 'all') {
return $globalFeed;
} else {
return false;
}
}
public function formatFeeds($feed) {
/*
Format Feeds for displayable content..
For now we're only interested in the titles of each feed
*/
$getFeed = $this->getXMLFeeds($feed);
return $getFeed->xpath('channel/item/title');
}
}
$feeds = new ForumFeed();
However when trying to echo $feeds->formatFeeds('all'); it doesn't return anything. The results is blank.
What am I doing wrong?
var_dump($feeds->formatFeeds('all')); returns
array(3) {
[0]=>
object(SimpleXMLElement)#3 (0) {
}
[1]=>
object(SimpleXMLElement)#4 (0) {
}
[2]=>
object(SimpleXMLElement)#5 (0) {
}
}

According to PHPs documentation SimpleXMLElement::xpath returns an array of SimpleXMLElements or false on error. Maybe var_dump($feeds->formatFeeds('all')); prints something you then can use to debug.
Edit: The XPath query returns results, so probably there is a logical error in your query or the returned elements don't have content.

How to extract all image urls from a nested object or json string

Given a JSON parsed object with deep nesting, I would like to extract an (array of) image(s) from a nested structure like this:
object: {
type: "...",
title:"...",
description: {
image:[
src:"logo1.png",
...:...
]
},
somethingelse: {
deeper:[
{imageurl:"logo2.jpg"}
]
}
}
how would I create a function that returns an array of images like this?
$images = getAllImagesFromObject(json_parse($jsonstring));
I do not know beforehand how deep the nesting will be and what the key will be, just any string beginning with http and ending on jpg, png and gif would be useful
I do not have any example since I do not know which method to use and I do not care what the key is so some var dump is also ok.
Perhaps even regex the jsonstring for "http://....[jpg|gif|png]" would be a solution

Something like this should do the trick. I did not test though.
function getAllImagesFromObject($obj) {
$result = array();
foreach ($obj as $prop)
if (is_array($prop) || is_object($prop))
$result = array_merge($result, getAllImagesFromObject($prop));
else if (preg_match('/\.(jpg|jpeg|gif|png|bmp)$/', $prop))
$result[] = $prop;
return $result;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using Zend_Dom as a screen scraper - php

Related

Is there any php function to find any url title or description?

Copy SimpleXMLElement children to a different object

SimpleXml::xpath() after simplexml_import_dom() operates on the whole DomDocument and not just the Node

get data in object - SimpleXML

How to extract all image urls from a nested object or json string

Categories

Resources