PHP Xpath Scrape Possible Namespace Issue - php

UPDATE: The source code is very much different from what Developer Tools shows.
Check out the source: view-source:http://www.machinerytrader.com/list/list.aspx?ETID=1&catid=1002
Is that javascript that needs to be rendered by a browser into html? If so, how can I have php do that process so that I have Html to parse? It's weird that you can use Xpath Checker to return the items I'm looking for (see below), but you cannot access the full html!
(Xpath: //table[contains(#id, 'ctl00_ContentPlaceHolder1') and (contains(#id,"tblContent") or contains(#id,"tblListingHeader"))])
END UPDATE
I need to scrape some information off of this site for work on a regular basis. I am attempting to write some PHP code to scrape this data. I think I have some namespace issues here, having read a number of other posts on SO. I have never encountered namespace problems before and used the approach shown on another SO post (to no avail :().
It appears the xpath query is just not happening for whatever reason. If you have any guesses or solutions as to how to handle this issue, I am open for suggestions.
Also here is the output from my code:
object(DOMXPath)#2 (0) {
}
Debug 1
array(0) {
}
array(0) {
}
I left out the bottom of the code where I var_dump testarray and create and var_dump otherarray. Their output is included above. Obviously the two arrays will be empty if the DOMXPath element has length 0 as well.
$string = 'http://www.machinerytrader.com/list/list.aspx?ETID=1&catid=1002';
$machine_trader = file_get_contents($string);
$xml = new DOMDocument();
$xml->loadHTML($machine_trader);
$xpath = new DOMXPath($xml);
$rootNamespace = $xml->lookupNamespaceUri($xml->namespaceURI);
$xpath->registerNamespace('x', $rootNamespace);
$tableRows = $xpath->query("//x:table[contains(#id, 'ctl00_ContentPlaceHolder1') and (contains(#id,'tblContent') or contains(#id,'tblListingHeader'))]");
var_dump($xpath);
$testarray = array();
$otherarray = array();
foreach ( $tableRows as $row )
{
echo "Debug 1"."\n";
$testarray[] = $row->nodeValue;
}

This is not an XPath issue insofar that the actual content is found from a form post, which you didn't reach yet. JS Source code here does nothing more than authenticate a proper 'user' for the information request, and then send the request via form submission.
At each request, the salt / encryption 'key' is randomized and changes, preventing simple scrapes.
You could rewrite that JavaScript to PHP and then issue two requests, battling the authentication process along the way.
Or, rather than diddle with reverse-engineering this, you could switch your scraping to NodeJS and use something like PhantomJS since it can evaluate javascript but give you programmatic access. Given the complexity of this task, it'd be much simpler to use the right tool.

Related

PHP - file_get_html not returning anything

I am trying to scrape data from this site, using "inspect" I am checking the class of the div, but when I try to get it, it doesn't display anything:
Trying to get the "Diamond" below "Supremacy".
What I am using:
<?php
include('simple_html_dom.php');
$memberName = $_GET['memberName'];
$html = file_get_html('https://destinytracker.com/d2/profile/pc/'.$memberName.'');
preg_match("/<div id=\"dtr-rating\".*span>/", $html, $data);
var_dump($data);
?>
FYI, simple_html_dom is a package available on SourceForge at http://simplehtmldom.sourceforge.net/. See the documentation.
file_get_html(), from simple_html_dom, does not return a string; it returns an object that has methods you can call to traverse the HTML document. To get a string from the object, do:
$url = https://destinytracker.com/d2/profile/pc/'.$memberName;
$html_str = file_get_html($url)->plaintext;
But if you are going to do that, you might as well just do:
$html_str = file_get_contents($url);
and then run your regex on $html_str.
BUT ... if you want to use the power of simple_html_dom ...
$html_obj = file_get_html($url);
$the_div = $html_obj->find('div[id=dtr-rating]', 0);
$inner_str = $the_div->innertext;
I'm not sure how to do exactly what you want, because when I look at the source of the web link you provided, I cannot find a <div> with id="dtr-rating".
My other answer is about using simple_html_dom. After looking at the HTML doc in more detail, I see the problem is different than I first thought (I'll leave it there for pointers on better use of simple_html_dom).
I see that the web page you are scraping is a VueJS application. That means the HTML sent by the web server causes Javascript to run and build the dynamic contents of the web page that you see displayed. That means, the <div> your are looking for with regex DOES NOT EXIST in the HTML sent by the server. Your regex cannot find anything but its not there.
In Chrome, do Ctl+U to see what the web server sent (no "Supremacy"). Do Ctl+Shift+I and look under the "Elements" tab to see the HTML after the Javascript has done is magic (this does have "Supremacy").
This means you won't be able to get the initial HTML of the web page and scrape it to get the data you want.

Get pixel coordinates of HTML/DOM elements using PHP

I am working on an web crawler/site analyzer in php. What I need to do is to extract some tags from a HTML file and compute some attributes (such as image size for example). I can easily do this using a DOM parser, but I would also need to find the pixel coordinates and size of a html/DOM tree element (let's say I have a div and I need to know which area it covers and on which coordinate does it start and if). I can define a standard screen resolution, that is not a problem for me, but I need to retrieve the pixel coordinates automatically, by using a server-side php script (or calling some java app from console or something similar, if needed).
From what I understand, I need a headless browser in php and that would simulate/render a webpage, from which I can retrieve the pixel coordinates I need. Would you recommend me a open-source solution for that? Some code snippets would also be useful, so I would not install the solution and then notice it does not provide pixel coordinates.
PS: I see people who answered missed the point of the question, so it means I did not explain well that I need this solution to work COMPLETELY server-side. Say I use a crawler and it feeds html pages to my script. I could launch it from browser, but also from console (like 'php myScript.php').
maybe you can set the coordinates as some kind of metadata inside your tag using javascript
$("element").data("coordinates",""+this.offset.top+","+this.offset.left);
then you have to request with php
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('element');
foreach ($tags as $tag) {
echo $tag->getAttribute('data'); <-- this will print the coordinates of each tag
}
A Headless browser is an overkill for what you're trying to achieve. Just use cookies to store whatever you want.
So any time you get some piece of information, such as an X,Y coordinate, scroll position, etc. in javascript, simply send it to a PHP script that makes a cookie out of it with some unique string index.
Eventually, you'll have a large array of cookie data that will be directly available to any PHP or javascript file, and you can do anything you'd like with it at that point.
For example, if you wanted to just store stuff in sessions, you could do:
jquery:
// save whatever you want from javascript
// note: probably better to POST, since we're not getting anything really, just showing quick example
$.get('save-attr.php?attr=xy_coord&value=300,550');
PHP:
// this will be the save-attr.php file
session_start();
$_SESSION[$_GET['attr']] = $_GET['value'];
// now any other script can get this value like so:
$coordinates = $_SESSION['xy_coord'];
// where $coordinates would now equal "300,550"
Simple continue this pattern for whatever you need access to in PHP

How can I load a url obtained from $_SERVER['REQUEST_URI'] into DOMDocument?

How to load a URL obtained from $_SERVER['REQUEST_URI'] into domDocument?
I am trying to load a dynamic webpage into DOMDocument to be parsed for certain words. Ultimately I want to create a glossary for my site (Tiki Wiki CMS). I started very simple and right now I am only trying to load a page and parse the text for testing purposes.
I am new to DOMDocument and after reading several articles on this site and on PHP Manual, I know that I have to load a html page with loadHTMLFile, then parse the site by getElementsById or getElementsByTagName in order to do stuff with it. It works fine for static pages, but the main problem I am having is that I cannot enter a static url into loadHTMLFile, because parsing should be performed when the site is uploaded by the user.
Here's the code that DID work:
$url = 'http://mysite.org/bbk/tiki-index.php?page=pagetext';
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
So, I thought I could use $_SERVER['REQUEST_URI'] for the job, but it did not work.
This did NOT work (no error message):
$url = $_SERVER['REQUEST_URI'];
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
After checking what the $url output was, I decided to add http://mysite.org to it to make it identical to the url that worked. However, no luck either and this time I got an internal server error.
This did NOT work either (Internal Server Error):
$url = 'http://mysite.org' . $_SERVER['REQUEST_URI'];
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
I think I am missing something substantial here and I thought it might just not be possible to use DOMDocument in this way, so I was searching the web for help again (if it is possible to use $_SERVER['REQUEST_URI'] in combination with DOMdocument at all), but I didn't find an answer. So I hope anybody here can help. Any suggestions including third party parsers etc. would be helpful, except anything that requires parsing with regex. Tiki Wiki CMS already has a glossary option done with regex, but it is very buggy.
Thanks.
UPDATE
I haven't found an answer to the problem, but I think I have an idea on where my mistake was. I was expecting $_SERVER['REQUEST_URI'] to run on a dynamic page that was not completely built yet. I ran the script on the main setup page, so I guess the html was not rendered yet, when I tried to point $_SERVER['REQUEST_URI'] to it. When I noticed that this might be the problem, I abandoned the idea of parsing the document with DomDocument and used a javascript solution that can be loaded after the document is ready.
I can think of two things that you can do (probably won't solve your problem directly, but will help you greatly with solving it):
$_SERVER['REQUEST_URI'] doesn't contain what you think it does. Try echoing or var_dumping it, and see if it matches your expectations.
Enable error reporting. The reason you are seeing a generic 500 error page, is because error reporting is disabled. enable it using error_reporting().
Also note that DOMDocument only parses HTML, if you have dynamic DOM nodes generated and added to the page using a client-side language, or CSS pseudo elements, they won't be displayed unless you deploy a JS/CSS parser as well (which is not trivial).

Is there a simple way to get and manipulate nested <div> tags with php

First off, I'm far from awesome with PHP - having only a basic familiarity with it, but I'm looking for a way to manipulate the contents of nested divs with php. This is a basic site for a local non-profit food bank that will allow them to post events for their clientelle.
For example, the file I want to parse and work with has this structure (consider this the complete file though there may be more than 2 entries at any point in time):
<div class="event">
<div class="eventTitle">title text</div>
<div class="eventContent">event content</div>
</div>
<div class="event">
<div class="eventTitle">title2</div>
<div class="eventContent">event content2</div>
</div>
My thoughts are to parse it (what's the best way?), and build a multidimensional array of all div with class="event", and the nested contents of each. However, up to this point all my attempts have ended in failure.
The point of this is allow the user (non-technical food bank admin) to add, edit, and delete these structures. I have the code working to add the structures - but am uncertain as to how I would re-open the file at a later date to then edit and/or delete select instances of the "event" divs and their nested contents. It seems like it should be an easy task but I just can't wrap my head around the search results I have found online.
I have tried some stuff with preg_match(), getElementById(), and getElementByTagName(). I'd really like to help this organization out but I'm at the point where I have to defer to my betters for advice on how to solve the task at hand.
Thanks in advance.
To Clarify:
This is for their website, hosted on an external service by a provider that does not allow them to host a DB or provide ftp/sftp/ssh access to the server for regular maintenance. The plan is to get the site up there once, and from then on, have it maintained via an unsecure (no other options at this point) url.
Can anyone provide a sample php syntax to parse the above html and create a multidimensional array of the div tags? As I mentioned, I have attempted to thumb my way through it, but have been unsuccessful. I know what I need to do, I just get lost in the syntax.
IE: this is what I've come up with to do this, but it doesn't seem to work, and I don't have a strong enough understanding of php to understand exactly why it does not.
<?php
$doc = new DOMDocument();
$doc->load('events.php');
$events = array();
foreach ($doc->getElementsByTagName('div') as $node) {
// looks at each <div> tag and creates an array from the other named tags below // hopefully...
$edetails = array (
'title' => $node->getElementsByTagName('eventTitle')->item(0)->nodeValue,
'desc' => $node->getElementsByTagName('eventContent')->item(0)->nodeValue
);
array_push($events, $edetails);
}
foreach ($events as &$edetails) {
// walk through the $events array and write out the appropriate information.
echo $edetails['title'] . "<br>";
echo $edetails['desc'] . "<br>";
}
print_r($events); // this is currently empty and not being populated
?>
Error:
PHP Warning: DOMDocument::load(): Extra content at the end of the document in /var/www/html/events.php, line: 7 in /var/www/html/test.php on line 4
Looking at this now, I realize this would never work because it is looking for tags named eventTitle and eventContent, not classes. :(
I would use a "database", whether it's an sqlite database or a simple text file (seems sufficient for your needs), and use php scripts to manipulate that file and build the required html to manage the text/database file and display the contents.
That would be a lot easier than using DOM manipulation to add / edit / remove events.
By the way, I would probably look for a sponsor, get a decent hosting provider and use a real database...
If you want to keep using the "php" file you have (which I think is needless complex), the reasons your current code fails are:
1) The load() method for DOMDocument is designed for XML, and expects a well formed file. The work around for this would be to either use the loadHTMLFile() method, or to wrap everything in a parent element.
2) The looping fails as the getElementsByTagName() is looking for tags - so the outermost loop gets 6 different divs in your current example (the parent event, and the children eventTitle and eventContent)
3) The inner loops fail of course, as you're again using getElementsByTagName(). Note that the tag names are all still 'div'; what you're really trying/wanting to search on is the value of 'class' attribute. In theory, you could work around this by putting in a lot of logic using things like hasChildNodes() and/or getAttribute().
Alternatively, you could restructure using valid XML, rather than this weird hybrid you're trying to use - if you do that, you could use DOMDocument to write out the file, as well as read it. Probably overkill, unless you're looking to learn how to use the PHP DOM libraries and XML.
As other's have mentioned, I'd change the format of events.php into something besides a bunch of div's. Since a database isn't an option, I'd probably go for a pipe delimited file, something like:
title text|event content
title2|event content2
The code to parse this would be much simpler, something along the lines of:
<?php
$events = array();
$filename = 'events.txt';
if (file_exists($filename)) {
$lines = file($filename);
foreach ($lines as $line) {
list($title, $desc) = explode('|', $line);
$event = array('title'=>$title, 'desc'=>$desc);
$events[] = $event; //better way of adding one element to an array than array_push (http://php.net/manual/en/function.array-push.php)
}
}
print_r($events);
?>
Note that this code reads the whole file into memory, so if they have too many events or super long descriptions, this could get unwieldy, but should work fine for hundreds, even thousands, of events or so.

PHP Parsing with simple_html_dom, please check

I made a simple parser for saving all images per page with simple html dom and get image class but i had to make a loop inside the loop in order to pass page by page and i think something is just not optimized in my code as it is very slow and always timeouts or memory exceeds. Could someone just have a quick look at the code and maybe you see something really stupid that i made?
Here is the code without libraries included...
$pageNumbers = array(); //Array to hold number of pages to parse
$url = 'http://sitename/category/'; //target url
$html = file_get_html($url);
//Simply detecting the paginator class and pushing into an array to find out how many pages to parse placing it into an array
foreach($html->find('td.nav .str') as $pn){
array_push($pageNumbers, $pn->innertext);
}
// initializing the get image class
$image = new GetImage;
$image->save_to = $pfolder.'/'; // save to folder, value from post request.
//Start reading pages array and parsing all images per page.
foreach($pageNumbers as $ppp){
$target_url = 'http://sitename.com/category/'.$ppp; //Here i construct a page from an array to parse.
$target_html = file_get_html($target_url); //Reading the page html to find all images inside next.
//Final loop to find and save each image per page.
foreach($target_html->find('img.clipart') as $element) {
$image->source = url_to_absolute($target_url, $element->src);
$get = $image->download('curl'); // using GD
echo 'saved'.url_to_absolute($target_url, $element->src).'<br />';
}
}
Thank you.
I suggest making a function to do the actual simple html dom processing.
I usually use the following 'template'... note the 'clear memory' section.
Apparently there is a memory leak in PHP 5... at least I read that someplace.
function scraping_page($iUrl)
{
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aObj = $html->find('img');
// do something with the element objects
// clean up memory (prevent memory leaks in PHP 5)
$html->clear(); // **** very important ****
unset($html); // **** very important ****
return; // also can return something: array, string, whatever
}
Hope that helps.
You are doing quite a lot here, I'm not surprised the script times out. You download multiple web pages, parse them, find images in them, and then download those images... how many pages, and how many images per page? Unless we're talking very small numbers then this is to be expected.
I'm not sure what your question really is, given that, but I'm assuming it's "how do I make this work?". You have a few options, it really depends what this is for. If it's a one-off hack to scrape some sites, ramp up the memory and time limits, maybe chunk up the work to do a little, and next time write it in something more suitable ;)
If this is something that happens server-side, it should probably be happening asynchronously to user interaction - i.e. rather than the user requesting some page, which has to do all this before returning, this should happen in the background. It wouldn't even have to be PHP, you could have a script running in any language that gets passed things to scrape and does it.

Categories