PHP Parsing with simple_html_dom, please check - php

I made a simple parser for saving all images per page with simple html dom and get image class but i had to make a loop inside the loop in order to pass page by page and i think something is just not optimized in my code as it is very slow and always timeouts or memory exceeds. Could someone just have a quick look at the code and maybe you see something really stupid that i made?
Here is the code without libraries included...
$pageNumbers = array(); //Array to hold number of pages to parse
$url = 'http://sitename/category/'; //target url
$html = file_get_html($url);
//Simply detecting the paginator class and pushing into an array to find out how many pages to parse placing it into an array
foreach($html->find('td.nav .str') as $pn){
array_push($pageNumbers, $pn->innertext);
}
// initializing the get image class
$image = new GetImage;
$image->save_to = $pfolder.'/'; // save to folder, value from post request.
//Start reading pages array and parsing all images per page.
foreach($pageNumbers as $ppp){
$target_url = 'http://sitename.com/category/'.$ppp; //Here i construct a page from an array to parse.
$target_html = file_get_html($target_url); //Reading the page html to find all images inside next.
//Final loop to find and save each image per page.
foreach($target_html->find('img.clipart') as $element) {
$image->source = url_to_absolute($target_url, $element->src);
$get = $image->download('curl'); // using GD
echo 'saved'.url_to_absolute($target_url, $element->src).'<br />';
}
}
Thank you.

I suggest making a function to do the actual simple html dom processing.
I usually use the following 'template'... note the 'clear memory' section.
Apparently there is a memory leak in PHP 5... at least I read that someplace.
function scraping_page($iUrl)
{
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aObj = $html->find('img');
// do something with the element objects
// clean up memory (prevent memory leaks in PHP 5)
$html->clear(); // **** very important ****
unset($html); // **** very important ****
return; // also can return something: array, string, whatever
}
Hope that helps.

You are doing quite a lot here, I'm not surprised the script times out. You download multiple web pages, parse them, find images in them, and then download those images... how many pages, and how many images per page? Unless we're talking very small numbers then this is to be expected.
I'm not sure what your question really is, given that, but I'm assuming it's "how do I make this work?". You have a few options, it really depends what this is for. If it's a one-off hack to scrape some sites, ramp up the memory and time limits, maybe chunk up the work to do a little, and next time write it in something more suitable ;)
If this is something that happens server-side, it should probably be happening asynchronously to user interaction - i.e. rather than the user requesting some page, which has to do all this before returning, this should happen in the background. It wouldn't even have to be PHP, you could have a script running in any language that gets passed things to scrape and does it.

Related

Parallel Processing of Numerous HTML pages with PHP

I have the following function in PHP that reads URL of pages from an array and fetches the HTML content of the corresponding pages for parsing. I have the following code that works fine.
public function fetchContent($HyperLinks){
foreach($HyperLinks as $link){
$content = file_get_html($link);
foreach($content->find('blablabla') as $result)
$this->HyperLink[] = $result->xmltext;}//foreach
return($this->HyperLink);
}
the problem with the code is that it is very slow and take 1 second to fetch content and parse its content. Considering very large number of files to read, I am looking for a parallel model of the above code. The content of each page is just few kilobyte.
I did search and found exec command but cannot figure out how to do it. I want to have a function and call it in parallel for N times so the execution takes less time. The function would get one link as input like below:
public function FetchContent($HyperLink){
// reading and parsing code
}
I tried this exec could:
print_r(exec("FetchContent",$HyperLink ,$this->Title[]));
but no way. I also replaced "FetchContent" with "FetchContent($HyperLink)" and removed second para, but neither works.
Thanks. Pls let me know if anything is missing. You may suggest anyway that helps me quickly process the content of numerous files at least 200-500 pages.

Get pixel coordinates of HTML/DOM elements using PHP

I am working on an web crawler/site analyzer in php. What I need to do is to extract some tags from a HTML file and compute some attributes (such as image size for example). I can easily do this using a DOM parser, but I would also need to find the pixel coordinates and size of a html/DOM tree element (let's say I have a div and I need to know which area it covers and on which coordinate does it start and if). I can define a standard screen resolution, that is not a problem for me, but I need to retrieve the pixel coordinates automatically, by using a server-side php script (or calling some java app from console or something similar, if needed).
From what I understand, I need a headless browser in php and that would simulate/render a webpage, from which I can retrieve the pixel coordinates I need. Would you recommend me a open-source solution for that? Some code snippets would also be useful, so I would not install the solution and then notice it does not provide pixel coordinates.
PS: I see people who answered missed the point of the question, so it means I did not explain well that I need this solution to work COMPLETELY server-side. Say I use a crawler and it feeds html pages to my script. I could launch it from browser, but also from console (like 'php myScript.php').
maybe you can set the coordinates as some kind of metadata inside your tag using javascript
$("element").data("coordinates",""+this.offset.top+","+this.offset.left);
then you have to request with php
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('element');
foreach ($tags as $tag) {
echo $tag->getAttribute('data'); <-- this will print the coordinates of each tag
}
A Headless browser is an overkill for what you're trying to achieve. Just use cookies to store whatever you want.
So any time you get some piece of information, such as an X,Y coordinate, scroll position, etc. in javascript, simply send it to a PHP script that makes a cookie out of it with some unique string index.
Eventually, you'll have a large array of cookie data that will be directly available to any PHP or javascript file, and you can do anything you'd like with it at that point.
For example, if you wanted to just store stuff in sessions, you could do:
jquery:
// save whatever you want from javascript
// note: probably better to POST, since we're not getting anything really, just showing quick example
$.get('save-attr.php?attr=xy_coord&value=300,550');
PHP:
// this will be the save-attr.php file
session_start();
$_SESSION[$_GET['attr']] = $_GET['value'];
// now any other script can get this value like so:
$coordinates = $_SESSION['xy_coord'];
// where $coordinates would now equal "300,550"
Simple continue this pattern for whatever you need access to in PHP

How to extract contents from URLs?

I am having a problem. This is what I have to do and the code is taking extremely long to run:
There is 1 website I need to collect data from, and to do so I need my algorithm to visit over 15,000 subsections of this website (i.e. www.website.com/item.php?rid=$_id), where $_id will be the current iteration of a for loop.
Here are the problems:
The method I am currently using to get the source code of each page is file_get_contents, and, as you can imagine, it takes super long to file_get_contents of 15,000+ pages.
Each page contains over 900 lines of code, but all I need to extract is about 5 lines worth, so it seems as though the algorithm is wasting a lot of time by retrieving all 900 lines of it.
Some of the pages do not exist (i.e. maybe www.website.com/item.php?rid=2 exists but www.website.com/item.php?rid=3 does not), so I need a method of quickly skipping over these pages before the algorithm tries to fetch its contents and waste a bunch of time.
In short, I need a method of extracting a small portion of the page from 15,000 webpages in as quick and efficient a manner as possible.
Here is my current code.
for ($_id = 0; $_id < 15392; $_id++){
//****************************************************** Locating page
$_location = "http://www.website.com/item.php?rid=".$_id;
$_headers = #get_headers($_location);
if(strpos($_headers[0],"200") === FALSE){
continue;
} // end if
$_source = file_get_contents($_location);
//****************************************************** Extracting price
$_needle_initial = "<td align=\"center\" colspan=\"4\" style=\"font-weight: bold\">Current Price:";
$_needle_terminal = "</td>";
$_position_initial = (stripos($_source,$_needle_initial))+strlen($_needle_initial);
$_position_terminal = stripos($_source,$_needle_terminal);
$_length = $_position_terminal-$_position_initial;
$_current_price = strip_tags(trim(substr($_source,$_position_initial,$_length)));
} // end for
Any help at all is greatly appreciated since I really need a solution to this!
Thank you in advance for your help!
the short of it: don't.
longer: If you want to do this much work, you shouldn't do it on demand. Do it in the background! You can use the code you have here, or any other method you're comfortable with, but instead of showing it to a user, you can save it in a database or a local file. Call this script with a cron job every x minutes (depends on the interval you need), and just show the latest content from your local cache (be it a database or a file).

Status report during form process

I created a little script that imports wordpress posts from an xml file:
if(isset($_POST['wiki_import_posted'])) {
// Get uploaded file
$file = file_get_contents($_FILES['xml']['tmp_name']);
$file = str_replace('&', '&', $file);
// Get and parse XML
$data = new SimpleXMLElement( $file , LIBXML_NOCDATA);
foreach($data->RECORD as $key => $item) {
// Build post array
$post = array(
'post_title' => $item->title,
........
);
// Insert new post
$id = wp_insert_post( $post );
}
}
The problem is that my xml file is really big, and when i submit the form, the browser just hangs for a couple of minutes.
Is it possible to display some messages during the import, like displaying a dot after every item is imported?
Unfortunately, no, not easily. Especially if you're building this on top of the WP framework you'll find it not worth your while at all. When you're interacting with a PHP script you are sending a request and awaiting a response. However long it takes that PHP script to finish processing and start sending output is how long it usually takes the client to start seeing a response.
There are a few things to consider if what you want is for output to start showing as soon as possible (i.e. as soon as the first echo or output statement is reached).
Turn off output buffering so that output begins sending immediately.
Output whatever you want inside the loop that would indicate to you the progress you wish to be know about.
Note that if you're doing this with an AJAX request content may not be ready immediately to transport to the DOM via your XMLHttpRequest object. Also note that some browsers do their own buffering before content can be ready for the user to display (like IE for example).
Some suggestions you may want to look into to speed up your script, however:
Why are you doing str_replace('&','&',$file) on a large file? You realize that has cost with no benefit, right? You've acomplished nothing and if you meant you want to replace the HTML entity & then you probably have some of your logic very wrong. Encoding is something you want to let the XML parser handle.
You can use curl_multi instead of file_get_contents to do multiple HTTP requests concurrently to save time if you are transferring a lot of files. It will be much faster since it's none-blocking I/O.
You should use DOMDocument instead of SimpleXML and a DOMXPath query can get you your array much faster than what you're currently doing. It's a much nicer interface than SimpleXML and I always recommend it above SimpleXML since in most cases SimpleXML makes things incredibly difficult to do and for no good reason. Don't let the name fool you.

Including/Excluding content with xPath/DOM > PHP

I'm trying to take an existing php file which I've built for a page of my site (blue.php), and grab the parts I really want with some xPath to create a different version of that page (blue-2.php).
I've been successful in pulling in my existing .php file with
$documentSource = file_get_contents("http://mysite.com/blue.php");
I can alter an attribute, and have my changes reflected correctly within blue-2.php, for example:
$xpath->query("//div[#class='walk']");
foreach ($xpath->query("//div[#class='walk']") as $node) {
$source = $node->getAttribute('class');
$node->setAttribute('class', 'run');
With my current code, I'm limited to making changes like in the example above. What I really want to be able to do is remove/exclude certain divs and other elements from showing on my new php page (blue-2.php).
By using echo $doc->saveHTML(); at the end of my code, it appears that everything from blue.php is included in blue-2.php's output, when I only want to output certain elements, while excluding others.
So the essence of my question is:
Can I parse an entire page using $documentSource = file_get_contents("http://mysite.com/blue.php");, and pick and choose (include and exclude) which elements show on my new page, with xPath? Or am I limited to only making modifications to the existing code like in my 'div class walk/run' example above?
Thank you for any guidance.
I've tried this, and it just throws errors:
$xpath->query("//img[#src='blue.png']")->remove();
What part of the documentation did make you think remove is a method of DOMNodeList? Use DOMNode::removeChild
foreach($xpath->query("//img[#src='blue.png']") as $node){
$node->parentNode->removeChild($node);
}
I would suggest browsing a bit through all classes & functions from the DOM extension (which is not PHP-only BTW), to get a bit of a feel what to find where.
On a side note: is probably very more resource efficient if you could get a switch in your original blue.php resulting in the different output, because this solution (extra http-request, full DOM load & manipulation) has a LOT of unneeded overhead compared to that.

Categories