Parallel Processing of Numerous HTML pages with PHP - php

I have the following function in PHP that reads URL of pages from an array and fetches the HTML content of the corresponding pages for parsing. I have the following code that works fine.
public function fetchContent($HyperLinks){
foreach($HyperLinks as $link){
$content = file_get_html($link);
foreach($content->find('blablabla') as $result)
$this->HyperLink[] = $result->xmltext;}//foreach
return($this->HyperLink);
}
the problem with the code is that it is very slow and take 1 second to fetch content and parse its content. Considering very large number of files to read, I am looking for a parallel model of the above code. The content of each page is just few kilobyte.
I did search and found exec command but cannot figure out how to do it. I want to have a function and call it in parallel for N times so the execution takes less time. The function would get one link as input like below:
public function FetchContent($HyperLink){
// reading and parsing code
}
I tried this exec could:
print_r(exec("FetchContent",$HyperLink ,$this->Title[]));
but no way. I also replaced "FetchContent" with "FetchContent($HyperLink)" and removed second para, but neither works.
Thanks. Pls let me know if anything is missing. You may suggest anyway that helps me quickly process the content of numerous files at least 200-500 pages.

Related

How to make dynamic links in php without eval()

I am using wordpress for a web site. I am using snippets (my own custom php code) to fetch data from a database and echo that data onto my web site.
if($_GET['commentID'] && is_numeric($_GET['commentID'])){
$comment_id=$_GET['commentID'];
$sql="SELECT comments FROM database WHERE commentID=$comment_id";
$result=$database->get_results($sql);
echo "<dl><dt>Comments:</dt>";
foreach($result as $item):
echo "<dd>".$item->comment."</dd>";
endforeach;
echo "</dl>";
}
This specific page reads an ID from the URL and shows all comments related to that ID. In most cases, these comments are texts. But some comments should be able to point to other pages on my web site.
For example, I would like to be able to input into the comment-field in the database:
This is a magnificent comment. You should also check out this other section for more information
where getURLtoSectionPage() is a function I have declared in my functions.php to provide the static URLs to each section of my home page in order to prevent broken links if I change my URL pattern in the future.
I do not want to do this by using eval(), and I have not been able to accomplish this by using output buffers either. I would be grateful for any hints as to how I can get this working as safely and cleanly as possible. I do not wish to execute any custom php code, only make function calls to my already existing functions which validates input parameters.
Update:
Thanks for your replies. I have been thinking of this problem a lot, and spent the evening experimenting, and I have come up with the following solution.
My SQL "shortcode":
This is a magnificent comment. You should also check out this other section for more information
My php snippet in wordpress:
ob_start();
// All my code that echo content to my page comes here
// Retrieve ID from url
// Echo all page contents
// Finished generating page contents
$entire_page=ob_get_clean();
replaceInternalLinks($entire_page);
PHP function in my functions.php in wordpress
if(!function_exists("replaceInternalLinks")){
function replaceInternalLinks($reference){
mb_ereg_search_init($reference,"\[custom_func:([^\]]*):([^\]]*)\]");
if(mb_ereg_search()){
$matches = mb_ereg_search_getregs(); //get first result
do{
if($matches[1]=="getURLtoSectionPage" && is_numeric($matches[2])){
$reference=str_replace($matches[0],getURLtoSectionPage($matches[2]),$reference);
}else{
echo "Help! An unvalid function has been inserted into my tables. Have I been hacked?";
}
$matches = mb_ereg_search_regs();//get next result
}while($matches);
}
echo $reference;
}
}
This way I can decide which functions it is possible to call via the shortcode format and can validate that only integer references can be used.
I am safe now?
Don't store the code in the database, store the ID, then process it when you need to. BTW, I'm assuming you really need it to be dynamic, and you can't just store the final URL.
So, I'd change your example comment-field text to something like:
This is a magnificent comment. You should also check out this other section for more information
Then, when you need to display that text, do something like a regular expression search-replace on 'href="#comment-([0-9]+)"', calling your getURLtoSectionPage() function at that point.
Does that make sense?
I do not want to do this by using eval(), and I have not been able to accomplish this by using output buffers either. I would be grateful for any hints as to how I can get this working as safely and cleanly as possible. I do not wish to execute any custom php code, only make function calls to my already existing functions which validates input parameters.
Eval is a terrible approach, as is allowing people to submit raw PHP at all. It's highly error-prone and the results of an error could be catastrophic (and that's without even considering the possibly that code designed by a malicious attacker gets submitted).
You need to use something custom. Possibly something inspired by BBCode.

How to gather code to be used in a htmlentities() function without rendering beforehand?

To put my question into context, I'm working on an entirely static website where 'post' pages are created by myself manually - there's no CMS behind it. Each page will require a <pre> <code> block to display code as text in a styled block. This could be very few - several which is why I'm trying to do this for ease.
Here's what I've done -
function outputCode($code) {
return "<pre class='preBlock'><code class='codeBlock'>".htmlentities($code)."</code></pre>";
}
The code works as expected and produces an expected outcome when it's able to grab code. My idea is to somehow wrap the code for the code block with this function and echo it out for the effect, fewer lines and better readability.
As I'm literally just creating pages as they're needed, is there even a way to create the needed code blocks with such function to avoid having to manually repeat all the code for each code block? Cheers!
EDIT:
I was previously using this function and it was working great as I was pulling code from .txt documents in a directory and storing the code for code blocks in a variable with file_get_contents(). However, now, I'm trying to get the function to work by manually inputting the code into the function.
Well. Wrapping the function input in ' ' completely slipped my mind! It works just fine now!
If I understand correctly, you want to re-use your outputCode function in several different PHP files, corresponding to posts. If yes, you could put this 1 function in its own file, called outputcode.php for example, and then do
include "outputcode.php";
in every post/PHP file that needs to re-use this function. This will pull in the code, from the one common/shared file, for use in each post/PHP file that needs it. Or maybe I'm misreading your last paragraph :(

Running PHP code from a database

Ive got some code that runs html code from a database in order to not have to make a new file for every page that I might want to make, problem with this is that if the page contains php code it wont run, and im pretty sure you can do this with eval however it has security risks so I was trying to find some alternatives. I can paste the code if necessary. Ive got a php script that gets data from table and a main php script that gets the formatting in HTML and PHP for the data however it will just run the PHP code as if it were strings from the database.
Here is an image of the code thats meant to be run from the php script :
Here is the php code in that row
And this is the main script that is meant to run it :
Why not make PHP functions? Like, for example if you wanted to spit out data from your databases using PHP, but not a lot of code, you can do something like... (for user profiles)
Like, you can make a funcs.php
then inside of it do:
function user_page($user_id){
?> [Create user page here] <?php
}
Then inside of any other file just do:
include 'funcs.php';
if(isset($_GET['username']) === true){
user_page($_GET['username']);
} else {
//if not loading user page then do something else
}
EDIT
Okay, I just saw your screenshot.
To do something like that, a function would be good.
Ex:
function print_data($info1,$info2,$info3,$info4){
echo("<center>$info1</center><br>$info2<br>$info3<br>$info4");
}
then just calling the function with the $row[] information you have in your screenshot.
Like so:
print_data($row['email'],$row['name'],$row['username'],$row['ip_addr']);
this question must be added to some "doing it wrong" list, honestly.
but, returning to your question you have 3 options, all of them more or less painful, and only one of them is right.
to do it right: rewrite your engine or take some cms/framework. store text data in database, scripts/template on disk
eval! (which you don't want)
parse. (it will be very hard and slow and totally crazy)

Inserting output into page after document has been executed

In PHP have a situation where I need the page to be mostly executed, but have an item inserted into the output from that page.
I think output buffering may be of some help, but I can't work out how to implement it in my situation.
My code looks like this:
//this document is part of a global functions file
function pageHeader (){
//I'm using $GLOBALS here because it works, however I would really rather a better method if possible
$GLOBALS['error_handler'] = new ErrorHandler(); //ErrorHandler class sets a function for set_error_handler, which gets an array of errors from the executed page
require_once($_SERVER['DOCUMENT_ROOT'].'/sales/global/_header.php');
//I would like the unordered list from ->displayErrorNotice() to be displayed here, but if I do that the list is empty because the list was output before the rest of the document was executed
}
function pageFooter (){
$GLOBALS['error_handler'] ->displayErrorNotice(); //this function displays the errors as an html unordered list
include($_SERVER['DOCUMENT_ROOT']."/sales/global/_footer.php");
}
Most pages on the site include this document and use the pageHeader() and pageFooter() functions. What I am trying to achieve is to put an unordered list of the PHP generated errors into an HTML list just at a point after _header.php has been included. I can get the list to work as intended if I put it in the footer (after the document has been executed), but I don't want it there. I guess I could move it with JS, but I think there must be a PHP solution.
UPDATE
I'm wondering whether a callback function for ob_start() which searches the buffer by regex where to put the error list, and then inserts it will be the solution.
UPDATE 2 I have solved the problem, my answer is below. I will accept it in 2 days when I am allowed.
Worked it out finally. The key was to buffer the output, and search the buffer for a given snippet of html, and replace it with the unordered list.
My implementation is like this:
function outputBufferCallback($buffer){
return str_replace("<insert_errors>", $GLOBALS['error_handler']->returnErrorNotice(), $buffer);
}
function pageHeader (){
ob_start('outputBufferCallback');
//I'm using $GLOBALS here because it works, however I would really rather a better method if possible
$GLOBALS['error_handler'] = new ErrorHandler(); //ErrorHandler class sets a function for set_error_handler, which gets an array of errors from the executed page
require_once($_SERVER['DOCUMENT_ROOT'].'/sales/global/_header.php');
echo '<insert_errors>'; //this snippet is replaced by the ul at buffer flush
}
function pageFooter (){
include($_SERVER['DOCUMENT_ROOT']."/sales/global/_footer.php");
ob_end_flush();
}
If I'm getting this right, you're trying to insert some calculated code/errors between the header and footer. I'm guessing that the errors are being totalled/summed up at the very end of the page and would be completed after the page footer.
If this is true, I can't think of anyway to do this with pure php. It can only run through a page once, and cannot double back. What you can do is create an element after the footer and move it using javascript to the area where you want to display it. This would be the easiest way I would think. You can do this easily with jquery.
I can explain further if I am on the right track, but I'm not 100% sure what you're asking yet...
The jquery command you would use is .appendTo().

PHP Parsing with simple_html_dom, please check

I made a simple parser for saving all images per page with simple html dom and get image class but i had to make a loop inside the loop in order to pass page by page and i think something is just not optimized in my code as it is very slow and always timeouts or memory exceeds. Could someone just have a quick look at the code and maybe you see something really stupid that i made?
Here is the code without libraries included...
$pageNumbers = array(); //Array to hold number of pages to parse
$url = 'http://sitename/category/'; //target url
$html = file_get_html($url);
//Simply detecting the paginator class and pushing into an array to find out how many pages to parse placing it into an array
foreach($html->find('td.nav .str') as $pn){
array_push($pageNumbers, $pn->innertext);
}
// initializing the get image class
$image = new GetImage;
$image->save_to = $pfolder.'/'; // save to folder, value from post request.
//Start reading pages array and parsing all images per page.
foreach($pageNumbers as $ppp){
$target_url = 'http://sitename.com/category/'.$ppp; //Here i construct a page from an array to parse.
$target_html = file_get_html($target_url); //Reading the page html to find all images inside next.
//Final loop to find and save each image per page.
foreach($target_html->find('img.clipart') as $element) {
$image->source = url_to_absolute($target_url, $element->src);
$get = $image->download('curl'); // using GD
echo 'saved'.url_to_absolute($target_url, $element->src).'<br />';
}
}
Thank you.
I suggest making a function to do the actual simple html dom processing.
I usually use the following 'template'... note the 'clear memory' section.
Apparently there is a memory leak in PHP 5... at least I read that someplace.
function scraping_page($iUrl)
{
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aObj = $html->find('img');
// do something with the element objects
// clean up memory (prevent memory leaks in PHP 5)
$html->clear(); // **** very important ****
unset($html); // **** very important ****
return; // also can return something: array, string, whatever
}
Hope that helps.
You are doing quite a lot here, I'm not surprised the script times out. You download multiple web pages, parse them, find images in them, and then download those images... how many pages, and how many images per page? Unless we're talking very small numbers then this is to be expected.
I'm not sure what your question really is, given that, but I'm assuming it's "how do I make this work?". You have a few options, it really depends what this is for. If it's a one-off hack to scrape some sites, ramp up the memory and time limits, maybe chunk up the work to do a little, and next time write it in something more suitable ;)
If this is something that happens server-side, it should probably be happening asynchronously to user interaction - i.e. rather than the user requesting some page, which has to do all this before returning, this should happen in the background. It wouldn't even have to be PHP, you could have a script running in any language that gets passed things to scrape and does it.

Categories