How to extract contents from URLs?

How to extract contents from URLs? - php

I am having a problem. This is what I have to do and the code is taking extremely long to run:
There is 1 website I need to collect data from, and to do so I need my algorithm to visit over 15,000 subsections of this website (i.e. www.website.com/item.php?rid=$_id), where $_id will be the current iteration of a for loop.
Here are the problems:
The method I am currently using to get the source code of each page is file_get_contents, and, as you can imagine, it takes super long to file_get_contents of 15,000+ pages.
Each page contains over 900 lines of code, but all I need to extract is about 5 lines worth, so it seems as though the algorithm is wasting a lot of time by retrieving all 900 lines of it.
Some of the pages do not exist (i.e. maybe www.website.com/item.php?rid=2 exists but www.website.com/item.php?rid=3 does not), so I need a method of quickly skipping over these pages before the algorithm tries to fetch its contents and waste a bunch of time.
In short, I need a method of extracting a small portion of the page from 15,000 webpages in as quick and efficient a manner as possible.
Here is my current code.
for ($_id = 0; $_id < 15392; $_id++){
//****************************************************** Locating page
$_location = "http://www.website.com/item.php?rid=".$_id;
$_headers = #get_headers($_location);
if(strpos($_headers[0],"200") === FALSE){
continue;
} // end if
$_source = file_get_contents($_location);
//****************************************************** Extracting price
$_needle_initial = "<td align=\"center\" colspan=\"4\" style=\"font-weight: bold\">Current Price:";
$_needle_terminal = "</td>";
$_position_initial = (stripos($_source,$_needle_initial))+strlen($_needle_initial);
$_position_terminal = stripos($_source,$_needle_terminal);
$_length = $_position_terminal-$_position_initial;
$_current_price = strip_tags(trim(substr($_source,$_position_initial,$_length)));
} // end for
Any help at all is greatly appreciated since I really need a solution to this!
Thank you in advance for your help!

the short of it: don't.
longer: If you want to do this much work, you shouldn't do it on demand. Do it in the background! You can use the code you have here, or any other method you're comfortable with, but instead of showing it to a user, you can save it in a database or a local file. Call this script with a cron job every x minutes (depends on the interval you need), and just show the latest content from your local cache (be it a database or a file).

Related

Page updates while processing

I have a PHP script that can take a few minutes to be done. It's some search engine which executes a bunch of regex commands and retrieve the results to the user.
I start by displaying a "loading page" which does an AJAX call to the big processing method in my controller (let's call it 'P'). This method then returns a partial view and I just replace my "loading page" content with that partial view. It works fine.
Now what I would like to do is give the user some information about the process (and later on, some control over it), like how many results the script has already found. To achieve that, I do another AJAX call every 5 seconds which is supposed to retrieve the current number of results and display it in a simple html element. This call uses a method 'R' in the same controller as method 'P'.
Now the problem I have is that I'm not able to retrieve the correct current number of results. I tried 2 things :
Session variable ('file' driver) : in 'P' I first set a session variable 'v' to 0 and then update 'v' every time a new result is found. 'R' simply returns response()->json(session('v'))
Controller variable : same principle as above but I use a variable declared at the top of my controller.
The AJAX call to 'P' works in both cases, but everytime and in both cases it returns 0. If I send back 'v' at the end of the 'P' script, it has the correct value.
So to me it looks like 'R' can't access the actual current value of 'v', it only access some 'cached' version of it.
Does anyone have an idea about how I'm going to be able to achieve what I'd like to do? Is there another "cleaner" approach and/or what is wrong with mine?
Thank you, have a nice day!
__
Some pseudo-code to hopefully make it a bit more precise.
SearchController.php
function P() {
$i = 0;
session(['count' => $i]); // set session variable
$results = sqlQuery(); // get rows from DB
foreach ($results as $result) {
if (regexFunction($result))
$i++
session(['count' => $i]); // update session variable
}
return response()->json('a bunch of stuff');
}
function R() {
return response()->json(session('count')); // always returns 0
}

I would recommend a different approach here.
Read a bit more about flushing content here http://php.net/manual/en/ref.outcontrol.php and then use it.
Long story short in order to display the numbers of row processed with flushing you could just make a loop result and flush from time to time or at an exact number or rows, the need for the 5 seconds AJAX is gone. Small untested example :
$cnt = 0;
foreach($result as $key => $val) {
//do your processing here
if ($cnt % 100 == 0) {
//here echo smth for flushing, you can echo some javascript, tough not nice
echo "<script>showProcess({$cnt});</script>";
ob_flush();
}
}
// now render the proccessed full result
And in the showProcess javascript function make what you want... some jquery replace in a text or some graphical stuff...
Hopefully u are not using fast_cgi, beacause in order to activate output buffering you need to disable some important features.

I believe you have hit a wall with PHP limitations. PHP doesn't multithread, well. To achieve the level of interaction you are probably required to edit the session files directly, the path of which can be found in your session.save_path global through php_info(), and you can edit this path with session_save_path(String). Though this isn't recommended usage, do so at your own risk.
Alternatively use a JSON TXT file stored somewhere on your computer/server, identifying them in a similar manner to the session files.
You should store the current progress of the query to a file and also if the transaction has been interrupted by the user. a check should be performed on the status of the interrupt bit/boolean before continuing to iterate over the result set.
The issue arises when you consider concurrency, what if the boolean is edited just slightly before, or at the same time, as the count array? Perhaps you just keep updating the file with interrupts until the other script gets the message. This however is not an elegant solution.
Nor does this solution allow for concurrent queries being run by the same user. to counter this an additional check should be performed on the session file to determine if something is already running. An error should be flagged to notify the user.
Given the option, I would personally, rewrite the code in either JSP or ASP.NET
All in all this is a lot of work for an unreliable feature.

Parallel Processing of Numerous HTML pages with PHP

I have the following function in PHP that reads URL of pages from an array and fetches the HTML content of the corresponding pages for parsing. I have the following code that works fine.
public function fetchContent($HyperLinks){
foreach($HyperLinks as $link){
$content = file_get_html($link);
foreach($content->find('blablabla') as $result)
$this->HyperLink[] = $result->xmltext;}//foreach
return($this->HyperLink);
}
the problem with the code is that it is very slow and take 1 second to fetch content and parse its content. Considering very large number of files to read, I am looking for a parallel model of the above code. The content of each page is just few kilobyte.
I did search and found exec command but cannot figure out how to do it. I want to have a function and call it in parallel for N times so the execution takes less time. The function would get one link as input like below:
public function FetchContent($HyperLink){
// reading and parsing code
}
I tried this exec could:
print_r(exec("FetchContent",$HyperLink ,$this->Title[]));
but no way. I also replaced "FetchContent" with "FetchContent($HyperLink)" and removed second para, but neither works.
Thanks. Pls let me know if anything is missing. You may suggest anyway that helps me quickly process the content of numerous files at least 200-500 pages.

When parsing XML with PHP, it only shows the first record from the file, not all of them

Below is my code:
foreach(simplexml_load_file('http://www.bbc.co.uk/radio1/playlist.xml')->item as $link){
$linked = $link->artist;
$xml_data = file_get_contents('http://ws.audioscrobbler.com/2.0/?method=artist.getimages&artist=' . $linked . '&api_key=b25b959554ed76058ac220b7b2e0a026');
$xml = new SimpleXMLElement($xml_data);
foreach($xml->images as $test){
$new = $test->image->sizes->size[4];
echo "<img src='$new'>";
?><br /><?php
}}
?>
This does work, but it only displays one record from many, it shows the first record from the XML file. I want it to display all of the records.
What I am trying to achieve from this code is:
I have an xml file I am getting the artist name from, im then listing all of the artist names and inserting them into a link which is therefore dynamically created from them generated artist names. I then want to take the dynamically created link, which is another xml file and parse that file to get the size node which is an image link (the image is of the artist). I then want to echo that image link out into an image tag which displays the image.
It partially works, but as I said earlier, It only displays one record instead of all the records in the xml file.

The returned XML is structured like this:
<lfm>
<images>
<image>
<image>
…
<image>
Which means you have to iterate
$xml->images->image
Example:
$lfm = simplexml_load_file('http://…');
foreach ($lfm->images->image as $image) {
echo $image->sizes->size[4];
}
On a sidenote, there is no reason to use file_get_contents there. Either use simplexml_load_file or use new SimpleXmlElement('http://…', false, true). And really no offense, but given that I have already given you an almost identical solution in the comments to When extracting artist name from XML file only 1 record shows I strongly suggest you try to understand what is happening there instead of just copy and pasting.

Problems:
Rate limiting. My comment from the question:
Please note how much network traffic you are generating on each execution of this script, and cache accordingly. It's quite possible you could be rate-limited if you execute too often or too many times in a day (and API rate-limits are often a lot lower than one might think).
Even if you are just "testing", or you and a "few other people" use this, every single request makes 40 automated requests to ws.audioscrobbler.com! They are not going to be happy about this, and since it appears they are smart, they have banned this kind of traffic.
When I run this script, ws.audioscrobbler.com serves up the first result (Artist: Adele), but gives request-failed warnings on many subsequent requests until some time period has passed, obviously due to a rate limit.
Remedies:
Check if the API for ws.audioscrobbler.com has a multiple-artist query. This would allow you to get multiple artists with one request.
Create a manager interface that can get AND CACHE results for one artist at a time. Then, perform this process when you need updates and use the cached results all other times.
Regardless of which method you use, cache, cache, cache!
Wrong argument supplied to the inner foreach. file_get_contents returns a string. Even though the contents are XML, you haven't loaded it into an XML parser. You need to do that before you can iterate on it.

PHP Parsing with simple_html_dom, please check

I made a simple parser for saving all images per page with simple html dom and get image class but i had to make a loop inside the loop in order to pass page by page and i think something is just not optimized in my code as it is very slow and always timeouts or memory exceeds. Could someone just have a quick look at the code and maybe you see something really stupid that i made?
Here is the code without libraries included...
$pageNumbers = array(); //Array to hold number of pages to parse
$url = 'http://sitename/category/'; //target url
$html = file_get_html($url);
//Simply detecting the paginator class and pushing into an array to find out how many pages to parse placing it into an array
foreach($html->find('td.nav .str') as $pn){
array_push($pageNumbers, $pn->innertext);
}
// initializing the get image class
$image = new GetImage;
$image->save_to = $pfolder.'/'; // save to folder, value from post request.
//Start reading pages array and parsing all images per page.
foreach($pageNumbers as $ppp){
$target_url = 'http://sitename.com/category/'.$ppp; //Here i construct a page from an array to parse.
$target_html = file_get_html($target_url); //Reading the page html to find all images inside next.
//Final loop to find and save each image per page.
foreach($target_html->find('img.clipart') as $element) {
$image->source = url_to_absolute($target_url, $element->src);
$get = $image->download('curl'); // using GD
echo 'saved'.url_to_absolute($target_url, $element->src).'<br />';
}
}
Thank you.

I suggest making a function to do the actual simple html dom processing.
I usually use the following 'template'... note the 'clear memory' section.
Apparently there is a memory leak in PHP 5... at least I read that someplace.
function scraping_page($iUrl)
{
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aObj = $html->find('img');
// do something with the element objects
// clean up memory (prevent memory leaks in PHP 5)
$html->clear(); // **** very important ****
unset($html); // **** very important ****
return; // also can return something: array, string, whatever
}
Hope that helps.

You are doing quite a lot here, I'm not surprised the script times out. You download multiple web pages, parse them, find images in them, and then download those images... how many pages, and how many images per page? Unless we're talking very small numbers then this is to be expected.
I'm not sure what your question really is, given that, but I'm assuming it's "how do I make this work?". You have a few options, it really depends what this is for. If it's a one-off hack to scrape some sites, ramp up the memory and time limits, maybe chunk up the work to do a little, and next time write it in something more suitable ;)
If this is something that happens server-side, it should probably be happening asynchronously to user interaction - i.e. rather than the user requesting some page, which has to do all this before returning, this should happen in the background. It wouldn't even have to be PHP, you could have a script running in any language that gets passed things to scrape and does it.

How can I use PHP and JavaScript to make an image clickable, and increment a counter stored as a flat file?

Im trying to find a php/js script that will let me take an image, and when clicked, increase the number in a flat file, and save that file.
I know how to include the file to get the vote total.
Im going insane trying to find this to plug and play into my website. Id love to have ip logging, and a cool fade in/out refresh update thing. But at this point ill settle for basics.
Id like to avoid using MySQL, but if its necessary i can work with it.

Your best bet is to use the AJAX support in jQuery to access, but not load to the user, some kind of URL that writes the increment to the file. If you're using any kind of a thorough platform, you should consider doing in the with your database. However, it'd be simple enough to use jQuery's $.get() function to access the URL /increment_number.php?image=whatever.jpg. If you ever start using a database, you'd just have to change this script to perform a DB query. For your case, you'd have a simple script like this (which has been in no way optimized or has any security considerations whatsoever):
$image = $_GET['image'];
$number = file_get_contents('tracker_for_{$image}.txt');
if ($number != ''){
$number = (int) $number + 1
}
$file = fopen('tracker_for_{$image}.txt', 'w');
fwrite($file, $number);
fclose($file);
And just remember to have the following bit of JS on the page with the image:
$(document).ready(function(){
$('img.incrementme').click(function(){
$.get('/increment.php?'+$(this).attr('src'));
});
);
I haven't tested this code so it might not work, but it's in the spirit of what you'd have to do.

Something simple like this won't work?
<?php
// Link to this file: <a href='onclick.php'><img src='yourimg'></a>
$count = file_get_contents("count.file");
$count += 1;
file_put_contents("count.file", $count);
// Possibly log an IP too? open a file
$f = fopen("ipaddresses.file", "a");
fwrite($f, $_SERVER["REMOTE_ADDR"] . "\n");
fclose($f);
?>

If you are doing this for a voting system like Stack Overflow, creating lots of files to store this one bit of information is going to become unwieldy. This is perfect for a database.
That way, you also wouldn't include the file, but perform a query to get the total score.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to extract contents from URLs? - php

Related

Page updates while processing

Parallel Processing of Numerous HTML pages with PHP

When parsing XML with PHP, it only shows the first record from the file, not all of them

PHP Parsing with simple_html_dom, please check

How can I use PHP and JavaScript to make an image clickable, and increment a counter stored as a flat file?

Categories

Resources