Parse a website, getting all the links and save into mysql database

Parse a website, getting all the links and save into mysql database - php

I'm working on PHP and MySQL along with PHP Simple HTML DOM Parser. I have to parse a website's pages and fetch some content. For that I put the homepage of website as an initial url and fetched all the anchor tags available on that page.
I have to filter those urls as every link is not useful for me. So, I used regular expression. Required links must be saved into my mysql database.
My questions are:
If I extract all the links(around 1,20,000 links) and try to save into mysql DB, I'm getting the following error:
Fatal error: Maximum execution time of 60 seconds exceeded in C:\xampp\htdocs\search-engine\index.php on line 12
I can't store data into database.
I couldn't filter links.
include('mysql_connection.php');
include('simplehtmldom_1_5/simple_html_dom.php');
$website_name="xyz.html/";
$html=file_get_html("xyz.html/");
foreach($html->find('div') as $div)
{
foreach($html->find('a') as $a_burrp)
{
echo $a1 = $a_burrp->href . '<br>';
if(preg_match('/.+?event.+/',$a1, $a_match))
{
mysql_query("INSERT INTO scrap_urls(url, website_name, date_added) VALUES(`$a1`, `$website_name`, now())";
}
}
}

You are receiving Fatal error: Maximum execution time of 60 seconds because of a config limitation in PHP. You can enlarge this number by adding a line like this at the top of your code:
set_time_limit(320);
More info: http://www.php.net/manual/en/function.set-time-limit.php
You can also just enlarge the number in your php.ini file in xampp

Actually, PHP is not the best solution. PHP script is intended to perform quick operations and return response. In your case the script can possibly run for a quite long time. Although you are able to increase max_execution_time, I encourage you to use another technology that is much more flexible than standard PHP, such as Python or JavaScript (Node.js)

I also/ usually work with php scripts that need "some time" to finish.
I always run those scripts either as a cronjob or directly from shell or command line using:
php script.php parameters
Though I don't have to mind the execution.
There is a purpose that php_execution_time is usually set to <=60secs.
Regards.

Related

Using PHP to read real time serial data

I am using a Raspberry Pi to read real time RS232 data via a USB port using a Prolific 2303 adaptor.
I want to parse this and display it in a web page.
My first approach was to use a python script. This could read and parse the data but passing it to a PHP web page proved a problem. Either I could use GET to send it or save it to a file and read it with PHP. The latter is a non starter with an SD card, it's not going to last too long and the former might involve file storage anyway.
I then tried a PHP approach;
$device = "/dev/ttyUSB0";
$fp = fopen($device,"r") or die();
while(true){
$xml = fread($fp,"400");
print_r($xml);
}
fclose($fp);
print_r(); produces:
CC128-v0.110011909:27:3016.70000951000660008500024 CC128-v0.110011909:27:3616.70000951000670008600024 CC128-v0.110011909:27:4216.70000951000680008700027 CC128-v0.110011909:27:4816.70000951000680008600024 CC128-v0.110011909:27:5516.70000951000680008800024
This is 5 bursts of XML stripped of all its tags. The complete XML stream is 340 characters long, 57600-N-8-1 sent every 6 seconds.
Then it crashes with "Fatal error: Maximum execution time of 30 seconds exceeded in ~/meter.php on line 79" Sometimes there is missing or corrupted data, too.
If I use $xml = fgets($fp); I get no data.
The stream I am expecting, as read using a Python is:
['<msg><src>CC128-v0.11</src><dsb>00114</dsb><time>17:45:02</time><tmpr>16.8</tmpr><sensor>0</sensor><id>00095</id><type>1</type><ch1><watts>00065</watts></ch1><ch2><watts>00093</watts></ch2><ch3><watts>00024</watts></ch3></msg>\r\n']
I tried to use PECL-DIO but could not locate all the dependencies. Apparently, it is deprecated for the current version of Debian and will be removed from the next. I don't know if that refers to Buster, Bullseye or Bookworm. I also tried using php_serial.class which I found on GIT but could not get a complete XML file output or even the stripped down data only stream.
What am I missing to get a PHP variable updated every 6 seconds?

Make python scripts run threaded when calling from php

I have a PHP webpage, that makes multiple queries to the database and displays the results on charts.
The logic is, there is index.php, where the query can be made. After submitting the data, 6 different PHP pages are called. The PHP pages log the query, run the appropriate Python script and make charts with Javascript. Each of those 6 PHP pages are displayed in index.php in divs. All of the python scripts have the same input and queried against the same database. Difference comes from the data pulled from the database and also the subsequent Javascript to make the charts.
Example of calling one of the PHP pages:
$("#chartFOO").load("http://example/test/get_foo.php? bar=".concat(bar)+"&start=".concat(start)+"&end=".concat(end), function(responseTxt, statusTxt, xhr){
if(statusTxt == "error")
alert("Error: " + xhr.status + ": " + xhr.statusText);
});
Example of calling the Python script:
if ($msisdn) {
$command = escapeshellcmd("/home/example/scripts/graph_foo.py $bar $start $end");
$output = shell_exec($command);
}
And the output is then used in the PHP file, to make charts. All of the PHP files are displayed in divs with different styling on index.php.
The problem is, it doesn't run them on multiple threads and locks up the system, which makes the response time for the query quite slow. Is that right, that only one shell command can be ran at a time?!
I have tried putting all the Python scripts as functions and 6 of the PHP files as strings in one file. Trying to call it all with one command, but so far I have problems with formating the PHP files, I can't use '{}' to format, because the PHP files already contain those. Had the idea to use threading module, to run the functions. And use one connection to the database, to save time from connecting 6 times, because it takes time each time.
Is there any reasonable solution to have the scripts run threaded, without having to rework the whole webpage? How can PHP, Javascript and Python be mixed?
A lot to read and a lot to ask, but thanks for advance for your time.
EDIT:
I created a new file, which basically has all the 6 files in it. But calling the Python scripts is a bit different now. And from index.php only calling this one file now, like I did before with 6 files.
Example of new way:
$part->handles = [
popen("/home/example/scripts/graph_foo.py {$bar} {$start} {$end}", 'r'),
popen("/home/example/scripts/graph_foo2.py {$bar} {$start} {$end}", 'r')
];
And the way I solved the memory issue:
$output0 = '';
while (!feof($part->handles[0])) {
$output0 .= fread($part->handles[0], 32768);
}
$output1 = '';
while (!feof($part->handles[1])) {
$output1 .= fread($part->handles[1], 32768);
}
Don't know if the best way, but works. Don't know PHP well. But it did get 0.5 minutes off the request time, which helps.

How to rerun a the script when an error occured?

I have written a script that search for values in xml file. This files I retrieve online via the following code
# Reads entire file into a string
$result = file_get_contents($xml_link);
# Returns the xml string into an object
$xml = simplexml_load_string($result);
But the xml are somethimes are big with as consequence that I got the following error Fatal error: Maximum execution time of 30 seconds exceeded.
I have adapted the ini.php files to max_exucution time set to a 360 seconds but I still got the same error.
I have two options in mind.
If this error occurs run the line again. But I couldn't find anything online (I am probably searching with wrong searchterms). Is there a possibility to run the line where the error occurs again?
Save the xml files temporary local and search for the information in this way and remove in the end of the process. Here , I have no idea how to remove them after retrieving all data? And would this actually solve the problem? Because my script still needs to search through the xml file? Will it not take the same amount of time?

When I used these two lines in my script the problem was solved.
ini_set('max_execution_time', 300);
set_time_limit(0);

check cron job has run script properly - proper way to log errors in batch processing

I have set up a cronjob to run a script daily. This script pulls out a list of Ids from a database, loops through each to get more data from the database and geneates an XML file based on the data retrieved.
This seems to have run fine for the first few days, however, the list of Ids is getting bigger and today I have noticed that not all of the XML files have been generated. It seems to be random IDs that have not run. I have manually run the script to generate the XML for some of the missing IDs individually and they ran without any issues.
I am not sure how to locate the problem as the cron job is definately running, but not always generating all of the XML files. Any ideas on how I can pin point this problem and quickly find out which files have not been run.
I thought perhaps add timestart and timeend fields to the database and enter these values at the start and end of each XML generator being run, this way I could see what had run and what hadn't, but wondered if there was a better way.
set_time_limit(0);
//connect to database
$db = new msSqlConnect('dbconnect');
$select = "SELECT id FROM ProductFeeds WHERE enabled = 'True' ";
$run = mssql_query($select);
while($row = mssql_fetch_array($run)){
$arg = $row['id'];
//echo $arg . '<br />';
exec("php index.php \"$arg\"", $output);
//print_r($output);
}

My suggestion would be to add some logging to the script. A simple
error_log("Passing ID:".$arg."\n",3,"log.txt");
Can give you some info on whether the ID is being passed. If you find that that is the case, you can introduce logging to index.php to further evaluate the problem.
Btw, can you explain why you are using exec() to run a php script? Why not excute a function in the loop. This could well be the source of the problem.
Because with exec I think the process will run in the background and the loop will continue, so you could really choke you server that way, maybe that's worth trying out as well. (I think this also depends on the way of outputting:
Note: If a program is started with this function, in order for it to continue running in the background, the output of the program must be redirected to a file or another output stream. Failing to do so will cause PHP to hang until the execution of the program ends.
Maybe some other users can comment on this.

Turned out the apache was timing out. Therefore nothing to do with using a function or the exec() function.

PHP Memory differing server to server

I have a hefty PHP script.
So much so that I have had to do
ini_set('memory_limit', '3000M');
set_time_limit (0);
It runs fine on one server, but on another I get: Out of memory (allocated 1653342208) (tried to allocate 71 bytes) in /home/writeabo/public_html/propturk/feedgenerator/simple_html_dom.php on line 848
Both are on the same package from the same host, but different servers.
Above Problem solved new problem below for bounty
Update: The script is so big because it rawls a site and parsers data from 252 pages, including over 60,000 images, which it makes two copies of. I have since broken it down into parts.
I have another problem now though. when I am writing the image from outside site to server like this:
try {
$imgcont = file_get_contents($va); // $va is an img src from an array of thousands of srcs
$h = fopen($writeTo,'w');
fwrite($h,$imgcont);
fclose($h);
} catch(Exception $e) {
$error .= (!isset($error)) ? "error with <img src='" . $va . "' />" : "<br/>And <img src='" . $va . "' />";
}
All of a sudden it goes to a 500 internal server error page and I have to do it again, at which point it works, because files are only copied it they don't already exist. Is there anyway I can receive the 500 response code and send it back it to the url to make it go again? As this is to all be an automated process?

If this is memory related, I would personally use copy() rather than file_get_contents(). It supports the file wrappers the same way, and I don't see any advantage in loading the whole file in memory just to write it back on the filesystem.
Otherwise, your error_log might give you more information as of why the 500 happens.

There are three parties involved here:
Remote - The server(s) that contain the images you're after
Server - The computer that is running your php script
Client - Your home computer if you are running the script from a web browser, or the same computer as the server if you are running it from Cron.
Is the 500 error you are seeing being generated by 'Remote' and seen by 'Server' (i.e. the images are temporarily unavailable);
Or is it being generated by 'Server' and seen by 'Client' (i.e. there is a problem with your script).
If it is being generated by 'Remote', then see Ali's answer for how to retry.
If it is being generated by your script on 'Server', then you need to identify exactly what the error is - the php error logs should give you more information. I can think of two likely causes:
Reaching PHP's time limit. PHP will only spend a certain amount of time working before returning a 500 error. You can set this to a higher value, or regularly re-set the timer with a call to set_time_limit(), but that won't work if your server is configured in safe mode.
Reaching PHP's memory limit. You seem to have encoutered this already, but worth making sure you're script still isn't eating lots of memory. Consider outputing debug data (possibly only if you set $config['debug_mode'] = true or something). I'd suggest:
try {
echo 'Getting '.$va.'...';
$imgcont = file_get_contents($va); // $va is an img src from an array of thousands of srcs
$h = fopen($writeTo,'w');
fwrite($h,$imgcont);
fclose($h);
echo 'saved. Memory usage: '.(memory_get_usage() / (1024 * 1024)).' <br />';
unset($imgcont);
} catch(Exception $e) {
$error .= (!isset($error)) ? "error with <img src='" . $va . "' />" : "<br/>And <img src='" . $va . "' />";
}
I've also added a line to remove the image from memory, incase PHP isn't doing this correctly itself (in theory that line shouldn't be necessary).
You can avoid both problems by making your script process fewer images at a time and calling it regularly - either using Cron on the server (the ideal solution, although not all shared webhosts allow this), or some software on your desktop computer. If you do this, make sure you consider what will happen if there are two copies of the script running at the same time - will they both fetch the same image at the same time?

So it sounds like you're running this process via a web browser. I'm guessing that you may be getting the 500 error from Apache timing out somehow after a certain period of time or the process dies or something funky. I would suggest you do one of the following:
A) Move the image downloading to a background process, you can run the crawl script in the browser which will write the urls of the images to be downloaded to the db or something and another script will fire up via cron and fetch all the images. You could also have this script work in batches of 100 or so at a time to keep memory consumption down
B) Call the script directly from the command line (this is really the preferred method for something like this anyway, and you should still probably separate the image fetching to another script)
C) If the command line is not an option for some reason, have your browser loaded script touch a file, and have a cron that runs every minute and looks for the file to exist. Then it fires up your script, you can have the output written to a file for you to check later or send an email when it's completed

Is there anyway I can receive the 500 response code and send it back it to the url to make it go again? As this is to all be an automated process?
Here's the simple version of how I would do it:
function getImage($va, $writeTo, $retries = 3)
{
while ($retries > 0) {
if ($imgcont = file_get_contents($va)) {
file_put_contents($writeTo, $imgcont);
return true;
}
$retries--;
}
return false;
}
This doesn't create the file unless we successfully get our image file, and will retry three times by default. You will of course need to add any require exception handling, error checking, etc.

I would definitely stop using file_get_contents() and write the files in chunks, like this:
$read = fopen($url, 'rb');
$write = fope($local, 'wb');
$chunk = 8096;
while (!feof($read)) {
fwrite($write, fread($read, $chunk));
}
fclose($fp);
This will be nicer to your server, and should hopefully solve your 500 problems. As for "catching" a 500 error, this is simply not possible. It is an irretrievable error thrown by your script and written to the client by the web server.

I'm with Swish, this is not really the kind of task that PHP is intended for - you'de be much better using some sort of server side scripting.
Is there anyway I can receive the 500 response code and send it back it to the url to make it go again?
Have you considered using another library? Fetching files from an external server seems to me more like a job for curl or ftp than file_get_content &etc. If the error is external, and you're using curl, you can detect the 500 return code and handle it appropriately without crashing. If not, then maybe you should split your program into two files - one of which fetches a single file/image, and the other that uses curl to repeatedly call the first one. Unless the 500 error means that all php execution crashes, you would be able to detect the failure and handle it.
Something like this pseudocode:
file1.php:
foreach(list_of_files as filename){
do {
x = call_curl('file2.php', filename);
}
while(x == 500);
}
file2.php:
filename=$_GET['filename'];
results = use_curl_to_get_page(filename);
echo results;

Thanks for all your input. I had seperated everything by the time I wrote this question, so the crawler, fired the image grabber, etc.
I took on board the solution to split the number of images, and that also helped.
I also added a try, catch round the file read.
This was only being called from the browser during testing, but now that it is all up and running it is going to be a cron job.
Thanks Swish and Benubird for your particularly detailed and educational answers. Unfortunately I had no cooperation with the developers on the backend where the images are coming from (long and complicated story).
Anyway, all good now so thanks. (Swish how do you call a script from the command line, my knowledge of this field is severely lacking?)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parse a website, getting all the links and save into mysql database - php

Related

Using PHP to read real time serial data

Make python scripts run threaded when calling from php

How to rerun a the script when an error occured?

check cron job has run script properly - proper way to log errors in batch processing

PHP Memory differing server to server

Categories

Resources