Writing a PHP web crawler using cron - php

I have written myself a web crawler using simplehtmldom, and have got the crawl process working quite nicely. It crawls the start page, adds all links into a database table, sets a session pointer, and meta refreshes the page to carry onto the next page. That keeps going until it runs out of links
That works fine however obviously the crawl time for larger websites is pretty tedious. I wanted to be able to speed things up a bit though, and possibly make it a cron job.
Any ideas on making it as quick and efficient as possible other than setting the memory limit / execution time higher?

Looks like you're running your script in a web browser. You may consider running it from the command line. You can execute multiple scripts to crawl on different pages at the same time. That should speed things up.

Memory must not be an problem for a crawler.
Once you are done with one page and have written all relevant data to the database you should get rid of all variables you created for this job.
The memory usage after 100 pages must be the same as after 1 page. If this is not the case find out why.
You can split up the work between different processes: Usually parsing a page does not take as long as loading it,so you can write all links that you find to a database and have multiple other processes that just download the documents to a temp directory.
If you do this you must ensure that
no link is downloaded by to workers.
your processes wait for new links if there are none.
temp files are removed after each scan.
the download process stop when you run out of links. You can archive this by setting a "kill flag" this can be a file with a special name or an entry in the database.

Related

PHP Website exceeding 1MB Memory allocation on Shared Server

I have a small website / web application (HTML/Jquery/PHP/MYSQL) that loads an HTML document, then simultaneously calls php files in the backend to fetch data sets.
For example, a contacts.php page is loaded, then AJAX calls to 4 PHP scripts at the same time to load different data sets:
contacts-names.php
contacts-groups.php
contacts-tags.php
contacts-locations.php
These datasets are rather small and are less than 100 rows in the DB.
individually these guys run fine. But when called from main page (initial page load) I get the memory limit hits.
If it was just one file causing the issue, i could dig in and optimize it.. but every time i load the page, one or 2 of the above calls get an error (hitting memory limit) and they seem so random.
I went as far as to putting an exit(); code on the top of the script (to stop it from running) to two of the files i would still get max resource hit randomly, sometimes on the file with the exit() code! doesn't make sense. The darn file isn't running any code anymore.
The only way that seem to fix this is when i remove some of the calls (2 scripts) w/c renders my App useless.
Or if I set a delayed call to 2 file (JS timeout) ..
So it seems that calling all php scripts at the same time causes the memory limit issue. Is this normal? I could simply settle w/ this delayed call strategy but I wanted to know if you guys had to do this as well (on limited shared hosts)
Other notes:
- im on a very good cloud unix type shared server
- even if i terminate 2 of the scripts being called simultaneously w/ the other 2 scripts , i still get issues. So it must not be my code and optimization wont do me any good.
- i profiled my scripts individually via xdebug and cachegrind and they all seem fine.
Best

PHP script with Wikipedia API is suddenly slow, what could be wrong?

I run a php script that uses the wikipedia api to locate wikipedia pages about certain movies, based on a long list with titles and year of release. This takes 1-2 seconds per query on average, and I do about 5 queries per minute. This has been working well for years. But since february 11 it suddenly became very slow : 30 seconds per query seems the norm now.
This is a example from a random movie in my list, and the link my script loads with file_get_contents();
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=yaml&rvsection=0&titles=Deceiver_(film)
I can put this link in my browser directly and it takes no more than a few seconds to load and open it. So I don't think the wikipedia api servers have suddenly become slow. When I load the link to my php script from my webserver in my browser, it takes between 20 to 40 seconds before the page is loaded and the result from one query is shown. When I run the php script from the command line of my webserver, I have the same slow loading times. My script still manage to save some results to the database now and then, so I'm probably not blocked either.
Other parts of my php scripts have not slowed down. There is a whole bunch of calculations done with the results of the wikipedia api, and all that is still working at a regular speed. So my webserver is still healthy. The load is always pretty low, and I'm not even using this server for something else. I have restarted apache, but found no difference in loading times.
My questions :
has something changed in the wikipedia api system recently ? Perhaps my way of using it is outdated and I need to use something new ?
where could I look for the cause of this slow loading ? I've dug through error files and tested what I could test, but I don't even know where something goes wrong. If I knew what to look for, I might perhaps easily fix it.
git log --until=2013-02-11 --merges in operations/puppet.git shows no changes on that day. Sounds like a DNS issue or similar on your end.

Where to store a system flag?

This may seem a silly question, but having been pondering it for a few days, i've only come up with one answer, so I thought i'd throw it open and see what other people think.
I have a PHP script which runs as a CRON job, and checks that our CDN is up and running. If it is, it needs to set a flag to show 'true', if not to show 'false'. The main PHP webpages then need to check this flag everytime a user visits the site, and set the CDN URL variable appropriately.
The query is where do we store this flag? We could put it in a database, but then the database would get 'hit' every page load. Also, as we're using Memcache it could result in a delay to it being updated (although there are ways around that, obviously).
My next thought was to save it into a simple text file on the server. But a) we have load-balanced servers, and although we could run the same CRON job on each server, that seems inefficient, and b) opening and reading a text file could, i thought, slow the page load down, as its involving disk activity everytime - that isn't good when the system is delivering 3 million web pages a week!
The ideal would be PHP system variable that retains its value between pages - but of course, that doesn't exist! Its not the kind of information that should be stored as a cookie on the users machine either.
Does any one have any thoughts? All comments welcome!
If you are already using Memcache, then why do you need the CronJob at all?
Put the check for the CDN into your regular application code. Cache the result in Memcache with a lifetime of what your Cron Interval is right now. If the cache is stale, do the full checking on the CDN, otherwise return the cached result.
If you want to keep the CRON job and are concerned about Disk I/O, consider using a RAM disk or Semaphores or Shared Memory to store the results. But it seems kinda pointless when you are already using memcache.
Do you already use a database connection?
If so, use that; you can fit it into your existing replication mechanism pretty easily, I'm sure.
If not, then whatever you do here is going to add overhead, be it reading a text file on disk or reading from a database where you weren't before.

How do php 'daemons' work?

I'm learning php and I'd like to write a simple forum monitor, but I came to a problem. How do I write a script that downloads a file regularly? When the page is loaded, the php is executed just once, and if I put it into a loop, it would all have to be ran before the page is finished loading. But I want to, say, download a file every minute and make a notification on the page when the file changes. How do I do this?
Typically, you'll act in two steps :
First, you'll have a PHP script that will run every minute -- using the crontab
This script will do the heavy job : downloading and parsing the page
And storing some information in a shared location -- a database, typically
Then, your webpages will only have to check in that shared location (database) if the information is there.
This way, your webpages will always work :
Even if there are many users, only the cronjob will download the page
And even if the cronjob doesn't work for a while, the webpage will work ; worst possible thing is some information being out-dated.
Others have already suggested using a periodic cron script, which I'd say is probably the better option, though as Paul mentions, it depends upon your use case.
However, I just wanted to address your question directly, which is to say, how does a daemon in PHP work? The answer is that it works in the same way as a daemon in any other language - you start a process which doesn't end immediately, and put it into the background. That process then polls files or accepts socket connections or somesuch, and in so doing, accepts some work to do.
(This is obviously a somewhat simplified overview, and of course you'd typically need to have mechanisms in place for process management, signalling the service to shut down gracefully, and perhaps integration into the operating system's daemon management, etc. but the basics are pretty much the same.)
How do I write a script that downloads
a file regularly?
there are shedulers to do that, like 'cron' on linux (or unix)
When the page is loaded, the php is
executed just once,
just once, just like the index.php of your site....
If you want to update a page which is show in a browser than you should use some form of AJAX,
if you want something else than your question is not clear to /me......

How do I avoid this PHP Script causing a server standstill?

I'm currently running a Linux based VPS, with 768MB of Ram.
I have an application which collects details of domains and then connect to a service via cURL to retrieve details of the pagerank of these domains.
When I run a check on about 50 domains, it takes the remote page about 3 mins to load with all the results, before the script can parse the details and return it to my script. This causes a problem as nothing else seems to function until the script has finished executing, so users on the site will just get a timer / 'ball of death' while waiting for pages to load.
**(The remote page retrieves the domain details and updates the page by AJAX, but the curl request doesnt (rightfully) return the page until loading is complete.
Can anyone tell me if I'm doing anything obviously wrong, or if there is a better way of doing it. (There can be anything between 10 and 10,000 domains queued, so I need a process that can run in the background without affecting the rest of the site)
Thanks
A more sensible approach would be to "batch process" the domain data via the use of a cron triggered PHP cli script.
As such, once you'd inserted the relevant domains into a database table with a "processed" flag set as false, the background script would then:
Scan the database for domains that aren't marked as processed.
Carry out the CURL lookup, etc.
Update the database record accordingly and mark it as processed.
...
To ensure no overlap with an existing executing batch processing script, you should only invoke the php script every five minutes from cron and (within the PHP script itself) check how long the script has been running at the start of the "scan" stage and exit if its been running for four minutes or longer. (You might want to adjust these figures, but hopefully you can see where I'm going with this.)
By using this approach, you'll be able to leave the background script running indefinitely (as it's invoked via cron, it'll automatically start after reboots, etc.) and simply add domains to the database/review the results of processing, etc. via a separate web front end.
This isn't the ideal solution, but if you need to trigger this process based on a user request, you can add the following at the end of your script.
set_time_limit(0);
flush();
This will allow the PHP script to continue running, but it will return output to the user. But seriously, you should use batch processing. It will give you much more control over what's going on.
Firstly I'm sorry but Im an idiot! :)
I've loaded the site in another browser (FF) and it loads fine.
It seems Chrome puts some sort of lock on a domain when it's waiting for a server response, and I was testing the script manually through a browser.
Thanks for all your help and sorry for wasting your time.
CJ
While I agree with others that you should consider processing these tasks outside of your webserver, in a more controlled manner, I'll offer an explanation for the "server standstill".
If you're using native php sessions, php uses an exclusive locking scheme so only a single php process can deal with a given session id at a time. Having a long running php script which uses sessions can certainly cause this.
You can search for combinations of terms like:
php session concurrency lock session_write_close()
I'm sure its been discussed many times here. I'm too lazy to search for you. Maybe someone else will come along and make an answer with bulleted lists and pretty hyperlinks in exchange for stackoverflow reputation :) But not me :)
good luck.
I'm not sure how your code is structured but you could try using sleep(). That's what I use when batch processing.

Categories