Fetching multiple URLs and updating DB with PHP script - php

I have a website that uses MySQL. I am using a table named "People" that each row represents, obviously, a person. When a user enters a page I would like to introduce news related to that person (along with the information from the MySQL table). For that purpose, I decided to use BING News Source API.
The problem with the method of calling the BING API for each page load is that I am increasing the load time of my page (round tip to BING servers). Therefore, I have decided to pre-fetch all the news and save them in my table under a coloumn named "News".
Since my table contains 5,000+ people, running a PHP script to download all news for every person and update the table at once results a Fatal error: Maximum execution time (I would not like to disable the timeout, since it is a good security measure).
What will be a good and efficient way to run such a script? I know I can run a cron job every 5 minutes that will update only a portion of rows everytime - but even in that case - what will be the best way to save the current offset? Should i save the offset in MySQL, or as a server var?

use cronjob for complex job
you should increase the timeout if you plan to run as cronjob (you are pulling things from other site, not for public)
consider create a master script (triggered by the cronjob) and this master script will spawn multiple sub-scripts (with certain control), so that you can pull the data from BING News Source (with this you can multi download the 5000+ profiles) without have to download one-by-one sequentially (think batch processing)
Update
Cron is a time-based job scheduler in Unix-like computer operating systems. The name cron comes from the word "chronos", Greek for "time". Cron enables users to schedule jobs (commands or shell scripts) to run periodically at certain times or dates. It is commonly used to automate system maintenance or administration, though its general-purpose nature means that it can be used for other purposes, such as connecting to the Internet and downloading email
Cron - on Wiki

Why not load the news section of the page via AJAX? This would mean that the rest of the page would load quickly, and the delay created from waiting for BING would only affect the news section, which you could allocate a loading placeholder to.
Storing the news in the DB doesnt sound like as very efficient/practical solution, the ongoing management of the records alone would potentially cause a headache in future.

Related

Getting a PHP script to run frequently on a web server

I am currently developing a PHP script that pulls XML data from a web page and so far it gets the XML data and stores it on a MySQL table. However, it only stores it when the PHP script is run, so I was wondering if there was a function or a tool (or if there are a few options let me know) that would run the script every x amount of seconds. Since its to do with currency changes, I need the XML pulled very frequently.
I've heard that a CRON will execute a script every set amount of time, but I've also heard they are really bad news for highly frequent use. Any other suggestions?
Also, this is for an app, so what I can also do is when a user requests the XML data, then it will get the data, then it will send it to the user however that will be saved for another post. If this way sounds better, let me know, since I'm not the greatest with web servers.
Cron jobs will be fine even if you need the task done frequently. The problem with cron jobs is that you can only do a task every minute (without getting too hacky) and you might get weird results if the query takes a long time (ex. is slower than one minute).
You should be totally fine though.

Help writing an algorithm for indexing/parsing a limited chunk of data on cron run

Here's the situation. I am scrapping a website to get the data from it's articles using a robots page supplied by that website (list of URLs pointing to every article that's posted on the site). So far, I do a database merge to 'upsert' the URLs into my table. I know that each scrapping run will take a good while cause there's over 1400 articles to parse. I need to write an algorithm that will only do a small chunk of the jobs on cron at a time so it doesn't overload my server, etc.
Edit: I think I should mention that I'm using drupal 7. Also, this has to be an ongoing script that happens over time, I'm not so worried about the time it takes for the initial fill of the database. The robots page is dynamic, URLs get added there periodically as articles are added. I'm using hook_cron() currently for this, but I'm open to better methods if there's something better than that for doing it.
You can use the Drupal queue operations API to enqueue each page to scrap as queue item. You can, but are not required, declare your queue as cron-executed. Drupal will then take cares of executing as much queue item at each cron run without reaching the queue declared maximum execution time.
See aggregator_cron for an example of item en-queuing. And aggregator_cron_queue_info for the declaration that will let Drupal process these queued items during its cron.
If queue processing during normal Drupal cron is an issue, you can process your queue independently with the help of modules like Waiting Queue or Beanstalkd integration.
Most likely the http overhead of fetching each article will vastly outweigh the overhead of doing the database operations. Just don't fetch too many articles in parallel and you should be fine. Most webmasters frown on scrapers, especially when they're doing 10, 20, 500+ parallel fetches.
So, you already have the urls in your database. Have a status column in that table - scraped or not. The cron can kick off every so often grabbing the next url that has not been scraped from the table and marking it as scraped.

php multithreading, mysql

I have a php script which I use to make about 1 mil. requests every day to a specific web service.
The problem is that in a "normal" workflow the script is working almost the whole day to complete the job .
Therefore I've worked on an additional component. Basically I developed a script which access the main script using multi-curl GET request to generates some random tempid for each 500 records and finally makes another multi-curl request using POST with all the generated tempids.
However I don't feel this is the right way so I would like some advice/solutions to add multithreading capabilities to the main script without to use additional /external applications (e.g the curl script that I'm currently using).
Here is the main script : http://pastebin.com/rUQ6pwGS
If you want to do it right you should install a message queue. My preference goes out to redis because it is a "data structure server since keys can contain strings, hashes, lists, sets and sorted sets". Also redis is extremely fast.
Using the blpop(spawning a couple of worker threads using php <yourscript> to process work concurrently) to listen for new messages(work) and rpush to push new messages onto the queue. Spawning processes is expensive(relative) and when using a message queue this has to be done only once when the process is created.
I would go for phpredis if you could(need to be to recompile PHP) because it is an extension written in C and therefor going to be a lot faster than the pure PHP clients. Else PRedis is also pretty mature library you could use.
You could also use this brpop/rpush as some sort of lock(if you need to). This is because:
Multiple clients can block for the
same key. They are put into a queue,
so the first to be served will be the
one that started to wait earlier, in a
first-BLPOP first-served fashion.
I would advise you to have a look at Simon's redis tutorial to get an impression of the sheer power that redis has to offer.
This is background process, correct? In which case, you should not run it via a web server. Run it from the command-line, either as a daemon or as a cron job.
My preference is a "cron" job because you get automatic restart for free. Be sure that you don't have more instances of the program running than desired (You can achieve this by locking a file in the filesystem, doing something atomic in a database etc).
Then you just need to start the number of processes you want, and have them read work from a queue.
Normally the pattern for doing this is having a table containing columns to store who is currently excuting a given task:
CREATE TABLE sometasks (
ID of some kind,
Other info required to do task,
some data we need to know if the task is due yet or complete,
locked_by_host VARCHAR(64) NULL,
locked_by_pid INT NULL
)
Then the process will do the following pseduo-query to lock a set of tasks (batch_size is how many per batch, can be 1)
UPDATE sometasks SET locked_by_host=my_hostname, locked_by_pid=my_pid
WHERE not_done_already AND locked_by_host IS NULL ORDER BY ID LIMIT batch_size
Then select the rows back out using a select to find the current process's tasks. Then process the tasks, and update them as being "done" and clear out the lock.
I'd opt for a cron job with a controller process which starts up N child processes and monitors them. The child processes could periodically die (remember PHP does not have good GC, so it can easily leak memory) and be respawned to prevent resource leaks.
If the work is all done, the parent could quit, and wait to be respawned by cron (the next hour or something).
NB: locked_by_host can store the host name (pids aren't unique in different hosts) to allow for distributed processing, but maybe you don't need that, so you can omit it.
You can make this design more robust by putting a locked_time column and detecting when a task has been taking too long - you can alert, kill the process, and try again or something.

Real time updates in php for a pbbg. (Php, JavaScript, and html)

I am trying to create a PBBG (persistent browser based game) like that of OGame, Space4k, and others.
My problem is with the always-updating resource collection and with the building times, as in a time is set when the building, ship, research, and etc completes building and updates the user's profile even if the user is offline. What and/or where should I learn to make this? Should it be a constantly running script in the background
Note that I wish to only use PHP, HTML, CSS, JavaScript, and Mysql but will learn something new if needed.
Cron jobs or the Windows equivalent seem to be the way, but it doesn't seem right or best to me.c
Do you have to query your db for many users properties, like "show me all users who already have a ship of the galaxy class"?
If you do not need this you could just check the build queue if someone requests the profile.
If this is not an option you could add an "finished_at" column to you database and include "WHERE finished_at>= SYSDATE()" in your query. In that case all resources (future and present) are in the same table.
Always keep in mind: what use is there to having "live" data if no one is requesting it?
My problem is with the always-updating
resource collection and with the
building times, as in a time is set
when the building, ship, research, and
etc completes building and updates the
user's profile even if the user is
offline
I think the best way to do this is to install message queue(But you need to be have install/compile it) like beanstalkd to do offline processing. Let's say it takes 30 seconds to build a ship. With pheanstalk client(I like pheanstalk) you first put message to the queue using:
$pheanstalk->put($data, $pri, $delay, $ttr);
You could see protocol for meaning of all arguments.
But with $delay=30. When a worker process does a reserve() it can process the message after 30 seconds.
$job = $pheanstalk->reserve();
Streaming data to user in real-time
Also you could look into XMPP over BOSH to stream the new data to all users in real-time.
http://www.ibm.com/developerworks/xml/tutorials/x-realtimeXMPPtut/index.html
http://abhinavsingh.com/blog/2010/08/php-code-setup-and-demo-of-jaxl-boshchat-application/

Opinions/expertise/suggestions wanted - provide feedback from php performing a lengthy task

I am thinking about converting a visual basic application (that takes pipe delimited files and imports them into a microsoft sql database) into a php page. One of these files is on average about 9 megabytes in size. (I couldn't be accurate about the number of lines involved but I'd say it's about 20 thousand)
One of the advantages being that any changes made to the page would be automatically 'deployed' to the intended user (currently when I make changes to the visual basic app, which was originally created by someone else, I have to put the latest version on all the PCs of the people that use it).
Problem is these imports can take like two minutes to complete. Two minutes is a long time in website-time so I would like to provide feedback to the user to indicate that the page hasn't failed/timed out and is definitely doing something.
The best idea that I can think of so far is to use ajax to do it incrementally. Say import 1000 records at a time then feed back, implement the next 1000, feed back, and so on.
Are there better ways of doing this sort of thing that wouldn't require me to learn new programming languages or download apps or libraries?
You don't have to make the Visual Basic -> PHP switch. You can stick with VB syntax in ASP or ASP.NET applications. With an ASP based solution, you can reuse plenty of the existing code so it won't be learning a new language / starting from scratch.
As for how to present a long running process to the user, you're looking for "Asynchronous Handlers" - the basic premise being the user visits a web page (A) which starts the process page (B).
(A) initiates (B), reports starting to the user and sets the page to reload in n seconds.
(B) does all the heavy lifting - just like your existing VB app. Progress is stored in some shared space (a flat file, a database, a memory cache, etc)
Upon reload, (A) reports current progress of (B) by read-only accessing the shared space (B) is keeping progress in.
Scope of (A):
Look for running (B) process - report status if found, or initiate fresh (B) process. Since (B) appears to be based on the existence of files (from your description) you might grant (A) the ability to determine if there's any point in calling (B) or not (ie. If files exist call (B) else report: nothing to do) or you may wish to keep the scopes entirely free and call (B).
Report progress of (B).
Should take very little time to execute, may want to include HTTP refresh header so user automatically gets updates.
Scope of (B):
Same as existing VB script – look for files, load… yada yada yada.
Should take similar time to execute as existing VB script (2 minutes)
Potential Improvements:
(A) could use an AJAX interface, so instead of a page-reload (HTTP refresh), an AJAX call is made every n seconds and simply the status box is updated. Some sort of animated icon (swirling wheel) will give the user the impression something is going on between refreshes.
It sounds like (B) could benefit from a multi-threaded approach (loading multiple files at once) depending on whether the files are related. As pointed out by Ponies, there may be a better strategy to such a load, but that's a different topic all together :)
Some sort of semaphore/flag approach may be required if page (A) could be simultaneously hit at the same time by multiple users and (B) takes a few seconds to start up and report status'.
Both (A) and (B) can be developed in PHP or ASP technology.
How are you importing the data into the database? Ideally, you should be using SQL Server's BULK INSERT which likely would speed up things. But it's still a matter of uploading the file for parsing...
I don't think it's worth the effort to get status of insertions - most sites only display an animated gif/etc (like the hourglass, etc) to indicate that the system is processing things but no real details.

Categories