PHP script with Wikipedia API is suddenly slow, what could be wrong? - php

I run a php script that uses the wikipedia api to locate wikipedia pages about certain movies, based on a long list with titles and year of release. This takes 1-2 seconds per query on average, and I do about 5 queries per minute. This has been working well for years. But since february 11 it suddenly became very slow : 30 seconds per query seems the norm now.
This is a example from a random movie in my list, and the link my script loads with file_get_contents();
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=yaml&rvsection=0&titles=Deceiver_(film)
I can put this link in my browser directly and it takes no more than a few seconds to load and open it. So I don't think the wikipedia api servers have suddenly become slow. When I load the link to my php script from my webserver in my browser, it takes between 20 to 40 seconds before the page is loaded and the result from one query is shown. When I run the php script from the command line of my webserver, I have the same slow loading times. My script still manage to save some results to the database now and then, so I'm probably not blocked either.
Other parts of my php scripts have not slowed down. There is a whole bunch of calculations done with the results of the wikipedia api, and all that is still working at a regular speed. So my webserver is still healthy. The load is always pretty low, and I'm not even using this server for something else. I have restarted apache, but found no difference in loading times.
My questions :
has something changed in the wikipedia api system recently ? Perhaps my way of using it is outdated and I need to use something new ?
where could I look for the cause of this slow loading ? I've dug through error files and tested what I could test, but I don't even know where something goes wrong. If I knew what to look for, I might perhaps easily fix it.

git log --until=2013-02-11 --merges in operations/puppet.git shows no changes on that day. Sounds like a DNS issue or similar on your end.

Related

3 second time to first byte

I've been maintaining a PHP/SQL e-commerce application and the customer has called about their ttfb jumping up to almost 3 seconds consistently.
What I've tried:
Creating a test.php page to just echo some text produces a 30ms ttfb.
Traveled back in the commit history and checked if any recent changes could have been the culprit.
The test page loading quickly leads me to believe that it's some sort of query or logic that runs on every other page on the site (auth?), but none of the recent commits since the jump in ttfb have had any effect. How could this just randomly happen?
This is actually a performance optimisation problem that can be solved with profiling.
For profiling you can use xdebug or maybe other tools that are out there, however I personally didn't find a useful solution when facing a similar situation, so I just made a simple profiling module adapted to the app.
What you want to do is try to mimic on local or on staging servers the exact settings as in production, server settings, db entries, etc. And then just measure the execution time starting from first line in index.php, to key part of the app, for example db read/write class, http request class. And then write the data to some db so you can generate a profiling report.
So for each route and/or operation you would want to see how many db requests have been made, and how long they took to execute, how many API calls have been made (if this is the case) and so on. At the end the goal is to have a good idea what part of the execution flow is taking how long.

Zurmo reports (Export to CSV) painfully slow

So I have a XAMPP setup with Zurmo 2.6.5 running on it. Everything works like a charm. The speed at which it pulls up contacts, goes through pages, etc is considerably fast. I have 2 GB RAM and this is the only web app that runs on it. You can call it dedicated I guess. The problem arises when I attempt to export a fairly decent amount of data to Excel (CSV is the only option available). For e.g, I tried exporting 200-odd rows of data and it timed out due to the max_execution_limit parameter. I increased it first from around 300 to 600, and now finally to 1200. The script keeps running as though there were no end to it :-/.
Surprisingly, when I first apply the filter (not many, just one), it takes around 10-15 seconds to display the first 10 records. That indicates the query executes well within time limits. I have memcached installed, like they suggest to alleviate performance issues.
I checked Zurmo's forums and the net in general, but unfortunately I did not get even a single hit with reference to this issue. Can any fellow Zurmo developer / power user help me get this resolved?
Much appreciated. Thanks.

Developing always running PHP script

(Our server is Linux based)
I'm an experienced PHP developer but first time i'll develop a bot which always running and fetch some datas.
I'll explain my application with a simple (and sample) scenario. I have about 2000 web site url and my application will visit this url's and record contents of web page's . This application will work 7 days 24 hours. It will start working again when it's finish 2000 web sites.
But i need some suggestions for my server. As you see, my application will be run infinity until i shut down server. I can do this infinity loop with this :
while(true)
{
APPLICATION CODES HERE
}
But i think this will be an evil for server :) Is it possible to doing something like this, on server side?
Also i think using cronjobs but it's not work for my scenario. Because my script start working again asap it's finish working. I have to "start again when you finish your work" , not "start every 30 minutes" . Because i don't know, maybe fetching all 2000 websites, will take more than 30 minutes or less than 30 minutes.
I hope i explained it very well.
Also i'm worried about memory usage. As you know garbage collector cleans memory after every PHP script stop. But as i said, my app won't stop for days (maybe weeks) . So garbage collector won't be triggered. I'm manually unsetting (unset() function) all used variables at end of script. Is it enough?
I need some suggestions from server administrators :)
PS. I'm developing it as console application, not a web application. I can execute it from command line.
Batch processing.. store all the sites in a csv or something, mark them after completion, then work on all the ones non-marked, then work on all the marked.. etc. Only do say 1 or 5 at a time, initiate batch script every minute from cron..
Don't even try to work on all of them at once.. any errors and you won't know what happened..
Could even store the jobs in a database, store processing stats etc.. allows for fine-tuning and better reporting.
You will probably hit time-limits trying to run infinite php scripts, even from the command line.. also your server admin will hate you. Will probably run into Memory limits if you don't release resources properly.. far too easily done with php.
Read: http://www.ibm.com/developerworks/opensource/library/os-php-batch/
Your script could just run through the list once and quit. That way, what ever resources php is holding can be freed.
Then have a shell script that calls the php script in an infinite loop.
As php is not designed for long running task, I am not sure if the garbage collection is up to the task. Quiting after every run will force it to release everything.

PHP execution time: Factor to consider in determining the speed to execution

As all my requests goes through an index script, I tried to time the respond time of all my requests.
Its simply the difference between the start time (start of the script) and end time (end of the script).
As I cache my data on memcached and user are all served using memcached.
I mostly get less than a second respond time but at times there's wierd spike of more than a seconds. the worse case can go up to 200+ seconds.
I was wondering if mobile users had a slow connection, does that reflect on my respond time?
I am serving primary mobile users.
Thanks!
No, it's the runtime of your script. It does not count the latency to the user, that's something the underlying web server is worrying about. Something in your script just takes very long. I recommend you profile your script to find what that is. Xdebug is a good way to do so.
If you're measuring in PHP (which it sounds like you are), thats the time it takes for the page to be generated on the server side, not the time it takes to be downloaded.
Drop timers in throughout the page, and try and break it down to a section that is causing the huge delay of 200+ seconds.
You could even add a small script that will email you details of how long each section took to load if it doesn't happen often enough to see it yourself.
It could be that the script cannot finish because a client downloads the results very-very slowly. If you don't use a front-end server like nginx, the first thing to do is to try it.
Someone already mentioned xdebug, but normally you would not want to run xdebug in production. I would suggest using xhprof to profile pages on development/staging/production. You can turn on xhprof conditionally, which makes it really easy to run on production.

Writing a PHP web crawler using cron

I have written myself a web crawler using simplehtmldom, and have got the crawl process working quite nicely. It crawls the start page, adds all links into a database table, sets a session pointer, and meta refreshes the page to carry onto the next page. That keeps going until it runs out of links
That works fine however obviously the crawl time for larger websites is pretty tedious. I wanted to be able to speed things up a bit though, and possibly make it a cron job.
Any ideas on making it as quick and efficient as possible other than setting the memory limit / execution time higher?
Looks like you're running your script in a web browser. You may consider running it from the command line. You can execute multiple scripts to crawl on different pages at the same time. That should speed things up.
Memory must not be an problem for a crawler.
Once you are done with one page and have written all relevant data to the database you should get rid of all variables you created for this job.
The memory usage after 100 pages must be the same as after 1 page. If this is not the case find out why.
You can split up the work between different processes: Usually parsing a page does not take as long as loading it,so you can write all links that you find to a database and have multiple other processes that just download the documents to a temp directory.
If you do this you must ensure that
no link is downloaded by to workers.
your processes wait for new links if there are none.
temp files are removed after each scan.
the download process stop when you run out of links. You can archive this by setting a "kill flag" this can be a file with a special name or an entry in the database.

Categories