Collecting and Processing data with PHP (Twitter Streaming API) - php

after reading through all of the twitter streaming API and Phirehose PHP documentation i've come across something I have yet to do, collect and process data separately.
The logic behind it, If I understand correctly, is to prevent a log jam at the processing phase that will back up the collecting process. I've seen examples before but they basically write right to a MySQL database right after collection which seems to go against what twitter recommends you do.
What I'd like some advice/help on is, what is the best way to handle this and how. It seems that people recommend writing all the data directly to a text file then parsing/processing it with a separate function. But with this method, I'd assume it could be a memory hog.
Here's the catch, it's all going to be running as a daemon/background process. So does anyone have any experience with solving a problem like this, or more specifically, the twitter phirehose library? Thanks!
Some notes:
*The connection will be through a socket so my guess is that the file will constantly be appended? not sure if anyone has any feedback on that

The phirehose library comes with an example of how to do this. See:
Collect: https://github.com/fennb/phirehose/blob/master/example/ghetto-queue-collect.php
Consume: https://github.com/fennb/phirehose/blob/master/example/ghetto-queue-consume.php
This uses a flat file, which is very scalable and fast, ie: Your average hard disk can write sequentially at 40MB/s+ and scales linearly (ie: unlike a database, it doesn't slow down as it gets bigger).
You don't need any database functionality to consume a stream (ie: you just want the next tweet, there's no "querying" involved).
If you rotate the file fairly often, you will get near-realtime performance (if desired).

Related

How to process massive data-sets and provide a live user experience

I am a programmer at an internet marketing company that primaraly makes tools. These tools have certian requirements:
They run in a browser and must work in all of them.
The user either uploads something (.csv) to process or they provide a URL and API calls are made to retrieve information about it.
They are moving around THOUSANDS of lines of data (think large databases). These tools literally run for hours, usually over night.
The user must be able to watch live as their information is processed and is presented to them.
Currently we are writing in PHP, MySQL and Ajax.
My question is how do I process LARGE quantities of data and provide a user experience as the tool is running. Currently I use a custom queue system that sends ajax calls and inserts rows into tables or data into divs.
This method is a huge pain in the ass and couldnt possibly be the correct method. Should I be using a templating system or is there a better way to refresh chunks of the page with A LOT of data. And I really mean a lot of data because we come close to maxing out PHP memory and is something we are always on the look for.
Also I would love to make it so these tools could run on the server by themselves. I mean upload a .csv and close the browser window and then have an email sent to the user when the tool is done.
Does anyone have any methods (programming standards) for me that are better than using .ajax calls? Thank you.
I wanted to update with some notes incase anyone has the same question. I am looking into the following to see which is the best solution:
SlickGrid / DataTables
GearMan
Web Socket
Ratchet
Node.js
These are in no particular order and the one I choose will be based on what works for my issue and what can be used by the rest of my department. I will update when I pick the golden framework.
First of all, you cannot handle big data via Ajax. To make users able to watch the processes live you can do this using web sockets. As you are experienced in PHP, I can suggest you Ratchet which is quite new.
On the other hand, to make calculations and store big data I would use NoSQL instead of MySQL
Since you're kind of pinched for time already, migrating to Node.js may not be time sensitive. It'll also help with the question of notifying users of when the results are ready as it can do browser notification push without polling. As it makes use of Javascript you might find some of your client-side code is reusable.
I think you can run what you need in the background with some kind of Queue manager. I use something similar with CakePHP and it lets me run time intensive processes in the background asynchronously, so the browser does not need to be open.
Another plus side for this is that it's scalable, as it's easy to increase the number of queue workers running.
Basically with PHP, you just need a cron job that runs every once in a while that starts a worker that checks a Queue database for pending tasks. If none are found it keeps running in a loop until one shows up.

Analtytics, statistics or logging information for a PHP Script

I have a WordPress plugin, which checks for an updated version of itself every hour with my website. On my website, I have a script running which listens for such update requests and responds with data.
What I want to implement is some basic analytics for this script, which can give me information like no of requests per day, no of unique requests per day/week/month etc.
What is the best way to go about this?
Use some existing analytics script which can do the job for me
Log this information in a file on the server and process that file on my computer to get the information out
Log this information in a database on the server and use queries to fetch the information
Also there will be about 4000 to 5000 requests every hour, so whatever approach I take should not be too heavy on the server.
I know this is a very open ended question, but I couldn't find anything useful that can get me started in a particular direction.
Wow. I'm surprised this doesn't have any answers yet. Anyways, here goes:
1. Using an existing script / framework
Obviously, Google analytics won't work for you since it is javascript based. I'm sure there exists PHP analytical frameworks out there. Whether you use them or not is really a matter of your personal choice. Do these existing frameworks record everything you need? If not, do they lend themselves to be easily modified? You could use a good existing framework and choose not to reinvent the wheel. Personally, I would write my own just for the learning experience.
I don't know any such frameworks off the top of my head because I've never needed one. I could do a Google search and paste the first few results here, but then so could you.
2. Log in a file or MySQL
There is absolutely NO GOOD REASON to log to a file. You'd first log it to a file. Then write a script to parse this file.Tomorrow you decide you want to capture some additional information. You now need to modify your parsing script. This will get messy. What I'm getting at is - you do not need to use a file as an intermediate store before the database. 4-5k write requests an hour (I don't think there will be a lot of read requests apart from when you query the DB) is a breeze for MySQL. Furthermore, since this DB won't be used to serve up data to users, you don't care if it is slightly un-optimized. As I see it, you're the only one who'll be querying the database.
EDIT:
When you talked about using a file, I assumed you meant to use it as a temporary store only until you process the file and transfer the contents to a DB. If you did not mean that, and instead meant to store the information permanently in files - that would be a nightmare. Imagine trying to query for certain information that is scattered across files. Not only would you have to write a script that can parse the files, you'd have to right a non-trivial script that can query them without loading all the contents into memory. That would get nasty very, very fast and tremendously impair your abilities to spot trends in data etc.
Once again - 4-5K might seem like a lot of requests, but a well optimized DB can handle it. Querying a reasonably optimized DB will be magnitudes upon magnitudes of orders faster than parsing and querying numerous files.
I would recommend to use an existing script or framework. It is always a good idea to use a specialized tool in which people invested a lot of time and ideas. Since you are using a php Piwik seems to be one way to go. From the webpage:
Piwik is a downloadable, Free/Libre (GPLv3 licensed) real time web analytics software program. It provides you with detailed reports on your website visitors: the search engines and keywords they used, the language they speak, your popular pages…
Piwik provides a Tracking API and you can track custom Variables. The DB schema seems highly optimized, have a look on their testimonials page.

How to make a php backend communicate with an HTML frontend?

This is THE MOST noobish question ever, I see it done so often.
I want a PHP page which is constantly runnng in the background (the backend) to occasionally query the frontend with updated data. YES, this is the way I want to do it.
But the only way I know of querying a page is to re-create that php, with XHR - so, I would XHR "index.php?data=newdata", but this would create a new process server-side. What do I do?
(Please ask for more info or correct me if there is a better way of doing it)
This is a great SO question/answer to look at:
Using comet with PHP?
The upshot is, you can do it with PHP...
Another way to do this is to create a bridge from your Apache set up and Node, if you read through the guides about Node you will see that it is:
Designed for high loads of networking
Only spawns new threads when it need's to do blocking tasks such as I/O
Extremely simple to use, based on Google V8 (Javascript Engine)
Can handle thousands of concurrent connections
With the above in mind my road plan would be to create a database for your PHP Application, and create 2 connections to that,
Connection used in the PHP Application
Connection used within Node.
The Node side of things would be simple:
Create a simple socket server (20~ lines)
Create an array
Listen for new connections, place the resource into the array.
Attach an for the database
When the event get's fired, pipe the new data to all the clients in the array.
All the clients will receive the data at pretty much the same time, this should be stable, it's extremely light weight solution as 1K Connections would use 1 process with a few I/O Threads, the ram used would be about 8~MB
Your first step's would be to set up node.js on your server, if you google around you will be able to find how to do that, a simple way under ubuntu is to do:
apt-get install nodejs
you should read the following resources:
http://nodejs.org/
http://www.youtube.com/watch?v=jo_B4LTHi3I
http://jeffkreeftmeijer.com/2010/experimenting-with-node-js/
http://www.slideshare.net/kompozer/nodejs-and-websockets-intro
http://remysharp.com/2010/02/14/slicehost-nodejs-websockets/
for more technical assistance you should connect to the #node.js irc server on freenode.net, those guys will really help you out over there!
Hope this helps.
COMET may be a way to go; however, having a finalized HTML page, and doing AJAX requests to get updates is the usual, and more robust way to do this. I would investigate ways to implement an Ajax based approached that is optimized for speed.
You say you are doing complex calculations, which you would have to repeat for each request when going the Ajax way. You may be able to help that e.g. by employing smart caching. If you write the results of whatever you do as JSON encoded data into a text file, you can fetch that with almost no overhead at all.

Whats the most efficent way to scrape data from a website (in php)?

Im trying to scrape data from IMDB, but naturally there are a lot of pages, and doing it in a serial fashion takes way too long. Even with I do multi-threaded CURL.
Is there a faster way of doing it?
Yes I know IMDb offers text files, but they dont offer everything, in any sane fashion.
I've done a lot of brute force scraping with PHP and sequential processing seems to be fine. I'm not sure "what a long time" to you is, but I often do other stuff while it scrapes.
Typically nothing is dependent on my scraping in real time, its the data that counts, and I usually scrape it and massage it at the same time.
Other times I'll use a crafty wget command to pull down a site and save locally. Then have a PHP script with some regex magic extract the data.
I use curl_* in PHP and it works very good.
You could have a parent job that forks child processes providing them URL's to scrape, which they process and save the data locally (db, fs, etc). The parent is responsible for making sure the same URL isn't processed twice and children don't hang.
Easy to do on linux (pcntl_, fork, etc), harder on windows boxes.
You could also add some logic to look at the last-modified-time and (which you previously store) and skip scraping the page if not content has changed or you already have it. There's probably a bunch of optimization tricks like that you could do.
If you are properly using cURL with curl_multi_add_handle and curl_multi_select there is no much you can do. You can test to find an optimal number of handles to process for your system. Too few and you will leave your bandwidth unused, too much and you will loose too much time switching handles.
You can try to use master-worker multi process pattern to have many script instances running in parallel, each one using cURL to fetch and later process block of pages. Frameworks like http://gearman.org/?id=gearman_php_extension can help in creating elegant solution but using process control functions on Unix or calling your script in the background (either via system shell or over non-blocking HTTP) can also work well.

Best way to send PHP script usage statistics to an external script

I'm writing an application in PHP which I plan to sell copies of. If you need to know, it's used to allow users to download files using expiring download links.
I would like to know whenever one of my sold applications generates a download.
What would be the best way to send a notice to my php application on my server, which simply tells it "Hey, one of your scripts has done something", and what would be the best way to keep a count of the number of "hits" my server gets of this nature? A database record, or a flat text file?
I ask because I want to display a running count of the total number of downloads on my homepage, sort of like:
"Responsible for X downloads so far!"
A pure PHP solution is idea, but I suppose an ajax call would be OK too. The simpler the better, since all I am really doing is a simple $var++, only on a larger scale, right?
Anyone care to point me in the right direction?
Whether by javascript or php, you need to set a url up on your server that other scripts can call. That URL should then point to a script that increments a counter. I'd put a number in a database and increment it, or if you wanted to be more detailed, you could easily break this down by month/client etc.
If you go with calling the URL from PHP, take care to ensure that the URL doesn't block the execution - ie: if your site goes down, the script you sell doesn't sit waiting for your server to respond. You can work around this in various ways - I'd do it by registering a shutdown function.
Alternatives that don't have this problem are loading the url with javascript, or as an image (but they will both likely be slightly less accurate) - I would go with image myself, as you'll get marginally better browser support.
Also, remember that unless you compile the code with something like Zend Guard, anyone can remove the remote call and prevent your counter incrementing!
Yeah something url callable is what you what. SOAP is probably the easiest way to go about this.
http://php.net/soap
Benlumley has a lot of valid points regarding this solution.
Also if you want to offset computation to the users browser(making web-service calls might annoy the people who buy your app, bandwidth/CPU cost) then AJAX might be a better solution.

Categories