I have a script that receives json data from various sources and processes it.
I have a list in a database and also as a text file of known good sources. The list has thousands of records.
Before processing I want to compare the source value from json with the source value in the list. Data is received every 10 sec. The list does not change often.
At the moment I can make this work either by querying the database for the sources list or read the list from a text file, however it seems redundant to do this every 10 sec upon receiving json since the list is going to be the same 99% of the time.
The question is - what is the good way to do this?
Assuming this DB is something you have more than read access - you mentioned the database records do not change often, you could add a trigger on the DB for any changes. Have the trigger update a single row in a new table called "listUpdated" to True.
Load the list into an array in your PHP and use that to bump your data against. Every 10 seconds you can just check if the "listUpdated" field has been set to True. If it is, update your array and change the value back to False.
Related
Say a PHP script appends some text to an already existing data in a MySQL database. Say the existing data is abc. Now one user wants to append 123 and another wants to append xyz. Both of them run the script at the same time. Whose will be appended first? Will the final result be abc123xyz or abcxyz123?
At a worse case, say the script first takes the data, appends the given text to it, and then replaces the old data in the database. Then whose change will 'survive' here? Will the result be abc123 or abcxyz?
Sorry if it has been asked before.
Depends on the engine you use. For InnoDB MySQL performs row-level lock only for writes.
There will be a tiny minute difference in hitting the database with the query
It will execute the first query.Then it will execute the second query
so the data will be either 123abc or 123xyz depending upon which query executed first.
A tutorial here shows how to build an agregator in PHP but I'm having some trouble finding the best way not to insert the same items in my database.
If I were to run the script on http://visualwebsiteoptimizer.com/split-testing-blog/feed/ and then run it again in 5 minutes it'll just insert the same items again.
That tutorial just has an interval time specified in wich it will reload the RSS feed and save all the items.
I was wondering if RSS implement some request header that will only send the items after a certain date. I see here that I could use lastBuildDate and mabe ignore channels that have a date older than last fetched but it doesn't say if that is mandatory.
My question here is: how can I check RSS feeds regularly and insert it in a database without inserting the same item more than once?
I'm thinking the only way to do it is to check if a record already exist using link and only insert if it doesn't exist already. I know link is optional but I won't save items that don't have one anyway. This seems a bit inefficient though; checking before every insert might be fine in the beginning but when the database starts filling up it might get very slow.
You might have to use a few different strategies depending on how well the site you are consuming has implemented the spec.
First I would try adding a unique index on the database for the GUID value, GUIDs by there nature should be unique, http://en.wikipedia.org/wiki/Globally_unique_identifier - then depending on which DB you are using you should be able to use syntax like INSERT IGNORE INTO... or INSERT ... ON DUPLICATE KEY UPDATE... and just have the update syntax not really do anything
If some sites don't have a guid field (I am assuming you will end up consuming more than just the example) you could add the unique on the siteId field and the either the time or the title, both are less than ideal of course contacting the site own to get them to implement a guid might work too ;)
You could also run an md5 hash on the post content and store that alongside the post, that should stop duplicates too.
How big are you expecting the DB to get? with proper indexing I would have thought that it would have to be huge before it runs slow; indexes on siteId, guid, time and/or hash and limited to just 1 row and just the rowId should be quick enough, epscialyl if you can get your script to run commandline / on a cron job rather than through a webserver
I certainly can't solve this problem by myself after a few many days already trying. This is the problem:
We need to display information on the screen (HTML) that is being generated in real time inside a PHP file.
The PHP is performing a very active crawling, returning huge arrays of URLs, each URL need to be displayed in real time in HTML, as soon as the PHP captures it, that's why we are using Ob_flush() and flush methods to echo and print the arrays as soon as we got them.
Meanwhile we need to display this information somehow so the users can see it while it works (since it could take more than one hour until it finishes).
It's not possible to be done, as far as I understand, with AJAX, since we need to make only 1 request and read the information inside the array. I'm not either totally sure if comet can do something like this, since it would interrupt the connection as soon as it gets new information, and the array is really rapidly increasing it's size.
Additionally and just to make the things more complex, there's no real need to print or echo the information (URLs) inside the array, since the HTML file is being included as the User Interface of the same file that is processing and generating the array that we need to display.
Long story short; we need to place here:
<ul>
<li></li>
<li></li>
<li></li>
<li></li>
<li></li>
...
</ul>
A never ending and real time updated list of URLS being generated and pushed inside an array, 1,000 lines below, in a PHP loop.
Any help would be really more than appreciated.
Thanks in advance!
Try web-sockets.
They offer real-time communication between client and server and using socket.io provide cross-browser compatibility. It's basically giving you the same results as long-polling / comet, but there is less overhead between requests so it's faster.
In this case you would use web sockets to send updates to the client about the current status of the processing (or whatever it was doing).
See this Using PHP with Socket.io
Suppose you used a scheme where PHP was writing to a Memcached server..
each key you write as rec1, rec2, rec3
You also store a current_min and a current_max
You have the user constantly polling with ajax. For each request they include the last key they saw, call this k. The server then returns all the records from k to max.
If no records are immediately available, the server goes into a wait loop for a max of, say 3 seconds, checking if there are new records every 100ms
If records become available, they are immediately sent.
Whenever the client receives updates or the connection is terminated, they immediately start a new request...
Writing a new record is just a matter of inserting max+1 and incrementing min and max where max-min is the number of records you want to keep available...
An alternative to web sockets is COMET
I wrote an article about this, along with a followup describing my experiences.
COMET in my experience is fast. Web sockets are definitely the future, but if you're in a situation where you just need to get it done, you can have COMET up and running in under an hour.
Definitely some sort of shared memory structure is needed here - perhaps an in-memory temp table in your database, or Memcached as stephen already suggested.
I think the best way to do this would be to have the first PHP script save each record to a database (MySQL or SQLite perhaps), and then have a second PHP script which reads from the database and outputs the newest records. Then use AJAX to call this script every so often and add the records it sends to your table. You will have to find a way of triggering the first script.
The javascript should record the id of the last url it already has, and send it in the AJAX request, then PHP can select all rows with ids greater than that.
If the number of URLs is so huge that you can't store a database that large on your server (one might ask how a browser is going to cope with a table as large as that!) then you could always have the PHP script which outputs the most recent records delete them from the database as well.
Edit: When doing a lot of MySQL inserts there are several things you can do to speed it up. There is an excellent answer here detailing them. In short use MyISAM, and enter as many rows as you can in a single query (have a buffer array in PHP, which you add URLs to, and when it is full insert the whole buffer in one query).
If I were you , I try to solve this with two way .
First of all I encode the output part array with json and with the setTimeout function with javascript I'll decode it and append with <ul id="appendHere"></ul> so
when list is updated , it will automatically update itself . Like a cronjob with js .
The second way , if u say that I couldn't take an output while proccessing , so
using data insertion to mysql is meaningless I think , use MongoDb or etc to increase speed .
By The way You'll reach what u need with your key and never duplicate the inserted value .
Disclaimer: I'm familiar with PHP, MySQL, jQuery, AJAX, but am by no means an expert in any of them.
I'm working on an web application that checks updates to a MySQL database every two seconds. Currently, there are 5 tables and for the sake of discussion we can assume each has less than 50 rows. The design I inherited was to refresh 5 iframes every two seconds (roughly each iframe corresponds to a table).
I've since attempted to improve upon this by replacing the iframes with divs. Checking the UPDATE_TIME in INFORMATION_SCHEMA and only updating the divs in which the content has changed since the save previous UPDATE_TIME. To do this, I use a jQuery AJAX call to get the new data from a PHP script. The problem with this strategy is that an external program is updating the database asynchronously so it is possible that it could make multiple updates within a second.
This question is very similar to other questions with the exception that the whole second resolution provided by the UPDATE_TIME is not enough in my case if I'm to base my updates solely .
Query to find tables modified in the last hour
Any solutions would be greatly appreciated!
I had a similar implementation. The tables which I needed to fetch from had a column which says CREATED_TIME. This column could be inserted with CURRENT_TIMESTAMP using the column default value or from the application.
Initially once you load the contents to the div, keep a javascript variable corresponding to each div which stores the CLIENT_MAX(CREATED_TIME). Each time when you need to update the div the latest rows, follow this step:
Request to server with the CheckIfTable1Updated?maxTime=CLIENT_MAX(CREATED_TIME) value using ajax.
The server should fetch the SERVER_MAX(CREATED_TIME) value from Table1 and compare with the value send by the client.
If the max value in the table is greater than the value send by client, the response should be send with the SERVER_MAX(CREATED_TIME), otherwise send 0.
If client receives a 0, do nothing.
If client receives a value greater than zero, i.e. the SERVER_MAX(CREATED_TIME), call the server with ajax- 'RetrieveTable1Updates?fromTime=CLIENT_MAX(CREATED_TIME)&toTime=SERVER_MAX_TIME_JUST_RECEIVED'
Server handles this, fetches the rows with the constraint BETWEEN CLIENT_MAX_TIME AND SERVER_MAX_TIME_ACCORDING_TO_CLIENT.
Server sends the html elements.
Client receives the data, appends to the corresponding div.
Makes the CLIENT_MAX(CREATED_TIME) as SERVER_MAX(CREATED_TIME), received in the first ajax call.
Note: This can be handles with the row_id too which would be much easier than timestamp handling as BETWEEN would need CLIENT_MAX_TIME + 1 to handle duplication.
Use HTML 5 sockets along with jquery to get maximum benefit
I have an area which gets populated with about 100 records from a php mysql query.
Need to add ajax pagination meaning to load 10 records first, then on user click to let's say "+" character it populates the next 20.
So the area instead of displaying all at once will display 20, then on click, then next 20, the next... and so on with no refresh.
Should I dump the 100 records on a session variable?
Should I call Myqsl query every time user clicks on next page trigger?
Unsure what will be the best aproach...or way to go around this.
My concern is the database will be growing from 100 to 10,000.
Any help direction is greatly apreciated.
If you have a large record set that will be viewed often (but not often updated), look at APC to cache the data and share it among sessions. You can also create a static file that is rewritten when the data changes.
If the data needs to be sorted/manipulated on the page, you will want to limit the number of records loaded to keep the JavaScript from running too long. ExtJS has some nice widgets that do this, just provide it with JSON data (use PHP's encode method on your record set). We made one talk to Oracle and do the 20-record paging fairly easily.
If your large record set is frequently updated, and the data must be "real time" accurate, you have some significant challenges ahead. I would look at comet, ape, or web workers for polling/push solution and build your API to only deal in updates to the "core" data--again, probably cached on the server rather than pulled from the DB every time.
your ajax call should call a page which only pulls the exact amount of rows it needs from the database. so select the top 20 rows from the query on the first page and so on. your ajax call can take a parameter called pagenum and depending on that is what records you actually pull from the database. no need for session variables.