I have a very large dataset that i am exporting using a batch process to keep the page from timing out. The whole process can take over an hour, and i'm using drupal batch which basically reloads the page with a status on how far the process has completed. Each page request essentially runs the query again which includes a sort which takes a while. Then it exports the data to a temp file. The next page load runs the full mongo query, sorts, skips the entries already exported, and exports more to the temp file. The problem is that each page load makes mongo rerun the entire query and sort. I'd like to be able to have the next batch page just pick up the same cursor where it left off and continue to pull the next set of results.
The MongoDB Manual entry for cursor.skip() gives some advice:
Consider using range-based pagination for these kinds of tasks. That is, query for a range of objects, using logic within the application to determine the pagination rather than the database itself. This approach features better index utilization, if you do not need to easily jump to a specific page.
E.g If your nightly batch process runs over the data accumulated in the last 24hrs, perhaps you can run date-range based queries (maybe one per hour of the day) and process your data that way. I'm assuming that your data contains some sort of usable time stamp per document, but you get the idea.
Although cursors live on the server and only timeout after roughly 10minutes of no-activity, the PHP driver does not support persisting cursors between requests.
At the end of each request the driver will kill all cursors created during that request that have not been exhausted.
This also happens when all references to the MongoCursor object are removed (eg $cursor = null).
This is done as its unfortunately fairly common for applications not to iterate over the entire cursor, and we don't want to leave unused cursors around on the server as it could cause performance implications.
For your specific case, the best way to work around this problem is to improve your indexes so loading the cursor is faster.
You may also want to only select some subset of the data so you have a fixed point you can request data between.
Say, for reports, your first request may ask for all data from 1am to 2am.
Then your next request asks for all data from 2am to 3am and so on and on, like Saftschleck explains.
You may also want to look into the aggregation framework, which is designed to do "online reporting": http://docs.mongodb.org/manual/aggregation/
Related
I have a fairly large amount of data that I'm trying to insert into MySQL. It's a data dump from a provider that is about 47,500 records. Right now I'm simply testing the insert method through a PHP script just to get things dialed in.
What I'm seeing is that, first, the inserts will continue to happen long after the PHP script "finishes". So by the time I can see the browser no longer has an "X" to cancel the request and now has a "reload" (indicating the script is done from the browser perspective) I can see for a good 10+ minutes that inserts are still occurring. I assume this is MySQL caching the queries. Is there any way to keep the script "alive" until all queries have completed? I put a 15 minute timeout on my script.
Second, and more disturbing, is that I won't get every insert. Of the 47,500 records I'll get anywhere between 28,000 and 38,000 records but never more - and that number is random each time I run the script. Anything I can do about that?
Lastly, I have a couple simple echo statements at the end of my script for debugging, these never fire - leading me to believe that a time out might be happening (although I don't get any errors about time-outs or memory outages). I'm thinking this has something to do with the problem but am not sure.
I tried changing my table to an archive table but not only didn't that help but it also means I lose the ability to update the records in the table when I want to, I did it only as a test.
Right now the insert is in a simple loop, it loops each record in the JSON data that I get from the source and runs an insert statement, then on to the next iteration. Should I be trying to instead using the loop to build a massive insert and run a single insert statement at the end? My concern with this is that I fear I would go beyond my max_allowed_packet configuration that is hard coded by my hosting provider.
So I guess the real question is what is the best method to insert nearly 50,000 records into MySQL using PHP based on what I've explained here.
I am running application (build on PHP & MySql) on VPS. I have article table which have millions of records in it. Whenever user login i am displaying last 50 records for each section.
So every-time use login or refresh page it is executing sql query to get those records. now there are lots of users on website due to that my page speed has dropped significantly.
I done some research on caching and found that i can read mysql data based on section, no. articles e.g (section - 1 and no. of articles - 50). store it in disk file cache/md5(section no.).
then in future when i get request for that section just get the data from cache/md5(section no).
Above solution looks great. But before i go ahead i really would like to clarify few below doubts from experts .
Will it really speed up my application (i know disk io faster than mysql query but dont know how much..)
i am currently using pagination on my page like display first 5 articles and when user click on "display more" then display next 5 articles etc... this can be easily don in mysql query. I have no idea how i should do it in if i store all records(50) in cache file. If someone could share some info that would be great.
any alternative solution if you believe above will not work.
Any opensource application if you know. (PHP)
Thank you in advance
Regards,
Raj
I ran into the same issue where every page load results in 2+ queries being run. Thankfully they're very similar queries being run over and over so caching (like your situation) is very helpful.
You have a couple options:
offload the database to a separate VPS on the same network to scale it up and down as needed
cache the data from each query and try to retrieve from the cache before hitting the database
In the end we chose both, installing Memecached and its php extension for query caching purposes. Memecached is a key-value store (much like PHP's associative array) with a set expiration time measured in seconds for each value stored. Since it stores everything in RAM, the tradeoff for volatile cache data is extremely fast read/write times, much better than the filesystem.
Our implementation was basically to run every query through a filter; if it's a select statement, cache it by setting the memecached key to "namespace_[md5 of query]" and the value to a serialized version of an array with all resulting rows. Caching for 120 seconds (3 minutes) should be more than enough to help with the server load.
If Memecached isn't a viable solution, store all 50 articles for each section as an RSS feed. You can pull all articles at once, grabbing the content of each article with SimpleXML and wrapping it in your site's article template HTML, as per the site design. Once the data is there, use CSS styling to only display X articles, using JavaScript for pagination.
Since two processes modifying the same file at the same time would be a bad idea, have adding a new story to a section trigger an event, which would add the story to a message queue. That message queue would be processed by a worker which does two consecutive things, also using SimpleXML:
Remove the oldest story at the end of the XML file
Add a newer story given from the message queue to the top of the XML file
If you'd like, RSS feeds according to section can be a publicly facing feature.
I have a multiple devices (eleven to be specific) which sends information every second. This information in recieved in a apache server, parsed by a PHP script, stored in the database and finally displayed in a gui.
What I am doing right now is check if a row for teh current day exists, if it doesn't then create a new one, otherwise update it.
The reason I do it like that is because I need to poll the information from the database and display it in a c++ application to make it look sort of real-time; If I was to create a row every time a device would send information, processing and reading the data would take a significant ammount of time as well as system resources (Memory, CPU, etc..) making the displaying of data not quite real-time.
I wrote a report generation tool which takes the information for every day (from 00:00:00 to 23:59:59) and put it in an excel spreadsheet.
My questions are basically:
Is it posible to do the insertion/updating part directly in the database server or do I have to do the logic in the php script?
Is there a better (more efficient) way to store the information without a decrease in performance in the display device?
Regarding the report generation, if I want to sample intervals lets say starting from yesterday at 15:50:00 and ending today at 12:45:00 it cannot be done with my current data structure, so what do I need to consider in order to make a data structure which would allow me to create such queries.
The components I use:
- Apache 2.4.4
- PostgreSQL 9.2.3-2
- PHP 5.4.13
My recommendations - just store all the information, your devices are sending. With proper indexes and queries you can process and retrieve information from DB really fast.
For your questions:
Yes it is possible to build any logic you desire inside Postgres DB using SQL, PL/pgSQL, PL/PHP, PL/Java, PL/Py and many other languages built into Postgres.
As I said before - proper indexing can do magic.
If you cannot get desired query speed with full table - you can create a small table with 1 row for every device. And keep in this table last known values to show them in sort of real-time.
1) The technique is called upsert. In PG 9.1+ it can be done with wCTE (http://www.depesz.com/2011/03/16/waiting-for-9-1-writable-cte/)
2) If you really want it to be real-time you should be sending the data directly to the aplication, storing it in memory or plaintext file also will be faster if you only care about the last few values. But PG does have Listen/notify channels so probabably your lag will be just 100-200 mili and that shouldn't be much taken you're only displaying it.
I think you are overestimating the memory system requirements given the process you have described. Adding a row of data every second (or 11 per second) is not a hog of resources. In fact it is likely more time consuming to UPDATE vs ADD a new row. Also, if you add a TIMESTAMP to your table, sort operations are lightning fast. Just add some garbage collection handling as a CRON job (deletion of old data) once a day or so and you are golden.
However to answer your questions:
Is it posible to do the insertion/updating part directly in the database server or do I >have to do the logic in the php script?
Writing logic from with the Database engine is usually not very straight forward. To keep it simple stick with the logic in the php script. UPDATE (or) INSERT INTO table SET var1='assignment1', var2='assignment2' (WHERE id = 'checkedID')
Is there a better (more efficient) way to store the information without a decrease in >performance in the display device?
It's hard to answer because you haven't described the display device connectivity. There are more efficient ways to do the process however none that have locking mechanisms required for such frequent updating.
Regarding the report generation, if I want to sample intervals lets say starting from >yesterday at 15:50:00 and ending today at 12:45:00 it cannot be done with my current data >structure, so what do I need to consider in order to make a data structure which would >allow me to create such queries.
You could use the a TIMESTAMP variable type. This would include DATE and TIME of the UPDATE operation. Then it's just a simple WHERE clause using DATE functions within the database query.
I certainly can't solve this problem by myself after a few many days already trying. This is the problem:
We need to display information on the screen (HTML) that is being generated in real time inside a PHP file.
The PHP is performing a very active crawling, returning huge arrays of URLs, each URL need to be displayed in real time in HTML, as soon as the PHP captures it, that's why we are using Ob_flush() and flush methods to echo and print the arrays as soon as we got them.
Meanwhile we need to display this information somehow so the users can see it while it works (since it could take more than one hour until it finishes).
It's not possible to be done, as far as I understand, with AJAX, since we need to make only 1 request and read the information inside the array. I'm not either totally sure if comet can do something like this, since it would interrupt the connection as soon as it gets new information, and the array is really rapidly increasing it's size.
Additionally and just to make the things more complex, there's no real need to print or echo the information (URLs) inside the array, since the HTML file is being included as the User Interface of the same file that is processing and generating the array that we need to display.
Long story short; we need to place here:
<ul>
<li></li>
<li></li>
<li></li>
<li></li>
<li></li>
...
</ul>
A never ending and real time updated list of URLS being generated and pushed inside an array, 1,000 lines below, in a PHP loop.
Any help would be really more than appreciated.
Thanks in advance!
Try web-sockets.
They offer real-time communication between client and server and using socket.io provide cross-browser compatibility. It's basically giving you the same results as long-polling / comet, but there is less overhead between requests so it's faster.
In this case you would use web sockets to send updates to the client about the current status of the processing (or whatever it was doing).
See this Using PHP with Socket.io
Suppose you used a scheme where PHP was writing to a Memcached server..
each key you write as rec1, rec2, rec3
You also store a current_min and a current_max
You have the user constantly polling with ajax. For each request they include the last key they saw, call this k. The server then returns all the records from k to max.
If no records are immediately available, the server goes into a wait loop for a max of, say 3 seconds, checking if there are new records every 100ms
If records become available, they are immediately sent.
Whenever the client receives updates or the connection is terminated, they immediately start a new request...
Writing a new record is just a matter of inserting max+1 and incrementing min and max where max-min is the number of records you want to keep available...
An alternative to web sockets is COMET
I wrote an article about this, along with a followup describing my experiences.
COMET in my experience is fast. Web sockets are definitely the future, but if you're in a situation where you just need to get it done, you can have COMET up and running in under an hour.
Definitely some sort of shared memory structure is needed here - perhaps an in-memory temp table in your database, or Memcached as stephen already suggested.
I think the best way to do this would be to have the first PHP script save each record to a database (MySQL or SQLite perhaps), and then have a second PHP script which reads from the database and outputs the newest records. Then use AJAX to call this script every so often and add the records it sends to your table. You will have to find a way of triggering the first script.
The javascript should record the id of the last url it already has, and send it in the AJAX request, then PHP can select all rows with ids greater than that.
If the number of URLs is so huge that you can't store a database that large on your server (one might ask how a browser is going to cope with a table as large as that!) then you could always have the PHP script which outputs the most recent records delete them from the database as well.
Edit: When doing a lot of MySQL inserts there are several things you can do to speed it up. There is an excellent answer here detailing them. In short use MyISAM, and enter as many rows as you can in a single query (have a buffer array in PHP, which you add URLs to, and when it is full insert the whole buffer in one query).
If I were you , I try to solve this with two way .
First of all I encode the output part array with json and with the setTimeout function with javascript I'll decode it and append with <ul id="appendHere"></ul> so
when list is updated , it will automatically update itself . Like a cronjob with js .
The second way , if u say that I couldn't take an output while proccessing , so
using data insertion to mysql is meaningless I think , use MongoDb or etc to increase speed .
By The way You'll reach what u need with your key and never duplicate the inserted value .
Brief overview about my usecase: Consider a database (most probably mongodb) having a million entries. The value for each entry needs to be updated everyday by calling an API. How to design such a cronjob? I know Facebook does something similar. The only thing I can think of is to have multiple jobs which divide the database entries into batches and each job updates a batch. I am certain there are smarter solutions out there. I am also not sure what technology to use. Any advise is appreciated.
-Karan
Given the updated question context of "keeping the caches warm", a strategy of touching all of your database documents would likely diminish rather than improve performance unless that data will comfortably fit into available memory.
Caching in MongoDB relies on the operating system behaviour for file system cache, which typically frees cache by following a Least Recently Used (LRU) approach. This means that over time, the working data set in memory should naturally be the "warm" data.
If you force data to be read into memory, you could be loading documents that are rarely (or never) accessed by end users .. potentially at the expense of data that may actually be requested more frequently by the application users.
There is a use case for "prewarming" the cache .. for example when you restart a MongoDB server and want to load data or indexes into memory.
In MongoDB 2.2, you can use the new touch command for this purpose.
Other strategies for prewarming are essentially doing reverse optimization with an explain(). Instead of trying to minimize the number of index entries (nscanned) and documents (nscannedObjects), you would write a query that intentionally will maximize these entries.
With your API response time goal .. even if someone's initial call required their data to be fetched into memory, that should still be a reasonably quick indexed retrieval. A goal of 3 to 4 seconds response seems generous unless your application has a lot of processing overhead: the default "slow" query value in MongoDB is 100ms.
From a technical standpoint, You can execute scripts in the mongodb shell, and execute them via cron. If you schedule cron to run a command like:
./mongo server:27017/dbname--quiet my_commands.js
Mongodb will execute the contents of the my_commands.js script. Now, for an overly simple example just to illustrate the concept. If you wanted to find a person named sara and insert an attribute (yes, unrealistic example) you could enter the following in your .js script file.
person = db.person.findOne( { name : "sara" } );
person.validated = "true";
db.people.save( person );
Then everytime the cron runs, that record will be updated. Now, add a loop and a call to your api, and you might have a solution. More information on these commands and example can be found in the mongodb docs.
However, from a design perspective, are you sure you need to update every single record each night? Is there a way to identify a more reasonable subset of records that need to be processed? Or possibly can the api be called on the data as it's retrieved and served to whomever is going to consume it?