I have a script that is transferring about 1.5 million rows (~400mb worth of data) from a table to another table (during this process, some data is converted, modified, and placed in the correct field). It's a simple script, it just recursively loads data, then places it in the new tables under the correct fields and formats. The scripts works by (as an example) pulling all of the users from the table then begins looping through the users, inserting them into the new table, then pulling all of the posts from that user, looping through and inserting them into the correct table, then pulling all of the comments from a post and inserting those, then jumping back up and pulling all of the contacts for that user, finally onto the next user where it goes through the same process.
I'm just having a problem with the immense amount of data being transferred, because it is so large and there isn't any sort of memory management besides garbage collection (that I know of) in PHP, I'm unable to complete the script (it gets through about 15,000 connections and items transferred before it maxes out at 200MB of memory).
This is a one time thing, so I'm doing it on my local computer, not an actual server.
Since unset() does not actually free up memory, is there any other way to free up the data in a variable? One thing I attempted to do was overwrite the variable to a NULL value, but that didn't seem to help.
Any advice would be awesome, because man, this stinks.
If you're actually doing this recursively then that's your problem - you should be doing it iteratively. Recursive processing leaves overhead (+garbage) every time the next call is made - so eventually you hit the limit. An iterative approach doesn't have such problems, and should be actively garbage collecting.
You're also talking about a mind numbing number of connections - why are there so many? I guess I don't completely understand your process, and why this approach is what's needed rather than one retrieve connection and one store connection. Even if you were - say - reconnecting on for each row, you should look at using persistent connections which allows the second connection to the same db to reuse the last connection. Persistent connections aren't a great idea for a web app with multi users (for scalability reasons) but in your very targeted case they should be fine.
unset() does free up memory, but only if the object you're unsetting has no other references pointing to it. Since PHP uses reference counting rather than 'real' GC, this can bite you if you have circular references somewhere - a typical culprit is inside an ORM, where you often have a Database object that holds references to some Table objects, and each Table object has a reference back to the Database. Even if no outside reference exists to either object, they both still reference each other, preventing the reference count from hitting zero.
Also, are both tables on the same database? If so, all you need might be a simple INSERT ... SELECT query, mapping columns and doing a bit of conversion on the fly (although the processing you need to perform might not be possible or feasible in SQL).
Other than that, you don't need that many connections. Just open one for the reader, one for the writer; prepare a statement on the writer, execute the reader query, fetch one row at a time (this is important: do not fetch them all at once) from the reader query, do the processing, stuff it in the prepared writer statement, rinse and repeat. PHP's memory usage should remain roughly constant after the first few rows.
Related
I'm having a problem retrieving documents from a MongoDB collection immediately after inserting them. I'm creating documents, then running a query (that cannot be determined in advance) to get a subset of all documents in the collection. The problem is that some or all of the documents I inserted aren't included in the results.
The process is:
Find the timestamp of the most recent record
Find transactions that have taken place since that time
Generate records for those transactions and insert() each one (this can and will become a single bulk insert)
find() some of the records
The documents are always written successfully, but more often than not the new documents aren't included when I run the find(). They are available after a few seconds.
I believe the new documents haven't propagated to all members of the replica set by the time I try to retrieve them, though I am suspicious that this may not be the case as I'm using the same connection to insert() and find().
I believe this can be solved with a write concern, but I'm not sure what value to specify to ensure that the documents have propagated to all members of the replica set, or at least the member that will be used for the find() operation if it's possible to know that in advance.
I don't want to hard code the total number of members, as this will break when another member is added. It doesn't matter if the insert() operations are slow.
Read preference
When you write to a collection, it's a good practice to set the readPreference to "primary" to make sure you're reading from the same MongoDB server that you've written to.
You do that with the MongoCollection::setReadPreference() method.
$db->mycollection->setReadPreference(MongoClient::RP_PRIMARY);
$db->mycollection->insert(['foo' => 'bar']);
$result = $db->mycollection->find([]);
Write concern (don't do it!)
You might be tempted to use write concern to wait for the data to be replicated to all secondaries by using w=3 (for a 3 server setup). However this is not the way to go.
One of the nice things about MongoDB replication, is that is will do automatic fail over. In that case you might have less than 3 servers that can accept the data, cause your script to wait for ever.
There is no w=all to write to all server that are up. Using such a write concern wouldn't be good. A secondary that have just recovered from a fail over might be hours behind, taking a long time to catch up. You script would wait (hang) until all secondaries are caught up.
A good practice is never to use w=N with N > majority outside of administrative tasks.
Basically you are looking for a write concern, which (in Layman's term) allows you to specify when insert is finished.
In PHP this is done, by providing an option in the insert statement, so you need something like
w=N Replica Set Acknowledged The write will be acknowledged by the primary server, and replicated to N-1 secondaries.
or if you do not want to hard code N:
w= Replica Set Tag Set Acknowledged The write will be
acknowledged by members of the entire tag set
$collection->insert($someDoc, ["w" => 3]);
I have a program that creates logs and these logs are used to calculate balances, trends, etc for each individual client. Currently, I store everything in separate MYSQL tables. I link all the logs to a specific client by joining the two tables. When I access a client, it pulls all the logs from the log_table and generates a report. The report varies depending on what filters are in place, mostly date and category specific.
My concern is the performance of my program as we accumulate more logs and clients. My intuition tells me to store the log information in the user_table in the form of a serialized array so only one query is used for the entire session. I can then take that log array and filter it using PHP where as before, it was filtered in a MYSQL query (using multiple methods, such as BETWEEN for dates and other comparisons).
My question is, do you think performance would be improved if I used serialized arrays to store the logs as opposed to using a MYSQL table to store each individual log? We are estimating about 500-1000 logs per client, with around 50000 clients (and growing).
It sounds like you don't understand what makes databases powerful. It's not about "storing data", it's about "storing data in a way that can be indexed, optimized, and filtered". You don't store serialized arrays, because the database can't do anything with that. All it sees is a single string without any structure that it can meaningfully work with. Using it that way voids the entire reason to even use a database.
Instead, figure out the schema for your array data, and then insert your data properly, with one field per dedicated table column so that you can actually use the database as a database, allowing it to optimize its storage, retrieval, and database algebra (selecting, joining and filtering).
Is serialized arrays in a db faster than native PHP? No, of course not. You've forced the database to act as a flat file with the extra dbms overhead.
Is using the database properly faster than native PHP? Usually, yes, by a lot.
Plus, and this part is important, it means that your database can live "anywhere", including on a faster machine next to your webserver, so that your database can return results in 0.1s, rather than PHP jacking 100% cpu to filter your data and preventing users of your website from getting page results because you blocked all the threads. In fact, for that very reason it makes absolutely no sense to keep this task in PHP, even if you're bad at implementing your schema and queries, forget to cache results and do subsequent searches inside of those cached results, forget to index the tables on columns for extremely fast retrieval, etc, etc.
PHP is not for doing all the heavy lifting. It should ask other things for the data it needs, and act as the glue between "a request comes in", "response base data is obtained" and "response is sent back to the client". It should start up, make the calls, generate the result, and die as fast as it can again.
It really depends on how you need to use the data. You might want to look into storing with mongo if you don't need to search that data. If you do, leave it in individual rows and create your indexes in a way that makes them look up fast.
If you have 10 billion rows, and need to look up 100 of them to do a calculation, it should still be fast if you have your indexes done right.
Now if you have 10 billion rows and you want to do a sum on 10,000 of them, it would probably be more efficient to save that total somewhere. Whenever a new row is added, removed or updated that would affect that total, you can change that total as well. Consider a bank, where all items in the ledger are stored in a table, but the balance is stored on the user account and is not calculated based on all the transactions every time the user wants to check his balance.
I have a MySQL table with about 9.5K rows, these won't change much but I may slowly add to them.
I have a process where if someone scans a barcode I have to check if that barcode matches a value in this table. What would be the fastest way to accomplish this? I must mention there is no pattern to these values
Here Are Some Thoughts
Ajax call to PHP file to query MySQL table ( my thoughts would this would be slowest )
Load this MySQL table into an array on log in. Then when scanning Ajax call to PHP file to check the array
Load this table into an array on log in. When viewing the scanning page somehow load that array into a JavaScript array and check with JavaScript. (this seems to me to be the fastest because it eliminates Ajax call and MySQL Query. Would it be efficient to split into smaller arrays so I don't lag the server & browser?)
Honestly, I'd never load the entire table for anything. All I'd do is make an AJAX request back to a PHP gateway that then queries the database, and returns the result (or nothing). It can be very fast (as it only depends on the latency) and you can cache that result heavily (via memcached, or something like it).
There's really no reason to ever load the entire array for "validation"...
Much faster to used a well indexed MySQL table, then to look through an array for something.
But in the end it all depends on what you really want to do with the data.
As you mentions your table contain around 9.5K of data. There is no logic to load data on login or scanning page.
Better to index your table and do a ajax call whenever required.
Best of Luck!!
While 9.5 K rows are not that much, the related amount of data would need some time to transfer.
Therefore - and in general - I'd propose to run validation of values on the server side. AJAX is the right technology to do this quite easily.
Loading all 9.5 K rows only to find one specific row, is definitely a waste of resources. Run a SELECT-query for the single value.
Exposing PHP-functionality at the client-side / AJAX
Have a look at the xajax project, which allows to expose whole PHP classes or single methods as AJAX method at the client side. Moreover, xajax helps during the exchange of parameters between client and server.
Indexing to be searched attributes
Please ensure, that the column, which holds the barcode value, is indexed. In case the verification process tends to be slow, look out for MySQL table scans.
Avoiding table scans
To avoid table scans and keep your queries run fast, do use fixed sized fields. E.g. VARCHAR() besides other types makes queries slower, since rows no longer have a fixed size. No fixed-sized tables effectively prevent the database to easily predict the location of the next row of the result set. Therefore, you e.g. CHAR(20) instead of VARCHAR().
Finally: Security!
Don't forget, that any data transferred to the client side may expose sensitive data. While your 9.5 K rows may not get rendered by client's browser, the rows do exist in the generated HTML-page. Using Show source any user would be able to figure out all valid numbers.
Exposing valid barcode values may or may not be a security problem in your project context.
PS: While not related to your question, I'd propose to use PHPexcel for reading or writing spreadsheet data. Beside other solutions, e.g. a PEAR-based framework, PHPExcel depends on nothing.
Concern about my page loading speed, I know there are a lot of factors that affect page loading time.
Does retrieving records (Categories) in a array instead of DB is faster?
Thanks
It is faster to keep it all in PHP till you have an absurd amount of records and you use up RAM.
BUT, both of these things are super fast. Selecting a handful of records on a single table that has an index should take less than a msec. Are you sure that you know the source of your web page slowness?
I would be a little bit cautious of having your Data in your code. It will make your system less maintainable. How will users change categories?
THis gets back to deciding if you want your site static versus dynamic.
Yes of course retrieving data from an array is much faster than retrieving data from a Database, but usually arrays and databases have totally different use cases, because data in an array is static (you type the value in code or in a separate file and you can't modify them) while data in a database is dynamic
Yes, it's probably faster to have an array of your categories directly in your PHP script, especially if you need all the categories on every page load. This makes it possible for APC to cache the array (if you have APC running), and also lessen the traffic to/from the database.
But is this where your bottleneck is? It seems to me as the categories should have been cached in the query cache and therefore be easily retrieved. If this is not your biggest bottleneck, chances are you won't see any decrease in loading times. Make sure to profile your application to find the large bottlenecks or you will waste your time on getting only small performance gains.
If you store categories in a database, you have to connect to the database, prepare a SQL statement, send it to the server, fetch the result set, and (probably) store the results in an array. (But you'll probably already have a connection to the database anyway, and hardware and software is designed to do this kind of work quickly.)
Storing and retrieving categories
from a database trades speed for
maintenance. Your data is always up
to date; it might take a little
longer to get it.
You can also store categories as constants or as literals in an array assignment. It would be smart to generate the constants or the array literals from data stored in the database, but you might not have to do that for every page load. If "categories" doesn't change much, you might be able to get away with generating the code once or twice a day, plus whenever someone adds a category. It depends on your application.
Storing and retrieving categories
from an array trades maintenance for
speed. Your data loads a little
faster; it might be incomplete.
The unsatisfying answer is that you're not going to be able to tell how different storage and page generation strategies affect page loading speed until you test them. And even testing isn't that easy, because the effect of changing server and database parameters can be, umm, surprising.
(You can also generate static pages from the database using php. I suggest you test some static pages to give you an idea of "best case" performance.)
Within a php/mysql system we have a number of configuration values (approx 200) - these are mostly booleans or ints and store things such as the number of results per page and whether pages are 2 or 3 columns. This is all stored in a single mysql table and we use a single function to return these values as they are requested, on certain pages loads there can probably be up to around 100 requests to this config table. With the number of sites using this system this means potentially thousands of requests each second to retrieve these values. The question is whether this method makes sense or whether it would be preferable to perform a single request per page and store all the configs in an array and retrieve from here each time instead.
Use a cache such as memcache, APC, or any other. Load the settings once, cache it, and share it through your sessions with a singleton object.
Even if the query cache is saved, it's a waste of time and resources to query the database over and over. Rather, on any request that modifies the values, invalidate the cache that is in memory so it is reloaded immediately the next time someone requests a value from it.
If you enable MySQL query cache, the query that selects your values will be cached in memory, and MySQL will give an instant answer from memory unless the query or data in the underlying tables are changed.
This is excellent both for performance and for manageability.
The query results may be reused between the sessions: that means, if you have 1000 sessions, you don't need to keep 1000 copies of your data.
You might want to consider using memcache for this. I think it would be faster than multiple DB queries (even with query caching on), and you won't need a database connection to get them.
You might want to consider just loading them from a flat file into memory, this affords you the opportunity to version control your config values.
I would defintely recommend memcache for this. We have a similar setup and it has noticably brought resource usage down on that server.