Doctrine entity manager and multiple threads updating database - php

I currently have an XHR request that can fire off N times from the client. This requests are handled by the server and each requests typically creates a new row in the database (all doctrine / xml).
Before I persist() the object, I make sure if has a unique filename (I am uploading assets) and I do this by overriding persist(), calling my getUniqueFilename() and then calling parent::persist.
I get a race condition when I perform multiple XHR requests w/the same filename. What happens if that multiple threads are running at the exact same time and checking the database for duplicates in order to generate the unique filename.
Upload file
Check database if filename exists
Increment filename (eg: filename_1)
However, when multiple XHR requests occur in multiple threads a race condition occurs where multiple files are inserted into the database w/the same name (filename_1 is generated multiple times).
I think the way to solve this is either
A mysql trigger
Adding a unique constraint to the table for filename, and wrapping in try/catch in code
What would you do?

Adding a unique constraint is the safest way to ensure consistent data. Anything at the PHP level could have some race conditions unless you have some other forms of locking, which will be less efficient.
You could also avoid the problem by making sure the file names become unique using other attributes such as the user it came from or keeping version history which would just make it look like there was a new version almost at the same time.

I would suggest using a different strategy: Using the mysql auto_increment to get an id, saving the asset to the database, retrieving the id and then adding it to the filename. All the other ways have drawbacks where you have to perform partial rollbacks, handle dublicated filenames etc.
I would also suggest to not use the original filename for storing the object: You run into troubles with forbidden characters on different operating systems as well as character encoding, possible dublicates for some reason (e.g. because the database is case_sensitive where the file system is not). There may be other drawbacks like max file name length or so which you might not be aware of right now.
My solution is simply using a mysql auto increment as filename. If you think about it it makes sense. The auto increment is used as a unique idenfitier. If you make sure to only story objects from one table into one folder, you have no problems with identifieng different assets, filenames etc.
If you insist on keeping your way, you could make the filename unique in the database and then restart on a failing flush as you suggested.

Related

Issues with storing images in filesystem vs DB

There are several questions with excellent answers on SO regarding the quintessential BLOB vs filesystem question. However, none of them seem to represent my scenario so I'm asking this.
Say there's a social network (a hypothetical scenario, of course) where users are all free to change anyone's profile picture. And each user's profile is stored in a MySQL table with the following schema:
ID [unsigned int, primary]
USERNAME [varchar(20)]
PROFILENAME [varchar(60)]
PROFILEPIC [blob]
Now, here's the thing: What if I want to store profile images as files on the server instead of BLOBs in the db? I can understand there will have to be some kind of naming convention to ensure all files have a unique name that also maps it to the primary key on the table for easy access. So, say, the primary key is the filename for the corresponding image stored on the disk. But in my context there could be simultaneous read/writes, and quite a lot of them. MySQL would typically handle that with no problems since it locks out the row while it's being updated. But how does one handle such situations in a filesystem model?
In your application layer, you could lock the block that does DB transaction and file IO to alleviate concurrency issues (lock example in C#).
Within this block, run your inserts/updates/deletes in a transaction. Follow that with adding/replacing/deleting the photo on disk. Let's write some pseudo-code:
lock (obj)
{
connection.StartTransaction();
connection.PerformAction();
if failed, return false;
photoMgmt.PerformAction();
if failed, return false;
connection.CommitTransaction();
}
Applying similar technique with PHP; additionally use flock to perform file locking.
In other words, commit to DB after committing to filesystem. If either DB or filesystem operation fails, perform cleansing so no change is saved.
I'd use bigint ID as the primary key and GUID filenames on disk. If users preferred the application to hold the name they provided, I'd create a field called user_filename to store the filename provided by the user, and for all other purposes I'd use the GUID.
Hopefully this will provide some direction.

Race conditions with Apache, MySQL and PHP

My understanding is that Apache creates a separate PHP process for each incoming request. That means that if I have code that does something like:
check if record exists
if record doesn't exist, create it
Then this is susceptible to a race condition, is it not? If two requests come in at the same time, and they both hit (1) simultaneously, they will both come back false, and then both attempt to insert a new record.
If so, how do people deal with this? Would creating a MySQL transaction around those 2 requests solve the issue, or do we need to do a full table lock?
As far as I know you cannot create a transaction across different connections. Maybe one solution would be to set column you are checking to be unique. This way if two connections are made to 10, and 10 does not exist. They will both try to create 10. One will finish inserting the row first, and all is well; then the connection just a second behind will fail because the column isn't unique. If you catch the exception that is thrown, then you can subsequently SELECT the record from the database.
Honestly, I've very rarely run into this situation. Often times it can be alleviated by reevaluating business requirements. Even if two different users were trying to insert the exact same data, I would defer management of duplicates the users, rather than the application.
However, if there were a reason to enforce a unique constraint in the application logic, I would use an INSERT IGNORE... ON DUPLICATE KEY UPDATE... query (with the corresponding UNIQUE index in the table, of course).
I think that handling errors on the second step ought to be sufficient. If two processes try to create a record then one of them will fail, as long as youve configured the MySQL table appropriately. Using UNIQUE across the right fields is one way to do the trick.
Apache does not "create a separate PHP process for each incoming request".
It either uses a pool of processes (default, prefork mode), or threads.
The race conditions, as you mention, may also be refered to (or cause) DB "Deadlocks".
#see what is deadlock in a database?
Using transactions where needed should solve this problem, yes.
By making sure you check if a record exists and create it within a transaction, the whole operation is atomic.
Hence, other requests will not try to create duplicate records (or, depending on the actual queries, create inconsistencies or enter actual deadlocks).
Also note that MySQL does not (yet) support nested transactions:
You cannot have transactions inside transactions, as the first commit will commit everything.

PHP Array efficiency vs mySQL query

I have a MySQL table with about 9.5K rows, these won't change much but I may slowly add to them.
I have a process where if someone scans a barcode I have to check if that barcode matches a value in this table. What would be the fastest way to accomplish this? I must mention there is no pattern to these values
Here Are Some Thoughts
Ajax call to PHP file to query MySQL table ( my thoughts would this would be slowest )
Load this MySQL table into an array on log in. Then when scanning Ajax call to PHP file to check the array
Load this table into an array on log in. When viewing the scanning page somehow load that array into a JavaScript array and check with JavaScript. (this seems to me to be the fastest because it eliminates Ajax call and MySQL Query. Would it be efficient to split into smaller arrays so I don't lag the server & browser?)
Honestly, I'd never load the entire table for anything. All I'd do is make an AJAX request back to a PHP gateway that then queries the database, and returns the result (or nothing). It can be very fast (as it only depends on the latency) and you can cache that result heavily (via memcached, or something like it).
There's really no reason to ever load the entire array for "validation"...
Much faster to used a well indexed MySQL table, then to look through an array for something.
But in the end it all depends on what you really want to do with the data.
As you mentions your table contain around 9.5K of data. There is no logic to load data on login or scanning page.
Better to index your table and do a ajax call whenever required.
Best of Luck!!
While 9.5 K rows are not that much, the related amount of data would need some time to transfer.
Therefore - and in general - I'd propose to run validation of values on the server side. AJAX is the right technology to do this quite easily.
Loading all 9.5 K rows only to find one specific row, is definitely a waste of resources. Run a SELECT-query for the single value.
Exposing PHP-functionality at the client-side / AJAX
Have a look at the xajax project, which allows to expose whole PHP classes or single methods as AJAX method at the client side. Moreover, xajax helps during the exchange of parameters between client and server.
Indexing to be searched attributes
Please ensure, that the column, which holds the barcode value, is indexed. In case the verification process tends to be slow, look out for MySQL table scans.
Avoiding table scans
To avoid table scans and keep your queries run fast, do use fixed sized fields. E.g. VARCHAR() besides other types makes queries slower, since rows no longer have a fixed size. No fixed-sized tables effectively prevent the database to easily predict the location of the next row of the result set. Therefore, you e.g. CHAR(20) instead of VARCHAR().
Finally: Security!
Don't forget, that any data transferred to the client side may expose sensitive data. While your 9.5 K rows may not get rendered by client's browser, the rows do exist in the generated HTML-page. Using Show source any user would be able to figure out all valid numbers.
Exposing valid barcode values may or may not be a security problem in your project context.
PS: While not related to your question, I'd propose to use PHPexcel for reading or writing spreadsheet data. Beside other solutions, e.g. a PEAR-based framework, PHPExcel depends on nothing.

How to implement semaphores in PHP without PHP Semaphore?

Question:
How can I implement shared memory variable in PHP without the semaphore package (http://php.net/manual/en/function.shm-get-var.php) ?
Context
I have a simple web application (actually a plugin for WordPress)
this gets a url
this then checks the database if that url already exists
if not then it goes out and does some operations
and then writes the record in the database with the url as unique entry
What happens in reality is that 4,5,6 ... sessions at the same time request the url and I get up to 9 duplicate entries in the database of the url.. (possibly 9 because the processing time and database write of the first entry takes just enough time to let 9 other requests fall through). After that all requests read the correct entry that the record already exists so that is good.
Since it is a WordPress plugin there will be many users on all kind of shared hosting platforms with variable compiles/settings of PHP.
So I'm looking for a more generic solution. I cant use database or text file writes since these will be too slow. while i write to the db the next session will already have passed.
fyi: the database code: http://plugins.svn.wordpress.org/wp-favicons/trunk/includes/server/plugins/metadata_favicon/inc/class-favicon.php
update
Using a unique key on a new md5 hash of the uri together with try catches around it seems to work.
I found 1 duplicate entry with
SELECT uri, COUNT( uri ) AS NumOccurrences
FROM edl40_21_wpfavicons_1
GROUP BY uri
HAVING (
COUNT( uri ) >1
)
LIMIT 0 , 30
So I thought it did not work but this was because they were:
http://en.wikipedia.org/wiki/Book_of_the_dead
http://en.wikipedia.org/wiki/Book_of_the_Dead
(capitals grin)
This could be achieved with MySQL.
You could do it explicitly by locking the table from read access. This will prevent any read access from the entire table though, so may not be preferable. http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html
Otherwise if the field in the table is classified as unique, then when the next session tries to write the same URL to the table they will get an error, you can catch that error and continue as there's no need to do anything if the entry is already there. The only time wasted is the possibility of two or more sessions creating the same URL, the result is still one record, as the database won't add the same unique URL again.
As discussed in comments, because the length of a URL could be very long, and fixed length unique hash can help overcome that issue.
There are other shared memory modules in PHP (shmop or APC for example), but I think what you are saying is that there is an issue relying on non-standard/not pre-installed libraries.
My suggestion is that before you go and do "other operations" you need to make an entry in the database, perhaps with a status of "compiling" (or something) so you know it is still not available. This way you don't run into issues with getting multiple entries. I would also be sure you are using transactions when they are available so your commits are atomic.
Then, when you "other operations" are done, update the database entry to "available" and do whatever else it is you need to do.

What Would be a Suitable Way to Log Changes Within a Database Using CodeIgniter

I want to create a simple auditing system for my small CodeIgniter application. Such that it would take a snapshot of a table entry before the entry has been edited. One way I could think of would be to create a news_audit table, which would replicate all the columns in the news table. It would also create a new record for each change with the added column of date added. What are your views, and opinions of building such functionality into a PHP web application?
There are a few things to take into account before you decide which solution to use:
If your table is large (or could become large) your audit trail needs to be in a seperate table as you describe or performance will suffer.
If you need an audit that can't (potentially) be modified except to add new entries it needs to have INSERT permissions only for the application (and to be cast iron needs to be on a dedicated logging server...)
I would avoid creating audit records in the same table as it might be confusing to another developer (who might no realize they need to filter out the old ones without dates) and will clutter the table with audit rows, which will force the db to cache more disk blocks than it needs to (== performance cost). Also to index this properly might be a problem if your db does not index NULLS. Querying for the most recent version will involve a sub-query if you choose to time stamp them all.
The cleanest way to solve this, if your database supports it, is to create an UPDATE TRIGGER on your news table that copies the old values to a seperate audit table which needs only INSERT permissions). This way the logic is built into the database, and so your applications need not be concerned with it, they just UPDATE the data and the db takes care of keeping the change log. The body of the trigger will just be an INSERT statement, so if you haven't written one before it should not take long to do.
If I knew which db you are using I might be able to post an example...
What we do (and you would want to set up archiving beforehand depending on size and use), but we created an audit table that stores user information, time, and then the changes in XML with the table name.
If you are in SQL2005+ you can then easily search the XML for changes if needed.
We then added triggers to our table to catch what we wanted to audit (inserts, deletes, updates...)
Then with simple serialization we are able to restore and replicate changes.
What scale are we looking at here? On average, are entries going to be edited often or infrequently?
Depending on how many edits you expect for the average item, it might make more sense to store diff's of large blocks of data as opposed to a full copy of the data.
One way I like is to put it into the table itself. You would simply add a 'valid_until' column. When you "edit" a row, you simply make a copy of it and stamp the 'valid_until' field on the old row. The valid rows are the ones without 'valid_until' set. In short, you make it copy-on-write. Don't forget to make your primary keys a combination of the original primary key and the valid_until field. Also set up constraints or triggers to make sure that for each ID there can be only one row that does not have it's valid_until set.
This has upsides and downsides. The upside is less tables. The downside is far more rows in your tables. I would recommend this structure if you often need to access old data. By simply adding a simple WHERE to your queries you can query the state of a table at a previous date/time.
If you only need to access your old data occasionally then I would not recommend this though.
You can take this all the way to the extreme by building a Temportal database.
In small to medium size project I use the following set of rules:
All code is stored under Revision Control System (i.e. Subversion)
There is a directory for SQL patches in source code (i.e. patches/)
All files in this directory start with serial number followed by short description (i.e. 086_added_login_unique_constraint.sql)
All changes to DB schema must be recorded as separate files. No file can be changed after it's checked in to version control system. All bugs must be fixed by issuing another patch. It is important to stick closely to this rule.
Small script remembers serial number of last executed patch in local environment and runs subsequent patches when needed.
This way you can guarantee, that you can recreate your DB schema easily without the need of importing whole data dump. Creating such patches is no brainer. Just run command in console/UI/web frontend and copy-paste it into patch if successful. Then just add it to repo and commit changes.
This approach scales reasonably well. Worked for PHP/PostgreSQL project consisting of 1300+ classes and 200+ tables/views.

Categories