OO performance question

OO performance question - php

I will be quick and simple on this.
Basically I need to merge multiple Invoices(Object) quickly and fast.
A simple idea is to
$invoice1 = new Invoice(1);
$invoice2 = new Invoice(2);
$invoice3 = new Invoice(3);
$invoice1->merge($invoice2,invoice3);
$invoice1->save();
Since each object will query it's own data, the number of queries increase as the number of invoices needed to be merge increases.
However, this is a case where a single query
SELECT * FROM invoice WHERE id IN (1,2,3)
Will suffice, however the implementation will not be as elegant as the above.
Initial benchmarks on sample data indicates a 2.5x-3x decrease in speed on the above due to the sheer number of mysql queries.
Advice please

Use an Invoice factory. You ask it for invoices using various methods. newest(n) get(id) get(array(id,id,id)) so on, and it returns arrays of invoices or single invoice objects.
<?php
$invoice56 = InvoiceFactory::Get(56); // Get's invoice 56
$invoices = InvoiceFactory::Newest(25); // Get's an array of the newest 25 invoices
?>

Could you make the Invoice object lazy and let merge load everything that hasn't been loaded?

Make sure you work on the same db connection all the time. Check that it does not reconnect in one script execution thread.

I could suggest looking into using an actual ORM (object relational mapping) in order to create a seperation between your actual queries and the objects used.. Take a look at Propel or (my favorite) Doctrine (version 2 is very easy to use)
That way you could have exactly what you want in just the same amount of code...

Related

Laravel Query efficiency difference

Please could somebody tell me which one is most efficient select in Laravel:
$car = Car::all(); ------- $car = Car::find();
$car = DB::table('car')->get(); ------ $car = DB::table('car')->first();

Your first approach:
$car = Car::all(); ------- $car = Car::find();
Makes use of Eloquent. This means, all the rows received from the query will be hydrated to Model instances, and all of those will be injected into an instance of Collection (for multiple elements, of course). This is useful because you then will have all the benefits that this brings. However, this comes with a little decrease on performance (understandable)
Your second one:
$car = DB::table('car')->get(); ------ $car = DB::table('car')->first();
Uses the Query Builder instead. The results (as a whole) will be also casted into an instance of Collection, but its items will be simple arrays. This means that the process will be faster (more performant) but on detriment of not having all the cool features of Eloquent.
There's even a more performant option: Using raw queries. That also has tradeoffs: Results are not hydrated into a Collection instance.
Which one to use? It depends on your needs. Usually I go for the Eloquent option. I use the Query Builder directly when I need to make queries to big databases and need speed.

For me most efficient is selecting from the Model: like Car:all(), but it's always better if you use pagination or just don't take all of the records from the database with all() method.
But selecting with DB is a bit faster and in some cases maybe it would be better to use.
In the end, it always depends on what`s your problem and which way do you want to solve it.
for a better understanding I recommend you to watch this video and after that maybe keep going to search for some more information or just try it out yourself.
https://www.youtube.com/watch?v=uVsY_OXRq5o&t=205s

How to avoid calling DB queries inside of loops for child data

I've been struggling with trying to figure out what is the best, most efficient way to handle displaying child data. In my specific case I'm using PHP and MySQL, but I feel that this is more of a "generally in any language" sort of deal.
My two thoughts are (for this example I'll be listing invoices and their line items)
Joining the child data (invoice items) to the main data (invoices) as to only have a single query My problem with this is that, say I have 500 line items on an invoice (probably not realistic, but things happen), then I would have sent 500 times the overall invoice data from the MySQL server to my PHP script and that just sounds ridiculous since I only need it the once time.
And the second option would be to, while looping through the invoices and displaying the overall invoice data, select the invoices's line items. And this, of course, is now contacting the database 500 more times.
Are there any other options for dealing with this data that makes logical sense (with the given schema)? I'm almost 100% sure there are, since I can't believe that I'm the first person to think about this issue, but I think I'm just having difficulty finding the right way to search for more information on this topic.

Joining the child data (invoice items) to the main data (invoices) as to only have a single query
That's the conventional way of handling this requirement. It does, indeed, handle redundant data, and there is some overhead.
But.
that's the reason it's possible to specify a compressed connection to a RDBMS from a client ... compression mitigates the network overhead of the redundant data.
the redundant data in a single result set costs much less than the repeated queries.
Most folks just retrieve the redundant data in this kind of application. Program products like Crystal Reports do this.
If it just won't work for you, you retrieve and save one result set for your master records ... maybe something like this.
SELECT master_id, name, address, whatever
FROM master m
WHERE m.whatever = whatever
ORDER BY whatever
Then, put those into an associative array by master_id.
Then, retrieve the detail records.
SELECT d.master_id, d.detail_id, d.item, d.unit, d.quantity
FROM detail d
JOIN master m ON d.master_id = m.master_id
WHERE m.whatever = whatever
ORDER BY d.master_id, d.detail_id, whatever
You'll get a result set with all your relevant detail records (invoice items) tagged with the master_id value. You can them match them up in your php program to the master records. You're basically doing the join in your application code.
If all that sounds like too much of a pain in the neck... just go for the redundant data and get your project done. You can always optimize later if you must.

Doctrine native query(createNativeQuery) select query for 1 million rows high memory usage

I am trying to select a million rows with Doctrine but am having some problems.
First I've tried doing it using the ORM query, but then I found out the native query is faster.(I don't need the ORM mapping for this).
I am already using an array hydrator(creating objects would be pointless since I need to only read the data).
I also heard about the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, but I get an error if I turn it off, since I am working with multiple result sets(cursors) at the same time.
So the memory usage is damn high.
Around 100 MB ram just for one integer column on about 1.3 Mil rows.
Example part of my code(using the non-custom HYDRATE_ARRAY):
function getResult() {
$rsm = new ResultSetMapping;
$q = $this->getEntityManager()->
createNativeQuery("select {$this->getTableIdColName()}
from {$this->getTableName()}", $rsm);
return $q->iterate(null, \Doctrine\ORM\Query::HYDRATE_ARRAY);
}
Upon calling said function, even if I don't iterate it at all - the memory is taken.
I've also made a custom hydrator which pretty much does the same as the default one but uses less memory(since it doesn't map column names). But the result ain't good either way.
Am I missing something or is it normal for a query to take 100 MB of ram without even using the result?

The problem was with the Query buffering, for anyone interested:
http://php.net/manual/en/mysqlinfo.concepts.buffering.php
By default, it's enabled, hence the whole results are being kept in memory.
Basically this fixes the problem, but it imposes some limitations:
Requires you to consume/close all results before making new queries to the database(doctrine has no internal way of closing them by the way)
$em->getConnection()->getWrappedConnection()->setAttribute(\PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);

Reasons not to use GROUP_CONCAT?

I just discovered this amazingly useful MySQL function GROUP_CONCAT. It appears so useful and over-simplifying for me that I'm actually afraid of using it. Mainly because it's been quite some time since I started in web-programming and I've never seen it anywhere. A sample of awesome usage would be the following
Table clients holds clients ( you don't say... ) one row per client with unique IDs.
Table currencies has 3 columns client_id, currency and amount.
Now if I wanted to get user 15's name from the clients table and his balances, with the "old" method of array overwriting I would have to do use the following SQL
SELECT id, name, currency, amount
FROM clients LEFT JOIN currencies ON clients.id = client_id
WHERE clients.id = 15
Then in php I would have to loop through the result set and do an array overwrite ( which I'm really not a big fan of, especially in massive result sets ) like
$result = array();
foreach($stmt->fetchAll() as $row){
$result[$row['id']]['name'] = $row['name'];
$result[$row['id']]['currencies'][$row['currency']] = $row['amount'];
}
However with the newly discovered function I can use this
SELECT id, name, GROUP_CONCAT(currency) as currencies GROUP_CONCAT(amount) as amounts
FROM clients LEFT JOIN currencies ON clients.id = client_id
WHERE clients.id = 15
GROUP BY clients.id
Then on application level things are so awesome and pretty
$results = $stmt->fetchAll();
foreach($results as $k => $v){
$results[$k]['currencies'] = array_combine(explode(',', $v['currencies']), explode(',', $v['amounts']));
}
The question I would like to ask is are there any drawbacks to using this function in performance or anything at all, because to me it just looks like pure awesomeness, which makes me think that there must be a reason for people not to be using it quite often.
EDIT:
I want to ask, eventually, what are the other options besides array overwriting to end up with a multidimensional array from a MySQL result set, because if I'm selecting 15 columns it's a really big pain in the neck to write that beast..

Using GROUP_CONCAT() usually invokes the group-by logic and creates temporary tables, which are usually a big negative for performance. Sometimes you can add the right index to avoid the temp table in a group-by query, but not in every case.
As #MarcB points out, the default length limit of a group-concatenated string is pretty short, and many people have been confused by truncated lists. You can increase the limit with group_concat_max_len.
Exploding a string into an array in PHP does not come for free. Just because you can do it in one function call in PHP doesn't mean it's the best for performance. I haven't benchmarked the difference, but I doubt you have either.
GROUP_CONCAT() is a MySQLism. It is not supported widely by other SQL products. In some cases (e.g. SQLite), they have a GROUP_CONCAT() function, but it doesn't work exactly the same as in MySQL, so this can lead to confusing bugs if you have to support multiple RDBMS back-ends. Of course, if you don't need to worry about porting, this is not an issue.
If you want to fetch multiple columns from your currencies table, then you need multiple GROUP_CONCAT() expressions. Are the lists guaranteed to be in the same order? That is, does the third field in one list correspond to the third field in the next list? The answer is no -- not unless you specify the order with an ORDER BY clause inside the GROUP_CONCAT().
I usually favor your first code format, use a conventional result set, and loop over the results, saving to a new array indexed by client id, appending the currencies to an array. This is a straightforward solution, keeps the SQL simple and easier to optimize, and works better if you have multiple columns to fetch.
I'm not trying to say GROUP_CONCAT() is bad! It's really useful in many cases. But trying to make any one-size-fits-all rule to use (or to avoid) any function or language feature is simplistic.

The biggest problem that I see with GROUP_CONCAT is that it is highly specific to MySql: if you want to port your code to run against any other platform, you would have to rewrite all queries that use GROUP_CONCAT. For example, your first query is a lot more portable - you can probably run it against any major RDBMS engine without changing a single character in it.
If you are fine with working only with MySql (say, because you are writing a tool that is meant to be specific to MySql) the queries with GROUP_CONCAT would probably go faster, because the RDBMS would do more work for you, saving on the size of the data transfer.

Performance issues with mongo + PHP with pagination, distinct values

I have a mongodb collection contains lots of books with many fields. Some key fields which are relevant for my question are :
{
book_id : 1,
book_title :"Hackers & Painters",
category_id : "12",
related_topics : [ {topic_id : "8", topic_name : "Computers"},
{topic_id : "11", topic_name : "IT"}
]
...
... (at least 20 fields more)
...
}
We have a form for filtering results (with many inputs/selectbox) on our search page. And of course there is also pagination. With the filtered results, we show all categories on the page. For each category, number of results found in that category is also shown on the page.
We try to use MongoDB instead of PostgreSQL. Because performance and speed is our main concern for this process.
Now the question is :
I can easily filter results by feeding "find" function with all filter parameters. That's cool. I can paginate results with skip and limit functions :
$data = $lib_collection->find($filter_params, array())->skip(20)->limit(20);
But I have to collect number of results found for each category_id and topic_id before pagination occurs. And I don't want to "foreach" all results, collect categories and manage pagination with PHP, because filtered data often consists of nearly 200.000 results.
Problem 1 : I found mongodb::command() function in PHP manual with a "distinct" example. I think that I get distinct values by this method. But command function doesn't accept conditional parameters (for filtering). I don't know how to apply same filter params while asking for distinct values.
Problem 2 : Even if there is a way for sending filter parameters with mongodb::command function, this function will be another query in the process and take approximately same time (maybe more) with the previous query I think. And this will be another speed penalty.
Problem 3 : In order to get distinct topic_ids with number of results will be another query, another speed penalty :(
I am new with working MongoDB. Maybe I look problems from the wrong point of view. Can you help me solve the problems and give your opinions about the fastest way to get :
filtered results
pagination
distinct values with number of results found
from a large data set.

So the easy way to do filtered results and pagination is as follows:
$cursor = $lib_collection->find($filter_params, array())
$count = $cursor->count();
$data = $cursor->skip(20)->limit(20);
However, this method may not be somewhat inefficient. If you query on fields that are not indexed, the only way for the server to "count()" is to load each document and check. If you do skip() and limit() with no sort() then the server just needs to find the first 20 matching documents, which is much less work.
The number of results per category is going to be more difficult.
If the data does not change often, you may want to precalculate these values using regular map/reduce jobs. Otherwise you have to run a series of distinct() commands or in-line map/reduce. Neither one is generally intended for ad-hoc queries.
The only other option is basically to load all of the search results and then count on the webserver (instead of the DB). Obviously, this is also inefficient.
Getting all of these features is going to require some planning and tradeoffs.

Pagination
Be careful with pagination on large datasets. Remember that skip() and take() --no matter if you use an index or not-- will have to perform a scan. Therefore, skipping very far is very slow.
Think of it this way: The database has an index (B-Tree) that can compare values to each other: it can tell you quickly whether something is bigger or smaller than some given x. Hence, search times in well-balanced trees are logarithmic. This is not true for count-based indexation: A B-Tree has no way to tell you quickly what the 15.000th element is: it will have to walk and enumerate the entire tree.
From the documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the
server to walk from the beginning of the collection, or index, to get
to the offset/skip position before it can start returning the page of
data (limit). As the page number increases skip will become slower and
more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow
you to easily jump to a specific page.
Make sure you really need this feature: Typically, nobody cares for the 42436th result. Note that most large websites never let you paginate very far, let alone show exact totals. There's a great website about this topic, but I don't have the address at hand nor the name to find it.
Distinct Topic Counts
I believe you might be using a sledgehammer as a floatation device. Take a look at your data: related_topics. I personally hate RDBMS because of object-relational mapping, but this seems to be the perfect use case for a relational database.
If your documents are very large, performance is a problem and you hate ORM as much as I do, you might want to consider using both MongoDB and the RDBMS of your choice: Let MongoDB fetch the results and the RDBMS aggregate the best matches for a given category. You could even run the queries in parallel! Of course, writing changes to the DB needs to occur on both databases.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.