I'm developing app with FuelPHP & mySql and I'm using the provided ORM functionality. The problem is with following tables:
Table: pdm_data
Massive table (350+ columns, many rows)
Table data is rather static (updates only once a day)
Primary key: obj_id
Table: change_request
Only few columns
Data changes often (10-20 times / min)
References primary key (obj_id from table pdm_data)
Users can customize datasheet that is visible to them, eg. they can save filters (eg. change_request.obj_id=34 AND pdm_data.state = 6) on columns which then are translated to query realtime with ORM.
However, the querying with ORM is really slow as the table pdm_data is large and even ~100 rows will result in many mbs of data. The largest problem seems to be in FuelPHP ORM: even if the query itself is relatively fast model hydration etc. takes many seconds. Ideal solution would be to cache results from pdm_data table as it is rather static. However, as far as I know FuelPHP doesn't let you cache tables through relations (you can cache the complete result of query, thus both tables or none).
Furthermore, using normal SQL query with join instead of ORM is not ideal solution, as I need to handle other tasks where hydrated models are awesome.
I have currently following code:
//Initialize the query and use eager-loading
$query = Model_Changerequest::query()->related('pdmdata');
foreach($filters as $filter)
{
//First parameter can point to either table
$query->where($filter[0], $filter[1], $filter[2]);
}
$result = $query->get();
...
Does someone have a good solution for this?
Thanks for reading!
The slowness of the version 1 ORM is a known problem which is being addressed with v2. My current benchmarks are showing that v1 orm takes 2.5 seconds (on my machine, ymmv) to hydrate 40k rows while the current v2 alpha takes around 800ms.
For now I am afraid that the easiest solution is to do away with the ORM for large selects and construct the queries using the DB class. I know you said that you want to keep the abstraction of the ORM to ease development, one solution is to use as_object('MyModel') to return populated model objects.
On the other hand if performance is your main concern then the ORM is simply not suitable.
Related
Based on a relational database analogy i would like to know how Solr fits into place.
Based on what i figured so far, "documents" in Solr are similar to "rows" in sql (if my sql table has 100 rows i need to insert 100 documents in solr) and "cores" are similar to "tables" (or databases?!?).
The questions are:
If i have 2 sets of unrelated information, let's say a table with car information (id, name, series, color, description) and a table with user information (id, name, address, age, sex), where do i insert these things in Solr?
I make 2 cores (core_car, core_user) and populate each of them with documents from the coresponding table?
Or i make 1 core (core_general) and insert all the documents from both tables there (separated somehow which i don't know how).
In the first case with 2 cores i am feeling like i am creating 2 databases with 1 table in each (overkill).
In the second i am feeling like i am creating 1 table with all the unrelated fields mushed together (this would't be the case if there was some form of separation - that i don't know of at the moment).
Please confirm or not my presumptions.
Thank you in advance.
Great that you explored before posting the question. Here's my opinion.
Solr Document: Probably a more suitable way of perceiving this concept is thinking in terms of results. Each Solr document is nothing but one result entry in your result set after executing a search query.
If you were indexing Wikipedia, each article would be a Solr Document. When you search for "sorting algorithms", the results you would like to see are "bubble sort", "merge sort", etc. Each of them is an article, a Solr document, and a result in the result-set.
If you want you relate this back to rdbms concept, I would like to say that each search-result (i.e. a Solr Document) could be a row in the result-set of a sql-query. That row could be a row from a single table, or a row from JOINed tables.
Solr Core is nothing but a wrapper around ONE lucene Index. Each Solr web-app can operate multiple Solr Cores.
The best way to speed up your understanding is to avoid relating concepts in Solr to RDBMS.
Explore What Solr offers that RDMBS doesn't (efficiently)
Here's another link that might help you : Solr Terminology
Your use-case
The beauty of Solr/Lucene is flexible schema or I'd say no schema. Each document can have totally different fields and attributes from the previous document indexed.
It is perfectly fine to have different types of documents (car, person, etc) in the same lucene index (Solr Core in your case), as long as they are scalable altogether.
For example, if you have 500M car entries and 3 billion person entries, it makes sense to index them separately. If you have 1mn Persons and 500k cars, you can stuff all of them in the same index with an identifier field containing entity type.
Your question is very subjective. Not everyone would agree with what I said. It depends on a lot more factors to decide between one core or multiple cores.
For example,
do those two entities (persons and cars) complement each other to serve as a logical chunk in order to support a product feature?
Are there any situations where you'd have to get both types of results for a query.
How often you update each type of entity. (There's no update option in Solr. It's only delete & re-add.)
Do they belong in different product features?
Are they owned by different teams, etc..
This question got me today, my repositories should always return full objects? They can not return partial data (in an array for example)?
For example, I have the method getUserFriends(User $user) inside my repository Friends, in this method I execute the following DQL:
$dql = 'SELECT userFriend FROM Entities\User\Friend f JOIN f.friend userFriend WHERE f.user = ?0';
But this way I'm returning the users entities, containing all the properties, the generated SQL is a SELECT of all fields from the User table. But let's say I just need the id and the name of the user friends, there would be more interesting (and quick) get just these values?
$dql = 'SELECT userFriend.id, userFriend.name FROM Entities\User\Friend f JOIN f.friend userFriend WHERE f.user = ?0';
These methods are executed in my service class.
From a database perspective, performance will not be that much affected by the number of fields, unless the number of rows to return is really huge (millions of rows, probably) : the hardest part for the db is to make the joints, and build the resultset from the tables.
From a php perspective, that depends on multiple factors, like the complexity and the number of objects created.
I would take the problem differently : I would profile and stress-test my code in order to see if performance is an issue or not, and decide to refactor only if needed (switching from doctrine to a hand-made model is time consuming, will the performance gain be worth it ?)
EDIT : and to answer your initial question : fetching complete objects will lead to easier caching if needed, and better data encapsulation. I would keep these until they represent a big performance issue.
You can use partial keyword in your DQL : http://www.doctrine-project.org/docs/orm/2.0/en/reference/partial-objects.html?highlight=partial
But only do that if your app has performance issues.
I'm using PHP and MySQL. I have records for:
events with various "event types" that are hierarchical (events can have multiple categories and subcategories, but there are a fixed amount of such categories and subcategories) (timestamped)
What is the best way to set up the table? Should I have a bunch of columns (30 or so) with enums for yes or no indicating membership in that category? or should I use MySQL SET datatype?
http://dev.mysql.com/tech-resources/articles/mysql-set-datatype.html
Basically I have performance in mind and I want to be able to retrieve all of the ids of the events for a given category. Just looking for some insight on the most efficient way to do this.
It sounds like you're chiefly concerned with performance.
A couple people have suggested splitting into 3 tables (category table plus either simple cross-reference table or a more sophisticated way of modeling the tree hierarchy, like nested set or materialized path), which is the first thing I thought when I read your question.
With indexes, a fully normalized approach like that (which adds two JOINs) will still have "pretty good" read performance. One issue is that an INSERT or UPDATE to an event now may also include one or more INSERT/UPDATE/DELETEs to the cross-reference table, which on MyISAM means the cross-reference table is locked and on InnoDB means the rows are locked, so if your database is busy with a significant number of writes you're going to have a larger contention problems than if just the event rows were locked.
Personally, I would try out this fully normalized approach before optimizing. But, I'll assume you know what you're doing, that your assumptions are correct (categories never change) and you have a usage pattern (lots of writes) that calls for a less-normalized, flat structure. That's totally fine and is part of what NoSQL is about.
SET vs. "lots of columns"
So, as to your actual question "SET vs. lots of columns", I can say that I've worked with two companies with smart engineers (whose products were CRM web applications ... one was actually events management), and they both used the "lots of columns" approach for this kind of static set data.
My advice would be to think about all of the queries you will be doing on this table (weighted by their frequency) and how the indexes would work.
First, with the "lots of columns" approach you are going to need indexes on each of these columns so that you can do SELECT FROM events WHERE CategoryX = TRUE. With the indexes, that is a super-fast query.
Versus with SET, you must use bitwise AND (&), LIKE, or FIND_IN_SET() to do this query. That means the query can't use an index and must do a linear search of all rows (you can use EXPLAIN to verify this). Slow query!
That's the main reason SET is a bad idea -- its index is only useful if you're selecting by exact groups of categories. SET works great if you'd be selecting categories by event, but not the other way around.
The primary problem with the less-normalized "lots of columns" approach (versus fully normalized) is that it doesn't scale. If you have 5 categories and they never change, fine, but if you have 500 and are changing them, it's a big problem. In your scenario, with around 30 that never change, the primary issue is that there's an index on every column, so if you're doing frequent writes, those queries become slower because of the number of indexes that have to updated. If you choose this approach, you might want to check the MySQL slow query log to make sure there aren't outlier slow queries because of contention at busy times of day.
In your case, if yours is a typical read-heavy web app, I think going with the "lots of columns" approach (as the two CRM products did, for the same reason) is probably sane. It is definitely faster than SET for that SELECT query.
TL;DR Don't use SET because the "select events by category" query will be slow.
It's good that the number of categories is fixed. If it wasn't you couldn't use either approach.
Check the Why You Shouldn't Use SET on the page you linked. I think that should give you a comprehensive guide.
I think the most important one is about indexes. Also, modifying a SET is slightly more complex.
The relationship between events and event types/categories is a many to many relationship, as echo says, but a simple xref table will leave you with a problem: If you want to query for all descendants of any given node, then you must make multiple recursive queries. On a deep tree, that will be very inefficient.
So when you say "retrieve all ids for a given category", if you do mean all descendants, then you want to use a Nested Set Model:
http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
The Nested Set model makes writes updates a bit slower, but makes it very easy to retrieve subtrees:
To get the Televisions sub tree, you query for all categories left >= 2 and right <= 9.
Leaf nodes always have left = right - 1
You can find the count of descendants without pulling those rows: (right - left - 1)/2
Finding inheritance paths and depth is also very easy (single query stuff). See the article for full details.
You might try using a cross-reference (Xref) table, to create a many-to-many relationship between your events and their types.
create table event_category_event_xref
(
event_id int,
event_category_id int,
foreign key(event_id) references event(id),
foreign key (event_category_id) references event_category(id)
);
Event / category membership is defined by records in this table. So if you have a record with {event_id = 3, event_category_id = 52}, it means event #3 is in category #52. Similarly you can have records for {event_id = 3, event_category_id = 27}, and so on.
I will be quick and simple on this.
Basically I need to merge multiple Invoices(Object) quickly and fast.
A simple idea is to
$invoice1 = new Invoice(1);
$invoice2 = new Invoice(2);
$invoice3 = new Invoice(3);
$invoice1->merge($invoice2,invoice3);
$invoice1->save();
Since each object will query it's own data, the number of queries increase as the number of invoices needed to be merge increases.
However, this is a case where a single query
SELECT * FROM invoice WHERE id IN (1,2,3)
Will suffice, however the implementation will not be as elegant as the above.
Initial benchmarks on sample data indicates a 2.5x-3x decrease in speed on the above due to the sheer number of mysql queries.
Advice please
Use an Invoice factory. You ask it for invoices using various methods. newest(n) get(id) get(array(id,id,id)) so on, and it returns arrays of invoices or single invoice objects.
<?php
$invoice56 = InvoiceFactory::Get(56); // Get's invoice 56
$invoices = InvoiceFactory::Newest(25); // Get's an array of the newest 25 invoices
?>
Could you make the Invoice object lazy and let merge load everything that hasn't been loaded?
Make sure you work on the same db connection all the time. Check that it does not reconnect in one script execution thread.
I could suggest looking into using an actual ORM (object relational mapping) in order to create a seperation between your actual queries and the objects used.. Take a look at Propel or (my favorite) Doctrine (version 2 is very easy to use)
That way you could have exactly what you want in just the same amount of code...
I'm experimenting with the Doctrine ORM (v1.2) for PHP. I have defined a class "liquor", with two child classes "gin" and "whiskey". I am using concrete inheritance (class table inheritance in most literature) to map the classes to three seperate database tables.
I am attempting to execute the following:
$liquor_table = Doctrine_Core::getTable('liquor');
$liquors = $liquor_table->findAll();
Initially, I expected $liquors to be a Doctrine_Collection containing all liquors, whether they be whiskey or gin. But when I execute the code, I get a empty collection, despite having several rows in the whiskey and gin database tables. Based on the generated SQL, I understand why: the ORM is querying the "liquor" table, and not the whiskey/gin tables where the actual data is stored.
Note that the code works perfectly when I switch the inheritance type to column aggregation (simple table inheritance).
What's the best way to obtain a Doctrine_Collection containing all liquors?
Update
After some more research, it looks like I'm expecting Doctrine to be performing a SQL UNION operation behind the scenes to combine the result sets from the "whiskey" and "gin" tables.
This is known as a polymorphic query.
According to this ticket, this functionality is not available in Doctrine 1.x. It is destined for the 2.0 release. (also see Doctrine 2.0 docs for CTI).
So in light of this information, what would be the cleanest, most efficient way to work around this deficiency? Switch to single table inheritance? Perform two DQL queries and manually merge the resulting Doctrine_Collections?
the only stable and useful inheritence mode of Doctrine for the moment is column_aggregation. I have tried the others in different projects. With column_aggregation you can imitate polymorphic queries.
Inheritance in general is something that is a bit buggy in Doctrine (1.x). With 2.x this will change, so we may have better options in the future.
I wrote the (not production ready) beginnings of an ORM that would do exactly what you're looking for a while back. Just so that I could have a proof of concept. All my studies did yield that you're in some way mixing code and data (subclass information in the liquor table).
So what you might do is write a method on your liquor class/table class that queries it's own table. The best way to get away with not having to hard-code all the subclasses in your liquor class is to have a column which contains the class name of the subclass in it.
How you spread the details around is entirely up to you. I think the most normalized (and anyone can correct me if I'm wrong here) way to do it is to store all fields that appear in your liquor class in the liquor table. Then, for each subclass, have a table that stores the specific data that pertains to the subclass type.
Which is the point at which you are mixing code and data because your code is reading the liquor table to get the name of the subclass to perform a join.
I'll use cars & bikes and some minimal, yet trivial differences between them for my example:
Ride
----
id
name
type
(1, 'Sebring', 'Car')
(2, 'My Bike', 'Bicycle')
Bicycle
-------
id
bike_chain_length
(2, '2 feet')
Car
---
id
engine_size
(1, '6 cylinders')
There's all kinds of variations from here forward like storing all liquor class data in the subclass table and only storing references and subclass names in the liquor table. I like this the least though because if you are aggregating the common data, it saves you from having to query every subclass table for the common fields.
Hope this helps!