Lets say I have the following "expenses" MySQL Table:
id
amount
vendor
tag
1
100
google
foo
2
450
GitHub
bar
3
22
GitLab
fizz
4
75
AWS
buzz
I'm building an API that should return expenses based on partial "vendor" or "tag" filters, so vendor="Git" should return records 2&3, and tag="zz" should return records 3&4.
I was thinking of utilizing elasticsearch capabilities, but I'm not sure the correct way..
most articles I read suggest replicating the table records (using logstash pipe or other methods) to elastic index.
So my API doesn't even query the DB and return an array of documents directly from ES?
Is this considered good practice? replicating the whole table to elastic?
What about table relations... What If I want to filter by nested table relation?...
So my API doesn't even query the DB and return an array of documents
directly from ES?
Yes, As you are doing query to elasticsearch, you will get result only from Elasticsearch. Another way is, just get id from Elasticsearch and use id to retrive documeents from MySQL, but this might impact response time.
Is this considered good practice? replicating the whole table to
elastic? What about table relations... What If I want to filter by
nested table relation?...
It is not about good practice or bad practice, it is all about what type of functionality and use case you want to implement and based on that technology stack can be used and data can be duplicated. There is lots of company using Elasticsearch as secondary data source where they have duplicated data just because there usecase is best fit with Elasticsearh or other NoSQL db.
Elasticsearch is NoSQL DB and it is not mantain any relationship between data. Hence, you need to denormalize your data before indexing to the Elasticsearch. You can read this article for more about denormalizetion and why it is required.
ElasticSearch provide Nested and Join data type for parent child relationship but both have some limitation and performance impact.
Below is what they have mentioned for join field type:
The join field shouldn’t be used like joins in a relation database. In
Elasticsearch the key to good performance is to de-normalize your data
into documents. Each join field, has_child or has_parent query adds a
significant tax to your query performance. It can also trigger global
ordinals to be built.
Below is what they have mentioned for nested field type:
When ingesting key-value pairs with a large, arbitrary set of keys,
you might consider modeling each key-value pair as its own nested
document with key and value fields. Instead, consider using the
flattened data type, which maps an entire object as a single field and
allows for simple searches over its contents. Nested documents and
queries are typically expensive, so using the flattened data type for
this use case is a better option.
most articles I read suggest replicating the table records (using
logstash pipe or other methods) to elastic index.
Yes, You can use logstash or any language client like java, python etc, to sync data from DB to Elasticsearch. You can check this SO answer for more information on this.
Your Search Requirements
If you go ahead with Elasticsearch then you can use N-Gram Tokenizer or Regex Query and achieve your search requirements.
Maybe you can try TiDB: https://medium.com/#shenli3514/simplify-relational-database-elasticsearch-architecture-with-tidb-c19c330b7f30
If you want to scale your MySQL and have fast filtering and aggregating, TiDB could simplify the architecture and reduce development work.
Related
I have this situation where i need suggestions on database tables design.
BACKGROUND
I am developing an application in PHP ( cakephp to be precise ). where we upload an xml file, it parses the file and save data in databases. These XML could be files or url feeds and these are purchased from various suppliers for data. It is intended to collect various venues data from source urls , venues can be anything like hotels , cinemas , schools , restaurants etc.
Problem
Initial table structure for these venues is as below . table is deigned to store generic information initially.
id
Address
Postcode
Lat
Long
SourceURL
Source
Type
Phone
Email
Website
With the more data coming from different sources , I realized that there are many attributes for different types of venues.
For example
a hotel can have some attributes like
price_for_one_day, types_of_accommodation, Number_of_rooms etc
where as schools will not have them but have different set of attributes.Restaurant will have some other attributes.
My first idea is to create two tables called vanue_attribute_names , Venue_attributes
##table venue_attribute_names
_____________________________
id
name
##table venue_attributes
________________________
id
venue_id
venue_attribute_name_id
value
So if I detect any new attribute I want to create one and the its value in attributes table with a relation. But I doubt this is not the correct approach. I believe there could be any other approach for this?. Besides if table grows huge there could be performance issues because of increase in joins and also sql queries
Is creating widest possible table with all possible attributes as columns is right approach? Please let me know. If there any links where I could refer I can follow it . Thanks
This is a surprisingly common problem.
The design you describe is commonly known as "Entity/Attribute/Value" or EAV. It has the benefit of allowing you to store all kinds of data without knowing in advance what the schema for that data is. It has the drawback of being hard to query - imagine finding all hotels in a given location, where the daily roomrate is between $100 and $150, whose name starts with "Waldorf". Writing queries against all the attributes and applying boolean logic quickly becomes harder than you'd want it to be. You also can't easily apply database-level consistency checks like "hotel_name must not be null", or "daily_room_rate must be a number".
If neither of those concerns worry you, maybe your design works.
The second option is to store the "common" fields in a traditional relational structure, but to store the variant data in some kind of document - MySQL supports XML, for instance. That allows you to define an XML schema, and query using XPath etc.
This approach gives you better data integrity than EAV, because you can apply schema constraints. It does mean that you have to create a schema for each type of data you're dealing with. That might be okay for you - I'm guessing that the business doesn't add dozens of new venue types every week.
Performance with XML querying can be tricky, and general tooling and the development approach will make it harder to build than "just SQL".
The final option if you want to stick with a relational database is to simply bite the bullet and use "pure" SQL. You can create a "master" table with the common attributes, and a "restaurant" table with the restaurant-specific attributes, a "hotel" table with the hotel attributes. This works as long as you have a manageable number of venue types, and they don't crop up unpredictably.
Finally, you could look at NoSQL options.
If you are sticking with a relational data base, that's it. The options you listed are pretty much what they can give you.
For your situation MongoDB (or an other document oriented NoSql system) could be a good option. This db systems are very good if your have a lot of records with different atributes.
I am trying to make my first application using PHP and MongoDB.
What I am trying to make is a listing of Fruit trees in a given area, and Also I want to have a list of fruit trees that could be grown in an area.
I come from MySQL background, so if in sql, I would have a trees table that had information about trees. Than I would have a table of actual trees, which would be left joined on a IDs with the trees table,a nd also have location and other information.
However in Mongo, I am not sure how this is done, or HOW it should be done.
Baiscally want I want is a list of trees, and than users can make a reference about their own tree.
Any help or direction on how this would be great.
$db->trees->fruittrees
$db->tress->userstrees
Chris
There are quite a few things missing from your question however:
What I am trying to make is a listing of Fruit trees in a given area, and Also I want to have a list of fruit trees that could be grown in an area.
Smells like a geospatial query: http://docs.mongodb.org/manual/core/geospatial-indexes/ .
So the user would have a location:
{
name: 'sammaye',
location: [107,206]
}
And each growing tree could take advantage of an array of areas:
{
name: 'apple tree'
locations_of_growth: [[74,59],[67,-45]]
}
And you would simply do a $near query on the user comparable to the distance of the Earth ( http://docs.mongodb.org/manual/core/geospatial-indexes/#distance-calculation ) to calculate what trees exist near that user.
The thing about geospatial queries is that they can be constrained with further information as well, so say you want to allow the user to compare the leaf colour of their apple trees in the area to find out how many apple trees in the area have green leafs you can simply add that as a conditiojn within the tree document and add it to your $near.
There is one downside to a geospatial query and it is that you can only have one index per collection ( http://docs.mongodb.org/manual/core/geospatial-indexes/#create-a-geospatial-index ) so this means that if you wish to also provide a list of trees that can be planted in that area then you might need two collections, one called other_trees and one called growable_species. When you want to allow the user to compare their tree to others in the area you query other_trees and when you wish to allow the user to view what trees can be grown in the area you can query growable_trees.
This is of course just one way of doing this.
Edit
I would not "recommend" using Doctrine, it is upto you and it depends on your scenario. Doctrine is a very heavy and surprisingly slow ORM and there are a lot of faster and more Lightweight ORMs for PHP out there: http://docs.mongodb.org/ecosystem/drivers/php-libraries/ if you want to a use an ORM that is, maybe you have a full stack framework or you want to go it alone with the driver.
Also MongoDB does not require JSON PHP (what-ever that is) handling knowledge. All results from MongoDB come into the driver as BSON and are then de-serialzed to a standard dict within PHP, namely an associative array. You do not require JSON PHP handling knowledge here.
Also unlike other answers, be very wary about "store all that information in a single JSON document", namely because MongoDB is BSON not JSON but also because embedding should be thought about very carefully and should be judged upon whether it fits your scenario or not. In fact you will find that embedding anything more than _ids in most cases to other rows can cause problems, but as I said it is scenario dependant.
MongoDB is a document-oriented database. This means you should design your model completely different as you are used in MySQL or another relational database. For example, if you want to store users information (including the trees that they have) you should store all that information in a single JSON document. For more reference please read this
I also recommend using Doctrine which lets you store objects into your MongoDB, so it will be far easier to model your database, you will only need to create an Object Oriented Model that represents your objects. Please refer to this page
For this you will need knowledge in JSON php handling and object oriented programming. One thing that always helps in the beginning of the transition from using MySql to a No-Sql database is to model your database and then denormalize the tables, that way you will know all the information you need to put in your objects/documents.
Hope this helps you
I've found quite often when modeling data in Mongo it depends on how you will use the data in the UI. Mongo isn't good at doing joins (as you know), but you can still do them. You can do them in a middle tier or in the UI.
So for your example, I would probably do the following:
A tree in the tree collection:
{
"_id" : 123,
"name" : "Orange",
"price" : 10,
"climate" : "Any"
}
User:
{
"user" : "Bob",
"selected_trees" : [ 123, 124, 156 ]
}
Then in your UI, you load Bob and make one call to the available tree collection to get the details of the trees for him.
Also a common practice is to actually include the full tree object with the user. You will recoil in disgust at this idea coming from a relational background, but it's very common with Mongo. You have copies of the same data all over. That is normal and expected.
Symfony ACL allows me to grant access to an entity, and then check it:
if (false === $securityContext->isGranted('EDIT', $comment)) {
throw new AccessDeniedException();
}
However, if I have thousands of entities in the database and the user has access only to 10 of them, I don't want to load all the entities in memory and hydrate them.
How can I do a simple "SELECT * FROM X" while filtering only on the entities the user has access (at SQL level)?
Well there it is: it's not possible.
In the last year I've been working on an alternative ACL system that would allow to filter directly in database queries.
My company recently agreed to open source it, so here it is: http://myclabs.github.io/ACL/
As pointed out by #gregor in the previous discussion,
In your first query, get a list (with a custom query) of all the object_identity_ids (for a specific entity/class X) a user has access to.
Then, when querying a list of objects for entity/class X, add "IN (object_identity_ids)" to your query.
Matthieu, I wasn't satisfied by replying with more of conjectures (since my conjectures don't add anything valuable to the conversation). So I did some bench-marking on this approach (Digital Ocean 5$/mo VPS).
As expected, table size doesn't matter when using the IN array approach. But a big array size indeed makes things get out of control.
So, Join approach vs IN array approach?
JOIN is indeed better when the array size is huge. BUT, this is assuming that we shouldn't consider the table size. Turns out, in practice IN array is faster - except when there's a large table of objects and the acl entries cover almost every object (see the linked question).
I've expanded on my reasoning on a separate question. Please see When using Symfony's ACL, is it better to use a JOIN query or an IN array query?
You could have a look into the Doctrine filters. That way you could extend all queries. I have not done this yet and there are some limitations documented. But maybe it helps you. You'll find a description of the ACL database tables here.
UPDATE
Each filter will return a string and all those strings will be added to the SQL queries like so:
SELECT ... FROM ... WHERE ... AND (<result of filter 1> AND <result of filter 2> ...)
Also the table alias is passed to the filter method. So I think you can add Subqueries here to filter your entities.
I have a Postgres DB containing some configuration data spread over several tables.
This configurations need to be tested before they get deployed to the production system.
Now I'm looking for a way to
store single configuration objects with their child entities in SVN, and
to deploy this objects with child entities to different target DB's
The point is that the relations between the objects needs to be somehow maintained without the actual id's which would cause conflicts when copying the data to another DB.
For example, if the database would contain data about music artists, albums and tracks with a simple tree table schema like artist -> has albums -> has tracks, then the solution I'm looking for would allow to export e.g. one selected album with all tracks (or one artist with all albums with all tracks) into one file which could be stored to SVN and later be 'deployed' to whatever DB which has the same schema.
I was thinking of implementing something myself, e.g. to have config file describing dependencies, and an export script which replaces id's with PHP variables and generates some kind of PHP-SQL INSERT or UPDATE script.
But then I thought it would be really silly not to ask before to double check if something like this already exists :o)
This is one of the arguments for Natural Keys. An album has an artist and is made up of tracks. No "id" necessary to link these pieces of information together, just use the names. Perl-esque example of a data file:
"Bob Artist" => {
"First Album" => ["My Best Song", "A Slow Song",],
"Comeback Album" => ["One-Hit Wonder", "ORM Blues",],
}, "Noname Singer" => {
"Parse This Record!" => ["Song Named 'D'",],
}
To add the data, just walk the tree creating INSERT statements based on each level of parent data and if you must have one, use "RETURNING id" (PostgreSQL extension) at the end of each INSERT statement to get the auto-generated ids to pass to the next level down in the tree.
I second Matthew's suggestion. As a refinement of that concept, you may want to create "derived natural keys", for example "bob_artist" for "Bob Artist". The derived natural key would be well suited as a filename when storing the record into svn, for example.
The derived natural key should be generated such that any two different natural keys result in different derived natural keys. That way conflicts can't happen between independent datasets.
The concept of Rails migrations seems relevant although it aims mainly on performing schema updates: http://guides.rubyonrails.org/migrations.html
The idea has been transferred into PHP with the name Ruckusing, but seem to support only mySQL at this point: http://code.google.com/p/ruckusing/wiki/BigPictureOverview
Doctrine also provides migrations functionality but seems again to focus on schema transformations rather than on migrating or deploying data: http://www.doctrine-project.org/projects/migrations/2.0/docs/en
Possibly Ruckusing or Doctrine could be used (abused?) or if needed modified / extended to do the job?
I am lost on how to best approach the site search component. I have a user content site similar to yelp. People can search for local places, local events, local photos, members, etc. So if i enter "Tom" in the search box I expect the search to return results from all user objects that match with Tom. Now the word Tom can be anywhere, like a restaurant name or in the description of the restaurant or in the review, or in someone's comment, etc.
So if i design this purely using normalized sql I will need to join about 15 object tables to scan all the different user objects + scan multiple colunms in each table to search all the fields/colunms. Now I dont know if this is how it is done normally or is there a better way? I have seen stuff like Solr/Apache/Elasticsearch but I am not sure how these fit in to myusecase and even if i use these I assume i still need to scan all the 15 tables + 30-40 colunms correct? My platform is php/mysql. Also any coding / component architecture / DB design practice to follow for this? A friend said i should combine all objects into 1 table but that wont work as you cant combine photos, videos, comments, pages, profiles, etc into 1 table so I am lost on how to implement this.
Probably your friend meant combining all the searchable fields into one table.
The basic idea would be to create a table that acts as the index. One column is indexable and stores words, whereas the other column contains a list of references to objects that contain that word in one of those fields (for example, an object may be a picture, and its searchable fields might be title and comments).
The list of references can be stored in many ways, so you could for example have string of variable length, say a BLOB, and in it store a JSON-encoded array of the ids & types of objects, so that you could easily find them afterwards by doing a search for that id in the table corresponding to the type of object).
Of course, on any addition / removal / modification of indexable data, you should update your index accordingly (but you can use lazy update techniques that eventually update the index in the background - that is because most people expect indexes to be accurate within maybe a few minutes to the current state of the data. One implementation of such an index is Apache Cassandra, but I wouldn't use it for small-scale projects, where you don't need distributed databases and such).