Storing language dictionaries in database

Storing language dictionaries in database - php

I'm creating a language app that currently only features Mandarin Chinese and Spanish.
Currently, I have self-created dictionary simply loaded as JSON without storing in the DB, but I've found full downloadable dictionaries, such as CEDICT for Chinese to do the definitions for me. That being said, this file is 115k rows long, with 6 columns per row.
I also need to do this for Spanish, and then every other language I plan on including.
Notes:
MySQL DB
Laravel ORM (PHP)
That being said, what's the best way to store this data?
I'm assuming as separate tables, dictionary_zh, dictionary_es, but I could also store each dictionary in a dictionary table, with an added column for locale and query based on that. This SO answer states that 1m records isn't "too much" for a table to handle, it simply defines on how you index the table.
Btw, anyone have a recommendation for a good downloadable Spanish - English dictionary?
Note: I'm downloading the dictionary and cutting it up into something I can load into a CSV
Traditional Simplified Pinyin Meaning Level Quest
佟 佟 Tong2 surname Tong 1 2
...
I'm translating it by simply passing in the identifying character, in this case佟, and grabbing its Meaning.

I would store each dictionary in a separate table to abstract how I fetch the definition for a word depending on the locale, without the need to know how a dictionary (mapped as Dictionary type in the diagram below) operates its translation. This is useful when you might have dictionaries which don't reside in your DB, such as ones translating via an API.
The method translate() is implemented differently for each type of Dictionary (in your case ChineseDictionary or SpanishDictionary).
Another advantage of this approach from a data management point of view is that you will not have to make a lot of operations on the data when new versions of your dictionary are released, which makes it cheap to maintain.

Related

Translation of model properties

I am developing an MVC framework (please don't question that, I know..) and currently designing a translation mechanism to maximally ease translation of applications, so far I have a lang folder which contains translation files for different pages
./lang/en/system.php
./lang/es/system.php
./lang/fr/system.php
and so on. Lets say this file contains translations of system messages such as
./lang/en/system.php
<?php
return array(
'yourIP' => 'Your IP address is :1'
);
To access that in a page I will use a facade class Lang, which will fetch the file based on the selected language (stored in session) and give me the translations.
Controller
public function index() {
return new View('index', ['translations' => Lang::get('system')]);
}
View
<h1><?= $translations->get('yourIP', System::getClientIP()) ?></h1>
This seems to work pretty fast as I can group translations efficiently in separate files for separate modules/pages.
The problem I am trying to solve now is with translating models. For example lets say I'm building a multilingual blog and saving posts in a database. Each post will need translations of its own, but theoretically there can be an unlimited amount of posts. The current method I'm using does not seem very practical.
What I will have to do is create a child directory and store translations like
./lang/en/posts/post-1.php
./lang/en/posts/post-2.php
...
./lang/en/posts/post-n.php
And that would be for every language, where in this file I will store all translatable (is this a word?) fields of the model and will load it in the model's constructor.
Problems regarding this solution:
The filesystem will get stuffed with lots of very small files - I'm not really a filesystem expert and I would like to ask if having a large amount of small files like that can cause harm to the filesystem itself including slowdowns of reads and such.
There will be n filesystem reads when retrieving a set of models, where n is the number of models. The hard drive is the slowest component in a computer, performing lots of FS reads in a script will present a significant slowdown, now with SSDs maybe not that much but still not a minor problem.
The other solution I came up with is use an additional system database table, which will store translations by table and primary key, something like
table INT
model_pk INT
lang INT
translations TEXT
where table will be a crc32 encoded number of the name of the table which the translations belong to, model will be the PK (id) of the model, lang no need to explain and translations will be a serialized string containing all translatable properties.
Problems with this approach:
Forces developer to use a database and obligates them to have a certain table (currently the framework does not require you to have a database and thus there are no system tables when you actually use one).
models with composite primary keys will not be able to benefit from this since the model column can not store a composite key, so only models with a single column primary key will be translatable.
These are just my observations and thoughts, I may be wrong or I may be missing something. I'm posting this question to get advice on which solution will be less problematic from someone with greater experience or propose a completely different one, I'm open to everything.

If you need to build something great first you must understand that what you make until now is called localization (setting app language which will case translation of static data) but you still need to make translation (which mean translate dynamic data like data came from database)
For more details http://content.lionbridge.com/the-difference-between-translation-and-localization-for-multilingual-website-projects-definitions/
For more about localization: https://laravel.com/docs/5.3/localization
and see also https://github.com/mcamara/laravel-localization
For more about translation you can find this package For Laravel Framework interested https://github.com/dimsav/laravel-translatable

Right mysql table design/relations in this scenario

I have this situation where i need suggestions on database tables design.
BACKGROUND
I am developing an application in PHP ( cakephp to be precise ). where we upload an xml file, it parses the file and save data in databases. These XML could be files or url feeds and these are purchased from various suppliers for data. It is intended to collect various venues data from source urls , venues can be anything like hotels , cinemas , schools , restaurants etc.
Problem
Initial table structure for these venues is as below . table is deigned to store generic information initially.
id
Address
Postcode
Lat
Long
SourceURL
Source
Type
Phone
Email
Website
With the more data coming from different sources , I realized that there are many attributes for different types of venues.
For example
a hotel can have some attributes like
price_for_one_day, types_of_accommodation, Number_of_rooms etc
where as schools will not have them but have different set of attributes.Restaurant will have some other attributes.
My first idea is to create two tables called vanue_attribute_names , Venue_attributes
##table venue_attribute_names
_____________________________
id
name
##table venue_attributes
________________________
id
venue_id
venue_attribute_name_id
value
So if I detect any new attribute I want to create one and the its value in attributes table with a relation. But I doubt this is not the correct approach. I believe there could be any other approach for this?. Besides if table grows huge there could be performance issues because of increase in joins and also sql queries
Is creating widest possible table with all possible attributes as columns is right approach? Please let me know. If there any links where I could refer I can follow it . Thanks

This is a surprisingly common problem.
The design you describe is commonly known as "Entity/Attribute/Value" or EAV. It has the benefit of allowing you to store all kinds of data without knowing in advance what the schema for that data is. It has the drawback of being hard to query - imagine finding all hotels in a given location, where the daily roomrate is between $100 and $150, whose name starts with "Waldorf". Writing queries against all the attributes and applying boolean logic quickly becomes harder than you'd want it to be. You also can't easily apply database-level consistency checks like "hotel_name must not be null", or "daily_room_rate must be a number".
If neither of those concerns worry you, maybe your design works.
The second option is to store the "common" fields in a traditional relational structure, but to store the variant data in some kind of document - MySQL supports XML, for instance. That allows you to define an XML schema, and query using XPath etc.
This approach gives you better data integrity than EAV, because you can apply schema constraints. It does mean that you have to create a schema for each type of data you're dealing with. That might be okay for you - I'm guessing that the business doesn't add dozens of new venue types every week.
Performance with XML querying can be tricky, and general tooling and the development approach will make it harder to build than "just SQL".
The final option if you want to stick with a relational database is to simply bite the bullet and use "pure" SQL. You can create a "master" table with the common attributes, and a "restaurant" table with the restaurant-specific attributes, a "hotel" table with the hotel attributes. This works as long as you have a manageable number of venue types, and they don't crop up unpredictably.
Finally, you could look at NoSQL options.

If you are sticking with a relational data base, that's it. The options you listed are pretty much what they can give you.
For your situation MongoDB (or an other document oriented NoSql system) could be a good option. This db systems are very good if your have a lot of records with different atributes.

How to store & deploy chunks of relational data?

I have a Postgres DB containing some configuration data spread over several tables.
This configurations need to be tested before they get deployed to the production system.
Now I'm looking for a way to
store single configuration objects with their child entities in SVN, and
to deploy this objects with child entities to different target DB's
The point is that the relations between the objects needs to be somehow maintained without the actual id's which would cause conflicts when copying the data to another DB.
For example, if the database would contain data about music artists, albums and tracks with a simple tree table schema like artist -> has albums -> has tracks, then the solution I'm looking for would allow to export e.g. one selected album with all tracks (or one artist with all albums with all tracks) into one file which could be stored to SVN and later be 'deployed' to whatever DB which has the same schema.
I was thinking of implementing something myself, e.g. to have config file describing dependencies, and an export script which replaces id's with PHP variables and generates some kind of PHP-SQL INSERT or UPDATE script.
But then I thought it would be really silly not to ask before to double check if something like this already exists :o)

This is one of the arguments for Natural Keys. An album has an artist and is made up of tracks. No "id" necessary to link these pieces of information together, just use the names. Perl-esque example of a data file:
"Bob Artist" => {
"First Album" => ["My Best Song", "A Slow Song",],
"Comeback Album" => ["One-Hit Wonder", "ORM Blues",],
}, "Noname Singer" => {
"Parse This Record!" => ["Song Named 'D'",],
}
To add the data, just walk the tree creating INSERT statements based on each level of parent data and if you must have one, use "RETURNING id" (PostgreSQL extension) at the end of each INSERT statement to get the auto-generated ids to pass to the next level down in the tree.

I second Matthew's suggestion. As a refinement of that concept, you may want to create "derived natural keys", for example "bob_artist" for "Bob Artist". The derived natural key would be well suited as a filename when storing the record into svn, for example.
The derived natural key should be generated such that any two different natural keys result in different derived natural keys. That way conflicts can't happen between independent datasets.

The concept of Rails migrations seems relevant although it aims mainly on performing schema updates: http://guides.rubyonrails.org/migrations.html
The idea has been transferred into PHP with the name Ruckusing, but seem to support only mySQL at this point: http://code.google.com/p/ruckusing/wiki/BigPictureOverview
Doctrine also provides migrations functionality but seems again to focus on schema transformations rather than on migrating or deploying data: http://www.doctrine-project.org/projects/migrations/2.0/docs/en
Possibly Ruckusing or Doctrine could be used (abused?) or if needed modified / extended to do the job?

Sitewide multi object search - database design / code strategy?

I am lost on how to best approach the site search component. I have a user content site similar to yelp. People can search for local places, local events, local photos, members, etc. So if i enter "Tom" in the search box I expect the search to return results from all user objects that match with Tom. Now the word Tom can be anywhere, like a restaurant name or in the description of the restaurant or in the review, or in someone's comment, etc.
So if i design this purely using normalized sql I will need to join about 15 object tables to scan all the different user objects + scan multiple colunms in each table to search all the fields/colunms. Now I dont know if this is how it is done normally or is there a better way? I have seen stuff like Solr/Apache/Elasticsearch but I am not sure how these fit in to myusecase and even if i use these I assume i still need to scan all the 15 tables + 30-40 colunms correct? My platform is php/mysql. Also any coding / component architecture / DB design practice to follow for this? A friend said i should combine all objects into 1 table but that wont work as you cant combine photos, videos, comments, pages, profiles, etc into 1 table so I am lost on how to implement this.

Probably your friend meant combining all the searchable fields into one table.
The basic idea would be to create a table that acts as the index. One column is indexable and stores words, whereas the other column contains a list of references to objects that contain that word in one of those fields (for example, an object may be a picture, and its searchable fields might be title and comments).
The list of references can be stored in many ways, so you could for example have string of variable length, say a BLOB, and in it store a JSON-encoded array of the ids & types of objects, so that you could easily find them afterwards by doing a search for that id in the table corresponding to the type of object).
Of course, on any addition / removal / modification of indexable data, you should update your index accordingly (but you can use lazy update techniques that eventually update the index in the background - that is because most people expect indexes to be accurate within maybe a few minutes to the current state of the data. One implementation of such an index is Apache Cassandra, but I wouldn't use it for small-scale projects, where you don't need distributed databases and such).

PHP/MySQL database design for various/variable content - modular system

I'm trying to build (right now just thinking/planning/drawing relations :] ) little modular system to build basic websites (mostly to simplify common tasks we as webdesigners do routinely).
I got little stuck with database design / whole idea of storing content.
1., What is mostly painful on most of websites (from my experience), are pages with quasi same layout/skelet, with different information - e.g. Title, picture, and set of information - but, making special templates / special modules in cms happens to cost more energy than edit it as a text - however, here we lose some operational potential - we can't get "only titles", because, CMS/system understands whole content as one textfield
So, I would like to this two tables - one to hold information what structure the content has (e.g. just variable amount of photos <1;500) :], title & text & photo (large) & gallery) - HOW - and another table with all contents, modules and parts of "collections" (my working name for various structured information) - WHAT
table module_descriptors (HOW)
id int
structure - *???*
table modules (WHAT)
id int
module_type - #link to module_descriptors id
content - *???*
2., What I like about this is - I don't need many tables - I don't like databases with 6810 tables, one for each module, for it's description, for misc. number to text relations, ... and I also don't like tables with 60 columns, like content_us, content_it, category_id, parent_id.
I'm thinking I could hold the structure description and content itself (noted the ??? ?) as either XML or CSV, but maybe I'm trying to reinvent the wheel and answer to this is hidden in some design pattern I haven't looked into.
Hope I make any sense at all and would get some replies - give me your opinion, pros, cons... or send me to hell. Thank you
EDIT: My question is also this: Does this approach make sense? Is it edit-friendly? Isn't there something better? Is it moral? Don't do kittens die when I do this? Isn't it too much for server, If I want to read&compare 30 XMLs pulled from DB (e.g. I want to compare something)? The technical part - how to do it - is just one part of question:)

The design pattern you're hinting at is called Serialized LOB. You can store some data in the conventional way (as columns) for attributes that are the same for every entry. For attributes that are variable, format them as XML or MarkDown or whatever you want, and store it in a TEXT BLOB.
Of course you lose the ability to use SQL expressions to query individual elements within the BLOB. Anything you need to use in searching or sorting should be in conventional columns.
Re comment: If your text blob is in XML format, you could search it with XML functions supported by MySQL 5.1 and later. But this cannot benefit from an index, so it's going to result in very slow searches.
The same is true if you try to use LIKE or RLIKE with wildcards. Without using an index, searches will result in full table-scans.
You could also try to use a MySQL FULLTEXT index, but this isn't a good solution for searching XML data, because it won't be able to tell the difference between text content and XML tag names and XML attributes.
So just use conventional columns for any fields you want to search or sort by. You'll be happier that way.
Re question: If your documents really require variable structure, you have few choices. When used properly, SQL assumes that every row has the same structure (that is, columns). Your alternatives are:
Single Table Inheritance or Concrete Table Inheritance or Class Table Inheritance
Serialized LOB
Non-relational databases
Some people resort to an antipattern called Entity-Attribute-Value (EAV) to store variable attributes, but honestly, don't go there. For a story about how bad this can go wrong, read this article: Bad CaRMa.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.