Parsing URLs vs Storing URL Parts

Parsing URLs vs Storing URL Parts - php

I'm using PHP and MySQL, and I'm building a database that needs to store urls. I'm going to have to do a lot of work with the parts of the url. It's going to end up being millions of records.
My question is, what makes more sense:
to store the parts of the url in several fields, negating the need to
parse
store the whole url in one field, and parse it out every time
Thanks for any advice you can offer!

The rule of thumb when you design new database schema - is not to denormalize until it is proven that it is necessary.
So start with the most normalized and the simplest schema. And only after you experience any performance issues - profile your application and solve the particular bottleneck.

Depends on your querying pattern. If you're going to do things like SELECT * FROM urls WHERE hostname = ...., then you obviously want them split into their own fields. If you're never going to slice and dice your data using queries, then storing just the full URL itself would be fine. But you never want to parse db-side (always better to just store your parsed data if you find yourself parsing db-side).

Database structure is really depends on queries you are planning to run.
If you do need search by URL parts like domain name you do need to keep them somewhere else, outside of big urls table(s) to perform these queries against smaller table.

Related

Question about storing/retrieving data for a complex page

My stack is php and mysql.
I am trying to design a page to display details of a mutual fund.
Data for a single fund is distributed over 15-20 different tables.
Currently, my front-end is a brute-force php page that queries/joins these tables using 8 different queries for a single scheme. It's messy and poor performing.
I am considering alternatives. Good thing is that the data changes only once a day, so I can do some preprocessing.
An option that I am considering is to create run these queries for every fund (about 2000 funds) and create a complex json object for each of them, store it in mysql indexed for the fund code, retrieve the json at run time and show the data. I am thinking of using the simple json_object() mysql function to create the json, and json_decode in php to get the values for display. Is this a good approach?
I was tempted to store them in a separate MongoDB store - would that be an overkill for this?
Any other suggestion?
Thanks much!

To meet your objective of quick pageviews, your overnight-run approach is very good. You could generate JSON objects with your distilled data, or even prerendered HTML pages, and store them.
You can certainly store JSON objects in MySQL columns. If you don't need the database server to search the objects, simply use TEXT (or LONGTEXT) data types to store them.
To my way of thinking, adding a new type of server (mongodb) to your operations to store a few thousand JSON objects does not seem worth the the trouble. If you find it necessary to search the contents of your JSON objects, another type of server might be useful, however.
Other things to consider:
Optimize your SQL queries. Read up: https://use-the-index-luke.com and other sources of good info. Consider your queries one-by-one starting with the slowest one. Use the EXPLAIN or even the EXPLAIN ANALYZE command to get your MySQL server to tell you how it plans each query. And judiciously add indexes. Using the query-optimization tag here on StackOverflow, you can get help. Many queries can be optimized by adding indexes to MySQL without changing anything in your php code or your data. So this can be an ongoing project rather than a big new software release.
Consider measuring your query times. You can do this with MySQL's slow query log. The point of this is to identify your "dirty dozen" slowest queries in a particular time period. Then, see step one.
Make your pages fill up progressively, to keep your users busy reading while you get the data they need. Put the toplevel stuff (fund name, etc) in server-side HTML so search engines can see it. Use some sort of front-end tech (React, maybe, or Datatables that fetch data via AJAX) to render your pages client-side, and provide REST endpoints on your server to get the data, in JSON format, for each data block in the page.
In your overnight run create a sitemap file along with your JSON data rows. That lets you control exactly how you want search engines to present your data.

Reading a file or searching in a database?

I am creating a web-based app for android and I came to the point of the account system. Previously I stored all data for a person inside a text file, located users/<name>.txt. Now thinking about doing it in a database (like you probably should), wouldn't that take longer to load since it has to look for the row where the name is equal to the input?
So, my question is, is it faster to read data from a text file, easy to open because it knows its location, or would it be faster to get the information from a database, although it would have to first scan line by line untill it reaches the one with the correct name?
I don't care about the safety, I know the first option is not save at all. It doesn't really matter in this case.
Thanks,
Merijn

In any question about performance, the first answer is usually: Try it out and see.
In your case, you are reading a file line-by-line to find a particular name. If you have only a few names, then the file is probably faster. With more lines, you could be reading for a while.
A database can optimize this using an index. Do note that the index will not have much effect until you have a fair amount of data (tens of thousands of bytes). The reason is that the database reads the records in units called data pages. So, it doesn't read one record at a time, it reads a page's worth of records. If you have hundreds of thousands of names, a database will be faster.
Perhaps the main performance advantage of a database is that after the first time you read the data, it will reside in the page cache. Subsequent access will use the cache and just read it from memory -- automatically, I might add, with no effort on your part.
The real advantage to a database is that it then gives you the flexibility to easily add more data, to log interactions, and to store other types of data the might be relevant to your application. On the narrow question of just searching for a particular name, if you have at most a few dozen, the file is probably fast enough. The database is more useful for a large volume of data and because it gives you additional capabilities.

Abit of googling came up with this question: https://dba.stackexchange.com/questions/23124/whats-better-faster-mysql-or-filesystem
I think the answer suits this one as well.
The file system is useful if you are looking for a particular file, as
operating systems maintain a sort of index. However, the contents of a
txt file won't be indexed, which is one of the main advantages of a
database. Another is understanding the relational model, so that data
doesn't need to be repeated over and over. Another is understanding
types. If you have a txt file, you'll need to parse numbers, dates,
etc.
So - the file system might work for you in some cases, but certainly
not all.

That's where database indexes come in.
You may wish to take a look at How does database indexing work? :)

It is quite a simple solution - use database.
Not because its faster or slower, but because it has mechanisms to prevent data loss or corruption.
A failed write to the text file can happen and you will lose a user profile info.
With database engine - its much more difficult to lose data like that.
EDIT:
Also, a big question - is this about server side or app side??
Because, for app side, realistically you wont have more than 100 users per smartphone... More likely you will have 1-5 users, who share the phone and thus need their own profiles, and for the majority - you will have a single user.

Storing language and styles. What would be best? Files or DB ( i18N )

I'm starting a Incident Tracking System for IT, and its likely my first PHP project.
I've been designing it in my mind based on software I've seen like vBulletin, and I'd like it to have i18n and styles editables.
So my first question goes here:
What is best method to store these things, knowing they will be likely static. I've been thinking about getting file content with PHP, showing it in a text editor, and when save is made, replace the old one. (Making a copy if it hasn't ever been edited before so we have the "original").
I think this would be considerably faster than using MySQL and storing the language / style.
What about security here? Should I create .htaccess for asking for pw on this folder?
I know how to make a replace using for each getting an array from database and using strreplace ($name, $value, $file) but if I store language in file, can I make a an associative array with it's content (like a JSON).
Thanks a lot and sorry for so many questions, im newbie

this is what im doing in my cms:
for each plugin/program/entity (you name it) i develop, i create a /translations folder.
i put there all my translations, named like el.txt, de.txt, uk.txt etc. all languages
i store the translation data in JSON, because its easy to store to, easy to read from and easiest for everyone to post theirs.
files can be easily UTF8 encoded in-file without messing with databases, making it possible to read them in file-mode. (just JSON.parse them)
on installation of such plugins, i just loop through all translations and put them in database, each language per table row. (etc. a data column of TEXT datatype)
for each page render i just query once the database for taking this row of selected language, and call json_decode() to the whole result to get it once; then put it in a $_SESSION so the next time to get flash-speed translated strings for current selected language.
the whole thing was developed having i mind both performance and compatibility.

The benefit for storing on the HDD vs DB is that backups won't waste as much space. eg. once the file is backed-up once, it doesn't take up tape on the next day. Whereas, a db gets fully backed up every day and takes up increasing amounts of space. The down-side to writing it to the disk is that it increases your chance of somebody uploading something malicious and they might be clever enough to figure out how to execute it. You just need to be more careful, that's all.
Yes, use .htaccess to limit any action on a writable folder. Good job thinking ahead of that risk.
Your approach sounds like a good strategy.
Good luck.

How to import data to analyse it the fastest way

I just have a question which way gives me more performance and would be easier to get done. We have a DB with over 120000 datarows which is stored in a database. These data is currently exported as CSV file to an ftp location.
Now from this csv file there should be a webform created to filter the datasets. What would you recommend regarding performance and work todo. Should I parse the csv file and get the information out to the webpage or should I reimport the csv file to a DB (MySQL) and use SQL queries to filter the data (Note: The original DB and export is on a different server than the webpage/webform.)
A direct connection to the DB on the original server is not possible.
I prefer reuploading it to a DB, because it makes the development easier, I just simply need to create the SQL query against the filter criteria entered in the webform and run it.
Any ideas?
Thanks...
WorldSignia

The database is undoubtedly the best answer. Since you are looking to use a web form to analyze the results and perform complex queries, the other alternative may prove VERY expensive in terms of server processing time, and quite more difficult to implement. After all, on the one hand you have SQL that handles all filtering details for you, and on the other you will have to implement something yourself.
I would advise, performance - wise, that you create indices for all fields that you know you will be using as criteria, and to display results partially, say 50 per page to minimize load times.

These data is currently exported as CSV file to an ftp location.
There are so many things wrong in that one sentence.
Should I parse the csv file and get the information out to the webpage
Definitely not.
While it is technically possible, and will probably be faster given the number of rows if you use the right tools this is a high risk approach which gives a lot less clarity of code. And while it may meet your immediate requirement is it rather inflexible.
Since the only sensible option is to transfer to another database, perhaps you should think about how you can do this
without using FTP
without using CSV
What happens to the data after it has been filtered?

I think the DB with indexes may be a better solution in case you need to filter the data. Actually this is the idea of DB to optimize your work with data. But you could profile you work and measure the performance. Then you just choose..

hmm good question.
i would think the analysis with a DB is faster. You can set Indizes and optimize the analysis.
But it could take some time to load the CSV into the Database.
To analyse the CSV without a Db it could take some time. You have to create a concrete algorithm and this may be a lot of work :)
So I think u have to proof it both and take the best performance... evaluate them ;-)

Storing a URL value in MYSQL

I am developing an URL bookmark application (PHP and MySQL) for myself. With this application I will store URLs in MySQL.
The question is, should I store URLs in a TEXT column or should I first parse the URL and store its components (host, path, query, fragment) in separate columns in one table? The latter one also gives me the chance of generating statistical data by grouping servers and etc. Or maybe I should store servers in a separate table and use JOIN. What do you think?
Thanks.

I'd go with storing them in TEXT columns to start. As your application grows, you can build up the parsing and analysis functionality if you really want to. From what it sounds like, it's all just pie-in-the-sky functionality right now. Do what you need to get the basic application up and running first so that you have something to work with. You can always refactor and go from there.

The answer depends on how you like to use this data in the future.
If you like to analyze the different parts of the URL splitting them is the way to go.
If not. the INSERT, as well, as the SELECT, will be faster, if you store them in just one field.
If you know the URLs are not longer then 255 Chars, varchar(255) will be better, than text, for performance reasons.

If you seriously thing that you're going to be using it for getting interesting data, then sure, do it as a series of columns. Honestly, I'd say it'd probably just be easier to do it as a single column though.
Also, don't forget that it's easy for you to convert back and forth if you want to later. Single to multiple is just a SELECT;regex;INSERT[into another table]; multiple to single is just a INSERT SELECT with CONCAT.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.