Computation: better to truncate before insertion or let MySQL truncate?

Computation: better to truncate before insertion or let MySQL truncate? - php

Difficult question to phrase, so let me explain.
As part of an RSS caching system I'm inserting a lot of rows into a DB, several times a day. One of the columns is 'snippet', for the description node in the RSS feeds.
Sometimes this node is far longer than I want, since the corresponding DB column is type "tiny text" (max: 255 chars).
So, in terms of computation/memory, is it better for me to truncate via PHP before insertion, or just feed the whole, too-long string to MySQL and have it do the truncation?
Both of course work, but I wondered if one was better practice than the other.

In cases like this it's probably best to measure. If you don't notice a difference then it doesn't matter.
My intuition tells me that, since your snippet size is very small and the plain text can be very big it would be better to truncate before hand. Take the performance hit in PHP so you don't spend a lot of time sending a large query to MySQL.
For readability and code clarity it would also be better to do the truncation in PHP because that makes it explicit. You can even do clever truncating by word or by sentence.

Related

Raw Text as a Substitute for MySQL

I am renting a server which does not support MySQL. An upgrade would be significantly expensive.
So for the moment, I am trying to cope with using raw text.
So here is a database syntax I am thinking about:
{first line} metadata (id, name, date, number of rows, number of columns, etc.)
{second line} column headers
{rest of lines} column data, separated by a deliminator
Example (using * as a deliminator):
rmC2xA7f*Users*1436703535*3*5
id*first*last*email*password
d29JHVca*Example*User*example.user#example.com*examplepassword123
tGpy3CM6*Foo*Bar*foo.bar#foobar.com*foobarpassword456
PdQMDHsK*Bla*Bla*bla.bla#bla.com*blablapassword789
I would then create a PHP library for manipulating this text. I know that it wouldn't be as efficient, scalable or fast a MySQL, but would this be an acceptable substitute for a small, personal website?
Are there any issues with it, or any way I could improve it? I'll probably change the * to something else if you're thinking that.
Also, comment if this question should be on a different network...
Thanks :).

Find phrases using mysql and php

I am working on a project and I need your suggestions in a database query. I am using PHP and MySQL.
Context
I have a table named phrases containing a phrases column in which there are phrases stored, each of which consists of one to three words.
I have a text string which contains 500 - 1000 words
I need to highlight all the phrases in the text string which exist in my phrases database table.
My solution
I go through every phrase in the phrase list and compare it against the text, but the number of phrases is large (100k) so it takes about 2 min or more to do this matching.
Is there any more efficient way of doing this?

I'm gonna focus on how to do the comparision part with 100K Values. This will require two steps.
a) Write a C++ library and link it to PHP using an extension. Google PHP-CPP. There is a framework which allows you to do this.
b) Inside C/C++ , you need to create a data structure which has a time complexity of O(n) . n being length of the phrases you're searching for. Normally, this is called a tries data structure. This is conventionally used for words without space[not phrases]. but, surely you can write your own.
Here is a link, which contains the word implementation. aka dictionary.
http://www.geeksforgeeks.org/trie-insert-and-search/
This takes quite a bit of Memory since, the number is 100K. fair to say, you need a large system. But, when you're looking for better performance, then, Memory tends to be a tradeoff.
Alternative Approach
Only PHP. Here , extract phrases from your text input. Convert them into a Hash. the table data that you contain, should also be stored in a hash. [Needs Huge Memory]. The performance here will be rocket fast, per search aka O(1). so, for a sentence of k words. your time complexity will be O(K-factorial).

rawurlencode for storing data

I have always used rawurlencode to store user entered data into my mysql databases. The main reason I do this is so that stroing foreign characters is very simple I find. I'd then use rawurldecode to retrieve and display the data.
I read somewhere that rawurlencode was not meant for this purpose. Are there any disadvantages to what I'm doing?
So let's say I have a German address with many characters like umlauts etc. What is the simplest way to store this in a mysql database with no risks of it coming out wrong and being searchable using a search script? So far rawurelencode has been excellent for our system. Perhaps the practise can be improved upon by only encoding foreign letters and not common characters like spaces etc, which is a waste of space I totally agree.

Sure there are.
Let's start with the practical: for a large class of characters you are spending 3 bytes of storage for every byte of data. The description of rawurlencode (and of course the RFC) say that those characters are
all non-alphanumeric characters except -_.~
This means that there is a total of 26 + 26 + 10 (alphanumeric) + 4 (special exceptions) = 66 characters for which you do not waste space.
Then there are also the logical drawbacks: You are not storing the data itself, but rather a representation of the data tailored to URLs. Unless the data itself is URLs, that's not what you should be doing.

Drawbacks I can think of:
Waste of disk space.
Waste of CPU cycles encoding and decoding on every read and every write.
Additional complexity (you can't even inspect data with a MySQL client).
Impossibility to use full text searches.
URL encoding is not necessarily unique (there're at least two RFCs). It may not lead to data loss but it can lead to duplicate data (e.g., unique indexes where two rows actually contain the same piece of data).
You can accidentally encode a non-string piece of data such as a date: 2012-04-20%2013%3A23%3A00
But the main consideration is that such technique is completely arbitrary and unnecessary since MySQL doesn't have the least problem storing the complete Unicode catalogue. You could also decide to swap e's and o's in all strings: Holle, werdl!. Your app would run fine but it would not provide any added value.
Update: As Your Common Sense points out, a SQL clause as basic as ORDER BYis no longer usable. It's not that international chars will be ignored; you'll basically get an arbitrary sort order based on the ASCII code of the % and hexadecimal characters. If you can't SELECT * FROM city ORDER BY city_name reliably, you've rendered your DB useless.

I am using a fork to eat a soup
I am using money bills to fire the coals for BBQ
I am using a kettle to boil eggs.
I am using a microscope to hammer the nails.
Are there any disadvantages to what I'm doing?
YES
You are using a tool not on purpose. This is always a disadvantage.
A sane human being alway using a tool that is intended for the certain job. Not some randomly picked one. Especially if there is no shortage in the right tool supply.
URL encoding is not intended to be used with database, as one can tell from the name. That's alone reason enough for the sane developer. Take a look around: find the proper tool.
There is a thing called "common sense" - a thing widely used in the regular life but for some reason always absent in the php world.
A common sense can warn us: if we're using a wrong tool, it may spoil the work. Sooner or later it will spoil it. No need to ask for the certain details - it's a general rule. We are learning this rule at about age of 5.
Why not to use it while playing with some web thingies too?
Why not to ask yourself a question:
What's wrong with storing foreign characters at all?
urlencode makes stroing foreign characters very simple
Any hardships you encountered without urlencode?
Although I feel that common sense should be enough to answer the question, people always look for the "omen", the proof. Here you are:
Database's job is not limited to just storing and retrieving data. A plain text file can handle such a primitive task as well.
Data manipulations is what we are using databases for.
Most widely used ones are sorting and filtering.
Such a quite intelligent thing as a database can sort and filter data character-insensitive, which is very handy feature. But of course it can be done only if characters being saved as is, not as some random codes.
Sorting texts also may use ordering other than just binary order in the character table. Some umlaut characters may be present at the other parts of the table but database collation will put them in the right place. Of course it can be done only if characters being saved as is, not as some random codes.
Sometimes we have to manipulate the data that already stored in the database. Say, cut some piece from the string and compare with the entered value. How it is supposed to be done with urlencoded data?

What is the best way to check for duplicate TEXT fields in MYSQL/PHP?

My code pulls ~1000 HTML files, extracts the relevant information & then stores that information in a MySQL TEXT field (as it is usually quite long). I am looking for a system to prevent duplicate entries in the DB
My first idea is to add a HASH field to the table (probably MD5), pull the hash list at the beginning of each run & check for duplicates before inserting into the DB.
Second idea is to store the file length (bytes or chars or whatever), index that, & check for duplicate file lengths, doublechecking content if a duplicate length is found.
No idea what is the best solution performance-wise. Perhaps there is a better way?
If there is an efficient way to check if files are >95% similar that would be ideal, but I doubt there is?
Thanks for any help!
BTW I am using PHP5/Kohana
EDIT:
just had an idea on checking for similarity: I could count all alphanumeric characters & log the occurrence of each
eg: 17aB... = 1a,7b,10c,27c,...
potential problem would be the upper limit for a char count (around 61?)
I imagine false positives would still be rare . . .
good idea/bad idea?

The hash idea is probably the best. You might have collisions, but they would be exceedingly rare.
Make the hash field a unique key for the table, and catch the duplicate error code. Or use insert ignore or insert replace.

That sounds pretty good, I have implemented something similar. The hash field should be a key since duplicates are not allowed.
If each text record is long you could compute a constant multiple number (say 2) of hashes per record. Then maybe if just one of them is identical, that is close enough. Obviously the more hashes you have per record the closer you get to comparing the full text.
MD5's are 16 bytes. How many potential hashes will there be over time? If this number stays reasonable, you should be okay doing the comparison in memory.

Autodetect Presence of CSV Headers in a File

Short question: How do I automatically detect whether a CSV file has headers in the first row?
Details: I've written a small CSV parsing engine that places the data into an object that I can access as (approximately) an in-memory database. The original code was written to parse third-party CSV with a predictable format, but I'd like to be able to use this code more generally.
I'm trying to figure out a reliable way to automatically detect the presence of CSV headers, so the script can decide whether to use the first row of the CSV file as keys / column names or start parsing data immediately. Since all I need is a boolean test, I could easily specify an argument after inspecting the CSV file myself, but I'd rather not have to (go go automation).
I imagine I'd have to parse the first 3 to ? rows of the CSV file and look for a pattern of some sort to compare against the headers. I'm having nightmares of three particularly bad cases in which:
The headers include numeric data for some reason
The first few rows (or large portions of the CSV) are null
There headers and data look too similar to tell them apart
If I can get a "best guess" and have the parser fail with an error or spit out a warning if it can't decide, that's OK. If this is something that's going to be tremendously expensive in terms of time or computation (and take more time than it's supposed to save me) I'll happily scrap the idea and go back to working on "important things".
I'm working with PHP, but this strikes me as more of an algorithmic / computational question than something that's implementation-specific. If there's a simple algorithm I can use, great. If you can point me to some relevant theory / discussion, that'd be great, too. If there's a giant library that does natural language processing or 300 different kinds of parsing, I'm not interested.

As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:
The first row has columns that are not strings or are empty
The first row's columns are not all unique
The first row appears to contain dates or other common data formats (eg, xx-xx-xx)

In the most general sense, this is impossible. This is a valid csv file:
Name
Jim
Tom
Bill
Most csv readers will just take hasHeader as an option, and allow you to pass in your own header if you want. Even in the case you think you can detect, that being character headers and numeric data, you can run into a catastrophic failure. What if your column is a list of BMW series?
M
3
5
7
You will process this incorrectly. Worst of all, you will lose the best car!

In the purely abstract sense, I don't think there is an foolproof algorithmic answer to your question since it boils down to: "How do I distinguish dataA from dataB if I know nothing about either of them?". There will always be the potential for dataA to be indistinguishable from dataB. That said, I would start with the simple and only add complexity as needed. For example, if examining the first five rows, for a given column (or columns) if the datatype in rows 2-5 are all the same but differ from the datatype in row 1, there's a good chance that a header row is present (increased sample sizes reduce the possibility of error). This would (sorta) solve #1/#3 - perhaps throw an exception if the rows are all populated but the data is indistinguishable to allow the calling program to decide what to do next. For #2, simply don't count a row as a row unless and until it pulls non-null data....that would work in all but an empty file (in which case you'd hit EOF). It would never be foolproof, but it might be "close enough".

It really depends on just how "general" you want your tool to be. If the data will always be numeric, you have it easy as long as you assume non-numeric headers (which seems like a pretty fair assumption).
But beyond that, if you don't already know what patterns are present in the data, then you can't really test for them ahead of time.
FWIW, I actually just wrote a script for parsing out some stuff from TSVs, all from the same source. The source's approach to headers/formatting was so scattered that it made sense to just make the script ask me questions from the command line while executing. (Is this a header? Which columns are important?). So no automation, but it let's me fly through the data sets I'm working on, instead of trying to anticipate each funny formatting case. Also, my answers are saved in a file, so I only have to be involved once per file. Not ideal, but efficient.

This article provides some good guidance:
Basically, you do statistical analysis on columns based on whether the first row contains a string and the rest of the rows numbers, or something like that.
http://penndsg.com/blog/detect-headers/

If you CSV has a header like this.
ID, Name, Email, Date
1, john, john#john.com, 12 jan 2020
Then doing a filter_var(str, FILTER_VALIDATE_EMAIL) on the header row will fail. Since the email address is only in the row data. So check header row for an email address (assuming your CSV has email addresses in it).
Second idea.
http://php.net/manual/en/function.is-numeric.php
Check header row for is_numeric, most likely a header row does not have numeric data in it. But most likely a data row would have numeric data.
If you know you have dates in your columns, then checking the header row for a date would also work.
Obviously you need to what type of data you are expecting. I am "expecting" email addresses.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.