Autodetect Presence of CSV Headers in a File

Autodetect Presence of CSV Headers in a File - php

Short question: How do I automatically detect whether a CSV file has headers in the first row?
Details: I've written a small CSV parsing engine that places the data into an object that I can access as (approximately) an in-memory database. The original code was written to parse third-party CSV with a predictable format, but I'd like to be able to use this code more generally.
I'm trying to figure out a reliable way to automatically detect the presence of CSV headers, so the script can decide whether to use the first row of the CSV file as keys / column names or start parsing data immediately. Since all I need is a boolean test, I could easily specify an argument after inspecting the CSV file myself, but I'd rather not have to (go go automation).
I imagine I'd have to parse the first 3 to ? rows of the CSV file and look for a pattern of some sort to compare against the headers. I'm having nightmares of three particularly bad cases in which:
The headers include numeric data for some reason
The first few rows (or large portions of the CSV) are null
There headers and data look too similar to tell them apart
If I can get a "best guess" and have the parser fail with an error or spit out a warning if it can't decide, that's OK. If this is something that's going to be tremendously expensive in terms of time or computation (and take more time than it's supposed to save me) I'll happily scrap the idea and go back to working on "important things".
I'm working with PHP, but this strikes me as more of an algorithmic / computational question than something that's implementation-specific. If there's a simple algorithm I can use, great. If you can point me to some relevant theory / discussion, that'd be great, too. If there's a giant library that does natural language processing or 300 different kinds of parsing, I'm not interested.

As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:
The first row has columns that are not strings or are empty
The first row's columns are not all unique
The first row appears to contain dates or other common data formats (eg, xx-xx-xx)

In the most general sense, this is impossible. This is a valid csv file:
Name
Jim
Tom
Bill
Most csv readers will just take hasHeader as an option, and allow you to pass in your own header if you want. Even in the case you think you can detect, that being character headers and numeric data, you can run into a catastrophic failure. What if your column is a list of BMW series?
M
3
5
7
You will process this incorrectly. Worst of all, you will lose the best car!

In the purely abstract sense, I don't think there is an foolproof algorithmic answer to your question since it boils down to: "How do I distinguish dataA from dataB if I know nothing about either of them?". There will always be the potential for dataA to be indistinguishable from dataB. That said, I would start with the simple and only add complexity as needed. For example, if examining the first five rows, for a given column (or columns) if the datatype in rows 2-5 are all the same but differ from the datatype in row 1, there's a good chance that a header row is present (increased sample sizes reduce the possibility of error). This would (sorta) solve #1/#3 - perhaps throw an exception if the rows are all populated but the data is indistinguishable to allow the calling program to decide what to do next. For #2, simply don't count a row as a row unless and until it pulls non-null data....that would work in all but an empty file (in which case you'd hit EOF). It would never be foolproof, but it might be "close enough".

It really depends on just how "general" you want your tool to be. If the data will always be numeric, you have it easy as long as you assume non-numeric headers (which seems like a pretty fair assumption).
But beyond that, if you don't already know what patterns are present in the data, then you can't really test for them ahead of time.
FWIW, I actually just wrote a script for parsing out some stuff from TSVs, all from the same source. The source's approach to headers/formatting was so scattered that it made sense to just make the script ask me questions from the command line while executing. (Is this a header? Which columns are important?). So no automation, but it let's me fly through the data sets I'm working on, instead of trying to anticipate each funny formatting case. Also, my answers are saved in a file, so I only have to be involved once per file. Not ideal, but efficient.

This article provides some good guidance:
Basically, you do statistical analysis on columns based on whether the first row contains a string and the rest of the rows numbers, or something like that.
http://penndsg.com/blog/detect-headers/

If you CSV has a header like this.
ID, Name, Email, Date
1, john, john#john.com, 12 jan 2020
Then doing a filter_var(str, FILTER_VALIDATE_EMAIL) on the header row will fail. Since the email address is only in the row data. So check header row for an email address (assuming your CSV has email addresses in it).
Second idea.
http://php.net/manual/en/function.is-numeric.php
Check header row for is_numeric, most likely a header row does not have numeric data in it. But most likely a data row would have numeric data.
If you know you have dates in your columns, then checking the header row for a date would also work.
Obviously you need to what type of data you are expecting. I am "expecting" email addresses.

Related

Scraping Oracle text-file using pcre in php

I would like to scrape a text-file which is the output from Oracle AP. I don't have access to Oracle, but need to assist in bug hunting and compare text-file against two csv-files from other systems. Importing the csv-files into a database is not a problem, but I'm struggling with this text-file.
The text-file is divided in two parts. What is successfully imported, and what is rejected. Each column has a specific width set by Oracle when creating the report. They will not change the setting for column width. If content of a column exceeds the width it simply continues on the row below. And columns for imported and rejected are not 100% the same.
For the successful imports it's simple, as there is one version of every row, but the rejected one might have more than one row for different reasons.
The import file is shortened and obfuscated for obvious reasons, as it can be several thousands of lines. It's best viewed in a text editor without word-wrap. I cannot get it to look any good in this forum with blockquote or code sample in forum editor, so please view/copy it from links below.
I'm showing the successful ones on regex101.com here.
Regex finding the imported (I'm sure it could be better, but it works and that is good enough for me):
\s(\d+)\s+([\D]{2,})(\d+)\s+(\d{1,2}-[a-zA-Z]{3}-\d{2})\s+(\w+)\s+([\w+\,]*\.\d+)\s+(\d)\s+([\w+\,]*\.\d+)\s+(\d{1,2}-[a-zA-Z]{3}-\d{2})
I'm struggling with the rejected ones however, due to the variations.
Duplicate invoice number, if there are more than one reason (column) for not being imported.
Missing supplier number and supplier name (always shows up in pair).
Here is what I'm done so far with the rejected ones.
Regex finding rejected:
^\s(\d+)\s+([\D]{2,})(\d+)\s+(\d{1,2}-[a-zA-Z]{3}-\d{2})\s+(\w+)\s+(-?[\w]{1,}\.?\d+)\s+
Clearly my regex for rejected is not the final result. It's crap at the moment. It would even scrape a successful row.
My questions:
Is it possible to have only one regex for rejected catching the variations mentioned in bullet points above? Example would be appreciated.
Is it possible to fetch the word-wrapped parts of a column? Example would be appreciated.
I'm trying to understand the PCRE documentation regarding conditionals as it might be of help when dealing with the rejected variations, but so far I'm struggling with it.
Regards,
Bjørn

Computation: better to truncate before insertion or let MySQL truncate?

Difficult question to phrase, so let me explain.
As part of an RSS caching system I'm inserting a lot of rows into a DB, several times a day. One of the columns is 'snippet', for the description node in the RSS feeds.
Sometimes this node is far longer than I want, since the corresponding DB column is type "tiny text" (max: 255 chars).
So, in terms of computation/memory, is it better for me to truncate via PHP before insertion, or just feed the whole, too-long string to MySQL and have it do the truncation?
Both of course work, but I wondered if one was better practice than the other.

In cases like this it's probably best to measure. If you don't notice a difference then it doesn't matter.
My intuition tells me that, since your snippet size is very small and the plain text can be very big it would be better to truncate before hand. Take the performance hit in PHP so you don't spend a lot of time sending a large query to MySQL.
For readability and code clarity it would also be better to do the truncation in PHP because that makes it explicit. You can even do clever truncating by word or by sentence.

Prevent people to submit a form with meaningless data

I work on a website which allows people to tell about how they were treated when they request for support from companies. The issue is that some people are playing with the platform using meaningless data like
blabla bal bla bka asdfdsff sdfs sdf
Is there a way to prevent this?
Can't do the validation of data manually because the website is very dynamic with a lot of data.
Thanks

Improve your form validation checks.
For the phone number, make sure it's exactly the appropriate size, and it doesn't (for example) have the same number (ie the number 0777777777 will probably be fake).
Calculate the letter usage in a sentence. The most used letters in the english language are e and a (I think). If the ratio is completely different (for example if there is no letter e in a 200 letter text - there is a bit problem ).
Also match the words with a dictionary. For a ratio of unknown words larger than 60% you can consider it to be not valid.
Check for dates, if you're expecting a date that's in the next few days, you shouldn't accept dates for 30 years ago.
Think of the data that you're expecting to receive, and find limits to it, that's the only way. Good luck !

Short answer no.
Long answer: you may want to try to match words against a dictionary. But this is not fool proof and when doing the matching too tight you may get a lot of false positives.
Another way may be to build a blacklist of bogus words and match against that.
Also you may want reconsider making that particular field required. When a lot of people fill in bogus data the form is probably setup wrong.

You can do it to an extent:
Validation on certain fields (phone number, email, numeric/text only fields etc...)
Restrict the user to use pre-defined items, such as drop-downs, check-boxes, rather than just plain text inputs where they have total freedom
Run some checks through the dictionary and determine a desirable percentage of quality that a user submits.
Regardless of what you do, it'll never be 100%. The only (almost!) guaranteed method of correct validation with user input outside of pre-determined values would be to sit someone down and manually check every submitted piece of data. Even then, they're prone to human error and it still wouldn't be 100%.
My advice would be to keep all important fields to values you've already specified yourself with drop-downs, check-boxes, number spinners etc...
Add fields for 'additional comments' on certain items, but keep those fields unnecessary to the main process handling of a submitted form.

Storing text in db: how to choose varchar size (considering formatting), storing formatting separately?

How to best choose a size for a varchar/text/... column in a (mysql) database (let's assume the text the user can type into a text area should be max 500 chars), considering that the user also might use formatting (html/bb code/...), which is not visible to the user and should not affect the max 500 chars text size...??
1) theoretically, to prevent any error, the varchar size has to be almost endless, if the user e.g. uses 20 links like this (http://[huge number of chars]) or whatever... - or not?
2) should/could you save formatting in a separate column, to e.g. not give an index (like FULLTEXT) wrong values (words that are contained in formatting but not in the real text)?
If yes, how to best do this? do you remember at which point the formatting was used, save this point and the formatting and when outputting put this information together?
(php/mysql, java script, jquery)
Thank you very much in advance!

A good solution is to consider in the amount of formatting characters.
If you do not, to avoid data loss, you need to use much more space for the text on the database and check the length of prior record before save or use full text.
Keep the same data twice in one table is not a good solution, it all depends on your project, but usually better it's filter formating on php.

rawurlencode for storing data

I have always used rawurlencode to store user entered data into my mysql databases. The main reason I do this is so that stroing foreign characters is very simple I find. I'd then use rawurldecode to retrieve and display the data.
I read somewhere that rawurlencode was not meant for this purpose. Are there any disadvantages to what I'm doing?
So let's say I have a German address with many characters like umlauts etc. What is the simplest way to store this in a mysql database with no risks of it coming out wrong and being searchable using a search script? So far rawurelencode has been excellent for our system. Perhaps the practise can be improved upon by only encoding foreign letters and not common characters like spaces etc, which is a waste of space I totally agree.

Sure there are.
Let's start with the practical: for a large class of characters you are spending 3 bytes of storage for every byte of data. The description of rawurlencode (and of course the RFC) say that those characters are
all non-alphanumeric characters except -_.~
This means that there is a total of 26 + 26 + 10 (alphanumeric) + 4 (special exceptions) = 66 characters for which you do not waste space.
Then there are also the logical drawbacks: You are not storing the data itself, but rather a representation of the data tailored to URLs. Unless the data itself is URLs, that's not what you should be doing.

Drawbacks I can think of:
Waste of disk space.
Waste of CPU cycles encoding and decoding on every read and every write.
Additional complexity (you can't even inspect data with a MySQL client).
Impossibility to use full text searches.
URL encoding is not necessarily unique (there're at least two RFCs). It may not lead to data loss but it can lead to duplicate data (e.g., unique indexes where two rows actually contain the same piece of data).
You can accidentally encode a non-string piece of data such as a date: 2012-04-20%2013%3A23%3A00
But the main consideration is that such technique is completely arbitrary and unnecessary since MySQL doesn't have the least problem storing the complete Unicode catalogue. You could also decide to swap e's and o's in all strings: Holle, werdl!. Your app would run fine but it would not provide any added value.
Update: As Your Common Sense points out, a SQL clause as basic as ORDER BYis no longer usable. It's not that international chars will be ignored; you'll basically get an arbitrary sort order based on the ASCII code of the % and hexadecimal characters. If you can't SELECT * FROM city ORDER BY city_name reliably, you've rendered your DB useless.

I am using a fork to eat a soup
I am using money bills to fire the coals for BBQ
I am using a kettle to boil eggs.
I am using a microscope to hammer the nails.
Are there any disadvantages to what I'm doing?
YES
You are using a tool not on purpose. This is always a disadvantage.
A sane human being alway using a tool that is intended for the certain job. Not some randomly picked one. Especially if there is no shortage in the right tool supply.
URL encoding is not intended to be used with database, as one can tell from the name. That's alone reason enough for the sane developer. Take a look around: find the proper tool.
There is a thing called "common sense" - a thing widely used in the regular life but for some reason always absent in the php world.
A common sense can warn us: if we're using a wrong tool, it may spoil the work. Sooner or later it will spoil it. No need to ask for the certain details - it's a general rule. We are learning this rule at about age of 5.
Why not to use it while playing with some web thingies too?
Why not to ask yourself a question:
What's wrong with storing foreign characters at all?
urlencode makes stroing foreign characters very simple
Any hardships you encountered without urlencode?
Although I feel that common sense should be enough to answer the question, people always look for the "omen", the proof. Here you are:
Database's job is not limited to just storing and retrieving data. A plain text file can handle such a primitive task as well.
Data manipulations is what we are using databases for.
Most widely used ones are sorting and filtering.
Such a quite intelligent thing as a database can sort and filter data character-insensitive, which is very handy feature. But of course it can be done only if characters being saved as is, not as some random codes.
Sorting texts also may use ordering other than just binary order in the character table. Some umlaut characters may be present at the other parts of the table but database collation will put them in the right place. Of course it can be done only if characters being saved as is, not as some random codes.
Sometimes we have to manipulate the data that already stored in the database. Say, cut some piece from the string and compare with the entered value. How it is supposed to be done with urlencoded data?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.