Scraping Oracle text-file using pcre in php

Scraping Oracle text-file using pcre in php - php

I would like to scrape a text-file which is the output from Oracle AP. I don't have access to Oracle, but need to assist in bug hunting and compare text-file against two csv-files from other systems. Importing the csv-files into a database is not a problem, but I'm struggling with this text-file.
The text-file is divided in two parts. What is successfully imported, and what is rejected. Each column has a specific width set by Oracle when creating the report. They will not change the setting for column width. If content of a column exceeds the width it simply continues on the row below. And columns for imported and rejected are not 100% the same.
For the successful imports it's simple, as there is one version of every row, but the rejected one might have more than one row for different reasons.
The import file is shortened and obfuscated for obvious reasons, as it can be several thousands of lines. It's best viewed in a text editor without word-wrap. I cannot get it to look any good in this forum with blockquote or code sample in forum editor, so please view/copy it from links below.
I'm showing the successful ones on regex101.com here.
Regex finding the imported (I'm sure it could be better, but it works and that is good enough for me):
\s(\d+)\s+([\D]{2,})(\d+)\s+(\d{1,2}-[a-zA-Z]{3}-\d{2})\s+(\w+)\s+([\w+\,]*\.\d+)\s+(\d)\s+([\w+\,]*\.\d+)\s+(\d{1,2}-[a-zA-Z]{3}-\d{2})
I'm struggling with the rejected ones however, due to the variations.
Duplicate invoice number, if there are more than one reason (column) for not being imported.
Missing supplier number and supplier name (always shows up in pair).
Here is what I'm done so far with the rejected ones.
Regex finding rejected:
^\s(\d+)\s+([\D]{2,})(\d+)\s+(\d{1,2}-[a-zA-Z]{3}-\d{2})\s+(\w+)\s+(-?[\w]{1,}\.?\d+)\s+
Clearly my regex for rejected is not the final result. It's crap at the moment. It would even scrape a successful row.
My questions:
Is it possible to have only one regex for rejected catching the variations mentioned in bullet points above? Example would be appreciated.
Is it possible to fetch the word-wrapped parts of a column? Example would be appreciated.
I'm trying to understand the PCRE documentation regarding conditionals as it might be of help when dealing with the rejected variations, but so far I'm struggling with it.
Regards,
Bjørn

Related

getCalculatedValue reports different response to exported spreadsheet

I am working with a fairly large, complex spreadsheet (there are 6 sheets, each with 200-400 rows) and am having trouble getting the correct values out of some cells.
My workflow is roughly:
User data is inputted on front-end
Data is validated and then placed into certain cells on the spreadsheet
Calculations in other cells reference the user-input cells
I use getCalculatedValue on particular cells to retrieve the necessary values
For debug purposes I then save out the modified spreadsheet so that I can easily see that the data has been inputted and generated correctly.
PHPExcel has been working great, but I have ran into an issue where the getCalculatedValue method (step 4) is returning an incorrect value, but when I inspect the spreadsheet that has been saved out (step 5) the values are correct.
The calculations consist of general mathematical equations, IF conditions, some date manipulation and multiple VLOOKUPs.
I am currently picking my way through the calculations in order to trace the issue, but was wondering if there may be a simpler solution to this that I am not aware of. Perhaps some setting that affects the outcome of various different calculations? This may even be a subtle change in calculations that is subsequently snow-balling into a bigger change further down the line.
Thanks in advance.

Turned out to be a syntax error in the spreadsheet that I was provided.
A round function was being used like so:
ROUND(NUMBER,)
Excel compensated for this by using 0 as the second parameter, whereas PHPExcel (quite correctly) didn't.

Date/Number ranges displaying incorrectly

I am using a 3rd party "shopping cart" program (CartWeaver) that uses a MySQL database. The products in my store have primary and secondary categories, the latter of which where my issue lies.
I am using date ranges (e.g. 1930-1939, 1940-1949, etc), however when I view the ranges on the product search page, the first number sequence is displaying incorrectly. For example, the date range 1930-1939 is displaying as 3860-1930, and 1940-1949 is displaying as 3880-1949 (you can see the issue at www.silverscreencollectibles.com/searchpage.php).
I have tried multiple variables to try to get around this, all to no available. Here is what I've tried: Starting the sequence with an alpha character, starting it with a special character, putting the range in single as well as double quotes, putting a space between the sequences, replacing the dash with the alpha "to". I have also deleted and recreated the subcategories, and nothing I have tried changes the result. The entry for "2000-Present" also displays incorrectly (as 4000-Present).
I'm just a dumb user, not a programmer, so any responses that anyone is kind enough to offer will need to be "dumbed down". The application support group wants me to either send them my entire database, or allow them to directly access my site...and neither option appeals to me, from a security standpoint. I thought I would throw the issue out to the StackOverFlow community to see if anyone has seen this sort of issue before, and may be able to point me in the right direction to address the issue. Thank you so much for your time.

PHP Repairing Bad Text

This is something I'm working on and I'd like input from the intelligent people here on StackOverflow.
What I'm attempting is a function to repair text based on combining various bad versions of the same text page. Basically this can be used to combine different OCR results into one with greater accuracy than any of them individually.
I start with a dictionary of 600,000 English words, that's pretty much everything including legal and medical terms and common names. I have this already.
Then I have 4 versions of the text sample.
Something like this:
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
I attempting to combine the above to get an output which looks like:
$text = 'First text sample is this line.';
Don't tell me it's impossible, because it is certainly not, just very difficult.
I would very much appreciate any ideas anyone has towards this.
Thank you!
My current thoughts:
Just checking the words against the dictionary will not work, since some of the spaces are in the wrong place and occasionally the word will not be in the dictionary.
The major concern is repairing broken spacings, once this is fixed then then the most commonly occurring dictionary word can be chosen if exists, or else the most commonly occurring non-dictionary word.

Have you tried using a longest common subsequence algorithm? These are commonly seen in the "diff" text comparison tools used in source control apps and some text editors. A diff algorithm helps identify changed and unchanged characters in two text samples.
http://en.wikipedia.org/wiki/Diff
Some years ago I worked on an OCR app similar to yours. Rather than applying multiple OCR engines to one image, I used one OCR engine to analyze multiple versions of the same image. Each of the processed images was the result of applying different denoising technique to the original image: one technique worked better for low contrast, another technique worked better when the characters were poorly formed. A "voting" scheme that compared OCR results on each image improved the read rate for arbitrary strings of text such as "BQCM10032". Other voting schemes are described in the academic literature for OCR.
On occasion you may need to match a word for which no combination of OCR results will yield all the letters. For example, a middle letter may be missing, as in either "w rd" or "c tch" (likely "word" and "catch"). In this case it can help to access your dictionary with any of three keys: initial letters, middle letters, and final letters (or letter combinations). Each key is associated with a list of words sorted by frequency of occurrence in the language. (I used this sort of multi-key lookup to improve the speed of a crossword generation app; there may well be better methods out there, but this one is easy to implement.)
To save on memory, you could apply the multi-key method only to the first few thousand common words in the language, and then have only one lookup technique for less common words.
There are several online lists of word frequency.
http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists
If you want to get fancy, you can also rely on prior frequency of occurrence in the text. For example, if "Byrd" appears multiple times, then it may be the better choice if the OCR engine(s) reports either "bird" or "bard" with a low confidence score. You might load a medical dictionary into memory only if there is a statistically unlikely occurrence of medical terms on the same page--otherwise leave medical terms out of your working dictionary, or at least assign them reasonable likelihoods. "Prosthetics" is a common word; "prostatitis" less so.
If you have experience with image processing techniques such as denoising and morphological operations, you can also try preprocessing the image before passing it to the OCR engine(s). Image processing could also be applied to select areas after your software identifies the words or regions where the OCR engine(s) fared poorly.
Certain letter/letter and letter/numeral substitutions are common. The numeral 0 (zero) can be confused with the letter O, C for O, 8 for B, E for F, P for R, and so on. If a word is found with low confidence, or if there are two common words that could match an incompletely read word, then ad hoc shape-matching rules could help. For example, "bcth" could match either "both" or "bath", but for many fonts (and contexts) "both" is the more likely match since "o" is more similar to "c" in shape. In a long string of words such as a a paragraph from a novel or magazine article, "bath" is a better match than "b8th."
Finally, you could probably write a plugin or script to pass the results into a spellcheck engine that checks for noun-verb agreement and other grammar checks. This may catch a few additional errors. Maybe you could try VBA for Word or whatever other script/app combo is popular these days.

Tackling complex algorithms like this by yourself will probably take longer and be more error prone than using a third party tool - unless you really need to program this yourself, you can check the Yahoo Spelling Suggestion API. They allow 5.000 requests per IP per day, I believe.
Others may offer something similar (I think there's a bing API, too).
UPDATE: Sorry, I just read that they've stopped this service in April 2011. They claim to offer a similar service called "Spelling Suggestion YQL table" now.

This is indeed a rather complicated problem.
When I do wonder how to spell a word, the direct way is to open a dictionary. But what if it is a small complex sentence that I'm trying to spell correctly ? One of my personal trick, which works most of the time, is to call Google. I place my sentence between quotes on Google and count the results. Here is an example : entering "your very smart" on Google gives 13'600k page. Entering "you're very smart" gives 20'000k pages. Then, likely, the correct spelling is "you're very smart". And... indeed it is ;)
Based on this concept, I guess you have samples which, for the most parts, are correctly misspelled (well, maybe not if your develop for a teens gaming site...). Can you try to divide the samples into sub pieces, not going up to the words, and matching these by frequency ? The most frequent piece is the most likely correctly spelled. Prior to this, you can already make a dictionary spellcheck with your 600'000 terms to increase the chance that small spelling mistakes will alredy be corrected. This should increase the frequency of correct sub pieces.
Dividing the sentences in pieces and finding the right "piece-size" is also tricky.
What concerns me a little too : how do you extract the samples and match them together to know the correctly spelled sentence is the same (or very close?). Your question seems to assume you have this, which also seems something very complex for me.
Well, what precedes is just a general tip based on my personal and human experience. Donno if this can help. This is obviously not a real answer and is not meant to be one.

You could try using google n-grams to achieve this.

If you need to get right string only by comparing other. Then Something like this maybe will help.
It not finished yet, but already gives some results.
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
function getRight($arr){
$_final='';
$count=count($arr);
// Remove multi spaces AND get string lengths
for($i=0;$i<$count;$i++){
$arr[$i]=preg_replace('/\s\s+/', ' ',$arr[$i]);
$len[$i]=strlen($arr[$i]);
}
// Max length
$_max=max($len);
for($i=0;$i<$_max;$i++){
$_el=array();
for($j=0;$j<$count;$j++){
// Cheking letter counts
$_letter=$arr[$j][$i];
if(isset($_el[$_letter]))$_el[$_letter]++;
else$_el[$_letter]=1;
}
//Most probably count
list($mostProbably) = array_keys($_el, max($_el));
$_final.=$mostProbably;
// If probbaly example is not space
if($_el!=' '){
// THERE NEED TO BE CODE FOR REMOVING SPACE FROM LINES WHERE $text[$i] is space
}
}
return $_final;
}
echo getRight($text);

storing/generating barcode string in database (mysql)

I never worked with barcode and now i must design a whole app with barcode support. I was wondering what type of barcode i can use, how can i make shure that barcode string is uniqe and how would i store that in MySQL.
I was thinkin about generating some barcode strings and print them to stickers so my clients can use them. I was thinking to do generating part in php/mysql then prepare for printing (render in pdf). Let's say i generated 100 strings and store them to database and next time i want to generate another 200 that must be unique.
I don't even know where to begin with string. What information can i store in barcode string?
Can i do this: XXX-ZZZZZ-YYYY-autincrementID?
Where XXX is country ID, ZZZZZ is client ID, YYYY is barcode string ID. Should i use surrogative key for my primary key or should i split those to multiple tables?
Did i mentioned that all autoincrementID's should start from 1 for each client :) I am sooooo confused about all this.
Thanks

First decide on the barcode format you want to use.
Then check if there is a PHP implementation out there (there will be for most - if not all - barcode formats).
A basic example (using PEAR Image_Barcode) can be found at Using barcodes in your web application.
You just store the text in the DB and can generate the corresponding image using the Image_Barcode class (it supports Code 39, Code 128, EAN 13, INT 25, PostNet and UPCA).
I once wrote an app creating EAN 13 barcodes, don't remember which lib I used though (I'll check at home if I can find the source).

We need to separate some concerns.
First is the action of printing any given string as a barcode. The other answers talk about how to do that.
The other action has nothing to do with barcodes and is about database design. Your example suggests the barcode will be a combination of values. However, I get the idea (correct me if I am wrong) that the larger application is not yet clearly spelled out. Therefore it does not matter what kind of "play" table you create for unique codes right now -- create whatever you want. When you know what values must be printed as barcodes, then we are into a database design question.

A barcode is just a way to print and/or read a string. It involves
special fonts,
some calculation (for check digits)
Your first step should be to identify wich barcode you need to support. Many companies manufacturing barcode printers and readers also provide some help about that.
I found some great help here, including free fonts. It's a french site but a few things are available in English.

Autodetect Presence of CSV Headers in a File

Short question: How do I automatically detect whether a CSV file has headers in the first row?
Details: I've written a small CSV parsing engine that places the data into an object that I can access as (approximately) an in-memory database. The original code was written to parse third-party CSV with a predictable format, but I'd like to be able to use this code more generally.
I'm trying to figure out a reliable way to automatically detect the presence of CSV headers, so the script can decide whether to use the first row of the CSV file as keys / column names or start parsing data immediately. Since all I need is a boolean test, I could easily specify an argument after inspecting the CSV file myself, but I'd rather not have to (go go automation).
I imagine I'd have to parse the first 3 to ? rows of the CSV file and look for a pattern of some sort to compare against the headers. I'm having nightmares of three particularly bad cases in which:
The headers include numeric data for some reason
The first few rows (or large portions of the CSV) are null
There headers and data look too similar to tell them apart
If I can get a "best guess" and have the parser fail with an error or spit out a warning if it can't decide, that's OK. If this is something that's going to be tremendously expensive in terms of time or computation (and take more time than it's supposed to save me) I'll happily scrap the idea and go back to working on "important things".
I'm working with PHP, but this strikes me as more of an algorithmic / computational question than something that's implementation-specific. If there's a simple algorithm I can use, great. If you can point me to some relevant theory / discussion, that'd be great, too. If there's a giant library that does natural language processing or 300 different kinds of parsing, I'm not interested.

As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:
The first row has columns that are not strings or are empty
The first row's columns are not all unique
The first row appears to contain dates or other common data formats (eg, xx-xx-xx)

In the most general sense, this is impossible. This is a valid csv file:
Name
Jim
Tom
Bill
Most csv readers will just take hasHeader as an option, and allow you to pass in your own header if you want. Even in the case you think you can detect, that being character headers and numeric data, you can run into a catastrophic failure. What if your column is a list of BMW series?
M
3
5
7
You will process this incorrectly. Worst of all, you will lose the best car!

In the purely abstract sense, I don't think there is an foolproof algorithmic answer to your question since it boils down to: "How do I distinguish dataA from dataB if I know nothing about either of them?". There will always be the potential for dataA to be indistinguishable from dataB. That said, I would start with the simple and only add complexity as needed. For example, if examining the first five rows, for a given column (or columns) if the datatype in rows 2-5 are all the same but differ from the datatype in row 1, there's a good chance that a header row is present (increased sample sizes reduce the possibility of error). This would (sorta) solve #1/#3 - perhaps throw an exception if the rows are all populated but the data is indistinguishable to allow the calling program to decide what to do next. For #2, simply don't count a row as a row unless and until it pulls non-null data....that would work in all but an empty file (in which case you'd hit EOF). It would never be foolproof, but it might be "close enough".

It really depends on just how "general" you want your tool to be. If the data will always be numeric, you have it easy as long as you assume non-numeric headers (which seems like a pretty fair assumption).
But beyond that, if you don't already know what patterns are present in the data, then you can't really test for them ahead of time.
FWIW, I actually just wrote a script for parsing out some stuff from TSVs, all from the same source. The source's approach to headers/formatting was so scattered that it made sense to just make the script ask me questions from the command line while executing. (Is this a header? Which columns are important?). So no automation, but it let's me fly through the data sets I'm working on, instead of trying to anticipate each funny formatting case. Also, my answers are saved in a file, so I only have to be involved once per file. Not ideal, but efficient.

This article provides some good guidance:
Basically, you do statistical analysis on columns based on whether the first row contains a string and the rest of the rows numbers, or something like that.
http://penndsg.com/blog/detect-headers/

If you CSV has a header like this.
ID, Name, Email, Date
1, john, john#john.com, 12 jan 2020
Then doing a filter_var(str, FILTER_VALIDATE_EMAIL) on the header row will fail. Since the email address is only in the row data. So check header row for an email address (assuming your CSV has email addresses in it).
Second idea.
http://php.net/manual/en/function.is-numeric.php
Check header row for is_numeric, most likely a header row does not have numeric data in it. But most likely a data row would have numeric data.
If you know you have dates in your columns, then checking the header row for a date would also work.
Obviously you need to what type of data you are expecting. I am "expecting" email addresses.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.