storing/generating barcode string in database (mysql)

storing/generating barcode string in database (mysql) - php

I never worked with barcode and now i must design a whole app with barcode support. I was wondering what type of barcode i can use, how can i make shure that barcode string is uniqe and how would i store that in MySQL.
I was thinkin about generating some barcode strings and print them to stickers so my clients can use them. I was thinking to do generating part in php/mysql then prepare for printing (render in pdf). Let's say i generated 100 strings and store them to database and next time i want to generate another 200 that must be unique.
I don't even know where to begin with string. What information can i store in barcode string?
Can i do this: XXX-ZZZZZ-YYYY-autincrementID?
Where XXX is country ID, ZZZZZ is client ID, YYYY is barcode string ID. Should i use surrogative key for my primary key or should i split those to multiple tables?
Did i mentioned that all autoincrementID's should start from 1 for each client :) I am sooooo confused about all this.
Thanks

First decide on the barcode format you want to use.
Then check if there is a PHP implementation out there (there will be for most - if not all - barcode formats).
A basic example (using PEAR Image_Barcode) can be found at Using barcodes in your web application.
You just store the text in the DB and can generate the corresponding image using the Image_Barcode class (it supports Code 39, Code 128, EAN 13, INT 25, PostNet and UPCA).
I once wrote an app creating EAN 13 barcodes, don't remember which lib I used though (I'll check at home if I can find the source).

We need to separate some concerns.
First is the action of printing any given string as a barcode. The other answers talk about how to do that.
The other action has nothing to do with barcodes and is about database design. Your example suggests the barcode will be a combination of values. However, I get the idea (correct me if I am wrong) that the larger application is not yet clearly spelled out. Therefore it does not matter what kind of "play" table you create for unique codes right now -- create whatever you want. When you know what values must be printed as barcodes, then we are into a database design question.

A barcode is just a way to print and/or read a string. It involves
special fonts,
some calculation (for check digits)
Your first step should be to identify wich barcode you need to support. Many companies manufacturing barcode printers and readers also provide some help about that.
I found some great help here, including free fonts. It's a french site but a few things are available in English.

Related

Scraping Oracle text-file using pcre in php

I would like to scrape a text-file which is the output from Oracle AP. I don't have access to Oracle, but need to assist in bug hunting and compare text-file against two csv-files from other systems. Importing the csv-files into a database is not a problem, but I'm struggling with this text-file.
The text-file is divided in two parts. What is successfully imported, and what is rejected. Each column has a specific width set by Oracle when creating the report. They will not change the setting for column width. If content of a column exceeds the width it simply continues on the row below. And columns for imported and rejected are not 100% the same.
For the successful imports it's simple, as there is one version of every row, but the rejected one might have more than one row for different reasons.
The import file is shortened and obfuscated for obvious reasons, as it can be several thousands of lines. It's best viewed in a text editor without word-wrap. I cannot get it to look any good in this forum with blockquote or code sample in forum editor, so please view/copy it from links below.
I'm showing the successful ones on regex101.com here.
Regex finding the imported (I'm sure it could be better, but it works and that is good enough for me):
\s(\d+)\s+([\D]{2,})(\d+)\s+(\d{1,2}-[a-zA-Z]{3}-\d{2})\s+(\w+)\s+([\w+\,]*\.\d+)\s+(\d)\s+([\w+\,]*\.\d+)\s+(\d{1,2}-[a-zA-Z]{3}-\d{2})
I'm struggling with the rejected ones however, due to the variations.
Duplicate invoice number, if there are more than one reason (column) for not being imported.
Missing supplier number and supplier name (always shows up in pair).
Here is what I'm done so far with the rejected ones.
Regex finding rejected:
^\s(\d+)\s+([\D]{2,})(\d+)\s+(\d{1,2}-[a-zA-Z]{3}-\d{2})\s+(\w+)\s+(-?[\w]{1,}\.?\d+)\s+
Clearly my regex for rejected is not the final result. It's crap at the moment. It would even scrape a successful row.
My questions:
Is it possible to have only one regex for rejected catching the variations mentioned in bullet points above? Example would be appreciated.
Is it possible to fetch the word-wrapped parts of a column? Example would be appreciated.
I'm trying to understand the PCRE documentation regarding conditionals as it might be of help when dealing with the rejected variations, but so far I'm struggling with it.
Regards,
Bjørn

Using bag of words

I am looking into implementing bag of words approach when dealing with emails stored as text files. I want to use keywords that could indicate that the email needs reply, analyse the emails with binary (something like 1|0|1|0|0 etc depending if the word is used) and then obtain a feature vectors that I could use with different ML algorithms.
I was thinking about using PHP to obtain the feature vectors but I can't find any existing implementations. Is it even possible to do something like that in PHP?

Yes bag of words makes much sense for making classifiers. i am also doing thesis on text classification and i m using php and mysql for it. i m little bit confused about creating bag of words. But after some time it can be done.

Storing text in db: how to choose varchar size (considering formatting), storing formatting separately?

How to best choose a size for a varchar/text/... column in a (mysql) database (let's assume the text the user can type into a text area should be max 500 chars), considering that the user also might use formatting (html/bb code/...), which is not visible to the user and should not affect the max 500 chars text size...??
1) theoretically, to prevent any error, the varchar size has to be almost endless, if the user e.g. uses 20 links like this (http://[huge number of chars]) or whatever... - or not?
2) should/could you save formatting in a separate column, to e.g. not give an index (like FULLTEXT) wrong values (words that are contained in formatting but not in the real text)?
If yes, how to best do this? do you remember at which point the formatting was used, save this point and the formatting and when outputting put this information together?
(php/mysql, java script, jquery)
Thank you very much in advance!

A good solution is to consider in the amount of formatting characters.
If you do not, to avoid data loss, you need to use much more space for the text on the database and check the length of prior record before save or use full text.
Keep the same data twice in one table is not a good solution, it all depends on your project, but usually better it's filter formating on php.

Autodetect Presence of CSV Headers in a File

Short question: How do I automatically detect whether a CSV file has headers in the first row?
Details: I've written a small CSV parsing engine that places the data into an object that I can access as (approximately) an in-memory database. The original code was written to parse third-party CSV with a predictable format, but I'd like to be able to use this code more generally.
I'm trying to figure out a reliable way to automatically detect the presence of CSV headers, so the script can decide whether to use the first row of the CSV file as keys / column names or start parsing data immediately. Since all I need is a boolean test, I could easily specify an argument after inspecting the CSV file myself, but I'd rather not have to (go go automation).
I imagine I'd have to parse the first 3 to ? rows of the CSV file and look for a pattern of some sort to compare against the headers. I'm having nightmares of three particularly bad cases in which:
The headers include numeric data for some reason
The first few rows (or large portions of the CSV) are null
There headers and data look too similar to tell them apart
If I can get a "best guess" and have the parser fail with an error or spit out a warning if it can't decide, that's OK. If this is something that's going to be tremendously expensive in terms of time or computation (and take more time than it's supposed to save me) I'll happily scrap the idea and go back to working on "important things".
I'm working with PHP, but this strikes me as more of an algorithmic / computational question than something that's implementation-specific. If there's a simple algorithm I can use, great. If you can point me to some relevant theory / discussion, that'd be great, too. If there's a giant library that does natural language processing or 300 different kinds of parsing, I'm not interested.

As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:
The first row has columns that are not strings or are empty
The first row's columns are not all unique
The first row appears to contain dates or other common data formats (eg, xx-xx-xx)

In the most general sense, this is impossible. This is a valid csv file:
Name
Jim
Tom
Bill
Most csv readers will just take hasHeader as an option, and allow you to pass in your own header if you want. Even in the case you think you can detect, that being character headers and numeric data, you can run into a catastrophic failure. What if your column is a list of BMW series?
M
3
5
7
You will process this incorrectly. Worst of all, you will lose the best car!

In the purely abstract sense, I don't think there is an foolproof algorithmic answer to your question since it boils down to: "How do I distinguish dataA from dataB if I know nothing about either of them?". There will always be the potential for dataA to be indistinguishable from dataB. That said, I would start with the simple and only add complexity as needed. For example, if examining the first five rows, for a given column (or columns) if the datatype in rows 2-5 are all the same but differ from the datatype in row 1, there's a good chance that a header row is present (increased sample sizes reduce the possibility of error). This would (sorta) solve #1/#3 - perhaps throw an exception if the rows are all populated but the data is indistinguishable to allow the calling program to decide what to do next. For #2, simply don't count a row as a row unless and until it pulls non-null data....that would work in all but an empty file (in which case you'd hit EOF). It would never be foolproof, but it might be "close enough".

It really depends on just how "general" you want your tool to be. If the data will always be numeric, you have it easy as long as you assume non-numeric headers (which seems like a pretty fair assumption).
But beyond that, if you don't already know what patterns are present in the data, then you can't really test for them ahead of time.
FWIW, I actually just wrote a script for parsing out some stuff from TSVs, all from the same source. The source's approach to headers/formatting was so scattered that it made sense to just make the script ask me questions from the command line while executing. (Is this a header? Which columns are important?). So no automation, but it let's me fly through the data sets I'm working on, instead of trying to anticipate each funny formatting case. Also, my answers are saved in a file, so I only have to be involved once per file. Not ideal, but efficient.

This article provides some good guidance:
Basically, you do statistical analysis on columns based on whether the first row contains a string and the rest of the rows numbers, or something like that.
http://penndsg.com/blog/detect-headers/

If you CSV has a header like this.
ID, Name, Email, Date
1, john, john#john.com, 12 jan 2020
Then doing a filter_var(str, FILTER_VALIDATE_EMAIL) on the header row will fail. Since the email address is only in the row data. So check header row for an email address (assuming your CSV has email addresses in it).
Second idea.
http://php.net/manual/en/function.is-numeric.php
Check header row for is_numeric, most likely a header row does not have numeric data in it. But most likely a data row would have numeric data.
If you know you have dates in your columns, then checking the header row for a date would also work.
Obviously you need to what type of data you are expecting. I am "expecting" email addresses.

What is CMIString4096 and how can I extract the data within it?

I have this string that I get back from Adobe Presenter 7. It's called suspend_data and is of type CMIString4096 (by the docs)
CMIString4096 A set of ASCII characters with a maximum length
of 4096 characters.
This is the string:
aG1111111000000000BB001EC%2EacC%7E%24GS%2AayjHm110BKCBBB0B0EBAB1B1ED%2EicC%7E%24GS%2AlfkHm110BKDBCB0B0EBBB0B0EBAB1B1EE%2EwcC%7E%24GS%2ACBlHm100BKDB2BCBCDB1BABBDB0BBBADF%2E7cC%7E%24GS%2A4GmHm110BKBB0Ebl%C3%A1rRbl%C3%A1r%3Bgr%C3%A6nn%3Brau%C3%B0urB
It looks like base64 with some urlencoded characters. When i urldecode() the string, the last few characters resemble some data but it's in utf8, then i utf8_decode it and see this.
aG1111111000000000BB001EC.acC~$GS*ayjHm110BKCBBB0B0EBAB1B1ED.icC~$GS*
lfkHm110BKDBCB0B0EBBB0B0EBAB1B1EE.wcC~$GS*CBlHm100BKDB2BCBCDB1BABBDB0BBBADF.
7cC~$GS*4GmHm110BKBB0EblárRblár;grænn;rauðurB
Ok i'm closer to some data (at the end), but it still looks like it's a mess. When i base64_decode() it i get some binary mess, but i don't know what on earth it is.
Does anyone know what this data is and how i can get some sense out of it? I'm using PHP btw so only functions within it are applicable.

The data stored in the cmi.suspend_data field is simply a bucket of data that the SCO (the the content) can use to persist its current state. There is no semantic meaning or defined structure to the data. In many cases, the meaning of the data can be guessed at or reversed engineered, but that does not appear to be the case in with content produced by Adobe Presenter.
The suspend_data field is limited to 4096 ASCII characters. For some SCOs this doesn't provide enough storage to fully persist the current state. In many cases, a content developer faced with this predicament will apply a compression algorithm to the state data in order to squeeze it into the limited size. It looks like that is what Adobe Presenter is doing here. My guess is that they compressed their data to the unencoded state that you found, then applied url encoding to ensure that all of the resulting characters were safe to send to the LMS.
The string of 1's and 0's at the start of the suspend data might be something meaningful. It could likely correspond to which of the slides in the course have been previously viewed by the learner. To verify this, it might be helpful to run the course through a tool like SCORM TestTrack (freely available at scorm.com) and use the generated debug logs to watch how the suspend data changes as the user progresses through the course.
SCORM provides quite a few other data model elements which do have a specific meaning relating to the current status of the course. Here is a list of all available data model elements. The SCORM TestTrack debug logs will also show you which of those data model elements Adobe Presented content uses.

I don't think that SCORM defines what the suspend_data field contains or in what format it is.
This is entirely up to the content/lesson (Adobe Presenter in your case), but it can only be text and is limited to 4096 characters.
This field can be used by the content to store any kind of state which should be passed back to the content the next time it is started.

Found
13. cmi.suspend_data
Read / Write
Intended to act as a location to store
any information that a SCO would like
to persist until a subsequent session.
in here. So as Martin wrote SCORM only defines the data type and not the encoding or the content of cmi.suspend_data. Perhaps this could help you in determining the encoding.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.