I have this string that I get back from Adobe Presenter 7. It's called suspend_data and is of type CMIString4096 (by the docs)
CMIString4096 A set of ASCII characters with a maximum length
of 4096 characters.
This is the string:
aG1111111000000000BB001EC%2EacC%7E%24GS%2AayjHm110BKCBBB0B0EBAB1B1ED%2EicC%7E%24GS%2AlfkHm110BKDBCB0B0EBBB0B0EBAB1B1EE%2EwcC%7E%24GS%2ACBlHm100BKDB2BCBCDB1BABBDB0BBBADF%2E7cC%7E%24GS%2A4GmHm110BKBB0Ebl%C3%A1rRbl%C3%A1r%3Bgr%C3%A6nn%3Brau%C3%B0urB
It looks like base64 with some urlencoded characters. When i urldecode() the string, the last few characters resemble some data but it's in utf8, then i utf8_decode it and see this.
aG1111111000000000BB001EC.acC~$GS*ayjHm110BKCBBB0B0EBAB1B1ED.icC~$GS*
lfkHm110BKDBCB0B0EBBB0B0EBAB1B1EE.wcC~$GS*CBlHm100BKDB2BCBCDB1BABBDB0BBBADF.
7cC~$GS*4GmHm110BKBB0EblárRblár;grænn;rauðurB
Ok i'm closer to some data (at the end), but it still looks like it's a mess. When i base64_decode() it i get some binary mess, but i don't know what on earth it is.
Does anyone know what this data is and how i can get some sense out of it? I'm using PHP btw so only functions within it are applicable.
The data stored in the cmi.suspend_data field is simply a bucket of data that the SCO (the the content) can use to persist its current state. There is no semantic meaning or defined structure to the data. In many cases, the meaning of the data can be guessed at or reversed engineered, but that does not appear to be the case in with content produced by Adobe Presenter.
The suspend_data field is limited to 4096 ASCII characters. For some SCOs this doesn't provide enough storage to fully persist the current state. In many cases, a content developer faced with this predicament will apply a compression algorithm to the state data in order to squeeze it into the limited size. It looks like that is what Adobe Presenter is doing here. My guess is that they compressed their data to the unencoded state that you found, then applied url encoding to ensure that all of the resulting characters were safe to send to the LMS.
The string of 1's and 0's at the start of the suspend data might be something meaningful. It could likely correspond to which of the slides in the course have been previously viewed by the learner. To verify this, it might be helpful to run the course through a tool like SCORM TestTrack (freely available at scorm.com) and use the generated debug logs to watch how the suspend data changes as the user progresses through the course.
SCORM provides quite a few other data model elements which do have a specific meaning relating to the current status of the course. Here is a list of all available data model elements. The SCORM TestTrack debug logs will also show you which of those data model elements Adobe Presented content uses.
I don't think that SCORM defines what the suspend_data field contains or in what format it is.
This is entirely up to the content/lesson (Adobe Presenter in your case), but it can only be text and is limited to 4096 characters.
This field can be used by the content to store any kind of state which should be passed back to the content the next time it is started.
Found
13. cmi.suspend_data
Read / Write
Intended to act as a location to store
any information that a SCO would like
to persist until a subsequent session.
in here. So as Martin wrote SCORM only defines the data type and not the encoding or the content of cmi.suspend_data. Perhaps this could help you in determining the encoding.
Related
I have a site where users can change their location.
I have all of the available countries stored in a DB and an image for each of these in a folder in the same directory.
However, some countries have special characters and don't display properly or else can't find the image.
The countries in question are:
Côte d'Ivoire
Česká republika
I tried url encoding them so it was like this: %C4%8Cesk%C3%A1+republika
I need a way to store these in the DB in such a way as that they display the name correctly on the site and find the image of the same name.
First of all, see UTF-8 all the way through for all the things you need to do correctly to make non-ASCII characters work in your app in general.
Secondly, it's… tricky… to serve files with non-ASCII file names over the web. 1) You need to ensure that you encode all URLs for these files with percent encoding, as you already seem to do. 2) The web server will take that URL, percent-decode it to a byte string, and then ask the underlying operating/file system to look for a file with a name with that string. This is the tricky part: you won't know exactly what byte string your OS/file system uses to represent that file exactly. You would need to figure that out first, then encode the URL specifically so it will decode exactly to the correct string.
And when you move to a different server, especially if you're moving from Windows to *NIX or vice versa, you can do that all over again since those system do things very differently.
In a nutshell, it's often more hassle than it's worth, and you should store your images with ASCII-only names to avoid all that. Specifically for countries, it'd make a whole lot of sense to use the two-character country codes for the image name (e.g. "cz.jpg").
Human readable, meaning the string is a real word. This is essentially a form validation. Ideally I'd like to test the 'texture' of the form responses to determine if an actual user has filled out the form versus someone looking for form vulnerabilities. Possibly using a dictionary look-up on the POSTed data and then giving a threshold of returned 'real words'.
I don't see anything in the PHP docs and the Google machine isn't offering up anything, at least this specific. I suspect that someone out there has written a PHP class or even a jQuery plugin that can do this. Something like so:
$string = "laiqbqi";
is_this_string_human_readable($string);
Any ideas?
This can be done using something called Markov Chains.
Essentially, they read through a large chunk of text in a given language (English, French, Russian, etc.) and determine the probability of one character being after another.
e.g. a "q" has a much lower probability of occurring after a "z" than a vowel such as "a" does.
At a lower level, this is actually implemented as a state machine.
As per Mike's comment, a PHP version of this can be found here.
For flavor, an amusing the Daily WTF article on Markov Chains.
I have always used rawurlencode to store user entered data into my mysql databases. The main reason I do this is so that stroing foreign characters is very simple I find. I'd then use rawurldecode to retrieve and display the data.
I read somewhere that rawurlencode was not meant for this purpose. Are there any disadvantages to what I'm doing?
So let's say I have a German address with many characters like umlauts etc. What is the simplest way to store this in a mysql database with no risks of it coming out wrong and being searchable using a search script? So far rawurelencode has been excellent for our system. Perhaps the practise can be improved upon by only encoding foreign letters and not common characters like spaces etc, which is a waste of space I totally agree.
Sure there are.
Let's start with the practical: for a large class of characters you are spending 3 bytes of storage for every byte of data. The description of rawurlencode (and of course the RFC) say that those characters are
all non-alphanumeric characters except -_.~
This means that there is a total of 26 + 26 + 10 (alphanumeric) + 4 (special exceptions) = 66 characters for which you do not waste space.
Then there are also the logical drawbacks: You are not storing the data itself, but rather a representation of the data tailored to URLs. Unless the data itself is URLs, that's not what you should be doing.
Drawbacks I can think of:
Waste of disk space.
Waste of CPU cycles encoding and decoding on every read and every write.
Additional complexity (you can't even inspect data with a MySQL client).
Impossibility to use full text searches.
URL encoding is not necessarily unique (there're at least two RFCs). It may not lead to data loss but it can lead to duplicate data (e.g., unique indexes where two rows actually contain the same piece of data).
You can accidentally encode a non-string piece of data such as a date: 2012-04-20%2013%3A23%3A00
But the main consideration is that such technique is completely arbitrary and unnecessary since MySQL doesn't have the least problem storing the complete Unicode catalogue. You could also decide to swap e's and o's in all strings: Holle, werdl!. Your app would run fine but it would not provide any added value.
Update: As Your Common Sense points out, a SQL clause as basic as ORDER BYis no longer usable. It's not that international chars will be ignored; you'll basically get an arbitrary sort order based on the ASCII code of the % and hexadecimal characters. If you can't SELECT * FROM city ORDER BY city_name reliably, you've rendered your DB useless.
I am using a fork to eat a soup
I am using money bills to fire the coals for BBQ
I am using a kettle to boil eggs.
I am using a microscope to hammer the nails.
Are there any disadvantages to what I'm doing?
YES
You are using a tool not on purpose. This is always a disadvantage.
A sane human being alway using a tool that is intended for the certain job. Not some randomly picked one. Especially if there is no shortage in the right tool supply.
URL encoding is not intended to be used with database, as one can tell from the name. That's alone reason enough for the sane developer. Take a look around: find the proper tool.
There is a thing called "common sense" - a thing widely used in the regular life but for some reason always absent in the php world.
A common sense can warn us: if we're using a wrong tool, it may spoil the work. Sooner or later it will spoil it. No need to ask for the certain details - it's a general rule. We are learning this rule at about age of 5.
Why not to use it while playing with some web thingies too?
Why not to ask yourself a question:
What's wrong with storing foreign characters at all?
urlencode makes stroing foreign characters very simple
Any hardships you encountered without urlencode?
Although I feel that common sense should be enough to answer the question, people always look for the "omen", the proof. Here you are:
Database's job is not limited to just storing and retrieving data. A plain text file can handle such a primitive task as well.
Data manipulations is what we are using databases for.
Most widely used ones are sorting and filtering.
Such a quite intelligent thing as a database can sort and filter data character-insensitive, which is very handy feature. But of course it can be done only if characters being saved as is, not as some random codes.
Sorting texts also may use ordering other than just binary order in the character table. Some umlaut characters may be present at the other parts of the table but database collation will put them in the right place. Of course it can be done only if characters being saved as is, not as some random codes.
Sometimes we have to manipulate the data that already stored in the database. Say, cut some piece from the string and compare with the entered value. How it is supposed to be done with urlencoded data?
I haven't got clue if this is a normal issue or not but I have a small flash application that handles management for my company. It's a small company, so its not a big deal, its just a bunch of INSERTs, SELECTs, UPDATEs and other stuff to manage their clients, address, phone numbers, etc.
The flash (in AS3) sends the variables through a URLRequest to several php pages and the php handles the request to mySQL.
My problem is that, sometimes, instead of inserting the String I sent, it instead gets a weird string, made mostly, but not only, of numbers (and it happens like 1 column out of about 10 per INSERT, so its fairly common).
Is this a known issue? Could it be because of the encoding (I used UTF-8, which I believe is the one that we use here in portugal, due to special characters, like ã, à, á, etc)?
Thank you for your time.
Marco Fox.
After connecting to the DB, try the following query "SET CHARACTER SET utf8;".
Make sure every PHP page are in utf-8.
To do that, open the file in Notepad++ and use the menu Encoding -> Convert to UTF-8 without BOM, or open the file in notepad and ask to save as and look at encoding dropdown bellow name (this will save the BOM, which is not good).
Some IDE have the ability to save in ANSI, UTF-8 and more, or have the conversion option.
In Flash, use encodeURI() in your URLLoader data if you are passing it by GET.
Hopes that this solves your problem (if it is, in fact, encoding issues).
Short question: How do I automatically detect whether a CSV file has headers in the first row?
Details: I've written a small CSV parsing engine that places the data into an object that I can access as (approximately) an in-memory database. The original code was written to parse third-party CSV with a predictable format, but I'd like to be able to use this code more generally.
I'm trying to figure out a reliable way to automatically detect the presence of CSV headers, so the script can decide whether to use the first row of the CSV file as keys / column names or start parsing data immediately. Since all I need is a boolean test, I could easily specify an argument after inspecting the CSV file myself, but I'd rather not have to (go go automation).
I imagine I'd have to parse the first 3 to ? rows of the CSV file and look for a pattern of some sort to compare against the headers. I'm having nightmares of three particularly bad cases in which:
The headers include numeric data for some reason
The first few rows (or large portions of the CSV) are null
There headers and data look too similar to tell them apart
If I can get a "best guess" and have the parser fail with an error or spit out a warning if it can't decide, that's OK. If this is something that's going to be tremendously expensive in terms of time or computation (and take more time than it's supposed to save me) I'll happily scrap the idea and go back to working on "important things".
I'm working with PHP, but this strikes me as more of an algorithmic / computational question than something that's implementation-specific. If there's a simple algorithm I can use, great. If you can point me to some relevant theory / discussion, that'd be great, too. If there's a giant library that does natural language processing or 300 different kinds of parsing, I'm not interested.
As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:
The first row has columns that are not strings or are empty
The first row's columns are not all unique
The first row appears to contain dates or other common data formats (eg, xx-xx-xx)
In the most general sense, this is impossible. This is a valid csv file:
Name
Jim
Tom
Bill
Most csv readers will just take hasHeader as an option, and allow you to pass in your own header if you want. Even in the case you think you can detect, that being character headers and numeric data, you can run into a catastrophic failure. What if your column is a list of BMW series?
M
3
5
7
You will process this incorrectly. Worst of all, you will lose the best car!
In the purely abstract sense, I don't think there is an foolproof algorithmic answer to your question since it boils down to: "How do I distinguish dataA from dataB if I know nothing about either of them?". There will always be the potential for dataA to be indistinguishable from dataB. That said, I would start with the simple and only add complexity as needed. For example, if examining the first five rows, for a given column (or columns) if the datatype in rows 2-5 are all the same but differ from the datatype in row 1, there's a good chance that a header row is present (increased sample sizes reduce the possibility of error). This would (sorta) solve #1/#3 - perhaps throw an exception if the rows are all populated but the data is indistinguishable to allow the calling program to decide what to do next. For #2, simply don't count a row as a row unless and until it pulls non-null data....that would work in all but an empty file (in which case you'd hit EOF). It would never be foolproof, but it might be "close enough".
It really depends on just how "general" you want your tool to be. If the data will always be numeric, you have it easy as long as you assume non-numeric headers (which seems like a pretty fair assumption).
But beyond that, if you don't already know what patterns are present in the data, then you can't really test for them ahead of time.
FWIW, I actually just wrote a script for parsing out some stuff from TSVs, all from the same source. The source's approach to headers/formatting was so scattered that it made sense to just make the script ask me questions from the command line while executing. (Is this a header? Which columns are important?). So no automation, but it let's me fly through the data sets I'm working on, instead of trying to anticipate each funny formatting case. Also, my answers are saved in a file, so I only have to be involved once per file. Not ideal, but efficient.
This article provides some good guidance:
Basically, you do statistical analysis on columns based on whether the first row contains a string and the rest of the rows numbers, or something like that.
http://penndsg.com/blog/detect-headers/
If you CSV has a header like this.
ID, Name, Email, Date
1, john, john#john.com, 12 jan 2020
Then doing a filter_var(str, FILTER_VALIDATE_EMAIL) on the header row will fail. Since the email address is only in the row data. So check header row for an email address (assuming your CSV has email addresses in it).
Second idea.
http://php.net/manual/en/function.is-numeric.php
Check header row for is_numeric, most likely a header row does not have numeric data in it. But most likely a data row would have numeric data.
If you know you have dates in your columns, then checking the header row for a date would also work.
Obviously you need to what type of data you are expecting. I am "expecting" email addresses.