How can I identify certain symbols on a scanned form? - php

I'm building a web application in PHP, and part of the requirement is that I need to be able to quickly process data on a scanned copy of a fairly simple form, and save it to a database for later retrieval.
Given the following image
how can I identify and assign a database field a value of either true or false (true when it sees a tick, and false otherwise)?
I'm thinking along the following line of implementation:
I will keep two copies of the above image - the first will have ticks shown (as above), and the second will be a "clean" copy of the image with the borders left behind. Comparing between the two images will yield a difference; the difference will return either a value of true or false.
There are drawbacks as far as I can observe of the above implementation. What happens if the user scribbles something in it (as seen above) but it does not mean anything? How do I even ensure that the returned values of true or false are assigned to the appropriate columns in the database?
I don't have any code implementation at this point in time, and I'm not asking for it. Rather, I'm asking for guidance on where to look and how I can efficiently do this.

You may try using OpenCV framework for PHP (https://github.com/mgdm/OpenCV-for-PHP, http://mgdm.net/talks/confoo11/making-php-see.pdf) and use contour detection (or any other classificators) to find signs like "V" and skip false-positives.

You might want to use a PHP OCR library.

I will do thin in a following way: I will divide image into 2x6 grid and count black pixels in each row. If the number n contains in <A;B> then we can assume that row is checked. If someone scratch off an answer then n is larger than B.
So if n is in <A;B> range we can check its pattern - for example common part of the all marked rows because of user's handwriting.

Related

PHPExcel - get current page during document creation

Client works in pharmacy industry.
We print bills for items bought by clients.
General structure of document is the following:
Header
List of items
Footer
Sometimes happens such situation when all list of items fit into N number of pages.
In this case footer is moved to the last page.
Local law forbids such documents, i.e.footer cannot be alone. It needs contain at least one sold position.
Thoughts about implementation:
It would be easy to implement it if I could get the current page during the calling operations like setCellValue or fromArray.
In this case I could check how many lines I still have and if I'm on the new page.
If I 'm on the new page and have zero items then I would add empty line before last position.
I feel it's not possible to get in phpexcel because of the way how Excel works at all.
As for now I have in my mind only one dirty solution:
creating document upon user request
-> storing in somewhere to temporary folder on server -> reading it and checking if document has last page without items , if this is the case I would add empty position and recreate document again (some kind recursion with limited depth.)
But this is too messy solution. Are there any better ways to handle such problem?
Actually if the format is fixed in which you print hopefully (like font type and size, header format, paper size etc) then You can check how many rows are there:
$objPHPExcel->setActiveSheetIndex(0)->getHighestRow();
//if 0 is the sheet you are working on
You can check how many rows does it require to push the footer to a new page. Basically there is a formula to you:
$footerWillBeOnSeparateSheet = $allRowCount % $maximumRowPerPage === $nomberOfRowsWhenFooterGetsOnTheNextPage;
If $footerWillBeOnSeparateSheet is true then you need to add the empty row you mentioned.
I'd use a different approach instead of adding an empty line. Try to separate the last two items and footer for example into a different page in case the value of $footerWillBeOnSeparateSheet is true.
Edit: using getRowHeight() is also an option although it would be hard to maintain the differences of the different templates.

A view interface for large object/array dumps

I want to embed in a page a detailed structure report of my model objects, like print_r() or var_export() produce (now I’m doing this with running var_export() on get_object_vars()). But what I actually want to see is only some properties (in most cases), but at this moment I have to use Ctrl+F and seek the variable I want, instead of just staring at it right after the page completes loading. So I’m embedding buttons to show/hide large arrays etc. but thought: ‘What if there already is the thing I do right now?’ So is there?
Update:
What would your ideal interface look like?
First of all, dumped models fit in the first screen. All the properties can be seen at the first look at the screen (there are not many of them, around 10 per each, three models total, so it is possible). Small arrays can be shown unrolled too. Let the size of the array to count it as ‘small’ be definable. Ideally, the user can see values of the properties without doing any click, scrolling the screen or typing something.
There must be some improvements to representing the values, say, if an array is empty, show
array ‘My_big_array’ is empty
and if a boolean variable starting with is_, has_, had_ has a false as the value, make the variable (let us take is_available for example) shown as
is_NOT_available
in red, and if it has true as the value, show
is_available
in green. Without any value shown. The same goes for defined constants.
That would be ideal.
I want to make focus on this kind of switches.
Krumo seems useful, but since it always closes up the variable without making difference of how large it is, I cannot use it as is, but there might appear something similar on github soon :)
Second update starts here:
Any programmer who sees is_available = false will know what it means, no need to do more
Bringing in color indication I forgot about one thing: the ‘switches’ let’s call them so, may me important or not. So I have right now some of them that will show in green or red, this is for something global, like caching, which is shown as
Caching is… ON
with ‘ON’ written in green, (and ‘OFF’ in red when disabled) while the words about what it is, i.e. ‘Caching is… ’ are written in black.
And some which are not so important, for example I haven’t defined
REVEAL_TIES is… not set
with ‘not set’ written in gray, while the words describing what it is stay in black. And if it would be set the whole phrase would be in black since there is nothing important: if this small utility for showing some undercover things is working, I will see some messages after it, if it isn’t — site will be working independently of its state.
Dividing switches into important ones and not with corresponding color match should improve readability, especially for those users who are not programmers and just enabled debug mode because some guy from bugzilla said do that — for them it would help to understand what is important and what is not.
A big thanks to all who replied.

Need an algorithm to find near-duplicate text values

I run a photo website where users are free to enter any tag they like, even tags not used before. As a result, a photo of a tag may sometimes be tagged as "insect" whilst somebody else tags it as "insects".
I'd like to keep the free-tagging capability, yet would like to have a way to filter out such near-duplicates. The total collection of tags is currently at 1,500. My idea is to read all of them from the DB into mem and then run an alghoritm on it that displays "suspects".
My idea of a suspect is that x% of the characters in the string are the same (same char and order), where x is configurable. I could probably code a really inefficient way to do this but I was wondering if there is an existing solution to this problem?
Edit: Forgot to mention: just sorting the tags isn't enough, as that would require me to go through the entire set to find dupes.
There are some flaws in your logic. For example, what happens when the plural of an object is different from the singular (i.e. person vs. people or even candy vs. candies).
If English is the primary language, check out Soundex which allows phonetic matches. Also consider using a crowd-sourced synonym model where users can create links to existing tags.
Maybe the algorithm you are looking for is approximate string matching.
http://en.wikipedia.org/wiki/Approximate_string_matching.
by a given word you can match it to list of words and if the 'distance' is close add it to suspects.
A fast implementation is to use dynamic programming like the Needleman–Wunsch algorithm.
I have made a blog example of this in C# where you can configure the 'distance' using a matrix character lookup file.
http://kunuk.wordpress.com/2010/10/17/dynamic-programming-example-with-c-using-needleman-wunsch-algorithm/
Is "either contains either" fine? You could do a SQL query something like this, if your images are in a database (which would only make sense):
SELECT * FROM ImageTags WHERE INSTR('theNewTag', TagName) > 0 OR INSTR(TagName, 'theNewTag') > 0 LIMIT 1;
If you really want to do this efficiently I would suggest some sort of JavaScript implementation that displays possibilities as the user is typing in a tag that they want. Not only will it save the user time to happily see 5 suggestions as they type. It will automatically stop them from typing "suspects" when "suspect" shows up as a suggestion. That is, of course, unless they really want "suspects" as a point of urgency.
You could load a huge list of words and as the user types narrow them down. I get the feeling that this could be very simplistic esp if you want to anticipate correctly spelled words. If someone misses a letter, they'll probably go back to fix it when they see a list of suggestions that isn't at all what they meant to type. And when they do correctly type a word it'll pop up in the suggestions.

using php to get two even columns of text

Does anyone know a clever way to create even columns of text using php?
So lets say I have a few paragraphs of text and I want to split this into two columns of even length (not string length, I'm talking even visible length).
At the moment I'm splitting based on word count, which (as you can imagine) isn't working too well. For instance, on one page I have a list (ul li style) which is increasing the line breaks but not the word count. eg: whats happening is that the left column (with the list in it) is visibly longer than the right column (and if there was a list in the right hand column then it would be the same the other way round).
So does anyone have a clever way to split text? For instance using my knowledge of objective c there is a "size that fits" function. I know how wide the columns are going to be, so is there any way to take that, and the string, and work out how high its going to be? Then cut it in half? Or similar?
Thanks
ps: no css3 nonsense please, we're targeting browsers as far back as ie6 (shudder). :)
I know you're looking at a PHP solution but since the number of lines will depend on how it's rendered in the browser, you'll need to use some javascript.
You basically need to know the dimensions of the container the text is in and using the height divided by the text's line-height, you'll get the number of lines.
Here's a fiddle using jQuery: http://jsfiddle.net/bh8ZR/
There is not a lot of information here as to the source data. However, if you know that you have 20 lines of data, and want to split it, why not simply use an array of the display lines, then divide by two. Then you can take the first half of the PHP array and push it into the second column when you hit the limit of the first.
I think you're going to have trouble displaying these columns in a web browser and having a consistent look and feel because you're trying to apply simple programming logic to a visual layout. CSS and jQuery were designed to help layout issues. jQuery does have IE6 compatibility.
I really don't think you're going to find a magic bullet here if you have HTML formatting inside the data you're trying to display. The browser is going to render this based on a lot of variables. Page width, font size, etc. This is exactly why CSS and other layout styles are there, to handle this sort of formatting.
Is there any reason why you're not trying to solve this in the browser instead of PHP? IE6 to me is not a strong enough case not to do this where it belongs.

how to store and search mp3 by its content

I want to store multiple mp3 files and search them by giving some part of song, to detect which song it is.
I am thinking of storing all binary content in mysql and when I want to search for a specific song by content I will take some middle portion of song and actually match it with the binary data in MySQL.
My questions are:
Is this a reasonable way to find songs by their content?
Is it right to store the songs' content in the database or should I use the filesystem?
This is not going to work. MP3 is a "lossy" format. That means that it constantly alters subtle nuances of the music when encoding, thus producing totally different byte-wise data on almost every encoding for the same song.
Also, even in an uncompressed format like WAV, two identical records at different volumes will produce different byte data. So, it is impossible to compare music by comparing the byte values of the file's contents.
A binary comparison will work only for two exact identical copies of the same MP3 file. It won't even work anymore when you re-encode the same MP3 file with identical settings.
Comparing music is not a trivial matter, several approaches exist but to my knowledge none that can be used in PHP.
If you're lucky, there exists a web service that allows some kind of matching. Expect it to be commercial in some way, though - I doubt we are at the stage where this kind of thing can be used free of charge.
Is it a right way to find songs by content of song.
Only if you can be sure that the part you get as search criterium will actually be an excerpt from that particular MP3 file... and that is very, very unlikely. If the part can be from a different source (i.e. a different recording of the same song, or just a differently compressed MP3), you'll have to use audio fingerprinting which is vastly more complicated.
Is it right to store songs content in database or file store normally will work?
If you do simple binary matching, there is no point in using a database. If you have a more complex indexing technique (such as audio fingerprints) then using a database can make sense.
As others have pointed out - comparing MP3s by looking at the binary content of files is not going to work.
I wrote something like this in Java whilst at university for my final year project. I'd be more than happy to send you the source code. It dealt in relative similarities - "song X is more similar to song Y than it is to song Z", rather than matches, but it might be a step in the right direction.
And please, whatever you do, don't try and do this in PHP. The algorithm I used needed me to compute (if I remember correctly - I worked on this around 3 years ago) 30 30x30 matrices for each MP3 it analysed. Each song took around 30 seconds to process to a set of matrices on my clunky old machine (I'm sure my new PC could get the job done significantly quicker). Once I had those matrices for n songs a second step computed differences between each pair of songs, and a third step reduced those differences down to m-dimensional space. Each of these 3 steps takes a fair amount of horsepower, and PHP definitely isn't the right horse for the job.
What PHP might work for is a frontend - I ended up with a queryable web-app written in Ruby on Rails, where I had a simple backend which stored the co-ordinates of each song in m-dimensional space (I happened to choose m = 6) - given a particular song, or fragment, X, you could then compute songs within a certain "distance" of X.
NB. I should probably point out that all the code I wrote was basically just a wrapper around libraries others had written - which were by some smart people at a university in Austria - those libraries took two songs and generated the matrices - all I did was compute distances and map distances of lots of songs into m-dimensional space. Wish I was smart enough to have done the first bit too!
I don't fully understand what you're trying to do, but if you're going to index an MP3 collection, it's probably a better idea to store a hash (of sufficient length) rather than the actual file.
The problem is that the bytes don't give you any insight to the CONTENT of the file, i.e. the music in it. Even if you cut the metadata from the bytes to compare (to get rid of noise like changes in spelling/capitalisation of metadata), you only know something about the unique file itself. So you could compare two identical files (i.e. exact duplicates) for equality, but you couldn't compare any two random files for similarity.
To search songs, you may probably want to index their tags and focus on a nice, easy to use UI so users can look for them in flexible ways.
As said above, same song will show different content bytes depending on the encoding.
However, one idea pointing to your direction, and I'm not sure how feasible is, would be to index some songs patterns that may uniquely identify it. For ex. what do all Johnny Cash songs have in common? Volume, tone, a combination of them? And when you get a portion of content, you may extract that same pattern from it and match. That would be an interesting concept.

Categories