I have seen some captchas being decode using javascript, php, etc. How do they do it?
For example, very popular megaupload site's captcha has also been decoded.
I'm an image processing specialist and CAPTCHA decoder, I've done many CAPTCHA resolving projects before.
OK, let's start CAPTCHA resolving steps!
Decoding any kind off CAPTCHA has 3 main steps:
1- Removing background
Clear the CAPTCHA from any noise (using any image processing methods).
Note for captcha decoding fighter: If you want to have a good CAPTCHA, you should add a stronger noise. Use random noised background that has similar color of characters.
2- Splitting characters
Easy step when they are separate and very hard when they're not.
*Note for captcha decoding fighter: If you want to have a good CAPTCHA, don't leave the character separate! Make them overlapping, do NOT use different colors for characters, decoders can split the characters very easily! (most of the developers are unaware of this and think it's better to use a colorful CAPTCHA!), the best one is making an overlapping string with black color. For an experienced CAPTCHA decoder, it's not a problem to decode a colorful CAPTCHA! It's just beautiful and not useful! :) Use random curved lines witch connect all characters to each other. *
3- Converting separate images into character
After separation, we have a character set, (we don't have any string now, just have images and pixels), we should convert character images into string, But how?!
There are several ways, if they are not rotated, and have fixed font and size (such as freeglobes CAPTCHA), you can define a pattern set, your program should loop throw the patters to find the best match for each image, if the characters is very different and needs a large pattern you should use a "Neural Network" to recognize the character. A neural network for CAPTCHA resolving, will takes a character, and we say the network what this character is, for example, we will give it an image of "A" and we tel the NN: it's "A"! , then it will "LEARN" this character and will save its learning into a database, This procedure called "TRAINING". So, when we ask a trained network for a new character again, it will return us the best match from it's learning database.
Usually decoder specialists use the CAPTCHA itself to train the neural network. Be careful! Using appropriate data for training can make or break your results.
Note for captcha decoding fighter: If you want to have a good CAPTCHA, use any method witch a decoder can't recognize the characters, even with a Neural network. Deform the characters randomly, use many fonts instead of one and rotate the characters as well, etc.
Finally, we concatenate all single characters into one and return it as result.
Unfortunately, there are no fixed algorithm for solving any CAPTCHA, it means, new CAPTCHA needs new analysis and training. You can't make a CAPTCHA decoder to decode all CAPTCHA.
What should you know before starting:
1- Image processing fundamentals
2- General understanding of a Neural Network
3- Simple image processing functions (in any language)
For PHP:
imagecreate()
imagecreatetruecolor()
imagecolorat()
imagecolorsforindex()
imagesetpixel()
.
.
.
For .NET:
Bitmap type,
getPixel()
setPixel()
.
.
.
For JavaScript and HTML5:
You should know the Canvas very well.
Lastly:
Note for captcha decoding fighter: If you are wonder about how someone can decode a CAPTCHA and want to prevent it from decoding, you should first be a CAPTCHA decoder yourself or hire someone knows the weakness and attacking algorithm very well!
Hope to help! ;)
See:
OCR and Neural Nets in JavaScript
Here John Resig (creator of JQuery javascript library) explains how exactly it is been done.
Take a look at PWNtcha
You can also read Breaking a Visual CAPTCHA
I was involved in a project to circumvent Captcha images on the TicketMaster website about 8-9 years ago for a third-party ticket seller. When an event went on-sale, like a concert, our network of machines would use multiple credit cards and mailing addresses to buy any and every seat possible in the first 10 rows.
Rather than generating new captcha's each time, TM had a limited pool of images they could re-use. We'd create a unique digital fingerprint (checksum) for each image, then simply attack it with some imaging tools (LEADTOOLS.com) (to remove extraneous elements, enhance contrast, etc) and then use OCR tools. It was surprisingly effective.
We were able to crack a great number programmatically, and we'd store the ones we couldn't crack for human processing. Sometimes they'd have a pool of 20K images, so at first we'd get maybe 60-70% automatically, but eventually we'd get 100% success because we could identify the images our humans processed (offline) based on looking up their hash in our database. (That is, we could check a captcha image against our database based on the hash we created and if we already had the solution we could just submit the answer immediately.)
Occasionally, they'd flush and replace their pool of captcha image images with a new set, but again, it would just take us a bit of time to get back up to a 100% rate. The fatal flaw with this particular system was that they recycled images, rather than programmatically generating new captcha images each time.
But the fact is, if the financial incentive to crack the capthcha is high enough, it doesn't take much to create a distributed platform where low-wage unskilled workers can sit around earning pocket change to crack them all day.
Inside India's CAPTCHA solving economy
http://www.zdnet.com/blog/security/inside-indias-captcha-solving-economy/1835
There are services for recognition. Such as 2captcha. This is a tool for solving php https://github.com/jumper423/decaptcha/
Related
I am trying to integrate Adobe Signature in PDF where end user can sign it on browser itself, I want his/her hand written signature on it. End user will use his/her mouse to draw the signature. This PDF creation is written in PHP and application contains Adobe APIs.
I referred to the Handwritten Adobe page and Adobe tags
I have also referred to Stack 1 and stack 2, not matching to my requirement.
I was able to sign the custom runtime generated PDF document using {{Sig_es_:signer1:signature}}
I checked it at several places including Stackoverflow, but i cant find any such reference document which can guide me to code for hand written signatures. i also need to understand if Hand written signatures have any limitation or drawbacks or any privacy/security issues.
Let me know if anyone knows How to proceed on this.
Draw a signature with a mouse? That will not work. I can't do that. A finger on a phone would work better. Still clumsy, but better.
Drawn signatures are old fashioned in the digital world, and require complex verifiable encryption. You would have to prove that the digital copy you have, was indeed drawn within the exact document it appears in. Digital things can, after all, be copied easily. Whenever there's a dispute you would have to prove that the signature is an inherent part of the unchangeable digital document. This is far more difficult than it seems at first. That's why it is usually quite expensive.
I would strongly advice to not go down this road. Find another solution.
You haven't explained what you want to use the signature for, which makes it difficult for me to suggest another solution, so I won't.
Re:
i also need to understand if Hand written signatures have any limitation or drawbacks or any privacy/security issues.
Yes, there are lots of limitations and drawbacks. You need to consider the issues of forgery (someone else signing as me, Larry) and non-reputability (I signed it but later claim that it wasn't me. How do you prove that it was Larry who signed it?)
There's also the overall context of the signature: what is the value of the agreement? What are the consequences of not being able to prove that the right person did sign the document?
Adobe Sign (and their competitors) have answers to all of the above. eSignatures are far more complicated than just getting something that looks like the person's signature on the PDF.
Pro-tip: how the signature looks on the PDF is the least important part of the process.
I'm currently working on a personal project for decoding a text or any object in an image.
I'm using GD library for processing image. I have access to every pixel of image and its rgb color.
My question is not about coding,I'm just looking for an algorithm to decode image,or any advise for how to do that and I don't want to use any API, I want to do it by myself.
I know that php has a face detection library, but it only recognizes faces in image, and I don't know how it does that .
for start, I assume that the object is white and the background is black (or any separate colors) .
summary : How can I define an object or a word for a php program and train it to recognize it from a picture?
You have some api which decode simple captcha like this.
Check this link : Captcha Decoded
And try with this api : http://www.opendecoder.com/api, there are many API if you search on google
The process you are trying to implement is called “optical character recognition” and there is some free software available and doing this. With this expressions, you may find more information.
You did not specify the kind of software component you are looking far, so it is hard to be more specific.
This is usually an error-prone process, but you might get better results if you can make regularity assumptions on your input, especially if you already know which character types are used in your input.
Useful starting points could be
http://jwilk.net/software/ocrodjvu
http://unpaper.berlios.de/
If converting to DJVU and using python on a UNIX system is an option for you, you might consider a the first link as a solution. Otherwise you may use the various tools supported by ocrodjvu to start your research. The second is more about pre-processing you might want to do before OCR but still might be useful if you want to implement your own procedure.
Ok, so my company has a client that has an interface for posting content - standard MySQL database, PHP-based, etc.
Anyway, they've continually had an intern or someone, post content to this interface straight from an MS Word doc - the interface is coded poorly, and takes this input as is, with no formatting.
My company has now been contracted out to fix this particular problem, as it is continually breaking their site, and my company has repeatedly had to manually go into the database, and delete the offending values.
Is there a quick and easy way to do this, or am I going to have to just do a replace operation on each offending character?
I see htmlentities() may be a partial solution - but as far as I know, that won't remove everything.
What's a good solution to this problem? Is there anything out there to make this easier?
We're also considering writing a content validator as well, probably just server-side (though maybe client-side, if my week is going slowly enough/I finish the rest of this quickly enough).
It depends on how many clients (or potential clients) you are supporting and how much time you have to invest. Options
Write your own function to strip out the metadata
Teach your clients to remove it themselves such as paste in notepad first,
or supply a knowledge base article to explain how to do it in the software. Perhaps a "Help" section or icon they can click on.
htttp://support.microsoft.com/default.aspx?scid=kb;en-us;223396
Use a WYSIWYG editor such as TinyMCE which has built in functionality to remove it
But like I said in the comments, unless you are using your own function, prepare for clients to continue to paste directly and wonder why there is a problem.
I currently ran into the issue that I do not have the money to buy/rent any professional captchaing service.
So I tried to look around for OS captcha generators, and captcha designs.
I also had a brief brainstorm about my own and simple captcha design.
Do you have any preferences, or can give me a good advice handling captchas in PHP without having huge perfomance leeks?
(My attempt to design a simple captcha: .pdf)
EDIT: Thanks to all of you, I am sorry for only giving one "right-answer", but +1 for every good answer ;)
I actually would suggest that rather than rolling your own you use reCAPTCHA as it is free and of very good quality (used by this site, Facebook, Craigslist etc).
It also meets your requirements in that it isn't resource intensive, as all the image generation and distortion is done on the reCAPTCHA server.
PHP examples can be found here
If you want to design your own captcha, I highly recommend you take a look at this tutorial. It goes through a basic captcha design, allowing you to alter the design of the captcha as you wish, using various PHP image modification functions.
You could alter the code to use random fonts, make each character a different size, skew the image, etc. The tutorial is to show you how a code is created, how it's used with a session, and how to actually use the image in an input form.
Leaving aside the problem of CAPTCHAs being a horrible barrier for users… ReCaptcha should solve the budgetary issues without making you reinvent the wheel.
First and foremost thing you must consider is that you captcha is not easily breakable. There are some good old captchas already breaked/decoded using javascript. For further info please visit these pages:
http://ejohn.org/blog/ocr-and-neural-nets-in-javascript/
http://blog.makezine.com/archive/2009/01/javascript_captcha_decoder.html
To generate it , like Captcha Creator is a powerful and complete PHP Captcha Script that generates Captcha Images.
The classic approach is to generate some random text, apply some random effects to it and convert it into an image.
I'm working on an information warehousing site for HIV prevention. Lots of collaborators will be posting articles via a tinyMCE GUI.
The graphic designers, of course, want control over page lengths. They would like automatic pagination based on the height of content in the page.
Anyone seen AJAX code to manage this?
Barring that anyone seen PHP code that can do a character count and a look-behind regex to avoid splitting words or tags?
Any links much appreciated!
If it doesn't need to be exact there's no reason you can't use a simple word count function to determine an appropriate place to break the page (at the nearest paragraph I suppose). You could go so far as to reduce the words per page based on whether there are images in the post, even taking the size of the images into account.
That could get ugly fast though, I think the best way to do it is to allow them to manually set the page dividers with a tag in the article that you can parse out. Something like [pagebreak] is pretty straightforward and you'll get much more logical and readable page breaks than any automated solution would achieve.
You don't just have to worry about character count, you also have to worry about image heights if there are images or any other kind of embedded objects in your pages that can take up height. Character count will also not give you an idea of paragraph structure (a single long paragraph with more characters than a page with many paragraphs might be shorter).
If you're willing to use JavaScript, that might be the ideal solution, post the entire article to the client and let JavaScript handle the pagination. From the client you can detect image and object heights. You could use PHP to place markers about where you think the pages should be, and then use JavaScript to make it happen. Unless the pages are very long I don't think you'll need to do several xmlHttpRequests (AJAX).
For just a straight PHP solution is also simple, but probably not ideal as you're not dealing with a matter of managing row counts. You could use a GET variable to determine where you are in the page.
Although this might not be the exact answer you're looking for, but you should really make sure your site doesn't have a fixed height. Flexible width's are really nice, but not as critical as the height.
Especially for a cause like this, and a content-heavy site; it's fair to require flexible heights.
As mentioned by apphacker, you can't really detect the height from within PHP and you're kind of stuck with javascript. If you're absolutely stuck with paging, it's probably better to let your content authors decide when to break off the page, so you break it on a real section, instead of mid-word, sentence, etc.
Edit: usability should dictate design, not the other way around. You're doing it wrong ;)
A good pagination is not a simple task. That's not a simple matter of coding. Scientific research by Plass (1981) proved that the optimal page breaking is in general NP-hard.
You should worry about floating figures, line breaks, different font styles,etc.
And the only thing an HTML engine can help you is parsing a page to a DOM tree. What about sizes? Yes you could have font width and font height, margins and paddings, picture sizes. But that's all. All the layout is on your shoulders. And doing it in javascript... meh...
So the only feasible solution of automatic fixed height pagination would be a server-side. PrinceXML is currently the best HTML2PDF converter. But it costs a lot.
If you are good with different page heights, you could use epalla's suggestion. But this is also not as simple as it seems.
Some references for pagination:
Optimal pagination techniques, Plass, 1981
On the Pagination of Complex Documents, 1998
Pagination reconsidered
Knuth's Digital Typography