I'm looking for a way to analyze a string of text and find out in which tense it was written, for example : "I'm going to the store" == current, "I bought a car" == past ect..
Any tips on how I could this done?
Yes, this is going to be extremely difficult... I had started to do something similar for what was going to be a quick weekend project until I realized this... nonetheless here is a resource I found to be helpful.
Download the source code of Wordnet 3.0 from Princeton, which has a database of english words. The file /dict/index.verb is a list of present tense english verbs you should be able to import into your database as a CSV without too much trouble. From there, you're on your own, and will need to figure out how to handle the weirdness that is the English language.
This could be a rather tasking process. How detailed do you want to get? Do you want to consider only past, present, and future? Or do you want to consider Simple Present, Present Progressive, Simple Past, etc?
In any case, you'll also have to evaluate the Affirmative forms, Negative forms, and Question forms. A great chart online that can help can be found at http://www.ego4u.com/en/cram-up/grammar/tenses
Note the rules and signal words.
Tokenize / find action words from db/file (or at least, guess - *th=past, for example) / count tense hits?
For such a task, I believe Regular expressions won't be enough : it's a pretty difficult task...
Either you won't get anything good at all from regex, or you'll end with some kind of super-monster-regex that not even you will understand and be able to maintain...
This probably requires more than regex... Something like some kind of "linguistic-engine", I suppose...
If you actually need it and aren't just playing around, you might take a look at nltk. Parsing is a complex matter. Parsing natural languages is even more complex. And parsing a highly irregular language, such as English, is even worse. If you can narrow the problem scope down, you stand a better chance at a solution.
What do you need it for?
You can find a basic Brill Parser implementation for PHP at Ian Barber's PHP/ir site. The algorithm will tag your words.
If you enter the words "I think", the result will be:
I/NN think/VBP
NN= Noun,
VBP= Verb Present
Related
... server side and using PHP.
I read this SO article on when to use regexes and it basically states that you can use regexes to parse HTML in certain cases.
<title></title>
should be easy to match.
I see no problem with this. I think the popular answer is voted so much for not b.c. of correctness but b.c. of entrainment value.
Is this O.K?
Yes, it is
/<title[^>]*>(.*?)<\/title>/is
Different people have different opinions, though. And you should only use regex if you know what you're doing.
This might me a very interesting read: When you should NOT use Regular Expressions?
Your best bet is to use an HTML parsing library (like this one), not regex. You may get away with using regex in this case, but it's like using a hammer to pound in a screw.
If you are looking for anything non-trivial in the HTML, regex is going to be very confusing and hard to read, and in many cases, regex cannot do the job without making many assumptions about the content of the HTML.
I'm searching for a simple algorithm or an open source library (PHP) allowing to estimate whether a text mainly uses a specific language. I found the following answer relating to Python, which probably leads in the right direction. But something working out-of-the-box for PHP would be a charm.
Of course something like an n-gram estimator wouldn't be too hard to implement, but it requires a reference database as well.
The actual problem to solve is as follows. I run a WordPress blog, which currently is flooded by SPAM. The blog is in German language and virtually all trackback spam is English. My idea is to inmmediately spam all trackbacks seeming to be English. However, I cannot use marker words, because I do not want to spam typos or citations.
My solution:
Using the answers to this question I implemented a solution, which detects German by a simple stopword ratio. Any comment must contain at least 25% German stopwords, if it has a link. So you can still comment something like "cool article", which has no stopwords at all, but if you put a link, you should bother to write proper language.
Unfortunately the stopwords from NLTK are incorrect. The list contains words, which do not exist in German. So I used the snowball list. Using the Perl regexp optimizer I condensed the entire list into a single regexp and count the stopwords using preg_match_all(). The whole filter is 25 lines, a third of the Perl code to produce the regexp from the list. Let's see how it performs in the wild.
Thanks for your help.
I agree with #Thomas that what you are looking for is a spam classifier rather than a language detection algorithm. Nonetheless, I think this language detection solution is simple enough and out of the box as you want. Basically if you count the number of stop-words in different languages and select the language with higher number of them in the document you have a simple, yet very effective language classifier.
Now, the best part is that you do not need to code almost anything as you can used standard stop-words list and processing packages like nltk to deal with the information. Here you have the example of how to implement it from scratch with Python and nltk.
I hope this helps.
If all you want to do is recognize English then there's a very easy hack. If you just check for the letters in a post, English is one of the only languages that will be entirely in the pure-ASCII range. It's hacky, but it's a decently simplification on an otherwise very difficult problem I believe.
My guess on efficacy, just doing some quick back on the envelope calculations on a couple French and German blogs would be ~85%, which isn't foolproof, but is pretty good for the simplicity of it I would think.
I know this question has been asked about a dozen times, but this one is not technically a dupe (check the others if you like) ;)
Basically, I have a Javascript regex that checks email addresses which I use for front-end validation, and I use CodeIgniter to double check on the back end, in case the validation on the front end fails to run properly (browser issues, for instance.) It's QUITE a long regular expression, and I have no idea where to begin converting it by hand.
I'm pretty much looking for a tool that converts JS regexes to PHP regexes - I haven't found one in any of the answers to similar questions (of course, it's possible that such a tool doesn't exist.) Okay, I lied - one of them suggested a tool that costs $39.95, but I really don't want to spend that much to convert a single expression (and no, there isn't a free trial as suggested by the answer to the aforementioned question.)
Here's the Javascript expression, graciously provided by aSeptik:
/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))#((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i
And the one used by CodeIgniter, which I don't want to use because it doesn't follow the same rules (disallows some valid addresses):
/^([a-z0-9\+_\-]+)(\.[a-z0-9\+_\-]+)*#([a-z0-9\-]+\.)+[a-z]{2,6}$/ix
I want to use the same rules set by the Javascript regex in PHP.
Having this sort of inconsistency where my front-end code is saying that the email address is okay, and then Codeigniter says it isn't, is of course the behavior I'm trying to fix in my application.
Thanks for any and all tips! :D
There are some differences between regex engines in Javascript and PHP. Please check Comparison of regular-expression engines
article for theoretical and Difference between PHP regex and JavaScript regex answer for practical information.
Most of the time, you can use Javascript regex patterns in PHP with small modifications. As a fundamental difference, PHP regex is defined as a string (or in a string) like this:
preg_match('/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/',$telephone);
Javascript regex is not, it's defined in its own way:
var ptr = new RegExp(/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/);
// or
var ptr = /^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/;
You can give it a try by running the regex on PHP. As a recommendation, do not replace it in Codeigniter files, you can simply extend or replace native library. You can check Creating Libraries out for more information.
I was able to solve this in a better-than-expected manner. I was unable to convert the Javascript regex that I wanted to use (even after purchasing RegexBuddy - it'll come in handy, but it was not able to produce a proper conversion), so I decided to go looking on the Regex Validate Email Address site to see if they had any recommendations anywhere for good regexes. That's when I found this:
"The expression with the best score is currently the one used by PHP's filter_var()":
/^(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){65,}#)(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22))(?:\.(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22)))*#(?:(?:(?!.*[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.){1,126}){1,}(?:(?:[a-z][a-z0-9]*)|(?:(?:xn--)[a-z0-9]+))(?:-[a-z0-9]+)*)|(?:\[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.*[a-f0-9][:\]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:\.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))\]))$/iD
It matches with only 4/86 errors, while the Javascript one I was using matches with 8/86 errors, so the PHP one is a little more accurate. So, I extended the CodeIgniter Form_validation library to instead use return filter_var($str, FILTER_VALIDATE_EMAIL);.
...But does it work in Javascript?
var pattern = new RegExp(/^(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){65,}#)(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22))(?:\.(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22)))*#(?:(?:(?!.*[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.){1,126}){1,}(?:(?:[a-z][a-z0-9]*)|(?:(?:xn--)[a-z0-9]+))(?:-[a-z0-9]+)*)|(?:\[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.*[a-f0-9][:\]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:\.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))\]))$/i);
Zing! Works like a charm! Not only did I get the consistency I was looking for between front and back end validation, but I also got a more accurate regex in the process. Double win!
Thank you to all those who provided suggestions!
Today there is exists the site https://regex101.com/ where you can transform one JS regex to PHP or some another languages.
I've seen this question, which is very nice and informative. However, it doesn't deal with a rather common scenario.
Say I need to scrape a multitude of websites (or even pages in the same domain), but the author of that website didn't care enough for his code, and has some seriously malformed code "that kinda works". I need to take information from that website.
How do I do it in this case? Ideally without going í͞ń̡͢͡s̶̢̛á̢̕͘ń̵͢҉e̶̸̢̛.
Is it possible? Do I have to revert to RegExp?
You need a DOM Parser. Php has one. And then there are some alternatives (and more... just google for them). You can even run the "garbled HTML" trhu HTML Purifier if you want.
I don't know how your are scraping the site, but working with RegExp will allow you to add many conditions to the scrap code. This may take time, depending on the number of footprints and your RegExp skills.
You may also use Tidy on the site HTML, but this will lead to strange results as well IMO.
Does it have to be PHP? Python has a wonderful library called Beautiful Soup ("You didn't write that awful page. You're just trying to get some data out of it"). From my experience I'd recommend it so much that I'd say if you have the option, write a quick Python script to parse your nodes into a clean file that your PHP can pick up.
(Know PHP is in the title & this doesn't directly answer your question. Apologies if you don't have the option of (or dislike) Python, just wanted to present a good alternative.)
We're getting ready to translate our PHP website into various languages, and the gettext support in PHP looks like the way to go.
All the tutorials I see recommend using the english text as the message ID, i.e.
gettext("Hi there!")
But is that really a good idea? Let's say someone in marketing wants to change the text to "Hi there, y'all!". Then don't you have to update all the language files because that string -- which is actually the message ID -- has changed?
Is it better to have some kind of generic ID, like "hello.message", and an english translations file?
Wow, I'm surprised that no one is advocating using the English as a key. I used this style in a couple of software projects, and IMHO it worked out pretty well. The code readability is great, and if you change an English string it becomes obvious that the message needs to be considered for re-translation (which is a good thing).
In the case that you're only correcting spelling or making some other change that definitely doesn't require translation, it's a simple matter to update the IDs for that string in the resource files.
That said, I'm currently evaluating whether or not to carry this way of doing I18N forward to a new project, so it's good to hear some thoughts on why it might not be a good idea.
I strongly disagree with Richard Harrisons answer about which he states it is "the only way". Dear asker, do not trust an answer that states it is the only way, because the "only way" doesn't exist.
Here is another way which IMHO has a few advantages over Richards approach:
Start with using the proto-version of the English string as Original.
Don't display these proto-strings but create a translation file for English nontheless
Copy the proto-strings to the translation for the beginning
Advantages:
readable code
text in your code is very close if not identical to what your view displays
if you want to change the English text, you don't change the proto-string but the translation
if you want to translate the same thing twice, just write a slightly different proto-string or just add 'version for this and that' and you still have a perfectly readable code
I use meaningful IDs such as "welcome_back_1" which would be "welcome back, %1" etc. I always have English as my "base" language so in the worst case scenario when a specific language doesn't have a message ID, I fall-back on English.
I don't like to use actual English phrases as message ID's because if the English changes so does the ID. This might not affect you much if you use some automated tools, but it bothers me. I don't like to use simple codes (like msg3975) because they don't mean anything, so reading the code is more difficult unless you litter comments everywhere.
The reason for the IDs being English is so that the ID is returned if the translation fails for whatever reason - the translation for the current language and token not being available, or other errors.
That of course assumes the developer is writing the original English text, not some documentation person.
Also if the English text changes then probably the other translations need to be updated?
In practice we also use Pure IDs rather than then English text, but it does mean we have to do lots of extra work to default to English.
There is a lot to consider and answer is not so easy.
Using plain English
Pros
Easy to write and READ code
In most cases, it works even without running translation functions in code
Cons
Involved programmers must be also good copywriters :)
You need to write correct precise texts fully in English, even in the case that first language you need to run is something else (ie we're starting lof of projects in Czech language and we're localizing them to EN later).
In a lot of cases, you need to use contexts. If you fail to do it from begginig, it's a lot of work to add them later. To explain: In English, one word can have many different meands - and you need to use contexts to differentiate them - and it's not always so easy (order = sort order, or it can be purchase order).
It can be very hard to correct English later in the process. Corrections of the source strings will very often lead to loss of already translated phrases. It's very frustrating to loose translation to 3 different languages just because you corrected English.
Using keys
Pros
You can use localization platform functions even for the English language. I.e. we're using the lovely Crowdin platform. There is a lot of handy tools - or rather a complete workflow - for translation management: voting for different translations, translation history, glossaries (which helps to keep translation/language coherent), proofing, approval, etc. Using keys make this process much more smooth.
It's much easier to send Engish texts for proofreading etc. Usually, it's not a good idea to let copywriters to modify your code directly :)
Cons
More complicated project setup.
Harder to use %d, %s etc.
In a word don't do this.
The same word/phrase in English can often enough have more than one meaning, and each meaning a different translation.
Define mnemonic ids for your strings,and treat English as just another language.
Agree with other posters that id numbers in code are a nightmare for code readability.
Ex localisation engineer
Haven't you already answered your own question? :)
Clearly, if you intend to support i18n of your application, you should treat all the language implementations the same. If someone decides a string needs to change, you make a similar change in all the language files. The metadata with the checkin should group all the language files together in the same change. If your "default" language is handled differently, that makes it more difficult to maintain.
At the end of the day, a translator should be able to sit down and change the texts for every language (so they match in meaning) without having to involve the programmer that already did his/her job.
This makes me feel like the proper answer is to use a modified version of gettext where you put strings like this
_(id, backup_text, context)
_('ABOUT_ME', 'About Me', 'HOMEPAGE')
context being optional
why like this?
because you need to identify text in the system using unique ID's not english text that could get repeated elsewhere.
You should also keep the backup, id and context in the same place in your code to reduce discrepancies.
The id's also have to be readable, which brings in the problem of synonyms and duplicate use (even as ids), we could prefix the ids like this "HOMEPAGE_ABOUT_ME" or "MAIL_LETTER", but
people forget to do this at the start and changing it later is a problem
its more flexible for the system to be able to group both by id and context
which is why I also added the context variable at the end
the backup text can be pretty much anything, could even be "[ABOUT_ME#HOMEPAGE text failed to load, please contact example#example.com]"
It won't work with the current gettext editing programs like "poedit", but I think you can define custom variable names for translations like just "t()" without the underscore at the start.
I know that gettext also has support for contexts, but its not very well documented or widely used.
P.S. I'm not sure about the best variable order to enforce good and extendable code so suggestions are welcome.
I'd go so far as to say that you never (for most values of never) want to use free text as keys to anything. Imagine if SO used the query title as key to this page for instance. If someone links to it, and then the title is edited, the link is no longer valid.
Your problem is similar, except you would also be responsible for updating all links...
Like Douglas Leeder mentions, what you probably want to do is use English as the default (backup) language, although an interface that uses English and another language intermixed is highly confusing (but mildly amusing, too).
In addition to the considerations above, there are many cases where you'd want the "key" (msgid) to be different from the source text (English). For example, in the HTML view, I might want to say [yyyy] where the destination and label of that anchor tag depend on the locale of the user. E.g. it might be a link to a social network, and in US it would be Facebook but in China it would be Weibo. So the MsgIds might be something like socialSiteUrl and socialSiteLabel.
I use a mix.
For basic strings that I don't think will have conflicts/changes/weird meanings, I'll make the key be the same as the English.
We use Dutch. The strings should be written in the native language of the writer; this makes communication with translators less prone to errors, since the writer(s) can communicatie in their native language with them.