How to validate a Twitter username using Regex - php

I have used the pattern /[a-z0-9_]+/i within the function:
function validate_twitter($username) {
if (eregi('/[a-z0-9_]+/i', $username)) {
return true;
}
}
With this, I test if the input is a valid twitter username, but i'm having difficulties as it is not giving me a valid result.
Can someone help me find a solution.

To validate if a string is a valid Twitter handle:
function validate_username($username)
{
return preg_match('/^[A-Za-z0-9_]{1,15}$/', $username);
}
If you are trying to match #username within a string.
For example: RT #username: lorem ipsum #cjoudrey etc...
Use the following:
$string = 'RT #username: lorem ipsum #cjoudrey etc...';
preg_match_all('/#([A-Za-z0-9_]{1,15})/', $string, $usernames);
print_r($usernames);
You can use the latter with preg_replace_callback to linkify usernames in a string.
Edit: Twitter also open sourced text libraries for Java and Ruby for matching usernames, hash tags, etc.. You could probably look into the code and find the regex patterns they use.
Edit (2): Here is a PHP port of the Twitter Text Library: https://github.com/mzsanford/twitter-text-php#readme

Don't use / with ereg*.
In fact, don't use ereg* at all if you can avoid it. http://php.net/preg_match
edit: Note also that /[a-z0-9_]+/i will match on spaces are invalid and not-a-real-name. You almost certainly want /^[a-z0-9_]+$/i.
S

I believe that you're using the PCRE form, in which case you should be using the preg_match function instead.

eregi() won't expect any / or additional toggles. Just use eregi('[a-z0-9_]+')

Your regular expression is valid, although it allows spaces FYI. (If you want to test out regular expressions I recommend: http://rubular.com/).
The first issue here is your use of eregi which is deprecated as of PHP 5.3. It is recommended that you use preg_match instead, it has the same syntax. Give that a try and see if it helps.
PHP Documentation for preg_match: http://www.php.net/manual/en/function.preg-match.php
PHP Documentation for eregi: http://php.net/manual/en/function.eregi.php

Twitter user names have from 1 to 15 chars... so this could be even better with /^[a-z0-9_]{1,15}$/i.

Related

PHP preg_replace RCE [duplicate]

This question already has answers here:
Replace preg_replace() e modifier with preg_replace_callback
(3 answers)
Closed 4 years ago.
I'm currently improving my knowledge about security holes in HTML, PHP, JavaScript etc.
A few hours ago, I stumbled across the /e modifier in regular expressions and I still don't get how it works. I've taken a look at the documentation, but that didn't really help.
What I understood is that this modifier can be manipulated to give someone the opportunity to execute PHP code in (for example, preg_replace()). I've seen the following example describing a security hole but it wasn't explained, so could someone please explain me how to call phpinfo() in the following code?
$input = htmlentities("");
if (strpos($input, 'bla'))
{
echo preg_replace("/" .$input ."/", $input ."<img src='".$input.".png'>", "bla");
}
The e Regex Modifier in PHP with example vulnerability & alternatives
What e does, with an example...
The e modifier is a deprecated regex modifier which allows you to use PHP code within your regular expression. This means that whatever you parse in will be evaluated as a part of your program.
For example, we can use something like this:
$input = "Bet you want a BMW.";
echo preg_replace("/([a-z]*)/e", "strtoupper('\\1')", $input);
This will output BET YOU WANT A BMW.
Without the e modifier, we get this very different output:
strtoupper('')Bstrtoupper('et')strtoupper('') strtoupper('you')strtoupper('') strtoupper('want')strtoupper('') strtoupper('a')strtoupper('') strtoupper('')Bstrtoupper('')Mstrtoupper('')Wstrtoupper('').strtoupper('')
Potential security issues with e...
The e modifier is deprecated for security reasons. Here's an example of an issue you can run into very easily with e:
$password = 'secret';
...
$input = $_GET['input'];
echo preg_replace('|^(.*)$|e', '"\1"', $input);
If I submit my input as "$password", the output to this function will be secret. It's very easy, therefore, for me to access session variables, all variables being used on the back-end and even take deeper levels of control over your application (eval('cat /etc/passwd');?) through this simple piece of poorly written code.
Like the similarly deprecated mysql libraries, this doesn't mean that you cannot write code which is not subject to vulnerability using e, just that it's more difficult to do so.
What you should use instead...
You should use preg_replace_callback in nearly all places you would consider using the e modifier. The code is definitely not as brief in this case but don't let that fool you -- it's twice as fast:
$input = "Bet you want a BMW.";
echo preg_replace_callback(
"/([a-z]*)/",
function($matches){
foreach($matches as $match){
return strtoupper($match);
}
},
$input
);
On performance, there's no reason to use e...
Unlike the mysql libraries (which were also deprecated for security purposes), e is not quicker than its alternatives for most operations. For the example given, it's twice as slow: preg_replace_callback (0.14 sec for 50,000 operations) vs e modifier (0.32 sec for 50,000 operations)
The e modifier is a PHP-specific modifier that triggers PHP to run the resulting string as PHP code. It is basically a eval() wrapped inside a regex engine.
eval() on its own is considered a security risk and a performance problem; wrapping it inside a regex amplifies both those issues significantly.
It is therefore considered bad practice, and is being formally deprecated as of the soon-to-be-released PHP v5.5.
PHP has provided for several versions now an alternative solution in the form of preg_replace_callback(), which uses callback functions instead of using eval(). This is the recommended method of doing this kind of thing.
With specific regard to the code you've quoted:
I don't see an e modifier in the sample code you've given in the question. It has a slash at each end as the regex delimiter; the e would have to be outside of that, and it isn't. Therefore I don't think the code you've quoted is likely to be directly vulnerable to having an e modifier injected into it.
However, if $input contains any / characters, it will be vulnerable to being entirely broken (ie throwing an error due to invalid regex). The same would apply if it had anything else that made it an invalid regular expression.
Because of this, it is a bad idea to use an unvalidated user input string as part of a regex pattern - even if you are sure that it can't be hacked to use the e modifier, there's plenty of other mischief that could be achieved with it.
As explained in the manual, the /e modifier actually evaluates the text the regular expression works on as PHP code. The example given in the manual is:
$html = preg_replace(
'(<h([1-6])>(.*?)</h\1>)e',
'"<h$1>" . strtoupper("$2") . "</h$1>"',
$html
);
This matches any "<hX>XXXXX</hX>" text (i.e. headline HTML tags), replaces this text with "<hX>" . strtoupper("XXXXXX") . "<hX>", then executes "<hX>" . strtoupper("XXXXXX") . "<hX>" as PHP code, then puts the result back into the string.
If you run this on arbitrary user input, any user has a chance to slip something in which will actually be evaluated as PHP code. If he does it correctly, the user can use this opportunity to execute any code he wants to. In the above example, imagine if in the second step the text would be "<hX>" . strtoupper("" . shell('rm -rf /') . "") . "<hX>".
It's evil, that's all you need to know :p
More specifically, it generates the replacement string as normal, but then runs it through eval.
You should use preg_replace_callback instead.

How can I parse out specific "tags" from a string in php

I like how StackOverflow allows you to search for tags by specifying [tagname] in the search field. How could I go about writing a parser that would help me separate out tags from normal text. I can think of the manual way which would be to use some combination of substring and/or regex to get the position of opening and closing square brackets, and then extract out those strings, but I'm curious if there's a better way (and my regex skill is subpar at best)
// example
$query = 'How to use [jQuery] [selector] selectors';
$tags = getTags($query); // $tags == 'jQuery, selector'
$text = getText($query); // $text == 'How to use selectors'
Regular Expressions are probably the way to go. The more you can specify about how the tags are set the easier it will be to capture the right ones (In the expression below I limit it to either letters \w or numbers \d. The function uses a capture group (enclosed in parens) to pull out the relevant tags.
function getTags($query) {
preg_match_all("/\[([\w\d]+)\]/", $query, $matches);
return $matches;
}
Regex would probably work best, just don't try to parse HTML.
https://www.debuggex.com/
Is a really good site for visually seeing what your regex string is doing. I would recommend reading up on the PHP regex functions, and learn some more, there is a cheatsheat at the bottom of the site.
.*[(tag)].*
Would work to get the tags, using a captured group. The preg_match_all function is really good for working with multiple results, just make sure to read the official documentation to get it working how you need it.
For parsing more complex, or irregular things (like html, which is extremely difficult to do reliably), it is better to do it manually. Regex has worked for all my non HTML parsing needs in the past.

PHP preg_replace();

I've got a problem with regexp function, preg_replace(), in PHP.
I want to get viewstate from html's input, but it doesn't work properly.
This code:
$viewstate = preg_replace('/^(.*)(<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value=")(.*[^"])("\s+name="__VIEWSTATE">)(.*)$/u','^\${3}$',$html);
Returns this:
%0D%0A%0D%0A%3C%21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.0+Transitional%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-transitional.dtd%22%3E%0D%0A%0D%0A%3Chtml+xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml%22+%3E%0D%0A%3Chead%3E%3Ctitle%3E%0D%0A%09Strava.cz%0D%0A%3C%2Ftitle%3E%3Clink+rel%3D%22shortcut+icon%22+href%3D%22..%2FGrafika%2Ffavicon.ico%22+type%3D%22image%2Fx-icon%22+%2F%3E%3Clink+rel%3D%22stylesheet%22+type%3D%22text%2Fcss%22+media%3D%22screen%22+href%3D%22..%2FStyly%2FZaklad.css%22+%2F%3E%0D%0A++++%3Cstyle+type%3D%22text%2Fcss%22%3E%0D%0A++++++++.style1%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+47px%3B%0D%0A++++++++%7D%0D%0A++++++++.style2%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+64px%3B%0D%0A++++++++%7D%0D%0A++++%3C%2Fstyle%3E%0D%0A%0D%0A%3Cscript+type%3D%22text%2Fjavascript%22%3E%0D%0A%0D%0A++var+_gaq+%3D+_gaq+%7C%7C+%5B%5D%3B%0D%0A++_gaq.push%28%5B
EDIT: Sorry, I left this question for a long time. Finally I used DOMDocument.
To be sure i'd split this match into two phases:
Find the relevant input element
Get the value
Because you cannot be certain what the attributes order in the element will be.
if(preg_match('/<input[^>]+name="__VIEWSTATE"[^>]*>/i', $input, $match))
$value = preg_replace('/.*value="([^"]*)".*/i', '$1', $match[0]);
And, of course, always consider DOM and DOMXpath over regex for parsing html/xml.
You should only capture when you're planning on using the data. So most () are obsolete in that regexp pattern. Not a cause for failure but I thought I'd mention it.
Instead of using [^"] to mark that you don't want that character you could use the non-greedy modifier - ?. This makes sure the pattern is matching as little as it can. Since you have name="__VIEWSTATE" following the value this should be safe.
Let's put this in practice and simplify the pattern some. This works as you want:
'/.*<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value="(.+?)"\s+name="__VIEWSTATE">.*/'
I would strongly recommend checking out an alternative to regexp for DOM operations. This makes certain your code works also if the attributes changes order. Plus it's so much nicer to work with.
The main mistake was the use of funciton preg_replace, witch returns the subject - neither the matched pattern nor the replacement. Thank you for your ideas and for the recommendation of DOMDocument. m93a
http://www.php.net/manual/en/function.preg-replace.php#refsect1-function.preg-replace-returnvalues

PHP expressions fail when changing from ereg to preg_match

I have a class that uses PHP's ereg() which is deprecated.
Looking on PHP.net I thought I could just get away and change to preg_match()
But I get errors with the regular expressions or they fail!!
Here are two examples:
function input_login() {
if (ereg("[^-_#\.A-Za-z0-9]", $input_value)) { // WORKS
// if (preg_match("[^-_#\.A-Za-z0-9]", $input_value)) { // FAILS
echo "is NOT alphanumeric with characters - _ # . allowed";
}
}
// validate email
function validate_email($email) {
// return eregi("^[_\.0-9a-zA-Z-]+#([0-9a-zA-Z][0-9a-zA-Z-]+\.)+[a-zA-Z]{2,6}$", $email); // FAILS
}
You forgot the delimiters:
if (preg_match("/[^-_#.A-Za-z0-9]/", $input_value))
Also, the dot doesn't need to be escaped inside a character class.
For your validation function, you need to make the regex case-insensitive by using the i modifier:
return preg_match('/^[_.0-9a-zA-Z-]+#([0-9a-zA-Z][0-9a-zA-Z-]+\.)+[a-zA-Z]{2,6}$/i', $email)
I can't suppress the suspicion anymore that people simply don't like my +#example.org email address (I changed the right part only). It is an absolutely normal address, it's valid, short and easy to type. Simply people don't like it! One cannot register on any page using that mail.
So, why don't be nice and use PHPs Filter extension to verify the mail or to use a PCRE, which definitely allows all valid emails in use (excluding only the #[] ones):
/^[^#]+#(?:[^.]+\.)+[A-Za-z]{2,6}$/
Thanks for saving my email, it's a dieing species!
try preg_match("/[^-_#\.A-Za-z0-9]/", $input_value)

Regular expression (PCRE) for URL matching

The input: we get some plain text as input string and we have to highlighight all URLs there with <a href={url}>{url></a>.
For some time I've used regex taken from http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/, which I modified several times, but it's built for another issue - to check whether the whole input string is an URL or no.
So, what regex do you use in such issues?
UPD: it would be nice if answers were related to php :-[
Take a look at a couple of modules available on CPAN:
URI::Find
URI::Find::Schemeless
where the latter is a little more forgiving. The regular expressions are available in the source code (the latter's, for example).
For example:
#! /usr/bin/perl
use warnings;
use strict;
use URI::Find::Schemeless;
my $text = "http://stackoverflow.com/users/251311/zerkms is swell!\n";
URI::Find::Schemeless
->new(sub { qq[$_[0]] })
->find(\$text);
print $text;
Output:
http://stackoverflow.com/users/251311/zerkms is swell!
For Perl, I usually use one of the modules defining common regex, Regexp::Common::URI::*. You might find a good regexp for you in the sources of those modules.
http://search.cpan.org/search?query=Regexp%3A%3ACommon%3A%3AURI&mode=module

Categories