Trying to find Twitter RT's with Regular Expressions and PHP

Trying to find Twitter RT's with Regular Expressions and PHP - php

I'm trying to find the correct Regular Expression to match all RT scenarios on Twitter (can't wait to Twitter's new retweet API).
The way I see it, RT's can be at the beginning, middle, or end of the string returned from Twitter.
So, I need something at the beginning and end of this Regular Expression:
([Rr])([Tt])
No matter what I try, I cannot match all scenarios in one Regular Expression. I tried
[^|\s+]
to match the scenario where the RT will appear either at the beginning of the string or after one or more whitespace characters, but it didn't work the same for the end of the string or RT. I tried
[\s+|$]
to match a case when the RT appear either in the end of the string or there's one or more whitespace characters following it, same as with the 'pre' -- it didn't work.
Can someone please explain what am I doing wrong here? Any help or suggestions will be highly appreciated (as always :) )

You'll probably be happiest with something like:
/\brt\b/i
Which will find isolated instances of RT (that is, surrounded by word-boundaries), and use the /i modifier at the end of the regex to make it case-insensitive.
You want the word boundaries so that you don't end up thinking random tweets containing words like "Art" and "Quartz" are actually retweets. Even then, it's going to have false positives.
By default, a regular expression can (and will) match anywhere inside a string, so you don't need to account for what may precede or follow your match if indeed you don't care what it is or if it is present.

if(preg_match('/\brt\s*#(\w+)/i', $tweet, $match))
echo 'Somebody retweeted ' . $match[1] . "\n";

Related

Match words in a string which are not in anchor of link with regex

I'm trying to find some words (or expression: like two words) in a string which are not in the anchor of a link (the string contains html code and is usually utf-8 encoded). The plan is to replace those words with some links after that.
I'm not really good with regex, i've searched the web and stackoverflow and found two regex patterns which help me, but each of them have an issue. I'm hoping someone can help me to combine those two example to get a good one.
First pattern: /('.$tag.')(?![^<]*<\/a>)/is
This pattern, finds the words, but if by example i'm trying to find "express" in the string:
In computing, a regular expression provides a concise and flexible means...
..i don't expect to find a match, however the match is found in the word "expression".
Second pattern: \'(?!((<.*?)|(<a.*?)))(\b'.$tag.'\b)(?!(([^<>]*?)>)|([^>]*?</a>))\'is
This pattern, doesn't have the previous issue, but if the word or expression, i'm trying to find has as a last character a special utf-8 character then i don't get a match.
Example word: apă
Example string: ...care transformă umiditatea din aer în apă potabilă. Dacă iniţial a fost creată pentru situaţia ţărilor...

Assuming the second regular expression works for you (I haven't tested it and I really don't think you should use regexes for this kind of stuff), all you need to do is add a u modifier like #hakre said:
\'(?!((<.*?)|(<a.*?)))(\b'.$tag.'\b)(?!(([^<>]*?)>)|([^>]*?</a>))\'isu
Personally, I'd use DOMDocument for this task.

Positive look ahead regex confusing

I'm building this regex with a positive look ahead in it. Basically it must select all text in the line up to last period that precedes a ":" and add a "|" to the end to delimit it. Some sample text below. I am testing this in gskinner and editpadpro which has full grep regex support apparently so if I could get the answers in that for I'd appreciate it.
The regex below works to a degree but I am unsure if it is correct. Also it falls down if the text contains brackets.
Finally I would like to add another ignore rule like the one that ignores but includes "Co." in the selection. This second ignore rule would ignore but include periods that have a single Capital letter before them. Sample text below too. Thanks for all the help.
^(?:[^|]+\|){3}(.*?)[^(?:Co)]\.(?=[^:]*?\:)
121| Ryan, T.N. |2001. |I like regex. But does it like me (2) 2: 615-631.
122| O' Toole, H.Y. |2004. |(Note on the regex). Pages 90-91 In: Ryan, A. & Toole, B.L. (Editors) Guide to the regex functionality in php. Timmy, Tommy& Stewie, Quohog. * Produced for Family Guy in Quohog.

I don't think I understand what you want to do. But this part [^(?:Co)] is definitely not correct.
With the square brackets you are creating a character class, because of the ^ it is a negated class. That means at this place you don't want to match one of those characters (?:Co), in other words it will match any other character than "?)(:Co".
Update:
I don't think its possible. How should I distinguish between L. Co. or something similar and the end of the sentence?
But I found another error in your regex. The last part (?=[^:]*?\:) should be (?=[^.]*?\:) if you want to match the last dot before the : with your expression it will match on the first dot.
See it here on Regexr

This seems to do what you want.
(.*\.)(?=[^:]*?:)
It quite simply matches all text up to the last full stop that occurs before the colon.

Reg-Ex for filtering out, parsing and replacing a specific string? (php)

While developing a private CMS for a client, I've had an idea to implement a php-underlying, yet server-side and flexible "language".
I'm in trouble finding a reqular-expression finding (filter..) the following string ( [..] is the code, which'll be parsed after it's been filtered out ), I want to filter the string out with the line-breaks.
<(
[..]
)>
I was looking for a solution all night, but I didn't find a solution.

First off: Listen to Dan Grossmans advice above.
From my current understanding of your question, you want to get the verbatim content between <( and )> - no exceptions, no comment handling.
If so, try this RegExp
'/<\(((?:.|\s)*?)\)>/'
which you can use like this
preg_match_all('/<\(((?:.|\s)*?)\)>/', $yourstring, $matches)
It doesn't need case insensitivity, and it does lazy matching (so you can apply it to a string with several instances of matches).
Explanation of the RegExp: Starting with <(, ending with )> (brackets escaped of course), in between is the capturing group. At its core, we take either regular characters . or whitespace \s (which solves your problem, since line breaks are whitespace too). We don't want to capture every single character, so the inner group is non capturing - just either whitespace or character: (?:.|\s). This is repeated any number of times (including zero), but only until the first match is complete: *? for lazy 0-n. That's about it, hope it helps.

Match a regular expression against any non-character or number

Ok, here again.
I'll promise to study deeply the regular expression soon :P
Language: PhP
Problem:
Match if some badword exist inside a string and do something.
The word must be not included inside a "greater word". I mean if i'll search for "rob" (sorry Rob, i'm not thinking you're a badword), the word "problem have to pass without check.
I'd googled around but found nothing good for me. So, I thought something like this:
If i match the word with after and before any character of the following:
.
,
;
:
!
?
(
)
+
-
[whitespace]
I can simulate a check against single word inside a string.
Finally the Questions:
There's a better way to do it?
If not, which will be the correct regexp to consider [all_that_char]word[all_that_char]?
Thanks in advance to anyone would help!
Maybe this is a very stupid question but today is one of that day when move our neurons causes an incredible headache :|

Look up \b (word boundary):
Matches at the position between a word
character (anything matched by \w) and
a non-word character (anything matched
by [^\w] or \W) as well as at the
start and/or end of the string if the
first and/or last characters in the
string are word characters.
(http://www.regular-expressions.info/reference.html)
So: \brob\b matches rob, but not problem.

You can use \b, see Whole word bounderies.

Need to negate this regex pattern, but no clue how

I found a regex pattern for PHP that does the exact OPPOSITE of what I'm needing, and I'm wondering how I can reverse it?
Let's say I have the following text: Item_154 ($12)
This pattern /\((.*?)\)/ gets what's inside the parenthesis, but I need to get "Item_154" and cut out what's in parenthesis and the space before the parenthesis.
Anybody know how I can do that?
Regex is above my head apparently...

/^([^( ]*)/
Match everything from the start of the string until the first space or (.
If the item you need to match can have spaces in it, and you only want to get rid of whitespace immediately before the parenthetical, then you can use this instead:
/^([^(]*?)\s*\(/

The following will match anything that looks like text (...) but returns just the text part in the match.
\w+(?=\s*\([^)]*\))
Explanation:
The \w includes alphanumeric and underscore, with + saying match one or more.
The (?= ) group is positive lookahead, saying "confirm this exists but don't match it".
Then we have \s for whitespace, and * saying zero or more.
The \( and \) matches literal ( and ) characters (since its normally a special chat).
The [^)] is anything non-) character, and again * is zero or more.
Hopefully all makes sense?

/(.*)\(.*\)/
What is not in () will now be your 1st match :)

One site that really helped me was http://gskinner.com/RegExr/
It'll let you build a regex and then paste in some sample targets/text to test it against, highlighting matches. All of the possible regex components are listed on the right with (essentially) a tooltip describing the function.

<?php
$string = 'Item_154 ($12)';
$pattern = '/(.*)\(.*?\)/';
preg_match($pattern, $string, $matches);
var_dump($matches[1]);
?>
Should get you Item_154

The following regex works for your string as a replacement if that helps? :-
\s*\(.*?\)
Here's an explanation of what's it doing...
Whitespace, any number of repetitions - \s*
Literal - \(
Any character, any number of repetitions, as few as possible - .*?
Literal - \)
I've found Expresso (http://www.ultrapico.com/) is the best way of learning/working out regular expressions.
HTH

Here is a one-shot to do the whole thing
$text = 'Item_154 ($12)';
$text = preg_replace('/([^\s]*)\s(\()[^)]*(\))/', $1$2$3, $text);
var_dump($text);
//Outputs: Item_154()
Keep in mind that using any PCRE functions involves a fair amount of overhead, so if you are using something like this in a long loop and the text is simple, you could probably do something like this with substr/strpos and then concat the parens on to the end since you know that they should be empty anyway.
That said, if you are looking to learn REGEXs and be productive with them, I would suggest checking out: http://rexv.org
I've found the PCRE tool there to very useful, though it can be quirky in certain ways. In particular, any examples that you work with there should only use single quotes if possible, as it doesn't work with double quotes correctly.
Also, to really get a grip on how to use regexs, I would check out Mastering Regular Expressions by Jeffrey Friedl ISBN-13:978-0596528126
Since you are using PHP, I would try to get the 3rd Edition since it has a section specifically on PHP PCRE. Just make sure to read the first 6 chapters first since they give you the foundation needed to work with the material in that particular chapter. If you see the 2nd Edition on the cheap somewhere, that pretty much the same core material, so it would be a good buy as well.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Trying to find Twitter RT's with Regular Expressions and PHP - php

if(preg_match('/\brt\s*#(\w+)/i', $tweet, $match)) echo 'Somebody retweeted ' . $match[1] . "\n";

Related

Match words in a string which are not in anchor of link with regex

Positive look ahead regex confusing

Reg-Ex for filtering out, parsing and replacing a specific string? (php)

Match a regular expression against any non-character or number

Need to negate this regex pattern, but no clue how

Categories

Resources