I have an API call that essentially returns the HTML of a hosted wiki application page. I'm then doing some substr, str_replace and preg_replace kung-fu to format it as per my sites style guides.
I do one set of calls to format my left nav (changing a link to pageX to my wikiParse?page=pageX type of thing). I can safely do this on the left nav. In the body text, however, I cannot safely assume a link is a link to an internal page. It could very well be a link to an external resource. So I need to do a preg_replace that matches href= that is not followed by http://.
Here is my stab at it:
$result = preg_replace('href\=\"(?!http\:\/\/)','href="bla?id=',$result);
This seems to strip out the entire contents on the page. Anyone see where I slipped up? I don't think I'm too far off, just can't see where to go next.
Cheers
The preg_* functions expect Perl-Compatible Regular Expressions (PCRE). The structural difference to normal regular expressions is that the expression itself is wrapped into delimiters that separate the expression from possible modifiers. The classic delimiter is the / but PHP allows any other non-alphanumeric character except the backslash character. See also Intruduction to PCRE in PHP.
So try this:
$result = preg_replace('/href="(?!http:\/\/)/', 'href="bla?id=', $result);
Here href="(?!http://) is the regular expression. But as we use / as delimiters, the occurences of / inside the regular expression must be escaped using backslashes.
Your regexp is missing starting and ending delimiters (by default '/');
$result = preg_replace('/href\=\"(?!http\:\/\/)/','href="bla?id=',$result);
Related
I'm not familair with regular expressions. I'm trying to understand it, but it's difficult.
I've got a regular expression which will wrap any URL in an anchor tag. However, it's also wrapping URLs which are already in an anchor tag. I would like to prevent that, so I found a regular expression which does this for me.
?![^<]*</a>
However, I have no idea how I would add this to my existing regular expression. This is my current regular expression:
preg_replace('!(((ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $text); ?>
So, how can I skip an URL that is already wrapped in an anchor tag?
I'm gonna join the choir and say: Don't use regex for this - use a html parser.
This said - the regex you found isn't really a regex in itself. It's part of a negative look-ahead that kind of checks you aren't in an anchor. (It should really be (?![^<]*</a>).) It checks that following text up to the next < (or the end) isn't followed by </>.
Appending this to the en of your original RE will sometimes do the trick. I won't spend time thinking of situations it'll fail - but it probably will.
Along with some simplifications your regex should look like this:
(https?:\/\/[-\wа-яА-Я()#:%+.~#?&;\/=]+)(?![^<]*<\/a>)
This probably will work for you mostly, but probably will fail at times as well.
Regards
I have made a regular expression to remove a script tag from a imported page.(used curl)
<script[\s\S]*?/script> this is my expresion
when i used it with preg_replace to remove the tag it gave me this error
Warning: preg_replace() [function.preg-replace]: Unknown modifier 'c' in C:\xampp\htdocs\get_page.php on line 21
can anyone help me
thanks
You should choose a suitable delimiter for your regular expression (preferably one that doesn't' occur anywhere in your pattern, so that you don't need to escape). For example:
"#<script[\s\S]*?/script>#"
Also, don't do that if you are trying to prevent malicious people from injecting Javascript into your page. It can easily be worked around. Use a whitelist of known safe constructs rather than trying to remove dangerous code.
PHP requires delimiters on RegExp patterns. Also, your expression can be simplified.
|<script.+/script>|
Did you wrap your regexp in forward slashes?
$str = preg_replace('/<script[\s\S]*?\/script>/', ...);
Did you surround your regular expression with a delimiter, such as /? If you didn't, you need to. If you did, and you used / (as opposed to your other choices) you'll need to escape the / in your /script, so it'll look like \/script instead.
Use the following code :
$result = preg_replace('%<script[\s\S]*?/script>%', $change_to, $subject);
Is there an equivalent of the PHP function preg_split for JavaScript?
Any string in javascript can be split using the string.split function, e.g.
"foo:bar".split(/:/)
where split takes as an argument either a regular expression or a literal string.
You can use regular expressions with split.
The problem is the escape characters in the string as the (? opens a non capturing group but there is no corresponding } to close the non capturing group it identifies the string to look for as '
If you want support for all of the preg_split arguments see https://github.com/kvz/phpjs/blob/master/_workbench/pcre/preg_split.js (though not sure how well tested it is).
Just bear in mind that JavaScript's regex syntax is somewhat different from PHP's (mostly less expressive). We would like to integrate XRegExp at some point as that makes up for some of the missing features of PHP regexes (as well as fixes the many browser reliability problems with functions like String.split()).
I'm trying to write a regular expression for matching the following HTML.
<span class="hidden_text">Some text here.</span>
I'm struggling to write out the condition to match it and have tried the following, but in some cases it selects everything after the span as well.
$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";
If anyone could highlight what I'm doing wrong that would be great.
You need to use a non-greedy selection by adding ? after .* :
$condition = "/<span class=\"hidden_text\">(.*?)<\/span>/";
Note : If you need to match generic HTML, you should use a XML parser like DOM.
You shouldn’t try to use regular expressions on a non-regular language like HTML. Better use a proper HTML parser to parse the document.
See the following questions for further information on how to do that with PHP:
How to parse HTML with PHP?
Best methods to parse HTML
$condition = "/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/";
I got it. ;)
Chances are that you have multiple spans, and the regexp you're using will default to greedy mode
It's a lot easier using PHP's DOM Parser to extract content from HTML
I think this is what they call a teachable moment. :P Let us now compare and contrast the regex in your self-answer:
"/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/"
...and this one:
'~<span class="hidden_text">[^><]++</span>~'
PHP's double-quoted strings are subject to interpolation of embedded variables ($my_var) and evaluation of source code wrapped in braces ({return "foo"}). If you aren't using those features, it's best to use single-quoted strings to avoid surprises. As a bonus, you don't have to escape those double-quotes any more.
PHP allows you to use almost any ASCII punctuation character for the regex delimiters. By replacing your slashes with ~ I eliminated the need to escape the slash in the closing tag.
The lookbehind - (?<=^|>) - was not doing anything useful. It would only ever be evaluated immediately after the opening tag had been matched, so the previous character was always >.
[^><]+? is good (assuming you don't want to allow other tags in the content), but the quantifier doesn't need to be reluctant. [^><]+ can't possibly overrun the closing </span> tag, so there's point sneaking up on it. In fact, go ahead and kick the door in with a possessive quantifier: [^><]++.
Like the lookbehind before it, (?=<|$) was only taking up space. If [^><]+ consumes everything it can and the next character not <, you don't need a lookahead to tell you the match is going to fail.
Note that I'm just critiquing your regex, not fixing it; your regex and mine would probably yield the same results every time. There are many ways both of them can go wrong, even if the HTML you're working with is perfectly valid. Matching HTML with regexes is like trying to catch a greased pig.
in twitter
when you write #moustafa
will change to <a href='user/moustafa'>#moustafa</a>
now i want make the same thing
when write #moustafa + space its change #moustafa only
One regular expression that could be used (shamelessly stolen from the #anywhere javascript library mentioned in another answer) would be:
\B\#([a-zA-Z0-9_]{1,20})
This looks for a non–word-boundary (to prevent a#b [i.e. emails] from matching) followed by #, then between one and 20 (inclusive) characters in that character class. Of course, the anything-except-space route, as in other answers; it depends very much on what values are to be (dis)allowed in the label part of the #label.
To use the highlighted regex in PHP, something like the following could be used to replace a string $subject.
$subject = 'Hello, #moustafa how are you today?';
echo preg_replace('/\B\#([a-zA-Z0-9_]{1,20})/', '$0', $subject);
The above outputs something like:
Hello, #moustafa how are you today?
You're looking for a regular expression that matches #username, where username doesn't have a space? You can use:
#[^ ]+
If you know the allowed characters in a username you can be more specific, like if they have to be alphanumeric:
#[A-Za-z0-9]+
Regular Expressions in PHP are just Strings that start and end with the same character. By convention this character is /
So you can use something like this as an argument to any of the many php regular expression functions:
Not space:
"/[^ ]+/"
Alphanumeric only:
"/[A-Za-z0-9]+/"
Why not use the #anywhere javascript library that Twitter have recently released?
There are several libraries that perform this selection and linking for you. Currently I know of Java, Ruby, and PHP libraries under mzsanford's Github account: http://github.com/mzsanford/twitter-text-rb