Regex replace matched subexpression (and nothing else)? - php

I've used regex for ages but somehow I managed to never run into something like this.
I'm looking to do some bulk search/replace operations within a file where I need to replace some data within tag-like elements. For example, converting <DelayEvent>13A</DelayEvent> to just <DelayEvent>X</DelayEvent> where X might be different for each.
The current way I'm doing this is such:
$new_data = preg_replace('|<DelayEvent>(\w+)</DelayEvent>|', '<DelayEvent>X</DelayEvent>', $data);
I can shorten this a bit to:
$new_data = preg_replace('|(<DelayEvent>)(\w+)(</DelayEvent>)|', '${1}X${2}', $data);
But really all I want to do is simulate a "replace text between tags T with X".
Is there a way to do such a thing? In essence I'm trying to prevent having to match all the surrounding data and reassembling it later. I just want to replace a given matched sub-expression with something else.
Edit: The data is not XML, although it does what appear to be tag-like elements. I know better than parsing HTML and XML with RegEx. ;)

It is possible using lookarounds:
$new_data = preg_replace('|(?<=<DelayEvent>)\w+(?=</DelayEvent>)|', 'X', $data);
See it working online: ideone

Related

How to get the number out of a HTML string without tags?

I have the following string inside the source of some website:
user_count: <b>5.122.512</b>
Is this possible to get the number out of this string, even if the tags around this number were different? I mean, "user_count:" part won't change, but the tags can be changed, to strong for example. Or the tags could be doubled, or whatever.
How can I do that?
You can use
user_count:\s*<.*?>(.*?)<.*?>
See DEMO
I'd imagine you have to use JS to extract the content between the tags <b>5.122.512<b> from the DOM.
If you can assign an ID to this you can probably use document.getElementById('NAME_OF_YOUR_ID').innerHTML; to extract the number between it. If you need to process this inside a PHP script, you would probably need to POST this back to the server.
There are a couple of ways to get the number out of the string. One would be just to strip the tags and run a regular expression.
$s = "user_count: <b>5.122.512</b>"
preg_match_all("#user_count: (.+)#", strip_tags($s), $matches);
print_r($matches)
$matches[1] should match the number.

PHP Regex string returns two identical arrays

I've got a Regex query here to pull out all of the tags in a page. It looks like this:
preg_match_all('%<tr[^>]++>(.*?)</tr>%s', $pageText, $rows);
Problem is that while it does find all of the tags on the page in the return array it actually returns a multidimensional array, where each entry of the first array contains an array of all of the matches. In other words, it hands me multiple identical copies of the first array, IE the one I actually want.
Help please?
EDIT: Also relevant: I'm not allowed to use DOM for this application despite it being a significantly easier (and better) way of going about things.
What you're actually asking about is the $row[0] list, which redundantly contains the <tr>...</tr> blob again. If you just care about the (.*?) inner data, then use \K to reset the full match.
preg_match_all('=<tr\b[^>]*+>(.*?)</tr>\K=s', $pageText, $rows);
It's not possible to get rid of $row[0] completely. You'll have to ignore it, and use $row[1] alone.
Try this one:
preg_match_all('~<tr(?:\\s+[^>]*)?>(.*?)</tr>~si', $pageText, $rows);
var_dump($rows[1]);
Don't use % to wrap RegExps. It's a character somehow reserved for printf() like functions and with %s or %i at the end of your Pattern, it can be quite confusing.

Regex for a Function Call with Multiple Optional Parameters

I'm looking for a regex that will scan a document to match a function call, and return the value of the first parameter (a string literal) only.
The function call could look like any of the following:
MyFunction("MyStringArg");
MyFunction("MyStringArg", true);
MyFunction("MyStringArg", true, true);
I'm currently using:
$pattern = '/Use\s*\(\s*"(.*?)\"\s*\)\s*;/';
This pattern will only match the first form, however.
Thanks in advance for your help!
Update
I was able to solve my problem with:
$pattern = '/Use\s*\(\s*"(.*?)\"/';
Thanks Justin!
~Scott
If you only care about the value of the first parameter, you can just chop off the end of the regex:
$pattern = '/Use\s*\(\s*"(.*?)\"/';
However, you should understand that this (or any pure-regex solution for this problem) will not be perfect, and there will be some possible cases it handles incorrectly. In this case, you'll get false positives, and escaped quotes (\") will break it.
You can ignore escaped quotes by complicating it a bit:
$pattern = '/Use\s*\(\s*"(.*?)(?!<(?:\\\\)*\\)\"/';
This ignores " characters inside the quoted string if they have an odd number of backslashes in front of them.
However, the false-postives issue can't be helped without introducing false-negatives, and vice versa. This is because PHP is an irregular language, so it can't be parsed with "pure" regex, and even modern regex engines that allow recursion are going to need some pretty complex code to do a really thorough job at this.
All I'm saying is, if you're planning a one-off job to quickly scrape through some PHP you wrote yourself, regex is probably fine. If you're looking for something robust and open-ended that will do this on arbitrary PHP code, you need some kind of reflection or PHP parser.
This might be slightly simpler, though will only work if you have double quotes and not single quotes:
$pattern = /Use\s*[^\"]*\"([^\"]*)\"/

PHP Extract Text from Webpage

Is it possible to do something with PHP where I can set up a connection to a URL like http://en.wikipedia.org/wiki/Wiki and extract any words that contain a prefix like "Exa" and "ins" such that the resulting PHP page will print out all the words that it found. For example with "Exa", the word "Example" would be printed out each time it found an instance of "Example". Same thing for words that start with "ins".
$data = strip_tags(file_get_contents($url));
$matches = array();
preg_match('/\bExa|ins([^\b]+)/', $data, &$matches);
for ($i = 1; $i < count($matches); $i++) {
echo "Match: '".$matches[$i]."'\r\n";
}
Probably something like this, though I'm not so sure about the regex, I haven't tested it yet...
Edit: I changed it, it should work now... (\B => \b and strip_tags to prevent HTML-classes from being matched).
I don't have a full answer with example to give you, but yes, you should be able to read the whole page into a string variable and then do normal string operations on it. It will read in all the HTML, so you will probably need to do a lot of regex to eliminate tags if you don't want them.
Read the page into a string using file_get_contents. Use one of the various string functions to examine the page.
Yes, this possible. A potential approach would be to:
Use something like fopen (if allow_url_fopen is enabled - failing that use CURL) to grab the external web page content.
Remove the (presumably not required) HTML tags via strip_tags.
Use strtok to tokenise and iterate over the remaining content, checking for whatever conditions you require.

preg_replace() help in PHP

Consider this string
hello awesome <a href="" rel="external" title="so awesome is cool"> stuff stuff
What regex could I use to match any occurence of awesome which doesn't appear within the title attribute of the anchor?
So far, this is what I've came up with (it doesn't work sadly)
/[^."]*(awesome)[^."]*/i
Edit
I took Alan M's advice and used a regex to capture every word and send it to a callback. Thanks Alan M for your advice. Here is my final code.
$plantDetails = end($this->_model->getPlantById($plantId));
$botany = new Botany_Model();
$this->_botanyWords = $botany->getArray();
foreach($plantDetails as $key=>$detail) {
$detail = preg_replace_callback('/\b[a-z]+\b/iU', array($this, '_processBotanyWords'), $detail);
$plantDetails[$key] = $detail;
}
And the _processBotanyWords()...
private function _processBotanyWords($match) {
$botanyWords = $this->_botanyWords;
$word = $match[0];
if (array_key_exists($word, $botanyWords)) {
return '' . $word . '';
} else {
return $word;
}
}
Hope this well help someone else some day! Thanks again for all your answers.
This subject comes up pretty much every day here and basically the issue is this: you shouldn't be using regular expressions to parse or alter HTML (or XML). That's what HTML/XML parsers are for. The above problem is just one of the issues you'll face. You may get something that mostly works but there'll still be corner cases where it doesn't.
Just use an HTML parser.
Asssuming this is related to the question you posted and deleted a little while ago (that was you, wasn't it?), it's your fundamental approach that's wrong. You said you were generating these HTML links yourself by replacing words from a list of keywords. The trouble is that keywords farther down the list sometimes appear in the generated title attributes and get replaced by mistake--and now you're trying to fix the mistakes.
The underlying problem is that you're replacing each keyword using a separate call to preg_replace, effectively processing the entire text over and over again. What you should do is process the text once, matching every single word and looking it up in your list of keywords; if it's on the list, replace it. I'm not set up to write/test PHP code, but you probably want to use preg_replace_callback:
$text = preg_replace_callback('/\b[A-Za-z]+\b/', "the_callback", $text);
"the_callback" is the name of a function that looks up the word and, if it's in the list, generates the appropriate link; otherwise it returns the matched word. It may sound inefficient, processing every word like this, but in fact it's a great deal more efficient than your original approach.
Sure, using a parsing library is the industrial-strength solution, but we all have times were we just want to write something in 10 seconds and be done. Next time you want to process the meaty text of a page, ignoring tags, try just run your input through strip_tags first. This way you will get only the plain, visible text and your regex powers will again reign supreme.
This is so horrible I hesitate to post it, but if you want a quick hack, reverse the problem--instead of finding the stuff that isn't X, find the stuff that IS, change it, do the thing and change it back.
This is assuming you're trying to change awesome (to "wonderful"). If you're doing something else, adjust accordingly.
$string = 'Awesome is the man who <b>awesome</b> does and awesome is.';
$string = preg_replace('#(title\s*=\s*\"[^"]*?)awesome#is', "$1PIGDOG", $string);
$string = preg_replace('#awesome#is', 'wonderful', $string);
$string = preg_replace('#pigdog#is', 'awesome', $string);
Don't vote me down. I know it's hack.

Categories