Regular expression match formatted text in PHP

Regular expression match formatted text in PHP - php

I have formatted text like this:
Record
name=aaa
age=16
info=blabla bla
Record
name=bbb
age=15
info=foo bar foo bar
Would like to convert it into arrays with regular expression in PHP. So far I've tried:
preg_match_all("/Record.*\n(?m:^(.+)=(.+)$)+/",$text,$matches);
But it only catches "Record name=aaa" and "Record name=bbb"
Wondering why the + does not work in this case. So how should I form my pattern here?

You have not matched the newlines after the first. Move the \n inside the (?m:...) section

This will do it.
$data = array_values(array_map(
function($e){
preg_match_all('/(.*?)=([^\r\n]*)/', $e, $m);
return array_combine($m[1], $m[2]);
},
array_filter(explode("Record", $text))
));
First it splits the whole data by Record as delimiter using explode and array_filter. Then for each of the chunk it extracts the key-value pair using preg_match_all and constructs an associative array (by array_combine).

Related

preg_replace with arrays of different sizes

I know that the answer is probably simple and I am just not seeing it.
This code gets me an array of "tags" in the layout (eg [[tag]]), and I have an array of replacements that comes in with the request ($this->data). My first inclination was to use preg_match_all to get an array of all the tags and just pass in both arrays:
if(isset($this->layout))
{
ob_start();
include(VIEWS.'layouts/'.$this->layout.'.phtml');
$this->layout = ob_get_contents();
preg_match_all('/\[\[.*\]\]/', $this->layout, $tags);
print preg_replace($tags, $this->data, $this->layout);
}
But the arrays are not the same length (most of the time). The layout might reuse some tags, and the passed in data variables might not include some tags in the layout.
I feel like there must be a more efficient way to do this than doing a foreach and building the output in iterations.
This project is way too small to implement a full templating engine like Smarty or Twig... it is actually just a few pages and a few replacements. My client just wants a simple way to add things like page titles and email recipients, etc.
Any advice would be appreciated. Like I said, I am positive that it is something simple that I am overlooking.
EDIT:
$this->data is an array of replacement text that look like:
tag => replacement_text
EDIT 2:
If I user preg_match_all('/\[\[(.*)\]\]/', $this->layout, $tags); the array includes JUST the tags (no [[]]), I just need a way to match them up with the array of replacement strings in $this->data.

You can simply use str_replace for this job, creating an array of search strings and replacements from $this->data:
$search = array_map(function ($s) { return "[[$s]]"; }, array_keys($this->data));
$replacements = array_values($this->data);
echo str_replace($search, $replacements, $this->layout);
Demo on 3v4l.org

You don't need to get $tags by matching $this->layout, the information is all in the keys of $this->data. You just need to add [[...]] around the keys.
$tags = array_map(function($key) {
return '/\[\[' . preg_quote($key) . '\]\]';
}, array_keys($this->data));
Another solution is to use preg_replace_callback() to look the tag up in $this->data;
echo preg_replace_callback('/\[\[(.*?)\]\]/', function($matches) {
return $this->data[$matches[1]] ?? $matches[0];
}, $this->layout);
Note that I changed the regexp to use a non-greedy quantifier; your regexp will match from the beginning of the first tag to the end of the last tag.
If the tag isn't found in $this->data, ?? $matches[0] leaves it unchanged.

You could make use of preg_replace_callback:
$result = preg_replace_callback('/\[\[(?<tag>.*?)]]/', function ($matches) {
return $this->data[$matches['tag']] ?? $matches[0];
}, $this->layout);
Demo: https://3v4l.org/I9Vvh
Shorter PHP 7.4 version:
$result = preg_replace_callback('/\[\[(?<tag>.*?)]]/', fn($matches) =>
$this->data[$matches['tag']] ?? $matches[0], $this->layout);
Edited with the ?? $matches[0] (courtesy of #Barmar) -- this is basically the same answer, just leaving it in case the PHP 7.4 syntax is useful.

How to improve my algorithm?/seaching and replacing words in a formated text/

I have a source of html, and an array of keywords. I'm trying to find all words which begin with any keyword in the keywords array and wrap it in a link tag.
For example, the keyword array has two values: [ABC, DEF]. It should match ABCDEF, DEFAD, etc. and wrap each word with hyperlink markup.
Here is the code I've got so far:
$_keys = array('ABC', 'DEF');
$text = 'Some ABCDD <strong>HTML</strong> text. DEF';
function search_and_replace(($key,$text)
{
$words = preg_split('/\s+/', trim($text)); //to seprate words in $_text
for($words as $word)
{
if(strpos($word,$key) !== false)
{
if($word.startswith($key))
{
str_replace($word,''.$word.',$_text);
}
}
}
return text;
}
for($_keys as $_key)
{
$text = search_and_replace($key,$text);
}
My questions:
Would this algorithm work?
How would I modify this to work with UTF-8?
How can I recognize hyperlinks in the html and ignore them (don't want to put a hyperlink in a hyperlink).
Is this algorithm safe?

is the algorithm "true"? ( I'm reading "accurate")
No, it is not. Since str_replace functions as follows
a string or an array with all occurrences of search in subject
replaced with the given replace value.
The string you're matching is not the only one being replaced. Using your example, if you ran this function against your data set, you'd end up wrapping each occurrence of ABC in multiple tags ( just run your code to see it, but you'll have to fix syntax errors).
work with UTF-8 Alphabets?
Not sure, but as written, I don't think so. See Preg_Replace and UTF8. PREG functions should be multibyte safe.
I want to igonre all words in each a tag for search operetion
That's awefully hard. You'll have to avoid <a ...>word</a>, which starts to make a big mess fast. Regex matching HTML reliably is a fool's errand.
Probably the best would be to interpret the webpage as XML or HTML. Have you considered doing this in javascript? Why do it on the server side? The advantage of JS is twofold - one, it runs on the client side, so you're offloading / distributing the work, and two, since the DOM is already interpreted, you can find all text nodes and replace them fairly easily. In fact, I was helping a frend working on a chrome extension to to almost exactly what you're describing; you could modify it to do what you're looking for easily.
a better alternative method?
Definitely. What you're showing here is one of the worse methods of doing this. I'd push for you to use preg_replace ( another answer has a good start for the regex you'd want, matching word breaks tather than whitespace) but since you want to avoid changing some elements, I'm thinking now that doing this in JS client-side is far better.

In order to maximize your performance you should look into Trie (same as Retrieval Tree) data structure. (http://en.wikipedia.org/wiki/Trie) If I were you I would first build a Trie containing the words in the HTML page. At this step you could also check if the word is inside an <a> tag and if it this then do not add it to the Trie. You can easily do that with a Regex match

How about regex?
preg_match_all("/\b".$word."\B*\b/",$matches);
foreach($matches as $each) {
print($each[0]);
}
(Sorry, my PHP is a bit rusty)

For a simple task like this PHP regular expressions will serve well. The idea is to find all hyperlinks ( and optionally some other HTML elements ) and replace them with unique tokens. After that we are free to seek and replace desired keywords, and in the end we will restore the removed HTML elements back.
$_keys = array( 'ABC', 'DEF', 'ABČ' );
$text =
'Some <a href="#" >ABC</a> ABCDđD <strong>ABCDEF</strong> text. DEF
<p class="test">
PHP is <em>the</em> most ABCwidely used
langČuage ABC for ABČogr ammDEFing on the webABC DEFABC.
</p>';
// array for holding html items replaced with tokens
$tokens = array();
$id = 0;
// we will replace all links and strong elements (a|strong)
$text = preg_replace_callback( '/<(a|strong)[^>]*>.*?<\/\1\s*>/s',
function( $matches ) use ( &$tokens, &$id )
{
// store matches into the tokens array
$tokens[ '#'.++$id.'#' ] = $matches[0];
// replace matches with the unique id
return '#'.$id.'#';
},
$text
);
echo htmlentities( $text );
/* - outputs: Some #1# ABCDđD #2# text. DEF <p class="test"> #3# is <em>the</em> most ABCwidely used langČuage ABC for pćrogrABCamming on the webABC DEFABC. </p>
- note the #1# #2# #3# tokens
*/
// wrap the words that starts with items in $_keys array ( with u(PCRE_UTF8) modifier )
$text = preg_replace( '/\b('. implode( '|', $_keys ) . ')\w*\b/u', '$0', $text );
// replace the tokens with values
$text = str_replace( array_keys($tokens), array_values($tokens), $text );
echo $text;
Info about UTF-8 strings in PHP regex:

PHP:preg_replace function

$text = "
<tag>
<html>
HTML
</html>
</tag>
";
I want to replace all the text present inside the tags with htmlspecialchars(). I tried this:
$regex = '/<tag>(.*?)<\/tag>/s';
$code = preg_replace($regex,htmlspecialchars($regex),$text);
But it doesn't work.
I am getting the output as htmlspecialchars of the regex pattern. I want to replace it with htmlspecialchars of the data matching with the regex pattern.
what should i do?

You're replacing the match with the pattern itself, you're not using the back-references and the e-flag, but in this case, preg_replace_callback would be the way to go:
$code = preg_replace_callback($regex,'htmlspecialchars',$text);
This will pass the mathces groups to htmlspecialchars, and use its return value as replacement. The groups might be an array, in which case, you can try either:
function replaceCallback($matches)
{
if (is_array($matches))
{
$matches = implode ('', array_slice($matches, 1));//first element is full string
}
return htmlspecialchars($matches);
}
Or, if your PHP version permits it:
preg_replace_callback($expr, function($matches)
{
$return = '';
for ($i=1, $j = count($matches); $i<$j;$i++)
{//loop like this, skips first index, and allows for any number of groups
$return .= htmlspecialchars($matches[$i]);
}
return $return;
}, $text);
Try any of the above, until you find simething that works... incidentally, if all you want to remove is <tag> and </tag>, why not go for the much faster:
echo htmlspecialchars(str_replace(array('<tag>','</tag>'), '', $text));
That's just keeping it simple, and it'll almost certainly be faster, too.
See the quickest, easiest way in action here

If you want to isolate the actual contents as defined by your pattern, you could use preg_match($regex,$text,$hits);. This will give you an array of hits those bits that were between the paratheses in the pattern, starting at $hits[1], $hits[0] contains the whole matched string). You can then start manipulating these found matches, possibly using htmlspecialchars ... and combine them again into $code.

preg_match acting very strange

I am using preg_match() to extract pieces of text from a variable, and let's say the variable looks like this:
[htmlcode]This is supposed to be displayed[/htmlcode]
middle text
[htmlcode]This is also supposed to be displayed[/htmlcode]
i want to extract the contents of the [htmlcode]'s and input them into an array. i am doing this by using preg_match().
preg_match('/\[htmlcode\]([^\"]*)\[\/htmlcode\]/ms', $text, $matches);
foreach($matches as $value){
return $value . "<br />";
}
The above code outputs
[htmlcode]This is supposed to be displayed[/htmlcode]middle text[htmlcode]This is also supposed to be displayed[/htmlcode]
instead of
[htmlcode]This is supposed to be displayed[/htmlcode]
[htmlcode]This is also supposed to be displayed[/htmlcode]
and if have offically run out of ideas

As explained already; the * pattern is greedy. Another thing is to use preg_match_all() function. It'll return you a multi-dimension array of matched content.
preg_match_all('#\[htmlcode\]([^\"]*?)\[/htmlcode\]#ms', $text, $matches);
foreach( $matches[1] as $value ) {
And you'll get this: http://codepad.viper-7.com/z2GuSd

A * grouper is greedy, i.e. it will eat everything until last [/htmlcode]. Try replacing * with non-greedy *?.

* is by default greedy, ([^\"]*?) (notice the added ?) should make it lazy.
What do lazy and greedy mean in the context of regular expressions?

Look at this piece of code:
preg_match('/\[htmlcode\]([^\"]*)\[\/htmlcode\]/ms', $text, $matches);
foreach($matches as $value){
return $value . "<br />";
}
Now, if your pattern works fine and all is ok, you should know:
return statement will break all loops and will exit the function.
The first element in matches is the whole match, the whole string. In your case $text
So, what you did is returned the first big string and exited the function.
I suggest you can check for desired results:
$matches[1] and $matches[2]

preg_match just returns 'Array'?

when using this code to fetch data from http://www.ea.com/uk/football/profile/Calfreezy/360 the code just echo's back the word 'Array'
Here is the code I'm using currently:
<?php
$content = file_get_contents("http://www.ea.com/uk/football/profile/Calfreezy/360");
preg_match('#<div class="stat">Titles Won<span>([0-9\.]*)<span class="sprite13 goalImage cup"></span></span>#', $content, $titleswon);
echo 'Titles Won: '.$titleswon.'';
?>
And this is the HTML I am trying to pull from the url:
<div class="stat">
Titles Won <span>0<span class="sprite13 goalImage cup"></span></span>
</div>
This is just returning Titles won: Array
When if working it should return Titles won: 0
What am I doing wrong, thanks.

You are printing the entire matches array instead of selecting the index(es) that you want from it and printing them.
See the documentation
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.

preg_match() produces an array of matches. If you print out an array in string context, you get Array as the text. e.g.
$arr = array('foo' => 'bar');
echo $arr; // prints "Array"
echo $arr['foo']; // prints "bar"

preg_match returns an array as can be seen from the documentation.
if you want to see all the contents of the array use
var_dump( $titleswon );
If you just need the matched, you have to address that specific part.

The best approach would be :
if (preg_match('#<div class="stat">Titles Won<span>([0-9\.]*)<span class="sprite13 goalImage cup"></span></span>#', $content, $titleswon)) {
echo 'Titles Won: '.$titleswon[1].'';
}

The third param tho preg_match is passed by reference and will contain an array with matches in every capture group. You are using two "groups". The whole match and ([0-9\.]*) which will be the second. So I expect you need this:
echo 'Titles Won: '.$titleswon[1].''; // note the array is indexed by 0

If you have a look to the preg_match documentation:
http://php.net/manual/en/function.preg-match.php
You can see that in the $match argument is actually an array, $match[0] containing the whole match, and the consecutive array elements containing the subqueries.
If you do var_dump($titleswon) or print_r($titleswon) you will see the whole information, then you can access to the desired info, in your case it would be $titleswon[1]

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regular expression match formatted text in PHP - php

You have not matched the newlines after the first. Move the \n inside the (?m:...) section

Related

preg_replace with arrays of different sizes

How to improve my algorithm?/seaching and replacing words in a formated text/

PHP:preg_replace function

preg_match acting very strange

preg_match just returns 'Array'?

Categories

Resources