Find a pattern within two or more sets of text

Find a pattern within two or more sets of text - php

I have lots of data that I need to search through for certain patterns.
Problem is when looking for said patterns I have no reference to what I'm looking for.
Or in other words, I have two paragraphs. Each on similar topics. I need to be able to compare both paragraphs and find patterns. Phrases said in both paragraphs and how many times both were said.
Can't seem to find the solution because preg_match and other functions your required to supply the things your looking for.
Example paragraphs
Paragraph 1:
Bee Pollen is made by honeybees, and is the food of the young bee. It
is considered one of nature's most completely nourishing foods as it
contains nearly all nutrients required by humans. Bee-gathered pollens
are rich in proteins (approximately 40% protein), free amino acids,
vitamins, including B-complex, and folic acid.
Paragraph 2:
Bee Pollen is made by honeybees. It is required for the fertilization
of the plant. The tiny particles consist of 50/1,000-millimeter
corpuscles, formed at the free end of the stamen in the heart of the
blossom, nature's most completely nourishing foods. Every variety of
flower in the universe puts forth a dusting of pollen. Many orchard
fruits and agricultural food crops do, too.
So from those examples these patterns:
Bee Pollen is made by honeybees
and:
nature's most completely nourishing foods
Both phrases are found in both paragraphs.

This is potentially a complex question depending on whether you're looking for similar phrases or phrases that match word for word.
Finding exact word-for-word matches is quite simple all you need to do is split on common breaks like punctuation marks (e.g. .,;:) and perhaps on conjunctions as well (e.g. and or). However, the problem comes when you come to, for example, adjectives two phrases might be exactly the same but have one word different, like so:
The world is spinnnig around its axis at a tremendous speed.
The world is spinning around its axis at a magnificent speed.
This won't match because tremendous and magnificent are used in place of one another. Potentially you could work around this, however, that would be a more complex question.
Answer
If we stick to the simple side of things we can achieve phrase matching with just a few lines of code (4 in this example; not including the formatting for comments/readability).
$wordSplits = 'and or on of as'; //List of words to split on
preg_match_all('/(?<m1>.*?)([.,;:\-]| '.str_replace(' ', ' | ', trim($wordSplits)).' )/i', $para1, $matches1);
preg_match_all('/(?<m2>.*?)([.,;:\-]| '.str_replace(' ', ' | ', trim($wordSplits)).' )/i', $para2, $matches2);
$commonPhrases = array_filter( //Removes blank $key=>$value pairs
array_intersect( //Finds matching paterns
array_map(function($item){
return(strtolower(trim($item))); //Cleans array for $para1 values - removes leading and following spaces
}, $matches1['m1']),
array_map(function($item){
return(strtolower(trim($item))); //Cleans array for $para2 values - removes leading and following spaces
}, $matches2['m2'])
)
);
var_dump($commonPhrases);
/**
OUTPUT:
array(2) {
[0]=>
string(31) "bee pollen is made by honeybees"
[5]=>
string(41) "nature's most completely nourishing foods"
}
/*
The above code will find matches splitting both on punctuation (defined in [...] of the preg_match_all pattern) it will also concatenate the word list (matching only words in the word list with a preceding and following space).
Wordlist
You can change the word list to include any breaks you like, editing the list until you get the phrases you are after, examples:
$wordSplits = 'and or';
$wordSplits = 'and but if or';
$wordSplits = 'a an as and by but because if in is it of off on or';
Punctuation
You can add any punctuation marks you like into the list (between [ and ]), however remember that some characters do have special meanings and may need to be escaped (or placed appropriately): - and ^ should become \- and \^ or be placed where their special meaning doesn't come into play.
You may consider changing:
([.,;:\-]|
To:
([.,;:\-] | //Adding a space before the pipe
So that you only split punctuation marks which are followed by a space. For example: this would mean that items like 50,000 won't be split.
Spaces and breaks
You may also consider changing the spaces to \s so that tabs and newlines etc are included and not just spaces. Like so:
'/(?<m1>.*?)([.,;:\-]|\s'.str_replace(' ', '\s|\s', trim($wordSplits)).'\s)/i'
This would also apply to:
([.,;:\-]\s|
If you decide to go down that route.

I've been working on this code, don't know if it suits your needs... Feel free to expand it!
$p1 = "Bee Pollen is made by honeybees, and is the food of the young bee. It is considered one of nature's most completely nourishing foods as it contains nearly all nutrients required by humans. Bee-gathered pollens are rich in proteins (approximately 40% protein), free amino acids, vitamins, including B-complex, and folic acid.";
$p2 = "Bee Pollen is made by honeybees. It is required for the fertilization of the plant. The tiny particles consist of 50/1,000-millimeter corpuscles, formed at the free end of the stamen in the heart of the blossom, nature's most completely nourishing foods. Every variety of flower in the universe puts forth a dusting of pollen. Many orchard fruits and agricultural food crops do, too.";
// Strip strings of periods etc.
$p1 = strtolower(str_replace(array('.', ',', '(', ')'), '', $p1));
$p2 = strtolower(str_replace(array('.', ',', '(', ')'), '', $p2));
// Extract words from first paragraph
$w1 = explode(" ", $p1);
// Build search string
$search = '';
$found = array();
foreach ($w1 as $word) {
//echo 'Word: ' . $word . "<br />";
$search .= ' ' . $word;
$search = trim($search);
//echo '. . Search string: '. $search . "<br /><br />";
if (substr_count($p2, $search)) {
$old_search = $search;
$num_occured = substr_count($p2, $search);
//echo " . . . found!" . "<br /><br /><br />";
$add = TRUE;
} else {
//echo " . . . not found! Generating new search string: " . $word . '<br />';
if ($add) {
$found[] = array('pattern' => $old_search, 'occurences' => $num_occured);
$add = FALSE;
}
$old_search = '';
$search = $word;
}
}
print_r($found);
The above code finds occurences of patterns from the first string in the second one.
I'm sure it can be written better, but since it's past midnight (local time), I'm not as "fresh" as I'd like to be...
Codepad-link

Related

preg_replace_callback highlight pattern not match in result

I have this code:
$string = 'The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.';
$text = explode("#", str_replace(" ", " #", $string)); //ugly trick to preserve space when exploding, but it works (faster than preg_split)
foreach ($text as $value) {
echo preg_replace_callback("/(.*p.*e.*d.*|.*a.*y.*)/", function ($matches) {
return " <strong>".$matches[0]."</strong> ";
}, $value);
}
The point of it is to be able to enter a sequence of characters (in the code above it's a fixed pattern), and it finds and highlights those characters in the matched word. The code I have now highlights the entire word. I'm looking for the most efficient way of highlighting the characters.
The result of the current code:
The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.
What I would like to have:
The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.
Did I take the wrong approach? It would be awesome if someone could point me in the right way, I've been searching for hours and didn't find what I was looking for.
EDIT 2:
Divaka's been a great help. Almost there.. I apologize if I haven't been clear enough on what my goal is. I will try to explain further.
- Part A -
One of the things I will be using this code for is a phone book. A simple example:
When following characters are entered:
Jan
I need it to match following examples:
Jan Verhoeven
Arjan Peters
Raj Naren
Jered Von Tran
The problem is that I will be iterating over the entire phone book, person-record per person-record. Each person also has email-addresses, a postal address, maybe a website, a extra note, ect.. This means that the text I'm actually search can contain anything from letters, numbers, special characters(&#()%_- etc..), newlines, and most importantly spaces. So an entire record (csv) might contain the following info:
Name;Address;Email address;Website;Note
Jan Verhoeven;Veldstraat 2a, 3209 Herkstad;jan#werk.be;www.janophetwerk.be,jan#telemet.be;Jan die ik ontmoet heb op de bouwbeurs.\n Zelfstandige vertegenwoordiger van bouwmaterialen.
Raj Naren;Kerklaan 334, 5873 Biep;raj#werk.be;;Rechtstreekse contactpersoon bij Werk.be (#654 intern)
The \n is meant to be an actual newline. So if I search for #werk.be, I'd like to see both these records as a result.
- Part B -
Something else I want to use this for is searching song-texts. When I'm looking for a song and I can only remember it had to do something with ducks or docks and a circle, I would enter dckcircle and get the following result:
... and the ducks were all dancing in a great big circle, around the great big bonfire ...
To be able to fine-tune the searching I'd like to be able to limit the number of spaces (or any other character), because I would imagine it finding a simple pattern like eve in every song while I'm only looking for a song that has the exact word eve in it.
- Conclusion -
If I summarize this in pseudo-regex, for a search pattern abc with a max of 3 spaces in-between it would be something like this: (I might be totally off here)
(a)(any character, max 3 spaces)(b)(any character, max 3 spaces)(c)
Or more generic:
(a)({any character}{these characters with a limit of 3})(b)({any character}{these characters with a limit of 3})(c)
This can even be extended to this fairly easily I'm guessing:
(a)({any character}{these characters with a limit of 3}{not these characters})(b)({any character}{these characters with a limit of 3}{not these characters})(c)
(I know the ´{}´ brackets are not to be used that way in a regular expression, but I don't know how else to put it without using a character that has a meaning in regular expressions.)
If anyone wonders, I know the sql like statement would be able to do 80% (I'm guessing, might even be more) of what I'm trying to do, but I'm trying to avoid using a database to make this as portable as possible.
When the correct answer has been found, I'll clean this question (and the code) up and post the resulting php-class here (maybe I'll even put it up on github if that would be useful), so anyone looking for the same will have a fully working class to work with :).

I've came up with this. Tell me if it's what you want!
//$string = "The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.";
$string = "abcdefo";
//$pattern_array1 = array(a,y);
//$pattern_array2 = array(p,e,d);
$pattern_array1 = array(e,f);
$pattern_array2 = array(o);
$pattern_array2 = array(a,f);
$number_of_patterns = 2;
$regexp1 = generate_regexp($pattern_array1, 1);
$regexp2 = generate_regexp($pattern_array2, 2);
$string = preg_replace($regexp1["pattern"], $regexp1["replacement"], $string);
$string = preg_replace($regexp2["pattern"], $regexp2["replacement"], $string);
$string = transform_multimatched_chars($string);
// transforming other chars after transforming the multimatched ones
for($i = 1; $i <= $number_of_patterns; $i++) {
$string = str_replace("#{$i}", "<strong>", $string);
$string = str_replace("#/{$i}", "</strong>", $string);
}
echo $string;
function generate_regexp($pattern_array, $pattern_num) {
$regexp["pattern"] = "/";
$regexp["replacement"] = "";
$i = 0;
foreach($pattern_array as $key => $char) {
$regexp["pattern"] .= "({$char})";
$regexp["replacement"] .= "#{$pattern_num}\$". ($key + $i+1) . "#/{$pattern_num}";
if($key < count($pattern_array) - 1) {
$regexp["pattern"] .= "(?s)((?:(?!{$pattern_array[$key + 1]})(?!\s).)*)";
$regexp["replacement"] .= "\$".($key + $i+2) . "";
}
$i = $key + 1;
}
$regexp["pattern"] .= "/";
return $regexp;
}
function transform_multimatched_chars($string)
{
preg_match_all("/((#[0-9]){2,})(.*)((#\/[0-9]){2,})/", $string, $matches);
// change this for your purposes
$start_replacement = '<span style="color:red;">';
$end_replacement = '</span>';
foreach($matches[1] as $key => $match)
{
$string = str_replace($match, $start_replacement, $string);
$string = str_replace($matches[4][$key], $end_replacement, $string);
}
return $string;
}

wrap words in string with regex

This is the string
(code)
Pivot: 96.75<br />Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.<br />Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.<br />Comment the pair has broken above its resistance and should post further advance.<br />
(text)
"Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance."
the result should be
(code)
<b>Pivot</b>: 96.75<br /><b>Our preference</b>: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.<br /><b>Alternative scenario</b>: Below 96.75 look for further downside with 96.35 & 95.9 as targets.<br />Comment the pair has broken above its resistance and should post further advance.<br />
(text)
Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance.
The porpuse:
Wrap all the words before : sign.
I've tried this regex: ((\A )|(<br />))(?P<G>[^:]*):, but its working only on python environment. I need this in PHP:
$pattern = '/((\A)|(<br\s\/>))(?P<G>[^:]*):/';
$description = preg_replace($pattern, '<b>$1</b>', $description);
Thanks.

This preg_replace should do the trick:
preg_replace('#(^|<br ?/>)([^:]+):#m','$1<b>$2</b>:',$input)
PHP Fiddle - Run (F9)

I should start by saying that HTML operations are better done with a proper parser such as DOMDocument. This particular problem is straightforward, so regular expressions may work without too much hocus pocus, but be warned :)
You can use look-around assertions; this frees you from having to restore the neighbouring strings during the replacement:
echo preg_replace('/(?<=^|<br \/>)[^:]+(?=:)/m', '<b>$0</b>', $str);
Demo
First, the look-behind assertion matches either the start of each line or a preceding <br />. Then, any characters except the colon are matched; the look-ahead assertion makes sure it's followed by a colon.
The /m modifier is used to make ^ match the start of each line as opposed to \A which always matches the start of the subject string.

The most "general" and least regex-expensive way to do this that I could come up with was this:
$parts = explode('<br', $str);//don't include space and `/`, as tags may vary
$formatted = '';
foreach($parts as $part)
{
$formatted .= preg_replace('/^\s*[\/>]{0,2}\s*([^:]+:)/', '<b>$1</b>',$part).'<br/>';
}
echo $formatted;
Or:
$formatted = array();
foreach($parts as $part)
{
$formatted[] = preg_replace('/^\s*[\/>]{0,2}\s*([^:]+:)/', '<b>$1</b>',$part);
}
echo implode('<br/>', $formatted);
Tested with, and gotten this as output
Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance.
That being said, I do find this bit of data weird, and, if I were you, I'd consider str_replace or preg_replace-ing all breaks with PHP_EOL:
$str = preg_replace('/\<\s*br\s*\/?\s*\>/i', PHP_EOL, $str);//allow for any form of break tag
And then, your string looks exactly like the data I had to parse, and got the regex for that here:
$str = preg_replace(...);
$formatted = preg_replace('/^([^:\n\\]++)\s{0,}:((\n(?![^\n:\\]++\s{0,}:)|.)*+)/','<b>$1:</b>$2<br/>', $str);

In PHP, get entire word from MySQL search result using "LIKE"

what I want is:
Let's supose I searched "goo" using a query that goes like this: ...WHERE message LIKE '%goo%' and it returned me a result, for example I love Google to make my searches, but I'm starting to worry about privacy, so it will be displayed as a result, because the word Google matches my search criteria.
How do I, based on my search string save this entire Google result on a variable?
I need this because I'm using a regular expression that will highlight the searched word and display content before and after this result, but it's only working when the searched word matches exactly the word in the result, and also it's malconstructed, so it won't work well with words that are not surrounded by space.
This is the regular expression code
<?=preg_replace('/^.*?\s(.{0,'.$size.'})(\b'.$_GET['s'].'\b)(.{0,'.$size.'})\s.*?$/',
'...$1<strong>$2</strong>$3...',$message);?>
What I want is that change this $_GET['s'] to my variable which will contain the whole word found in my query string.
How do I achieve this ?

I bet it will be easier to change your regular expression to check any word containing the term, what about:
<?=preg_replace('/^.*?(.{0,'.$size.'})(\b\S*'.$_GET['s'].'\S*\b)(.{0,'.$size.'}).*?$/i',
'...$1<strong>$2</strong>$3...',$message);?>

I read your discussion on this and more robust implementation might be in order. Especially taking your need to support diacritics into account. Using a single regular expression to fix all your problems might seem tempting, but the more complicated it becomes the harder it gets to maintain or expand upon. To quote Jamie Zawinski
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
As I have problems with iconv on my local machine, I used a more simple implementation instead, feel free to use something more complicated or robust if your situation requires it.
I use a simple regular expression in this solution to get a set of alphanumeric characters only (also known as a "word"), the part in the regular expression that reads \p{L}\p{M} makes sure we also get all the multibyte characters.
You can see this code working on IDEone.
<?php
function stripAccents($p_sSubject) {
$sSubject = (string) $p_sSubject;
$sSubject = str_replace('æ', 'ae', $sSubject);
$sSubject = str_replace('Æ', 'AE', $sSubject);
$sSubject = strtr(
utf8_decode($sSubject)
, utf8_decode('àáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝ')
, 'aaaaaaceeeeiiiinoooooouuuuyyAAAAAACEEEEIIIINOOOOOOUUUUY'
);
return $sSubject;
}
function emphasiseWord($p_sSubject, $p_sSearchTerm){
$aSubjects = preg_split('#([^a-z0-9\p{L}\p{M}]+)#iu', $p_sSubject, null, PREG_SPLIT_DELIM_CAPTURE);
foreach($aSubjects as $t_iKey => $t_sSubject){
$sSubject = stripAccents($t_sSubject);
if(stripos($sSubject, $p_sSearchTerm) !== false || mb_stripos($t_sSubject, $p_sSearchTerm) !== false){
$aSubjects[$t_iKey] = '<strong>' . $t_sSubject . '</strong>';
}
}
$sSubject = implode('', $aSubjects);
return $sSubject;
}
/////////////////////////////// Test \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
$aTest = array(
'goo' => 'I love Google to make my searches, but I`m starting to worry about privacy.'
, 'peo' => 'people, People, PEOPLE, peOple, people!, people., people?, "people, people" péo'
, 'péo' => 'people, People, PEOPLE, peOple, people!, people., people?, "people, people" péo'
, 'gen' => '"gente", "inteligente", "VAGENS", and "Gente" ...vocês da física que passam o dia protegendo...'
, 'voce' => '...vocês da física que passam o dia protegendo...'
, 'o' => 'Characters like æ,ø,å,Æ,Ø and Å are used in Denmark, Sweden and Norway'
, 'ø' => 'Characters like æ,ø,å,Æ,Ø and Å are used in Denmark, Sweden and Norway'
, 'ae' => 'Characters like æ,ø,å,Æ,Ø and Å are used in Denmark, Sweden and Norway'
, 'Æ' => 'Characters like æ,ø,å,Æ,Ø and Å are used in Denmark, Sweden and Norway'
);
$sContent = '<dl>';
foreach($aTest as $t_sSearchTerm => $t_sSubject){
$sContent .= '<dt>' . $t_sSearchTerm . '</dt><dd>' . emphasiseWord($t_sSubject, $t_sSearchTerm) .'</dd>';
}
$sContent .= '</dl>';
echo $sContent;
?>

I don't understand the importance of matching everything else in the search string, wouldn't this simply be enough?
<?=preg_replace('/\b\S*'.$GET['s'].'\S*\b/i', '<strong>$0</strong>', $message);?>
As far as I can tell, you are only putting the matched word in a html tag, but not doing anything to the rest of the string?
The above regex works fine for cases where you are only matching whole words, captures multiple matches within a string (should there be more than one) and also works fine with case insensitivity.

How to replace glossary terms in HTML text with links?

I would like to run a str_replace or preg_replace which looks for certain words (found in $glossary_terms) in my $content and replaces them with links (like term).
However, the $content is full HTML and my links/images are being affected too, which isn't what I'm after.
An example of $content is:
<div id="attachment_542" class="wp-caption alignleft" style="width: 135px"><img class="size-thumbnail wp-image-542" title="Amazonas English" src="http://www.seriouslyfish.com/dev/wp-content/uploads/2011/12/Amazonas-English-1-288x381.jpg" alt="Amazonas English" width="125" height="165" /><p class="wp-caption-text">Amazonas Magazine - now in English!</p></div>
<p>Edited by Hans-Georg Evers, the magazine ‘Amazonas’ has been widely-regarded as among the finest regular publications in the hobby since its launch in 2005, an impressive achievment considering it’s only been published in German to date. The long-awaited English version is just about to launch, and we think a subscription should be top of any serious fishkeeper’s Xmas list…</p>
<p>The magazine is published in a bi-monthly basis and the English version launches with the January/February 2012 issue with distributors already organised in the United States, Canada, the United Kingdom, South Africa, Australia, and New Zealand. There are also mobile apps availablen which allow digital subscribers to read on portable devices.</p>
<p>It’s fair to say that there currently exists no better publication for dedicated hobbyists with each issue featuring cutting-edge articles on fishes, invertebrates, aquatic plants, field trips to tropical destinations plus the latest in husbandry and breeding breakthroughs by expert aquarists, all accompanied by excellent photography throughout.</p>
<p>U.S. residents can subscribe to the printed edition for just $29 USD per year, which also includes a free digital subscription, with the same offer available to Canadian readers for $41 USD or overseas subscribers for $49 USD. Please see the Amazonas website for further information and a sample digital issue!</p>
<p>Alternatively, subscribe directly to the print version here or digital version here. Just gonna add this to the end of the post so I can do some testing.</p>
I came across this link, but I wasn't sure if such a method would work with nested HTML.
Is there any way I can str_replace or preg_replace content within <p> tags only; excluding any nested <a>, <img> or <h1/2/3/4/5> tags?
Thanks in advance,

A "by-the-book solution" would be like this:
<?php
$html = "<your HTML string>";
$glossary_terms = array('fishes', 'invertebrates', 'aquatic plants');
$dom = new DOMDocument;
$dom->loadHTML($html);
dom_link_glossary($dom, $glossary_terms);
echo $dom->saveHTML();
// wraps all occurrences of the glossary terms in links
function dom_link_glossary(&$document, &$glossary) {
$xpath = new DOMXPath($document);
$urls = array();
$pattern = array();
// build a normalized lookup (case-insensitive, whitespace-agnostic)
foreach ($glossary as $term) {
$term_norm = preg_replace('/\s+/', ' ', strtoupper(trim($term)));
$pattern[] = preg_replace('/ /', '\\s+', preg_quote($term_norm));
$urls[$term_norm] = '/glossary/initial/' . rawurlencode($term);
}
$pattern = '/\b(' . implode('|', $pattern) . ')\b/i';
$text_nodes = $xpath->query('//text()[not(ancestor::a)]');
foreach($text_nodes as $original_node) {
$text = $original_node->nodeValue;
$hitcount = preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);
if ($hitcount == 0) continue;
$offset = 0;
$parent = $original_node->parentNode;
$refnode = $original_node->nextSibling;
$parent->removeChild($original_node);
foreach ($matches[0] as $i => $match) {
$term_txt = $match[0];
$term_pos = $match[1];
$term_norm = preg_replace('/\s+/', ' ', strtoupper($term_txt));
// insert any text before the term instance
$prefix = substr($text, $offset, $term_pos - $offset);
$parent->insertBefore($document->createTextNode($prefix), $refnode);
// insert the actual term instance as a link
$link = $document->createElement("a", $term_txt);
$link->setAttribute("href", $urls[$term_norm]);
$parent->insertBefore($link, $refnode);
$offset = $term_pos + strlen($term_txt);
if ($i == $hitcount - 1) { // last match, append remaining text
$suffix = substr($text, $offset);
$parent->insertBefore($document->createTextNode($suffix), $refnode);
}
}
}
}
?>
Here is how dom_link_glossary() works:
It normalizes the glossary terms (trim, uppercase, white-space) and builds a lookup array and a regex pattern that matches all terms.
It uses XPath to find all text nodes that are not already part of a link. Text nodes are returned irrespective of their nesting depth (i.e. no recursion necessary on our part). I use \b to prevent partial matches.
For each text node that contains terms:
The original text node is deleted ($parent->removeChild())
Now new nodes are created and inserted into the DOM: text nodes for anything before (or after) a glossary term, element nodes (<a>) for the actual glossary terms.
The solution preserves original case and white space, therefore
term will become term
Term will become Term
Foo Bar will become Foo Bar. Surplus whitespace or line breaks in the HTML will not break the mechanism.
Note that it is perfectly all-right to use regex on the plain text node values. It is not okay to use regex on full HTML.
I would recommend pairing the glossary terms with their respective URLs in an array, instead of calculating the URLs in the function. That way you can make multiple terms point to the same URL.

You can try this:
$content = preg_replace('/(<p\sclass=\"wp\-caption\-text\">)[^<]+(<\/p>)/i', '', $content);

Text wrap nightmare in PHP after PREG_REPLACE

Provinces is a group_concat of all the individual records that contain province, some of which are blank.
So, when I encode:
$provinces = ($row['provinces']);
echo "<td>".wordwrap($provinces, 35, "<br />")."</td>";
This is what the result looks like:
Minas Gerais,,,Rio Grande do
Sul,Santa Catarina,Paraná,São Paulo
However, when I try to preg_replace out some of the nulls, and add some spaces with this expression:
$provinces = preg_replace($patterns,
$replaces, ($row['provinces']));
echo "<td>".wordwrap($provinces, 35, "<br />")."</td>";`
This is what I get!!! :(
Minas Gerais, Rio Grande do
Sul, Santa
Catarina, Paraná, São Paulo
The output is very unnatural looking.
BTW: Here are the search and replace arrays:
$patterns[0] = '/,,([,]+)?/'; $replaces[0] = ', ';
$patterns[1] = '/^,/'; $replaces[1] = '';
$patterns[2] = '/,$/'; $replaces[2] = '';
$patterns[3] = '/\b,\b/'; $replaces[3] = ', ';
$patterns[4] = '/\s,/'; $replaces[4] = ', ';
UPDATE: I even tried to change Paraná to Parana
Minas Gerais, Rio Grande do
Sul, Santa
Catarina, Parana, São
Paulo

Don't use as the replacement. wordwrap() considers that 6 characters. It doesn't interpret the HTML entity. That's why your lines are breaking funny. If you want replace spaces after you wordwrap()
Also, your first pattern should be:
// match one or more commas together
$patterns[0] = '/,+/';

Is the wordwrap() really necessary? It sounds like you are rendering this content into a table cell of some fixed width and you don't want individual entries to split across lines.
If this inference is correct - and if none of your entries is actually so long that forcing it to a single line will break your layout - then how about this: explode() on commas into an array, remove the whitespace-only entries, replace normal spaces in each array entry with , and implode() back on , (a comma followed by a space). Then let the rendering browser break lines wherever it needs.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Find a pattern within two or more sets of text - php

Related

preg_replace_callback highlight pattern not match in result

wrap words in string with regex

In PHP, get entire word from MySQL search result using "LIKE"

How to replace glossary terms in HTML text with links?

Text wrap nightmare in PHP after PREG_REPLACE

Categories

Resources