PHP extra whitespace not being deleted - php

I'm counting words in an article and removing common words such as "and" or "the".
I"m removing them by use of preg_replace
after it is done I do a quick clean of extra white space by using.
$search_body = preg_replace('/\s+/',' ',$search_body);
However I've got some very stubborn white space that will not go away. I've tried
if($word == "" OR $word == " "){
//chop it's head off
}
But the if statement does not see $word as being just whitespace. I've also tried printing it to the screen to get the raw data type of it and it's still just showing up blank.
Here is the full regex that I'm using.
$pattern = array(
'/\&quot\;/',
'/[0-9]/',
'/\,/',
'/\./',
'/\!/',
'/\#/',
'/\#/',
'/\$/',
'/\%/',
'/\^/',
'/\&/',
'/\*/',
'/\(/',
'/\)/',
'/\_/',
'/\"/',
'/\'/',
'/\:/',
'/\;/',
'/\?/',
'/\`/',
'/\~/',
'/\[/',
'/\]/',
'/\{/',
'/\}/',
'/\|/',
'/\+/',
'/\=/',
'/\-/',
'/–/',
'/°/',
'/\bthe\b/',
'/\band\b/',
'/\bthat\b/',
'/\bhave\b/',
'/\bfor\b/',
'/\bnot\b/',
'/\bwith\b/',
'/\byou\b/',
'/\bthis\b/',
'/\bbut\b/',
'/\bhis\b/',
'/\bfrom\b/',
'/\bthey\b/',
'/\bsay\b/',
'/\bher\b/',
'/\bshe\b/',
'/\bwill\b/',
'/\bone\b/',
'/\ball\b/',
'/\bwould\b/',
'/\bthere\b/',
'/\btheir\b/',
'/\bwhat\b/',
'/\bout\b/',
'/\babout\b/',
'/\bwho\b/',
'/\bget\b/',
'/\bwhich\b/',
'/\bwhen\b/',
'/\bmake\b/',
'/\bcan\b/',
'/\blike\b/',
'/\btime\b/',
'/\bjust\b/',
'/\bhim\b/',
'/\bknow\b/',
'/\btake\b/',
'/\bpeople\b/',
'/\binto\b/',
'/\byear\b/',
'/\byour\b/',
'/\bgood\b/',
'/\bsome\b/',
'/\bcould\b/',
'/\bthem\b/',
'/\bsee\b/',
'/\bother\b/',
'/\bthan\b/',
'/\bthen\b/',
'/\bnow\b/',
'/\blook\b/',
'/\bonly\b/',
'/\bcome\b/',
'/\bits\b/', //it's?
'/\bover\b/',
'/\bthink\b/',
'/\balso\b/',
'/\bback\b/',
'/\bafter\b/',
'/\buse\b/',
'/\btwo\b/',
'/\bhow\b/',
'/\bour\b/',
'/\bwork\b/',
'/\bfirst\b/',
'/\bwell\b/',
'/\bway\b/',
'/\beven\b/',
'/\bnew\b/',
'/\bwant\b/',
'/\bbecause\b/',
'/\bany\b/',
'/\bthese\b/',
'/\bgive\b/',
'/\bday\b/',
'/\bmost\b/',
'/\bare\b/',
'/\bwas\b/',
'/\<\w+\>/', '/\<\/\w+\>/',
'/\b\w{1}\b/', //1 letter word
'/\b\w{2}\b/', //2 letter word
'/\//',
'/\</',
'/\>/'
);
$search_body = strip_tags($body);
$search_body = strtolower($search_body);
$search_body = preg_replace($pattern, ' ', $search_body);
$search_body = preg_replace('/\s+/',' ',$search_body);
$search_body = explode(" ", $search_body);
When exploded blank values show up left and right
Example text that I am using is too long to post here. But I copied and pasted
This article to give it a test and it showed 32 counts of white space, not including the white space in front of or behind of other words even after using trim().
Here's a js.fiddle of the raw data that is being handled by php.
htmlentities and htmlspecialchars also show nothing.
Here's the code counts all the values and puts them into one.
$inhere = array();
$body_hold = array();
foreach($search_body as $value){
$value = trim($value);
if(in_array($value, $inhere) && $value != ""){
$key = array_search($value, $inhere);
$body_hold[$key]['count'] = $body_hold[$key]['count']+1;
}elseif($value != ""){
$inhere[] = $value;
$body_hold[] = array(
'count' => 1,
'word' => $value
);
}
}
rsort($body_hold);
Basic foreach to see values.
foreach($body_hold as $value){
$count = $value['count'];
$word = trim($value['word']);
echo "Count: ".$count;
echo " Word: ".$word;
echo '<br>';
}
Here's a PHP example of what it's returning

Are you sure you put the exact same data you're processing in the js.fiddle? Or did you get it from a subsequent post-processed step?
It's obviously a Wikipedia article. I went to that article on Wikipedia and opened it in Edit mode, and saw that there are s in the raw wikitext. However, those nbsp's don't appear in your js.fiddle data.
TL;DR: Check for in your processing (and convert to spaces, etc.).

This character 160 looks like space but it's not, replacing all of them to the regular spaces (32) and then removing all the double spaces will fix your problem.
$search_body = str_replace(chr(160), chr(32), $search_body);
$search_body = trim(preg_replace('/\s+/', ' ', $search_body));

Related

PHP Form implode / explode

I am using the a db field merchant_sku_item in a form. the original value is separated by / in the db like this:
2*CC689/1*CC368-8/1*SW6228-AB
I want to display in a text area on each line so I tried like this:
<textarea name="merchant_sku_item" rows="5" class="form-control" id="merchant_sku_item"><?
$items=explode('/',$merchant_sku_item);
foreach($items as $item){
echo $item."\r\n";
}
?></textarea>
All works fine:
2*CC689
1*CC368-8
1*SW6228-AB
but when I post the form I get a value like this:
2*CC689 1*CC368-8 1*SW6228-AB
but I wan't it back in the original format to update the DB in the correct format:
2*CC689/1*CC368-8/1*SW6228-AB
I tried to implode it with the / but I think it's just one string now so it's not working. I could replace the spaces I guess but this will not work if the field contains spaces.
Could somebody please tell me the best way to handle this?
The explode is correct, but you cannot just echo $item . "\r\n" because if $item contains </textarea> or whatever HTML you'll skrew up the page. You have to use echo htmlspecialchars($item) . "\n";. Normally, HTML pages have Linux line endings with "\n" and not Windows line endings with "\r\n".
To re-create the value for the DB, you have to take in consideration that the user may add some spaces or new lines. So you might not just get "\r\n" between the values but also " \n" or I don't know what. This is why a regular expression will be more flexible than a simple explode().
The regular expression pattern: \s+
The pattern \s will match any space, tab or new line chars. If you add the + sign after, it means that it can be 1 or multiple times. So this means that " \r\n" will match as it contains spaces, a carriege return and a new line. In PHP, you put the pattern between a delimiter char that you choose and that you put at the begin and the end. Commonly it's a slash so it becomes /\s+/. But you sometimes see also #\s+# or ~\s+~. After this delimiter, you can put some flags to change the way the regular expression is executed. Typically /hello/i will match "Hello" or "hello" because the i flag makes the search case-insensitive.
Similar to what you did: explode and re-implode example:
<?php
// Example of values that could be posted because users are always
// stupid and add spaces that they then don't see anymore.
$examples = [
"2*CC689 1*CC368-8 1*SW6228-AB", // spaces
"2*CC689\n1*CC368-8\n1*SW6228-AB", // new lines
"2*CC689\r\n1*CC368-8\r\n1*SW6228-AB", // carriege returns and new lines
"2*CC689\n 1*CC368-8 \n1*SW6228-AB", // new lines and spaces
];
foreach ($examples as $merchant_sku_item) {
$values = preg_split('/\s+/', $merchant_sku_item);
$merchant_sku_item_for_db = implode('/', $values);
echo $merchant_sku_item_for_db . "\n";
}
?>
Output:
2*CC689/1*CC368-8/1*SW6228-AB
2*CC689/1*CC368-8/1*SW6228-AB
2*CC689/1*CC368-8/1*SW6228-AB
2*CC689/1*CC368-8/1*SW6228-AB
Simplier, you could also just do a replacement with the same regular expression like this:
<?php
// Example of values that could be posted.
$examples = [
"2*CC689 1*CC368-8 1*SW6228-AB", // spaces
"2*CC689\n1*CC368-8\n1*SW6228-AB", // new lines
"2*CC689\r\n1*CC368-8\r\n1*SW6228-AB", // carriege returns and new lines
"2*CC689\n 1*CC368-8 \n1*SW6228-AB", // new lines and spaces
];
foreach ($examples as $merchant_sku_item) {
$merchant_sku_item_for_db = preg_replace('/\s+/', '/', $merchant_sku_item);
echo $merchant_sku_item_for_db . "\n";
}
?>
And just another important point regarding the data the user could input: What happens if the user types "2*CC/689" in the textarea?
Well, this will break your DB value :-/
This means that you have to validate the user input with some checks:
<?php
header('Content-Type: text/plain');
$examples = [
"2*CC689 1*CC368-8 1*SW6228-AB", // spaces
"2*CC689\n1*CC368-8\n1*SW6228-AB", // new lines
"2*CC689\r\n1*CC368-8\r\n1*SW6228-AB", // carriege returns and new lines
"2*CC689\n 1*CC368-8 \n1*SW6228-AB", // new lines and spaces
// Test with invalid datas:
"2*C/C689\n1*CC368-8\n1*SW6228-AB", // slash not allowed
"*CC689\n1*CC368-8\n1*SW6228-AB", // missing number before the *
"1*\n1*CC368-8\n1*SW62?28-AB", // missing product identifier and invalid ?
"1CC689 1*CC368-8 1SW6228-AB", // missing *
];
foreach ($examples as $example_nbr => $merchant_sku_item) {
echo str_repeat('=', 80) . "\n";
echo "Example $example_nbr\n\$merchant_sku_item = \"$merchant_sku_item\"\n";
$values = preg_split('/\s+/', $merchant_sku_item);
$errors = [];
foreach ($values as $i => $value) {
echo "Value $i = \"$value\"";
// Pattern: a number followed by * and followed by a product id (length between 3 and 10).
if (!preg_match('/^\d+\*[\d\w-]{3,10}$/i', $value)) {
echo " <-- ERROR\n";
$errors[] = $value;
} else {
echo "\n"; // It's ok
}
}
if (!empty($errors)) {
// You should handle the error and reload the form with the posted value and an error
// message explaining to the user what format is allowed.
echo "ERROR: Cannot save the value because the following products are wrong:\n";
echo implode("\n", $errors) . "\n";
}
}
?>
Test it here: https://onecompiler.com/php/3xtff6nk8
You can use preg_replace for your final post string like below code
$str = "2*CC689 blue 1*CC368-8 red 1*SW6228-AB";
$items = preg_replace('/\s+/', '/', $str);
echo $items;
output
2*CC689/blue/1*CC368-8/red/1*SW6228-AB
I Hope understand your question exactly.
Try replacing the created spaces and removing newline characters like so:
<?php
$merchant_sku_item = str_replace("\r\n","/",trim($_POST["merchant_sku_item"]));
?>
Make it this way:
<textarea name="merchant_sku_item" rows="5" class="form-control" id="merchant_sku_item"><?
$items=explode('//',$merchant_sku_item);
foreach($items as $item){
echo $item."\r\n";
}
?></textarea>

php preg_match excluding text within html tags/attributes to find correct place to cut a string

I am trying to determine the absolute position of certain words within a block of html, but only if they are outside of an actual html tag. For instance, if I wanted to determine the position of the word "join" using preg_match in this text:
<p>There are 14 more days until our holiday special so come join us!</p>
I could use:
preg_match('/join/', $post_content, $matches, PREG_OFFSET_CAPTURE, $offset);
The problem is that this is matching the word within the aria-label attribute, when what I need is the one just after the link. It would be fine to match between the <a> and </a>, just not inside the brackets themselves.
My actual end goal, most of what (I think) I have aside from this last element: I am trimming a block of html (not a full document) to cut off at a specific word count. I am trying to determine which character that last word ends at, and then joining the left side of the html block with only the html from the right side, so all html tags close gracefully. I thought I had it working until I ran into an example like I showed where the last word was also within an html attribute, causing me to split the string at the wrong location. This is my code so far:
$post_content = strip_tags ( $p->post_content, "<a><br><p><ul><li>" );
$post_content_stripped = strip_tags ( $p->post_content );
$post_content_stripped = preg_replace("/[^A-Za-z0-9 ]/", ' ', $post_content_stripped);
$post_content_stripped = preg_replace("/\s+/", ' ', $post_content_stripped);
$post_content_stripped_array = explode ( " " , trim($post_content_stripped) );
$excerpt_wordcount = count( $post_content_stripped_array );
$cutpos = 0;
while($excerpt_wordcount>48){
$thiswordrev = "/" . strrev($post_content_stripped_array[$excerpt_wordcount - 1]) . "/";
preg_match($thiswordrev, strrev($post_content), $matches, PREG_OFFSET_CAPTURE, $cutpos);
$cutpos = $matches[0][1] + (strlen($thiswordrev) - 2);
array_pop($post_content_stripped_array);
$excerpt_wordcount = count( $post_content_stripped_array );
}
if($pwordcount>$excerpt_wordcount){
preg_match_all('/<\/?[^>]*>/', substr( $post_content, strlen($post_content) - $cutpos ), $closetags_result);
$excerpt_closetags = "" . $closetags_result[0][0];
$post_excerpt = substr( $post_content, 0, strlen($post_content) - $cutpos ) . $excerpt_closetags;
}else{
$post_excerpt = $post_content;
}
I am actually searching the string in reverse in this case, since I am walking word by word backwards from the end of the string, so I know that my html brackets are backwards, eg:
>p/<!su nioj emoc os >a/<laiceps yadiloh>"su nioj"=lebal-aira "renepoon rerreferon"=ler "knalb_"=tegrat "lmth.egapemos/"=ferh a< ruo litnu syad erom 41 era erehT>p<
But it's easy enough to flip all of the brackets before doing the preg_match, or I am assuming should be easy enough to have the preg_match account for that.
Do not use regex to parse HTML.
You have a simple objective: limit the text content to a given number of words, ensuring that the HTML remains valid.
To this end, I would suggest looping through text nodes until you count a certain number of words, and then removing everything after that.
$dom = new DOMDocument();
$dom->loadHTML($post_content);
$xpath = new DOMXPath($dom);
$all_text_nodes = $xpath->query("//text()");
$words_left = 48;
foreach( $all_text_nodes as $text_node) {
$text = $text_node->textContent;
$words = explode(" ", $text); // TODO: maybe preg_split on /\s/ to support more whitespace types
$word_count = count($words);
if( $word_count < $words_left) {
$words_left -= $word_count;
continue;
}
// reached the threshold
$words_that_fit = implode(" ", array_slice($words, 0, $words_left));
// If the above TODO is implemented, this will need to be adjusted to keep the specific whitespace characters
$text_node->textContent = $words_that_fit;
$remove_after = $text_node;
while( $remove_after->parentNode) {
while( $remove_after->nextSibling) {
$remove_after->parentNode->removeChild($remove_after->nextSibling);
}
$remove_after = $remove_after->parentNode;
}
break;
}
$output = substr($dom->saveHTML($dom->getElementsByTagName("body")->item(0)), strlen("<body>"), -strlen("</body>"));
Live demo
Ok, I figured out a workaround. I don't know if this is the most elegant solution, so if someone sees a better one I would still love to hear it, but for now I realized that I don't have to actually have the html in the string I am searching to determine the position to cut, I just need it to be the same length. I grabbed all of the html elements and just created a dummy string replacing all of them with the same number of asterisks:
// create faux string with placeholders instead of html for search purposes
preg_match_all('/<\/?[^>]*>/', $post_content, $alltags_result);
$tagcount = count( $alltags_result );
$post_content_dummy = $post_content;
foreach($alltags_result[0] as $thistag){
$post_content_dummy = str_replace($thistag, str_repeat("*",strlen($thistag)), $post_content_dummy);
}
Then I just use $post_content_dummy in the while loop instead of $post_content, in order to find the cut position, and then $post_content for the actual cut. So far seems to be working fine.

remove HTML from displaying in PHP

I have this text : http://pastebin.com/2Zgbs7hi
And i want to be able to remove the HTML code from it and just display the plain text but i want to keep at least one line break where there are currently a few line breaks
i have tried:
$ticket["summary"] = 'pastebin example';
$TicketSummaryDisplay = nl2br($ticket["summary"]);
$TicketSummaryDisplay = stripslashes($TicketSummaryDisplay);
$TicketSummaryDisplay = trim(strip_tags($TicketSummaryDisplay));
$TicketSummaryDisplay = preg_replace('/\n\s+$/m', '', $TicketSummaryDisplay);
echo $TicketSummaryDisplay;
that is displaying as plain text, but it shows it all as one big block of text with no line breaks at all
Maybe this will earn you some time.
<?php
libxml_use_internal_errors(true); //crazy o tags
$html = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$dom = new DOMDocument;
$dom->loadHTML($html);
$result='';
foreach ($dom->getElementsByTagName('p') as $node) {
if (strstr($node->nodeValue, 'Legal Disclaimer:')){
break;
}
$result .= $node->nodeValue;
}
echo $result;
This example should successfully store text from html into an array of strings.
After stripping all the tags, you can use preg_split with \R special character ( matches any newline sequence ) to convert string into array. That array will now have several blank values, and there will be also some amount of html non-breaking space entities, so we will check the array for empty values with array_filter() function ( it will remove all items that do not satisfy the filter conditions, in our case, an empty value ). Here are a problem with entity, because and space characters are not the same, they have different ASCII code, so trim() function will not remove spaces. Here are two possible solutions, the first uncommented part will only replace &nbsp and check for white space characters, while the second commented one will decode all html entities and also check for spaces.
PHP:
$text = file_get_contents( 'http://pastebin.com/raw.php?i=2Zgbs7hi' );
$text = strip_tags( $text );
$array = array_filter(
preg_split( '/\R/', $text ),
function( &$item ) {
$item = str_replace( ' ', ' ', $item );
return trim( $item );
// $item = html_entity_decode( $item );
// return trim( str_replace( "\xC2\xA0", ' ', $item ) );
}
);
foreach( $array as $value ) {
echo $value . '<br />';
}
Array output:
Array
(
[8] => Hi,
[11] => Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
[13] => Regards
[23] => Legal Disclaimer:
[24] => This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
[25] => Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
)
Now you should have clear array with only items with value in it. By the way, newlines in HTML are expressed through <br />, not through \n, your example as response in a web browser still has them, but they are only visible in page source code. I hope I did not missed the point of the question.
try this get text output with line brakes
<?php
$ticket["summary"] = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$TicketSummaryDisplay = nl2br($ticket["summary"]);
echo strip_tags($TicketSummaryDisplay,'<br>');
?>
You are asking on how to add line-breaks to your "one big block of text with no line breaks at all".
Short answer
After you stripped the HTML tags, apply wordwrap with a desired text-block length
$text = wordwrap($text, 90, "<br />\n");
I really wonder, why nobody suggested that function before.
there is also chunk_split around, which doesn't take words into account and just splits after a certain number of chars. breaking words - but that's not what you want, i guess.
PHP
<?php
$text = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
/**
* Returns string without html tags, also
* removes takes control chars, spaces and " " into account.
*/
function dropHtmlTags($string) {
// remove html tags
//$string = preg_replace ('/<[^>]*>/', ' ', $string);
$string = strip_tags($string);
// control characters and "&nbsp"
$string = str_replace("\r", '', $string); // remove
$string = str_replace("\n", ' ', $string); // replace with space
$string = str_replace("\t", ' ', $string); // replace with space
$string = str_replace(" ", ' ', $string);
// remove multiple spaces
$string = preg_replace('/ {2,}/', ' ', $string);
$string = trim($string);
return $string;
}
$text = dropHtmlTags($text);
// The Answer: insert line breaks after 95 chars,
// to get rid of the "one big block of text with no line breaks at all"
$text = wordwrap($text, 95, "<br />\n");
// if you want to insert line-breaks before the legal disclaimer,
// uncomment the next line
//$text = str_replace("Regards Legal Disclaimer", "<br /><br />Regards Legal Disclaimer", $text);
echo $text;
?>
Result
first section shows your text block
second section shows the text with wordwrap applied (code from above)
Hello it can be done as follows:
$abc= file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$abc = strip_tags("\n", $abc);
echo $abc;
Please, let me know whether it works
you may use
<?php
$a= file_get_contents('a.txt');
echo nl2br(htmlspecialchars($a));
?>
<?php
$handle = #fopen("pastebin.html", "r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgetss($handle, 4096);
echo $buffer;
}
fclose($handle);
}
?>
output is
Hi,
Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
Regards
Legal Disclaimer:
This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
You can probably write additional code to convert to spaces etc.
I'm not sure I did understand everything correctly but this seems to be your expected result:
$txt = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
var_dump(preg_replace("/(\&nbsp\;(\s{1,})?)+/", "\n", trim(strip_tags(preg_replace("/(\s){1,}/", " ", $txt)))));
//more readable
$txt = preg_replace("/(\s){1,}/", " ", $txt);
$txt = trim(strip_tags($txt));
$txt = preg_replace("/(\&nbsp\;(\s{1,})?)+/", "\n", $txt);
The strip_tags() function strips HTML and PHP tags from a string, if that is what you are trying to accomplish.
Examples from the docs:
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
The above example will output:
Test paragraph. Other text
<p>Test paragraph.</p> Other text

Merging preg_match_all and preg_replace

I have some code running which finds out hashtags in the string and turns them into links. I have done this using preg_match_all as shown below:
if(preg_match_all('/(#[A-z_]\w+)/', $postLong, $arrHashTags) > 0){
foreach ($arrHashTags[1] as $strHashTag) {
$long = str_replace($strHashTag, ''.$strHashTag.'', $postLong);
}
}
Also, for my search script, I need to bold the searched keywords in the result string. Something similar to the below code using preg_replace:
$string = "This is description for Search Demo";
$searchingFor = "/" . $searchQuery . "/i";
$replacePattern = "<b>$0<\/b>";
preg_replace($searchingFor, $replacePattern, $string);
The problem that I am having is that both have to work together and should be thrown as a combined result. One way I can think of is to run the resultant string from preg_match_all with the preg_replace code but what if the tags and the searched string are the same? The second block will bold my tag as well which is not desired.
update the code i'm running based on the answer given below but it still doesn't work
if(preg_match_all('/(#[A-z_]\w+)/', $postLong, $arrHashTags) > 0){
foreach ($arrHashTags[1] as $strHashTag) {
$postLong = str_replace($strHashTag, ''.$strHashTag.'', $postLong);
}
}
And immediately after this, i run this
$searchingFor = "/\b.?(?<!#)" . $keystring . "\b/i";
$replacePattern = "<b>$0<\/b>";
preg_replace($searchingFor, $replacePattern, $postLong);
Just so you know, this is all going inside a while loop, which is generating the list
You just need to modify you the search pattern to avoid ones that start with a '#'
$postLong = "This is description for Search Demo";
if(preg_match_all('/(#[A-z_]\w+)/', $postLong, $arrHashTags) > 0){
foreach ($arrHashTags[1] as $strHashTag) {
$postLong = str_replace($strHashTag, ''.$strHashTag.'', $postLong);
}
}
# This expression finds any text with 0 or 1 characters in front of it
# and then does a negative look-behind to make sure that the character isn't a #
searchingFor = "/\b.?(?<!#)" . $searchQuery . "\b/i";
$replacePattern = "<b>$0<\/b>";
preg_replace($searchingFor, $replacePattern, $postLong);
Or if you don't need an array of the available hashes for another reason, you could use preg_replace only.
$postLong = "This is description for #Search Demo";
$patterns = array('/(#[A-z_]\w+)/', "/\b.?(?<!#)" . $searchQuery . "\b/i");
$replacements = array(''.$0.'', ' "<b>$0<\/b>');
preg_replace($patterns, $replacements, $postLong);

Replace text ignoring HTML tags

I have a simple text with HTML tags, for example:
Once <u>the</u> activity reaches the resumed state, you can freely add and remove fragments to the activity. Thus, <i>only</i> while the activity is in the resumed state can the <b>lifecycle</b> of a <hr/> fragment change independently.
I need to replace some parts of this text ignoring its html tags when I do this replace, for example this string - Thus, <i>only</i> while I need to replace with my string Hello, <i>its only</i> while . Text and strings to be replaced are dynamically. I need your help with my preg_replace pattern
$text = '<b>Some html</b> tags with <u>and</u> there are a lot of tags <i>in</i> this text';
$arrayKeys= array('Some html' => 'My html', 'and there' => 'is there', 'in this text' => 'in this code');
foreach ($arrayKeys as $key => $value)
$text = preg_replace('...$key...', '...$value...', $text);
echo $text; // output should be: <b>My html</b> tags with <u>is</u> there are a lot of tags <i>in</i> this code';
Please help me to find solution. Thank you
Basically we're going to build dynamic arrays of matches and patterns off of plain text using Regex. This code only matches what was originally asked for, but you should be able to get an idea of how to edit the code from the way I've spelled it all out. We're catching either an open or a close tag and white space as a passthru variable and replacing the text around it. This is setup based on two and three word combinations.
<?php
$text = '<b>Some html</b> tags with <u>and</u> there are a lot of tags <i>in</i> this text';
$arrayKeys= array(
'Some html' => 'My html',
'and there' => 'is there',
'in this text' =>'in this code');
function make_pattern($string){
$patterns = array(
'!(\w+)!i',
'#^#',
'! !',
'#$#');
$replacements = array(
"($1)",
'!',
//This next line is where we capture the possible tag or
//whitespace so we can ignore it and pass it through.
'(\s?<?/?[^>]*>?\s?)',
'!i');
$new_string = preg_replace($patterns,$replacements,$string);
return $new_string;
}
function make_replacement($replacement){
$patterns = array(
'!^(\w+)(\s+)(\w+)(\s+)(\w+)$!',
'!^(\w+)(\s+)(\w+)$!');
$replacements = array(
'$1\$2$3\$4$5',
'$1\$2$3');
$new_replacement = preg_replace($patterns,$replacements,$replacement);
return $new_replacement;
}
foreach ($arrayKeys as $key => $value){
$new_Patterns[] = make_pattern($key);
$new_Replacements[] = make_replacement($value);
}
//For debugging
//print_r($new_Patterns);
//print_r($new_Replacements);
$new_text = preg_replace($new_Patterns,$new_Replacements,$text);
echo $new_text."\n";
echo $text;
?>
Output
<b>My html</b> tags with <u>is</u> there are a lot of tags <i>in</i> this code
<b>Some html</b> tags with <u>and</u> there are a lot of tags <i>in</i> this text
Here we go. this piece of code should work, assuming you're respecting only twp constraints :
Pattern and replacement must have the same number of words. (Logical, since you want to keep position)
You must not split a word around a tag. (<b>Hel</b>lo World won't work.)
But if these are respected, this should work just fine !
<?php
// Splits a string in parts delimited with the sequence.
// '<b>Hey</b> you' becomes '~-=<b>~-=Hey~-=</b>~-= you' that make us get
// array ("<b>", "Hey" " you")
function getTextArray ($text, $special) {
$text = preg_replace ('#(<.*>)#isU', $special . '$1' . $special, $text); // Adding spaces to make explode work fine.
return preg_split ('#' . $special . '#', $text, -1, PREG_SPLIT_NO_EMPTY);
}
$text = "
<html>
<div>
<p>
<b>Hey</b> you ! No, you don't have <em>to</em> go!
</p>
</div>
</html>";
$replacement = array (
"Hey you" => "Bye me",
"have to" => "need to",
"to go" => "to run");
// This is a special sequence that you must be sure to find nowhere in your code. It is used to split sequences, and will disappear.
$special = '~-=';
$text_array = getTextArray ($text, $special);
// $restore is the array that will finally contain the result.
// Now we're only storing the tags.
// We'll be story the text later.
//
// $clean_text is the text without the tags, but with the special sequence instead.
$restore = array ();
for ($i = 0; $i < sizeof ($text_array); $i++) {
$str = $text_array[$i];
if (preg_match('#<.+>#', $str)) {
$restore[$i] = $str;
$clean_text .= $special;
}
else {
$clean_text .= $str;
}
}
// Here comes the tricky part.
// We wanna keep the position of each part of the text so the tags don't
// move after.
// So we're making the regex look like (~-=)*Hey(~-=)* you(~-=)*
// And the replacement look like $1Bye$2 me $3.
// So that we keep the separators at the right place.
foreach ($replacement as $regex => $newstr) {
$regex_array = explode (' ', $regex);
$regex = '(' . $special . '*)' . implode ('(' . $special . '*) ', $regex_array) . '(' . $special . '*)';
$newstr_array = explode (' ', $newstr);
$newstr = "$1";
for ($i = 0; $i < count ($regex_array) - 1; $i++) {
$newstr .= $newstr_array[$i] . '$' . ($i + 2) . ' ';
}
$newstr .= $newstr_array[count($regex_array) - 1] . '$' . (count ($regex_array) + 1);
$clean_text = preg_replace ('#' . $regex . '#isU', $newstr, $clean_text);
}
// Here we re-split one last time.
$clean_text_array = preg_split ('#' . $special . '#', $clean_text, -1, PREG_SPLIT_NO_EMPTY);
// And we merge with $restore.
for ($i = 0, $j = 0; $i < count ($text_array); $i++) {
if (!isset($restore[$i])) {
$restore[$i] = $clean_text_array[$j];
$j++;
}
}
// Now we reorder everything, and make it go back to a string.
ksort ($restore);
$result = implode ($restore);
echo $result;
?>
Will output Bye me ! No, you don't need to run!
[EDIT] Now supporting a custom pattern, which allows to avoid adding useless spaces.

Categories