wrap words in string with regex - php

This is the string
(code)
Pivot: 96.75<br />Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.<br />Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.<br />Comment the pair has broken above its resistance and should post further advance.<br />
(text)
"Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance."
the result should be
(code)
<b>Pivot</b>: 96.75<br /><b>Our preference</b>: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.<br /><b>Alternative scenario</b>: Below 96.75 look for further downside with 96.35 & 95.9 as targets.<br />Comment the pair has broken above its resistance and should post further advance.<br />
(text)
Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance.
The porpuse:
Wrap all the words before : sign.
I've tried this regex: ((\A )|(<br />))(?P<G>[^:]*):, but its working only on python environment. I need this in PHP:
$pattern = '/((\A)|(<br\s\/>))(?P<G>[^:]*):/';
$description = preg_replace($pattern, '<b>$1</b>', $description);
Thanks.

This preg_replace should do the trick:
preg_replace('#(^|<br ?/>)([^:]+):#m','$1<b>$2</b>:',$input)
PHP Fiddle - Run (F9)

I should start by saying that HTML operations are better done with a proper parser such as DOMDocument. This particular problem is straightforward, so regular expressions may work without too much hocus pocus, but be warned :)
You can use look-around assertions; this frees you from having to restore the neighbouring strings during the replacement:
echo preg_replace('/(?<=^|<br \/>)[^:]+(?=:)/m', '<b>$0</b>', $str);
Demo
First, the look-behind assertion matches either the start of each line or a preceding <br />. Then, any characters except the colon are matched; the look-ahead assertion makes sure it's followed by a colon.
The /m modifier is used to make ^ match the start of each line as opposed to \A which always matches the start of the subject string.

The most "general" and least regex-expensive way to do this that I could come up with was this:
$parts = explode('<br', $str);//don't include space and `/`, as tags may vary
$formatted = '';
foreach($parts as $part)
{
$formatted .= preg_replace('/^\s*[\/>]{0,2}\s*([^:]+:)/', '<b>$1</b>',$part).'<br/>';
}
echo $formatted;
Or:
$formatted = array();
foreach($parts as $part)
{
$formatted[] = preg_replace('/^\s*[\/>]{0,2}\s*([^:]+:)/', '<b>$1</b>',$part);
}
echo implode('<br/>', $formatted);
Tested with, and gotten this as output
Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance.
That being said, I do find this bit of data weird, and, if I were you, I'd consider str_replace or preg_replace-ing all breaks with PHP_EOL:
$str = preg_replace('/\<\s*br\s*\/?\s*\>/i', PHP_EOL, $str);//allow for any form of break tag
And then, your string looks exactly like the data I had to parse, and got the regex for that here:
$str = preg_replace(...);
$formatted = preg_replace('/^([^:\n\\]++)\s{0,}:((\n(?![^\n:\\]++\s{0,}:)|.)*+)/','<b>$1:</b>$2<br/>', $str);

Related

How do I replace multiple instances of less than < in a php string that also uses strip_tags?

I have the following string stored in a database table that contains HTML I need to strip out before rendering on a web page (This is old content I had no control over).
<p>I am <30 years old and weight <12st</p>
When I have used strip_tags it is only showing I am.
I understand why the strip_tags is doing that so I need to replace the 2 instances of the < with <
I have found a regex that converts the first instance but not the 2nd, but I can't work out how to amend this to replace all instances.
/<([^>]*)(<|$)/
which results in I am currently <30 years old and less than
I have a demo here https://eval.in/1117956
It's a bad idea to try to parse html content with string functions, including regex functions (there're many topics that explain that on SO, search them). html is too complicated to do that.
The problem is that you have poorly formatted html on which you have no control.
There're two possible attitudes:
There's nothing to do: the data are corrupted, so informations are loss once and for all and you can't retrieve something that has disappear, that's all. This is a perfectly acceptable point of view.
May be you can find another source for the same data somewhere or you can choose to print the poorly formatted html as it.
You can try to repair. In this case you have to ensure that all the document problems are limited and can be solved (at least by hand).
In place of a direct string approach, you can use the PHP libxml implementation via DOMDocument. Even if the libxml parser will not give better results than strip_tags, it provides errors you can use to identify the kind of error and to find the problematic positions in the html string.
With your string, the libxml parser returns a recoverable error XML_ERR_NAME_REQUIRED with the code 68 on each problematic opening angle bracket. Errors can be seen using libxml_get_errors().
Example with your string:
$s = '<p>I am <30 years old and weight <12st</p>';
$libxmlErrorState = libxml_use_internal_errors(true);
function getLastErrorPos($code) {
$errors = array_filter(libxml_get_errors(), function ($e) use ($code) {
return $e->code === $code;
});
if ( !$errors )
return false;
$lastError = array_pop($errors);
return ['line' => $lastError->line - 1, 'column' => $lastError->column - 2 ];
}
define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name
$patternTemplate = '~(?:.*\R){%d}.{%d}\K<~A';
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
while ( false !== $position = getLastErrorPos(XML_ERR_NAME_REQUIRED) ) {
libxml_clear_errors();
$pattern = vsprintf($patternTemplate, $position);
$s = preg_replace($pattern, '<', $s, 1);
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
}
echo $dom->saveHTML();
libxml_clear_errors();
libxml_use_internal_errors($libxmlErrorState);
demo
$patternTemplate is a formatted string (see sprintf in the php manual) in which the placeholders %d stand for respectively the number of lines before and the position from the start of the line. (0 and 8 here)
Pattern details: The goal of the pattern is to reach the angle bracket position from the start of the string.
~ # my favorite pattern delimiter
(?:
.* # all character until the end of the line
\R # the newline sequence
){0} # reach the desired line
.{8} # reach the desired column
\K # remove all on the left from the match result
< # the match result is only this character
~A # anchor the pattern at the start of the string
An other related question in which I used a similar technique: parse invalid XML manually
try this
$string = '<p>I am <30 years old and weight <12st</p>';
$html = preg_replace('/^\s*<[^>]+>\s*|\s*<\/[^>]+>\s*\z/', '', $string);// remove html tags
$final = preg_replace('/[^A-Za-z0-9 !##$%^&*().]/u', '', $html); //remove special character
Live DEMO
A simple use of str_replace() would do it.
Replace the <p> and </p> with [p] and [/p]
replace the < with <
put the p tags back i.e. Replace the [p] and [/p] with <p> and </p>
Code
<?php
$description = "<p>I am <30 years old and weight <12st</p>";
$d = str_replace(['[p]','[/p]'],['<p>','</p>'],
str_replace('<', '<',
str_replace(['<p>','</p>'], ['[p]','[/p]'],
$description)));
echo $d;
RESULT
<p>I am <30 years old and weight <12st</p>
My guess is that here we might want to design a good right boundary to capture < in non-tags, maybe a simple expression similar to:
<(\s*[+-]?[0-9])
might work, since we should normally have numbers or signs right after <. [+-]?[0-9] would likely change, if we would have other instances after <.
Demo
Test
$re = '/<(\s*[+-]?[0-9])/m';
$str = '<p>I am <30 years old and weight <12st I am < 30 years old and weight < 12st I am <30 years old and weight < -12st I am < +30 years old and weight < 12st</p>';
$subst = '<$1';
$result = preg_replace($re, $subst, $str);
echo $result;

how to do echo from a string, only from values that are between a specific stretch[href tag] of the string?

[PHP]I have a variable for storing strings (a BIIGGG page source code as string), I want to echo only interesting strings (that I need to extract to use in a project, dozens of them), and they are inside the quotation marks of the tag
but I just want to capture the values that start with the letter: N (news)
[<a href="/news7044449/exclusive_news_sunday_"]
<a href="/n[ews7044449/exclusive_news_sunday_]"
that is, I think you will have to work with match using: [a href="/n]
how to do that to define that the echo will delete all the texts of the variable, showing only:
note that there are other hrefs tags with values that start with other letters, such as the letter 'P' : href="/profiles... (This does not interest me.)
$string = '</div><span class="news-hd-mark">HD</span></div><p>exclusive_news_sunday_</p><p class="metadata"><span class="bg">Czech AV<span class="mobile-hide"> - 5.4M Views</span>
- <span class="duration">7 min</span></span></p></div><script>xv.thumbs.preparenews(7044449);</script>
<div id="news_31720715" class="thumb-block "><div class="thumb-inside"><div class="thumb"><a href="/news31720715/my_sister_running_every_single_morning"><img src="https://static-hw.xnewss.com/img/lightbox/lightbox-blank.gif"';
I imagine something like this:
$removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n = ('/something regex expresion I think /' or preg_match, substring?);
echo $string = str_replace($removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n,'',$string);
expected output: /news7044449/exclusive_news_sunday_
NOTE: it is not essential to be through a variable, it can be from a .txt file the place where the extracts will be extracted, and not necessarily a variable.
thanks.
I believe this will help her.
<?php
$source = file_get_contents("code.html");
preg_match_all("/<a href=\"(\/n(?:.+?))\"[^>]*>/", $source, $results);
var_export( end($results) );
Step by Step Regex:
Regex Demo
Regex Debugger
To get just the links out of the $results array from Valdeir's answer:
foreach ($results as $r) {
echo $r;
// alt: to display them with an HTML break tag after each one
echo $r."<br>\n";
}

preg match text between tags excluding same tag in between

Well I know there several questions similar but could not find any with this specific case.
I took one code and tweak it to my needs but now I'm founding a bug on it that I can't correct.
Code:
$tag = 'namespace';
$match = Tags::get($f, $tag);
var_dump($match);
static function get( $xml, $tag) { // http://stackoverflow.com/questions/3404433/get-content-within-a-html-tag-using-7-processing
// bug case string(56) "<namespaces>
// <namespace key="-2">Media</namespace>"
$tag_ini = "<{$tag}[^\>]*?>"; $tag_end = "<\\/{$tag}>";
$tag_regex = '/' . $tag_ini . '(.*?)' . $tag_end . '/si';
preg_match_all($tag_regex,
$xml,
$matches,
PREG_OFFSET_CAPTURE);
return $matches;
}
As you can see, there is a bug if the tag is nested:
<namespaces> <namespace key="-2">Media</namespace>
When it should return 'Media', or even the outer '<namespaces>' and then the inside ones.
I tried to add "<{$tag}[^\>|^\r\n ]*?>", ^\s+, changing the * to *?, and other few things that in best case turned to recognize only the bugged case.
Also tried "<{$tag}[^{$tag}]*?>" which gives blank, I suppose it nullifies itself.
I'm a newb on regex, I can tell that to fix this just is needed to add don't let open a new tag of the same type.
Or I could even use a hack answer for my use case, that excludes if the inside text has new line carriage.
Can anyone get the right syntax for this?
You can check an extract of the text here: http://pastebin.com/f2naN2S3
After the proposed change: $tag_ini = "<{$tag}\\b[^>]*>"; $tag_end = "<\\/{$tag}>"; it does work for the the example case, but not for this one:
<namespace key="0" />
<namespace key="1">Talk</namespace>
As it results in:
<namespace key="1">Talk"
It's because numbers and " and letters are considered inside word boundary. How could I address that?
The main problem is that you did not use a word boundary after the opening tag and thus, namespace in the pattern could also match namespaces tag, and many others.
The subsequent issue is that the <${tag}\b[^>]*>(.*?)<\/${tag}> pattern would overfire if there is a self-closing namespace tag followed with a "normal" paired open/close namespace tag. So, you need to either use a negative lookbehind (?<!\/) before the > (see demo), or use a (?![^>]*\/>) negative lookahead after \b (see demo).
So, you can use
$tag_ini = "<{$tag}\\b[^>]*(?<!\\/)>"; $tag_end = "<\\/{$tag}>";
This is probably not the idea answer, but I was messing with a regex generator:
<?php
# URL that generated this code:
# http://txt2re.com/index-php.php3?s=%3Cnamespace%3E%3Cnamespace%20key=%22-2%22%3EMedia%3C/namespace%3E&12&11
$txt='arstarstarstarstarstarst<namespace key="-2">Media</namespace>arstarstarstarstarst';
$re1='.*?'; # Non-greedy match on filler
$re2='(?:[a-z][a-z]+)'; # Uninteresting: word
$re3='.*?'; # Non-greedy match on filler
$re4='(?:[a-z][a-z]+)'; # Uninteresting: word
$re5='.*?'; # Non-greedy match on filler
$re6='(?:[a-z][a-z]+)'; # Uninteresting: word
$re7='.*?'; # Non-greedy match on filler
$re8='((?:[a-z][a-z]+))'; # Word 1
if ($c=preg_match_all ("/".$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8."/is", $txt, $matches))
{
$word1=$matches[1][0];
print "($word1) \n";
}
#-----
# Paste the code into a new php file. Then in Unix:
# $ php x.php
#-----
?>
This line is what I needed
$tag_ini = "<{$tag}\\b[^>|^\\/>]*>"; $tag_end = "<\\/{$tag}>";
Thank you very much you #Alison and #Wictor for your help and directions

php regular expression help needed on special charecters

here goes my code
$string="According to a report on the Times of In#dia, &#8220 Telan#gana Rashtra Samiti chief K Chandrasekhar #Rao has seen a #sinister motive behind the protests against the formation of Telangana";
preg_match_all('/(?!\b)(#\w+\b)/' ,$string, $matches);
foreach($matches[1] as $match){
$string = str_replace("$match","[h]".$match."[/h]",$string);
}
echo $string;
output
According to a report on the Times of In#dia, &[h]#8220[/h] Telan#gana
Rashtra Samiti chief K Chandrasekhar [h]#Rao[/h] has seen a
[h]#sinister[/h] motive behind the protests against the formation of
Telangana
i want to replace only the string starts with # but it also replacing &#8220 to &[h]#8220[/h] . please help me on this.
Try using a positive lookbehind since there's always a word boundary before a hash # :
/(?<=\s|^)(#\w+\b)/
Which makes sure there's either a space or the beginning of the string before the hashed word.
You can use this in a preg_replace:
$string="According to a report on the Times of In#dia, &#8220 Telan#gana Rashtra Samiti chief K Chandrasekhar #Rao has seen a #sinister motive behind the protests against the formation of Telangana";
$result = preg_replace('/(?<=\s|^)(#\w+\b)/', "[h]$1[/h]", $string);

Find a pattern within two or more sets of text

I have lots of data that I need to search through for certain patterns.
Problem is when looking for said patterns I have no reference to what I'm looking for.
Or in other words, I have two paragraphs. Each on similar topics. I need to be able to compare both paragraphs and find patterns. Phrases said in both paragraphs and how many times both were said.
Can't seem to find the solution because preg_match and other functions your required to supply the things your looking for.
Example paragraphs
Paragraph 1:
Bee Pollen is made by honeybees, and is the food of the young bee. It
is considered one of nature's most completely nourishing foods as it
contains nearly all nutrients required by humans. Bee-gathered pollens
are rich in proteins (approximately 40% protein), free amino acids,
vitamins, including B-complex, and folic acid.
Paragraph 2:
Bee Pollen is made by honeybees. It is required for the fertilization
of the plant. The tiny particles consist of 50/1,000-millimeter
corpuscles, formed at the free end of the stamen in the heart of the
blossom, nature's most completely nourishing foods. Every variety of
flower in the universe puts forth a dusting of pollen. Many orchard
fruits and agricultural food crops do, too.
So from those examples these patterns:
Bee Pollen is made by honeybees
and:
nature's most completely nourishing foods
Both phrases are found in both paragraphs.
This is potentially a complex question depending on whether you're looking for similar phrases or phrases that match word for word.
Finding exact word-for-word matches is quite simple all you need to do is split on common breaks like punctuation marks (e.g. .,;:) and perhaps on conjunctions as well (e.g. and or). However, the problem comes when you come to, for example, adjectives two phrases might be exactly the same but have one word different, like so:
The world is spinnnig around its axis at a tremendous speed.
The world is spinning around its axis at a magnificent speed.
This won't match because tremendous and magnificent are used in place of one another. Potentially you could work around this, however, that would be a more complex question.
Answer
If we stick to the simple side of things we can achieve phrase matching with just a few lines of code (4 in this example; not including the formatting for comments/readability).
$wordSplits = 'and or on of as'; //List of words to split on
preg_match_all('/(?<m1>.*?)([.,;:\-]| '.str_replace(' ', ' | ', trim($wordSplits)).' )/i', $para1, $matches1);
preg_match_all('/(?<m2>.*?)([.,;:\-]| '.str_replace(' ', ' | ', trim($wordSplits)).' )/i', $para2, $matches2);
$commonPhrases = array_filter( //Removes blank $key=>$value pairs
array_intersect( //Finds matching paterns
array_map(function($item){
return(strtolower(trim($item))); //Cleans array for $para1 values - removes leading and following spaces
}, $matches1['m1']),
array_map(function($item){
return(strtolower(trim($item))); //Cleans array for $para2 values - removes leading and following spaces
}, $matches2['m2'])
)
);
var_dump($commonPhrases);
/**
OUTPUT:
array(2) {
[0]=>
string(31) "bee pollen is made by honeybees"
[5]=>
string(41) "nature's most completely nourishing foods"
}
/*
The above code will find matches splitting both on punctuation (defined in [...] of the preg_match_all pattern) it will also concatenate the word list (matching only words in the word list with a preceding and following space).
Wordlist
You can change the word list to include any breaks you like, editing the list until you get the phrases you are after, examples:
$wordSplits = 'and or';
$wordSplits = 'and but if or';
$wordSplits = 'a an as and by but because if in is it of off on or';
Punctuation
You can add any punctuation marks you like into the list (between [ and ]), however remember that some characters do have special meanings and may need to be escaped (or placed appropriately): - and ^ should become \- and \^ or be placed where their special meaning doesn't come into play.
You may consider changing:
([.,;:\-]|
To:
([.,;:\-] | //Adding a space before the pipe
So that you only split punctuation marks which are followed by a space. For example: this would mean that items like 50,000 won't be split.
Spaces and breaks
You may also consider changing the spaces to \s so that tabs and newlines etc are included and not just spaces. Like so:
'/(?<m1>.*?)([.,;:\-]|\s'.str_replace(' ', '\s|\s', trim($wordSplits)).'\s)/i'
This would also apply to:
([.,;:\-]\s|
If you decide to go down that route.
I've been working on this code, don't know if it suits your needs... Feel free to expand it!
$p1 = "Bee Pollen is made by honeybees, and is the food of the young bee. It is considered one of nature's most completely nourishing foods as it contains nearly all nutrients required by humans. Bee-gathered pollens are rich in proteins (approximately 40% protein), free amino acids, vitamins, including B-complex, and folic acid.";
$p2 = "Bee Pollen is made by honeybees. It is required for the fertilization of the plant. The tiny particles consist of 50/1,000-millimeter corpuscles, formed at the free end of the stamen in the heart of the blossom, nature's most completely nourishing foods. Every variety of flower in the universe puts forth a dusting of pollen. Many orchard fruits and agricultural food crops do, too.";
// Strip strings of periods etc.
$p1 = strtolower(str_replace(array('.', ',', '(', ')'), '', $p1));
$p2 = strtolower(str_replace(array('.', ',', '(', ')'), '', $p2));
// Extract words from first paragraph
$w1 = explode(" ", $p1);
// Build search string
$search = '';
$found = array();
foreach ($w1 as $word) {
//echo 'Word: ' . $word . "<br />";
$search .= ' ' . $word;
$search = trim($search);
//echo '. . Search string: '. $search . "<br /><br />";
if (substr_count($p2, $search)) {
$old_search = $search;
$num_occured = substr_count($p2, $search);
//echo " . . . found!" . "<br /><br /><br />";
$add = TRUE;
} else {
//echo " . . . not found! Generating new search string: " . $word . '<br />';
if ($add) {
$found[] = array('pattern' => $old_search, 'occurences' => $num_occured);
$add = FALSE;
}
$old_search = '';
$search = $word;
}
}
print_r($found);
The above code finds occurences of patterns from the first string in the second one.
I'm sure it can be written better, but since it's past midnight (local time), I'm not as "fresh" as I'd like to be...
Codepad-link

Categories