Extract certain text from a string with regex - php

I tried to extract a coded string from a string, for instance,
$string = 'Louise Bourgeois and Tracey Emin: Do Not Abandon Me [date]Until 31 August 2011[ /date ]';
$description = preg_replace('/\[(?: |\s)*([date]+)(?: |\s)*\](.*?)\[(?: |\s)*([\/date]+)(?: |\s)*\]/is', '',$string);
$date = preg_replace('/\[(?: |\s)*([date]+)(?: |\s)*\](.*?)\[(?: |\s)*([\/date]+)(?: |\s)*\]/is', '$3',$string);
echo $date;
result:
Louise Bourgeois and Tracey Emin: Do Not Abandon Me /date
intended result:
Until 31 August 2011
I got the $description right but I can't get the [date] right. Any ideas?

I think a rather simpler form would do:
#.*?\[\s*?date\s*?\](.*)\[\s*?/date\s*?\].*#
for instance?

([date]+)
This is going to look for one-or-more sequences of letters containing d, a, t, and/or e. [] are regex metacharacters for character classes and will not treated as literal characters for matching purposes. You'd probably want:
(\[date\]) and (\[\/date\])
to properly match those opening/closing "tags".

Related

How do I replace multiple instances of less than < in a php string that also uses strip_tags?

I have the following string stored in a database table that contains HTML I need to strip out before rendering on a web page (This is old content I had no control over).
<p>I am <30 years old and weight <12st</p>
When I have used strip_tags it is only showing I am.
I understand why the strip_tags is doing that so I need to replace the 2 instances of the < with <
I have found a regex that converts the first instance but not the 2nd, but I can't work out how to amend this to replace all instances.
/<([^>]*)(<|$)/
which results in I am currently <30 years old and less than
I have a demo here https://eval.in/1117956
It's a bad idea to try to parse html content with string functions, including regex functions (there're many topics that explain that on SO, search them). html is too complicated to do that.
The problem is that you have poorly formatted html on which you have no control.
There're two possible attitudes:
There's nothing to do: the data are corrupted, so informations are loss once and for all and you can't retrieve something that has disappear, that's all. This is a perfectly acceptable point of view.
May be you can find another source for the same data somewhere or you can choose to print the poorly formatted html as it.
You can try to repair. In this case you have to ensure that all the document problems are limited and can be solved (at least by hand).
In place of a direct string approach, you can use the PHP libxml implementation via DOMDocument. Even if the libxml parser will not give better results than strip_tags, it provides errors you can use to identify the kind of error and to find the problematic positions in the html string.
With your string, the libxml parser returns a recoverable error XML_ERR_NAME_REQUIRED with the code 68 on each problematic opening angle bracket. Errors can be seen using libxml_get_errors().
Example with your string:
$s = '<p>I am <30 years old and weight <12st</p>';
$libxmlErrorState = libxml_use_internal_errors(true);
function getLastErrorPos($code) {
$errors = array_filter(libxml_get_errors(), function ($e) use ($code) {
return $e->code === $code;
});
if ( !$errors )
return false;
$lastError = array_pop($errors);
return ['line' => $lastError->line - 1, 'column' => $lastError->column - 2 ];
}
define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name
$patternTemplate = '~(?:.*\R){%d}.{%d}\K<~A';
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
while ( false !== $position = getLastErrorPos(XML_ERR_NAME_REQUIRED) ) {
libxml_clear_errors();
$pattern = vsprintf($patternTemplate, $position);
$s = preg_replace($pattern, '<', $s, 1);
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
}
echo $dom->saveHTML();
libxml_clear_errors();
libxml_use_internal_errors($libxmlErrorState);
demo
$patternTemplate is a formatted string (see sprintf in the php manual) in which the placeholders %d stand for respectively the number of lines before and the position from the start of the line. (0 and 8 here)
Pattern details: The goal of the pattern is to reach the angle bracket position from the start of the string.
~ # my favorite pattern delimiter
(?:
.* # all character until the end of the line
\R # the newline sequence
){0} # reach the desired line
.{8} # reach the desired column
\K # remove all on the left from the match result
< # the match result is only this character
~A # anchor the pattern at the start of the string
An other related question in which I used a similar technique: parse invalid XML manually
try this
$string = '<p>I am <30 years old and weight <12st</p>';
$html = preg_replace('/^\s*<[^>]+>\s*|\s*<\/[^>]+>\s*\z/', '', $string);// remove html tags
$final = preg_replace('/[^A-Za-z0-9 !##$%^&*().]/u', '', $html); //remove special character
Live DEMO
A simple use of str_replace() would do it.
Replace the <p> and </p> with [p] and [/p]
replace the < with <
put the p tags back i.e. Replace the [p] and [/p] with <p> and </p>
Code
<?php
$description = "<p>I am <30 years old and weight <12st</p>";
$d = str_replace(['[p]','[/p]'],['<p>','</p>'],
str_replace('<', '<',
str_replace(['<p>','</p>'], ['[p]','[/p]'],
$description)));
echo $d;
RESULT
<p>I am <30 years old and weight <12st</p>
My guess is that here we might want to design a good right boundary to capture < in non-tags, maybe a simple expression similar to:
<(\s*[+-]?[0-9])
might work, since we should normally have numbers or signs right after <. [+-]?[0-9] would likely change, if we would have other instances after <.
Demo
Test
$re = '/<(\s*[+-]?[0-9])/m';
$str = '<p>I am <30 years old and weight <12st I am < 30 years old and weight < 12st I am <30 years old and weight < -12st I am < +30 years old and weight < 12st</p>';
$subst = '<$1';
$result = preg_replace($re, $subst, $str);
echo $result;

Str_replace alternative for following strings in PHP

In my script I replaced all "," commas with quotation+spaces.
But when it comes to numbers which are like 3,456,778, it also converts the commas to quote+space. Is there any way to add to command to ignore big numbers like that so it doesn't convert it to:
3" 456" 778"
If there is quotationm+space+any number then convert quotation+space to comma.. I mean i know how to do it with str_replace command but i dont know how to select anynumber 0-9.
Any help to do it? To convert it to:
3,456,778
I think i need to elaborate some. I needed convert this text:
Value=3,456,778,id=777
To:
Value=3,456,778" id=777"
But problem is it also convert those middle commas in between numbers.
So even if I can change my str_replace command to this like
"If comma is not in between two numbers then only convert comma to quotation+space". It would be good. Is it possible?
What about this?
preg_replace("/,([^0-9]|$)/", "\"$1", $text);
This will match all the text except commas followed by numbers.
For instance, this:
$text = "123,23 adas , asdsa d, asdasd sa 1234,234324,asdas 324324 234,";
echo $text; echo "<br/>";
echo preg_replace("/,([^0-9]|$)/", "\"$1", $text);
Will echo this:
123,23 adas , asdsa d, asdasd sa 1234,234324,asdas 324324 234"
123,23 adas " asdsa d" asdasd sa 1234,234324"asdas 324324 234"
It is not really clear from your description what you actually want to do.
This might be a step into the right direction, however:
preg_replace('/([0-9]+)" /', '\\1,', '3" 456" 778"');
Not the best solution maybe,but can give it a try.
$copy_date = '3" 456" 778"';
$copy_date = preg_replace("(\"\s{1})", ",", $copy_date);
$copy_date1 = preg_replace("(\")", "", $copy_date);
print $copy_date1;
o/p:3,456,778

wrap words in string with regex

This is the string
(code)
Pivot: 96.75<br />Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.<br />Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.<br />Comment the pair has broken above its resistance and should post further advance.<br />
(text)
"Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance."
the result should be
(code)
<b>Pivot</b>: 96.75<br /><b>Our preference</b>: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.<br /><b>Alternative scenario</b>: Below 96.75 look for further downside with 96.35 & 95.9 as targets.<br />Comment the pair has broken above its resistance and should post further advance.<br />
(text)
Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance.
The porpuse:
Wrap all the words before : sign.
I've tried this regex: ((\A )|(<br />))(?P<G>[^:]*):, but its working only on python environment. I need this in PHP:
$pattern = '/((\A)|(<br\s\/>))(?P<G>[^:]*):/';
$description = preg_replace($pattern, '<b>$1</b>', $description);
Thanks.
This preg_replace should do the trick:
preg_replace('#(^|<br ?/>)([^:]+):#m','$1<b>$2</b>:',$input)
PHP Fiddle - Run (F9)
I should start by saying that HTML operations are better done with a proper parser such as DOMDocument. This particular problem is straightforward, so regular expressions may work without too much hocus pocus, but be warned :)
You can use look-around assertions; this frees you from having to restore the neighbouring strings during the replacement:
echo preg_replace('/(?<=^|<br \/>)[^:]+(?=:)/m', '<b>$0</b>', $str);
Demo
First, the look-behind assertion matches either the start of each line or a preceding <br />. Then, any characters except the colon are matched; the look-ahead assertion makes sure it's followed by a colon.
The /m modifier is used to make ^ match the start of each line as opposed to \A which always matches the start of the subject string.
The most "general" and least regex-expensive way to do this that I could come up with was this:
$parts = explode('<br', $str);//don't include space and `/`, as tags may vary
$formatted = '';
foreach($parts as $part)
{
$formatted .= preg_replace('/^\s*[\/>]{0,2}\s*([^:]+:)/', '<b>$1</b>',$part).'<br/>';
}
echo $formatted;
Or:
$formatted = array();
foreach($parts as $part)
{
$formatted[] = preg_replace('/^\s*[\/>]{0,2}\s*([^:]+:)/', '<b>$1</b>',$part);
}
echo implode('<br/>', $formatted);
Tested with, and gotten this as output
Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance.
That being said, I do find this bit of data weird, and, if I were you, I'd consider str_replace or preg_replace-ing all breaks with PHP_EOL:
$str = preg_replace('/\<\s*br\s*\/?\s*\>/i', PHP_EOL, $str);//allow for any form of break tag
And then, your string looks exactly like the data I had to parse, and got the regex for that here:
$str = preg_replace(...);
$formatted = preg_replace('/^([^:\n\\]++)\s{0,}:((\n(?![^\n:\\]++\s{0,}:)|.)*+)/','<b>$1:</b>$2<br/>', $str);

String padding for a text template

I'm creating a PDF file from a txt-template with tcpdf ([Example 8][1]). The txt-template looks like this:
SALUTATION
FIRSTNAME LASTNAME
STREET CURRENTDATE
SOMEMOREINFORMATION MYWEBSITE
I replace those markers with the correct value. So that it would look like this:
Mr.
John Doe
Downingstreet 10 14th May, 2010
john#doe.com www.stackoverflow.com
In this example, when I replace the values, the indention of the date is dependent on the length of the street name (which I don't want). I could solve this issue with str_pad but the problem is, I normally use three columns and there are lines which only have content in col1 and col3 as in the last line. How can I solve that problem? Is there something like the "overwrite" function in Word, that when you write, the text just gets overwritten?
Thanks in advance.
Count street's string length and then add/remove left padding of date.
You can use sprintf, e.g.
function something($street, $currentDate, $foo) {
$s = sprintf('%-20s %-18s %s',
$street,
$currentDate,
$foo
);
return $s;
}
echo something('streetA', '14th May, 2010', 'lalala'), "\n";
echo something('Downingstreet 10', '14th May, 2010', 'lalala'), "\n";
echo something('abcdefghijklmnopqrstuvwxyz 10', '14th May, 2010', 'lalala'), "\n";
prints
streetA 14th May, 2010 lalala
Downingstreet 10 14th May, 2010 lalala
abcdefghijklmnopqrstuvwxyz 10 14th May, 2010 lalala
(as you can see from the third line the width specification is the minimum length, so you might have to use something like substr())
I presume you are just str_replace()'ing the placeholders with their values?
$streetPlaceHolder = 'STREET ';
$streetReplacement = str_pad('Downingstreet 10', strlen($streetPlaceHolder));
$template = str_replace($streetPlaceHolder, $streetReplacement, $template);
Presumably you will run into the same problem with SOMEMOREINFORMATION. This same solution can be used.
I realize you said str_pad was not an ideal solution for you. However, I do not understand why, even if you extend this to three columns. You can still get by with this method.

php preg_match_all html dates with slashes error

I've trying to preg_match_all a date with slashes in it sitting between 2 html tags; however its returning null.
here is the html:
> <td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>
Here is my preg_match_all() code
preg_match_all('/<td width=\'40%\' align=\'right\' class=\'SmallDimmedText\'>Last([a-zA-Z0-9\s\.\-\',]*)<\/td>/', $h, $table_content, PREG_PATTERN_ORDER);
where $h is the html above.
what am i doing wrong?
thanks in advance
It (from a quick glance) is because you are trying to match:
Last Login: 11/14/2009
With this regex:
Last([a-zA-Z0-9\s\.\-\',]*)
The regex doesn't contain the required characters of : and / which are included in the text string. Changing the required part of the regex to:
Last([a-zA-Z0-9\s\.\-\',:/]*)
Gives a match
Would it be better to simply use a DOM parser, and then preform the regex on the result of the DOM lookup? It makes for nicer regex...
EDIT
The other issue is that your HTML is:
...40%' align='right'class='SmallDimmedText'>...
Where there is no space between align='right' and class='SmallDimmedText'
However your regex for that section is:
...40%\' align=\'right\' class=\'SmallDimmedText\'>...
Where it is indicated there is a space.
Use a DOM Parser It will save you more headaches caused by subtle bugs than you can count.
Just to give you an idea on how simple it is to parse using Simple HTML DOM.
$html = str_get_html(...);
$elems = $html->find('.SmallDimmedText');
if ( count($elems->children()) != 1 ){
throw new Exception('Too many/few elements found');
}
$text = $elems->children(0)->plaintext;
//parsing here is only an example, but you have removed all
//the html so that any regex used is really simple.
$date = substr($text, strlen('Last Login: '));
$unixTime = strtotime($date);
I see at least two problems :
in your HTML string, there is no space between 'right' and class=, and there is one space there in your regex
you must add at least these 3 characters to the list of matched characters, between the [] :
':' (there is one between "Login" and the date),
' ' (there are spaces between "Last" and "Login", and between ":" and the date),
and '/' (between the date parts)
With this code, it seems to work better :
$h = "<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>";
if (preg_match_all("#<td width='40%' align='right'class='SmallDimmedText'>Last([a-zA-Z0-9\s\.\-',: /]*)<\/td>#",
$h, $table_content, PREG_PATTERN_ORDER)) {
var_dump($table_content);
}
I get this output :
array
0 =>
array
0 => string '<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>' (length=80)
1 =>
array
0 => string ' Login: 11/14/2009' (length=18)
Note I have also used :
# as a regex delimiter, to avoid having to escape slashes
" as a string delimiter, to avoid having to escape single quotes
My first suggestion would be to minimize the amount of text you have in the preg_match_all, why not just do between a ">" and a "<"? Second, I'd end up writing the regex like this, not sure if it helps:
/>.*[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}</
That will look for the end of one tag, then any character, then a date, then the beginning of another tag.
I agree with Yacoby.
At the very least, remove all reference to any of the HTML specific and simply make the regex
preg_match_all('#Last Login: ([\d+/?]+)#', ...

Categories