PHP + Regex expressions - php

Below are 2 pager alert messages and I have a sore head of a time trying to extract the address and job details of the second message into a php string using Regex...
Here are 2 example messages:
0571040 15:45:21 30-04-12 ##ALERT F546356345 THEB8 STRUC1 SMELL OF SMOKE AND ALARM OPERATING 900 SOME ROAD SOMESUBURB /CROSSSTREET1 RD //CROSSTREET2 AV M 99 A1 (429085) CTHEB CBOROS PT28 [THEB]
0571040 15:45:21 30-04-12 ##ALERT F546356345 THEB8 STRUC1 SMELL OF SMOKE AND ALARM OPERATING 4 / 900 SOME ROAD SOMESUBURB /CROSSSTREET1 RD //CROSSTREET2 AV M 99 A1 (429085) CTHEB CBOROS PT28 [THEB]
You will note the second address has 4 / 900 at the start or it could say Unit 4 / 900... and this is where my issue starts! The addresses come in different formats, I have "normal" numbered addresses and "Corner Of" addresses sorted elsewhere but this address one with no 4 at 900 someroad has me stumped. The extra / has screwed my expression up... Help! :)
In my expression I use the first slash as the first cross street but in the second case above the first / is now a part of the address... Below is what I have so far:
function get_string_between2($string, $start, $end){
$string = " ".$string;
$ini = strpos($string,$start);
if ($ini == 0) return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}
$fullstring = "$rawPage";
if ( strpos($fullstring, ' STRUC1 ')!== false )
{
$parsed = get_string_between2($fullstring, "STRUC1", "/");
}
$input = "$parsed";
preg_match('/([^0-9]+)(.*)/', $input, $matches);
$jobdet = "$matches[1]";
$jobadd = "$matches[2]";
Now this works fine for the top message and I get this as the result:
$jobdet = SMELL OF SMOKE AND ALARM OPERATING
$jobadd = 900 SOME ROAD SOMESUBURB
$firstcrossstreet = /CROSSSTREET1 RD
$secondcrossstreet = //CROSSSTREET2 AV
For the second message it's all wrong with this the result:
$jobdet = SMELL OF SMOKE AND ALARM OPERATING
$jobadd = 4
$firstcrossstreet = / 900 SOME ROAD SOMESUBURB /CROSSSTREET1 RD
$secondcrossstreet = //CROSSSTREET2 AV
I know it's the / causing it but how can I make a expression that handles either case?

With regular expressions, you must escape the forward slash, as these are used as part of the expression. A typical expression looks as follows:
/expression/modifiers
expression is your regex, and modifiers change the execution and result type. eg:
/<[^>]+>/g
This should return all HTML tags in a string. The regex is <[^>]+> and it is between two forward slashes. Therefore you escape the forward slashes - / - to achieve a literal string forward slash.

Related

How do I replace multiple instances of less than < in a php string that also uses strip_tags?

I have the following string stored in a database table that contains HTML I need to strip out before rendering on a web page (This is old content I had no control over).
<p>I am <30 years old and weight <12st</p>
When I have used strip_tags it is only showing I am.
I understand why the strip_tags is doing that so I need to replace the 2 instances of the < with <
I have found a regex that converts the first instance but not the 2nd, but I can't work out how to amend this to replace all instances.
/<([^>]*)(<|$)/
which results in I am currently <30 years old and less than
I have a demo here https://eval.in/1117956
It's a bad idea to try to parse html content with string functions, including regex functions (there're many topics that explain that on SO, search them). html is too complicated to do that.
The problem is that you have poorly formatted html on which you have no control.
There're two possible attitudes:
There's nothing to do: the data are corrupted, so informations are loss once and for all and you can't retrieve something that has disappear, that's all. This is a perfectly acceptable point of view.
May be you can find another source for the same data somewhere or you can choose to print the poorly formatted html as it.
You can try to repair. In this case you have to ensure that all the document problems are limited and can be solved (at least by hand).
In place of a direct string approach, you can use the PHP libxml implementation via DOMDocument. Even if the libxml parser will not give better results than strip_tags, it provides errors you can use to identify the kind of error and to find the problematic positions in the html string.
With your string, the libxml parser returns a recoverable error XML_ERR_NAME_REQUIRED with the code 68 on each problematic opening angle bracket. Errors can be seen using libxml_get_errors().
Example with your string:
$s = '<p>I am <30 years old and weight <12st</p>';
$libxmlErrorState = libxml_use_internal_errors(true);
function getLastErrorPos($code) {
$errors = array_filter(libxml_get_errors(), function ($e) use ($code) {
return $e->code === $code;
});
if ( !$errors )
return false;
$lastError = array_pop($errors);
return ['line' => $lastError->line - 1, 'column' => $lastError->column - 2 ];
}
define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name
$patternTemplate = '~(?:.*\R){%d}.{%d}\K<~A';
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
while ( false !== $position = getLastErrorPos(XML_ERR_NAME_REQUIRED) ) {
libxml_clear_errors();
$pattern = vsprintf($patternTemplate, $position);
$s = preg_replace($pattern, '<', $s, 1);
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
}
echo $dom->saveHTML();
libxml_clear_errors();
libxml_use_internal_errors($libxmlErrorState);
demo
$patternTemplate is a formatted string (see sprintf in the php manual) in which the placeholders %d stand for respectively the number of lines before and the position from the start of the line. (0 and 8 here)
Pattern details: The goal of the pattern is to reach the angle bracket position from the start of the string.
~ # my favorite pattern delimiter
(?:
.* # all character until the end of the line
\R # the newline sequence
){0} # reach the desired line
.{8} # reach the desired column
\K # remove all on the left from the match result
< # the match result is only this character
~A # anchor the pattern at the start of the string
An other related question in which I used a similar technique: parse invalid XML manually
try this
$string = '<p>I am <30 years old and weight <12st</p>';
$html = preg_replace('/^\s*<[^>]+>\s*|\s*<\/[^>]+>\s*\z/', '', $string);// remove html tags
$final = preg_replace('/[^A-Za-z0-9 !##$%^&*().]/u', '', $html); //remove special character
Live DEMO
A simple use of str_replace() would do it.
Replace the <p> and </p> with [p] and [/p]
replace the < with <
put the p tags back i.e. Replace the [p] and [/p] with <p> and </p>
Code
<?php
$description = "<p>I am <30 years old and weight <12st</p>";
$d = str_replace(['[p]','[/p]'],['<p>','</p>'],
str_replace('<', '<',
str_replace(['<p>','</p>'], ['[p]','[/p]'],
$description)));
echo $d;
RESULT
<p>I am <30 years old and weight <12st</p>
My guess is that here we might want to design a good right boundary to capture < in non-tags, maybe a simple expression similar to:
<(\s*[+-]?[0-9])
might work, since we should normally have numbers or signs right after <. [+-]?[0-9] would likely change, if we would have other instances after <.
Demo
Test
$re = '/<(\s*[+-]?[0-9])/m';
$str = '<p>I am <30 years old and weight <12st I am < 30 years old and weight < 12st I am <30 years old and weight < -12st I am < +30 years old and weight < 12st</p>';
$subst = '<$1';
$result = preg_replace($re, $subst, $str);
echo $result;

How to remove first </p> and last <p> tag in a string

I have a string like this:
$str = '<div class="content"><br />
<strong>0730</strong> – Check in direct to Compass at Marlin Wharf Berth 18</p> <p><strong>0800 </strong> – Depart Marlin Marina Cairns to the Great Barrier Reef</p>
<p><strong>1015</strong> – Arrive at your first Great Barrier Reef Location</p> <p><strong>1230</strong> – BBQ Lunch with fresh salads</p>
<p><strong>1300</strong> Cruise to 2nd Reef Location 1530 – Depart the Great Barrier Reef</p>
<p><strong>1730</strong> – Approximately Arrival time at Cairns Marina<br /> </div>';
I want to use preg_replace function to remove first </p> and last <p> tag because they are redundant, I have used this pattern but it didn't work.
$patterns = array(
'#^\s*</p>#',
'#<p>\s*$#',
);
$str = preg_replace( $patterns, '', $str );
You could do this with one expression:
$str = preg_replace("#(.*?)</p>(.*)<p>(.*)#s", "$1$2$3", $str, 1 );
This will do a non-greedy capture of text before the first </p>, then capture text greedily until <p> (which will be the last one because of the greediness). And finally the remaining text is also captured. The three captured groups are maintained, the 2 tags are not.
The s modifier is needed to allow the dot to also match new line characters.
Note that this does not check whether the removal is actually needed. It just does it, so if the HTML was already OK, you will get an non-desirable result.
This should do what you need
$str = '<div class="content"><br />
<strong>0730</strong> – Check in direct to Compass at Marlin Wharf Berth 18</p> <p><strong>0800 </strong> – Depart Marlin Marina Cairns to the Great Barrier Reef</p>
<p><strong>1015</strong> – Arrive at your first Great Barrier Reef Location</p> <p><strong>1230</strong> – BBQ Lunch with fresh salads</p>
<p><strong>1300</strong> Cruise to 2nd Reef Location 1530 – Depart the Great Barrier Reef</p>
<p><strong>1730</strong> – Approximately Arrival time at Cairns Marina<br /> </div>';
//Replace the first one, easy enough
$str = preg_replace('/<\/p>/', "", $str, 1);
$stringReplace = "<p>";
$stringLen = strlen($stringReplace);
//Get the position of the last one, with strrpos (reverse check)
$pos = strrpos($str, $stringReplace);
//Make sure there is one
if($pos !== false){
//If so, replace it with nothing
$str = substr_replace($str, "", $pos, $stringLen);
}
you can use something like this. but i don't check it myself. so let me and others know if it works or not.
$text = "Quick \"brown fox jumps \"over\" the lazy\" dog";
$resault = Regex.Replace(text, "(?<=^[^\"]*)\"|\"(?=[^\"]*$)", "\"\"\"");

Solving 140 characters Twitter status limit with PHP regex

So, my text I want to post on Twitter is sometimes more than 140 character, so, I need to check the lenght and then go without changes if less than 140 or slive the text into two pieces (the text and the link) and grab the text part and make it e.g. 100 characters long - chop the rest.
Then grab the - now 100 characters long part - and put it otgether with the url.
How to do that?
my code so far:
if (strlen($status) < 140) {
// continue
} else {
// 1. slice the $status into $text and $url (every message has url so
// checking is not important right now
// 2. shorten the text to 100 char
// something like $text = substr($text, 0, 100); ?
// 3. put them back together
$status = $text . ' ' . $url;
}
How should I change my code? I have biggest problem with the first part when getting the url and text part.
Btw. in each $status is only 1 url, so checking for mulitple urls is not necessary
Example of a text that is longer than it should be:
What is now Ecuador was home to a variety of indigenous groups that were gradually incorporated into the Inca Empire during the fifteenth century. The territory was colonized by Spain during the sixteenth century, achieving independence in 1820 as part of Gran Colombia, from which it emerged as its own sovereign state in 1830. The legacy of both empires is reflected in Ecuador's ethnically diverse population, with most of its 15.2 million people being mestizos, followed by large minorities of European, Amerindian, and African descendant. https://en.wikipedia.org/wiki/Ecuador
should become in the end this:
What is now Ecuador was home to a variety of indigenous groups that were gradually incorporated int https://en.wikipedia.org/wiki/Ecuador
If you can be sure that the URL does not contain any spaces (no well-formed URL should) and that it is always present, try it like that:
preg_match('/^(.*)(\\S+)$/', $status, $matches);
$text = $matches[1];
$url = $matches[2];
$text = substr($text, 0, 100);
But possibly the length of the text should be adapted to the length of the url, so you would use
$text = substr($text, 0, 140-strlen($url)-1);
$reg = '/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&##\/%=~_|$?!:,.]*[A-Z0-9+&##\/%=~_|$]/i';
$string = "What is now Ecuador was home to a variety of indigenous groups that were gradually incorporated into the Inca Empire during the fifteenth century. The territory was colonized by Spain during the sixteenth century, achieving independence in 1820 as part of Gran Colombia, from which it emerged as its own sovereign state in 1830. The legacy of both empires is reflected in Ecuador's ethnically diverse population, with most of its 15.2 million people being mestizos, followed by large minorities of European, Amerindian, and African descendant. https://en.wikipedia.org/wiki/Ecuador";
preg_match_all($reg, $string, $matches, PREG_PATTERN_ORDER);
$cut_string = substr($string, 0, (140-strlen($matches[0][0])-1));
$your_twitt = $cut_string . " " . $matches[0][0];
echo $your_twitt;
// ouputs : "What is now Ecuador was home to a variety of indigenous groups that were gradually incorporated into t https://en.wikipedia.org/wiki/Ecuador"
This might be what you want :
$status = 'What is now Ecuador was home to a variety of indigenous groups that were gradually incorporated into the Inca Empire during the fifteenth century. The territory was colonized by Spain during the sixteenth century, achieving independence in 1820 as part of Gran Colombia, from which it emerged as its own sovereign state in 1830. The legacy of both empires is reflected in Ecuador\'s ethnically diverse population, with most of its 15.2 million people being mestizos, followed by large minorities of European, Amerindian, and African descendant. https://en.wikipedia.org/wiki/Ecuador';
if (strlen($status) < 140) {
echo 'Lenght ok';
} else {
$totalPart = round(strlen($status)/100);
$fulltweet = array();
for ($i=0; $i < $totalPart; $i++) {
if($i==0)
{
$fulltweet[$i] = substr($status, 0,100);
}else{
$fulltweet[$i] = substr($status, $i * 100);
}
}
}
If the string is longer than 140 chars then it'll explode it in an array of 100 char for each row

preg_replace_callback highlight pattern not match in result

I have this code:
$string = 'The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.';
$text = explode("#", str_replace(" ", " #", $string)); //ugly trick to preserve space when exploding, but it works (faster than preg_split)
foreach ($text as $value) {
echo preg_replace_callback("/(.*p.*e.*d.*|.*a.*y.*)/", function ($matches) {
return " <strong>".$matches[0]."</strong> ";
}, $value);
}
The point of it is to be able to enter a sequence of characters (in the code above it's a fixed pattern), and it finds and highlights those characters in the matched word. The code I have now highlights the entire word. I'm looking for the most efficient way of highlighting the characters.
The result of the current code:
The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.
What I would like to have:
The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.
Did I take the wrong approach? It would be awesome if someone could point me in the right way, I've been searching for hours and didn't find what I was looking for.
EDIT 2:
Divaka's been a great help. Almost there.. I apologize if I haven't been clear enough on what my goal is. I will try to explain further.
- Part A -
One of the things I will be using this code for is a phone book. A simple example:
When following characters are entered:
Jan
I need it to match following examples:
Jan Verhoeven
Arjan Peters
Raj Naren
Jered Von Tran
The problem is that I will be iterating over the entire phone book, person-record per person-record. Each person also has email-addresses, a postal address, maybe a website, a extra note, ect.. This means that the text I'm actually search can contain anything from letters, numbers, special characters(&#()%_- etc..), newlines, and most importantly spaces. So an entire record (csv) might contain the following info:
Name;Address;Email address;Website;Note
Jan Verhoeven;Veldstraat 2a, 3209 Herkstad;jan#werk.be;www.janophetwerk.be,jan#telemet.be;Jan die ik ontmoet heb op de bouwbeurs.\n Zelfstandige vertegenwoordiger van bouwmaterialen.
Raj Naren;Kerklaan 334, 5873 Biep;raj#werk.be;;Rechtstreekse contactpersoon bij Werk.be (#654 intern)
The \n is meant to be an actual newline. So if I search for #werk.be, I'd like to see both these records as a result.
- Part B -
Something else I want to use this for is searching song-texts. When I'm looking for a song and I can only remember it had to do something with ducks or docks and a circle, I would enter dckcircle and get the following result:
... and the ducks were all dancing in a great big circle, around the great big bonfire ...
To be able to fine-tune the searching I'd like to be able to limit the number of spaces (or any other character), because I would imagine it finding a simple pattern like eve in every song while I'm only looking for a song that has the exact word eve in it.
- Conclusion -
If I summarize this in pseudo-regex, for a search pattern abc with a max of 3 spaces in-between it would be something like this: (I might be totally off here)
(a)(any character, max 3 spaces)(b)(any character, max 3 spaces)(c)
Or more generic:
(a)({any character}{these characters with a limit of 3})(b)({any character}{these characters with a limit of 3})(c)
This can even be extended to this fairly easily I'm guessing:
(a)({any character}{these characters with a limit of 3}{not these characters})(b)({any character}{these characters with a limit of 3}{not these characters})(c)
(I know the ´{}´ brackets are not to be used that way in a regular expression, but I don't know how else to put it without using a character that has a meaning in regular expressions.)
If anyone wonders, I know the sql like statement would be able to do 80% (I'm guessing, might even be more) of what I'm trying to do, but I'm trying to avoid using a database to make this as portable as possible.
When the correct answer has been found, I'll clean this question (and the code) up and post the resulting php-class here (maybe I'll even put it up on github if that would be useful), so anyone looking for the same will have a fully working class to work with :).
I've came up with this. Tell me if it's what you want!
//$string = "The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.";
$string = "abcdefo";
//$pattern_array1 = array(a,y);
//$pattern_array2 = array(p,e,d);
$pattern_array1 = array(e,f);
$pattern_array2 = array(o);
$pattern_array2 = array(a,f);
$number_of_patterns = 2;
$regexp1 = generate_regexp($pattern_array1, 1);
$regexp2 = generate_regexp($pattern_array2, 2);
$string = preg_replace($regexp1["pattern"], $regexp1["replacement"], $string);
$string = preg_replace($regexp2["pattern"], $regexp2["replacement"], $string);
$string = transform_multimatched_chars($string);
// transforming other chars after transforming the multimatched ones
for($i = 1; $i <= $number_of_patterns; $i++) {
$string = str_replace("#{$i}", "<strong>", $string);
$string = str_replace("#/{$i}", "</strong>", $string);
}
echo $string;
function generate_regexp($pattern_array, $pattern_num) {
$regexp["pattern"] = "/";
$regexp["replacement"] = "";
$i = 0;
foreach($pattern_array as $key => $char) {
$regexp["pattern"] .= "({$char})";
$regexp["replacement"] .= "#{$pattern_num}\$". ($key + $i+1) . "#/{$pattern_num}";
if($key < count($pattern_array) - 1) {
$regexp["pattern"] .= "(?s)((?:(?!{$pattern_array[$key + 1]})(?!\s).)*)";
$regexp["replacement"] .= "\$".($key + $i+2) . "";
}
$i = $key + 1;
}
$regexp["pattern"] .= "/";
return $regexp;
}
function transform_multimatched_chars($string)
{
preg_match_all("/((#[0-9]){2,})(.*)((#\/[0-9]){2,})/", $string, $matches);
// change this for your purposes
$start_replacement = '<span style="color:red;">';
$end_replacement = '</span>';
foreach($matches[1] as $key => $match)
{
$string = str_replace($match, $start_replacement, $string);
$string = str_replace($matches[4][$key], $end_replacement, $string);
}
return $string;
}

How can I match a string between two other known strings and nothing else with REGEX?

I want to extract a string between two other strings. The strings happen to be within HTML tags but I would like to avoid a conversation about whether I should be parsing HTML with regex (I know I shouldn't and have solved the problem with stristr() but would like to know how to do it with regular expressions.
A string might look like this:
...uld select “Apply” below.<br/><br/><b>Primary Location</b>: United States-Washington-Seattle<br/><b>Travel</b>: Yes, 75 % of the Time <br/><b>Job Type</b>: Standard<br/><b>Region</b>: US Service Lines: ASL - Business Intelligence<br/><b>Job</b>: Business Intelligence<br/><b>Capability Group</b>: Con/Sol - BI&C<br/><br/>LOC:USA
I am interested in <b>Primary Location</b>: United States-Washington-Seattle<br/> and want to extract 'United States-Washington-Seattle'
I tried '(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)' which worked in RegExr but not PHP:
preg_match("/(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)/", $description,$matches);
You used / as regex delimiter, so you need to escape it if you want to match it literally or use a different delimiter
preg_match("/(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)/", $description,$matches);
to
preg_match("/(?<=<b>Primary Location<\/b>:)(.*?)(?=<br\/>)/", $description,$matches);
or this
preg_match("~(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)~", $description,$matches);
Update
I just tested it on www.writecodeonline.com/php and
$description = "uld select “Apply” below.<br/><br/><b>Primary Location</b>: United States-Washington-Seattle<br/><b>Travel</b>: Yes, 75 % of the Time <br/><b>Job Type</b>: Standard<br/><b>Region</b>: US Service Lines: ASL - Business Intelligence<br/><b>Job</b>: Business Intelligence<br/><b>Capability Group</b>: Con/Sol - BI&C<br/><br/>LOC:USA";
preg_match("~(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)~", $description, $matches);
print_r($matches);
is working. Output:
Array ( [0] => United States-Washington-Seattle [1] => United States-Washington-Seattle )
You can also get rid of the capturing group and do
$description = "uld select “Apply” below.<br/><br/><b>Primary Location</b>: United States-Washington-Seattle<br/><b>Travel</b>: Yes, 75 % of the Time <br/><b>Job Type</b>: Standard<br/><b>Region</b>: US Service Lines: ASL - Business Intelligence<br/><b>Job</b>: Business Intelligence<br/><b>Capability Group</b>: Con/Sol - BI&C<br/><br/>LOC:USA";
preg_match("~(?<=<b>Primary Location</b>:).*?(?=<br/>)~", $description, $matches);
print($matches[0]);
Output
United States-Washington-Seattle

Categories