Repeating capture group with a regex pattern - php

I'm trying to get a list of products off a website including the individual product codes.
The product codes are 5 digit codes, the elements range in complexity from
<p>Part Number: 67001</p>
<p>Part Number: 50545 – 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>
Unfortunately, the 5 digit patterns are throughout the web pages, so I can't just use /\d{5}/
I'm after a regex that extracts only the 5 digits in the Part Number elements and not from the rest of the web page.
Something like: /\<p\>Part\s*Number\:\s*((\d{5}) repeat this capture group n times)\<\/p\>/
I know I can do it by breaking the page down in stages and applying one regex after another. eg
1st stage /\<p\>Part\s*Number\:\s*.*?\<\/p\>/
2nd stage /\d{5}/
But is it possible do it in one regex pattern and if so how?

I am far wiser now than I was a year ago, so I have completely scrubbed my original advice. The best / most reliable approach when trying to parse valid html is to use a dom parser. XPath makes node/element hunting super easy. A regex pattern is still an appropriate tool once you have disqualified <p> tags that do not contain the Part Number keyword.
Code: (Demo)
$html = <<<HTML
<p>Zip Code: 99501</p>
<p>Part Number: 67001</p>
<p>Part Number: 98765 - 10000kg capacity</p>
<p>Some dummy/interfering text. Part Number: 12345</p>
<p>Zip Codes: 99501, 99524 , 85001 and 72201</p>
<p>Part Number: 50545 – 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>
HTML;
$partnos = [];
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//p[starts-with(., 'Part Number: ')]") as $node) {
// echo "Qualifying text: {$node->nodeValue}\n";
if (preg_match_all('~\b\d{5}\b~', $node->nodeValue, $matches)) {
$partnos = array_merge($partnos, $matches[0]); //or array_push($partnos, ...$matches[0]);
}
}
var_export($partnos);
Output:
array (
0 => '67001',
1 => '98765',
2 => '50545',
3 => '50525',
4 => '50520',
5 => '50555',
6 => '50575',
)
The xpath query says:
//p #find p tags at any level/position in the dom
[starts-with(. #with a substring at the start of the node's text
, 'Part Number: ')] #that literally matches "Part Number: "
The regex pattern uses word boundary metacharacters (\b) to differentiate part numbers from non-part numbers. If you need the pattern to be adjusted because of some data that is not represented in your question, let me know and I'll offer further guidance.
Finally, I did flirt with a pure regex solution that incorporated \G to "continue" matching after Part Number: OR a previous match, but this type of pattern is a little bit harder to conceptualize and again a dom parser is a more stable tool versus regex when processing valid html.

If I understood your question correctly you should just be able to do this:
Part\sNumber:\s(\d{5})
Given that your string contains all the Part Number, like demonstrated below:
<p>Part Number: 67001</p>
<p>Part Number: 50545 – 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>
<p>Part Number: 23425 - 55kg Drum 50575 *Indent - 175kg Drum</p>
<p>Part Number: 52232</p>

Related

regexp php - negate 4 digit (year ) number

I am trying to separate and get a number out of a string that contains 2 similar HTML statements:
1 - <td class="center"><p class="texte">1914</p></td>
2 - <td class="center"><p class="texte">135.000</p></td>
So, I am looking for the number 135.000 and not the number 1914.
IMPORTANT: This is not US notation for number. 135.000 is actually one hundred and thirty five thousands.
I have tried things like ([1-9][0-9]{1,2}), but that will capture 191 out of statement 1 above, which is not intended.
Thanks
You are dealing with html, you need to use an html parser first (XPATH is your friend). Then you need the preg_match function to filter numbers with your desired format. Example:
$dom = new DOMDocument;
$dom->loadHTML($yourHtmlString);
$xp = new DOMXPath($dom);
// you need to register the function `preg_match` to use it in your xpath query
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('preg_match');
// The xpath query
$targetNodeList = $xp->query('//td[#class="center"]/p[#class="texte"][php:functionString("preg_match", "~^[1-9][0-9]{0,2}(?:\.[0-9]{3})*$~", .) > 0]');
# ^ ^^ ^
# '------------------+------------------''-----------------------------------+-----------------------------------------'
# '- describe the path in the DOM tree |
# '- predicate to check the content format
foreach ($targetNodeList as $node) {
echo $node->nodeValue, PHP_EOL;
}
Give this a shot :)
\s*[\d.]+(?=<)
Here is the link:
Regex Example

php, strpos extract digit from string

I have a huge html code to scan. Until now i have been using preg_match_all to extract desired parts from it. The problem from the start was that it was extremely cpu time consuming. We finally decided to use some other method for extraction. I read in some articles that preg_match can be compared in performance with strpos. They claim that strpos beats regex scanner up to 20 times in efficiency. I thought i will try this method but i dont really know how to get started.
Lets say i have this html string:
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
I want to extract only number from each id and only text (letters) from content of a tags. so i do this preg_match_all scan:
'/<li.*?id=".*?([\d]+)".*?<a.*?>.*?([\w]+)<\/a>/s'
here you can see the result: LINK
Now if i would want to replace my method to strpos functionality how the approach would look like? I understand that strpos returns a index of start where match took place. But how can i use it to:
get all possible matches, not just one
extract numbers or text from desired place in string
Thank you for all the help and tips ;)
Using DOM
$html = '
<html>
<head></head>
<body>
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
$rootElement = $dom_document->documentElement;
$getId = $rootElement->getElementsByTagName('li');
$res = [];
foreach($getId as $tag)
{
$data = explode('-',$tag->getAttribute('id'));
$res['li_id'][] = end($data);
}
$getNode = $rootElement->getElementsByTagName('a');
foreach($getNode as $tag)
{
$res['a_node'][] = $tag->parentNode->textContent;
}
print_r($res);
Output :
Array
(
[li_id] => Array
(
[0] => 16451
[1] => 5674
[2] => c6543
)
[a_node] => Array
(
[0] => 23 - Star
[1] => 54 - Moon
[2] => 34,780 - Sun
)
)
This regex finds a match in 24 steps using 0 backtracks
(?:id="[^\d]*(\d*))[^<]*(?:<a href="[^>]*>[^a-z]*([a-z]*))
The regex you posted requires 134 steps. Maybe you will notice a difference? Note that regex engines can optimize so that in minimizes backtracking. I used the debugger of RegexBuddy to come to the numbers.

PHP Preg_split selecting the internal contents of an HTML tag

I have a string which contains text in a variety of HTML tags. I need to clean the HTML tags themselves, so the data between the < and > such that
<p class="MsoNormal" style="text-align: justify;">1939 After considerable negotiation between the Kemp estate and the Dunwich Trusts, the charter was purchased and returned to Dunwich.</p>
becomes
<p>1939 After considerable negotiation between the Kemp estate and the Dunwich Trusts, the charter was purchased and returned to Dunwich.</p>
I did this with
$value = preg_replace("/<p[^>]+>/", "<p>", $value);
But I need to preserve the contents of the <a> tags, within the string, but also clean the excess such as that style content.
I intend to do this by running a loop and extracting the anchor tag and then working on each anchor tag, splitting at the spaces and keeping the exploded array values starting with href=,title= etc etc.
But now my issue is this:
How can I split a string to take the contents of the <a> tag with a Preg_split regex ?
If I do
$value = preg_split("/<a[^>]+>/", $value);
Then value returns the content outside of the anchor tag, rather than inside the anchor tag. I do not know what is inside the anchor tag, so can only base the pattern on <a.......>
I want to make an array of anchor tags from a string, such that :
<h2>Headlines</h2>
Charter Returned to Dunwich in 1939
Thomas Gardner Visits Dunwich
Treasure Chest Purchases
Dunwich Charter 1215
Why did Dunwich have a Charter?
</div>
can give me:
$array[0] = 'a href="index.php?id=11"';
$array[1] = 'a href="index.php?id=10"';
$array[2] = 'a href="index.php?id=9"';
$array[3] = 'a href="index.php?id=8"';
$array[4] = 'a href="index.php?id=7"';
Use just preg_match_all:
$re = "/<a[^>]+>/";
$str = "<h2>Headlines</h2>\nCharter Returned to Dunwich in 1939 \nThomas Gardner Visits Dunwich \nTreasure Chest Purchases \nDunwich Charter 1215 \nWhy did Dunwich have a Charter? \n</div> ";
preg_match_all($re, $str, $matches);
$matches will contain:
a href="index.php?id=11"
a href="index.php?id=10"
a href="index.php?id=9"
a href="index.php?id=8"
a href="index.php?id=7"
Have a look at the demo program.

How can I match a string between two other known strings and nothing else with REGEX?

I want to extract a string between two other strings. The strings happen to be within HTML tags but I would like to avoid a conversation about whether I should be parsing HTML with regex (I know I shouldn't and have solved the problem with stristr() but would like to know how to do it with regular expressions.
A string might look like this:
...uld select “Apply” below.<br/><br/><b>Primary Location</b>: United States-Washington-Seattle<br/><b>Travel</b>: Yes, 75 % of the Time <br/><b>Job Type</b>: Standard<br/><b>Region</b>: US Service Lines: ASL - Business Intelligence<br/><b>Job</b>: Business Intelligence<br/><b>Capability Group</b>: Con/Sol - BI&C<br/><br/>LOC:USA
I am interested in <b>Primary Location</b>: United States-Washington-Seattle<br/> and want to extract 'United States-Washington-Seattle'
I tried '(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)' which worked in RegExr but not PHP:
preg_match("/(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)/", $description,$matches);
You used / as regex delimiter, so you need to escape it if you want to match it literally or use a different delimiter
preg_match("/(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)/", $description,$matches);
to
preg_match("/(?<=<b>Primary Location<\/b>:)(.*?)(?=<br\/>)/", $description,$matches);
or this
preg_match("~(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)~", $description,$matches);
Update
I just tested it on www.writecodeonline.com/php and
$description = "uld select “Apply” below.<br/><br/><b>Primary Location</b>: United States-Washington-Seattle<br/><b>Travel</b>: Yes, 75 % of the Time <br/><b>Job Type</b>: Standard<br/><b>Region</b>: US Service Lines: ASL - Business Intelligence<br/><b>Job</b>: Business Intelligence<br/><b>Capability Group</b>: Con/Sol - BI&C<br/><br/>LOC:USA";
preg_match("~(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)~", $description, $matches);
print_r($matches);
is working. Output:
Array ( [0] => United States-Washington-Seattle [1] => United States-Washington-Seattle )
You can also get rid of the capturing group and do
$description = "uld select “Apply” below.<br/><br/><b>Primary Location</b>: United States-Washington-Seattle<br/><b>Travel</b>: Yes, 75 % of the Time <br/><b>Job Type</b>: Standard<br/><b>Region</b>: US Service Lines: ASL - Business Intelligence<br/><b>Job</b>: Business Intelligence<br/><b>Capability Group</b>: Con/Sol - BI&C<br/><br/>LOC:USA";
preg_match("~(?<=<b>Primary Location</b>:).*?(?=<br/>)~", $description, $matches);
print($matches[0]);
Output
United States-Washington-Seattle

PHP: How to find the beginning and end of a substring in a string?

This is the content of one mysql table field:
Flash LEDs: 0.5W
LED lamps: 5mm
Low Powers: 0.06W, 0.2W
Remarks(1): this is remark1
----------
Accessories: Light Engine
Lifestyle Lights: Ambion, Crane Fun
Office Lights: OL-Deluxe Series
Street Lights: Dolphin
Retrofits: SL-10A, SL-60A
Remarks(2): this is remark2
----------
Infrared Receiver Module: High Data Rate Short Burst
Optical Sensors: Ambient Light Sensor, Proximity Sensor, RGB Color Sensor
Photo Coupler: Transistor
Remarks(3): this is remark3
----------
Display: Dot Matrix
Remarks(4): this is remark4
Now, I want to read the remarks and store them in a variable. Remarks(1), Remarks(2), etc. are fixed. 'this is remark1', etc. come from form input fields, so they are flexible.
Basically what I need is: Read everything between 'Remarks(1):' and '--------' and save it in a variable.
Thanks for your help.
You can use regex:
preg_match_all("~Remarks\(([^)]+)\):([^\n]+)~", $str, $m);
As seen on ideone.
The regex will put X in match group 1, Y in match group 2 (Remarks(X): Y)
This would be a job for regular expressions, which allow you to match on exactly the kinds of rules your requirements express. Here is a tutorial for you.
Use preg function for this or otherwise you can explode and implode function to get correct result. Don't Use Substring it may not provide correction.
Example of Implode and Explode Function for your query string :
$sdr = "Remarks(4): this is remark4";
$sdr1 = explode(":",$sdr);
$frst = $sdr1[0];
$sdr2 = array_shift($sdr1);
$secnd = implode(" ", $sdr1);
echo "First String - ".$frst;
echo "<br>";
echo "Second String - ".$secnd;
echo "<br>";
Your Answer :
First String - Remarks(4)
Second String - this is remark4

Categories