I am trying to extract the number 203 from this sample.
Here is the sample I am running the regex against:
<span class="crAvgStars" style="white-space:no-wrap;"><span class="asinReviewsSummary" name="B00KFQ04CI" ref="cm_cr_if_acr_cm_cr_acr_pop_" getargs="{"tag":"","linkCode":"sp1"}">
<img src="https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/customer-reviews/ratings/stars-4-5._CB192238104_.gif" width="55" alt="4.3 out of 5 stars" align="absbottom" title="4.3 out of 5 stars" height="12" border="0" /> </span>(203 customer reviews)</span>
Here is the code I am using that does not work
preg_match('/^\D*(\d+)customer reviews.*$/',$results[0], $clean_results);
echo "<pre>";
print_r( $clean_results);
echo "</pre>";
//expecting 203
It is just returning
<pre>array ()</pre>
Your regexp has two problems.
First, there are other numbers in the string before the number of customer reviews (like 4.3 out of 5 stars and height="12"), but \D* prevents matching that -- it only matches if there are no digits anywhere between the beginning of the string and the number of reviews.
Second, you have no space between (\d+) and customer reviews, but the input string has a space there.
There's no need to match any of the string before and after the part that contains the number of customer reviews, just match the part you care about.
preg_match('/(\d+) customer reviews/',$results[0], $clean_results);
$num_reviews = $clean_results[1];
DEMO
Related
I'm trying to get a list of products off a website including the individual product codes.
The product codes are 5 digit codes, the elements range in complexity from
<p>Part Number: 67001</p>
<p>Part Number: 50545 – 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>
Unfortunately, the 5 digit patterns are throughout the web pages, so I can't just use /\d{5}/
I'm after a regex that extracts only the 5 digits in the Part Number elements and not from the rest of the web page.
Something like: /\<p\>Part\s*Number\:\s*((\d{5}) repeat this capture group n times)\<\/p\>/
I know I can do it by breaking the page down in stages and applying one regex after another. eg
1st stage /\<p\>Part\s*Number\:\s*.*?\<\/p\>/
2nd stage /\d{5}/
But is it possible do it in one regex pattern and if so how?
I am far wiser now than I was a year ago, so I have completely scrubbed my original advice. The best / most reliable approach when trying to parse valid html is to use a dom parser. XPath makes node/element hunting super easy. A regex pattern is still an appropriate tool once you have disqualified <p> tags that do not contain the Part Number keyword.
Code: (Demo)
$html = <<<HTML
<p>Zip Code: 99501</p>
<p>Part Number: 67001</p>
<p>Part Number: 98765 - 10000kg capacity</p>
<p>Some dummy/interfering text. Part Number: 12345</p>
<p>Zip Codes: 99501, 99524 , 85001 and 72201</p>
<p>Part Number: 50545 – 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>
HTML;
$partnos = [];
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//p[starts-with(., 'Part Number: ')]") as $node) {
// echo "Qualifying text: {$node->nodeValue}\n";
if (preg_match_all('~\b\d{5}\b~', $node->nodeValue, $matches)) {
$partnos = array_merge($partnos, $matches[0]); //or array_push($partnos, ...$matches[0]);
}
}
var_export($partnos);
Output:
array (
0 => '67001',
1 => '98765',
2 => '50545',
3 => '50525',
4 => '50520',
5 => '50555',
6 => '50575',
)
The xpath query says:
//p #find p tags at any level/position in the dom
[starts-with(. #with a substring at the start of the node's text
, 'Part Number: ')] #that literally matches "Part Number: "
The regex pattern uses word boundary metacharacters (\b) to differentiate part numbers from non-part numbers. If you need the pattern to be adjusted because of some data that is not represented in your question, let me know and I'll offer further guidance.
Finally, I did flirt with a pure regex solution that incorporated \G to "continue" matching after Part Number: OR a previous match, but this type of pattern is a little bit harder to conceptualize and again a dom parser is a more stable tool versus regex when processing valid html.
If I understood your question correctly you should just be able to do this:
Part\sNumber:\s(\d{5})
Given that your string contains all the Part Number, like demonstrated below:
<p>Part Number: 67001</p>
<p>Part Number: 50545 – 450g Cartridge 50525 - 2.5kg Tub 50520 - 20kg Pail 50555 - 55kg Drum 50575 *Indent - 175kg Drum</p>
<p>Part Number: 23425 - 55kg Drum 50575 *Indent - 175kg Drum</p>
<p>Part Number: 52232</p>
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 1 year ago.
I have to find a values between specific tags in HTML page through php regex. but I want if HTML page contain multiple value then do preg_match_all otherwise do nothing.
For example if preg_match find 4 values in HTML then do preg_match_all in next phase otherwise if it is preg_match find only 1 tag value then do nothing.
<td class"page">
<span class="my-tag">value1</span>
<span class="my-tag">value2</span>
<span class="my-tag">value3</span>
<span class="my-tag">value4</span>
</td>
preg_match('/<td class"page">(.*?)<\/td>/s';)
now do preg_match_all in next phase because preg_match find 4 values
preg_match_all('|\<span class="my-tag"\>(.*?)\</span\>|', $html, $string);
and if HTML contain only 1 value like this
<td class"page">
<span class="my-tag">value1</span>
</td>
So if HTML contain only 1 value then do nothing
Basically, from your preg_match, you would be getting a string back that looks like this:
Array
(
[0] => <td class="page">
<span class="my-tag">value1</span>
<span class="my-tag">value2</span>
<span class="my-tag">value3</span>
<span class="my-tag">value4</span>
</td>
[1] =>
<span class="my-tag">value1</span>
<span class="my-tag">value2</span>
<span class="my-tag">value3</span>
<span class="my-tag">value4</span>
)
With that, we can just go ahead and do the match - regardless of if it only found one match or multiple matches. (Because it's not going to hurt anything to match one item or four items, I am proposing to move the logic down in your code.) Then we can just count how many it found and store that in a variable named $count.
// CHECK TO SEE IF WE FOUND A MATCH
if (isset($matches[1])) {
// GO AHEAD AND DO THE MATCH ON THE SPANS
preg_match_all('~<span class="my-tag">(.*?)</span>~s', $string, $span_matches);
$count = count($span_matches[1]);
// IF WE FOUND MULTIPLE MATCHES, LIST THEM OUT
if ($count > 1) {
print 'COUNT IS: '.$count;
print_r($span_matches[1]);
}
// WE DID NOT MATCH ANY SPAN TAGS
elseif ($count == 0) {
print 'COUNT IS ZERO - CRAP';
}
// IF WE ONLY FOUND ONE MATCH, WE DON'T NEED TO DO ANYTHING
else {
print 'COUNT IS EXACTLY 1 - DO NOTHING';
}
}
// WE DID NOT FIND AN INITAL MATCH TO BEGIN WITH
else {
print 'WE DID NOT FIND A MATCH';
}
From there, it's just a simple if/else statement to do what you want with it.
Here is a working demo:
http://ideone.com/SiPiOx
You can combine multiple patterns in a single regex with the pipe character in parentheses:
preg_match('/(cats?|dogs?|re.*tion)/', $string, $matches);
I have a string which contains text in a variety of HTML tags. I need to clean the HTML tags themselves, so the data between the < and > such that
<p class="MsoNormal" style="text-align: justify;">1939 After considerable negotiation between the Kemp estate and the Dunwich Trusts, the charter was purchased and returned to Dunwich.</p>
becomes
<p>1939 After considerable negotiation between the Kemp estate and the Dunwich Trusts, the charter was purchased and returned to Dunwich.</p>
I did this with
$value = preg_replace("/<p[^>]+>/", "<p>", $value);
But I need to preserve the contents of the <a> tags, within the string, but also clean the excess such as that style content.
I intend to do this by running a loop and extracting the anchor tag and then working on each anchor tag, splitting at the spaces and keeping the exploded array values starting with href=,title= etc etc.
But now my issue is this:
How can I split a string to take the contents of the <a> tag with a Preg_split regex ?
If I do
$value = preg_split("/<a[^>]+>/", $value);
Then value returns the content outside of the anchor tag, rather than inside the anchor tag. I do not know what is inside the anchor tag, so can only base the pattern on <a.......>
I want to make an array of anchor tags from a string, such that :
<h2>Headlines</h2>
Charter Returned to Dunwich in 1939
Thomas Gardner Visits Dunwich
Treasure Chest Purchases
Dunwich Charter 1215
Why did Dunwich have a Charter?
</div>
can give me:
$array[0] = 'a href="index.php?id=11"';
$array[1] = 'a href="index.php?id=10"';
$array[2] = 'a href="index.php?id=9"';
$array[3] = 'a href="index.php?id=8"';
$array[4] = 'a href="index.php?id=7"';
Use just preg_match_all:
$re = "/<a[^>]+>/";
$str = "<h2>Headlines</h2>\nCharter Returned to Dunwich in 1939 \nThomas Gardner Visits Dunwich \nTreasure Chest Purchases \nDunwich Charter 1215 \nWhy did Dunwich have a Charter? \n</div> ";
preg_match_all($re, $str, $matches);
$matches will contain:
a href="index.php?id=11"
a href="index.php?id=10"
a href="index.php?id=9"
a href="index.php?id=8"
a href="index.php?id=7"
Have a look at the demo program.
I search in many threads in Stackoverflow but I didn't find anything relevant for my case.
Here is the source text :
<span class="red"><span>70</span><span style="display:none">1</span><span>,89</span> € TTC<br /></span>
I want to extract 70,89 with a regular expression.
So I tried :
<span class="red"><span>([0-9]+)(<\/span><span style="display:none">1<\/span><span>)(,[0-9]+)<\/span>
which returns an array (with preg_match_all in PHP) with 3 groups :
1/ 70
2/
</span><span style="display:none">1</span><span>
3/ ,89
I would like to exclude group 2 and merge 1 & 3.
So I also tried :
<span class="red"><span>([0-9]+)(?:<\/span><span style="display:none">1<\/span><span>)(,[0-9]+)<\/span>
but it returns :
70
,89
How can I merge the two groups ?
Thanks a lot for your answers, I am going to be crazy searching for this regular expression ! :)
Have a good day !
Just match the numbers that are wrapped with a plain <span>:
$str = '<span class="red"><span>70</span><span style="display:none">1</span><span>,89</span> € TTC<br /></span>';
if (preg_match_all('#<span>([,\d]+)</span>#', $str, $matches)) {
echo join('', $matches[1]);
}
// output: 70,89
This is the content of one mysql table field:
Flash LEDs: 0.5W
LED lamps: 5mm
Low Powers: 0.06W, 0.2W
Remarks(1): this is remark1
----------
Accessories: Light Engine
Lifestyle Lights: Ambion, Crane Fun
Office Lights: OL-Deluxe Series
Street Lights: Dolphin
Retrofits: SL-10A, SL-60A
Remarks(2): this is remark2
----------
Infrared Receiver Module: High Data Rate Short Burst
Optical Sensors: Ambient Light Sensor, Proximity Sensor, RGB Color Sensor
Photo Coupler: Transistor
Remarks(3): this is remark3
----------
Display: Dot Matrix
Remarks(4): this is remark4
Now, I want to read the remarks and store them in a variable. Remarks(1), Remarks(2), etc. are fixed. 'this is remark1', etc. come from form input fields, so they are flexible.
Basically what I need is: Read everything between 'Remarks(1):' and '--------' and save it in a variable.
Thanks for your help.
You can use regex:
preg_match_all("~Remarks\(([^)]+)\):([^\n]+)~", $str, $m);
As seen on ideone.
The regex will put X in match group 1, Y in match group 2 (Remarks(X): Y)
This would be a job for regular expressions, which allow you to match on exactly the kinds of rules your requirements express. Here is a tutorial for you.
Use preg function for this or otherwise you can explode and implode function to get correct result. Don't Use Substring it may not provide correction.
Example of Implode and Explode Function for your query string :
$sdr = "Remarks(4): this is remark4";
$sdr1 = explode(":",$sdr);
$frst = $sdr1[0];
$sdr2 = array_shift($sdr1);
$secnd = implode(" ", $sdr1);
echo "First String - ".$frst;
echo "<br>";
echo "Second String - ".$secnd;
echo "<br>";
Your Answer :
First String - Remarks(4)
Second String - this is remark4