I want to know what is the most regular optimized to capture keywords in an HTML text expression.
Note that I am using PHP.
I have a piece of HTML code like this:
...
<li><span class="fl">
Dish</span>
<div class="oflow">
<span class="1F4446484E1FCB4FC3C21FC04AC6C21E232020211F underline">
pasta</span>
, <span class="1F4446484E1FCB4FC3C21FC04AC6C21E23202A251F underline">
rice</span>
, <span class="1F4446484E1FCB4FC3C21FC04AC6C21E2320202B1F underline">
potatoes</span>
</div>
</li>
...
I want to select the available dishes (pasta, rice and potatoes), knowing that the only word that is always the same is "Dish" and that there's always a span between each keyword that I would recover.
Thank you in advance.
<?php
var $aDishes = explode(',', strip_tags($sHtml));
?>
Related
I have div which contain other html tags along with text
I want to extract only text from this div OR inside all html tags
<div class="rpr-help m-chm">
<div class="header">
<h2 class="h6">Repair Help</h2>
</div><!-- /end .header -->
<div class="inner m-bsc">
<ul>
<li>Repair Video</li>
<li>Repair Q&A</li>
</ul>
</div>
<div>
<br>
<span class="h4">Cross Reference Information</span><br>
<p>Part Number 285753A (AP3963893) replaces 1195967, 280152, 285140, 285743, 285753, 3352470, 3363664, 3364002, 3364003, 62672, 62693, 661560, 80008, 8559748, AH1485646, EA1485646, PS1485646.
<br>
</p>
</div>
</div>
Here is my Regexp
preg_match_all("/<div class=\"rpr-help m-chm\">(.*)<\/.*>/s", $urlcontent, $description);
Its working fine whenever I assign this complete div to $urlcontent variable.
But when I am fetching data from real url like $urlcontent = "www.test.com/test.html";
its returning complete webpage script.
How can I get inside content of <div class="rpr-help m-chm"> ?
Is there any correction require in my regexp?
Any help would be appreciated. Thanks
It's not possible to parse HTML/XHTML by regex. Source
You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML
Based on the language you use, Please consider using a thirdpart library for HTML parsing.
use this function
function GetclassContent($tagStart,$tagEnd,$content)
{
$first_step = explode( $tagStart,$content );
$second_step = explode($tagEnd,$first_step[1] );
return $second_step[0];
}
Steps to Use Above function
$website="www.test.com/test.html";
$content=file_get_contents($website);
$tagStart ='<div class="rpr-help m-chm">';
$tagEnd = "</div >";
$RequiredContent = GetclassContent($tagStart,$tagEnd,$content);
I am trying to extract contents that lie outside two sets of html tags.
The HTML is set up like so:
<div class="col-md-4 col-sm-6 col-lg-3">
<small class="text-muted pull-right">4.4</small>
<i class="custom-icon"></i>
desired content to retrieve
<span class="text-muted">some other text here</span>
</div>
I need to retrieve the content "desired content to retrieve" which lies after the </i> and before the <span class="text-muted">.
I've tried:
$custom_regex= '#</i>(.*?)<span class="text-muted">#';
$text_scan = preg_match_all( $custom_regex, $content_to_scan, $text_array );
with no success. The $text_array variable returns empty.
I'm not that great with regex, so maybe my expression is incorrect for what I'm after.
Wouldn't usage of lookarounds be better?
(?<=<\/i>)\s*(.*?)\n.*(?=<span)
Demo: https://regex101.com/r/zK2wD8/8
If you insist on regex, try this.
/<\/i>\s*(.*?)\n.*<span class="text-muted"/g
I have this html
<div class="price-box">
<p class="old-price">
<span class="price-label">This:</span>
<span class="price" id="old-price-326">
8,69 € </span>
</p>
<p class="special-price">
<span class="price-label">This is:</span>
<span class="price" id="product-price-326">
1,99 € </span> <span style="">/ 6.87 </span>
</p>
</div>
I'm need get "1,99 €", but the id 'product-price-326' is generating random numbers. How to find 'product-price-*'? I'm trying
foreach($preke->find('span[id="product-price-[0-9]"]') as $div)
and
foreach($preke->find('span[id="product-price-"]') as $div)
but it doesn't work.
As per my comment, here's what you need to do:
foreach($preke->find('span[id^="product-price-"]') as $div) {} // note the ^ before the =
^= means starts with.
I am not sure what $preke is, but if it's a DOM selector that supports proper class selectors you can use
$preke->find('span[id^="product-price"]')
or
$preke->find('span[id*="product-price"]')
The ^= tells it to look for elements that has an ID starting with "product-price" and the *= tells it to look for elements that has an ID that contains "product-price".
Try Like This Might Be Works
foreach($preke->find('span[id^="product-price-"]') as $div) { /* Code */ }
why not to get it using class?
echo $preke->find('.special-price', 0)->find('.price', 0)->plaintext;
this will get you 1,99 €
I'm trying to get the content from the html tags
function get_model($html){
return preg_match('!<b>Model:</b>(.*?)<br>!i', $html, $matches) ? $matches[1] : '';
}
But, it returns "" string.
The entire html code looks like:
<div class="prodInfo">
<div class="prodOptions">
<div class="redBtn">
-
<input type="text" class="tnyTxt" value="1" name="quantity"/>
+
</div>
<br/>
<a href="/0-30cb9a-adjustable-pan-connector-p-mw555"
onclick="addToCart(139, $('.tnyTxt').val() ); return false;" class="redBtn"
id="button-cart">Add to Cart</a>
</div>
<p>
<b>Our Price: <span class="price">£5.55</span></b><br/>
<span class="grey">
(Exc. 20% VAT)<br/>
(£6.66 Inc. VAT)
</span>
</p>
<p>
<b>Model:</b> MW555<br/>
<b>Availability:</b> 2 - 3 Days</p>
</div>
I'm not quite understand why is this? even if I write preg_match('!<b>Model:</b>) it also return empty result. Could you help me please?
Please use this PHP Simple HTML DOM Parser.
This question have also duplicate :-
How parse HTML in PHP?
I prefer You to use phpQuery for this job.
Unfortunately I really cannot get my head around regular expressions so my last resort is to ask the help of you fine people.
I have this existing code:
<li id="id-21" class="listClass" data-author="newbie">
<div class="someDiv">
<span class="spanClass">Some content</span>
</div>
<div class="controls faint">
Link 2
Link 3
</div>
</li>
Due to a number of reasons, I have to use preg_replace to inject an additional piece of code:
Link 1
I think you can guess where that should go, but for the sake of clarity, my desire is for the resulting string to look like:
<li id="id-21" class="listClass" data-author="newbie">
<div class="someDiv">
<span class="spanClass">Some content</span>
</div>
<div class="controls faint">
Link 1
Link 2
Link 3
</div>
</li>
Can anyone help me with the appropriate regular expression to achieve this?
try this
$html = '<li id="id-21" class="listClass" data-author="newbie">
<div class="someDiv">
<span class="spanClass">Some content</span>
</div>
<div class="controls faint">
Link 2
Link 3
</div>
</li>';
$eleName = 'a';
$eleAttr = 'href';
$eleAttrValue = 'link2';
$addBefore = 'Link 1';
$result = regexAddBefore($html, $eleName, $eleAttr, $eleAttrValue, $addBefore);
var_dump($result);
function regexAddBefore($subject, $eleName, $eleAttr, $eleAttrValue, $addBefore){
$regex = "/(<\s*".$eleName."[^>]*".$eleAttr."\s*=\s*(\"|\')?\s*".$eleAttrValue."\s*(\"|\')?[^>]*>)/s";
$replace = $addBefore."\r\n$1";
$subject = preg_replace($regex, $replace, $subject);
return $subject;
}
I can suggest two things (Although I couldn't understand your problem clearly)
$newStr = preg_replace ('/<[^>]*>/', ' ', $htmlText);
this will remove all the html tags from the string. I don't know if it will be usefull for you.
Another recommendation would be to use strip_tags function. The second parameter of strip_tags is optional. You can define the tags you want to keep with the help of 2nd parameter.
$str = '<li id="id-21" class="listClass" data-author="newbie">
<div class="someDiv">
<span class="spanClass">Some content</span>
</div>
<div class="controls faint">
Link 2
Link 3
</div>
</li>';
echo strip_tags ($str,'<a>');
This will give you an output just with the links and whatever text in the html string.
Sorry if this also doesn't help.