In my application i am trying to get the google indexed pages and i came to know that the number is available in following div
<div id="resultStats"> About 1,960,000 results (0.38 seconds) </div>
now my question is how to extract the number from above div in a web page
Never user regexp to parse HTML. (See: RegEx match open tags except XHTML self-contained tags)
Use a HTML parser, like SimpleDOM (http://simplehtmldom.sourceforge.net/)
You can the use CSS rules to select:
$html = file_get_html('http://www.google.com/');
$divContent = $html->find('div#resultStats', 0)->plaintext;
$matches = array();
preg_match('/([0-9,]+)/', $divContent, $matches);
echo $matches[1];
Outputs: "1,960,000"
$str = '<div id="resultStats"> About 1,960,000 results (0.38 seconds) </div> ';
$matches = array();
preg_match('/<div id="resultStats"> About ([0-9,]+?) results[^<]+<\/div>/', $str, $matches);
print_r($matches);
Output:
Array (
[0] => About 1,960,000 results (0.38 seconds)
[1] => 1,960,000
)
This is simple regex with subpatterns
([0-9,]+?) - means 0-9 numbers and , character at least 1 time and not greedy.
[^<]+ - means every character but < more than 1 time
echo $matches[1]; - will print the number you want.
You can use regex ( preg_match ) for that
$your div_string = '<div id="resultStats"> About 1,960,000 results (0.38 seconds) </div>';
preg_match('/<div.*>(.*)<\/div>/i', $your div_string , $result);
print_r( $result );
output will be
Array (
[0] => <div id="resultStats"> About 1,960,000 results (0.38 seconds) </div>
[1] => About 1,960,000 results (0.38 seconds)
)
in this way you can get content inside div
Related
I need to convert a pdf file with tables into CSV, so I used "PDFPARSER" in order to parse the entire text, then with pregmatch_all search the patterns of each table so I can create an array from each table of the pdf.
The structure of the following PDF is:
When I parse I get this
ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas
I figured out how to pregmatch_all all the ECO-XXXXX, but I don't know how to pregmatch all the descriptions
This is what is working for ECO-XXXXXX
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('publication.pdf');
$text = $pdf->getText();
echo $text;
$pattern = '/ECO-[.-^*-]{3,}| ECO-[.-^*-]{4,}\s\b[NMB]\b|ECO-[.-^*-]{4,}\sUP| ECO-[.-^*-]{3,}\sUP\s[B-N-M]{1}| ECO-[.-^*-]{3,}\sRX/' ;
preg_match_all($pattern, $text, $array);
echo "<hr>";
print_r($array);
I get
Array ( [0] => Array ( [0] => ECO-698 [1] => ECO-CHI-522 [2]
You may try this regex:
(ECO[^\s]+)\s+(.*?)(?=ECO|\z)
As per the input string, group1 contains the ECO Block and group 2 contains the descriptions.
Explanation:
(ECO[^\s]+) capture full ECO block untill it reaches white space.
\s+one or more white space
(.*?)(?=ECO|\z) Here (.*?) matches description and (?=ECO|\z) is a positive look ahead to match ECO or end of string (\z)
Regex101
Source Code (Run here):
$re = '/(ECO[^\s]+)\s+(.*?)(?=ECO|\z)/m';
$str = 'ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
$val=1;
foreach ($matches as $value)
{
echo "\n\nRow no:".$val++;
echo "\ncol 1:".$value[1]."\ncol 2:".$value[2];
}
UPDATE AS per the comment:
((?:ECO-(?!DE)[^\s]+)(?: (?:RX|B|N|M|UP|UP B|UP N|UP M))?)\s+(.*?)(?=(?:ECO-(?!DE))|\z)
Regex 101 updated
I have placeholders that users can insert into a WYSIWYG editor (which contains HTML code). Sometimes when they paste from apps like Word etc it injects HTML within them.
Eg: It pastes %<span>firstname</span>% instead of %firstname%.
Here is an example of my regex code:
$html = '
<p>%firstname%</p>
<p>%<span>firstname</span>%</p>
<p>%<span class="blah">firstname</span>%</p>
<p>%<span><span>firstname</span></span>%</p>
<p>%<span><span><span>firstname</span></span></span>%</p>
<p>%<span class="blah"><span>firstname</span></span>%</p>
<div>other random <strong>HTML</strong> that needs to be preserved.</div>
';
preg_match_all(
'/\%(?![0-9])((?:<[^<]+?>)?[a-zA-z0-9_-]+(?:[\s]?<[^<]+?>)?)\%/U',
$html,
$matches
);
echo '<pre>';
print_r($matches);
echo '</pre>';
Which outputs the following:
Array
(
[0] => Array
(
[0] => %firstname%
[1] => %firstname%
[2] => %firstname%
)
[1] => Array
(
[0] => firstname
[1] => firstname
[2] => firstname
)
)
As soon as there is more than one span inside the placeholder it doesn't work. I'm not quite sure what to adjust in my regex.
/\%(?![0-9])((?:<[^<]+?>)?[a-zA-z0-9_-]+(?:[\s]?<[^<]+?>)?)\%/U
How would I achieve this?
Try this Regex. It should help you out!
/\%(?![0-9])(?:<[^<]+?>)*([a-zA-z0-9_-]+)(?:[\s]?<\/[^<]+?>)*\%/U
You could use a parser and the textContent property if it is a WYSIWYG editor anyway:
<?php
$html = '
<p>%firstname%</p>
<p>%<span>firstname</span>%</p>
<p>%<span class="blah">firstname</span>%</p>
<p>%<span><span>firstname</span></span>%</p>
<p>%<span><span><span>firstname</span></span></span>%</p>
<p>%<span class="blah"><span>firstname</span></span>%</p>
<div>A cool div with %firstname%</div>
<span>And a very neat span with %firstname%</span>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
# query only root elements here
$containers = $xpath->query("/*");
foreach ($containers as $container) {
echo $container->textContent . "\n";
}
?>
This outputs %firstname% a couple of times, see a demo on ideone.com.
Do you really need a regex for this? You could have simply used strip_tags() here.
Try this:
echo strip_tags($html);
I have a huge html code to scan. Until now i have been using preg_match_all to extract desired parts from it. The problem from the start was that it was extremely cpu time consuming. We finally decided to use some other method for extraction. I read in some articles that preg_match can be compared in performance with strpos. They claim that strpos beats regex scanner up to 20 times in efficiency. I thought i will try this method but i dont really know how to get started.
Lets say i have this html string:
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
I want to extract only number from each id and only text (letters) from content of a tags. so i do this preg_match_all scan:
'/<li.*?id=".*?([\d]+)".*?<a.*?>.*?([\w]+)<\/a>/s'
here you can see the result: LINK
Now if i would want to replace my method to strpos functionality how the approach would look like? I understand that strpos returns a index of start where match took place. But how can i use it to:
get all possible matches, not just one
extract numbers or text from desired place in string
Thank you for all the help and tips ;)
Using DOM
$html = '
<html>
<head></head>
<body>
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
$rootElement = $dom_document->documentElement;
$getId = $rootElement->getElementsByTagName('li');
$res = [];
foreach($getId as $tag)
{
$data = explode('-',$tag->getAttribute('id'));
$res['li_id'][] = end($data);
}
$getNode = $rootElement->getElementsByTagName('a');
foreach($getNode as $tag)
{
$res['a_node'][] = $tag->parentNode->textContent;
}
print_r($res);
Output :
Array
(
[li_id] => Array
(
[0] => 16451
[1] => 5674
[2] => c6543
)
[a_node] => Array
(
[0] => 23 - Star
[1] => 54 - Moon
[2] => 34,780 - Sun
)
)
This regex finds a match in 24 steps using 0 backtracks
(?:id="[^\d]*(\d*))[^<]*(?:<a href="[^>]*>[^a-z]*([a-z]*))
The regex you posted requires 134 steps. Maybe you will notice a difference? Note that regex engines can optimize so that in minimizes backtracking. I used the debugger of RegexBuddy to come to the numbers.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
Want to make a script that will automatically get content from html tags (start and end) and store them into an array.
Example:
Input:
$str = <p>This is a sample <b>text</b> </p> this is out of tags.<p>This is <p>another text</p>for same aggregate <i>tags</i>.</p>
output:
$blocks[0] = <p>This is a sample <b>text</b> </p>
$blocks[1] = <p>This is <p>another text</p>for same aggregate <i>tags</i>.</p>
NB: the first block start with <p> so must be stop at </p>, the second block again start with <p> but it has another start and end paragraph[<p></p>] between this, and stop when find </p> . That means i want to put all of the data and inner tags between start and end tags.
I'll try to provide an answer to this, although this solution does not give you exactly what your are looking for, since nested <p> tags are not valid HTML. Using PHP's DOMDocument, you can extract the paragraph tags like this.
<?php
$test = "<p>This is a sample <b>text</b> </p> this is out of tags.<p>This is <p>another text</p>for same aggregate <i>tags</i>.</p>";
$html = new DOMDocument();
$html->loadHTML($test);
$p_tags = array();
foreach ($html->getElementsByTagName('p') as $p) {
$p_tags[] = $html->saveHTML($p);
}
print_r($p_tags);
?>
After throwing some warnings at you because of the invalid tag nesting, the output should be the following:
Array
(
[0] => <p>This is a sample <b>text</b> </p>
[1] => <p>This is </p>
[2] => <p>another text</p>
)
you can use Simple Html Dom library to do this. Here is the example.
require_once('simple_html_dom.php');
$html = " <p>This is a sample <b>text</b> </p> this is out of tags.<p>This is <p>another text</p>for same aggregate <i>tags</i>.</p>";
$html = str_get_html($html);
$p = $html->find('p');
$contentArray = array();
foreach($p as $element)
$contentArray[] = $element->innertext; //You can try $element->outertext to get the output with tag. ie. <p>content</p>
print_r($contentArray);
your output is like this:
Array
(
[0] => This is a sample <b>text</b>
[1] => This is
[2] => another text
)
I have this code that extracts the first image from an article in joomla:
<?php preg_match('/<img (.*?)>/', $this->article->text, $match); ?>
<?php echo $match[0]; ?>
Is there a way to extract all the images that are available in the article and not only one?
I may suggest first to not use Regular Expressions to parse HTML. You should use an appropiate parser such as DOMDocument::loadHTML which uses libxml.
Then you may query for the desired tags you want. Something like this may work (untested):
$doc = new DOMDocument;
$doc->loadHTML($htmlSource);
$xpath = new DOMXPath($doc);
$query = '//img';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
// $entry->getAttribute('src')
}
Use preg_match_all. And you'll want to modify the pattern like so to take into account the trailing '/' inside the img tag.
$str = '<img src="asdf" />stuff more stuff <img src="qwerty" />';
preg_match_all('/<img (.*?)\/>/', $str, $matches);
print_r($matches);
Array
(
[0] => Array
(
[0] => <img src="asdf" />
[1] => <img src="qwerty" />
)
[1] => Array
(
[0] => src="asdf"
[1] => src="qwerty"
)
)