Parse html with regexp - php

I want to find all <h3> blocks in this example:
<h3>sdf</h3>
sdfsdf
<h3>sdf</h3>
32
<h2>fs</h2>
<h3>23sd</h3>
234
<h1>h1</h1>
(From h3 to other h3 or h2) This regexp find only first h3 block
~\<h3[^>]*\>[^>]+\<\/h3\>.+(?:\<h3|\<h2|\<h1)~is
I use php function preg_match_all (Quote from docs: After the first match is found, the subsequent searches are continued on from end of the last match.)
What i have to modify in my regexp?
ps
<h3>1</h3>
1content
<h3>2</h3>
2content
<h2>h2</h2>
<h3>3</h3>
3content
<h1>h1</h1>
this content have to be parsed as:
[0] => <h3>1</h3>1content
[1] => <h3>2</h3>2content
[2] => <h3>2</h3>3content

with DOMDocument:
$dom = new DOMDocument();
#$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('body')->item(0)->childNodes;
$flag = false;
$results = array();
foreach ($nodes as $node) {
if ( $node->nodeType == XML_ELEMENT_NODE &&
preg_match('~^h(?:[12]|(3))$~i', $node->nodeName, $m) ):
if ($flag)
$results[] = $tmp;
if (isset($m[1])) {
$tmp = $dom->saveXML($node);
$flag = true;
} else
$flag = false;
elseif ($flag):
$tmp .= $dom->saveXML($node);
endif;
}
echo htmlspecialchars(print_r($results, true));
with regex:
preg_match_all('~<h3.*?(?=<h[123])~si', $html, $matches);
echo htmlspecialchars(print_r($matches[0], true));

You shouldn't use Regex to parse HTML if there is any nesting involved.
Regex
(<(h\d)>.*?<\/\2>)[\r\n]([^\r\n<]+)
Replacement
\1\3
or
$1$3
http://regex101.com/r/uQ3uC2

preg_match_all('/<h3>(.*?)<\/h3>/is', $stringHTML, $matches);

Related

set tags in html using domdocument and preg_replace_callback

I try to replace words that are in my dictionary of terminology with an (html)anchor so it gets a tooltip. I get the replace-part done, but I just can't get it back in the DomDocument object.
I've made a recursive function that iterates the DOM, it iterates every childnode, searching for the word in my dictionary and replacing it with an anchor.
I've been using this with an ordinary preg_match on HTML, but that just runs into problems.. when HTML gets complex
The recursive function:
$terms = array(
'example'=>'explanation about example'
);
function iterate_html($doc, $original_doc = null)
{
global $terms;
if(is_null($original_doc)) {
self::iterate_html($doc, $doc);
}
foreach($doc->childNodes as $childnode)
{
$children = $childnode->childNodes;
if($children) {
self::iterate_html($childnode);
} else {
$regexes = '~\b' . implode('\b|\b',array_keys($terms)) . '\b~i';
$new_nodevalue = preg_replace_callback($regexes, function($matches) {
$doc = new DOMDocument();
$anchor = $doc->createElement('a', $matches[0]);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($matches[0])]);
return $doc->saveXML($anchor);
}, $childnode->nodeValue);
$dom = new DOMDocument();
$template = $dom->createDocumentFragment();
$template->appendXML($new_nodevalue);
$original_doc->importNode($template->childNodes, true);
$childnode->parentNode->replaceChild($template, $childnode);
}
}
}
echo iterate_html('this is just some example text.');
I expect the result to be:
this is just some <a class="text-info" data-toggle="tooltip" data-original-title="explanation about example">example</a> text
I don't think building a recursive function to walk the DOM is usefull when you can use an XPath query. Also, I'm not sure that preg_replace_callback is an adapted function for this case. I prefer to use preg_split. Here is an example:
$html = 'this is just some example text.';
$terms = array(
'example'=>'explanation about example'
);
// sort by reverse order of key size
// (to be sure that the longest string always wins instead of the first in the pattern)
uksort($terms, function ($a, $b) {
$diff = mb_strlen($b) - mb_strlen($a);
return ($diff) ? $diff : strcmp($a, $b);
});
// build the pattern inside a capture group (to have delimiters in the results with the PREG_SPLIT_DELIM_CAPTURE option)
$pattern = '~\b(' . implode('|', array_map(function($i) { return preg_quote($i, '~'); }, array_keys($terms))) . ')\b~i';
// prevent eventual html errors to be displayed
$libxmlInternalErrors = libxml_use_internal_errors(true);
// determine if the html string have a root html element already, if not add a fake root.
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$fakeRootElement = false;
if ( $dom->documentElement->nodeName !== 'html' ) {
$dom->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$fakeRootElement = true;
}
libxml_use_internal_errors($libxmlInternalErrors);
// find all text nodes (not already included in a link or between other unwanted tags)
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::a)][not(ancestor::style)][not(ancestor::script)]');
// replacement
foreach ($textNodes as $textNode) {
$parts = preg_split($pattern, $textNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$fragment = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k&1) {
$anchor = $dom->createElement('a', $part);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($part)]);
$fragment->appendChild($anchor);
} else {
$fragment->appendChild($dom->createTextNode($part));
}
}
$textNode->parentNode->replaceChild($fragment, $textNode);
}
// building of the result string
$result = '';
if ( $fakeRootElement ) {
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
} else {
$result = $dom->saveHTML();
}
echo $result;
demo
Feel free to put that into one or more functions/methods, but keep in mind that this kind of editing has a non-neglictable weight and should be used each time the html is edited (and not each time the html is displayed).

Php html parsing, I want to save parsed elements into an array

I'm trying to parse the html page and accessing some of the tags. I am parsing all of those tags and displaying the result in form of indentation which is according to the level of tags e.g. header tags h1, h2, h3 etc. Now, I want to save the resultant data (indented table of contents) into an array along with the name of the tags. Kindly help me to sort out my problem.
Here is my php code... I'm using html dom parser.
include ("simple_html_dom.php");
session_start();
error_reporting(0);
$string = file_get_contents('test.php');
$tags = array(0 => '<h1', 1 => '<h2', 2 => '<h3', 3 => '<h4', 4 => '<h5', 5 => '<h6');
function parser($html, $needles = array()){
$positions = array();
foreach ($needles as $needle){
$lastPos = 0;
while (($lastPos = strpos($html, $needle, $lastPos))!== false)
{
$positions[] = $lastPos;
$lastPos = $lastPos + strlen($needle);
}
unset($needles[0]);
if(count($positions) > 0){
break;
}
}
if(count($positions) > 0){
for ($i = 0; $i < count($positions); $i++) {
?>
<div class="<?php echo $i; ?>" style="padding-left: 20px; font-size: 14px;">
<?php
if($i < count($positions)-1){
$temp = explode('</', substr($html, $positions[$i]+4));
$pos = strpos($temp[0], '>');
echo substr($temp[0], $pos);
parser(substr($html, $positions[$i]+4, $positions[$i+1]-$positions[$i]-4), $needles);
} else {
$temp = explode('</', substr($html, $positions[$i]+4));
$pos = strpos($temp[0], '>');
echo substr($temp[0], $pos+1);
parser(substr($html, $positions[$i]+4), $needles);
}
?>
</div>
<?php
}
} else {
// not found any position of a tag
}
}
parser($string, $tags);
If you wanted to do it using SimpleXML and XPath, there is a shorter and much more readable version you could try...
$xml = new SimpleXMLElement($string);
$tags = $xml->xpath("//h1 | //h2 | //h3 | //h4");
$data = [];
foreach ( $tags as $tag ) {
$elementData['name'] = $tag->getName();
$elementData['content'] = (string)$tag;
$data[] = $elementData;
}
print_r($data);
You can see the pattern in the XPath - it combines any of the elements you need. The use of // means to find at any level and then the name of the element you want to find. These are combined using |, which is the 'or' operator. This could easily be expanded using the same type of expression to build a full set of tags you need.
The program then loops over the elements found and builds an array of each element at a time. Taking the name and content and adding them to the $data array.
Update:
If your file isn't well formed XML, you may have to use DOMDocument and loadHTML. Only a slight difference but is more tollerant of errors...
$string = file_get_contents("links.html");
$xml = new DOMDocument();
libxml_use_internal_errors();
$xml->loadHTML($string);
$xp = new DOMXPath($xml);
$tags = $xp->query("//h1 | //h2 | //h3 | //h4");
$data = [];
foreach ( $tags as $tag ) {
$elementData['name'] = $tag->tagName;
$elementData['content'] = $tag->nodeValue;
$data[] = $elementData;
}
print_r($data);

PHP Regex or DOMDocument for Matching & Removing URLs?

I'm trying to extract links from html page using DOM:
$html = file_get_contents('links.html');
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br/>';
}
Output:
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
I would like to remove all results matching dontwantthisdomain.com, dontwantthisdomain2.com and dontwantthisdomain3.com so the output will looks like that:
http://domain1.com/page-X-on-domain-com.html
http://domain.com/page-XZ-on-domain-com.html
http://domain3.com/page-XYZ-on-domain3-com.html
Some people saying I should not use regex for html and others that it's ok. Could somebody point the best way how I can remove unwanted urls from my html file? :)
Maybe something like this:
function extract_domains($buffer, $whitelist) {
preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches);
$result = array();
foreach($matches[1] as $url) {
$url = urldecode($url);
$parts = #parse_url((string) $url);
if ($parts !== false && in_array($parts['host'], $whitelist)) {
$result[] = $parts['host'];
}
}
return $result;
}
$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com')));
It does a rough match on the all the <a> with href=, grabs what's between the quotes, then filters it based on your whitelist of domains.
None regex solution (without potential errors :-) :
$html='
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
';
$html=explode("\n", $html);
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com');
foreach ($html as $link) {
$ok=true;
foreach($dontWant as $notWanted) {
if (strpos($link, $notWanted)>0) {
$ok=false;
}
if (trim($link=='')) $ok=false;
}
if ($ok) $final_result[]=$link;
}
echo '<pre>';
print_r($final_result);
echo '</pre>';
outputs
Array
(
[0] => http://domain1.com/page-X-on-domain-com.html
[1] => http://domain.com/page-XZ-on-domain-com.html
[2] => http://domain3.com/page-XYZ-on-domain3-com.html
)

Within parsing html to dom tree, how to split string by tag in php?

Here is the string:
<div>This is a test.</div>
<div>This <b>another</b> a test.</div>
<div/>
<div>This is last a test.</div>
I wanna to separate the following string to array like this:
{"This is a test.", "This <b>another</b> a test.", "", "This is last a test."}
Any idea to do so in php? Thank you.
I assume your HTML is malformed on purpose
There are many options, includin xpath and numerous libraries. Regex is not a good idea. I find DOMDocument fast and relatively simple.
getElementsByTagName then iterate over them getting the innerHTML.
Example:
<?php
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
$str = <<<'EOD'
<div>This is a test.</div>
<div>This <b>another</b> a test.</div>
<div/>
<div>This is last a test.</div>
EOD;
$doc = new DOMDocument();
$doc->loadHTML($str);
$ellies = $doc->getElementsByTagName('div');
foreach ($ellies as $one_el) {
if ($ih = get_inner_html($one_el))
$array[] = $ih;
}
?>
<pre>
<?php print_r($array); ?>
</pre>
// Output
// Note that there would be
// a 4th array elemnt w/o the `if ($ih = get_inner_html($one_el))` check:
Array
(
[0] => This is a test.
[1] => This <b>another</b> a test.
[2] => This is last a test.
)
Try it out here
Note:
The above will work fine as long as you don't have nested DIVS. If you do have nesting, you have to exclude the nested children as you loop through innerHTML.
For example let's say you have this HTML:
<div>One
<div>Two</div>
<div>Three</div>
<div/>
<div>Four
<div>Five</div>
</div>
Here's how to deal with the above and get an array that has the number in order:
Dealing with nesting
<?php
function get_inner_html_unnested( $node, $exclude ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
if (!property_exists($child, 'tagName') || ($child->tagName != $exclude))
$innerHTML .= trim($child->ownerDocument->saveXML( $child ));
}
return $innerHTML;
}
$str = <<<'EOD'
<div>One
<div>Two</div>
<div>Three</div>
<div/>
<div>Four
<div>Five</div>
</div>
EOD;
$doc = new DOMDocument();
$doc->loadHTML($str);
$ellies = $doc->getElementsByTagName('div');
foreach ($ellies as $one_el) {
if ($ih = get_inner_html_unnested($one_el, 'div'))
$array[] = $ih;
}
?>
<pre>
<?php print_r($array); ?>
</pre>
Try it out here
This make_array function should do the trick for you:
function make_array($string)
{
$regexp = "(\s*</?div/?>\s*)+";
$string = preg_replace("#^$regexp#is", "", $string);
$string = preg_replace("#$regexp$#is", "", $string);
return preg_split("#$regexp#is", $string);
}
When passed the string you gave as an example, it outputs the following array:
Array
(
[0] => "This is a test."
[1] => "This <b>another</b> a test."
[2] => "This is last a test."
)

(PHP) Regex for finding specific href tag

i have a html document with n "a href" tags with different target urls and different text between the tag.
For example:
<span ....>lorem ipsum</span>
<span ....>example</span>
example3
<img ...>test</img>
without a d as target url
As you can see the target urls switch between "d?, d., d/d?, d/d." and between the "a tag" there could be any type of html which is allowed by w3c.
I need a Regex which gives me all links which has one of these combination in the target url:
"d?, d., d/d?, d/d." and has "Lorem" or "test" between the "a tags" in any position including sub html tags.
My Regex so far:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>.*?</a>)
I tried to include the lorem / test as followed:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>(lorem|test)+</a>)
but this will only works if I put a ".*?" before and after the (lorem|test) and this would be to greedy.
If there is a easier way with SimpleXml or any other DOM parser, please let me know. Otherwise I would appreciate any help with the regex.
Thanks!
Here you go:
$html = array
(
'<span ....>lorem ipsum</span>',
'<span ....>example</span>',
'example3',
'<img ...>test</img>',
'without a d as target url',
);
$html = implode("\n", $html);
$result = array();
$anchors = phXML($html, '//a[contains(., "lorem") or contains(., "test")]');
foreach ($anchors as $anchor)
{
if (preg_match('~d[.?]~', strval($anchor['href'])) > 0)
{
$result[] = strval($anchor['href']);
}
}
echo '<pre>';
print_r($result);
echo '</pre>';
Output:
Array
(
[0] => http://www.example.com/d?12345abc
[1] => http://www.example.com/d/d.1234
)
The phXML() function is based on my DOMDocument / SimpleXML wrapper, and goes as follows:
function phXML($xml, $xpath = null)
{
if (extension_loaded('libxml') === true)
{
libxml_use_internal_errors(true);
if ((extension_loaded('dom') === true) && (extension_loaded('SimpleXML') === true))
{
if (is_string($xml) === true)
{
$dom = new DOMDocument();
if (#$dom->loadHTML($xml) === true)
{
return phXML(#simplexml_import_dom($dom), $xpath);
}
}
else if ((is_object($xml) === true) && (strcmp('SimpleXMLElement', get_class($xml)) === 0))
{
if (isset($xpath) === true)
{
$xml = $xml->xpath($xpath);
}
return $xml;
}
}
}
return false;
}
I'm too lazy not to use this function right now, but I'm sure you can get rid of it if you need to.
Here is a Regular Expression which works:
$search = '/<a\s[^>]*href=["\'](?:http:\/\/)?(?:[a-z0-9-]+(?:\.[a-z0-9-]+)*)\/(?:d\/)?d[?.].*?>.*?(?:lorem|test)+.*?<\/a>/i';
$matches = array();
preg_match_all($search, $html, $matches);
The only thing is it relies on there being a new-line character between each ` tag. Otherwise it will match something like:
example3<img ...>test</img>
Use an HTML parser. There are lots of reasons that Regex is absolutely not the solution for parsing HTML.
There's a good list of them here:
Robust and Mature HTML Parser for PHP
Will print only first and fourth link because two conditions are met.
preg_match_all('#href="(.*?)"(.*?)>(.*?)</a>#is', $string, $matches);
$count = count($matches[0]);
unset($matches[0], $matches[2]);
for($i = 0; $i < $count; $i++){
if(
strpos($matches[1][$i], '/d') !== false
&&
preg_match('#(lorem|test)#is', $matches[3][$i]) == true
)
{
echo $matches[1][$i];
}
}

Categories