get several links inside specific div with one regex - php

Take a look at this html:
<div class="foo">link1link2</div>
<div class="bar">barlink</div>
I would like to know if I can loop in all links inside foo with a regular expression within php.
I tried this but isn't working:
preg_match_all(
'#<div.*?class="foo".*?<a.*?>(?P<text>.*?)</a>#xi',
$text,
$matches,
PREG_SET_ORDER
);
sadly, in this case, it must be regex, not xml or other parsers.

DON'T USE REGEX TO PARSE HTML.
<?php
$content =
'<div class="foo">
link1
link2
</div>
<div class="bar">
barlink
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content);
$divs = $dom->getElementsByTagName('div');
foreach($divs as $div)
{
$classes = explode(' ', $div->getAttribute('class'));
if(in_array('foo', $classes) || trim($div->getAttribute('class')) === 'foo')
{
foreach($div->getElementsByTagName('a') as $link)
{
echo $dom->saveXML($link);
}
}
}
?>
This will output all matching links under any div with class 'foo'.
Regular Expressions should NOT be used to parse HTML, since HTML itself is not a regular language. It can get very sloppy and you can end up with more problems than what you started with, especially when you could potentially be dealing with malformed HTML.

Related

Regular Expression To Match Header Tags Not In Specific Div

So I have PHP code that puts out HTML that looks like this:
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too/h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
What I'm trying to do is preg_match_all of the header tags. My regular expression (<h([1-6]{1})[^>]*)>.*<\/h\2> returns all of them appropriately, but I don't want to grab the headers that are in the div with the class "ignore". I was reading about negative lookaheads, but it gets tricky. Anyone with help will be appreciated.
Desired output:
<h2>This is a header</h2>
<h2>This is one too/h2>
<h4>Here's one</h4>
Note I'm one in here too is omitted because it's wrapped in div with class "ignore".
Don't mess around with regular expressions here - unleash the power of DOMDocument in combination with xpath queries:
<?php
$html = <<<EOT
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too</h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
EOT;
$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXpath($doc);
$headers = $xpath->query("
//div[not(contains(#class, 'ignore'))]
/*[self::h2 or self::h4 or self::h5]");
foreach ($headers as $header) {
echo $header->nodeValue . "\n";
}
?>
This will yield
This is a header
This is one too
Here's one
With DOMDocument and DOMXPath:
$html = <<<'HTML'
<div class="wrapper">
<h2>This is a header</h2>
<h2>This is one too</h2>
<h4>Here's one</h4>
<div class="ignore">
<h5>I'm one in here too</h5>
</div>
</div>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$nodeList = $xp->query('
//*
[contains(";h1;h2;h3;h4;h5;h6;", concat(";", local-name(), ";"))]
[not(ancestor::div[
contains(concat(" ", normalize-space(#class), " "), " ignore ")
])
]');
foreach ($nodeList as $node) {
echo 'tag name: ', $node->nodeName, PHP_EOL,
'html content: ', $dom->saveHTML($node), PHP_EOL,
'text content: ', $node->textContent, PHP_EOL,
PHP_EOL;
}
demo
If you aren't comfortable with XPath take a look at the zvon tutorial.
Since you specify you want to do it with preg_match(), here is an example of a negative look-behind (i.e. filters out those occurrences NOT preceded by XYZ) : https://regex101.com/r/FeAsuj/1
The lookbehind itself is (?<!<div class=\"ignore\">) .
But in the test-snippet, notice how :
the regex depends on the exact use of whitespace ...
... so a platform-dependant \r\n can break the regex
the lookbehind cannot have a variable length, i.e. \n? - see Regular Expression Lookbehind doesn't work with quantifiers ('+' or '*')
If you MUST continue to work with regex's, consider a 2-step approach :
step 1, you use preg_replace() to eliminate all unwanted sections.
step 2, use your existing regex.
In general, I would concur with the other posters to avoid regex, and go with a HTML parser.

Difficulties with the function preg_match_all

I would like to get back the number which is between span HTML tags. The number may change!
<span class="topic-count">
::before
"
24
"
::after
</span>
I've tried the following code:
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
But it doesn't work.
Entire code:
$result=array();
$page = 201;
while ($page>=1) {
$source = file_get_contents ("http://www.jeuxvideo.com/forums/0-27047-0-1-0-".$page."-0-counter-strike-global-offensive.htm");
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
$result = array_merge($result, $nombre[$i][1]);
print("Page : ".$page ."\n");
$page-=25;
}
print_r ($nombre);
Can do with
preg_match_all(
'#<span class="topic-count">[^\d]*(\d+)[^\d]*?</span>#s',
$html,
$matches
);
which would capture any digits before the end of the span.
However, note that this regex will only work for exactly this piece of html. If there is a slight variation in the markup, for instance, another class or another attribute, the pattern will not work anymore. Writing reliable regexes for HTML is hard.
Hence the recommendation to use a DOM parser instead, e.g.
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.jeuxvideo.com/forums/0-27047-0-1-0-1-0-counter-strike-global-offensive.htm');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//span[contains(#class, "topic-count")]') as $node) {
if (preg_match_all('#\d+#s', $node->nodeValue, $topics)) {
echo $topics[0][0], PHP_EOL;
}
}
DOM will parse the entire page into a tree of nodes, which you can then query conveniently via XPath. Note the expression
//span[contains(#class, "topic-count")]
which will give you all the span elements with a class attribute containing the string topic-count. Then if any of these nodes contain a digit, echo it.

Get text between 2 tags that change (regex)(php)

How should I get the text between 2 html tags that are not always the same. How should I let regex "ignore" a part.
Lets say this is my html:
<html>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl03_lblName">stirng 1</span>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl04_lblName">string 2</span>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl53_lblName">string 3</span>
...
</html>
As you see the ctlxx part is not always the same, this code only gets the first string:
preg_match('#\\<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl03_lblName">(.+)\\</span>#s',$html,$matches);
$match = $matches[0];
echo $match;
How can I let regex ignore the ctlxx part and echo all the strings?
Thanks in advance
You can do it by DomDocument and DomXpath with using preg_match
$dom = new DOMDocument();
$dom->loadHTML($str);
$x = new DOMXpath($dom);
// Next two string to use Php functions within within Xpath expression
$x->registerNamespace("php", "http://php.net/xpath");
$x->registerPHPFunctions();
// Select span tags with proper id
foreach($x->query('//span[php:functionString("preg_match", "/ctl00_ContentPlaceHolder1_gvDomain_ctl\d+_lblName/", .)]') as $node)
echo $node->nodeValue;
If you want to solve it using regular expression then you can do something like this
<?php
preg_match('/<span id="[^"]*">(.+)<\/span>/is',$html,$matches);
$match = $matches[0];
echo $match;

Change specific words into links within HTML using PHP [duplicate]

I need you help here.
I want to turn this:
sometext sometext http://www.somedomain.com/index.html sometext sometext
into:
sometext sometext www.somedomain.com/index.html sometext sometext
I have managed it by using this regex:
preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$1$4'", $text);
The problem is it’s also replacing the img URL, for example:
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext
is turned into:
sometext sometext <img src="domain.com/image.jpg"> sometext sometext
Please help.
Streamlined version of Gumbo's above:
$html = <<< HTML
<html>
<body>
<p>
This is a text with a link
and another http://example.com/2
and also another http://example.com with the latter being the
only one that should be replaced. There is also images in this
text, like <img src="http://example.com/foo"/> but these should
not be replaced either. In fact, only URLs in text that is no
a descendant of an anchor element should be converted to a link.
</p>
</body>
</html>
HTML;
Let's use an XPath that only fetches those elements that actually are textnodes containing http:// or https:// or ftp:// and that are not themselves textnodes of anchor elements.
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$texts = $xPath->query(
'/html/body//text()[
not(ancestor::a) and (
contains(.,"http://") or
contains(.,"https://") or
contains(.,"ftp://") )]'
);
The XPath above will give us a TextNode with the following data:
and also another http://example.com with the latter being the
only one that should be replaced. There is also images in this
text, like
Since PHP5.3 we could also use PHP inside the XPath to use the Regex pattern to select our nodes instead of the three calls to contains.
Instead of splitting the textnodes apart in the standards compliant way, we will use a document fragment and just replace the entire textnode with the fragment. Non-standard in this case only means, the method we will be using for this, is not part of the W3C specification of the DOM API.
foreach ($texts as $text) {
$fragment = $dom->createDocumentFragment();
$fragment->appendXML(
preg_replace(
"~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i",
'$1',
$text->data
)
);
$text->parentNode->replaceChild($fragment, $text);
}
echo $dom->saveXML($dom->documentElement);
and this will then output:
<html><body>
<p>
This is a text with a link
and another http://example.com/2
and also another http://example.com with the latter being the
only one that should be replaced. There is also images in this
text, like <img src="http://example.com/foo"/> but these should
not be replaced either. In fact, only URLs in text that is no
a descendant of an anchor element should be converted to a link.
</p>
</body></html>
You shouldn’t do that with regular expressions – at least not regular expressions only. Use a proper HTML DOM parser like the one of PHP’s DOM library instead. You then can iterate the nodes, check if it’s a text node and do the regular expression search and replace the text node appropriately.
Something like this should do it:
$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
$doc = new DOMDocument();
$doc->loadHTML($str);
// for every element in the document
foreach ($doc->getElementsByTagName('*') as $elem) {
// for every child node in each element
foreach ($elem->childNodes as $node) {
if ($node->nodeType === XML_TEXT_NODE) {
// split the text content to get an array of 1+2*n elements for n URLs in it
$parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$n = count($parts);
if ($n > 1) {
$parentNode = $node->parentNode;
// insert for each pair of non-URL/URL parts one DOMText and DOMElement node before the original DOMText node
for ($i=1; $i<$n; $i+=2) {
$a = $doc->createElement('a');
$a->setAttribute('href', $parts[$i]);
$a->setAttribute('target', '_blank');
$a->appendChild($doc->createTextNode($parts[$i]));
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->insertBefore($a, $node);
}
// insert the last part before the original DOMText node
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
// remove the original DOMText node
$node->parentNode->removeChild($node);
}
}
}
}
Ok, since the DOMNodeList‍s of getElementsByTagName and childNodes are live, every change in the DOM is reflected to that list and thus you cannot use foreach that would also iterate the newly added nodes. Instead, you need to use for loops instead and keep track of the elements added to increase the index pointers and at best pre-calculated array boundaries appropriately.
But since that is quite difficult in such a somehow complex algorithm (you would need one index pointer and array boundary for each of the three for loops), using a recursive algorithm is more convenient:
function mapOntoTextNodes(DOMNode $node, $callback) {
if ($node->nodeType === XML_TEXT_NODE) {
return $callback($node);
}
for ($i=0, $n=count($node->childNodes); $i<$n; ++$i) {
$nodesChanged = 0;
switch ($node->childNodes->item($i)->nodeType) {
case XML_ELEMENT_NODE:
$nodesChanged = mapOntoTextNodes($node->childNodes->item($i), $callback);
break;
case XML_TEXT_NODE:
$nodesChanged = $callback($node->childNodes->item($i));
break;
}
if ($nodesChanged !== 0) {
$n += $nodesChanged;
$i += $nodesChanged;
}
}
}
function foo(DOMText $node) {
$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
$parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$n = count($parts);
if ($n > 1) {
$parentNode = $node->parentNode;
$doc = $node->ownerDocument;
for ($i=1; $i<$n; $i+=2) {
$a = $doc->createElement('a');
$a->setAttribute('href', $parts[$i]);
$a->setAttribute('target', '_blank');
$a->appendChild($doc->createTextNode($parts[$i]));
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->insertBefore($a, $node);
}
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->removeChild($node);
}
return $n-1;
}
$str = '<div>sometext http://www.somedomain.com/index.html sometext <img src="http//domain.com/image.jpg"> sometext sometext</div>';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elems = $doc->getElementsByTagName('body');
mapOntoTextNodes($elems->item(0), 'foo');
Here mapOntoTextNodes is used to map a given callback function onto every DOMText node in a DOM document. You can either pass the whole DOMDocument node or just a specific DOMNode (in this case just the BODY node).
The function foo is then used to find and replace the plain URLs in the DOMText node’s content by splitting the content string into non-URL‍/‍URL parts using preg_split while capturing the used delimiter resulting in an array of 1+2·n items. Then the non-URL parts are replaced by new DOMText nodes and the URL parts are replaced by new A elements that are then inserted before the origin DOMText node that is then removed at the end. Since this mapOntoTextNodes walks recursively, it suffices to just call that function on a specific DOMNode.
thanks for the reply, but its still does work. i have fixed using this function:
function livelinked ($text){
preg_match_all("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)|^(jpg)#ie", $text, $ccs);
foreach ($ccs[3] as $cc) {
if (strpos($cc,"jpg")==false && strpos($cc,"gif")==false && strpos($cc,"png")==false ) {
$old[] = "http://".$cc;
$new[] = ''.$cc.'';
}
}
return str_replace($old,$new,$text);
}
If you'd like to keep using a regex (and in this case, a regex is quite appropriate), you can have the regex match only URLs that "stand alone". Using a word boundary escape sequence (\b), you can only have the regex match where http is immediately preceded by whitespace or the beginning of the text:
preg_replace("#\b((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$1$4'", $text);
// ^^ thar she blows
Thus, "http://..." won't match, but http:// as its own word will.
DomDocument is more mature and runs much faster, so it's just an alternative if someone wants to use PHP Simple HTML DOM Parser:
<?php
require_once('simple_html_dom.php');
$html = str_get_html('sometext sometext http://www.somedomain.com/index.html sometext sometext
http://www.somedomain.com/index.html
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');
foreach ($html->find('text') as $element)
{
// you can add any tag into the array to exclude from replace
if (!in_array($element->parent()->tag, array('a')))
$element->innertext = preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'$1$4'", $element->innertext);
}
echo $html;
You can try my code from this question:
echo preg_replace('/<a href="([^"]*)([^<\/]*)<\/a>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');
If you wanna turn some other tags - that's easy enough:
echo preg_replace('/<img src="([^"]*)([^\/><]*)>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');
match a whitespace (\s) at the start and end of the url string, this will ensure that
"http://url.com"
is not matched by
http://url.com
is matched;

Extract text from HTML

Actors: example world
this example word using regular expression in php .....
preg_match('/<strong class="nfpd">Actors<\/strong>:([^<]+)<br \/>/', $text, $matches);
print_r($matches);
Like Gumbo already pointed out in the comments to this question and like you have also been told in a number of your previous questions as well, Regex aint the right tool for parsing HTML.
The following will use DOM to get the first following sibling of any strong elements with a class attribute of nfpd. In the case of the example HTML, this would be the content of the TextNode, e.g. : example world.
Example HTML:
$html = <<< HTML
<p>
<strong class="nfpd">Actors</strong>: example world <br />
something else
</p>
HTML;
And extraction with DOM
libxml_use_internal_errors(TRUE);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
libxml_clear_errors();
$nodes = $xPath->query('//strong[#class="nfpd"]/following-sibling::text()[1]');
foreach($nodes as $node) {
echo $node->nodeValue; // : example world
}
You can also do it withouth an XPath, though it gets more verbose then:
$nodes = $dom->getElementsByTagName('strong');
foreach($nodes as $node) {
if($node->hasAttribute('class') &&
$node->getAttribute('class') === 'nfpd' &&
$node->nextSibling) {
echo $node->nextSibling->nodeValue; // : example world
}
}
Removing the colon and whitespace is trivial: Use trim.

Categories