I'm trying to use preg_match_all to extract all urls from a block of HTML code. I'm also trying to ignore all images.
Example HTML block:
$html = '<p>This is a test</p><br>http://www.facebook.com<br><img src="http://www.google.com/photo.jpg">www.yahoo.com https://www.aol.com<br>';
I'm using the following to try and build an array of URLS only. (not images)
if(preg_match_all('~(?:(?:https://)|(?:http://)|(?:www\.))(?![^" ]*(?:jpg|png|gif|"))[^" <>]+~', $html, $links))
{
print_r($links);
}
In the example above the $links array should contain:
http://www.facebook.com, www.yahoo.com, https://www.aol.com
Google is left out because it contains the .jpg image extension. The problem occurs when I add an image like this one to $html:
<img src="http://www.google.com/image%201.jpg">
It seems as though the percent sign causes preg_match to break apart the URL and extract the following "link".
http://www.google.com/image
Any idea how to grab ONLY url's that are not images? (even if they contain special characters that urls could commonly have)
Using DOM allows you to recognize the structure of an HTML document. In your case to recognize the parts you want to fetch the urls from.
Load the HTML using DOM
Fetch urls from link href attributes using Xpath (only if you want them, too)
Fetch text nodes from the DOM using Xpath
Use RegEx on text node value to match urls
Here is an example implementation:
$html = <<<'HTML'
<p>This is a test</p>
<br>
http://www.facebook.com
<br>
<img src="http://www.google.com/photo.jpg">
www.yahoo.com
https://www.aol.com
Link
<!-- http://comment.ingored.url -->
<br>
HTML;
$urls = array();
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);
// fetch urls from link href attributes
foreach ($xpath->evaluate('//a[#href]/#href') as $href) {
$urls[] = $href->value;
}
// fetch urls inside text nodes
$pattern = '(
(?:(?:https?://)|(?:www\.))
(?:[^"\'\\s]+)
)xS';
foreach ($xpath->evaluate('/html/body//text()') as $text) {
$matches = array();
preg_match_all($pattern, $text->nodeValue, $matches);
foreach ($matches[0] as $href) {
$urls[] = $href;
}
}
var_dump($urls);
Output:
array(4) {
[0]=>
string(21) "http://www.google.com"
[1]=>
string(23) "http://www.facebook.com"
[2]=>
string(13) "www.yahoo.com"
[3]=>
string(19) "https://www.aol.com"
}
Related
I use this code to make clickable links from a string:
$string = "
<h2>This is a test</h2>
<p>Just a test to see if it works!</p>
<img src='./images/test.jpg' alt='test image' />
";
$wordtoconvert = "test";
$content = str_ireplace($wordtoconvert, "<a href='https://www.google.com' class='w3-text-blue'>" . $wordtoconvert . "</a>", $string);
This works almost perfect for me exept that i do NOT want to convert the word 'test' in de heading and image tag. I only want to convert everything between the paragraph tag.
How can that be done please?
PHP - Make anchors of a given word in the paragraphs content of an html string
Here I show and explain a simpe demo implementing the function makeAnchorsOfTargetWordInHtml($html, $targetText) using the DOMDocument class available in php. The reason why it's unsafe performing string operations by yourself it's because the input html may have content you are not aware of when doing very basic string replacements that will break the consistency.
How to: parse an html string, make transformations on its nodes and return the resulting html string
As suggested in comments I think it would be much safer if you process the html content using a legit parser.
The idea is, starting from an html content string:
Parse it with DOMDocument
https://www.php.net/manual/en/class.domdocument.php
Select all the p elements in the parsed html node
https://www.php.net/manual/en/domdocument.getelementsbytagname
Then, for each one of them:
Take the p tag string content (->nodeValue)
*I think this approach will not work if the p node contains other nodes instead of just text (textNodes)
Split it in text fragments isolating the target word..
the point is producing an array of strings in the order they are found in the p content, including the target text as a separate element. So that for example if you have the string "word1 word2 word3" and the target is "ord" the array we need is ["w", "ord", "1 w", "ord", "2 w", "ord", "3"]
Empty the p node content ->nodeValue = ''
For each text fragments we had in the array we got before, create a
new node and append it to the paragraph. Such node will be a new
anchor node if the fragment is the target word, otherwise it will be
a text node.
https://www.php.net/manual/en/domdocument.createelement.php
https://www.php.net/manual/en/domelement.setattribute
In the end take the whole parent node processed so far and return its
html serialization with ->saveHTML()
https://www.php.net/manual/en/domdocument.savehtml
Demo
https://onlinephp.io/c/b64d3
<?php
$html = "<h2>This is a test</h2><p>Just a test to see if it works!</p><img src='./images/test.jpg' alt='test image' />";
$res = makeAnchorsOfTargetWordInHtml($html, 'test');
echo $res;
function makeAnchorsOfTargetWordInHtml($html, $targetText){
//loads the html document passed
$dom = new DOMDocument();
$dom->loadHTML($html);
//find all p elements and loop through them
$ps = $dom->getElementsByTagName('p');
foreach ($ps as $p) {
//grabs the p content
$content = $p->nodeValue;
//split it in text fragment addressing the targetText
$textFragments = getTextFragments($content, $targetText);
//reset the content of the paragraph
$p->nodeValue = '';
//for each text fragment splitting the content in segments delimiting the targetText
foreach($textFragments as $fragment){
//if the fragment is the targetText, set $node as an anchor
if($fragment == $targetText){
$node = $dom->createElement('a', $fragment);
$node->setAttribute('href', 'https://www.google.com');
$node->setAttribute('class', 'w3-text-blue');
}
//otherwise set $node as a textNode
else{
$node = $dom->createTextNode($fragment);
}
//appends the ndoe to the parent p
$p->appendChild($node);
}
}
//return the processed html
return $dom->saveHTML();
}
function getTextFragments($input, $textToFind){
$fragments = explode($textToFind, $input);
$result = array();
foreach ($fragments as $fragment) {
$result[] = $fragment;
if ($fragment != end($fragments)) {
$result[] = $textToFind;
}
}
return $result;
}
I want to get all the href links in the html. I came across two possible ways. One is the regex:
$input = urldecode(base64_decode($html_file));
$regexp = "href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)\s*";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
echo $match[2] ;//= link address
echo $match[3]."<br>" ;//= link text
}
}
And the other one is creating DOM document and parsing it:
$html = urldecode(base64_decode($html_file));
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
I dont know which one of this is efficient. But The code will be used many times. So i want to clarify which is the better one to go with. Thank You!
Say I have the following string:
<a name="anchor" title="anchor title">
Currently I can extract name and title with strpos and substr, but I want to do it right. How can I do this with regex? And what if I wanted to extract from many of these tags within a block of text?
I've tried this regex:
/name="([A-Z,a-z])\w+/g
But it gets the name=" part as well, I just want the value.
The regex (\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']? can be used to extract all attributes
DOMDocument example:
<?php
$titles = array();
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br><a name="anchor" title="anchor title"></body></html>");
$links = $doc->getElementsByTagName('a');
if ($links->length!=0) {
foreach ($links as $a) {
$titles[] = $a->getAttribute('title');
}
}
?>
You commented: "I'm actually parsing the data before the page is rendered so DOM is not possible, right?"
We're working with the scraped HTML, so we construct a DOM with these functions and parse like XML.
Good examples in the comments here: http://php.net/manual/en/domdocument.getelementsbytagname.php
Using the following code I get "img" tags from some html and check them if they are covered with "a" tags. Later if current "img" tag is not part of the "a" ( hyperlink ) I want to do cover this img tag into "a" tag adding hyperlinks start ending tag plus setting to target. For this I want the whole "img" tags html to work with.
Question is how can I transfer "img" tags html into regexp. I need some php variable in regexp to work with the place is marked with ??? signs.
$doc = new DOMDocument();
$doc->loadHTML($article_header);
$imgs = $doc->getElementsByTagName('img');
foreach ($imgs as $img) {
if ($img->parentNode->tagName != "a") {
preg_match_all("|<img(.*)\/>|U", ??? , $matches, PREG_PATTERN_ORDER);
}
}
You do not want to use regex for this. You already have a DOM, so use it:
foreach ($imgs as $img) {
$container = $img->parentNode;
if ($container->tagName != "a") {
$a = $doc->createElement("a");
$a->appendChild( $img->cloneNode(true) );
$container->replaceChild($a, $img);
}
}
see documentation on
DOMDocument::createElement
DOMNode::appendChild
DOMNode::cloneNode
DOMNode::replaceChild
<li class="zk_list_c2 f_l"><a title="abc" target="_blank" href="link">
abc
</a> </li>
how would i extract abc and link?
$pattern="/<li class=\"zk_list_c2 f_l\"><a title=\"(.*)\" target=\"_blank\" href=\"(.*)\">\s*(.*)\s*<\/a> <\/li>/m";
preg_match_all($pattern, $content, $matches);
the one i have right now doesnt seems to work
Considering your are trying to extract some data from an HTML string, regex are generally not the right/best tool for the job.
Instead, why no use a DOM parser, like the DOMDocument class, provided with PHP, and its DOMDocument::loadHTML method ?
Then, you could navigate through your HTML document using DOM methods -- which is much easier than using regex, especially considering than HTML is not quite regular.
Here, for example, you could use something like this :
$html = <<<HTML
<li class="zk_list_c2 f_l"><a title="abc" target="_blank" href="link">
abc
</a> </li>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$as = $dom->getElementsByTagName('a');
foreach ($as as $a) {
var_dump($a->getAttribute('href'));
var_dump(trim($a->nodeValue));
}
And you would get the following output :
string(4) "link"
string(3) "abc"
The code is not quite hard, I'd say, but, in a few words, here what it's doing :
Load the HTML string : DOMDocument::loadHTML
Extract all <a> tags : DOMDocument::getElementsByTagName
Foreach tag found :
get the href attribute : DOMElement::getAttribute
and the value of the node : DOMNode::$nodeValue
Just a note : you might want to check if the href attribute exists, with DOMElement::hasAttribute, before trying to use its value...
EDIT after the comments : here's a quick example using DOMXpath to get to the links ; I supposed you want the link that's inside the <li> tag with class="zk_list_c2 f_l" :
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$as = $xpath->query('//li[#class="zk_list_c2 f_l"]/a');
foreach ($as as $a) {
var_dump($a->getAttribute('href'));
var_dump(trim($a->nodeValue));
}
And, again, you get :
string(4) "link"
string(3) "abc"
As you can see, the only thing that changes is the way you're using to get to the right <a> tag : instead of DOMDocument::getElementsByTagName, it's just a matter of :
instanciating The DOMXPath class
and calling DOMXPath::query with the right XPath query.