Is this even possible...
Say I have some text with a link with a class of 'click':
<p>I am some text, i am some text, i am some text, i am some text
<a class="click" href="http://www.google.com">I am a link</a>
i am some text, i am some text, i am some text, i am some text</p>
Using PHP, get the link with class name, 'click', then get the href value?
There are a few ways to do this, the quickest is to use XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodeList = $xpath->query('//a[#class="click"]');
foreach ($nodeList as $node) {
$href = $node->getAttribute('href');
$text = $node->textContent;
}
You actually don't need to complicate your life at all:
$string='that html code with links';
// while matches found
while(preg_match('/<a class="click" href="([^"]*)">/', $string, $matches)){
// print captured group that's actually the url your searching for
echo $matches[1];
}
Related
I have the following div:
<div class="myclass"><strong><a rel="nofollow noopener" href="some link">dynamic content</a></strong></div>
and I want to get only the: dynamic content anchor text.
so far I have tried with preg_match_all:
"'<div class=\"myclass\">(.*?)</div>'si"
that returns all div content.
I tried to combine it with:
"|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i"
that returns anchor text but I cannot make it to work
can someone help?
thank you
You can use DOMDocument instead to preg_match_all
$html = '<div class="myclass"><strong><a rel="nofollow noopener" href="some link">dynamic content</a></strong></div>';
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$query = './/div[#class="myclass"]/strong/a';
$nodes = $xpath->query($query);
echo $nodes[0]->textContent;
I'm calling some wikipedia content two different way:
$html = file_get_contents('https://en.wikipedia.org/wiki/Sans-serif');
The first one is to call the first paragraph
$dom = new DomDocument();
#$dom->loadHTML($html);
$p = $dom->getElementsByTagName('p')->item(0)->nodeValue;
echo $p;
The second one is to call the first paragraph after a specific $id
$dom = new DOMDocument();
#$dom->loadHTML($html);
$p=$dom->getElementById('$id')->getElementsByTagName('p')->item(0);
echo $p->nodeValue;
I'm looking for a third way to call all the first part.
So I was thinking about calling all the <p> before the id or class "toc" which is the id/class of the table of content.
Any idea how to do that?
If you're just looking for the intro in plain text, you can simply use Wikipedia's API:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Sans-serif
If you want HTML formatting as well (excluding inner images and the likes):
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&titles=Sans-serif
You could use DOMDocument and DOMXPath with for example an xpath expression like:
//div[#id="toc"]/preceding-sibling::p
$doc = new DOMDocument();
$doc->load("https://en.wikipedia.org/wiki/Sans-serif");
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[#id="toc"]/preceding-sibling::p');
foreach ($nodes as $node) {
echo $node->nodeValue;
}
That would give you the content of the paragraphs preceding the div with id = toc.
I want to get all the href links in the html. I came across two possible ways. One is the regex:
$input = urldecode(base64_decode($html_file));
$regexp = "href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)\s*";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
echo $match[2] ;//= link address
echo $match[3]."<br>" ;//= link text
}
}
And the other one is creating DOM document and parsing it:
$html = urldecode(base64_decode($html_file));
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
I dont know which one of this is efficient. But The code will be used many times. So i want to clarify which is the better one to go with. Thank You!
here I'm looking for a regular expression in PHP which would match the anchor with a specific "target="_parent" on it.I would like to get anchors with text like:
preg_match_all('Text here', subject, matches, PREG_SET_ORDER);
HTML:
<a href="http://" target="_parent">
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Text</B> - Text </DIV>
</FONT>
</a>
</DIV>
To be honest, the best way would be not to use a regular expression at all. Otherwise, you are going to be missing out on all kinds of different links, especially if you don't know that the links are always going to have the same way of being generated.
The best way is to use an XML parser.
<?php
$html = 'Text here';
function extractTags($html) {
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html); // because dom will complain about badly formatted html
$sxe = simplexml_import_dom($dom);
$nodes = $sxe->xpath("//a[#target='_parent']");
$anchors = array();
foreach($nodes as $node) {
$anchor = trim((string)dom_import_simplexml($node)->textContent);
$attribs = $node->attributes();
$anchors[$anchor] = (string)$attribs->href;
}
return $anchors;
}
print_r(extractTags($html))
This will output:
Array (
[Text here] => http://
)
Even using it on your example:
$html = '<a href="http://" target="_parent">
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Text</B> - Text </DIV>
</FONT>
</a>
</DIV>
';
print_r(extractTags($html));
will output:
Array (
[Text - Text] => http://
)
If you feel that the HTML is still not clean enough to be used with DOMDocument, then I would recommend using a project such as HTMLPurifier (see http://htmlpurifier.org/) to first clean the HTML up completely (and remove unneeded HTML) and use the output from that to load into DOMDocument.
You should be making using DOMDocument Class instead of Regex. You would be getting a lot of false positive results if you handle HTML with Regex.
<?php
$html='Text here';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if ($tag->getAttribute('target') === '_parent') {
echo $tag->nodeValue;
}
}
OUTPUT :
Text here
I'm working on using htmlpurifier to create a text-only version of my site.
I now need to replace all the a hrefs with the text only url i.e. 'www.example.com/aboutus' becomes 'www.example.com/text/aboutus'
Initially I tried a simple str_replace on the domain (I use a global variable for the domain), but the problem is links to files also get replaced i.e.
'www.example.com/document.pdf' becomes 'www.example.com/text/document.pdf' and therefore fails.
Is there a regular expression where I can say replace domain with domain/text where the url does not include string?
Thanks for any pointers you might be able to give me :)
Use a negative lookahead:
$output = preg_replace(
'#www.example.com(?!/text/)#',
'www.example.com/text',
$input
);
Better yet, use DOM with it:
$html = 'foo
<p>hello</p>
bar';
libxml_use_internal_errors(true); // supresses DOM errors
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->query('//a/#href');
foreach ($hrefs as $href) {
$href->value = preg_replace(
'#^www.example.com(?!/text/)(.*?)(?<!\.pdf)$#',
'www.example.com/text\\1',
$href->value
);
}
This should give you:
foo
<p>hello</p>
bar