Problem to get content from html string variable in php - php

I have this code:
$doc = new \DOMDocument();
$doc->loadHTML($content);
$links = [];
$container = $doc->getElementById("content");
$arr = $container->getElementsByTagName("a");
foreach($arr as $item) {
$href = $item->getAttribute("href");
$title = $item->getAttribute("title");
$links[] = [
'href' => $href,
'title' => $title
];
}
for($i = 0, $l = count($links); $i < $l; ++$i) {
echo $links[$i]['title'].' '.$links[$i]['href'].'<br />';
}
The html structure is like that:
<div class="post-content right-col">
<a title="" href="https://www.swisscars.pl/samochody/516321/">
<img src="https://swisscars.pl/uploads2/180843_0.jpg" alt="" class="thumb alignleft" height="75" width="75"/>
</a>
<h2 style="line-height:150%;">
<a href="https://www.swisscars.pl/samochody/516321/" rel="bookmark" title="Renault Kangoo II (96’011 km)">
Renault Kangoo II (96’011 km) </a>
</h2>
Do końca aukcji: <span id="countdown100">2018-10-23 14:00:00 GMT+02:00</span><p>DATA ZAKONCZENIA AUKCJI: 2018-10-23 14:00</p>
</div>
</div>
I want to get only values from a tag witch attribute rel="bookmark". Please help me with this. I try to use hasAttribute function but is not working. Please describe me what I can get only content from a tag with rel="bookmark" attribute. PHP have hasAttribute() function or something like this function?
Thanks for help

XPath might be an option, you could do this:
$doc = new \DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXPath($dom);
$links = $xpath->query('//a[#rel="bookmark"]');
That should return a DOMNodeList you could loop through.

Related

How to extract a link between paragraph tags

I'm trying to fetch a link which is in between p tags, But my result is "/playlist" and i need the link like "song/54826/father-friend".
Been on this for hours now. Help me out please
<div class="track__body">
<p class="track__track">
Father & Friend
<span class="track__artist" data-artist="" id="zwartelijst_artist">Alain Clark</span>
</p>
<a href="/playlist" class="track__playlist">
</a>
include('simple_html_dom.php');
$url="some url";
$html = file_get_contents($url);
$links = [];
$document = new DOMDocument;
$document ->loadHTML($html);
$xPath = new DOMXPath($document );
$anchorTags = $xPath->evaluate("//div[#class=\"track__body\"]//a/#href");
foreach ($anchorTags as $anchorTag) {
$links[] = $anchorTag->nodeValue;
}
echo $links[1];
You need to modify your xpath so it scopes to the right element.
$document = new DOMDocument;
$document ->loadHTML('<div class="track__body">
<p class="track__track">
Father & Friend
<span class="track__artist" data-artist="" id="zwartelijst_artist">Alain Clark</span>
</p>
<a href="/playlist" class="track__playlist">
</a>');
$xPath = new DOMXPath($document );
$anchorTags = $xPath->evaluate("//div[#class=\"track__body\"]/p[#class=\"track__track\"]/a/#href");
foreach ($anchorTags as $anchorTag) {
echo $anchorTag->nodeValue;
}
https://3v4l.org/YY0dD

PHP string search and replace - possible use of DOM Needed

I cant seem to figure out how to achieve my goal.
I want to find and replace a specific class link based off of a generated RSS feed (need the option to replace later no matter what link is there)
Example HTML:
<a class="epclean1" href="#">
WHAT IT SHOULD LOOK LIKE:
<a class="epclean1" href="google.com">
May need to incorporate get element using DOM as the Full php has a created document. If that is the case I would need to know how to find by class and add the href url that way.
FULL PHP:
<?php
$rss = new DOMDocument();
$feed = array();
$urlArray = array(array('url' => 'https://feeds.megaphone.fm')
);
foreach ($urlArray as $url) {
$rss->load($url['url']);
foreach ($rss->getElementsByTagName('item') as $node) {
$item = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue
);
array_push($feed, $item);
}
}
usort( $feed, function ( $a, $b ) {
return strcmp($a['title'], $b['title']);
});
$limit = sizeof($feed);
$previous = null;
$count_firstletters = 0;
for ($x = 0; $x < $limit; $x++) {
$firstLetter = substr($feed[$x]['title'], 0, 1); // Getting the first letter from the Title you're going to print
if($previous !== $firstLetter) { // If the first letter is different from the previous one then output the letter and start the UL
if($count_firstletters != 0) {
echo '</ul>'; // Closing the previously open UL only if it's not the first time
echo '</div>';
}
echo '<button class="glanvillecleancollapsible">'.$firstLetter.'</button>';
echo '<div class="glanvillecleancontent">';
echo '<ul style="list-style-type: none">';
$previous = $firstLetter;
$count_firstletters ++;
}
$title = str_replace(' & ', ' & ', $feed[$x]['title']);
echo '<li>';
echo '<a class="epclean'.$i++.'" href="#" target="_blank">'.$title.'</a>';
echo '</li>';
}
echo '</ul>'; // Close the last UL
echo '</div>';
?>
</div>
</div>
The above fullphp shows on site like so (this is shortened as there is 200+):
<div class="modal-glanvillecleancontent">
<span class="glanvillecleanclose">×</span>
<p id="glanvillecleaninstruct">Select the first letter of the episode that you wish to get clean version for:</p>
<br>
<button class="glanvillecleancollapsible">8</button>
<div class="glanvillecleancontent">
<ul style="list-style-type: none">
<li><a class="epclean1" href="#" target="_blank">80's Video Vixen Tawny Kitaen 044</a></li>
</ul>
</div>
<button class="glanvillecleancollapsible">A</button>
<div class="glanvillecleancontent">
<ul style="list-style-type: none">
<li><a class="epclean2" href="#" target="_blank">Abby Stern</a></li>
<li><a class="epclean3" href="#" target="_blank">Actor Nick Hounslow 104</a></li>
<li><a class="epclean4" href="#" target="_blank">Adam Carolla</a></li>
<li><a class="epclean5" href="#" target="_blank">Adrienne Janic</a></li>
</ul>
</div>
You're not very clear about how your question relates to the code shown, but I don't see any attempt to replace the attribute within the DOM code. You'd want to look at XPath to find the desired elements:
function change_clean($content) {
$dom = new DomDocument;
$dom->loadXML($content);
$xpath = new DomXpath($dom);
$nodes = $xpath->query("//a[#class='epclean1']");
foreach ($nodes as $node) {
if ($node->getAttribute("href") === "#") {
$node->setAttribute("href", "https://google.com/");
}
}
return $dom->saveXML();
}
$xml = '<?xml version="1.0"?><foo><bar><a class="epclean1" href="#">test1</a></bar><bar><a class="epclean1" href="https://example.com">test2</a></bar></foo>';
echo change_clean($xml);
Output:
<foo><bar><a class="epclean1" href="https://google.com/">test1</a></bar><bar><a class="epclean1" href="https://example.com">test2</a></bar></foo>
Hmm. I think your pattern and replacement might be your problem.
What you have
$pattern = 'class="epclean1 href="(.*?)"';
$replacement = 'class="epclean1 href="google.com"';
Fix
$pattern = '/class="epclean1" href=".*"/';
$replacement = 'class="epclean1" href="google.com"';

parse the html data to array data in php

I am trying to parse the html format data into arrays using the a tag classes but i was not able to get the desired format . Below is my data
$text ='<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">Text1</a>
</h2>
<a class="result__snippet" href="">Text1</a>
<a class="result__url" href="">
example.com
</a>
</div>
</div>
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">text3</a>
</h2>
<a class="result__snippet" href="">text23</a>
<a class="result__url" href="">
text.com
</a>
</div>
</div>';
I am trying to get the result using below code
$lines = explode("\n", $text);
$out = array();
foreach ($lines as $line) {
$parts = explode(" > ", $line);
$ref = &$out;
while (count($parts) > 0) {
if (isset($ref[$parts[0]]) === false) {
$ref[$parts[0]] = array();
}
$ref = &$ref[$parts[0]];
array_shift($parts);
}
}
print_r($out);
But i need the result exactly like below
array:2 [
0 => array:3 [
0 => "Text1"
1 => "Text1"
2 => "example.com"
]
1 => array:3 [
0 => "text3"
1 => "text23"
2 => "text.com"
]
]
Demo : https://eval.in/746170
Even i was trying dom like below in laravel :
$dom = new DOMDocument;
$dom->loadHTML($text);
foreach($dom->getElementsByTagName('a') as $node)
{
$array[] = $dom->saveHTML($node);
}
print_r($array);
So how can i use the classes to separate the data as i wanted .Any suggestions please.Thank you .
Here you go, try this and tell me if you need any more help:
<?php
$test = <<<EOS
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">Text1</a>
</h2>
<a class="result__snippet" href="">Text1</a>
<a class="result__url" href="">
example.com
</a>
</div>
</div>
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">text3</a>
</h2>
<a class="result__snippet" href="">text23</a>
<a class="result__url" href="">
text.com
</a>
</div>
</div>
EOS;
$document = new DOMDocument();
$document->loadHTML($test);
// first extract all the divs with the links_deep class
$divs = [];
foreach ($document->getElementsByTagName('div') as $div) {
$classes = $div->attributes->getNamedItem('class')->nodeValue;
if (!$classes) continue;
$classes = explode(' ', $classes);
if (in_array('links_main', $classes)) {
$divs[] = $div;
}
}
// now iterate through them and retrieve all the links in order
$results = [];
foreach ($divs as $div) {
$temp = [];
foreach ($div->getElementsByTagName('a') as $link) {
$temp[] = $link->nodeValue;
}
$results[] = $temp;
}
var_dump($results);
Working version - http://sandbox.onlinephpfunctions.com/code/e7ed2615ea32c5b9f0a89e3460da28a2702343f1
I will do it using DOMDocument and DOMXPath to target interesting parts more easily. In order to be more precise, I register a function that checks if a class attribute contains a set of classes:
function hasClasses($attrValue, $requiredClasses) {
$requiredClasses = explode(' ', $requiredClasses);
$classes = preg_split('~\s+~', $attrValue, -1, PREG_SPLIT_NO_EMPTY);
return array_diff($requiredClasses, $classes) ? false : true;
}
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$xp->registerNamespace('php', 'http://php.net/xpath');
$xp->registerPhpFunctions('hasClasses');
$mainDivClasses = 'result results_links results_links_deep web-result';
$childDivClasses = 'links_main links_deep result__body';
$divNodeList = $xp->query('//div[php:functionString("hasClasses", #class, "' . $mainDivClasses . '")]
/div[php:functionString("hasClasses", #class, "' . $childDivClasses . '")]');
$results = [];
foreach ($divNodeList as $divNode) {
$results[] = [
trim($xp->evaluate('string(./h2/a[#class="result__a"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__snippet"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__url"])', $divNode))
];
}
print_r($results);
without registering a function, you can also use the XPath function contains in your predicates. It's less precise since it only checks if a substring is in a larger string (and not if a class attribute have a specific class like the hasClasses function) but it must be enough:
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$divNodeList = $xp->query('//div[contains(#class, "results_links_deep")]
[contains(#class, "web-result")]
/div[contains(#class, "links_main")]
[contains(#class, "links_deep")]
[contains(#class, "result__body")]');
$results = [];
foreach ($divNodeList as $divNode) {
$results[] = [
trim($xp->evaluate('string(./h2/a[#class="result__a"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__snippet"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__url"])', $divNode))
];
}
print_r($results);

Php Dom Document results error

I would like to scrape some elements from html, but I am unable to scrape the data as I need.
html
<div class="opinions">
<ul>
<li>
<div class="imgcontainers">
<a href="domainname.com" title="title"> `<img width="160" src="image.jpg" />`
</a>
</div>
<p class="caption">
asdfad
<span>November 03, 2015 09:29 This is article title</span>
</p>
</li>
</ul>
</div>
$dom = new DOMDocument();
$classname = 'opinions';
$html = get_page($url);
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXPath($dom);
$articles = $xpath->query("//*[#class='" . $classname . "']");
$p = $articles->getElementsByTagName('a');
$div = $articles->getElementsByTagName('div');
foreach($p as $value){
$title = $value->getAttribute("href");
echo $title;
}
when I run this script I am getting this error "Call to undefined method DOMNodeList::getElementsByTagName()"
What I exactly need is, I need every href link and img src path (if there) and span text value of this . Please suggest your advice how to achieve this.
Maybe you can learn something from my code
Or, if you decide to include my function, here is how I do it:
$html = ""; //your html
$props = array(
array("tagname"=>"div", "props"=>array("class"=>"opinions")),
//the '/' before 'a' is for all descendant <a> of <div>
array("tagname"=>"/a"),
);
$options = array("property"=>"href");
require_once 'getNodeValue.php';
$hrefs = getNodeValue($html, $props, $options);
print_r($hrefs);

How do I extract this value using PHP Dom

I do have html file this is just a prt of it though...
<div id="result" >
<div class="res_item" id="1" h="63c2c439b62a096eb3387f88465d36d0">
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/gigablast.com.png"
alt="favicon for gigablast.com"
width=16
height=16
/>
<a
href="http://www.gigablast.com/"
rel="nofollow"
>
Gigablast
</a>
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/ask.com.png"
alt="favicon for ask.com"
width=16
height=16
/>
<a
href="http://ask.com/" rel="nofollow"
>
Ask.com - What's Your Question?
</a>....
I want extract only url address (for example: http://www.gigablast.com and http://ask.com/ - there are atleast 10 urls in that html) from above using PHP Dom Document..I know up to this but dont know how to move ahead??
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$data = $doc->getElementById('result');
then what?? this is inside tag hence I cant use $data->getElementsByTagName() here!!
Using XPath to narrow down the field to a elements inside the <div class="res_main"> element:
$doc = new DomDocument();
$doc->loadHTMLFile('urllist.html');
$xpath = new DomXpath($doc);
$query = '//div[#class="res_main"]//a';
$nodes = $xpath->query($query);
$urls = array();
foreach ($nodes as $node) {
$href = $node->getAttribute('href');
if (!empty($href)) {
$urls[] = $href;
}
}
This solves the problem of picking up all the <a> elements inside of the document, since it allows you to filter only the ones you want (since you don't care about navigation links, etc)...
You can call getElementsByTagName on a DOMElement object:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$result = $doc->getElementById('result');
$anchors = $result->getElementsByTagName('a');
$urls = array();
foreach ($anchors as $a) {
$urls[] = $a->getAttribute('href');
}
If you want to get image sources as well, that would be easy to add.
If you are just trying to extract the href attribute of all a tags in the document (and the <div id="result"> doesn't matter, you could use this:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$anchors = $doc->getElementsByTagName('a');
$urls = array();
foreach($anchors as $anchor) {
$urls[] = $anchor->attributes->href;
}
// $urls is your collection of urls in the original document.

Categories