parse the html data to array data in php

parse the html data to array data in php - php

I am trying to parse the html format data into arrays using the a tag classes but i was not able to get the desired format . Below is my data
$text ='<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">Text1</a>
</h2>
<a class="result__snippet" href="">Text1</a>
<a class="result__url" href="">
example.com
</a>
</div>
</div>
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">text3</a>
</h2>
<a class="result__snippet" href="">text23</a>
<a class="result__url" href="">
text.com
</a>
</div>
</div>';
I am trying to get the result using below code
$lines = explode("\n", $text);
$out = array();
foreach ($lines as $line) {
$parts = explode(" > ", $line);
$ref = &$out;
while (count($parts) > 0) {
if (isset($ref[$parts[0]]) === false) {
$ref[$parts[0]] = array();
}
$ref = &$ref[$parts[0]];
array_shift($parts);
}
}
print_r($out);
But i need the result exactly like below
array:2 [
0 => array:3 [
0 => "Text1"
1 => "Text1"
2 => "example.com"
]
1 => array:3 [
0 => "text3"
1 => "text23"
2 => "text.com"
]
]
Demo : https://eval.in/746170
Even i was trying dom like below in laravel :
$dom = new DOMDocument;
$dom->loadHTML($text);
foreach($dom->getElementsByTagName('a') as $node)
{
$array[] = $dom->saveHTML($node);
}
print_r($array);
So how can i use the classes to separate the data as i wanted .Any suggestions please.Thank you .

Here you go, try this and tell me if you need any more help:
<?php
$test = <<<EOS
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">Text1</a>
</h2>
<a class="result__snippet" href="">Text1</a>
<a class="result__url" href="">
example.com
</a>
</div>
</div>
<div class="result results_links results_links_deep web-result ">
<div class="links_main links_deep result__body">
<h2 class="result__title">
<a rel="nofollow" class="result__a" href="">text3</a>
</h2>
<a class="result__snippet" href="">text23</a>
<a class="result__url" href="">
text.com
</a>
</div>
</div>
EOS;
$document = new DOMDocument();
$document->loadHTML($test);
// first extract all the divs with the links_deep class
$divs = [];
foreach ($document->getElementsByTagName('div') as $div) {
$classes = $div->attributes->getNamedItem('class')->nodeValue;
if (!$classes) continue;
$classes = explode(' ', $classes);
if (in_array('links_main', $classes)) {
$divs[] = $div;
}
}
// now iterate through them and retrieve all the links in order
$results = [];
foreach ($divs as $div) {
$temp = [];
foreach ($div->getElementsByTagName('a') as $link) {
$temp[] = $link->nodeValue;
}
$results[] = $temp;
}
var_dump($results);
Working version - http://sandbox.onlinephpfunctions.com/code/e7ed2615ea32c5b9f0a89e3460da28a2702343f1

I will do it using DOMDocument and DOMXPath to target interesting parts more easily. In order to be more precise, I register a function that checks if a class attribute contains a set of classes:
function hasClasses($attrValue, $requiredClasses) {
$requiredClasses = explode(' ', $requiredClasses);
$classes = preg_split('~\s+~', $attrValue, -1, PREG_SPLIT_NO_EMPTY);
return array_diff($requiredClasses, $classes) ? false : true;
}
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$xp->registerNamespace('php', 'http://php.net/xpath');
$xp->registerPhpFunctions('hasClasses');
$mainDivClasses = 'result results_links results_links_deep web-result';
$childDivClasses = 'links_main links_deep result__body';
$divNodeList = $xp->query('//div[php:functionString("hasClasses", #class, "' . $mainDivClasses . '")]
/div[php:functionString("hasClasses", #class, "' . $childDivClasses . '")]');
$results = [];
foreach ($divNodeList as $divNode) {
$results[] = [
trim($xp->evaluate('string(./h2/a[#class="result__a"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__snippet"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__url"])', $divNode))
];
}
print_r($results);
without registering a function, you can also use the XPath function contains in your predicates. It's less precise since it only checks if a substring is in a larger string (and not if a class attribute have a specific class like the hasClasses function) but it must be enough:
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$divNodeList = $xp->query('//div[contains(#class, "results_links_deep")]
[contains(#class, "web-result")]
/div[contains(#class, "links_main")]
[contains(#class, "links_deep")]
[contains(#class, "result__body")]');
$results = [];
foreach ($divNodeList as $divNode) {
$results[] = [
trim($xp->evaluate('string(./h2/a[#class="result__a"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__snippet"])', $divNode)),
trim($xp->evaluate('string(.//a[#class="result__url"])', $divNode))
];
}
print_r($results);

Related

PHP string search and replace - possible use of DOM Needed

I cant seem to figure out how to achieve my goal.
I want to find and replace a specific class link based off of a generated RSS feed (need the option to replace later no matter what link is there)
Example HTML:
<a class="epclean1" href="#">
WHAT IT SHOULD LOOK LIKE:
<a class="epclean1" href="google.com">
May need to incorporate get element using DOM as the Full php has a created document. If that is the case I would need to know how to find by class and add the href url that way.
FULL PHP:
<?php
$rss = new DOMDocument();
$feed = array();
$urlArray = array(array('url' => 'https://feeds.megaphone.fm')
);
foreach ($urlArray as $url) {
$rss->load($url['url']);
foreach ($rss->getElementsByTagName('item') as $node) {
$item = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue
);
array_push($feed, $item);
}
}
usort( $feed, function ( $a, $b ) {
return strcmp($a['title'], $b['title']);
});
$limit = sizeof($feed);
$previous = null;
$count_firstletters = 0;
for ($x = 0; $x < $limit; $x++) {
$firstLetter = substr($feed[$x]['title'], 0, 1); // Getting the first letter from the Title you're going to print
if($previous !== $firstLetter) { // If the first letter is different from the previous one then output the letter and start the UL
if($count_firstletters != 0) {
echo '</ul>'; // Closing the previously open UL only if it's not the first time
echo '</div>';
}
echo '<button class="glanvillecleancollapsible">'.$firstLetter.'</button>';
echo '<div class="glanvillecleancontent">';
echo '<ul style="list-style-type: none">';
$previous = $firstLetter;
$count_firstletters ++;
}
$title = str_replace(' & ', ' & ', $feed[$x]['title']);
echo '<li>';
echo '<a class="epclean'.$i++.'" href="#" target="_blank">'.$title.'</a>';
echo '</li>';
}
echo '</ul>'; // Close the last UL
echo '</div>';
?>
</div>
</div>
The above fullphp shows on site like so (this is shortened as there is 200+):
<div class="modal-glanvillecleancontent">
<span class="glanvillecleanclose">×</span>
<p id="glanvillecleaninstruct">Select the first letter of the episode that you wish to get clean version for:</p>
<br>
<button class="glanvillecleancollapsible">8</button>
<div class="glanvillecleancontent">
<ul style="list-style-type: none">
<li><a class="epclean1" href="#" target="_blank">80's Video Vixen Tawny Kitaen 044</a></li>
</ul>
</div>
<button class="glanvillecleancollapsible">A</button>
<div class="glanvillecleancontent">
<ul style="list-style-type: none">
<li><a class="epclean2" href="#" target="_blank">Abby Stern</a></li>
<li><a class="epclean3" href="#" target="_blank">Actor Nick Hounslow 104</a></li>
<li><a class="epclean4" href="#" target="_blank">Adam Carolla</a></li>
<li><a class="epclean5" href="#" target="_blank">Adrienne Janic</a></li>
</ul>
</div>

You're not very clear about how your question relates to the code shown, but I don't see any attempt to replace the attribute within the DOM code. You'd want to look at XPath to find the desired elements:
function change_clean($content) {
$dom = new DomDocument;
$dom->loadXML($content);
$xpath = new DomXpath($dom);
$nodes = $xpath->query("//a[#class='epclean1']");
foreach ($nodes as $node) {
if ($node->getAttribute("href") === "#") {
$node->setAttribute("href", "https://google.com/");
}
}
return $dom->saveXML();
}
$xml = '<?xml version="1.0"?><foo><bar><a class="epclean1" href="#">test1</a></bar><bar><a class="epclean1" href="https://example.com">test2</a></bar></foo>';
echo change_clean($xml);
Output:
<foo><bar><a class="epclean1" href="https://google.com/">test1</a></bar><bar><a class="epclean1" href="https://example.com">test2</a></bar></foo>

Hmm. I think your pattern and replacement might be your problem.
What you have
$pattern = 'class="epclean1 href="(.*?)"';
$replacement = 'class="epclean1 href="google.com"';
Fix
$pattern = '/class="epclean1" href=".*"/';
$replacement = 'class="epclean1" href="google.com"';

How to web-scrape in in divs with DOMparser

I am trying to get div and for other pages, trying to put it in a foreach.
But facing some troubles,
<div class="article_info">
<ul class="c-result_box">
<li>
<div class="inner cf">
<div class="c-header">
<div class="c-logo">
<im src="/e/designs/31sumai/common/img/logo_08.png" alt="#">
</div>
<p class="c-supplier">三井のマンション</p>
<p class="c-name">
パークリュクス大阪天満
</p>
I'm trying to get the text inside the <a> element, here is my codes, what I am missing here?
$start_id = 1501;
while(true){
$url = 'https://www.31sumai.com/mfr/K'.$start_id.'/outline.html';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$xpath = new \DOMXPath($DOMParser);
$classname="c-name";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
$MyTable = false;
$insertData = [];
foreach($nodes as $node){
$allNames = [];
foreach($node->getElementsByTagName('a') as $a){
$name = $a->getElementsByTagName('a');
$allProperties[] = [
'names' => $name];
}
}
Thank you for helping!

You can rely on your XPath query to pull all the text node that you want, and then just get the nodeValue property within your loop:
$start_id = "1501";
$url = "https://www.31sumai.com/mfr/K$start_id/outline.html";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$xpath = new \DOMXPath($DOMParser);
$classname="c-name";
$nodes = $xpath->query("//*[contains(#class, '$classname')]/a/text()");
foreach($nodes as $node){
echo $node->nodeValue;
}

Problem to get content from html string variable in php

I have this code:
$doc = new \DOMDocument();
$doc->loadHTML($content);
$links = [];
$container = $doc->getElementById("content");
$arr = $container->getElementsByTagName("a");
foreach($arr as $item) {
$href = $item->getAttribute("href");
$title = $item->getAttribute("title");
$links[] = [
'href' => $href,
'title' => $title
];
}
for($i = 0, $l = count($links); $i < $l; ++$i) {
echo $links[$i]['title'].' '.$links[$i]['href'].'<br />';
}
The html structure is like that:
<div class="post-content right-col">
<a title="" href="https://www.swisscars.pl/samochody/516321/">
<img src="https://swisscars.pl/uploads2/180843_0.jpg" alt="" class="thumb alignleft" height="75" width="75"/>
</a>
<h2 style="line-height:150%;">
<a href="https://www.swisscars.pl/samochody/516321/" rel="bookmark" title="Renault Kangoo II (96’011 km)">
Renault Kangoo II (96’011 km) </a>
</h2>
Do końca aukcji: <span id="countdown100">2018-10-23 14:00:00 GMT+02:00</span><p>DATA ZAKONCZENIA AUKCJI: 2018-10-23 14:00</p>
</div>
</div>
I want to get only values from a tag witch attribute rel="bookmark". Please help me with this. I try to use hasAttribute function but is not working. Please describe me what I can get only content from a tag with rel="bookmark" attribute. PHP have hasAttribute() function or something like this function?
Thanks for help

XPath might be an option, you could do this:
$doc = new \DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXPath($dom);
$links = $xpath->query('//a[#rel="bookmark"]');
That should return a DOMNodeList you could loop through.

How can I turn the values inside a UL into an associative array in PHP

I have a ul list like this:
<ul>
<li>
<div class="time">18:45</div>
<div class="info">description goes here</div>
<div class="clearAll"></div>
</li>
<li>
<div class="time">19:15</div>
<div class="info">some info</div>
<div class="clearAll"></div>
</li>
</ul>
How can I turn this into an array like this:
$array = array(
1 => array('18:45','description goes here');
1 => array('19:15','some info');
);

Stay away from a regex for this. DOMDocument is your friend:
$dom = new DOMDocument;
$dom->loadHTML( $theHTMLstring );
$array = array();
foreach ( $dom->getElementsByTagName('li') as $li ) {
$divs = $li->getElementsByTagName('div');
$array[] = array(
$divs->item(0)->textContent,
$divs->item(1)->textContent
);
}
See it here in action: http://codepad.viper-7.com/5ExOqJ

By not using regex:
$sx = new SimpleXMLElement($xml);
foreach ($sx->xpath('//li') as $node) {
$time = current($node->xpath("div[#class='time']"));
$time = "$time";
$info = current($node->xpath("div[#class='info']"));
$info = "$info";
$data[] = array($time, $info);
}
http://codepad.viper-7.com/lo8k5c

Extract text and put into array with PHP

I have the following string and need to extract the text inside the div's (EDITOR'S PREFACE, MORE CONTENT, etc) and put them into an array with php. How could I do this?
Thanks in advance.
<div class='classit'><a href='site.php?site=1&filename=aname4'>EDITOR'S PREFACE</a></div>
<div class='classit'><a href='site.php?site=4&filename=aname3'>MORE CONTENT</a></div>
<div class='classit'><a href='site.php?site=3&filename=aname4'>LAST LINE</a></div>

Use Simple HTML DOM
$html = <<<HTML
<div class='classit'><a href='site.php?site=1&filename=aname4'>EDITOR'S PREFACE</a></div>
<div class='classit'><a href='site.php?site=4&filename=aname3'>MORE CONTENT</a></div>
<div class='classit'><a href='site.php?site=3&filename=aname4'>LAST LINE</a></div>
HTML;
$src = str_get_html($html);
$elem = $src->find("div.classit a");
foreach ($elem as $link) {
$links[] = $link->plaintext;
}
print_r($links);

You could use PHP's own DOM extension
$string = '<div><a>Elem 1</a></div><div><a>Elem 2</a></div>...etc';
$dom = new DOMDocument();
$dom->loadHTML($string);
$elements = $dom->getElementsByTagName('a');
$textElements = array();
foreach($elements as $node) {
textElements[] = $node->nodeValue;
}
If you want to load a larger HTML extract, you could use DOMXPath to query the DOMDocument in order to just get the elements you want.
$xPathObj = new DOMXPath($dom);
$elements = $xPathObj->query('//div[#class='classit']/a');
Edit
DOMNodeList supports foreach, so I've changed for($i = 0; $i < $elements->length; $i++) {$elements->item($i)->nodeValue;} to foreach($elements as $node) {$node->nodeValue}

You could use preg_match_all:
<?php
$html = <<<HTML
<div class='classit'><a href='site.php?site=1&filename=aname4'>EDITOR'S PREFACE</a></div>
<div class='classit'><a href='site.php?site=4&filename=aname3'>MORE CONTENT</a></div>
<div class='classit'><a href='site.php?site=3&filename=aname4'>LAST LINE</a></div>
HTML;
$result = array();
if (preg_match_all('/>([^><]+)(?=<\/a>)/', $html, $matches))
{
$result = $matches[1];
}
print_r($result);

you could do using strip_tags:
$s = "<div class='classit'><a href='site.php?site=1&fn=aname4'>EDITOR'S PREFACE</a></div>
<div class='classit'><a href='site.php?site=4&filename=aname3'>MORE CONTENT</a></div>
<div class='classit'><a href='site.php?site=3&filename=aname4'>LAST LINE</a></div> ";
foreach (explode("\n", $s) as $val){
$new[] = strip_tags($val);
}
var_dump($new);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

parse the html data to array data in php - php

Related

PHP string search and replace - possible use of DOM Needed

How to web-scrape in in divs with DOMparser

Problem to get content from html string variable in php

How can I turn the values inside a UL into an associative array in PHP

Extract text and put into array with PHP

Categories

Resources