php xquery parsing html - php

how to parse nested html tags like this structure:
<article class="tile">
<div class="tile-content">
ignore
<div class="tile-content__text tile-content__text--arrow-white">
<label class="label-date label-date--blue">01.12.2021</label>
<h4><a class="link-color-black" href="link-1">title-1</a></h4>
<p class="tile-content__paragraph tile-content__paragraph--gray pd-ver-10">​
content-1
</p>
</div>
more
</div>
<article class="tile">
<div class="tile-content">
ignore
<div class="tile-content__text tile-content__text--arrow-white">
<label class="label-date label-date--blue">02.12.2021</label>
<h4><a class="link-color-black" href="link-2">title-2</a></h4>
<p class="tile-content__paragraph tile-content__paragraph--gray pd-ver-10">​
content-2
</p>
</div>
more
</div>
</article>
to array like:
$parsedArray = [
0 =>
['title => 'title',
'link' => 'link-1',
'date' => '2021-12-01',
'content' => 'content-1']
1 =>
['title => 'title-2',
'link' => 'link-2',
'date' => '2021-12-02',
'content' => 'content-2']
,....]
i use xquery like above, but this remove all tags, after that i have only implode text from all tags, i need to extract info from all tags, any tip?
$dom = new DOMDocument();
$dom->loadHTML($html['html']);
$xpath = new DOMXPath($dom);
$nodelist = $xpath->query("//article[contains(#class, 'tile')]");
foreach ($nodelist as $n) {
echo '<pre>';
var_dump($n);
echo '</pre>';
}

var_dump won't parse the DOM :)
You just need to re-query for your elements within the tile, then assign them to the array.
Assign a working item array to define the structure if it matters, else just build up the result as you go.
<?php
$str = '<article class="tile">
<div class="tile-content">
ignore
<div class="tile-content__text tile-content__text--arrow-white">
<label class="label-date label-date--blue">02.12.2021</label>
<h4><a class="link-color-black" href="link-2">title-2</a></h4>
<p class="tile-content__paragraph tile-content__paragraph--gray pd-ver-10">
content-2
</p>
</div>
more
</div>
</article>';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHtml($str);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query("//article[contains(#class, 'tile')]") as $tile) {
// define item structure
$item = [
'title' => '',
'link' => '',
'date' => '',
'content' => ''
];
// find date
$query = $xpath->query("//label[contains(#class, 'label-date')][1]", $tile);
if (count($query)) {
$item['date'] = $query[0]->nodeValue;
}
// find link/title
$query = $xpath->query("//h4/a[1]", $tile);
if (count($query)) {
$item['link'] = $query[0]->getAttribute('href');
$item['title'] = $query[0]->nodeValue;
}
// find content
$query = $xpath->query("//p[contains(#class, 'tile-content__paragraph')][1]", $tile);
if (count($query)) {
$item['content'] = $query[0]->nodeValue;
}
// assign
$result[] = $item;
// cleanup
unset($item, $query);
}
print_r($result);
Output:
Array
(
[0] => Array
(
[title] => title-2
[link] => link-2
[date] => 02.12.2021
[content] =>
content-2
)
)

Related

XPath : Parsing a page

Let's say I have this HTML:
<div class="area">Area One</div>
<div class="key">AAA</div>
<div class="value">BBB</div>
<div class="key">CCC</div>
<div class="value">DDD</div>
<div class="key">EEE</div>
<div class="value">FFF</div>
<div class="area">Area Two</div>
I want to use XPath to make an array:
my_array['area']
[0] =>
['AAA'] => "BBB"
['CCC'] => "DDD"
['EEE'] => "FFF"
[1] => ...
And so on. Any thoughts on how this can be accomplished? What I'm trying to do is use "area" as the marker between sub-arrays.
My knowledge is somewhat limited in PHP but you can try :
<?php
$html = <<<'HTML'
<div class="area">Area One</div>
<div class="key">AAA</div>
<div class="value">BBB</div>
<div class="key">CCC</div>
<div class="value">DDD</div>
<div class="key">EEE</div>
<div class="value">FFF</div>
<div class="area">Area Two</div>
<div class="key">GGG</div>
<div class="value">HHH</div>
<div class="key">III</div>
<div class="value">JJJ</div>
<div class="key">KKK</div>
<div class="value">LLL</div>
HTML;
$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);
$nbarea = count($xpath->query('//*[contains(text(),"Area")]'));
$i=1;
$j=1;
for ($a = 1; $a <= $nbarea; $a++) {
for ($b = 1; $b <= 3; $b++) {
$element1 = $xpath->query('//*[contains(text(),"Area")]['.$i.']/following::div['.$j.']');
$j++;
$element2 = $xpath->query('//*[contains(text(),"Area")]['.$i.']/following::div['.$j.']');
$h1 = $element1->item(0)->nodeValue;
$h2 = $element2->item(0)->nodeValue;
$area[$i-1][$h1] = $h2;
$j++;
}
$i++;
$j=1;
}
print_r($area)
?>
Output :
Array
(
[0] => Array
(
[AAA] => BBB
[CCC] => DDD
[EEE] => FFF
)
[1] => Array
(
[GGG] => HHH
[III] => JJJ
[KKK] => LLL
)
)
Side note : I've assumed you always have the same number of elements for each area (=3).

PHP create recursive list of header tags from DOM

I want to parse some HTML to create a nested navigation based on the headings in that document.
An array like this is what i'm trying to create:
[
'name' => 'section 1',
'number' => '1',
'level' => 1,
'children' => [
[
'name' => 'sub section 1',
'number' => '1.1',
'level' => 2,
'children' => []
],
[
'name' => 'sub section 2',
'number' => '1.2',
'level' => 2,
'children' => []
]
],
]
So if the document has a H3 after a H2 the code can then parse this and create a nested array with child elements for each successive tier of H headings
I guess it needs to do a few main things:
Get all of the headings
Recursively loop (H3 after a H2 should be a child in the array)
Create the section number 1.1.1 or 1.1.2 for example
This is my code to extract the headings:
$dom = new \DomDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// Extract the heading structure
$xpath = new \DomXPath($dom);
$headings = $xpath->query('//h1|//h2|//h3|//h4|//h5|//h6');
I've tried to create a recursive function but i'm not sure on the best way to get it working
It's very difficult to test as this will depend on how complex the HTML is and the specific pages you use. Also as the code does a lot, I will leave it up to you to work out what it does as an explanation would go on for some time. The XPath was created using XPath select all elements between two specific elements as a reference to pick out the data between two tags. The test source (test.html) is merely....
<html>
<head>
</head>
<body>
<h2>Header 1</h2>
<h2>Header 2</h2>
<h3>Header 2.1</h3>
<h4>Header 2.1.1</h4>
<h2>Header 3</h2>
<h3>Header 3.1</h3>
</body>
</html>
The actual code is...
function extractH ( $level, $xpath, $dom, $position = 0, $number = '' ) {
$output = [];
$prevLevel = $level-1;
$headings = $xpath->query("//*/h{$level}[count(preceding-sibling::h{$prevLevel})={$position}]");
foreach ( $headings as $key => $heading ) {
$sectionNumber = ltrim($number.".".($key+1), ".");
$newOutput = ["name" => $heading->nodeValue,
"number" => $sectionNumber,
"level" => $level
];
$children = extractH($level+1, $xpath, $dom, $key+1, $sectionNumber);
if ( !empty($children) ) {
$newOutput["children"] = $children;
}
$output[] =$newOutput;
}
return $output;
}
$html = file_get_contents("test.html");
$dom = new \DomDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new \DomXPath($dom);
$output = extractH(2, $xpath, $dom);
print_r($output);
The call to extractH() has few parameters. As the sample HTML only starts with h2 tags (no h1) then the first parameter is 2. Then the XPath and DomDocument objects to work with.
accepted answer does not work for me with structure like this:
<h2>a</h2>
<h3>aa</h3>
<h4>aaa</h4>
<h5>aaaa</h5>
<h6>aaaaa</h6>
<h2>b</h2>
<h2>c</h2>
<h3>ca</h3>
<h3>cb</h3>
<h3>cc</h3>
<h2>d</h2>
<h3>da</h3>
<h4>daa</h4>
<h5>daaa</h5>
<h6>daaaa</h6>
tree from in the "d" section is being replaced with tree from the "a" section
this solution works for me
class Parser {
private $counter = [
1 => 0,
2 => 0,
3 => 0,
4 => 0,
5 => 0,
6 => 0,
];
public function generate(string $text) {
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML(mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
$tree = $this->extractHeadings(2, $xpath, $dom);
return $tree;
}
private function extractHeadings($level, DOMXPath $xpath, DOMDocument $dom, $position = 0) {
$result = [];
$prevLevel = $level-1;
$query = "//*/h{$level}[count(preceding::h{$prevLevel})={$position}]";
$headings = $xpath->query($query);
foreach ($headings as $key => $heading) {
$this->counter[$level]++;
$item = [
'value' => $heading->nodeValue,
'level' => $level,
'children' => [],
];
$children = $this->extractHeadings($level+1, $xpath, $dom, $this->counter[$level]);
if (!empty($children)) {
$item['children'] = $children;
}
$result[] = $item;
}
return $result;
}
}
$text = "
<h2>a</h2>
<h3>aa</h3>
<h4>aaa</h4>
<h5>aaaa</h5>
<h6>aaaaa</h6>
<h2>b</h2>
<h2>c</h2>
<h3>ca</h3>
<h3>cb</h3>
<h3>cc</h3>
<h2>d</h2>
<h3>da</h3>
<h4>daa</h4>
<h5>daaa</h5>
<h6>daaaa</h6>
";
$parser = new Parser();
$parser->generate($text);
but still expects ordered headings though

PHP Array Iterator to DIV

I want to loop through the unknown depth array with RecursiveIteratorIterator in SELF::FIRST mode along with RecursiveArrayIterator.
If the array value is an array, I will open a DIV so the "subarray" will be inside this DIV. Something like
$array = array(
'key0' => '0',
'key1' => array(
'value0' => '1',
'value1' => '2',
),
'key2' => '3',
'key3' => array(
'value2' => '4',
'value3' => array(
'value4' => '5'
),
'value4' => array(
'value5' => '6'
),
),
);
Then the HTML should be:
<div>
<div>
<p>key0 is 0</p>
</div>
<div>
<p>key1</p>
<div>
<p>value0 is 1</p>
<p>value1 is 2</p>
</div>
</div>
<div>
<p>key2 is 3</p>
</div>
<div>
<p>key3</p>
<div>
<p>value2 is 4</p>
<p>value3</p>
<div>
<p>value4 is 5</p>
</div>
<p>value4</p>
<div>
<p>value5 is 6</p>
</div>
</div>
</div>
</div>
But the problem is my code can only close 1 <div> tag each time. I have no idea how to remember how deep was there. So I can close to a for loop and echo </div>.
My current code:
<?php
echo '<div>';
$iterator = new RecursiveIteratorIterator(new RecursiveArrayIterator($iterator_array), RecursiveIteratorIterator::SELF_FIRST);
$is_start = true;
$last_element = '';
foreach($iterator as $key => $value) {
if(is_array($value) && $is_start) {
echo '<div><p>' . $key . '</p>';
$is_start = false;
$last_element = end($value);
} elseif(is_array($value) && !$is_start) {
echo '</div><div><p>' . $key . '</p>';
$last_element = end($value);
} elseif(!is_array($value)) {
echo '<div><p>' . $key . ' is ' . $value . '</p></div>';
if($last_element == $value) {
echo '</div>';
}
}
}
echo '</div>';
?>
Use this recursive function
May be it will help you
get_div($array);
function get_div($arr) {
foreach($arr as $k => $a){
echo '<div>';
if(is_array($a)) {
echo "<p>".$k."</p>";
get_div($a);
} else {
echo "<p>".$k." is ".$a."</p>";
}
echo '</div>';
}
}

Parsing html using php to an array

I have the below html
<p>text1</p>
<ul>
<li>list-a1</li>
<li>list-a2</li>
<li>list-a3</li>
</ul>
<p>text2</p>
<ul>
<li>list-b1</li>
<li>list-b2</li>
<li>list-b3</li>
</ul>
<p>text3</p>
Does anyone have an idea to parse this html file with php to get this output using complex array
fist one for the tags "p"
and the second for tags "ul" because after above every "p" tag a tag "ul"
Array
(
[0] => Array
(
[value] => text1
(
[il] => list-a1
[il] => list-a2
[il] => list-a3
)
)
[1] => Array
(
[value] => text2
(
[il] => list-b1
[il] => list-b2
[il] => list-b3
)
)
)
I can't use replace or removing all tags cause I use
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document.') === false) {
$links2[] = array(
'value' => $link->textContent, );
}
$er=0;
foreach ($doc->getElementsByTagName('ul') as $link)
{
$dont2 = $link->nodeValue;
//echo $dont2;
if (strpos($dont2, 'favorisContribuer') === false) {
$links3[]= array(
'il' => $link->nodeValue, );
}
You could use the DOMDocument class (http://php.net/manual/en/class.domdocument.php)
You can see an example below.
<?php
$html = '
<p>text1</p>
<ul>
<li>list-a1</li>
<li>list-a2</li>
<li>list-a3</li>
</ul>
<p>text2</p>
<ul>
<li>list-b1</li>
<li>list-b2</li>
<li>list-b3</li>
</ul>
<p>text3</p>
';
$doc = new DOMDocument();
$doc->loadHTML($html);
$textContent = $doc->textContent;
$textContent = trim(preg_replace('/\t+/', '<br>', $textContent));
echo '
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
' . $textContent . '
</body>
</html>
';
?>
However, I would suggest using javascript to find the content and send it to php instead.

Take an existing array and brake it up by value

I have a simplepie feed that spits out multiple feeds from each of the URL's that are loaded into a a associative array.
This associative array is used to sort the arrays alphabetically using the values.
I'm trying to use that same value sort and keep all of the arrays with the same URL or value together so that when the foreach loop runs I get one div per URL containning all the feeds from that URL for the day
<?php
require_once('php/autoloader.php');
$feed = new SimplePie(); // Create a new instance of SimplePie
// Load the feeds
$urls = array(
'http://abcfamily.go.com/service/feed?id=774372' => 'abc',
'http://animal.discovery.com/news/news.rss' => 'animalplanet',
'http://www.insideaolvideo.com/rss.xml' => 'aolvideo',
'http://feeds.bbci.co.uk/news/world/rss.xml' => 'bbcwn',
'http://www.bing.com' => 'bing',
'http://www.bravotv.com' => 'bravo',
'http://www.cartoonnetwork.com' => 'cartoonnetwork',
'http://feeds.cbsnews.com/CBSNewsMain?format=xml' => 'cbsnews',
'http://www.clicker.com/' => 'clicker',
'http://feeds.feedburner.com/cnet/NnTv?tag=contentBody.1' => 'cnet',
'http://www.comedycentral.com/' => 'comedycentral',
'http://www.crackle.com/' => 'crackle',
'http://www.cwtv.com/feed/episodes/xml' => 'cw',
'http://disney.go.com/disneyxd/' => 'disneyxd',
'http://www.engadget.com/rss.xml' => 'engadget',
'http://syndication.eonline.com/syndication/feeds/rssfeeds/video/index.xml' => 'eonline',
'http://sports.espn.go.com/espn/rss/news' => 'espn',
'http://facebook.com' => 'facebook',
'http://flickr.com/espn/rss/news' => 'flickr',
'http://www.fxnetworks.com//home/tonight_rss.php' => 'fxnetworks',
'http://www.hgtv.com/' => 'hgtv',
'http://www.history.com/this-day-in-history/rss' => 'history',
'http://rss.hulu.com/HuluRecentlyAddedVideos?format=xml' => 'hulu',
'http://rss.imdb.com/daily/born/' => 'imdb',
'http://www.metacafe.com/' => 'metacafe',
'http://feeds.feedburner.com/Monkeyseecom-NewestVideos?format=xml' => 'monkeysee',
'http://pheedo.msnbc.msn.com/id/18424824/device/rss/' => 'msnbc',
'http://www.nationalgeographic.com/' => 'nationalgeographic',
'http://dvd.netflix.com/NewReleasesRSS' => 'netflix',
'http://feeds.nytimes.com/nyt/rss/HomePage' => 'newyorktimes',
'http://www.nick.com/' => 'nickelodeon',
'http://www.nickjr.com/' => 'nickjr',
'http://www.pandora.com/' => 'pandora',
'http://www.pbskids.com/' => 'pbskids',
'http://www.photobucket.com/' => 'photobucket',
'http://feeds.reuters.com/Reuters/worldNews' => 'reuters',
'http://www.revision3.com/' => 'revision3',
'http://www.tbs.com/' => 'tbs',
'http://www.theverge.com/rss/index.xml' => 'theverge',
'http://www.tntdrama.com/' => 'tnt',
'http://www.tvland.com/' => 'tvland',
'http://www.vimeo.com/' => 'vimeo',
'http://www.vudu.com/' => 'vudu',
'http://feeds.wired.com/wired/index?format=xml' => 'wired',
'http://www.xfinitytv.com/' => 'xfinitytv',
'http://www.youtube.com/topic/4qRk91tndwg/most-popular#feed' => 'youtube',
);
$feed->set_feed_url(array_keys($urls));
$feed->enable_cache(true);
$feed->set_cache_location('cache');
$feed->set_cache_duration(1800); // Set the cache time
$feed->set_item_limit(0);
$success = $feed->init(); // Initialize SimplePie
$feed->handle_content_type(); // Take care of the character encoding
?>
<?php require_once("inc/connection.php"); ?>
<?php require_once("inc/functions.php"); ?>
<?php include("inc/header.php"); ?>
<?php
// Sort it
$feed_items = array();
$items = $feed->get_items();
$urls = array_unique($urls);
foreach ($urls as $url => $image) {
$unset = array();
$feed_items[$url] = array();
foreach ($items as $i => $item) {
if ($item->get_feed()->feed_url == $url) {
$feed_items[$url][] = $item;
$unset[] = $i;
}
}
foreach ($unset as $i) {
unset($items[$i]);
}
}
foreach ($feed_items as $feed_url => $items) {
if (empty($items)) {
?>
<div class="item"><img src="images/boreds/<?php echo $urls[$feed_url] ?>.png"/><p>Visit <?php echo $urls[$feed_url] ?> now!</p></div>
<?
continue;
}
$first_item = $items[0];
$feed = $first_item->get_feed();
?>
<?php
$feedCount = 0;
foreach ($items as $item ) {
$feedCount++;
?>
<div class="item"><strong id="amount"><?php echo $feedCount; ?></strong><img src="images/boreds/<?php echo $urls[$feed_url] ?>.png"/><p><?php echo $item->get_title(); ?></p></div>
<?php
}
}
?>
<?php require("inc/footer.php"); ?>
Maybe you can use this one:
$feed = new SimplePie();
$feed->set_feed_url('http://myfirstfeed','http://mysecondfeed');
foreach( $feed->get_items() as $k => $item ) {
echo "<div id='".$k.'">";
echo $item->get_permalink();
echo $title = $item->get_title();
echo $item->get_date('j M Y, g:i a');
echo $item->get_content();
echo "</div>";
}

Categories