turn HTML into a PHP array - php

I have a string containing also HTML in a $html variable:
'Here is some text which I do not need to extract but then there are
<figure class="class-one">
<img src="/example.jpg" alt="example alt" class="some-image-class">
<figcaption>example caption</figcaption>
</figure>
And another one (and many more)
<figure class="class-one some-other-class">
<img src="/example2.jpg" alt="example2 alt">
</figure>'
I want to extract all <figure> elements and everything they contain including their attributes and other html-elements and put this in an array in PHP so I would get something like:
$figures = [
0 => [
"class" => "class-one",
"img" => [
"src" => "/example.jpg",
"alt" => "example alt",
"class" => "some-image-class"
],
"figcaption" => "example caption"
],
1 => [
"class" => "class-one some-other-class",
"img" => [
"src" => "/example2.jpg",
"alt" => "example2 alt",
"class" => null
],
"figcaption" => null
]];
So far I have tried:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$figures = array();
foreach ($figures as $figure) {
$figures['class'] = $figure->getAttribute('class');
// here I tried to create the whole array but I can't seem to get the values from the HTML
// also I'm not sure how to get all html-elements within <figure>
}
Here is a Demo.

Here is the code that should get you where you want to be. I have added comments where I felt they would be helpful:
<?php
$htmlString = 'Here is some text which I do not need to extract but then there are <figure class="class-one"><img src="/example.jpg" alt="example alt" class="some-image-class"><figcaption>example caption</figcaption></figure>And another one (and many more)<figure class="class-one some-other-class"><img src="/example2.jpg" alt="example2 alt"></figure>';
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML.
#$dom->loadHTML($htmlString);
//Create new XP
$xp = new DOMXpath($dom);
//Create empty figures array that will hold all of our parsed HTML data
$figures = array();
//Get all <figure> elements
$figureElements = $xp->query('//figure');
//Create number variable to keep track of our $figures array index
$figureCount = 0;
//Loop through each <figure> element
foreach ($figureElements as $figureElement) {
$figures[$figureCount]["class"] = trim($figureElement->getAttribute('class'));
$figures[$figureCount]["img"]["src"] = $xp->query('//img', $figureElement)->item($figureCount)->getAttribute('src');
$figures[$figureCount]["img"]["alt"] = $xp->query('//img', $figureElement)->item($figureCount)->getAttribute('alt');
//Check that an img class exists, otherwise set the value to null. If we don't do this PHP will throw a NOTICE.
if (boolval($xp->evaluate('//img', $figureElement)->item($figureCount))) {
$figures[$figureCount]["img"]["class"] = $xp->query('//img', $figureElement)->item($figureCount)->getAttribute('class');
} else {
$figures[$figureCount]["img"]["class"] = null;
}
//Check that a <figcaption> element exists, otherwise set the value to null
if (boolval($xp->evaluate('//figcaption', $figureElement)->item($figureCount))) {
$figures[$figureCount]["figcaption"] = $xp->query('//figcaption', $figureElement)->item($figureCount)->nodeValue;
} else {
$figures[$figureCount]["figcaption"] = null;
}
//Increment our $figureCount so that we know we can create a new array index.
$figureCount++;
}
print_r($figures);
?>

$doc = new \DOMDocument();
$doc->loadHTML($html);
$figure = $doc->getElementsByTagName("figure"); // DOMNodeList Object
//Craete array to add all DOMElement value
$figures = array();
$i= 0;
foreach($figure as $item) { // DOMElement Object
$figures[$i]['class']= $item->getAttribute('class');
//DOMElement::getElementsByTagName— Returns html tag
$img = $item->getElementsByTagName('img')[0];
if($img){
//DOMElement::getAttribute — Returns value of attribute
$figures[$i]['img']['src'] = $img->getAttribute('src');
$figures[$i]['img']['alt'] = $img->getAttribute('alt');
$figures[$i]['img']['class'] = $img->getAttribute('class');
}
//textContent - use to get the text of tag
if($item->getElementsByTagName('figcaption')[0]){
$figures[$i]['figcaption'] = $item->getElementsByTagName('figcaption')[0]->textContent;
}
$i++;
}
echo "<pre>";
print_r($figures);
echo "</pre>";

Related

Loop through elements and parse them whith DOMDocument() in PHP

I've a list of item like this:
<div class="list">
<div class="ui_checkbox type hidden" data-categories="57 48 ">
<input id="attraction_type_119" type="checkbox" value="119"
<label for="attraction_type_119">Aquariums</label>
</div>
<div class="ui_checkbox type " data-categories="47 ">
<input id="attraction_type_120" type="checkbox" value="120"
<label for="attraction_type_120">Arènes et stades</label>
</div>
</div>
How can I loop through them with DOMDocument to get details like:
data-categories
input value
label text
This is what I tried:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xp = new DOMXpath($dom);
$elements = $dom->getElementsByTagName('div');
$data = array();
foreach($elements as $node){
foreach($node->childNodes as $child) {
$data['data_categorie'] = $child->item(0)->getAttribute('data_categories');
$data['input_value'] = $child->item(0)->getAttribute('input_value');
$data['label_text'] = $child->item(0)->getAttribute('label_text');
}
}
But it doesn't work.
What I'm missing here please ?
Thanks.
Setting multiple values in the loop like this $data['data_categorie'] = using the same key for the array $data = array(); will overwrite the values on every iteration.
As you have multiple items, you could create a temporary array $temp = []; to store the values and add the array to the $data array after storing all the values for the current iteration.
As you are already using DOMXpath, you could get the div with class="list" using an expression like //div[#class="list"]/div and loop the childNodes checking for nodeName input and get that value plus the value of the next sibling which is the value of the label
$data = array();
$xp = new DOMXpath($dom);
$items = $xp->query('//div[#class="list"]/div');
foreach($items as $item) {
$temp["data_categorie"] = $item->getAttribute("data-categories");
foreach ($item->childNodes as $child) {
if ($child->nodeName === "input") {
$temp["input_value"] = $child->getAttribute("value");
$temp["label_text"] = $child->nextSibling->nodeValue;
}
}
$data[] = $temp;
}
print_r($data);
Output
Array
(
[0] => Array
(
[data_categorie] => 57 48
[input_value] => 119
[label_text] => Aquariums
)
[1] => Array
(
[data_categorie] => 47
[input_value] => 120
[label_text] => Arènes et stades
)
)
Php demo
I used string() and evaluate to get result in a single query:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->query('//div[contains(#class, "ui_checkbox")]');
foreach($elements as $node) {
$data = array();
$data['data_categorie'] = $xpath->evaluate('string(./#data-categories)', $node);
$data['input_value'] = $xpath->evaluate('string(./input/#value)', $node);
$data['label_text'] = $xpath->evaluate('string(./label/text())', $node);
}

Array filter in PHP

I am using a simple html dom to parsing html file.
I have a dynamic array called links2, it can be empty or maybe have 4 elements inside or more depending on the case
<?php
include('simple_html_dom.php');
$url = 'http://www.example.com/';
$html = file_get_html($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
//////////////////////////////////////////////////////////////////////////////
foreach ($doc->getElementsByTagName('p') as $link)
{
$intro2 = $link->nodeValue;
$links2[] = array(
'value' => $link->textContent,
);
$su=count($links2);
}
$word = 'document.write(';
Assuming that the two elements contain $word in "array links2", when I try to filter this "array links2" by removing elements contains matches
unset( $links2[array_search($word, $links2 )] );
print_r($links2);
the filter removes only one element and array_diff doesn't solve the problem. Any suggestion?
solved by adding an exception
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document') === false) {
$links2[] = array(
'value' => $link->textContent,
);
}
$su=count($links2);
echo $su;

Insert XML element on the same line of deleted element

I have a php document that deletes an XML element (with child elements), based on the value of the attribute "id", and then creates a new element with the same child elements, but with different text added from a form input:
<?php
function ctr($myXML, $id) {
$xmlDoc = new DOMDocument();
$xmlDoc->load($myXML);
$xpath = new DOMXpath($xmlDoc);
$nodeList = $xpath->query('//noteboard[#id="'.$id.'"]');
if ($nodeList->length) {
$node = $nodeList->item(0);
$node->parentNode->removeChild($node);
}
$xmlDoc->save($myXML);
}
$xml = 'xml.xml'; // file
$to = $_POST['eAttr'];// the attribute value for "id"
ctr($xml,$to);
$target = "3";
$newline = "
<noteboard id='".$_POST['eId']."'>
<noteTitle>".$_POST['eTitle']."</noteTitle>
<noteMessage>".$_POST['eMessage']."</noteMessage>
<logo>".$_POST['eType']."</logo>
</noteboard>"; // HERE
$stats = file($xml, FILE_IGNORE_NEW_LINES);
$offset = array_search($target,$stats) +1;
array_splice($stats, $offset, 0, $newline);
file_put_contents($xml, join("\n", $stats));
?>
XML.xml
<note>
<noteboard id="title">
<noteTitle>Title Text Here</noteTitle>
<noteMessage>Text Here</noteMessage>
<logo>logo.jpg</logo>
</noteboard>
</note>
This works fine, but I would like it to put the new XML content on the line that the old element (the deleted) used to be on, instead of $target adding it to line 3. It is supposed to look like that the element is being 'edited', but it doesn't achieve this if it is on the wrong line.
The lines in an XML document are not exactly relevant, they are just formatting so that the document is easier to read (by a human). Think of it as a tree of nodes. Not only the elements are nodes but any content, like the XML declaration attributes and any text.
With that in mind you can think about your problem as replacing an element node.
First create the new noteCard element. This can be encapsulated into a function:
function createNote(DOMDocument $document, $id, array $data) {
$noteboard = $document->createElement('notecard');
$noteboard->setAttribute('id', $id);
$noteboard
->appendChild($document->createElement('noteTitle'))
->appendChild($document->createTextNode($data['title']));
$noteboard
->appendChild($document->createElement('noteMessage'))
->appendChild($document->createTextNode($data['text']));
$noteboard
->appendChild($document->createElement('logo'))
->appendChild($document->createTextNode($data['logo']));
return $noteboard;
}
Call the function to create the new notecard element node. I am using string literals here, you will have to replace that with the variables from you form.
$newNoteCard = createNote(
$document,
42,
[
'title' => 'New Title',
'text' => 'New Text',
'logo' => 'newlogo.svg',
]
);
Now that you have the new notecard, you can search the existing and replace it:
foreach($xpath->evaluate('//noteboard[#id=3][1]') as $noteboard) {
$noteboard->parentNode->replaceChild($newNoteCard, $noteboard);
}
Complete example:
$document = new DOMDocument();
$document->formatOutput = true;
$document->preserveWhiteSpace = false;
$document->loadXml($xml);
$xpath = new DOMXpath($document);
function createNote(DOMDocument $document, $id, array $data) {
$noteboard = $document->createElement('notecard');
$noteboard->setAttribute('id', $id);
$noteboard
->appendChild($document->createElement('noteTitle'))
->appendChild($document->createTextNode($data['title']));
$noteboard
->appendChild($document->createElement('noteMessage'))
->appendChild($document->createTextNode($data['text']));
$noteboard
->appendChild($document->createElement('logo'))
->appendChild($document->createTextNode($data['logo']));
return $noteboard;
}
$newNoteCard = createNote(
$document,
42,
[
'title' => 'New Title',
'text' => 'New Text',
'logo' => 'newlogo.svg',
]
);
foreach($xpath->evaluate('//noteboard[#id=3][1]') as $noteboard) {
$noteboard->parentNode->replaceChild($newNoteCard, $noteboard);
}
echo $document->saveXml();

Explode random unpredictagle tags in an array

Below is some random unpredictable set of tags wrapped inside a div tag. How to explode all the child tags innerHTML preserving the order of its occurrence.
Note: In case of img, iframe tags need to extract only the urls.
<div>
<p>para-1</p>
<p>para-2</p>
<p>
text-before-image
<img src="text-image-src"/>
text-after-image</p>
<p>
<iframe src="p-iframe-url"></iframe>
</p>
<iframe src="iframe-url"></iframe>
<h1>header-1</h1>
<img src="image-url"/>
<p>
<img src="p-image-url"/>
</p>
content not wrapped within any tags
<h2>header-2</h2>
<p>para-3</p>
<ul>
<li>list-item-1</li>
<li>list-item-2</li>
</ul>
<span>span-content</span>
content not wrapped within any tags
</div>
Expected array:
["para-1","para-2","text-before-image","text-image-src","text-after-image",
"p-iframe-url","iframe-url","header-1","image-url",
"p-image-url","content not wrapped within any tags","header-2","para-3",
"list-item-1","list-item-2","span-content","content not wrapped within any tags"]
Relevant code:
$dom = new DOMDocument();
#$dom->loadHTML( $content );
$tags = $dom->getElementsByTagName( 'p' );
// Get all the paragraph tags, to iterate its nodes.
$j = 0;
foreach ( $tags as $tag ) {
// get_inner_html() to preserve the node's text & tags
$con[ $j ] = $this->get_inner_html( $tag );
// Check if the Node has html content or not
if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {
// Check if the node contains html along with plain text with out any tags
if ( $tag->nodeValue != '' ) {
/*
* DOM to get the Image SRC of a node
*/
$domM = new DOMDocument();
/*
* Setting encoding type http://in1.php.net/domdocument.loadhtml#74777
* Set after initilizing DomDocument();
*/
$con[ $j ] = mb_convert_encoding( $con[ $j ], 'HTML-ENTITIES', "UTF-8" );
#$domM->loadHTML( $con[ $j ] );
$y = new DOMXPath( $domM );
foreach ( $y->query( "//img" ) as $node ) {
$con[ $j ] = "img=" . $node->getAttribute( "src" );
// Increment the Array size to accomodate bad text and image tags.
$j++;
// Node incremented, fetch the node value and accomodate the text without any tags.
$con[ $j ] = $tag->nodeValue;
}
$domC = new DOMDocument();
#$domC->loadHTML( $con[ $j ] );
$z = new DOMXPath( $domC );
foreach ( $z->query( "//iframe" ) as $node ) {
$con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
// Increment the Array size to accomodate bad text and image tags.
$j++;
// Node incremented, fetch the node value and accomodate the text without any tags.
$con[ $j ] = $tag->nodeValue;
}
} else {
/*
* DOM to get the Image SRC of a node
*/
$domA = new DOMDocument();
#$domA->loadHTML( $con[ $j ] );
$x = new DOMXPath( $domA );
foreach ( $x->query( "//img" ) as $node ) {
$con[ $j ] = "img=" . $node->getAttribute( "src" );
}
if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {
foreach ( $x->query( "//iframe" ) as $node ) {
$con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
}
}
}
}
// INcrement the node
$j++;
}
$this->content = $con;
A quick and easy way of extracting interesting pieces of information from a DOM document is to make use of XPath. Below is a basic example showing how to get the text content and attribute text from a div element.
<?php
// Pre-amble, scroll down to interesting stuff...
$html = '<div>
<p>para-1</p>
<p>para-2</p>
<p>
<iframe src="p-iframe-url"></iframe>
</p>
<iframe src="iframe-url"></iframe>
<h1>header-1</h1>
<img src="image-url"/>
<p>
<img src="p-image-url"/>
</p>
content not wrapped within any tags
<h2>header-2</h2>
<p>para-3</p>
<ul>
<li>list-item-1</li>
<li>list-item-2</li>
</ul>
<span>span-content</span>
content not wrapped within any tags
</div>';
$doc = new DOMDocument;
$doc->loadHTML($html);
$div = $doc->getElementsByTagName('div')->item(0);
// Interesting stuff:
// Use XPath to get all text nodes and attribute text
// $tests becomes a DOMNodeList filled with DOMText and DOMAttr objects
$xpath = new DOMXPath($doc);
$texts = $xpath->query('descendant-or-self::*/text()|descendant::*/#*', $div);
// You could only include/exclude specific attributes by looking at their name
// e.g. multiple paths: .//#src|.//#href
// or whitelist: descendant::*/#*[name()="src" or name()="href"]
// or blacklist: descendant::*/#*[not(name()="ignore")]
// Build an array of the text held by the DOMText and DOMAttr objects
// skipping any boring whitespace
$results = array();
foreach ($texts as $text) {
$trimmed_text = trim($text->nodeValue);
if ($trimmed_text !== '') {
$results[] = $trimmed_text;
}
}
// Let's see what we have
var_dump($results);
Try a recursive approach! Get an empty array $parts on your class instance and a function extractSomething(DOMNode $source). You function should the process each separate case, and then return. If source is a
TextNode: push to $parts
Element and name=img: push its href to $parts
other special cases
Element: for each TextNode or Element child call extractSomething(child)
Now when a call to extractSomenting(yourRootDiv) returns, you will have the list in $this->parts.
Note that you have not defined what happens with <p> sometext1 <img href="ref" /> sometext2 <p> but the above example is driving toward adding 3 elements ("sometext1", "ref" and "sometext2") on its behalf.
This is just a rough outline of the solution. The point is that you need to process each node in the tree (possibly not really regarding its position), and while walking them in the right order, you build your array by transforming each node to the desired text. Recursion is the fastest to code but you may alternatively try width traversal or walker tools.
Bottom line is that you have to accomplish two tasks: walk the nodes in a correct order, transform each to the desired result.
This is basically a rule of thumb for processing a tree/graph structure.
The simplest way is to use DOMDocument:
http://www.php.net/manual/en/domdocument.loadhtmlfile.php

How can I extract all img tag within an anchor tag?

I would like to extract all img tags that are within an anchor tag using the PHP DOM object.
I am trying it with the code below but its getting all anchor tag and making it's text empty due the inside of an img tag.
function get_links($url) {
// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();
// Load the url's contents into the DOM
#$xml->loadHTMLFile($url);
// Empty array to hold all links to return
$links = array();
//Loop through each <a> tag in the dom and add it to the link array
foreach($xml->getElementsByTagName('a') as $link)
{
$hrefval = '';
if(strpos($link->getAttribute('href'),'www') > 0)
{
//$links[] = array('url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
$hrefval = '#URL#'.$link->getAttribute('href').'#TEXT#'.$link->nodeValue;
$links[$hrefval] = $hrefval;
}
else
{
//$links[] = array('url' => GetMainBaseFromURL($url).$link->getAttribute('href'), 'text' => $link->nodeValue);
$hrefval = '#URL#'.GetMainBaseFromURL($url).$link->getAttribute('href').'#TEXT#'.$link->nodeValue;
$links[$hrefval] = $hrefval;
}
}
foreach($xml->getElementsByTagName('img') as $link)
{
$srcval = '';
if(strpos($link->getAttribute('src'),'www') > 0)
{
//$links[] = array('src' => $link->getAttribute('src'), 'nodval' => $link->nodeValue);
$srcval = '#SRC#'.$link->getAttribute('src').'#NODEVAL#'.$link->nodeValue;
$links[$srcval] = $srcval;
}
else
{
//$links[] = array('src' => GetMainBaseFromURL($url).$link->getAttribute('src'), 'nodval' => $link->nodeValue);
$srcval = '#SRC#'.GetMainBaseFromURL($url).$link->getAttribute('src').'#NODEVAL#'.$link->nodeValue;
$links[$srcval] = $srcval;
}
}
//Return the links
//$links = unsetblankvalue($links);
return $links;
}
This returns all anchor tag and all img tag separately.
$xml = new DOMDocument;
libxml_use_internal_errors(true);
$xml->loadHTMLFile($url);
libxml_clear_errors();
libxml_use_internal_errors(false);
$xpath = new DOMXPath($xml);
foreach ($xpath->query('//a[contains(#href, "www")]/img') as $entry) {
var_dump($entry->getAttribute('src'));
}
The usage of strpos() function is not correct in the code.
Instead of using
if(strpos($link->getAttribute('href'),'www') > 0)
Use
if(strpos($link->getAttribute('href'),'www')!==false )

Categories