how to parse this by simple html dom parser in php - php

<div id="productDetails" class="tabContent active details">
<span>
<b>Case Size:</b>
</span>
44mm
<br>
<span>
<b>Case Thickness:</b>
</span>
13mm
<br>
<span>
<b>Water Resistant:</b>
</span>
5 ATM
<br>
<span>
<b>Brand:</b>
</span>
Fossil
<br>
<span>
<b>Warranty:</b>
</span>
11-year limited
<br>
<span>
<b>Origin:</b>
</span>
Imported
<br>
</div>
How can I get data like 44mm, fossil, etc. by DOM parser in PHP?
the data i can get easily by
$data=$html->find('div#productDetails',0)->innertext;
var_dump($data);
but i want to break it in meta_key and meta_value for my sql table...
i can get the meta_key by
$meta_key=$html->find('div#productDetails span',0)->innertext;
but the meta value related to it????

It's not that hard, really... just google, and click this link, you now know how to parse a DOM, here you can see what methods you can use to select all elements of interest, iterate the DOM, get its contents and what have you...
$DOM = new DOMDocument();
$DOM->loadHTML($htmlString);
$spans = $DOM->getElementsByTagName('span');
for ($i=0, $j = count($spans); $i < $j; $i++)
{
echo $spans[$i]->childNodes[0]->nodeValue.' - '.$spans[$i]->parentNode->nodeValue."\n";
}
That seems to be what you're after, if I'm not mistaken. This is just off the top of my head, but I think this should output something like:
Case Size: - 44mm
Case Thickness: - 13mm
UPDATE:
Here's a tested solution, that returns the desired result, if I'm not mistaken:
$str = "<div id='productDetails' class='tabContent active details'>
<span>
<b>Case Size:</b>
</span>
44mm
<br>
<span>
<b>Case Thickness:</b>
</span>
13mm
<br>
<span>
<b>Water Resistant:</b>
</span>
5 ATM
<br>
<span>
<b>Brand:</b>
</span>
Fossil
<br>
<span>
<b>Warranty:</b>
</span>
11-year limited
<br>
<span>
<b>Origin:</b>
</span>
Imported
<br>
</div>";
$DOM = new DOMDocument();
$DOM->loadHTML($str);
$txt = implode('',explode("\n",$DOM->textContent));
preg_match_all('/([a-z0-9].*?\:).*?([0-9a-z]+)/im',$txt,$matches);
//or if you don't want to include the colon in your match:
preg_match_all('/([a-z0-9][^:]*).*?([0-9a-z]+)/im',$txt,$matches);
for($i = 0, $j = count($matches[1]);$i<$j;$i++)
{
$matches[1][$i] = preg_replace('/\s+/',' ',$matches[1][$i]);
$matches[2][$i] = preg_replace('/\s+/',' ',$matches[2][$i]);
}
$result = array_combine($matches[1],$matches[2]);
var_dump($result);
//result:
array(6) {
["Case Size:"]=> "44mm"
["Case Thickness:"]=> "13mm"
["Water Resistant:"]=> "5"
["ATM Brand:"]=> "Fossil"
["Warranty:"]=> "11"
["year limited Origin:"]=> "Imported"
}
To insert this in your DB:
foreach($result as $key => $value)
{
$stmt = $pdo->prepare('INSERT INTO your_db.your_table (meta_key, meta_value) VALUES (:key, :value)');
$stmt->execute(array('key' => $key, 'value' => $value);
}
Edit
To capture the 11-year limit substring entirely, you'll need to edit the code above like so:
//replace $txt = implode('',explode("\n",$DOM->textContent));etc... by:
$txt = $DOM->textContent;//leave line-feeds
preg_match_all('/([a-z0-9][^:]*)[^a-z0-9]*([a-z0-9][^\n]+)/im',$txt,$matches);
for($i = 0, $j = count($matches[1]);$i<$j;$i++)
{
$matches[1][$i] = preg_replace('/\s+/',' ',$matches[1][$i]);
$matches[2][$i] = preg_replace('/\s+/',' ',$matches[2][$i]);
}
$matches[2] = array_map('trim',$matches[2]);//remove trailing spaces
$result = array_combine($matches[1],$matches[2]);
var_dump($result);
The output is:
array(6) {
["Case Size"]=> "44mm"
["Case Thickness"]=> "13mm"
["Water Resistant"]=> "5 ATM"
["Brand"]=> "Fossil"
["Warranty"]=> "11-year limited"
["Origin"]=> "Imported"
}

You can remove the span tag using the set_callback Api
Try this
$url = "";
$html = new simple_html_dom();
$html->load_file($url);
$html->set_callback('my_callback');
$elem = $html->find('div[id=productDetails]');
$product_details = array();
$attrib = array( 1 => 'size', 2 => 'thickness', 3 => 'wr', 4 => 'brand', 5 => 'warranty', 6 => 'orgin' );
$attrib_string = strip_tags($elem[0]->innertext);
$attrib_arr = explode(' ',$attrib_string); // hope this can help you for every product
// Remove Empty Values
$attrib_arr = array_filter($attrib_arr);
$i = 1;
foreach($attrib_arr as $temp)
{
$product_details[$attrib[$i]] = $temp;
$i++;
}
print_r($product_details);
// remove span tag inside div
function my_callback($element) {
if($element->tag == 'span'){ $element->outertext = ""; }
}

Related

Can I get the result of domxpath nested classes into an array with keys => value?

I'm getting some data from a webpage for clients and that works fine, it gets all data in seperate rows by exploding the \n into new lines which I then map to specific array data to fill form fields with. Like so for each needed value:
$lines = explode("\n", $html);
$data['vraagprijs'] = preg_replace("/[^0-9]/", "", $lines[5]);
However, the data i need may be in Line 10 today, but might very well be line 11 tomorrow. So I'd like to get the values into named arrays. A sample of the HTML on the URL is as follows:
<div class="item_list">
<span class="item first status">
<span class="itemName">Status</span>
<span class="itemValue">Sold</span>
</span>
<span class="item price">
<span class="itemName">Vraagprijs</span>
<span class="itemValue">389.000</span>
</span>
<span class="item condition">
<span class="itemName">Aanvaarding</span>
<span class="itemValue">In overleg</span>
</span>
...
</div>
This is my function model:
$tagName3 = 'div';
$attrName3 = 'class';
$attrValue3 = 'item_list';
$html = getShortTags($tagName3, $attrName3, $attrValue3, $url);
function getShortTags($tagName, $attrName, $attrValue, $url = "", $exclAttrValue = 'itemTitle') {
$dom = $this->getDom($url);
$html = '';
$domxpath = new \DOMXPath($dom);
$newDom = new \DOMDocument;
$newDom->formatOutput = true;
$filtered = $domxpath->query(" //" . $tagName . "[#" . $attrName . "='" . $attrValue . "']/descendant::text()[not(parent::span/#" . $attrName . "='" . $exclAttrValue . "')] ");
$i = 0;
while ($myItem = $filtered->item($i++)) {
$node = $newDom->importNode($myItem, true);
$newDom->appendChild($node);
}
$html = $newDom->saveHTML();
return $html;
}
What am I getting?
Status\nSold\nVraagprijs\n389.000\nIn overleg\n....
Desired output anything like:
$html = array("Status" => "Sold", "Vraagprijs" => "389.000", "Aanvaarding" => "In overleg", ...)
Is there a way to "loop" through the itemList and get each itemName and itemValue into an associative array?
If your happy with what the getShortTags() method does (or if it's used elsewhere and so difficult to tweak), then you can process the return value.
This code first uses explode() to split the output by line, uses array_map() and trim() to remove any spaces etc., then passes the result through array_filter() to remove blank lines. This will leave the data in pairs, so an easy way is to use array_chunk() to extract the pairs and then foreach() over the pairs with the first as the key and the second as the value...
$html = getShortTags($tagName3, $attrName3, $attrValue3, $url);
$lines = array_filter(array_map("trim", explode(PHP_EOL, $html)));
$pairs = array_chunk($lines, 2);
$output = [];
foreach ( $pairs as $pair ) {
$output[$pair[0]] = $pair[1];
}
print_r($output);
with the sample data gives..
Array
(
[Status] => Sold
[Vraagprijs] => 389.000
[Aanvaarding] => In overleg
)
To use this directly in the document and without making any assumptions (although if you don't have a name for several values, then not sure what you will end up with). This just looks specifically for the base element and then loops over the <span> elements. Each time within this it will look for the itemName and itemValue class attributes and get the value from these...
$output = [];
$filtered = $domxpath->query("//div[#class='item_list']/span");
foreach ( $filtered as $myItem ) {
$name= $domxpath->evaluate("string(descendant::span[#class='itemName'])", $myItem);
$value= $domxpath->evaluate("string(descendant::span[#class='itemValue'])", $myItem);
$output[$name] = $value;
}
print_r($output);

DOMDocument: problems with replaceChild()

I'm attempting to find the first <p> in a <div>:
<div class="embed-left">
<h4>Bookmarks</h4>
<p>Something goes here.</p>
<p>Read more...</p>
</div>
Which I've done.
Now, however, I need to replace the found text with a link, as assigned to the <span> before then being used in the $url createElement() method:
$results_links = $this->data_migration->process_embed_find_links();
$dom = new DOMDocument();
foreach ($results_links as $notes):
$dom->loadHTML($notes['note']);
$x = $dom->getElementsByTagName('div')->length;
// Loop through the <div> elements found in the HTML...
for ($i = 0; $i < $x; $i++):
$parentNode = $dom->getElementsByTagName('div')->item($i);
// Here's a <h4> element.
$childNodeHeading = $dom->getElementsByTagName('div')->item($i)->childNodes->item(1);
// If the <h4> element is "Bookmarks"...
if ( $childNodeHeading->nodeValue == "Bookmarks" ):
// ... then grab the first <p> element.
$childNodeTitle = $dom->getElementsByTagName('div')->item($i)->childNodes->item(3);
// Create the appropriate <p> element.
$title = $dom->createElement('p', $childNodeTitle->nodeValue);
echo "<p>" . $title->nodeValue . "</p>";
// Find the `notes_links.from-asset` rows.
$results_bookmarks_links = $this->data_migration->process_embed_find_links_bookmarks_links(array(
'note_id' => $notes['note_id'],
// Send the first <p> tag in the <div> element.
'title' => htmlentities($childNodeTitle->nodeValue)
));
// Loop through the data (one row returned, but it's more neat to run it through a foreach() function)...
foreach ($results_bookmarks_links as $index => $link):
// Assuming there are values (which there has to be, by virtue of the fact that we found the <div> elements in the first place...
if ( isset($results_bookmarks_links) && ( count($results_bookmarks_links) > 0 ) ):
// Create the <span> element for the link item, according to Sina's design.
$span = '<span>[#' . $notes['note_id'] . ']</span>';
**$url = $dom->createElement('span', $span);**
**$parentNode->replaceChild(
$url,
$title
);**
endif;
endforeach;
endif;
endfor;
endforeach;
Which I've had no success with.
I'm unable to figure out either the parent element, or the proper parameters to use in the replaceChild() method.
I've emboldened the main bits that I'm having trouble with, if that helps.
The important thing is to replace the existing p with a newly-created p that contains the child nodes.
Here's an example, using XPath to select the nodes to be replaced:
<?php
$html = <<<END
<div class="embed-left">
<h4>Bookmarks</h4>
<p>Something goes here.</p>
<p>Read more...</p>
</div>
END;
$doc = new DOMDocument;
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[h4[text()="Bookmarks"]]/p[1]');
foreach ($nodes as $oldnode) {
$note = 'TODO'; // build `$note` somewhere
$link = $doc->createElement('a');
$link->setAttribute('href', '#');
$link->textContent = sprintf('[#%s]', $note);
$span = $doc->createElement('span');
$span->appendChild($link);
$newnode = $doc->createElement('p');
$newnode->appendChild($span);
$oldnode->parentNode->replaceChild($newnode, $oldnode);
}
print $doc->saveHTML($doc->documentElement);

PHP function to retrieve all images and their attributes from given html string and save to array and returns array

I need to write php function to retrieve all images attributes like src,alt, height, width from given html string and store those attributes into array result. i need this function to return the array result for further processing like saving into database, thumbnail creation ...etc.
i have wrote the following function but i am not satisfied as i am not able to extract elements other then scr
$url = '<ul> <li> <img src="http://www.sayidaty.net/sites/default/files/imagecache/645xauto/01/04/2015/1427896207_sliderfashionshoot.jpg"/> </li> <li> <img src="http://www.sayidaty.net/sites/default/files/imagecache/645xauto/01/04/2015/1427896207_1.jpg"/> </li> <li> <img src="http://www.sayidaty.net/sites/default/files/imagecache/645xauto/01/04/2015/1427896207_2.jpg"/> </li> <li> <img src="http://www.sayidaty.net/sites/default/files/imagecache/645xauto/01/04/2015/1427896207_5.jpg"/> </li> <li> <img src="http://www.sayidaty.net/sites/default/files/imagecache/645xauto/01/04/2015/1427896207_4.jpg"/> </li> <li> <img src="http://www.sayidaty.net/sites/default/files/imagecache/645xauto/01/04/2015/1427896207_3.jpg"/> </li> <li> <img src="http://www.sayidaty.net/sites/default/files/imagecache/645xauto/01/04/2015/1427896207_6.jpg"/> </li> </ul> ';
function getItemImages($content)
{
$dom = new domDocument;
$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$elements = $dom->getElementsByTagName('img');
if($elements->length >= 1) {
$url = array();
$title = array();
foreach($elements as $element) {
$url[] = $element->getAttribute('src');
$Title[] = $element->getAttribute('title');
}
return ($url);
}
Some inefficient code with lots of explodes
$a = explode("<img",$html);
for($j = 0; j <count($a) -1; $j++)
{
$codes = explode(">",$a[j+1]);
$codes = $codes[0];
$codes = explode("=",$codes);
for($i = 0; i < count($codes) -1; $i++)
{
$name = explode(" ",$codes[i]);
$value = explode("\"",$codes[i+1]);
}
}
$field = '<img src="public/image/sun.png" title="This is such a sunny day" />';
$attrs = simplexml_load_string($field);
var_dump($attrs);
then the obj you will get will be
object(SimpleXMLElement)#1 (1) {
["#attributes"]=>
array(2) {
["src"]=>
string(20) "public/image/sun.png"
["title"]=>
string(24) "This is such a sunny day"
}
}
and it will contain all the tag's attributes.

html dom parser to extract href from span sibiling

Here is my html file contains date and a link in <span> tag within a table.
Can anyone help me find the link of a particular date. view link of particular date
<table>
<tbody>
<tr class="c0">
<td class="c11">
<td class="c8">
<ul class="c2 lst-kix_h6z8amo254ry-0 start">
<li class="c1">
<span>1st Apr 2014 - </span>
<span class="c6"><a class="c4" href="/link.html">View</a>
</span>
</li>
</ul>
</td>
</tr>
</td>
</table>
I want to retrieve the link for particular date
MY CODE IS LIKE THIS
include('simple_html_dom.php');
$html = file_get_html('link.html');
//store the links in array
foreach($html->find('span') as $value)
{
//echo $value->plaintext . '<br />';
$date = $value->plaintext;
if (strpos($date,$compare_text)) {
//$linkeachday = $value->find('span[class=c1]')->href;
//$day_url[] = $value->href;
//$day_url = Array("text" => $value->plaintext);
$day_url = Array("text" => $date, "link" =>$linkeachday);
//echo $value->next_sibling (a);
}
}
or
$spans = $html->find('table',0)->find('li')->find('span');
echo $spans;
$num = null;
foreach($spans as $span){
if($span->plaintext == $compare_text){
$next_span = $span->next_sibling();
$num = $next_span->plaintext;
echo($num);
break;
}
}
echo($num);
You were on the right path with your last example...
I modified it a bit to get the following which basically gets all spans, then test if they have the searched text, and if so, it displays the content of their next sibling if there is any (check the in code comments):
$input = <<<_DATA_
<table>
<tbody>
<tr class="c0">
<td class="c11">
<td class="c8">
<ul class="c2 lst-kix_h6z8amo254ry-0 start">
<li class="c1">
<span>1st Apr 2013 - </span>
<span>1st Apr 2014 - </span>
<span class="c6">
<a class="c4" href="/link.html">View</a>
</span>
<span>1st Apr 2015 - </span>
</li>
</ul>
</td>
</td>
</tr>
</tbody>
</table>
_DATA_;
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($input);
// Searched value
$searchDate = '1st Apr 2014';
// Find all the spans direct childs of li, which is a descendent of table
$spans = $html->find('table li > span');
// Loop through all the spans
foreach ($spans as $span) {
// If the span starts with the searched text && has a following sibling
if ( strpos($span->plaintext, $searchDate) === 0 && $sibling = $span->next_sibling()) {
// Then, print it's text content
echo $sibling->plaintext; // or ->innertext for raw content
// And stop (if only one result is needed)
break;
}
}
OUTPUT
View
For the string comparison, you may also (for the best) use regex...
So in the code above, you add this to build your pattern:
$pattern = sprintf('~^\s*%s~i', preg_quote($searchDate, '~'));
And then use preg_match to test the match:
if ( preg_match($pattern, $span->plaintext) && $sibling = $span->next_sibling()) {
I don't know about simple HTML DOM but the built in PHP DOM library should suffice.
Say you have your date in a string like this...
$date = '1st Apr 2014';
You can easily find the corresponding link using an XPath expression. For example
$doc = new DOMDocument();
$doc->loadHTMLFile('link.html');
$xp = new DOMXpath($doc);
$query = sprintf('//span[starts-with(., "%s")]/following-sibling::span/a', $date);
$links = $xp->query($query);
if ($links->length) {
$href = $links->item(0)->getAttribute('href');
}
include('simple_html_dom.php');
$html = file_get_html('link.html');
$compare_text = "1st Apr 2013";
$tds = $html->find('table',1)->find('span');
$num = 0;
foreach($tds as $td){
if (strpos($td->plaintext, $compare_text) !== false){
$next_td = $td->next_sibling();
foreach($next_td->find('a') as $elm) {
$num = $elm->href;
}
//$day_url = array($day => array(daylink => $day, text => $td->plaintext, link => $num));
echo $td->plaintext. "<br />";
echo $num . "<br />";
}
}

How to store text in a span tag into a variable using PHP?

I just started learning PHP. I have a string called "address" which contain HTML looks like:
<div class="address_row">
<span class="address1">239 House</span>
<span class="address2">Street South</span>
</div>
I am wondering how to store "239 House" into a variable called "address1"
This is an inspiration from this question: PHP Parse HTML code
You can do this using http://php.net/manual/en/class.domelement.php
And here's the sample code;
$str = '<div class="address_row">
<span class="address1">239 House</span>
<span class="address2">Street South</span>
</div>';
$DOM = new DOMDocument;
$DOM->loadHTML($str);
// all span elements
$items = $DOM->getElementsByTagName('span');
$span_list = array();
for($i = 0; $i < $items->length; $i++) {
$item = $items->item($i);
$span_list[$item->getAttribute('class')] = $item->nodeValue;
}
extract($span_list);
echo $address1; // 239 House
echo $address2; // Street South
Could try so, XPath can help in your project:
<?php
$str = '<div class="address_row">
<span class="address1">239 House</span>
<span class="address2">Street South</span>
</div>';
$a = new DOMDocument;
$a->loadHTML($str);
$b = new DomXPath($a);//set DOM from XPath
//Get <span class=address1>
$address1 = trim($b->query('//*[#class="address1"]')->item(0)->nodeValue);
//Get <span class=address2>
$address2 = trim($b->query('//*[#class="address2"]')->item(0)->nodeValue);
//show variables
echo 'address2: ', $address1, '<br>';
echo 'address2: ', $address2;
Test online in ideone.
Links:
http://en.wikipedia.org/wiki/XPath#External_links
http://php.net/manual/en/class.domdocument.php
http://php.net/manual/en/domxpath.query.php
you can also try new simplexmlelement(XMLString):
http://php.net/manual/en/simplexmlelement.xpath.php

Categories