HTML parsing with php - php

Can anyone help in parsing this part of an HTML site? I use php and PHP:DOM
I would like to get the Klassifikation and Schlagwörter in one php string.
How is this done?
Thanks
<tr style="display:table-row;">
<td id="TREFWOORD" class="onOffLink"></td>
<td class="rec_lable"><div>
<span>Schlagwörter</span><span>: </span>
</div></td>
<td class="rec_title"><div>
<span>*</span><span><a class="
link_gen
" href="MAT=/NOMAT=T/REL?PPN=106189719">Recht</a></span><span>
</span><span><a href="http://"
target=""><img src="http://"
alt="Subject" title="Subject" class="img_link"></a></span><span> / </span>
<span><a class="
link_gen
" href="MAT=/NOMAT=T/CMD?
ACT=SRCHA&IKT=5040&TRM=Wo%CC%88rterbuch">Wörterbuch</a></span>
</div></td>
</tr>
<tr style="display:table-row;">
<td></td>
<td class="rec_lable"><div><span>Klassifikation: </span></div></td>
<td class="rec_title"><div>
<span>Basisklassifikation: </span><span><a class="
link_gen
" target=""><img
src="http://" alt="Subject"
title="Subject" class="img_link"></a></span>
</div></td>
</tr>
I tried this without success:
<?php
$url='http://...'
$easycurlcmd=sprintf("curl '%s' -o ./libbvhtml.txt", $url);
printf("Execute: CURL1 ".$easycurlcmd."\n");
exec($easycurlcmd);
$html=file_get_contents('./libbvhtml.txt');
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$rec_lable = $xpath->query("//tr/*[contains(#class, rec_lable')]/div/span[1]");
echo $rec_lable->item(0)->nodeValue; // Schlagwörter
echo $rec_lable->item(1)->nodeValue; // Klassifikation
The reason was that curl must be defined with the redirect option.
Thanks to all.

You need to use DOMDocument::loadHTML to parsing HTML and use DOMXPath::query to searching in DOM.
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$rec_lable = $xpath->query("//tr/*[contains(#class, 'rec_lable')]/div/span[1]");
echo $rec_lable->item(0)->nodeValue; // Schlagwörter
echo $rec_lable->item(1)->nodeValue; // Klassifikation
Check result in demo

Related

php preg_match with get between two div value

How can I get the value of this text.
Idea:
Year: 2012
KM: 69.000
Color: Blue
Price: 29.9000
preg_match('#</div></td><td
class=\"searchResultsAttributeValue\">(.*?)<\/td>#si',$string,$val);
$string = "<div class="classifiedSubtitle">Opel > Astra > 1.4 T Sport</div>
</td>
<td class="searchResultsAttributeValue">
2012</td>
<td class="searchResultsAttributeValue">
69.000</td>
<td class="searchResultsAttributeValue">
Blue</td>
<td class="searchResultsPriceValue">
<div> $ 29.900 </div></td>
<td class="searchResultsDateValue">
<span>21 Nov</span>
<br/>
<span>2016</span>
</td>
<td class="searchResultsLocationValue">
USA<br/>Texas</td>"
The best solution isn't with regex. You should do it with Dom.
$dom = new DOMDocument();
$dom->loadHTML($string);
$xPath = new DOMXpath($dom);
$tdValue = $xPath->query('//td[#class="searchResultsAttributeValue"]')->get(0)->nodeValue;
This way you'll get the td element with the class searchResultsAttributeValue. Of course you should verify if this element really exists, and some other verifications but that's the way.
Hope I was helpful.

PHP DOM Parser Get Specific text by Class While Looping

I am working on a PHP Simple DOM Parser and i want a simple solution for my question
<tr>
<td class="one">1</td>
<td class="two">2</td>
<td class="three">3</td>
</tr>
<tr>
<td class="one">10</td>
<td class="two">20</td>
<td class="three">30</td>
</tr>...
the html of mine is will look similar to the above
and i am looping over through td something like this
foreach ($sample->find("td") as $ele)
{
if($ele->class == "one")
echo "ONE = ".$ele->plaintext;
if($ele->class == "two")
echo "TWO= ".$ele->plaintext;
}
But is there any simple solution that without if condition getting the plaintext of particular class i dont want shorthand if also
I am expecting something like this below
$ele->class->one
take a look at it:
<?php
$html = "
<table>
<tr>
<td class='one'>1</td>
<td class='two'>2</td>
<td class='three'>3</td>
</tr>
<tr>
<td class='one'>10</td>
<td class='two'>20</td>
<td class='three'>30</td>
</tr>
</table>
";
// Your class name
$classeName = 'one';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// Get the results
$results = $xpath->query("//*[#class='" . $classeName . "']");
for($i=0; $i < $results->length; $i++) {
echo $review = $results->item($i)->nodeValue . "<br>";
}
?>

Extract text and image src with PHP DomDocument

I'm trying to extract img src and the text of the TDs inside the div id="Ajax" but i'm unable to extract the img with my code. It just ignores the img src. How can i extract also the img src and add it in the array?
HTML:
<div id="Ajax">
<table cellpadding="1" cellspacing="0">
<tbody>
<tr id="comment_1">
<td>20:28</td>
<td class="color">
</td>
<td class="last_comment">
Text<br/>
</td>
</tr>
<tr id="comment_2">
<td>20:25</td>
<td class="color">
</td>
<td class="comment">
Text 2<br/>
</td>
</tr>
<tr id="comment_3">
<td>20:24</td>
<td class="color">
<img src="http://url.ext/img/image02.jpeg" alt="img alt 2"/>
</td>
<td class="comment">
Text 3<br/>
</td>
</tr>
<tr id="comment_4">
<td>20:23</td>
<td class="color">
<img src="http://url.ext/img/image01.jpeg" alt="img alt"/>
</td>
<td class="comment">
Text 4<br/>
</td>
</tr>
</div>
PHP:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$contentArray = array();
$doc = $doc->getElementById('Ajax');
$text = $doc->getElementsByTagName ('td');
foreach ($text as $t)
{
$contentArray[] = $t->nodeValue;
}
print_r ($contentArray);
Thanks.
You're using $t->nodeValue to obtain the content of a node. An <img> tag is empty, thus has nothing to return. The easiest way to get the src attribute would be XPath.
Example:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$expression = "//div[#id='Ajax']//tr";
$nodes = $xpath->query($expression); // Get all rows (tr) in the div
$imgSrcExpression = ".//img/#src";
$firstTdExpression = "./td[1]";
foreach($nodes as $node){ // loop over each row
// select the first td node
$tdNodes = $xpath->query($firstTdExpression ,$node);
$tdVal = null;
if($tdNodes->length > 0){
$tdVal = $tdNodes->item(0)->nodeValue;
}
// select the src attribute of the img node
$imgNodes = $xpath->query($imgSrcExpression,$node);
$imgVal = null;
if($imgNodes ->length > 0){
$imgVal = $imgNodes->item(0)->nodeValue;
}
}
(Caution: Code may contain typos)

Extract all <a> from table with DOMdocument

I want to extract the multiple <a> tags from this html markup:
<table align="center" width="100%" cellpadding="0" cellspacing="0"><tr>
<td align="left" width="60%" valign="top">
<font size="5" color="#939390">title</font><br><font size="5" color="#939390">LINKS</font>
<a style="color:#000000;" title="title Link 1" href="/page/242808/1/44643.html"> <b style="background:#ff6633">Link 1</b></a>
<a style="color:#000000;" title="title Link 2" href="/page/242808/2/erewe.html"> <b style="background:#ff6633">Link 2</b></a>
</td>
</tr>
</table>
and here is my code :
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->validateOnParse = true;
$doc->preserveWhiteSpace = false;
$doc->loadHTML($html); // load HTML you can add $html
$selector = new DOMXPath($doc);
$a = $selector->query('//table[1]//a')->item(0);
echo $doc->saveHTML($a);
This gets the first <a>, but what I want is to get all the <a> tags in the document.
To get more than one, you'll need to loop through the results instead of just printing the first one:
$a = $selector->query('//table[1]//a');
foreach($a as $current) {
echo $doc->saveHTML($current);
}

DOMDocument Query

Hi I am very new to this World of DOMDocument,Im still learning and looking for xpath query use in DOMDocument.The html sometimes changes so a preg_match is not a good idea. .I need to get the values from a html file.This is the part of html i want to get. I would be happy if you could help me..
<?php
$doc = new DOMDocument();
#$doc->loadHTML('<table cellspacing="0" cellpadding="0" align="center" class="results">
<tr class="header" bgcolor="#0000FF">
<td>
</td>
<td>Name/AKAs</td>
<td>Age</td>
<td>Location</td>
<td>Possible Relatives</td>
</tr>
<tr>
<td>1.</td>
<td>
<a class="LN" href=""><b>Iron, Man E</b></a>
</td>
<td align="center">54</td>
<td>
Canada, AK<br />
California, AK<br />
</td>
<td>
</td>
<td>
View Details
</td>
</tr>
<tr><td>2.</td>
<td>
<a class="LN" href=""><b>Bat, Man E</b></a></td>
<td align="center">26</td>
<td>
Gotham, IA
<br /></td>
<td>
View Details</td></tr>
</table>');
$xpath = new DOMXPath($doc);
$xquery = '//a[#class="LN"]';
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}
?>
How do I get the following value? I can only get Iron, Man E, and Bat, Man E
Iron, Man E | 54 | Canada, AK;California, AK
Bat, Man E | 26 | Gotham, IA
My Answer is not about DomDocument Query but can solve your problem easily.
There is a Library named SIMPLEHTMLDOM ! You can do great things with it.
Example :
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Full Documentation (Power of this Lib) is Here.
Try this,
$xquery = '//a'; // you will get all anchor tags now
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}
Try this to get in a single line,
$xpath = new DOMXPath($doc);
$xquery = '//tr[td[a]]';
$links = $xpath->query($xquery);
foreach ($links as $el) {
echo strip_tags($doc->saveHTML($el)).'<br/>';
}

Categories