Now I am trying to write a PhP parser and I don't why my code return an empty array. I am using PHP Simple HTML DOM. I know my code is't perfect, but it's only for testing.
I will be appreciate for any help
public function getData() {
// get url form urls.txt
foreach ($this->list_url as $i => $url) {
// create a DOM object from a HTML file
$this->html = file_get_html($url);
// find array all elements with class="name" because every products having name
$products = $this->html->find(".name");
foreach ($products as $number => $product) {
// get value attr a=href product
$href = $this->html->find("div.name a", $number)->attr['href'];
// create a DOM object form a HTML file
$html = file_get_html($href);
if($html && is_object($html) && isset($html->nodes)){
echo "TRUE - all goodly";
} else{
echo "FALSE - all badly";
}
// get all elements class="description"
// $nodes is empty WHY? Tough web-page having content and div.description?
$nodes = $html->find('.description');
if (count($nodes) > 0) {
$needle = "Производитель:";
foreach ($nodes as $short_description) {
if (stripos($short_description->plaintext, $needle) !== FALSE) {
echo "TRUE";
$this->data[] = $short_description->plaintext;
} else {
echo "FALSE";
}
}
} else {
$this->data[] = '';
}
$html->clear();
unset($html);
}
$this->html->clear();
unset($html);
}
return $this->data;
}
hi you should inspect the element and copy->copy selector and use it in find method to getting the object
Related
I thought I would write a simple function to visit all the nodes in a DOM tree. I wrote it, gave it a not-too-complex bit of XML to work on, but when I ran it I got only the top-level (DOMDocument) node.
Note that I am using PHP's Generator syntax:
http://php.net/manual/en/language.generators.syntax.php
Here's my function:
function DOMIterate($node)
{
yield $node;
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $subnode) {
// if($subnode != null) {
DOMIterate($subnode);
// }
}
}
}
And the testcase code that is supposed to print the results:
$doc = new DOMDocument();
$doc->loadXML($input);
foreach (DOMIterate($doc) as $node) {
$type = $node->nodeType;
if ($type == XML_ELEMENT_NODE) {
$tag = $node-> tagName;
echo "$tag\n";
}
else if ($type == XML_DOCUMENT_NODE) {
echo "document\n";
}
else if ($type == XML_TEXT_NODE) {
$text = $node->wholeText;
echo "text: $text\n";
} else {
$linenum = $node->getLineNo();
echo "unknown node type: $type at input line $linenum\n";
}
}
The input XML is the first 18 lines of
https://www.w3schools.com/xml/plant_catalog.xml
plus a closing
If you're using PHP7, you can try this:
<?php
$string = <<<EOS
<div level="1">
<div level="2">
<p level="3"></p>
<p level="3"></p>
</div>
<div level="2">
<span level="3"></span>
</div>
</div>
EOS;
$document = new DOMDocument();
$document->loadXML($string);
function DOMIterate($node)
{
yield $node;
if ($node->childNodes) {
foreach ($node->childNodes as $childNode) {
yield from DOMIterate($childNode);
}
}
}
foreach (DOMIterate($document) as $node) {
echo $node->nodeName . PHP_EOL;
}
Here's a working example - http://sandbox.onlinephpfunctions.com/code/ab4781870f8f988207da78b20093b00ea2e8023b
Keep in mind that you'll also get the text nodes that are contained within the tags.
Using yield in a function called from the generator doesn't return the value to the caller of the original generator. You need to use yield from to propagate the values back.
function DOMIterate($node)
{
yield $node;
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $subnode) {
// if($subnode != null) {
yield from DOMIterate($subnode);
// }
}
}
}
This requires PHP 7. If you're using an earlier version, see Recursive generators in PHP
if (!is_null($elements)) {
$embeds = array();
foreach ($elements as $element) {
if (trim(strip_tags($element->innertext)) == $episode_term) {
$html2 = file_get_html($element->href);
$elements2 = $html2->find('#streamlinks .sideleft a');
if (!is_null($elements2)) {
foreach ($elements2 as $element) {
$html3 = file_get_html($element->href);
$iframe_element = $html3->find('.frame', 0);
if (!is_null($iframe_element)) {
$embed = $misc->buildEmbed($iframe_element->src);
if ($embed) {
$embeds[] = array(
"embed" => $embed,
"link" => $iframe_element->src,
"language" => "ENG",
);
}
}
}
}
}
}
return $embeds;
}
Blockquote
PHP Fatal error: Call to a member function find() on a non-object in
$elements2 = $html2->find('#streamlinks .sideleft a');
so its confusing as to what is causing this error to appear in my error log file?
I'd try to output $element->href befor you do the file_get_html.
If the file_get_html can't get a page $html2 stays uniinitialized and you can't use find on it.
Beside that you could build a check wether $html2 is set after the file_get_html and output an error if not. I usually use something like this:
if($html2 == false || $html2 == NULL){
// no html found
}else{
// html found
}
I have code trying to extract the Event SKU from the Robot Events Page, here is an example. The code that I am using dosn't find any of the SKU on the page. The SKU is on line 411, with a div of the class "product-sku". My code doesn't event find the Div on the page and just downloads all the events. Here is my code:
<?php
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = file_get_html($event[4]);
$html->load($htmldown);
echo "Downloaded";
foreach ($html->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
?>
Can anyone help me fix my code?
This code is used DOMDocument php class. It works successfully for below sample HTML. Please try this code.
// new dom object
$dom = new DOMDocument();
// HTML string
$html_string = '<html>
<body>
<div class="product-sku1" name="div_name">The this the div content product-sku</div>
<div class="product-sku2" name="div_name">The this the div content product-sku</div>
<div class="product-sku" name="div_name">The this the div content product-sku</div>
</body>
</html>';
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//the table by its tag name
$divs = $dom->getElementsByTagName('div');
// loop over the all DIVs
foreach ($divs as $div) {
if ($div->hasAttributes()) {
foreach ($div->attributes as $attribute){
if($attribute->name === 'class' && $attribute->value == 'product-sku'){
// Peri DIV class name and content
echo 'DIV Class Name: '.$attribute->value.PHP_EOL;
echo 'DIV Content: '.$div->nodeValue.PHP_EOL;
}
}
}
}
I would use a regex (regular expression) to accomplish pulling skus out.
The regex:
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
See php regex docs.
New code:
<?php
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = curl_init($event[4]);
curl_setopt($htmldown, CURLOPT_RETURNTRANSFER, true);
$html=curl_exec($htmldown);
curl_close($htmldown)
echo "Downloaded";
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
foreach ($matches as $row) {
echo $row;
}
}
?>
And actually in this case (using that webpage) being that there is only one sku...
instead of:
foreach ($matches as $row) {
echo $row;
}
You could just use: echo $matches[1]; (The reason for array index 1 is because the whole regex pattern plus the sku will be in $matches[0] but just the subgroup containing the sku is in $matches[1].)
try to use
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = str_get_html($event[4]);
echo "Downloaded";
foreach ($htmldown->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
and if class "product-sku" is only for div's then you can use
$htmldown->find('.product-sku')
I want to get the contents inside body tag..seperate them as words and get the words into an array..am using php
This is what i have done
$content=file_get_contents($_REQUEST['url']);
$content=html_entity_decode($content);
$content = preg_replace("/&#?Ã[a-z0-9]+;/i"," ",$content);
$dom = new DOMDocument;
#$dom->loadHTML($content);
$tags=$dom->getElementsByTagName('body');
foreach($tags as $h)
{
echo "<li>".$h->tagName;
getChilds2($h);
function getChilds2($node)
{
if($node->hasChildNodes())
{
foreach($node->childNodes as $c)
{
if($c->nodeType==3)
{
$nodeValue=$c->nodeValue;
$words=feature_node($c,$nodeValue,true);
if($words!=false)
{
$_ENV["words"][]=$words;
}
else if($c->tagName!="")
{
getChilds2($c);
}
}
}
}
else
{
return;
}
}
function feature_node($node,$content,$display)
{
if(strlen($content)<=0)
{
return;
}
$content=strtolower($content);
$content=mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
$content= drop_script_tags($content);
$temp=$content;
$content=strip_punctuation($content);
$content=strip_symbols($content);
$content=strip_numbers($content);
$words_after_noise_removal=mb_split( ' +',$content);
$words_after_stop_words_removal=remove_stop_words($words_after_noise_removal);
if(count($words_after_stop_words_removal)==0)
return(false);
$i=0;
foreach($words_after_stop_words_removal as $w)
{
$words['word'][$i]=$w;
$i++;
}
for($i=0;$i<sizeof($words['word']);$i++)
{
$words['stemmed'][$i]= PorterStemmer::Stem($words['word'][$i],true)."<br/>";
}
return($words);
}
Here i have used some functions like strip_punctuation,strip_symbols,strip_numbers,remove stop_words and porterstemmer for preprocessing of the page..they ar eworking fine..but am not getting the contents into array and print_r() or echo gives nothing..help plz?
You dont have to to iterate over the nodes.
$tags = $dom->getElementsByTagName('body');
will give you just one result in the DOMNodeList. So all you need to do to get the text is
$plainText = $tags->item(0)->nodeValue;
or
$plainText = $tags->item(0)->textContent;
To get the separate words into an array, you can use
str_word_count — Return information about words used in a string
on the resulting $plainText then
Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.
include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)
I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.