Scrape HTML & count children using Simple HTML DOM - php

I'm trying to collect data from a website, and want to count the amount of elements in another element. Targeting different DOM elements works fine, but for some reason the $count variable in the example below stays at "0". I'm probably missing something really silly, but I can't seem to find it.
The HTML on the website is as follows:
<div id="list_options">
<div class="list_mtgdef_option pointer">
<div class="list_mtgdef_foildesc shadow">
</div>
<div class="list_mtgdef_stock tooltip">
<div class="list_mtgdef_stock_left">
<span class="foil008469_1 block "></span>
<span class="foil008469_2 block transparency_25"></span>
</div>
<div class="list_mtgdef_stock_right">
<span class="008469_8 block "></span>
<span class="008469_7 block "></span>
<span class="008469_6 block "></span>
<span class="008469_5 block "></span>
<span class="008469_4 block "></span>
<span class="008469_3 block "></span>
<span class="008469_2 block "></span>
<span class="008469_1 block "></span>
</div>
</div>
</div>
</div>
And this is the php I'm using:
$array = array();
foreach($html->find('#list_options .list_mtgdef_option') as $element) {
$count = 0;
foreach($element->find('.list_mtgdef_stock', 0)->childNodes(1)->childNodes as $node) {
if(!($node instanceof \DomText))
$count++;
}
$row = array(
'stock' => strval($count)
);
array_push($array, $row);
}
echo json_encode($array);

You can just do:
count($element->find('.list_mtgdef_stock > *[2] > *'))
//=>8

Silly indeed: ()
$element->find('.list_mtgdef_stock', 0)->childNodes(1)->childNodes()

I solved with count elements from child :
$element= $element->find('.list_mtgdef_stock', 0);
count($element->children())

Related

Xpath to select based on attribute value not working in PHP as expected

The code works well if it is "//div" or "//html". The moment I use "//*[#class='hit']", "//div[#class='hit']" or '//*[#class="hit"]', it does not select the element I need.
This is the code:
$xpath = "//div";
$data = file_get_contents("https://www.hachi.tech/searching?q=&hPP=144&idx=instant_product_price_asc&p=0&is_v=1");
d($data);
//d() is a custom function that works like var_dump
$doc = new DOMDocument();
$doc->loadHTML($data);
$xpatho = new DOMXpath($doc);
$elementsn = $xpatho->query($xpath);
d($xpath);
d($elementsn->length);
//d() is a custom function that works like var_dump
When I dumped $data, I got this:
https://justpaste.it/38v46
(the text is very long so I pasted in a separate link).
There is clearly a div element with class="hit" in the html (you can do a search). Search for:
<div class="hit" style="min-height:258px;">
I can only think of malformed HTML, in which case what can I do in general to check (and fix!) the HTML first before passing it for selection?
You can't access that div directly because it is not in the DOM; it is inside a script element and thus treated as text. It gets added to the DOM once the page loads, via this code:
a.addWidget(instantsearch.widgets.hits({container:"#hits",hitsPerPage: showperpage,templates:{item: document.querySelector("#hit-template").innerHTML ,empty: document.querySelector("#no-results-template").innerHTML }}));
You can find the contents of the script tag with this code:
$xpatho = new DOMXPath($dom);
$elementsn = $xpatho->query("//*[contains(text(), 'div class=\"hit\"')]");
var_dump($elementsn);
echo htmlspecialchars($elementsn->item(0)->nodeValue);
Output:
<div class="col-lg-3 col-md-3 col-sm-4 col-xs-6"> <div class="hit" style="min-height:258px;"> <div class="thumbnail"> <div class="stay-image"> <div class="space-overlay hide" id="{{item_id}}"> <a href="https://www.hachi.tech/product/{{{item_id}}}/{{{item_url_desc}}}"><img src="https://cdn.hachi.tech/assets/images/product_images_thumb/{{image}}" alt="{{item_desc}}" class="img-responsive"/> <div class="caption"> <h5 class="product-title"><a style="-webkit-line-clamp: 3;max-height: 50px;" href="https://www.hachi.tech/product/{{{item_id}}}/{{{item_url_desc}}}">{{{_highlightResult.item_desc.value}}} <div class="prod-rating" id="R{{{item_id}}}"> <span class="text-red mbr-price"><span class="text-red mbr-price hidden-xs">ValueClub {{{display_final_price}}} <br class="dp_rebate"><span class="dp_rebate text-red product-price">{{{rebate}}} <p> <span class="reg-price"> <span class="hidden-xs hidden-sm" style='color:black'>{{strikeoff}} <div class="positionBtns"> {{#color_display}} <a class="hidden-xs hidden-sm" id="{{item_id}}" href="https://www.hachi.tech/product/{{{item_id}}}/{{{item_url_desc}}}"><span class="colours-special">{{color_display}} {{/color_display}} {{#special_display}} <a class="hidden-xs hidden-sm" id="{{item_id}}" href="https://www.hachi.tech/product/{{{item_id}}}/{{{item_url_desc}}}"><span class="colours-special">{{special_display}} {{/special_display}} <span class="colours-special publish_inv_check hide " style="color: #EE1C24;" id="{{item_id}}">Sold Out
but if you want to directly access that div, you will need to use a headless browser such as Selenium.

PHP database output not showing the correct way

I am making a sort of forum with subjects out of my database. My website is built in PHP (PDO) the problem is that I cant get the subject out of the database to show the correct way under each other in each a different block.
<div class="col-md-6">
<div class="well dash-box">
<h2><span class="glyphicon glyphicon-list-alt" aria-hidden="true"></span> Stel jezelf voor</h2>
<h5> Laat wetn wie jij en je business zijn</h5>
</div>
</div>
<div class="col-md-6">
<div class="well dash-box">
<h2><span class="glyphicon glyphicon-list-alt" aria-hidden="true"></span> 12</h2>
<?php
$toppics = $app->get_topics();
$i = 0;
$j = 0;
foreach ($toppics as $topic) {
if($j >= 1) continue;
echo '' . $topic['onderwerp'] . '';
$j++;
}
?>
</div>
</div>
Function:
public function get_topics(){
$getTopic = $this->database->query("SELECT * FROM topics ORDER BY id DESC LIMIT 1");
$topics = $this->database->resultset();
return $topics;
}
I want this also to be each record in a new block. In the code, you can see I tried some stuff but it isn't the right way or its not working. I know the j++ isn't good but if I delete that and the LIMIT 1 function it doesn't still don't work
UPDATE
If I use the code like it is in the answer given it looks like this and not 2 diffent blocks.
<?php
$toppics = $app->get_topics();
$i = 0;
foreach($toppics as $topic){
echo '<div>';
echo '<h3>'.$topic['onderwerp'].'</h3><br>';
echo '' .$topic['omschrijving'].'';
echo '</div><br>';
}
?>
First of all: Since you want to fetch multiple topics from the DB, you must remove the LIMIT 1 from the query and the if($j >= 1) continue; in the foreach loop, as they both are restricting your output to only 1 topic.
In your foreach loop for $toppics (correct spelling: topics ;P) you currently only echo an anchor tag (link), but what you want is (to use your words here) a 'block'. Whatever you want that block to look like, the place for defining that is within that foreach loop.
Now I don't know what elements, classes or stylings you use / want to use, so I will make an example of a block that consists of a title and below that the link:
//rename $topic keys to the names of your DB columns
foreach($toppics as $topic){
echo '<div>';
echo '<h3>'.$topic['title'].'</h3><br>';
echo ''.$topic['link_text'].'';
echo '</div><br>';
}
I know my solution will not look exactly like your given image, but it should get the point across how and where you can build your blocks.
I think this problem should have been easily solveable when you know the basics of HTML, so I would really recommend you learning a bit more about HTML before you work on big projects.
Edit after question was edited:
As I mentioned in my answer, my solution will not look exactly like your given image because I don't know what elements, classes or stylings you use. Your remaining problem is now the usage of the correct html tags, classes and stylings.
It appears that the parent element of the generated divs is styled the way you want the single blocks to look like.
So what you could do is remove the parent element and use it as a replacement of the generated div, like so:
<div class="col-md-6">
<div class="well dash-box">
<h2><span class="glyphicon glyphicon-list-alt" aria-hidden="true"></span> Stel jezelf voor</h2>
<h5> Laat wetn wie jij en je business zijn</h5>
</div>
</div>
<div class="col-md-6">
<!--<div class="well dash-box">-->
<h2><span class="glyphicon glyphicon-list-alt" aria-hidden="true"></span> 12</h2>
<?php
$toppics = $app->get_topics();
$i = 0;
foreach($toppics as $topic){
echo '<div class="well dash-box">';
echo '<h3>'.$topic['onderwerp'].'</h3><br>';
echo '' .$topic['omschrijving'].'';
echo '</div><br>';
}
?>
<!--</div>-->
</div>
sidenote: I do not agree with your building of your href attribute #section1. When building these sections you would have to know that exact index from that previous foreach-loop. Instead, use some attribute from the topic itself, maybe its ID, title, or description (like I did in the first codeblock). This way when you are building the sections you can easily know how to set the elements id attribute.

phpQuery selecting div inside first li

I have html page what im trying to read(used htmlsql.class.php, but as its too old and outdated, then i have to use phpQuery).
The html markup is:
<ul class="small-block-grid-1 medium-block-grid-2 large-block-grid-3">
<li>
<div data-widget-type="epg.tvGuide.channel" data-view="epg.tvGuide.channel" id="widget-765574917197" class=" widget-epg_tvGuide_channel">
<div class="group-box">
<div class="group-header l-center" data-action="togglePreviousBroadcasts">
<span class="header-text">
<img src="logo.png" style="height: 40px" />
</span>
</div>
<div>
<div class="tvGuide-item is-past">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
<div class="tvGuide-item is-current">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
<div class="tvGuide-item">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
</div>
</div>
Then with the previos thing it was fearly easy:
$wsql->select('li');
if (!$wsql->query('SELECT * FROM span')){
print "Query error: " . $wsql->error;
exit;
}
foreach($wsql->fetch_array() as $row){
But i could not read the class so i need to know when the class is current and when its not.
As im new to phpQuery then and reallife examples are hard to find.
can someone point me to the right direction.
I would like to have the "span" text and item meta, allso i like to know when the div class is "is-past" or "is-current"
You can find infos about phpQuery here: https://code.google.com/archive/p/phpquery/
I prefer "one-file" version on top in downloads:
https://code.google.com/archive/p/phpquery/downloads
Simple examples based on your code:
// for loading files use phpQuery::newDocumentFileHTML();
// for plain strings use phpQuery::newDocument();
$document = phpQuery::newDocumentFileHTML('http://domain.com/yourFile.html');
$items = pq($document)->find('.tvGuide-item');
foreach($items as $item) {
if(pq($item)->hasClass('is-past') === true) {
// matching past items
}
if(pq($item)->hasClass('is-current') === true) {
// matching current items
}
// examples for finding elements and grabbing text/attributes
$span = pq($item)->find('span');
$text_in_span = pq($span)->text();
$meta = pq($item)->find('.tvGuide-item-meta');
$link_in_meta = pq($meta)->find('a');
$href_of_link_in_meta = pq($link_in_meta)->attr('href');
}

Wrap multiple span tags within specific div using preg_replace

I have HTML code like this:
<li class="recipe-ingredient">
<span class="recipe-ingredient-quantity-unit">
<span data-original="" data-fraction="" data-normalized="0" class="recipe-ingredient-quantity recipe-ingredient-quantity"></span>
<span data-original="" class="recipe-ingredient-unit recipe-ingredient-unit"></span>
</span>
<span class="recipe-ingredient-name recipe-ingredient-name">water</span>
<span class="recipe-ingredient-notes recipe-ingredient-notes">For Kneading</span>
</li>
Using preg_replace, I want to wrap first set of <span> within one <div> and last two sets of <span> within another <div>, so my final outcome would be:
<li class="recipe-ingredient">
<div class="ing-qt-unit">
<span class="recipe-ingredient-quantity-unit">
<span data-original="" data-fraction="" data-normalized="0" class="recipe-ingredient-quantity recipe-ingredient-quantity"></span>
<span data-original="" class="recipe-ingredient-unit recipe-ingredient-unit"></span>
</span>
</div>
<div class="ing-name-notes">
<span class="recipe-ingredient-name recipe-ingredient-name">water</span>
<span class="recipe-ingredient-notes recipe-ingredient-notes">For Kneading</span>
</div>
</li>
This first one should catch the first set:
<span class[^>]+>(?:[^<]+<[^>]+>[^<]*<[^>]+>)*[^<]*</span>
And this one should catch the second set:
<span class[^>]+>(?:[^<]*<[^>]+>[^<]+<[^>]+>)*[^<]*</span>
Use it like this to avoid escaping:
$re = "#<span class[^>]+>(?:[^<]*<[^>]+>[^<]+<[^>]+>)*[^<]*</span>#im";
The replacement should be something like "<div class="ing-name-notes">$0</div>".

How do I extract keyword from webpage using PHP DOM

Here is a same of code I have extracted from a webpage...
<div class="user-details-narrow">
<div class="profileheadtitle">
<span class=" headline txtBlue size15">
Profession
</span>
</div>
<div class="profileheadcontent-narrow">
<span class="txtGrey size15">
administration
</span>
</div>
</div>
When displayed on the webpage it shows as "Profession administration". What I want to do is extract the profession, in this case "administration". However, it's not as simple as it might seem because this piece of code is repeated many times for various other questions, such as
<div class="user-details-narrow">
<div class="profileheadtitle">
<span class=" headline txtBlue size15">
Industry
</span>
</div>
<div class="profileheadcontent-narrow">
<span class="txtGrey size15">
banking
</span>
</div>
</div>
Any ideas on a good solution?
Please, do not use regular expressions for getting node values from a page.
PHP have a very nice class named DOMDocument. You can just fetch a page as DOMDocument:
$dom = new DOMDocument;
$dom->loadURL("http://test.de/page.html");
$finder = new DomXPath($doc);
$spaner = $finder->query("//*[contains(#class, 'size15')]");
echo $spaner->item(0)->nodeValue . "/" . $spaner->item(1)->nodeValue;

Categories