How do I extract keyword from webpage using PHP DOM

How do I extract keyword from webpage using PHP DOM - php

Here is a same of code I have extracted from a webpage...
<div class="user-details-narrow">
<div class="profileheadtitle">
<span class=" headline txtBlue size15">
Profession
</span>
</div>
<div class="profileheadcontent-narrow">
<span class="txtGrey size15">
administration
</span>
</div>
</div>
When displayed on the webpage it shows as "Profession administration". What I want to do is extract the profession, in this case "administration". However, it's not as simple as it might seem because this piece of code is repeated many times for various other questions, such as
<div class="user-details-narrow">
<div class="profileheadtitle">
<span class=" headline txtBlue size15">
Industry
</span>
</div>
<div class="profileheadcontent-narrow">
<span class="txtGrey size15">
banking
</span>
</div>
</div>
Any ideas on a good solution?

Please, do not use regular expressions for getting node values from a page.
PHP have a very nice class named DOMDocument. You can just fetch a page as DOMDocument:
$dom = new DOMDocument;
$dom->loadURL("http://test.de/page.html");
$finder = new DomXPath($doc);
$spaner = $finder->query("//*[contains(#class, 'size15')]");
echo $spaner->item(0)->nodeValue . "/" . $spaner->item(1)->nodeValue;

Related

Is there any way in php to select all classes that contain the same word

I would like to know if there is any way, in php, to match all classes with the same word,
Example:
<div class="classeby_class">
<div class="classos-nope">
<div class="row">
<div class="class-show"></div>
</div>
</div>
</div>
<div class="class-first-one">
<div class="container">
<div class="classes-show">
<div class="class"></div>
<div class="classing"></div>
</div>
</div>
</div>
in the example above I would like to match all div that contain the word "class" but do not match those that have the word "classes"
like,
positive for
<div class="class-show">...</div>
<div class="class-first-one">...</div>
<div class="class">...</div>
<div class="class-first-one">...</div>
but negative for
<div class="classeby_class">...</div>
<div class="classes-show">...</div>
<div class="classing">...</div>
I am using php to display several different html pages.
As regex would not be the appropriate method, first because of several page breaks, second because of hosting limitations, I'm trying to do this by parse.
All html code is stored on the server.
I can liminate with a specific class using the example below.
$doc = new DomDocument();
$xpath = new DOMXPath($doc);
$classtoremove = $xpath->query('//div[contains(#class,"class")]');
foreach($classtoremove as $classremoved){
$classremoved->parentNode->removeChild($classremoved);
}
echo $HTMLDoc->saveHTML();
I know there are CSS selectors, but when I try to use it in PHP it doesn't work. Possibly because I'm using XPath.
Example:
'[id*="class"],[class*="class"]'
Still, I think he would take values beyond what I need.
Any way to get these values by Xpath?
the intent is to completely remove the div or other tags that contain that word.

You could make use of a regex with word boundaries \bclass\b for the class attribtute and make use of DOMXPath::registerPhpFunctions.
For example
$data = <<<DATA
<div class="classeby_class">
<div class="classos-nope">
<div class="row">
<div class="class-show"></div>
</div>
</div>
</div>
<div class="class-first-one">
<div class="container">
<div class="classes-show">
<div class="class"></div>
<div class="classing"></div>
</div>
</div>
</div>
DATA;
$doc = new DomDocument();
$doc->loadHTML($data);
$xpath = new DOMXPath($doc);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions();
$classtoremove = $xpath->query("//div[1 = php:function('preg_match', '/\bclass\b/', string(#class))]");
foreach ($classtoremove as $a) {
var_dump($a->getAttribute("class"));
}
Output
string(10) "class-show"
string(15) "class-first-one"
string(5) "class"
See a PHP demo

Xpath to select based on attribute value not working in PHP as expected

The code works well if it is "//div" or "//html". The moment I use "//*[#class='hit']", "//div[#class='hit']" or '//*[#class="hit"]', it does not select the element I need.
This is the code:
$xpath = "//div";
$data = file_get_contents("https://www.hachi.tech/searching?q=&hPP=144&idx=instant_product_price_asc&p=0&is_v=1");
d($data);
//d() is a custom function that works like var_dump
$doc = new DOMDocument();
$doc->loadHTML($data);
$xpatho = new DOMXpath($doc);
$elementsn = $xpatho->query($xpath);
d($xpath);
d($elementsn->length);
//d() is a custom function that works like var_dump
When I dumped $data, I got this:
https://justpaste.it/38v46
(the text is very long so I pasted in a separate link).
There is clearly a div element with class="hit" in the html (you can do a search). Search for:
<div class="hit" style="min-height:258px;">
I can only think of malformed HTML, in which case what can I do in general to check (and fix!) the HTML first before passing it for selection?

You can't access that div directly because it is not in the DOM; it is inside a script element and thus treated as text. It gets added to the DOM once the page loads, via this code:
a.addWidget(instantsearch.widgets.hits({container:"#hits",hitsPerPage: showperpage,templates:{item: document.querySelector("#hit-template").innerHTML ,empty: document.querySelector("#no-results-template").innerHTML }}));
You can find the contents of the script tag with this code:
$xpatho = new DOMXPath($dom);
$elementsn = $xpatho->query("//*[contains(text(), 'div class=\"hit\"')]");
var_dump($elementsn);
echo htmlspecialchars($elementsn->item(0)->nodeValue);
Output:
<div class="col-lg-3 col-md-3 col-sm-4 col-xs-6"> <div class="hit" style="min-height:258px;"> <div class="thumbnail"> <div class="stay-image"> <div class="space-overlay hide" id="{{item_id}}"> <a href="https://www.hachi.tech/product/{{{item_id}}}/{{{item_url_desc}}}"><img src="https://cdn.hachi.tech/assets/images/product_images_thumb/{{image}}" alt="{{item_desc}}" class="img-responsive"/> <div class="caption"> <h5 class="product-title"><a style="-webkit-line-clamp: 3;max-height: 50px;" href="https://www.hachi.tech/product/{{{item_id}}}/{{{item_url_desc}}}">{{{_highlightResult.item_desc.value}}} <div class="prod-rating" id="R{{{item_id}}}"> <span class="text-red mbr-price"><span class="text-red mbr-price hidden-xs">ValueClub {{{display_final_price}}} <br class="dp_rebate"><span class="dp_rebate text-red product-price">{{{rebate}}} <p> <span class="reg-price"> <span class="hidden-xs hidden-sm" style='color:black'>{{strikeoff}} <div class="positionBtns"> {{#color_display}} <a class="hidden-xs hidden-sm" id="{{item_id}}" href="https://www.hachi.tech/product/{{{item_id}}}/{{{item_url_desc}}}"><span class="colours-special">{{color_display}} {{/color_display}} {{#special_display}} <a class="hidden-xs hidden-sm" id="{{item_id}}" href="https://www.hachi.tech/product/{{{item_id}}}/{{{item_url_desc}}}"><span class="colours-special">{{special_display}} {{/special_display}} <span class="colours-special publish_inv_check hide " style="color: #EE1C24;" id="{{item_id}}">Sold Out
but if you want to directly access that div, you will need to use a headless browser such as Selenium.

How to get content text of div by simple html dom - php

I get the bottom html code by simple dom html (file_get_html('http://example.com'))
<div id="ship" class="fe" data-feature-name="box" data-cel-widget="sox">
<div class="a-medium b-di">
<div id="mer-info" class="a-section a-spacing-mini">
Hello World
<span class="">
</span>
</div>
</div>
</div>
How can I get 'Hello World" content text?
I tried a lot of things for example bottom text, but that gave me 'NULL'
$html->find('div[id="mer-info"]',0);
$html->find("div#mer-info");
$html->find("div#mer-info")->plaintext;
$html->find('div[id="mer-info"]')->innertext;
and ...
But I got NULL still!

You only passed the second argument (0) to find method where you used div[id="mer-info"] as selector, which seems not to be recognized by find method. Try the following:
require 'simple_html_dom.php';
$html =<<<html
<div id="ship" class="fe" data-feature-name="box" data-cel-widget="sox">
<div class="a-medium b-di">
<div id="mer-info" class="a-section a-spacing-mini">
Hello World
<span class="">
</span>
</div>
</div>
</div>
html;
$dom = str_get_html($html);
$elem = $dom->find('#mer-info', 0);
print $elem->plaintext;
print "\n";
$elem = $dom->find('div#mer-info', 0);
print $elem->plaintext;

phpQuery selecting div inside first li

I have html page what im trying to read(used htmlsql.class.php, but as its too old and outdated, then i have to use phpQuery).
The html markup is:
<ul class="small-block-grid-1 medium-block-grid-2 large-block-grid-3">
<li>
<div data-widget-type="epg.tvGuide.channel" data-view="epg.tvGuide.channel" id="widget-765574917197" class=" widget-epg_tvGuide_channel">
<div class="group-box">
<div class="group-header l-center" data-action="togglePreviousBroadcasts">
<span class="header-text">
<img src="logo.png" style="height: 40px" />
</span>
</div>
<div>
<div class="tvGuide-item is-past">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
<div class="tvGuide-item is-current">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
<div class="tvGuide-item">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
</div>
</div>
Then with the previos thing it was fearly easy:
$wsql->select('li');
if (!$wsql->query('SELECT * FROM span')){
print "Query error: " . $wsql->error;
exit;
}
foreach($wsql->fetch_array() as $row){
But i could not read the class so i need to know when the class is current and when its not.
As im new to phpQuery then and reallife examples are hard to find.
can someone point me to the right direction.
I would like to have the "span" text and item meta, allso i like to know when the div class is "is-past" or "is-current"

You can find infos about phpQuery here: https://code.google.com/archive/p/phpquery/
I prefer "one-file" version on top in downloads:
https://code.google.com/archive/p/phpquery/downloads
Simple examples based on your code:
// for loading files use phpQuery::newDocumentFileHTML();
// for plain strings use phpQuery::newDocument();
$document = phpQuery::newDocumentFileHTML('http://domain.com/yourFile.html');
$items = pq($document)->find('.tvGuide-item');
foreach($items as $item) {
if(pq($item)->hasClass('is-past') === true) {
// matching past items
}
if(pq($item)->hasClass('is-current') === true) {
// matching current items
}
// examples for finding elements and grabbing text/attributes
$span = pq($item)->find('span');
$text_in_span = pq($span)->text();
$meta = pq($item)->find('.tvGuide-item-meta');
$link_in_meta = pq($meta)->find('a');
$href_of_link_in_meta = pq($link_in_meta)->attr('href');
}

Php to auto populate grids

I have the following html code:
<div class="media row-fluid">
<div class="span3">
<div class="widget">
<div class="well">
<div class="view">
<img src="img/demo/media/1.png" alt="" />
</div>
<div class="item-info">
Title 1
<p>Info.</p>
<p class="item-buttons">
<i class="icon-pencil"></i>
<i class="icon-trash"></i>
</p>
</div>
</div>
</div>
<div class="widget">
<div class="well">
<div class="view">
<img src="img/demo/media/2.png" alt="" />
</div>
<div class="item-info">
This is another title
<p>Some info and details go here.</p>
<p class="item-buttons">
<i class="icon-pencil"></i>
<i class="icon-trash"></i>
</p>
</div>
</div>
</div>
</div>
Which basically alternates between a span class with the widget class, and then the widget class without the span3 class.
What I wanted to know was if there was a way to have php "echo" or populate the details for and details under the "item-info" class. Would I need to use a foreach statement to get this done? I would be storing the information in a mysql database, and while I can get it to fill in the info one by one (repeatedly entering the and echoing out each image and item title) it's not practical when the content needed to be displayed is over 15 different items. I'm not well versed in foreach statements so I could definitely use some help on it.
If someone could help me perhaps structure a php script so that it can automatically output the html based on the number individual items in the database, that'd be greatly appreciated!
I'm wondering if the html + php (not including the foreach) would look like this:
<div class="span3">
<div class="widget">
<div class="well">
<div class="view">
<img src="img/<? $file ?>" alt="" />
</div>
<div class="item-info">
<?$title?>
<p>Info.</p>
<p class="item-buttons">
<i class="icon-pencil"></i>
<i class="icon-trash"></i>
</p>
</div>
</div>
</div>
EDIT:
I wanted to add some more information. The items populated would be based on a type of subscription - which will be managed by a group id.
I was initially going to use <? (if $_SESSION['group_id']==1)>
echo <div class="item-info">
$title
<p>$info</p>
</div>
so that only the subscribed items would populate. But, I would need it to iterate through all the items for group1 table and list it. Currently I know that I can do
<? (if $_SESSION['group_id']==1)
while ($row=mysql_fetch_assoc($sqlItem))
{
$itemInfo = $row['info'];
$image = $row['image'];
$title = $row['title'];
$url = $row['url'];
};
>
$sqlItem for now can only be assigned one thing (manually - as in: $sqlItem = '123'), unless I iterate through which is what I'm trying to figure out.

Just read that 'mysql_fetch_assoc' is being depreciated with 5.5, here is the new way and looks better, easier I think.. Hope this helps, was updated today.
I hope this helps http://php.net/manual/en/mysqli-stmt.fetch.php
replace the printf with echo '//then your html stuff
This will iterate through the rows in your database until their are no more matching records.

shouldn't a while be enough? It depends on the structure of your database and website (we didn't need so much HTML I think. Some more PHP maybe). Hope this helps.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How do I extract keyword from webpage using PHP DOM - php

Related

Is there any way in php to select all classes that contain the same word

Xpath to select based on attribute value not working in PHP as expected

How to get content text of div by simple html dom - php

phpQuery selecting div inside first li

Php to auto populate grids

Categories

Resources