Extract Data from HTML using PHP

Extract Data from HTML using PHP - php

Here is what I am looking for :
I have a Link which displays some data on HTML format :
http://www.118.com/people-search.mvc...0&pageNumber=1
Data comes in below format :
<div class="searchResult regular">
Bird John
56 Leathwaite Road
London
SW11 6RS
020 7228 5576
I want my PHP page to execute above URL and Extract/Parse Data from the Result HTML page based on above Tags as
h2=Name
address=Address
telephoneNumber= Phone Number
and Display them in a Tabular Format.
I got this but it only shows the TEXT format of an HTML page but works to an extent:
<?
function get_content($url)
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_HEADER, 0);
ob_start();
curl_exec ($ch);
curl_close ($ch);
$string = ob_get_contents();
ob_end_clean();
return $string;
}
$content = get_content("http://www.118.com/people-search.mvc?Supplied=true&Name=william&Location=Crabtree&pageSize=50&pageNumber=1");
echo $content;
$content = get_content("http://www.118.com/people-search.mvc?Supplied=true&Name=william&Location=Crabtree&pageSize=50&pageNumber=2");
echo $content;
$content = get_content("http://www.118.com/people-search.mvc?Supplied=true&Name=william&Location=Crabtree&pageSize=50&pageNumber=3");
echo $content;
$content = get_content("http://www.118.com/people-search.mvc?Supplied=true&Name=william&Location=Crabtree&pageSize=50&pageNumber=4");
echo $content;
?>

You need to use a dom parser Simple HTML or similar
The read the file into an dom object and parse it using the appropriate selectors:
$html = new simple_html_dom("http://www.118.com/people-search.mvc...0&pageNumber=1");
foreach($html->find(.searchResult+regular) as $div) {
//parse div contents here to extract name and address etc.
}
$html->clear();
unset($html);
For more info see the Simple HTML documentation.

Related

Is it possible to extract Dom Elements from htmlentities() function in php?

I appreciate the time you take to try and help me with my question.
So what i am doing is trying an html parser from a link. So I use curl first to link to the website then I convert it into htmlentities() so it doesn't load on the page so I get a string from that then i use the DOM object to extract the tag from. I checked different methods for a parser on google search so i learned a little bit about it then i execute my script but the problem is that the string is getting saved as textCont and not as a real html document so i would like to know how can convert htmlentities string into a real dom document and extract elements from it ?
the image of the var_dump is here
here is my script:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://www.usatoday.com/story/news/world/2021/02/17/dubai-princess-sheikha-latifa-says-she-hostage-after-flee-attempt/6778014002/?utm_source=feedblitz&utm_medium=FeedBlitzRss&utm_campaign=usatodaycomworld-topstories');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);
curl_close($curl);
$htmlentities = htmlentities($result);
// I added the code here
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
$htmlDom->preserveWhiteSpace = false;
$styles = $htmlDom->getElementsByTagName('style');
foreach ($styles as $style) {
$item = $style->getElementsByTagName('td');
//echo the values
echo '1: '.$item->item(0)->nodeValue.'<br />';
echo '2: '.$item->item(1)->nodeValue.'<br />';
echo '3: '.$item->item(2)->nodeValue;
}
EDIT:
what i added next to the code is this:
$htmlentities = htmlentities($result);
$htmlentities = str_replace(""",'"', $htmlentities);
$htmlentities = str_replace("'","'", $htmlentities);
$htmlentities = str_replace("<","<", $htmlentities);
$htmlentities = str_replace(">",">", $htmlentities);
libxml_use_internal_errors(true);
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
libxml_clear_errors();
var_dump($htmlDom);

Fatal error from time to time while processing html with PHP Simple HTML DOM Parser

I am extracting some meta data from Ikea site basing on catalogue number using PHP Simple HTML DOM Parser.
number 30275861 and dozens other that I tested work properly and as result give that link ($produkt variable) and some data http://www.ikea.com/pl/pl/catalog/products/30275861/?query=30275861 (if link is pasted to browser it gives page with kallax system furniture)
giving number 69136138 - link result ($produkt variable) http://www.ikea.com/pl/pl/catalog/products/S69136138/?query=69136138 that works if pasted to browser (besta tv furniture) gives error:
Fatal error: Call to a member function find() on boolean
Code that works in most cases looks like this:
<?php
include('simple_html_dom.php');
function clean($string) {
$string = str_replace(',', '.', $string);
return preg_replace('/[^A-Za-z0-9\-.]/', '', $string);
}
if(isset($_POST['produkt_id'])){
$produkt_id=str_replace('.', '', $_POST['produkt_id']);
$url="http://www.ikea.com/pl/pl/search/?query=".$produkt_id;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // Must be set to true so that PHP follows any "Location:" header
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a = curl_exec($ch); // $a will contain all headers
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); // Return the last effective URL
$produkt=(string)$url;
$html = file_get_html($produkt);
echo $produkt_id;
echo "<br>";
echo $produkt;
foreach($html->find('meta[name=partnumber]') as $e) echo $kod=$e->content;
foreach($html->find('link[rel=image_src"]') as $e) echo $obrazek=$e->href;
foreach($html->find('meta[name=title]') as $e) echo $nazwa=$e->content;
foreach($html->find('meta[name=price]') as $e) echo $cena=floatval(clean($e->content));
?>

Why not try wrapping your foreach loops in conditional statement so that the loops would only run if $html is neither null nor empty?
<?php
// ... SOME CODE ABOVE
$html = file_get_html($produkt);
/* WHY NOT (AT THIS POINT) TRY TO INSPECT THE CONTENT OF $html? WITH VAR_DUMP?*/
var_dump($html); //<== JUST TO SEE THE DATA CONTENT...
// ONLY RUN FOR LOOPS (ECHOING OUT SOME DATA)
// IF AND ONLY IF $html HAS SOME CONTENT
if($html && !empty($html)){
echo $produkt_id; //<== MAKES SENSE TO ECHO THIS ONLY IF $html HAS DATA
echo "<br>"; //<== MAKES SENSE TO ECHO THIS ONLY IF $html HAS DATA
echo $produkt; //<== MAKES SENSE TO ECHO THIS ONLY IF $html HAS DATA
foreach($html->find('meta[name=partnumber]') as $e){
echo $kod=$e->content;
}
foreach($html->find('link[rel=image_src"]') as $e){
echo $obrazek=$e->href;
}
foreach($html->find('meta[name=title]') as $e){
echo $nazwa=$e->content;
}
foreach($html->find('meta[name= price]') as $e){
echo $cena=floatval(clean($e->content));
}
}

PHP Simple HTML Dom parser returns 0

I use PHP Simple HTML Dom parser to get some elements of a page. Unfortunately, I get as a result 0 or 1... I would like to get the innerHTML instead.
Here is a photo of the dom:
And here is my code:
include('simple_html_dom.php');
// We take the url we want to scrape
$URL = 'https://www.legifrance.gouv.fr/affichTexte.do?cidTexte=JORFTEXT000033011065&dateTexte=20160821';
// Curl init
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $URL);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
curl_close($ch);
// We get the html
$html = new simple_html_dom();
$html->load($result);
// Find all article blocks
foreach($html->find('div.data') as $article) {
$item['title'] = $article->find('.titreSection', 0) ->plaintext;
$resultat[] = "<p>" + $item['title']."</p></br>";
}
include 'vue_scrap.php';
?>
Here is the code of my view:
foreach ($resultat as $result){
echo $result;
}
Thank you for your help.

In fact I just did a mistake with that line:
$resultat[] = "<p>" + $item['title']."</p></br>";
The correct version is:
$resultat[] = "<p>".$item['title']."</p></br>";

PHP DOM parser breaks the page and can't load page content

I have created a php parser that must extract the price in a span tag, but when I echo the $html so that I could see how the page loads, it shows me a broken page with no contents. Instead only header and footer loads, but not the content. The content seems to load by JavaScript externally and my question is how can I load the html page with Dom so that JavaScript also loads? I need to let the whole content load so that I can get the divs and spans. This is my code:
<?php
require_once('simple_html_dom.php');
$url = 'http://oldnavy.gap.com/browse/product.do?cid=99570&vid=1&pid=714649002';
$dom = new domDocument('1.0', 'UTF-8');
$html = file_get_html($url);
echo $html;
if(is_object($html)){
foreach ( $html->find('span#priceText') as $data){
$raw_price = $data->innertext;
echo $raw_price;
}
}
?>

Alt aproach
The link you are actually looking for (in his minimal expression) is this: http://oldnavy.gap.com/browse/productData.do?pid=714649
Now load that using curl, put a value to the unknownShopperId cookie, explode it into an array and get the price you need:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, "http://oldnavy.gap.com/browse/productData.do?pid=714649");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Cookie: unknownShopperId=E853DA3B2607DDAA5F2FE13CE8D32ACF"));
$result = curl_exec($ch);
$explode = explode(',', $result);
echo 'Original price: ' . $explode[92] . '<br/>' .
'New price: ' . $explode[93] . '<br/>' .
'Both prices: ' . $explode[13];
The result will be: '$14.94'
From now on, if you need another price you must know the intem's pid

Get element id by textcontent in php

I have a php code:
$url = "http://www.bbc.co.uk/";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->validateOnParse = true;
#$doc->loadHtml($data);
//I want to get element id and all i know is that the element is containg text "Business"
echo $doc->getElementById($id)->textContent;
Lets assume, that there is an element on a page a want to keep track of. I don't know the id, just the textcontent at that time. I want to get the id so i could get the textcontent of the same element next week or month, no matter if the text content is changing or not...

Have a look at this project:
http://code.google.com/p/phpquery/
With this you can use CSS3 selectors like "div:contains('foo')" to find elements containing a text.
Update: An example
The task: Find the elements containing "find me" inside "test.html":
<html>
<head></head>
<body>
<div>hello</div>
<div>find me!</div>
<div>and find me!</div>
<div>another one</div>
</body>
</html>
The PHP-Skript:
<?php
include "phpQuery-onefile.php";
phpQuery::newDocumentFileXHTML('test.html');
$domNodes = pq('div:contains("find me")');
foreach($domNodes as $domNode) {
/** #var DOMNode */
echo $domNode->textContent . PHP_EOL;
}
The result of running it:
php test.php
find me!
and find me!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract Data from HTML using PHP - php

Related

Is it possible to extract Dom Elements from htmlentities() function in php?

Fatal error from time to time while processing html with PHP Simple HTML DOM Parser

PHP Simple HTML Dom parser returns 0

PHP DOM parser breaks the page and can't load page content

Get element id by textcontent in php

Categories

Resources