Here is what I am looking for :
I have a Link which displays some data on HTML format :
http://www.118.com/people-search.mvc...0&pageNumber=1
Data comes in below format :
<div class="searchResult regular">
Bird John
56 Leathwaite Road
London
SW11 6RS
020 7228 5576
I want my PHP page to execute above URL and Extract/Parse Data from the Result HTML page based on above Tags as
h2=Name
address=Address
telephoneNumber= Phone Number
and Display them in a Tabular Format.
I got this but it only shows the TEXT format of an HTML page but works to an extent:
<?
function get_content($url)
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_HEADER, 0);
ob_start();
curl_exec ($ch);
curl_close ($ch);
$string = ob_get_contents();
ob_end_clean();
return $string;
}
$content = get_content("http://www.118.com/people-search.mvc?Supplied=true&Name=william&Location=Crabtree&pageSize=50&pageNumber=1");
echo $content;
$content = get_content("http://www.118.com/people-search.mvc?Supplied=true&Name=william&Location=Crabtree&pageSize=50&pageNumber=2");
echo $content;
$content = get_content("http://www.118.com/people-search.mvc?Supplied=true&Name=william&Location=Crabtree&pageSize=50&pageNumber=3");
echo $content;
$content = get_content("http://www.118.com/people-search.mvc?Supplied=true&Name=william&Location=Crabtree&pageSize=50&pageNumber=4");
echo $content;
?>
You need to use a dom parser Simple HTML or similar
The read the file into an dom object and parse it using the appropriate selectors:
$html = new simple_html_dom("http://www.118.com/people-search.mvc...0&pageNumber=1");
foreach($html->find(.searchResult+regular) as $div) {
//parse div contents here to extract name and address etc.
}
$html->clear();
unset($html);
For more info see the Simple HTML documentation.
Related
I appreciate the time you take to try and help me with my question.
So what i am doing is trying an html parser from a link. So I use curl first to link to the website then I convert it into htmlentities() so it doesn't load on the page so I get a string from that then i use the DOM object to extract the tag from. I checked different methods for a parser on google search so i learned a little bit about it then i execute my script but the problem is that the string is getting saved as textCont and not as a real html document so i would like to know how can convert htmlentities string into a real dom document and extract elements from it ?
the image of the var_dump is here
here is my script:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://www.usatoday.com/story/news/world/2021/02/17/dubai-princess-sheikha-latifa-says-she-hostage-after-flee-attempt/6778014002/?utm_source=feedblitz&utm_medium=FeedBlitzRss&utm_campaign=usatodaycomworld-topstories');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);
curl_close($curl);
$htmlentities = htmlentities($result);
// I added the code here
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
$htmlDom->preserveWhiteSpace = false;
$styles = $htmlDom->getElementsByTagName('style');
foreach ($styles as $style) {
$item = $style->getElementsByTagName('td');
//echo the values
echo '1: '.$item->item(0)->nodeValue.'<br />';
echo '2: '.$item->item(1)->nodeValue.'<br />';
echo '3: '.$item->item(2)->nodeValue;
}
EDIT:
what i added next to the code is this:
$htmlentities = htmlentities($result);
$htmlentities = str_replace(""",'"', $htmlentities);
$htmlentities = str_replace("'","'", $htmlentities);
$htmlentities = str_replace("<","<", $htmlentities);
$htmlentities = str_replace(">",">", $htmlentities);
libxml_use_internal_errors(true);
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
libxml_clear_errors();
var_dump($htmlDom);
I am extracting some meta data from Ikea site basing on catalogue number using PHP Simple HTML DOM Parser.
number 30275861 and dozens other that I tested work properly and as result give that link ($produkt variable) and some data http://www.ikea.com/pl/pl/catalog/products/30275861/?query=30275861 (if link is pasted to browser it gives page with kallax system furniture)
giving number 69136138 - link result ($produkt variable) http://www.ikea.com/pl/pl/catalog/products/S69136138/?query=69136138 that works if pasted to browser (besta tv furniture) gives error:
Fatal error: Call to a member function find() on boolean
Code that works in most cases looks like this:
<?php
include('simple_html_dom.php');
function clean($string) {
$string = str_replace(',', '.', $string);
return preg_replace('/[^A-Za-z0-9\-.]/', '', $string);
}
if(isset($_POST['produkt_id'])){
$produkt_id=str_replace('.', '', $_POST['produkt_id']);
$url="http://www.ikea.com/pl/pl/search/?query=".$produkt_id;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // Must be set to true so that PHP follows any "Location:" header
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a = curl_exec($ch); // $a will contain all headers
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); // Return the last effective URL
$produkt=(string)$url;
$html = file_get_html($produkt);
echo $produkt_id;
echo "<br>";
echo $produkt;
foreach($html->find('meta[name=partnumber]') as $e) echo $kod=$e->content;
foreach($html->find('link[rel=image_src"]') as $e) echo $obrazek=$e->href;
foreach($html->find('meta[name=title]') as $e) echo $nazwa=$e->content;
foreach($html->find('meta[name=price]') as $e) echo $cena=floatval(clean($e->content));
?>
Why not try wrapping your foreach loops in conditional statement so that the loops would only run if $html is neither null nor empty?
<?php
// ... SOME CODE ABOVE
$html = file_get_html($produkt);
/* WHY NOT (AT THIS POINT) TRY TO INSPECT THE CONTENT OF $html? WITH VAR_DUMP?*/
var_dump($html); //<== JUST TO SEE THE DATA CONTENT...
// ONLY RUN FOR LOOPS (ECHOING OUT SOME DATA)
// IF AND ONLY IF $html HAS SOME CONTENT
if($html && !empty($html)){
echo $produkt_id; //<== MAKES SENSE TO ECHO THIS ONLY IF $html HAS DATA
echo "<br>"; //<== MAKES SENSE TO ECHO THIS ONLY IF $html HAS DATA
echo $produkt; //<== MAKES SENSE TO ECHO THIS ONLY IF $html HAS DATA
foreach($html->find('meta[name=partnumber]') as $e){
echo $kod=$e->content;
}
foreach($html->find('link[rel=image_src"]') as $e){
echo $obrazek=$e->href;
}
foreach($html->find('meta[name=title]') as $e){
echo $nazwa=$e->content;
}
foreach($html->find('meta[name= price]') as $e){
echo $cena=floatval(clean($e->content));
}
}
I use PHP Simple HTML Dom parser to get some elements of a page. Unfortunately, I get as a result 0 or 1... I would like to get the innerHTML instead.
Here is a photo of the dom:
And here is my code:
include('simple_html_dom.php');
// We take the url we want to scrape
$URL = 'https://www.legifrance.gouv.fr/affichTexte.do?cidTexte=JORFTEXT000033011065&dateTexte=20160821';
// Curl init
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $URL);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
curl_close($ch);
// We get the html
$html = new simple_html_dom();
$html->load($result);
// Find all article blocks
foreach($html->find('div.data') as $article) {
$item['title'] = $article->find('.titreSection', 0) ->plaintext;
$resultat[] = "<p>" + $item['title']."</p></br>";
}
include 'vue_scrap.php';
?>
Here is the code of my view:
foreach ($resultat as $result){
echo $result;
}
Thank you for your help.
In fact I just did a mistake with that line:
$resultat[] = "<p>" + $item['title']."</p></br>";
The correct version is:
$resultat[] = "<p>".$item['title']."</p></br>";
I have created a php parser that must extract the price in a span tag, but when I echo the $html so that I could see how the page loads, it shows me a broken page with no contents. Instead only header and footer loads, but not the content. The content seems to load by JavaScript externally and my question is how can I load the html page with Dom so that JavaScript also loads? I need to let the whole content load so that I can get the divs and spans. This is my code:
<?php
require_once('simple_html_dom.php');
$url = 'http://oldnavy.gap.com/browse/product.do?cid=99570&vid=1&pid=714649002';
$dom = new domDocument('1.0', 'UTF-8');
$html = file_get_html($url);
echo $html;
if(is_object($html)){
foreach ( $html->find('span#priceText') as $data){
$raw_price = $data->innertext;
echo $raw_price;
}
}
?>
Alt aproach
The link you are actually looking for (in his minimal expression) is this: http://oldnavy.gap.com/browse/productData.do?pid=714649
Now load that using curl, put a value to the unknownShopperId cookie, explode it into an array and get the price you need:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, "http://oldnavy.gap.com/browse/productData.do?pid=714649");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Cookie: unknownShopperId=E853DA3B2607DDAA5F2FE13CE8D32ACF"));
$result = curl_exec($ch);
$explode = explode(',', $result);
echo 'Original price: ' . $explode[92] . '<br/>' .
'New price: ' . $explode[93] . '<br/>' .
'Both prices: ' . $explode[13];
The result will be: '$14.94'
From now on, if you need another price you must know the intem's pid
I have a php code:
$url = "http://www.bbc.co.uk/";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->validateOnParse = true;
#$doc->loadHtml($data);
//I want to get element id and all i know is that the element is containg text "Business"
echo $doc->getElementById($id)->textContent;
Lets assume, that there is an element on a page a want to keep track of. I don't know the id, just the textcontent at that time. I want to get the id so i could get the textcontent of the same element next week or month, no matter if the text content is changing or not...
Have a look at this project:
http://code.google.com/p/phpquery/
With this you can use CSS3 selectors like "div:contains('foo')" to find elements containing a text.
Update: An example
The task: Find the elements containing "find me" inside "test.html":
<html>
<head></head>
<body>
<div>hello</div>
<div>find me!</div>
<div>and find me!</div>
<div>another one</div>
</body>
</html>
The PHP-Skript:
<?php
include "phpQuery-onefile.php";
phpQuery::newDocumentFileXHTML('test.html');
$domNodes = pq('div:contains("find me")');
foreach($domNodes as $domNode) {
/** #var DOMNode */
echo $domNode->textContent . PHP_EOL;
}
The result of running it:
php test.php
find me!
and find me!