How can I parse a website with PHP?

How can I parse a website with PHP? - php

I want to make a website with news just like my other site appears on the Internet.
I haven't got enough content. So I want to get content from another news website (Of course, I got permission to copy their content from their site).
For this I used Simple HTML DOM, and it's progressing very well on my other site, but as I apply my script on the release site, it doesn't work (outputs nothing).
Here is the link I want to get the data from: hesspress.com.
Code:
require_once 'simple_html_dom.php';
$lien='http://www.hesspress.com';
$html=new simple_html_dom();
$html->load_file($lien);
$k=$html->find('a',0)->href;
print_r($k);

This is what I would use, you can modify it for your own purposes.
It uses the built-in PHP DOM.
Code:
<?php
$url = 'http://acid3.acidtests.org';
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
libxml_use_internal_errors(true); // Prevent HTML errors from displaying
$doc = new DOMDocument();
$doc->loadHTML($html);
$link = $doc->getElementsByTagName('a')->item(0);
echo 'Link text: ' . $link->nodeValue;
echo '<br />';
echo 'Link URL: ' . $link->getAttribute('href');
?>

you can't use your this code in your website you can do it if you use this link http://php.net/manual/en/book.dom.php

Related

How to get all existing images src in this url with PHP?

There are 6 images in this url which I want to give src images. My goal is getting all images src with PHP but only one image src coming.
<?php
require_once ('simple_html_dom/simple_html_dom.php');
$html = file_get_html('https://www.zara.com/tr/en/flatform-derby-shoes-with-reversible-fringe-p15318201.html?v1=5276035&v2=734142');
foreach($html->find('img') as $element){
echo $element->src . '<br>';
}
?>

After looking at the Simple HTML DOM bug tracker. It seems like they are having some issues fetching values that aren't real URL's.
Looking at the source of the page you're trying to fetch, only one image actually does have a URL. The rest has inline images: src="data:image/png;base64,...".
I would suggest using PHP's own DOMDocument for this.
Here's a working solution (with comments):
<?php
// Get the HTML from the URL
$data = file_get_contents("https://www.zara.com/tr/en/flatform-derby-shoes-with-reversible-fringe-p15318201.html?v1=5276035&v2=734142");
$doc = new DOMDocument;
// DOMDocument throws a bunch of errors since the HTML isn't 100% valid
// (and for all HTML5-tags) but it will sort them out.
// Let's just tell it to fix it in silence.
libxml_use_internal_errors(true);
$doc->loadHTML($data);
libxml_clear_errors();
// Fetch all img-tags and get the 'src' attributes.
foreach ($doc->getElementsByTagName('img') as $img) {
echo $img->getAttribute('src') . '<br />';
}
Demo: https://www.tehplayground.com/sh4yJ8CqIwypwkCa

Actually those base64encodes are the images base64ecnoded images. As far as this page you want to parse although the images are base64 encoded the a tags that are the parents of the images actually are containing the image urls.
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch,CURLOPT_URL,"https://www.zara.com/tr/en/flatform-derby-shoes-with-reversible-fringe-p15318201.html?v1=5276035&v2=734142");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$data = curl_exec($ch);
curl_close($ch);
and now the data manipulation
libxml_use_internal_errors(true);
$siteData = new DOMDocument();
$siteData->loadHTML($data);
$a = $siteData->getElementsByTagName("a"); //get the a tags
for($i=0;$i<$a->length;$i++){
if($a->item($i)->getAttribute("class")=="_seoImg"){ //_seoImg class is the image class
echo $a->item($i)->getAttribute("href").'<br/>';
}
}
and the result is
//static.zara.net/photos///2017/I/1/1/p/5318/201/040/3/w/560/5318201040_2_1_1.jpg?ts=1508311623896
//static.zara.net/photos///2017/I/1/1/p/5318/201/040/3/w/560/5318201040_1_1_1.jpg?ts=1508311816920
//static.zara.net/photos///2017/I/1/1/p/5318/201/040/3/w/560/5318201040_2_3_1.jpg?ts=1508311715728
//static.zara.net/photos///2017/I/1/1/p/5318/201/040/3/w/560/5318201040_2_10_1.jpg?ts=1508315639664
//static.zara.net/photos///2017/I/1/1/p/5318/201/040/3/w/560/5318201040_2_2_1.jpg?ts=1508311682567

How to get the HTML of from an URL in PHP?

I want the HTML code from the URL.
Actually I want following things from the data at one URL.
1. blog titile
2. blog image
3. blod posted date
4. blog description or actual blog text
I tried below code but no success.
<?php
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
//curl_setopt(... other options you want...)
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
// Get the status code
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);
curl_close($c);
echo "Status :".$status; die;
?>
Please help me out to get the necessary data from the URL(http://54.174.50.242/blog/).
Thanks in advance.

You are halfway there. You curl request is working and $html variable is containing blog page source code. Now you need to extract data, that you need, from html string. One way to do it is by using DOMDocument class.
Here is something you could start with:
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
// and so on ...
You can also simpllify that by using method loadHTMLFile on DOMDocument class, that way you don't have to worry about all curl code boilerplate:
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://54.174.50.242/blog/');
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
echo $title;
// and so on ...

You Should use Simple HTML Parser . And extract html using
$html = #file_get_html($url);foreach($html->find('article') as element) { $title = $dom->find('h2',0)->plaintext; .... }
I am also using this, Hope it is working.

Parsing only text content from url

I am trying to parse text content from url given. Here is the code:
<?php
$url = 'http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page';
$content = file_get_contents($url);
echo $content; // This parse everything on the page, including image + everything
$text=escapeshellarg(strip_tags($content));
echo "</br>";
echo $text; // This gives source code also, not only the text content over page
?>
I want to get only the text written over page. No page source code. Any idea for this? I already googled but above method only present everywhere.

You can use DOMDocument and DOMNode
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
foreach($xpath->query("//script") as $script) {
$script->parentNode->removeChild($script);
}
$textContent = $doc->textContent; //inherited from DOMNode
Instead of using xpath, you can also do:
$doc = new DOMDocument();
$doc->loadHTMLFile($url); // Load the HTML
foreach($doc->getElementsByTagName('script') as $script) { // for all scripts
$script->parentNode->removeChild($script); // remove script and content
// so it will not appear in text
}
$textContent = $doc->textContent; //inherited from DOMNode, get the text.

$content = file_get_contents(strip_tags($url));
This will remove the HTML tags coming form the page

To remove html tag use:
$text = strip_tags($text);

A simple cURL will solve the issue. [TESTED]
<?php
$ch = curl_init("http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //Sorry forgot to add this
echo strip_tags(curl_exec($ch));
curl_close($ch);
?>

parsing html through get_file_contents()

is have been told that the best way to parse html is through DOM like this:
<?
$html = "<span>Text</span>";
$doc = new DOMDocument();
$doc->loadHTML( $html);
$elements = $doc->getElementsByTagName("span");
foreach( $elements as $el)
{
echo $el->nodeValue . "\n";
}
?>
but in the above the variable $html can't be a url, or can it??
wouldnt i have to use to function get_file_contents() to get the html of a page?

You have to use DOMDocument::loadHTMLFile to load HTML from an URL.
$doc = new DOMDocument();
$doc->loadHTMLFile($path);
DOMDocument::loadHTML parses a string of HTML.
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents($path));

It can be, but it depends on allow_url_fopen being enabled in your PHP install. Basically all of the PHP file-based functions can accept a URL as a source (or destination). Whether such a URL makes sense is up to what you're trying to do.
e.g. doing file_put_contents('http://google.com') is not going to work, as you'd be attempting to do an HTTP upload to google, and they're not going allow you to replace their homepage...
but doing $dom->loadHTML('http://google.com'); would work, and would suck in google's homepage into DOM for processing.

If you're having trouble using DOM, you could use CURL to parse. For example:
$url = "http://www.davesdaily.com/";
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_URL, $url);
$input = curl_exec($curl);
$regexp = "<span class=comment>([^<]*)<\/span>";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match);
}
echo $match[0];
The script should grab the text between <span class=comment> and </span> and store inside an array $match. This should echo Entertainment.

Display only certain HTML using PHP's DOMDocument

I am looking to draw html of a webpage inside my website.
Take this scenario:
I have a website that checks availability of a hotel. But instead of hosting that hotel's images on my server. I simple curl, a specific page on the hotels website that contains their images.
Can I grab anything from the html and display it on my website? using their HTML code, but only the div(s) or images that i want to display?
I'm using this code, sourced from:
http://davidwalsh.name/download-urls-content-php-curl
As practice and arguments sake, lets try and display Google's logo from their homepage.
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('http://www.google.com');
echo '<base href="http://www.google.com/" />';
echo $returned_content;
Thanks to #alex I have started to play with DOMDocument in PHP's lib. However, I have hit a snag.
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$url = "www.abc.net.au";
$html = get_data($url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$logo = $dom->getElementById("abcLogo");
var_dump($logo);
Returns: object(DOMElement)[2]
How do I parse this further? Or Simply just print/echo the contents of the DIV with that id..?

Yes, run the resulting HTML through something like DOMDocument to extract the portions you require.

Once you have found a DOM element, it can be a bit tricky to get the HTML of the element itself (rather than just its contents).
You can get the XML value of a single element very easily with DOMDocument::saveXML:
echo $dom->saveXML($logo);
This may be good enough for you. I believe there is a change coming that will add this functionality to saveHTML as well.

echo $logo->nodeValue should work because you can only have 1 element by id!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How can I parse a website with PHP? - php

you can't use your this code in your website you can do it if you use this link http://php.net/manual/en/book.dom.php

Related

How to get all existing images src in this url with PHP?

How to get the HTML of from an URL in PHP?

Parsing only text content from url

parsing html through get_file_contents()

Display only certain HTML using PHP's DOMDocument

Categories

Resources