Why do I have an empty Array as result while fetching content? - php

<?php
$page = file_get_contents("https://www.google.com");
preg_match('#<div id="searchform" class="jhp big">(.*?)</div>#Uis', $page, $matches);
print_r($matches);
?>
The following code I wrote, has to grab a specific part of another web page (in this case google). Unfortunately it is not working, and I'm not sure why (since the regular expression itself is grabbing everything inside of the div).
Help would be appreciated!

According to the source of the page you have pasted, there does not exist a line with that structure. This is one of the reasons why parsing HTML with regalar expressions is not recommended.
Using the getElementById() seems to do what you are after:
<?php
$page = file_get_contents("https://www.google.com");
$doc = new DOMDocument();
$doc->loadHTML($page);
$result = $doc->getElementById('searchform');
print_r($result);
?>
EDIT:
You could use the code below:
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://google.com');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
$page = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$doc->loadHTML($page);
echo($page);
$result = $doc->getElementById('searchform');
print_r($result);
?>
You might need to refer to this question though since you might need to change some settings.

DomxPath would be a better choice for you, here is an example.
<?php
$content = file_get_contents('https://www.google.com');
//gets rid of a few things that domdocument hates
$content = preg_replace("/&(?!(?:apos|quot|[gl]t|amp);|#)/", '&', $content);
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DomXPath($doc);
$item = $xpath->query('//div[#id="searchform"]');

Related

How to get the HTML of from an URL in PHP?

I want the HTML code from the URL.
Actually I want following things from the data at one URL.
1. blog titile
2. blog image
3. blod posted date
4. blog description or actual blog text
I tried below code but no success.
<?php
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
//curl_setopt(... other options you want...)
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
// Get the status code
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);
curl_close($c);
echo "Status :".$status; die;
?>
Please help me out to get the necessary data from the URL(http://54.174.50.242/blog/).
Thanks in advance.
You are halfway there. You curl request is working and $html variable is containing blog page source code. Now you need to extract data, that you need, from html string. One way to do it is by using DOMDocument class.
Here is something you could start with:
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
// and so on ...
You can also simpllify that by using method loadHTMLFile on DOMDocument class, that way you don't have to worry about all curl code boilerplate:
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://54.174.50.242/blog/');
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
echo $title;
// and so on ...
You Should use Simple HTML Parser . And extract html using
$html = #file_get_html($url);foreach($html->find('article') as element) { $title = $dom->find('h2',0)->plaintext; .... }
I am also using this, Hope it is working.

Parsing only text content from url

I am trying to parse text content from url given. Here is the code:
<?php
$url = 'http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page';
$content = file_get_contents($url);
echo $content; // This parse everything on the page, including image + everything
$text=escapeshellarg(strip_tags($content));
echo "</br>";
echo $text; // This gives source code also, not only the text content over page
?>
I want to get only the text written over page. No page source code. Any idea for this? I already googled but above method only present everywhere.
You can use DOMDocument and DOMNode
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
foreach($xpath->query("//script") as $script) {
$script->parentNode->removeChild($script);
}
$textContent = $doc->textContent; //inherited from DOMNode
Instead of using xpath, you can also do:
$doc = new DOMDocument();
$doc->loadHTMLFile($url); // Load the HTML
foreach($doc->getElementsByTagName('script') as $script) { // for all scripts
$script->parentNode->removeChild($script); // remove script and content
// so it will not appear in text
}
$textContent = $doc->textContent; //inherited from DOMNode, get the text.
$content = file_get_contents(strip_tags($url));
This will remove the HTML tags coming form the page
To remove html tag use:
$text = strip_tags($text);
A simple cURL will solve the issue. [TESTED]
<?php
$ch = curl_init("http://stackoverflow.com/questions/12097352/how-can-i-parse-dynamic-content-from-a-web-page");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //Sorry forgot to add this
echo strip_tags(curl_exec($ch));
curl_close($ch);
?>

parsing html through get_file_contents()

is have been told that the best way to parse html is through DOM like this:
<?
$html = "<span>Text</span>";
$doc = new DOMDocument();
$doc->loadHTML( $html);
$elements = $doc->getElementsByTagName("span");
foreach( $elements as $el)
{
echo $el->nodeValue . "\n";
}
?>
but in the above the variable $html can't be a url, or can it??
wouldnt i have to use to function get_file_contents() to get the html of a page?
You have to use DOMDocument::loadHTMLFile to load HTML from an URL.
$doc = new DOMDocument();
$doc->loadHTMLFile($path);
DOMDocument::loadHTML parses a string of HTML.
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents($path));
It can be, but it depends on allow_url_fopen being enabled in your PHP install. Basically all of the PHP file-based functions can accept a URL as a source (or destination). Whether such a URL makes sense is up to what you're trying to do.
e.g. doing file_put_contents('http://google.com') is not going to work, as you'd be attempting to do an HTTP upload to google, and they're not going allow you to replace their homepage...
but doing $dom->loadHTML('http://google.com'); would work, and would suck in google's homepage into DOM for processing.
If you're having trouble using DOM, you could use CURL to parse. For example:
$url = "http://www.davesdaily.com/";
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_URL, $url);
$input = curl_exec($curl);
$regexp = "<span class=comment>([^<]*)<\/span>";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match);
}
echo $match[0];
The script should grab the text between <span class=comment> and </span> and store inside an array $match. This should echo Entertainment.

using curl with simplehtmldom

Recently our hosting disabled allow_url_fopen, it seems simplehtmldom needs it turned on I saw a work arround with allow_url_fopen in this site simplehtmldom.sourceforge.net...aq.htm#hosting, "Use curl to get the page, then call "str_get_dom" to create DOM object". but still to no luck. can you tell me if I did it properly or am I missing something?
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'www.weather.bm/');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$str = curl_exec($curl);
curl_close($curl);
$html= str_get_html($str);
?>
<?php
$element = $html->find("div");
$element[66]->class = "mwraping66";
foreach($html->find('.mwraping66 img') as $e)
$doc = phpQuery::newDocumentHTML( $e ); $containers = pq('.mwraping66', $doc);
foreach ( $containers as $container ) { $div = pq('img', $container);
$div->eq(1)->removeAttr('style')->addClass('thumbnail')->html( pq( 'img', $div->eq(1))- >removeAttr('height')->removeAttr('width')->removeAttr('alt') );
} print $doc;
?>
<?php
$element = $html->find("div");
$element[31]->class = "mwraping31";
foreach($html->find('.mwraping31') as $e)
echo $e->plaintext;
?>.................................
compared to:
<?php
include('simple_html_dom.php');
include ('phpQuery.php');
// Create DOM from URL
$html = file_get_html('www.weather.bm/');
?>
<?php
$element = $html->find("div");
$element[66]->class = "mwraping66";
foreach($html->find('.mwraping66 img') as $e).....
Thanks you for your help
I know this is too late to answer this query but i have found similar questions and answer in this forum.. this is the link to that Using simple html dom .. i am not sure whether this will answer your query because i am also new to dom .try to use this modified simple_html_dom.php file http://webarto.com/82/php-simple-html-dom-curl it uses curl instead of file_get_content; this file is working for me and its usage is also same as the original simple_html_dom.php

Get div and the correct close tag preg

Now preg has always been a tool to me that i like but i cant figure out for the life if me if what i want to do is possible let and how to do it is going over my head
What i want is preg_match to be able to return me a div's innerHTML the problem is the div im tring to read has more divs in it and my preg keeps closing on the first tag it find
Here is my Actual code
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
preg_match('% <div id="torrent_details">(.*)</div> %six', $data, $match);
print_r($match);
This has been updated for TomcatExodus's help
Live at :: http://megatorrentz.com/beta/details.php?hash=98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6
<?php
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($data);
libxml_use_internal_errors(false);
$div = $domd->getElementById("torrent_details");
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
Using regular expression leads often to problems when parsing markup documents.
XPath version - independent of the source layout. The only thing you need is a div with that id.
loadHTMLFile($url);
$xp = new domxpath($dom);
$result = $xp->query("//*[#id = 'torrent_details']");
$div=$result->item(0);
if($result->length){
$out =new DOMDocument();
$out->appendChild($out->importNode($div, true));
echo $out->saveHTML();
}else{
echo "No such id";
}
?>
And this is the fix for Maerlyn solution. It didn't work because getElementById() wants a DTD with the id attribute specified. I mean, you can always build a document with "apple" as the record id, so you need something that says "id" is really the id for this tag.
validateOnParse = true;
#$domd->loadHTML($data);
//this doesn't work as the DTD is not specified
//or the specified id attribute is not the attributed called "id"
//$div = $domd->getElementById("torrent_details");
/*
* workaround found here: https://fosswiki.liip.ch/display/BLOG/GetElementById+Pitfalls
* set the "id" attribute as the real id
*/
$elements = $domd->getElementsByTagName('div');
if (!is_null($elements)) {
foreach ($elements as $element) {
//try-catch needed because of elements with no id
try{
$element->setIdAttribute('id', true);
}catch(Exception $e){}
}
}
//now it works
$div = $domd->getElementById("torrent_details");
//Print its content or error
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
?>
Both of the solutions work for me.
You can do this:
/]>(.)<\/div>/i
Which would give you the largest possible innerHTML.
You cannot. I will not link to the famous question, because I dislike the pointless drivel on top. But still regular expressions are unfit to match nested structures.
You can use some trickery, but this is neither reliable, nor necessarily fast:
preg_match_all('#<div id="1">((<div>.*?</div>|.)*?)</div>#ims'
Your regex had a problem due to the /x flag not matching the opening div. And you used a wrong assertion notation.
preg_match_all('% <div \s+ id="torrent_details">(?<innerHtml>.*)</div> %six', $html, $match);
echo $match['innerHtml'];
That one will work, but you should only need preg_match not preg_match_all if the pages are written well, there should only be one instance of id="torrent_details" on the given page.
I'm retracting my answer. This will not work properly. Use DOM for navigating the document.
haha did it with a bit of tampering thanks for the DOMDocument idea i just to use simple
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(false);
$doc->strictErrorChecking = FALSE;
libxml_use_internal_errors(true);
$doc->loadHTML($data);
$xml = simplexml_import_dom($doc);
print_r($xml->body->table->tr->td->table[2]->tr->td[0]->span[0]->div);

Categories