I am try to learn data scrapping from other website so I started by trying creating a small HTML file.
domhtml.php :
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<body>
<div id="mango">
This is the mango div. It has some text and a form too.
<form>
<input type="text" name="first_name" value="Yahoo" />
<input type="text" name="last_name" value="Bingo" />
</form>
<table class="inner">
<tr><td>Happy</td><td>Sky</td></tr>
</table>
</div>
<table id="data" class="outer">
<tr><td>Happy1</td><td>Sky</td></tr>
<tr><td>Happy2</td><td>Sky</td></tr>
<tr><td>Happy3</td><td>Sky</td></tr>
<tr><td>Happy4</td><td>Sky</td></tr>
<tr><td>Happy5</td><td>Sky</td></tr>
</table>
</body>
</html>
extract.php :
<?php
$ch = curl_init("http://192.168.0.198/projects/domhtml.php");
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
$cl = curl_exec($ch);
$dom = new DOMDocument();
$dom->loadHTML($cl);
$dom->validate();
$title = $dom->getElementById("mango");
//var_dump($title);exit;
//$title = $dom->saveXML($title);
echo '<pre>';
print_r($title);
?>
But it returns output :
DOMElement Object
(
)
why it is empty ? What is to be done other then this ? I also tried PHP Dom not retrieving element solution but it return the same.
Edit :
Ok as you all guys told me I have done this :
$ch = curl_init("http://192.168.0.198/shopclues/domhtml.php");
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
$cl = curl_exec($ch);
$dom = new DOMDocument();
$dom->loadHTML($cl);
$dom->validate();
$title = $dom->getElementById("data");
//var_dump($title);exit;
$title = $dom->saveXML($title);
echo '<pre>';
print_r($title);
So now it is printing :
Happy1 Sky
Happy2 Sky
Happy3 Sky
Happy4 Sky
Happy5 Sky
I want to know the how many tr tag is there so that I can store the value of each tr in some variable. I mean how can I loop to store the value into variable ?
Thanks in advance.
The default "__toString()" functions in the DOM classes have been steadily improving:
http://codepad.viper-7.com/hw9UKg
Run the code in the snippet above using different versions of PHP, you'll see the difference between 5.3.3 and 5.4.33.
For the second part of your question, there are many ways to do what you want. I will show you one:
$dom = new DOMDocument();
// I used a different URL
$dom->loadHtmlFile("http://192.168.0.198/shopclues/domhtml.php");
$list = $dom->getElementById("data")->childNodes;
print_r($list->length); // outputs 5 for me.
$list is a DOMNodeList which implements Traversable so you can loop over it to get the values. For more information, check:
http://php.net/manual/en/class.domnodelist.php
For more complex queries, you may want to look into DOMXPath:
http://php.net/manual/en/class.domxpath.php
It would also be beneficial to read all the functions available to you with DomDocument and DomNode:
http://php.net/manual/en/class.domdocument.php
http://php.net/manual/en/class.domnode.php
Related
<?php
$page = file_get_contents("https://www.google.com");
preg_match('#<div id="searchform" class="jhp big">(.*?)</div>#Uis', $page, $matches);
print_r($matches);
?>
The following code I wrote, has to grab a specific part of another web page (in this case google). Unfortunately it is not working, and I'm not sure why (since the regular expression itself is grabbing everything inside of the div).
Help would be appreciated!
According to the source of the page you have pasted, there does not exist a line with that structure. This is one of the reasons why parsing HTML with regalar expressions is not recommended.
Using the getElementById() seems to do what you are after:
<?php
$page = file_get_contents("https://www.google.com");
$doc = new DOMDocument();
$doc->loadHTML($page);
$result = $doc->getElementById('searchform');
print_r($result);
?>
EDIT:
You could use the code below:
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://google.com');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
$page = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$doc->loadHTML($page);
echo($page);
$result = $doc->getElementById('searchform');
print_r($result);
?>
You might need to refer to this question though since you might need to change some settings.
DomxPath would be a better choice for you, here is an example.
<?php
$content = file_get_contents('https://www.google.com');
//gets rid of a few things that domdocument hates
$content = preg_replace("/&(?!(?:apos|quot|[gl]t|amp);|#)/", '&', $content);
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DomXPath($doc);
$item = $xpath->query('//div[#id="searchform"]');
I want the HTML code from the URL.
Actually I want following things from the data at one URL.
1. blog titile
2. blog image
3. blod posted date
4. blog description or actual blog text
I tried below code but no success.
<?php
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
//curl_setopt(... other options you want...)
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
// Get the status code
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);
curl_close($c);
echo "Status :".$status; die;
?>
Please help me out to get the necessary data from the URL(http://54.174.50.242/blog/).
Thanks in advance.
You are halfway there. You curl request is working and $html variable is containing blog page source code. Now you need to extract data, that you need, from html string. One way to do it is by using DOMDocument class.
Here is something you could start with:
$c = curl_init('http://54.174.50.242/blog/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
// and so on ...
You can also simpllify that by using method loadHTMLFile on DOMDocument class, that way you don't have to worry about all curl code boilerplate:
$dom = new DOMDocument;
// disable errors on invalid html
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://54.174.50.242/blog/');
$list = $dom->getElementsByTagName('title');
$title = $list->length ? $list->item(0)->textContent : '';
echo $title;
// and so on ...
You Should use Simple HTML Parser . And extract html using
$html = #file_get_html($url);foreach($html->find('article') as element) { $title = $dom->find('h2',0)->plaintext; .... }
I am also using this, Hope it is working.
I am trying to capture the first instance of particular elements from an object. I have an object $doc and would like to get the values of the following.
id, url, alias, description and label i.e. specifically:
variable1 - Q95,
variable2 - //www.wikidata.org/wiki/Q95,
variable3 - Google.Inc,
varialbe4 - American multinational Internet and technology corporation,
variable5 - Google
I've made some progress getting the $jsonArr string however I'm not sure this is the best way to go, and if so I'm not sure how to progress anyway.
Please advise as to the best way to get these. Please see my code below:
<HTML>
<body>
<form method="post">
Search: <input type="text" name="q" value="Google"/>
<input type="submit" value="Submit">
</form>
<?php
if (isset($_POST['q'])) {
$search = $_POST['q'];
$errors = libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile("https://www.wikidata.org/w/api.php?
action=wbsearchentities&search=$search&format=json&language=en");
libxml_clear_errors();
libxml_use_internal_errors($errors);
var_dump($doc);
echo "<p>";
$jsonArr = $doc->documentElement->nodeValue;
$jsonArr = (string)$jsonArr;
echo $jsonArr;
}
?>
</body>
</HTML>
Since the response to your API request is JSON, not HTML or XML, it's most appropriate to use cURL or Stream library to perform the HTTP request. You can even use something primitive like file_get_contents.
For example, using cURL:
// Make the request
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://www.wikidata.org/w/api.php?action=wbsearchentities&search=google&format=json&language=en");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
// Decode the string into an appropriate PHP type
$contents = json_decode($output);
// Navigate the object
$contents->search[0]->id; // "Q95"
$contents->search[0]->url; // "//www.wikidata.org/wiki/Q95"
$contents->search[0]->aliases[0]; // "Google Inc."
You can use var_dump to inspect the $contents and traverse it like you would any PHP object.
I have a php code:
$url = "http://www.bbc.co.uk/";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->validateOnParse = true;
#$doc->loadHtml($data);
//I want to get element id and all i know is that the element is containg text "Business"
echo $doc->getElementById($id)->textContent;
Lets assume, that there is an element on a page a want to keep track of. I don't know the id, just the textcontent at that time. I want to get the id so i could get the textcontent of the same element next week or month, no matter if the text content is changing or not...
Have a look at this project:
http://code.google.com/p/phpquery/
With this you can use CSS3 selectors like "div:contains('foo')" to find elements containing a text.
Update: An example
The task: Find the elements containing "find me" inside "test.html":
<html>
<head></head>
<body>
<div>hello</div>
<div>find me!</div>
<div>and find me!</div>
<div>another one</div>
</body>
</html>
The PHP-Skript:
<?php
include "phpQuery-onefile.php";
phpQuery::newDocumentFileXHTML('test.html');
$domNodes = pq('div:contains("find me")');
foreach($domNodes as $domNode) {
/** #var DOMNode */
echo $domNode->textContent . PHP_EOL;
}
The result of running it:
php test.php
find me!
and find me!
I'm using the current function :
function callframe(){
$ch = curl_init("file.html");
curl_setopt($ch, CURLOPT_HEADER, 0);
echo curl_exec($ch);
curl_close($ch);
}
Then i call callframe() and it appears on my php page.
Let's say this is the file.html content :
<html>
<body>
[...]
<td class="bottombar" valign="middle" height="20" align="center" width="1%" nowrap>
[...]
Link
[...]
</body>
</html>
How could i delete the <td class="bottombar" valign="middle" height="20" align="center" width="1%" nowrap> line?
How could i delete one parameter like the height parameter, or change align center to left?
How could i insert 'http://www.whatever.com/' before link.html in my a href
Thanks for your help!
ps: you may want to ask why i don't directly change file.html. well, then, there would be no question.
To get you started, instead of just echoing the curl_exec, store it first so you can work with it:
$html = curl_exec($ch);
now, load it up in to a DOMDocument that you can then use for parsing and making changes:
$dom = new DOMDocument();
$dom->loadHTML($html);
now, for the first task (removing that line) it'd look something like:
//
// rough example, not just copy-paste code
//
$tds = $dom->getElementsByTagname('td'); // $tds = DOMNodeList
foreach ($tds as $td) // $td = DOMNode
{
// validate this $td is the one you want to delete, then
// call something like:
$parent = $td->parentNode;
$parent->removeChild($td);
}
Perform any other kinds of processing as well.
Then, finally call:
echo $dom->saveHTML();
You can take your output in one variable and can use string functions to do your stuffs
function callframe(){
$ch = curl_init("file.html");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$result = curl_exec($ch);
$result = str_replace("link.html","http://www.whatever.com/link.html", $result);
// other replacements as required
curl_close($ch);
}
This is how i did it.
To change for example an option field (for search string)
This change the second value of my option list and replace it with what i wanted.
require('simple_html_dom.php');
$html = file_get_html('fileorurl');
$e = $html->find('option', 0) ->next_sibling ();
$e->outertext = '<option value="WTR">Tradition</option>';
then
echo $html;