Fetch the attributes using PHP crawler - php

I am trying to fetch the name,address and location from crawling of a website . Its a single page and dont want any other thing other than this. I am using the below code.
<?php
include 'simple_html_dom.php';
$html = "http://www.phunwa.com/phone/0191/2604233";
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="address-tags"]')->item(0);
for($i=0; $i < $div->length; $i++ )
{
print "nodename=".$div->item( $i )->nodeName;
print "\t";
print "nodevalue : ".$div->item( $i )->nodeValue;
print "\r\n";
echo $link->getElementsByTagName("<p>");
}
?>
The website html source code is
<div class="address-tags">
<p><strong>Name:</strong> RAJ GOPAL SINGH</p>
<p><strong>Address:</strong> R/O BARNAI NETARKOTHIAN, P.O.MUTHI TEH.& DISTT.JAMMU,X, 181206</p>
<p><strong>Location:</strong> JAMMU, Jammu & Kashmir, India</p>
<p><strong>Other Numbers:</strong> 01912604233 | +911912604233 | +91-191-2604233</p>
Can somone please help me get the three attributes as output. Nothing is echop on the page as of now.
Thanks alot .

you need $dom->load($html); instead of $dom->loadHtml($html);. After doing this you wil; find your html is not well formed, so $xpath stay empty.
Maybe try something like:
$html = file_get_contents('http://www.phunwa.com/phone/0191/2604233');
$name = preg_replace('/(.*)(<p><strong>Name:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$address = preg_replace('/(.*)(<p><strong>Address:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$location = preg_replace('/(.*)(<p><strong>Location:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$othernumbers = preg_replace('/(.*)(<p><strong>Other Numbers:<\/strong> )(.*)/mis','$3',$html);
list($othernumbers,$trash)= preg_split('/<\/p>/mis',$othernumbers,0);
echo 'name: '.$name.'<br>address: '.$address.'<br>location: '.$location.'<br>other numbers: '.$othernumbers;
exit;

You should use the following for your XPath query:
//*[#class='address-tags']/p
so you're retrieving the actual paragraph nodes that are children of the 'address-tags' parent. Then you can use a loop on them:
$nodes = $xpath->query('//*[#class="address-tags"]/p');
for ($i = 0; $i < $nodes->length; $i++) {
echo $nodes->item($i)->nodeValue;
}
// or just
foreach($nodes as $node) {
echo $node->nodeValue;
}
Right now your code is properly fetching the first div that's found, but then you continue treating that div as if it was a DOMNodeList returned from an xpath query, which is incorrect. ->item() returns a DOMNode object, which does NOT have an ->item() method.

Related

webscrapinhg a webite filtering for divs with a certain classname. How to do that?

currently I´m tring to webscrape a site for football matches and I need to find out how to filter for divs with a specific name. Here is the code I already have. Thanks
include('simple_html_dom.php');
$day = 1; //temporär
$html = file_get_html('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$list = $html -> find('div[class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]', 0);
$list_array = $list -> find('div');
for($i = 0; $i < sizeof($list_array); $i++){
echo $list_array[$i]->plaintext;
echo "<br>";
}
You can use xpath. Here is the full documentation.
$day = 1; //temporär
$html = file_get_contents('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXPath($doc);
$query = $xpath->query('//div[#class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]/div/span[2]');
foreach ($query as $item) {
/** #var DOMElement $item */
echo $item->nodeValue;
echo PHP_EOL;
}
Or you can benefit from symfony components for this purpose like DOM crawler or CSS selector

How to get image url by page in PHP

This is my code :
<form method="POST">
<input name="link">
<button type="submit">></button>
</form>
<title>GET IMAGE URL</title>
<?php
if (!isset($_POST['link'])) exit();
$link = $_POST['link'];
$parse = explode('.html', $link);
echo '<div id="pin" style="float:center"><textarea class="text" cols="110" rows="50">';
for ($i = 1; $i <=5; $i++)
{
if ($i > 1)
$link = "$parse[0]-$i.html";
$get = file_get_contents($link);
if (preg_match_all('/src="(.*?)"/', $get, $matches))
{
foreach ($matches[1] as $content)
echo $content."\r\n";
}
}
echo '</textarea>';
The page I'm trying to get the img src has 10 to 15 page,so I want my code to get all the img url until the end of the page. How can I do that without the loop?
If I use:
for ($i = 1; $i <=5; $i++)
this will get only 5 page img urls, but I want to make it get until the end. Then I don't need to edit the loop everytime I submit another URL with a different number of pages.
From this
this will get only 5 page img urls, but I want to make it get until the end. Then I don't need to edit the loop everytime I submit another URL with a different number of pages.
I could understand that your problem is with dynamic number of pages.Your urls have a next page link at the bottom
下一页
Identify it and get your images in while loop
<?php
// Link given in form
$link = "http://www.xiumm.org/photos/XiuRen-17305.html";
$parse = explode('.html', $link);
$i=1;
// Intialize a boolean
$nextPageFound = true;
while($nextPageFound) {
// Construct URL Every time when nextPageFound
if ($i == 1) {
$url = "$parse[0].html";
echo "First Page<br><br>";
} else {
$url = "$parse[0]-$i.html";
}
// Getting URL Contents
$get = file_get_contents($url);
if (preg_match_all('/src="(.*?)"/', $get, $matches))
{
// echoing contents
foreach ($matches[1] as $content)
echo $content."<br>";
}
// check nextPageBtn if available
if (strpos($get, '"nextPageBtn"') !== false) {
$nextPageFound = true;
// increment +1
$i++;
echo "<br>Page $i<br><br>";
} else {
$nextPageFound = false;
echo "THE END";
}
}
?>
You should use an HTML/XML parser, like DOMDocument, in combination with DOMXPath (xpath is query language to query (X)HTML data structures):
// create DOMDocument
$doc = new DOMDocument();
// load remote HTML file
$doc->loadHTMLFile( $link );
// create DOMXPath
$xpath = new DOMXPath( $doc );
// fetch all IMG elements that have a src attribute
$nodes = $xpath->query( '//img[#src]' );
// loop trough found IMG elements and echo their src attribute values
for( $i = 0; $i < $nodes->length; $i++ ) {
echo $nodes->item( $i )->getAttribute( 'src' ) . PHP_EOL;
}
Regarding the xpath query //div[contains(#class,'pic_box')]//#src, mentioned by #Enuma, in the comments:
The resulting DOMNodeList of that query will not contain DOMElement objects, but DOMAttr objects, because the query directly asks for attributes, not elements. Since DOMAttr represents an attribute and not an element, the method getAttribute() does not exist. To get the value of the attribute you have to use the property DOMAttr->value.
So, we have to slightly alter the relevant part of our example code from above to:
// loop trough found src attributes and echo their value
for( $i = 0; $i < $nodes->length; $i++ ) {
echo $nodes->item( $i )->value . PHP_EOL;
}
Putting it all together, our example code then becomes:
// create DOMDocument
$doc = new DOMDocument();
// load remote HTML file
$doc->loadHTMLFile( $link );
// create DOMXPath
$xpath = new DOMXPath( $doc );
// fetch all src attributes that are descendants of div.pic_box
$nodes = $xpath->query( '//div[contains(#class,'pic_box')]//#src' );
// loop trough found src attributes and echo their value
for( $i = 0; $i < $nodes->length; $i++ ) {
echo $nodes->item( $i )->value . PHP_EOL;
}
PS.: In order for DOMDocument to be able to load remote files, I believe some php config setting may be required to be set, which I don't know off the top of my head, right now. But since it already appeared to be working for #Enuma, it's not actually relevant now. Perhaps I'll look them up later.

PHP DOM Document not parsing / retrieving HTML

I wrote the following:
<?php
$str = 'http://stackoverflow.com';
$DOM = new DOMDocument;
$DOM->loadHTML($str);
//get all H1
$items = $DOM->getElementsByTagName('h1');
//display all H1 text
for ($i = 0; $i < $items->length; $i++)
{
echo $items->item($i)->nodeValue . "<br/>";
}
?>
And just wanted to simply retrieve all the H1 elements of stackoverflow, but can't get it working. Whenever I try filling in the variable $str manually (for example: <h1>hello</h1><div><h1>hello2</h1></div>) it is working. But whenever I try to parse content from another webpage it is not doing anything at all...
Help would be appericiated!
$str = 'http://stackoverflow.com';
$DOM = new DOMDocument;
$DOM->loadHTMLFile($str); // get html
echo $DOM->saveHTML(); echo html
$DOM->saveHTMLFile(FILE_NAME); save html to file

How to make crawling and extracting data in each pager links?

I want to extract all the attributes name="" of a website,
example html
<div class="link_row">
link
</div>
I have the following code:
<?php
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.onedomain.com/plus?ca=11_c&o=1');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='link_row']/a[#class='listing_container']/#name" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n<br>";
}
?>
Result is:
7777
This code is working fine, but need not be limited to one pager number.
http://www.onedomain.com/plus?ca=11_c&o=1 pager attr is "o=1"
I would like once you finish with o=1, follow with o=2
to my variable defined $last=556 is equal http://www.onedomain.com/plus?ca=11_c&o=556
Could you help me?
What is the best way to do it?
Thanks
Use a for (or while) loop. I don't see $last in your provided code so I've statically set the max value plus one.
$html = new DOMDocument();
for($i =1; $i < 557; $i++) {
#$html->loadHtmlFile('http://www.onedomain.com/plus?ca=11_c&o=' . $i);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='link_row']/a[#class='listing_container']/#name" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n<br>";
}
}
Simpler example:
for($i =1; $i < 557; $i++) {
echo $i;
}
http://php.net/manual/en/control-structures.for.php

How to call UL class only once using domdocument php

I am using PHP Domdocument to load my html. In my HTML, I have class="smalllist" two times. But, I need to load the first class elements.
Now, My PHP Code is
$d = new DOMDocument();
$d->validateOnParse = true;
#$d->loadHTML($html);
$xpath = new DOMXPath($d);
$table = $xpath->query('//ul[#class="smalllist"]');
foreach ($table as $row) {
echo $row->getElementsByTagName('a')->item(0)->nodeValue."-";
echo $row->getElementsByTagName('a')->item(1)->nodeValue."\n";
}
which loads both the classes.
But, I need to load only one class with that name.
Please help me in this. Thanks in advance.
DOMXPath returns a DOMNodeList which has a item() method. see if this works
$table->item(0)->getElementsByTagName('a')->item(0)->nodeValue
edited (untested):
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
echo $anchor->nodeValue . "\n";
}
You can put a break within the foreach loop to read only from the first class. Or, you can do foreach ($table->item(0) as $row) {...
Code:
$count = 0;
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
echo $anchor->nodeValue . "\n";
if( ++$count > 2 ) {
break;
}
}
another way rather than using break (more than one way to skin a cat):
$anchors = $table->item(0)->getElementsByTagName('a');
for($i = 0; $i < 2; $i++){
echo $anchor->item($i)->nodeValue . "\n";
}
This is my final code:
$d = new DOMDocument();
$d->validateOnParse = true;
#$d->loadHTML($html);
$xpath = new DOMXPath($d);
$table = $xpath->query('//ul[#class="smalllist"]');
$count = 0;
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
$data[$k][$arr1[$count]] = $anchor->nodeValue;
if( ++$count > 1 ) {
break;
}
}
Working fine.

Categories