How to make crawling and extracting data in each pager links? - php

I want to extract all the attributes name="" of a website,
example html
<div class="link_row">
link
</div>
I have the following code:
<?php
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.onedomain.com/plus?ca=11_c&o=1');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='link_row']/a[#class='listing_container']/#name" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n<br>";
}
?>
Result is:
7777
This code is working fine, but need not be limited to one pager number.
http://www.onedomain.com/plus?ca=11_c&o=1 pager attr is "o=1"
I would like once you finish with o=1, follow with o=2
to my variable defined $last=556 is equal http://www.onedomain.com/plus?ca=11_c&o=556
Could you help me?
What is the best way to do it?
Thanks

Use a for (or while) loop. I don't see $last in your provided code so I've statically set the max value plus one.
$html = new DOMDocument();
for($i =1; $i < 557; $i++) {
#$html->loadHtmlFile('http://www.onedomain.com/plus?ca=11_c&o=' . $i);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='link_row']/a[#class='listing_container']/#name" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n<br>";
}
}
Simpler example:
for($i =1; $i < 557; $i++) {
echo $i;
}
http://php.net/manual/en/control-structures.for.php

Related

webscrapinhg a webite filtering for divs with a certain classname. How to do that?

currently I´m tring to webscrape a site for football matches and I need to find out how to filter for divs with a specific name. Here is the code I already have. Thanks
include('simple_html_dom.php');
$day = 1; //temporär
$html = file_get_html('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$list = $html -> find('div[class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]', 0);
$list_array = $list -> find('div');
for($i = 0; $i < sizeof($list_array); $i++){
echo $list_array[$i]->plaintext;
echo "<br>";
}
You can use xpath. Here is the full documentation.
$day = 1; //temporär
$html = file_get_contents('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXPath($doc);
$query = $xpath->query('//div[#class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]/div/span[2]');
foreach ($query as $item) {
/** #var DOMElement $item */
echo $item->nodeValue;
echo PHP_EOL;
}
Or you can benefit from symfony components for this purpose like DOM crawler or CSS selector

PHP Simple Dom HTML - Trouble parsing list of a hrefs

I'm trying to scrape all the a hrefs with an id starting with 'system' from this webpage: http://www.myfxbook.com/systems
Here is my code which I just can't seem to get to work. I've been fiddling around for hours now, looking at countless answered questions here.
include_once( 'simple_html_dom.php' );
$url2process = 'http://www.myfxbook.com/systems';
$html = file_get_html( $url2process );
$cnt = 0;
$parent_mark = $html->find('a[id^=system]');
$cntr = 0;
foreach( $parent_mark as $element) {
if( $cntr > 3 ) continue;
$cntr++;
$single_html = file_get_html( $element->href );
UPDATE1: Ok this is kind of working now, but it only seems to be using the very last a href on the page with the correct id. I need to process ALL these a hrefs with this ID, what am I missing here?
You could do it using the domdocument like this..
$html = file_get_contents('http://www.myfxbook.com/systems');
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors(false);
$links = $doc->getElementsByTagName('a');
$cnt = 0;
$cntr = 0;
foreach ($links as $link) {
if(preg_match('~^system~', $link->getAttribute('id'))) {
if( $cntr > 3 ) {
continue;
}
$cntr++;
$single_html = file_get_contents($link->getAttribute('href'));
if (empty($single_html)) {
echo 'EMPTY';
}
}
}

how to scrape hindi text from web using php

Here i am trying to scrape data from the web (in url) that is in hindi but I am getting response like this
\u093f\u0938\
How to decode this unicode? Please suggest me what to do my script in PHP.
This script is working correctly with english text so what is happening with english. I have already scraped data with this script. I know this response is dev nagri unicode but how to decode it.
I am new in php problem thanks in advance
$i= 1;
for($i; $i < 6; $i++)
{
$html file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$nodes = $dom->getElementsByTagName('p');
$item = array();
$articles = array();
foreach ($nodes as $node) {
$item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
$item['cat_id'] = 1;
if($item['msg'] !="")
$articles[] = array_unique($item);
}
$articles = json_encode($articles);
print_r($articles);
}
f you are running PHP 5.4 or greater, pass the JSON_UNESCAPED_UNICODE parameter when calling json_encode.
$i= 1;
for($i; $i < 6; $i++)
{
$html file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$nodes = $dom->getElementsByTagName('p');
$item = array();
$articles = array();
foreach ($nodes as $node) {
$item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
$item['cat_id'] = 1;
if($item['msg'] !="")
$articles[] = array_unique($item);
}
$articles = json_encode($articles, JSON_UNESCAPED_UNICODE);
//--------------------add-this---------------------^
print_r($articles);
}
I think PHPhil's answer is good and I upvoted it. I edited the code as it does not work just to execute the php part - instead it is important to add the right meta tag (see the code below) to show the devnagari properly. Also I wanted to correct the mistake with the missing "=". Unfortunately my edit was rejected so I have to add a new answer with the code corrections.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<?php
$i= 1;
for($i; $i < 6; $i++)
{
$html = file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$nodes = $dom->getElementsByTagName('p');
$item = array();
$articles = array();
foreach ($nodes as $node) {
$item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
$item['cat_id'] = 1;
if($item['msg'] !="")
$articles[] = array_unique($item);
}
$articles = json_encode($articles, JSON_UNESCAPED_UNICODE);
//--------------------add-this---------------------^
print_r($articles);
}
?>
</body>
</html>
You are very close. You receive the signs: ि and स
First you can try is google for the character and you will find the devnagari meaning of the chars:
https://www.google.de/#q=%5Cu093f
https://www.google.de/#q=%5Cu0938
If you want to show unicode in html you have to change the encoding from /u0123 to &#x123. See here:
<html>
<body>
<p>These are two chars in devnagari िस<p>
</body>
</html>
But as you are wanting to scrape Hindi you should start learning how to read and handle unicode. Next question is, how you want to process with your result.

PHP XPath - query finds too many nodes

I'm trying to multiplicate a row (with data-id='first') from a template three times and fill the proper field ({first}) with some value (0,1,2 in this case). Below you can find my simple code. I don't understand, why this line - $nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode); finds more than one node (it finds nodes which contain text 'first'). It just finds both rows - the cloned and the original one, so it replaces the text in both of them, while it should replace it only in the new one - please note that I'm providing the second parameter for function $xpath->query which should make the search relative to just that new node I just cloned.
Here's a fiddle: https://eval.in/170941
HTML:
<html>
<head>
<title>test</title>
</head>
<body>
<table>
<tr data-id="first">
<td>{first}</td>
</tr>
</table>
</body>
</html>
PHP:
<?php
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$element = $xpath->query("//*[#data-id='first']")->item(0);
$element->removeAttribute("data-id");
$parent = $element->parentNode;
for ($i = 0; $i < 3; $i++) {
$newNode = $element->cloneNode(true);
$parent->insertBefore($newNode, $element);
$nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode);
for($j = 0; $j < $nodeList->length; $j++) {
$n = $nodeList->item($j);
$n->nodeValue = preg_replace("{{first}}", $i, $n->nodeValue);
}
}
$parent->removeChild($element);
echo $dom->saveHTML();
As you can see, the result is a three elements table with rows valued 0,0,0, while expected values should be 0,1,2.
Starting an xpath location path with / means tha it start at the document root. So //* is always any element node, the context argument has no effect.
Try:
$nodeList = $xpath->query(".//*[text()[contains(.,'first')]]", $newNode);
HINT: DOMXpath::query() does only allow expressions that return a node list, DOMXpath::evaluate() allows all expressions. Example: count(//*).
HINT: DOMNodelist objects implement iterator, you can use foreach to iterate them.
The problem you are having is that you are cloning the original node, but in your first pass you're altering the original node's content. Every pass after that is copying the already modified node, so there is no {first} to find.
One solution is to make a clone of the source element which you never insert into the document, and use that inside your loop.
Here's my fiddle: https://eval.in/171149
<?php
$html = '<html><head><title>test</title></head><body><table><tr data-id="first"><td>{first}</td></tr></table></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$element = $xpath->query("//*[#data-id='first']")->item(0);
$element->removeAttribute("data-id");
$parent = $element->parentNode;
$clonedNode = $element->cloneNode(true);
for ($i = 0; $i < 3; $i++) {
$newNode = $clonedNode->cloneNode(true);
$parent->insertBefore($newNode, $element);
$nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode);
for($j = 0; $j < $nodeList->length; $j++) {
$n = $nodeList->item($j);
$n->nodeValue = preg_replace("{{first}}", $i, $n->nodeValue);
}
}
$parent->removeChild($element);
echo $dom->saveHTML();

Fetch the attributes using PHP crawler

I am trying to fetch the name,address and location from crawling of a website . Its a single page and dont want any other thing other than this. I am using the below code.
<?php
include 'simple_html_dom.php';
$html = "http://www.phunwa.com/phone/0191/2604233";
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="address-tags"]')->item(0);
for($i=0; $i < $div->length; $i++ )
{
print "nodename=".$div->item( $i )->nodeName;
print "\t";
print "nodevalue : ".$div->item( $i )->nodeValue;
print "\r\n";
echo $link->getElementsByTagName("<p>");
}
?>
The website html source code is
<div class="address-tags">
<p><strong>Name:</strong> RAJ GOPAL SINGH</p>
<p><strong>Address:</strong> R/O BARNAI NETARKOTHIAN, P.O.MUTHI TEH.& DISTT.JAMMU,X, 181206</p>
<p><strong>Location:</strong> JAMMU, Jammu & Kashmir, India</p>
<p><strong>Other Numbers:</strong> 01912604233 | +911912604233 | +91-191-2604233</p>
Can somone please help me get the three attributes as output. Nothing is echop on the page as of now.
Thanks alot .
you need $dom->load($html); instead of $dom->loadHtml($html);. After doing this you wil; find your html is not well formed, so $xpath stay empty.
Maybe try something like:
$html = file_get_contents('http://www.phunwa.com/phone/0191/2604233');
$name = preg_replace('/(.*)(<p><strong>Name:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$address = preg_replace('/(.*)(<p><strong>Address:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$location = preg_replace('/(.*)(<p><strong>Location:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$othernumbers = preg_replace('/(.*)(<p><strong>Other Numbers:<\/strong> )(.*)/mis','$3',$html);
list($othernumbers,$trash)= preg_split('/<\/p>/mis',$othernumbers,0);
echo 'name: '.$name.'<br>address: '.$address.'<br>location: '.$location.'<br>other numbers: '.$othernumbers;
exit;
You should use the following for your XPath query:
//*[#class='address-tags']/p
so you're retrieving the actual paragraph nodes that are children of the 'address-tags' parent. Then you can use a loop on them:
$nodes = $xpath->query('//*[#class="address-tags"]/p');
for ($i = 0; $i < $nodes->length; $i++) {
echo $nodes->item($i)->nodeValue;
}
// or just
foreach($nodes as $node) {
echo $node->nodeValue;
}
Right now your code is properly fetching the first div that's found, but then you continue treating that div as if it was a DOMNodeList returned from an xpath query, which is incorrect. ->item() returns a DOMNode object, which does NOT have an ->item() method.

Categories