how to scrape hindi text from web using php - php

Here i am trying to scrape data from the web (in url) that is in hindi but I am getting response like this
\u093f\u0938\
How to decode this unicode? Please suggest me what to do my script in PHP.
This script is working correctly with english text so what is happening with english. I have already scraped data with this script. I know this response is dev nagri unicode but how to decode it.
I am new in php problem thanks in advance
$i= 1;
for($i; $i < 6; $i++)
{
$html file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$nodes = $dom->getElementsByTagName('p');
$item = array();
$articles = array();
foreach ($nodes as $node) {
$item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
$item['cat_id'] = 1;
if($item['msg'] !="")
$articles[] = array_unique($item);
}
$articles = json_encode($articles);
print_r($articles);
}

f you are running PHP 5.4 or greater, pass the JSON_UNESCAPED_UNICODE parameter when calling json_encode.
$i= 1;
for($i; $i < 6; $i++)
{
$html file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$nodes = $dom->getElementsByTagName('p');
$item = array();
$articles = array();
foreach ($nodes as $node) {
$item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
$item['cat_id'] = 1;
if($item['msg'] !="")
$articles[] = array_unique($item);
}
$articles = json_encode($articles, JSON_UNESCAPED_UNICODE);
//--------------------add-this---------------------^
print_r($articles);
}

I think PHPhil's answer is good and I upvoted it. I edited the code as it does not work just to execute the php part - instead it is important to add the right meta tag (see the code below) to show the devnagari properly. Also I wanted to correct the mistake with the missing "=". Unfortunately my edit was rejected so I have to add a new answer with the code corrections.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<?php
$i= 1;
for($i; $i < 6; $i++)
{
$html = file_get_contents("http://www.jagran.com/jokes/child/jokes-1262211".$i.".html");
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$nodes = $dom->getElementsByTagName('p');
$item = array();
$articles = array();
foreach ($nodes as $node) {
$item['msg'] = (strlen($node->nodeValue) > 20 ? $node->nodeValue : '');
$item['cat_id'] = 1;
if($item['msg'] !="")
$articles[] = array_unique($item);
}
$articles = json_encode($articles, JSON_UNESCAPED_UNICODE);
//--------------------add-this---------------------^
print_r($articles);
}
?>
</body>
</html>

You are very close. You receive the signs: ि and स
First you can try is google for the character and you will find the devnagari meaning of the chars:
https://www.google.de/#q=%5Cu093f
https://www.google.de/#q=%5Cu0938
If you want to show unicode in html you have to change the encoding from /u0123 to &#x123. See here:
<html>
<body>
<p>These are two chars in devnagari िस<p>
</body>
</html>
But as you are wanting to scrape Hindi you should start learning how to read and handle unicode. Next question is, how you want to process with your result.

Related

How to get child nodes from an xml url?

I got this link https://www.ncbi.nlm.nih.gov/gene/7128?report=xml&format=text. I am trying to write a code that gets Interactions and GeneOntology within Gene-commentary_heading from the link. I only succeed using this code when there are the 2 or 3 nodes but in this case there are at least 6 nodes or more. Could someone help me?
Bellow is the example of the information I am looking for (it's to much to visualise so I just showed a part)
<Gene-commentary_heading>GeneOntology</Gene-commentary_heading>
<Gene-commentary_source>
<Other-source>
<Other-source_pre-text>Provided by</Other-source_pre-text>
<Other-source_anchor>GOA</Other-source_anchor>
<Other-source_url>http://www.ebi.ac.uk/GOA/</Other-source_url>
</Other-source>
</Gene-commentary_source>
<Gene-commentary_comment>
<Gene-commentary>
<Gene-commentary_type value="comment">254</Gene-commentary_type>
<Gene-commentary_label>Function</Gene-commentary_label>
<Gene-commentary_comment>
<Gene-commentary>
<Gene-commentary_type value="comment">254</Gene-commentary_type>
<Gene-commentary_source>
<Other-source>
<Other-source_src>
<Dbtag>
<Dbtag_db>GO</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>3677</Object-id_id>
</Object-id>
</Dbtag_tag>
...
`$url = "https://www.ncbi.nlm.nih.gov/gene/7128?report=xml&format=text";
$document_xml = new DOMDocument();
$document_xml->loadXML($url);
$elements = $url->getElementsByTagName('Gene-commentary_heading');
echo $elements;
foreach($element as $node) {
$GO = $node -> getElementsByTagName('GeneOntology');
$Int = $node->getElementsByTagName('Interactions');
}
My answer
$esearch_test = "https://www.ncbi.nlm.nih.gov/gene/7128?report=xml&format=text";
$result = file_get_contents($esearch_test);
$xml = simplexml_load_string($result);
$doc = new DOMDocument();
$doc = DOMDocument::loadXML($xml);
$c = 1;
foreach($doc->getElementsByTagName('Gene-commentary_heading') as $node) {
echo "$c: ".$node->textContent."\n";
$c++;
}

How to make crawling and extracting data in each pager links?

I want to extract all the attributes name="" of a website,
example html
<div class="link_row">
link
</div>
I have the following code:
<?php
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.onedomain.com/plus?ca=11_c&o=1');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='link_row']/a[#class='listing_container']/#name" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n<br>";
}
?>
Result is:
7777
This code is working fine, but need not be limited to one pager number.
http://www.onedomain.com/plus?ca=11_c&o=1 pager attr is "o=1"
I would like once you finish with o=1, follow with o=2
to my variable defined $last=556 is equal http://www.onedomain.com/plus?ca=11_c&o=556
Could you help me?
What is the best way to do it?
Thanks
Use a for (or while) loop. I don't see $last in your provided code so I've statically set the max value plus one.
$html = new DOMDocument();
for($i =1; $i < 557; $i++) {
#$html->loadHtmlFile('http://www.onedomain.com/plus?ca=11_c&o=' . $i);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='link_row']/a[#class='listing_container']/#name" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n<br>";
}
}
Simpler example:
for($i =1; $i < 557; $i++) {
echo $i;
}
http://php.net/manual/en/control-structures.for.php

Adding a <script> as a node value in Html Dom not working

why the script not working in the example below (not working :i.e the script is not excuting in the browser)
$xpath = new DOMXpath($doc);
$nodes = $xpath->query( "//div[#class = 'ad_stream_hd']");
foreach( $nodes as $node) {
$node->nodeValue = '<script type="text/javascript" src="http://clkrev.com/adServe/banners?tid=SPORTVE158X21&size=158x21" ></script>';
}
The node value is just text, its not like inner HTML where you can specify a string with markup in it. With a document fragment you can get something close to that, but you're setting xml rather than html so your html would have to be valid xml.
$xpath = new DOMXpath($doc);
$nodes = $xpath->query( "//div[#class = 'ad_stream_hd']");
if ($nodes->length > 0){
$node = $nodes->item($nodes->length-1);
$fragment = $doc->createDocumentFragment();
$fragment->appendXML('<script type="text/javascript" src="http://clkrev.com/adServe/banners?tid=SPORTVE158X21&size=158x21" ></script>');
$node->appendChild($fragment);
}
Edit: right way to iterate DOMNodeList:
$nodes_length = $nodes->length;
for ($i=0; $i < $nodes_length; $i++) {
$nodes->item($i)->nodeValue = '<script type="text/javascript" src="http://clkrev.com/adServe/banners?tid=SPORTVE158X21&size=158x21" ></script>';
}

Fetch the attributes using PHP crawler

I am trying to fetch the name,address and location from crawling of a website . Its a single page and dont want any other thing other than this. I am using the below code.
<?php
include 'simple_html_dom.php';
$html = "http://www.phunwa.com/phone/0191/2604233";
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="address-tags"]')->item(0);
for($i=0; $i < $div->length; $i++ )
{
print "nodename=".$div->item( $i )->nodeName;
print "\t";
print "nodevalue : ".$div->item( $i )->nodeValue;
print "\r\n";
echo $link->getElementsByTagName("<p>");
}
?>
The website html source code is
<div class="address-tags">
<p><strong>Name:</strong> RAJ GOPAL SINGH</p>
<p><strong>Address:</strong> R/O BARNAI NETARKOTHIAN, P.O.MUTHI TEH.& DISTT.JAMMU,X, 181206</p>
<p><strong>Location:</strong> JAMMU, Jammu & Kashmir, India</p>
<p><strong>Other Numbers:</strong> 01912604233 | +911912604233 | +91-191-2604233</p>
Can somone please help me get the three attributes as output. Nothing is echop on the page as of now.
Thanks alot .
you need $dom->load($html); instead of $dom->loadHtml($html);. After doing this you wil; find your html is not well formed, so $xpath stay empty.
Maybe try something like:
$html = file_get_contents('http://www.phunwa.com/phone/0191/2604233');
$name = preg_replace('/(.*)(<p><strong>Name:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$address = preg_replace('/(.*)(<p><strong>Address:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$location = preg_replace('/(.*)(<p><strong>Location:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$othernumbers = preg_replace('/(.*)(<p><strong>Other Numbers:<\/strong> )(.*)/mis','$3',$html);
list($othernumbers,$trash)= preg_split('/<\/p>/mis',$othernumbers,0);
echo 'name: '.$name.'<br>address: '.$address.'<br>location: '.$location.'<br>other numbers: '.$othernumbers;
exit;
You should use the following for your XPath query:
//*[#class='address-tags']/p
so you're retrieving the actual paragraph nodes that are children of the 'address-tags' parent. Then you can use a loop on them:
$nodes = $xpath->query('//*[#class="address-tags"]/p');
for ($i = 0; $i < $nodes->length; $i++) {
echo $nodes->item($i)->nodeValue;
}
// or just
foreach($nodes as $node) {
echo $node->nodeValue;
}
Right now your code is properly fetching the first div that's found, but then you continue treating that div as if it was a DOMNodeList returned from an xpath query, which is incorrect. ->item() returns a DOMNode object, which does NOT have an ->item() method.

Replace Tag in HTML with DOMDocument

I'm trying to edit html tags with DOMDocument::loadHTML in php. The html data is a part of html and not the whole page. I followed what this page (PHP - DOMDocument - need to change/replace an existing HTML tag w/ a new one) says.
This should convert pre tags into div tags but it gives "Fatal error: Uncaught exception 'DOMException' with message 'Not Found Error'."
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
foreach( $dom->getElementsByTagName("pre") as $nodePre ) {
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$dom->replaceChild($nodeDiv, $nodePre);
}
echo $dom->saveHTML();
?>
[Edit]
While I'm trying to iterate the node object backwards, I get this error, 'Notice: Trying to get property of non-object...'
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$domPre = $dom->getElementsByTagName('pre');
$length = $domPre->length;
For ($i = $length; $i > -1 ; $i--) {
$nodePre = $domPre->item($i);
echo $nodePre->nodeValue . '<br />';
// $nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
// $dom->replaceChild($nodeDiv, $nodePre);
}
// echo $dom->saveHTML();
?>
[Edit]
Okey, solved. Since the answered code has some error I post the solution here. Thanks all.
Solution:
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$domPre = $dom->getElementsByTagName('pre');
$length = $domPre->length;
For ($i = $length - 1; $i > -1 ; $i--) {
$nodePre = $domPre->item($i);
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
}
echo $dom->saveHTML();
?>
The problem is the call to replaceChild(). Rather than
$dom->replaceChild($nodeDiv, $nodePre);
use
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
update
Here is a working code. Seems there is some issue with replacing multiple nodes (more info here: http://php.net/manual/en/domnode.replacechild.php) so you'll have to use a regressive loop to replace the elements.
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$elements = $dom->getElementsByTagName("pre");
for ($i = $elements->length - 1; $i >= 0; $i --) {
$nodePre = $elements->item($i);
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
}
Another way with paquettg/php-html-parser (didn't find the way to change name, so had to use hack with re-binding $this):
use PHPHtmlParser\Dom;
use PHPHtmlParser\Dom\HtmlNode;
$dom = new Dom;
$dom->load($text);
/** #var HtmlNode[] $tags */
foreach($dom->find('pre') as $tag) {
$changeTag = function() {
$this->name = 'div';
};
$changeTag->call($tag->tag);
};
echo (string)$dom;

Categories