I am creating a metasearch engine using Yandex API. Yandex gives result in XML format. So we need to traverse the XML response inorder to get the different fields like URL,title ,description etc.
The XML response by Yandex is as follows:
http://pastebin.com/kAVAVri9
This is how i have implemented: paste
$dom5 = new DOMDocument();
if ($dom5->loadXML($site_results)) {
$results = $dom5->getElementsByTagName("response");
$results1 = $results->getElementsByTagName("results");
$results2 = $results1->getElementsByTagName("group");
$totals["yandex"] = 1000;
foreach ($results1 as $link) {
$url = $link->getElementsByTagName("doc")->item(2)->nodeValue;
;
$url = str_replace('http://', '', $url);
if (substr($url, -1, 1) == '/') {
$url = substr($url, 0, strlen($url) - 1);
}
$search_results[$i]["url"] = $url;
$title = $link->getElementsByTagName("doc")->item(4)->nodeValue;
$search_results[$i]["title"] = $title;
$test = $link->getElementsByTagName("doc");
$test1 = $test->getElementsByTagName("title");
$desc = $test1->getElementsByTagName("headline")->item(0)->nodeValue;
$search_results[$i]["desc"] = $desc;
$search_results[$i]["engine"] = 'yandex';
$search_results[$i]["position"] = $i + 1;
$i++;
}
}
I am new to php. Please forgive me if i have done some stupid mistake. I am unable to retrive the results through my implementation. Please help me find the mistake and get the necessary fields from xml response.
Thank you!
The method getElementsByTagName() returns a DOMNodeList:
$results = $dom5->getElementsByTagName("response");
The DOMNodeList does not have a method called getElementsByTagName(), but you call it:
$results1 = $results->getElementsByTagName("results");
Therefore the fatal error is triggered: Whenever in PHP you execute a method on an object that does not exist, you will get a fatal error and your script stops working.
Do not call undefined object methods and you should be fine.
Apart from these basics, for parsing such XML documents I normally suggest SimpleXML, however this XML file is a little specific therfore I suggest to extend from SimpleXML and add the features you likely need to use, in part from regular expressions as well as from DOMDocument.
One concept you should know about when parsing these XML files is Xpath. For example to access the elements you had that many problems with above, you can write the path literally:
/*/response/results/grouping/group
In PHP with SimpleXML this looks like:
$url = 'http://pastebin.com/raw.php?i=kAVAVri9';
$xml = simplexml_load_file($url, 'MySimpleXML');
foreach ($xml->xpath('/*/response/results/grouping/group') as $link) {
# ... operate on $link
}
A larger example:
$url = 'http://pastebin.com/raw.php?i=kAVAVri9';
$url = '../data/yandex.xml';
$xml = simplexml_load_file($url, 'MySimpleXML');
foreach ($xml->xpath('/*/response/results/grouping/group') as $link) {
$url = $link->doc->url->str()->preg('~^https?://(.*?)/*$~u', '$1');
$title = $link->doc->title->text();
$headline = $link->doc->headline->text();
printf("<%s> %s\n%s\n\n", $url, $title, wordwrap($headline));
}
And it's exemplary output:
<www.facebook.com> " Facebook" - a social networking service
Allows users to find and communicate with friends, classmates and
colleagues, share thoughts, photos and videos, and join various groups.
<en.wikipedia.org/wiki/Facebook> Facebook - Wikipedia, the free encyclopedia
Facebook is a social networking service launched in February 2004, owned
and operated by Facebook, Inc. As of September 2012, Facebook has over one
billion active users, more than half of them using Facebook on a mobile
device.
<mashable.com/category/facebook> Facebook
...
The PHP code example above needs some more code to work because it extends from SimpleXML for the ease of use. This is done with the following code:
class MySimpleXML extends SimpleXMLElement
{
public function text()
{
$string = null === $this[0] ? ''
: (dom_import_simplexml($this)->textContent);
return $this->str($string)->normlaizeWS();
}
public function str($string = null)
{
return new MyString($string ?: $this);
}
}
class MyString
{
private $string;
public function __construct($string)
{
$this->string = $string;
}
public function preg($pattern, $replacement)
{
return new self(preg_replace($pattern, $replacement, $this));
}
public function normlaizeWS()
{
return $this->preg('~\s+~', ' ');
}
public function __toString()
{
return (string) $this->string;
}
}
This might be all a little bit much for the beginning, checkout the PHP manual for SimpleXML and the other functions used in the code-example.
Related
I have some problems with showing and formatting my JSON code.
I'm scraping Aliexpress product page, with Laravel goutte, and I have extracted JSON that is provided by aliexpress in their source code (You can check it it starts with window.runParams at the end of source code).
So after I successfully extracted that data, I have problems with formatting JSON. As you see Aliexpress already returned that data in JSON so I don't need to write json_encode or decode. I'm returning my response in postman and I don't think that postman is making that JSON response wrong, when I open JSON formatter it's always giving me errors for multiple lines "Invalid character".
This is my full code, and I'm just calling this class and returning it in controller:
use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;
class Scrapers
{
// This will be sent from another class, for now we are calling it from here
private $client;
public $url;
public $array = [];
public function __construct($url){
$this->url = $url;
$this->client = new Client(HttpClient::create(['timeout' => 60]));
$this->client->request('GET', $url)->filter('script')->each(
function ($node) {
array_push($this->array, $node->html());
});
}
public function getBetween($content,$start,$end){
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '';
}
/**
* Return product title
*/
public function getTitle(){
$content = $this->array[16];
$start = "data:";
$end = "csrfToken:";
$output = $this->getBetween($content,$start,$end);
$remove_new_lines = str_replace(array("\n", "\r"), '', $output);
return $remove_new_lines;
}
For now I don't have extracted ',' from the end of the file, but if I delete it, formatter must return valid json response anyway (but it doesen't).
This is how is data returned in my postman: https://justpaste.it/3qrtm
I tried multiple functions that are removing tabs, new spaces and everything I found on internet, but no success.
Any ideas how to fix that?
Try $node->text()instead of $node->html()
I'm new to Google Vision API Client Lib
I'm using Vision API Client Lib for PHP to detect text in images, this is my code:
<?php
require 'vendor/autoload.php';
use Google\Cloud\Vision\VisionClient;
function object_to_array($object) {
return (array) $object;
}
$vision = new VisionClient(
['keyFile' => json_decode(file_get_contents("smartcity-credentials.json"), true)]
);
$img = file_get_contents('img2.jpg');
$image=$vision->image($img,['DOCUMENT_TEXT_DETECTION']);
$result=$vision->annotate($image);
$res=object_to_array($result);
var_dump($res);
?>
All I need is using response as an array for handling, but $result returns something likes array of object/object ( sorry because I don't know much of OOP/Object)
Although i convert $result to array $res, but if I use foreach loop
foreach ($res as $key=>$value){
echo $value;
echo '<br>';
}
I get this
Catchable fatal error: Object of class Google\Cloud\Vision\Annotation\Document could not be converted to string
How do we get value (text detected) in above response for using ?.
You should use the methods fullText() and text() in order to access to the detected text, something like this:
$document = $annotation->fullText();
$text = $document->text();
See the specification of these classes Annotation and Document.
you cannot use $value of type Document as string with the echo.
use print_r($annotation); to see what you even get returned.
this document text detection example looks quite alike, notice the nested foreach loops there. also see the documentation:
use Google\Cloud\Vision\VisionClient;
$vision = new VisionClient();
$imageResource = fopen(__DIR__.'/assets/the-constitution.jpg', 'r');
$image = $vision->image($imageResource, ['DOCUMENT_TEXT_DETECTION']);
$annotation = $vision->annotate($image);
$document = $annotation->fullText();
$info = $document->info();
$pages = $document->pages();
$text = $document->text();
here's some more examples; in particular the detect_document_text.php.
I am trying to check and validate the phone number from an HTML page.
I am using the following code to check the phone number:
<?php
class Validation {
public $default_filters = array(
'phone' => array(
'regex'=>'/^\(?(\d{3})\)?[-\. ]?(\d{3})[-\. ]?(\d{4})$/',
'message' => 'is not a valid US phone number format.'
)
);
public $filter_list = array();
function Validation($filters=false) {
if(is_array($filters)) {
$this->filters = $filters;
} else {
$this->filters = array();
}
}
function validate($filter,$value) {
if(in_array($filter,$this->filters)) {
if(in_array('default_filter',$this->filters[$filter])) {
$f = $this->default_filters[$this->filters[$filter]['default_filter']];
if(in_array('message',$this->filters[$filter])) {
$f['message'] = $this->filters[$filter]['message'];
}
} else {
$f = $this->filters[$filter];
}
} else {
$f = $this->default_filters[$filter];
}
if(!preg_match($f['regex'],$value)) {
$ret = array();
$ret[$filter] = $f['message'];
return $ret;
}
return true;
}
}
This code is working fine for US phone number validation. But I do not understand how to pass a complete page to extract and check the valid phone number from an HTML page? Kindly help me and make me understand what I can do to fulfill my requirement.
You want to look into cURL.
cURL is a computer software project providing a library and command-line tool for transferring data using various protocols.
You should get the page from your php script (use cURL or whatever you want).
Find the div / input containing the telephone number from the response cURL give you. You can do that with a library like DomXPath (It allow you to navigate through DOM tree).
https://secure.php.net/manual/en/class.domxpath.php
Get the node values from the telephone input and pass it into your validator.
That is the way i would try it.
Your class does not really play any role in the task you want to accomplish because it's just some generic validation code that was never designed to become a scraper. However, the valuable part of it (the regular expression to determine what a US phone number is) is part of a public property so it can be reused in several ways (extend the class or call it from some other class):
public $default_filters = array(
'phone' => array(
'regex'=>'/^\(?(\d{3})\)?[-\. ]?(\d{3})[-\. ]?(\d{4})$/',
'message' => 'is not a valid US phone number format.'
)
);
E.g.:
// Scraper is a custom class written by you
$scraper = new Scraper('http://example.com');
$scraper->findByRegularExpression((new Validation())->default_filters['phone']['message']);
Of course, this is assuming that you cannot touch the validator code.
I cannot really answer the overall question without either writing a long tutorial or the app itself but here's some quick code to get started:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://example.com');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//text()') as $textNode) {
var_dump($textNode->nodeValue);
}
I'm new to PHP and I have an issue I can't seem to fix or find a solution to.
I'm trying to create a helper function that will return an 'object' filled with information pulled from an XML file. This helper function, named functions.php contains a getter method which returns a 'class' object filled with data from an SVN log.xml file.
Whenever I try to import this file using include 'functions.php'; none of the code after that line runs the calling function's page is blank.
What am I doing wrong?
Here is what the functions.php helper method and class declaration looks like:
<?php
$list_xml=simplexml_load_file("svn_list.xml");
$log_xml=simplexml_load_file("svn_log.xml");
class Entry{
var $revision;
var $date;
}
function getEntry($date){
$ret = new Entry;
foreach ($log_xml->logentry as $logentry){
if ($logentry->date == $date){
$ret->date = $logentry->date;
$ret->author = $logentry->author;
}
}
return $ret;
}
I'm not sure what the point of having a separate helper function from the class is, personally I'd combine the two. Something like this
other-file.php
require './Entry.php';
$oLogEntry = Entry::create($date, 'svn_log.xml');
echo $oLogEntry->date;
echo $oLogEntry->revision;
Entry.php
class Entry
{
public $revision;
public $date;
public $author;
public static function create($date, $file) {
$ret = new Entry;
$xml = simplexml_load_file($file);
foreach($xml->logentry as $logentry) {
if($logentry->date == $date) {
$ret->date = $logentry->date;
$ret->author = $logentry->author;
$ret->revision = $logentry->revision;
}
}
return $ret;
}
}
EDIT
In light of the fact OP is new to PHP, I'll revise my suggestion completely. How about ditching the class altogether here? There's hardly any reason to use a class I can see at this point; let's take a look at using an array instead.
I might still move the simplexml_load_file into the helper function though. Would need to see other operations to merit keeping it broken out.
entry-helper.php
function getEntry($date, $file) {
$log_xml = simplexml_load_file($file);
$entry = array();
foreach($log_xml->logentry as $logentry) {
if($logentry->date == $date) {
$entry['date'] = $logentry->date;
$entry['author'] = $logentry->author;
$entry['revision'] = $logentry->revision;
}
}
return $entry;
}
other-file.php
require './entry.php';
$aLogEntry = Entry::create($date, 'svn_log.xml');
echo $aLogEntry['date'];
echo $aLogEntry['revision'];
EDIT
One final thought.. Since you're seemingly searching for a point of interest in the log, then copying out portions of that node, why not just search for the match and return that node? Here's what I mean (a return of false indicates there was no log from that date)
function getEntry($date, $file) {
$log_xml = simplexml_load_file($file);
foreach($log_xml->logentry as $logentry) {
if($logentry->date == $date) {
return $logentry;
return false;
}
Also, what happens if you have multiple log entries from the same date? This will only return a single entry for a given date.
I would suggest using XPATH. There you can throw a single, concise XPATH expression at this log XML and get back an array of objects for all the entries from a given date. What you're working on is a good starting point, but once you have the basics, I'd move to XPATH for a clean final solution.
I am writing some code for an IRC bot written in php and running on the linux cli. I'm having a little trouble with my code to retrieve a websites title tag and display it using DOMDocument NodeList. Basically, on websites with two or more tags (and you would be surprised how many there actually are...) I want to process for only the first title tag. As you can see from the code below (which is working fine for processing one, or more tags) there is a foreach block where it iterates through each title tag.
public function onReceivedData($data) {
// loop through each message token
foreach ($data["message"] as $token) {
// if the token starts with www, add http file handle
if (strcmp(substr($token, 0, 4), "www.") == 0) {
$token = "http://" . $token;
}
// validate token as a URL
if (filter_var($token, FILTER_VALIDATE_URL)) {
// create timeout stream context
$theContext['http']['timeout'] = 3;
$context = stream_context_create($theContext);
// get contents of url
if ($file = file_get_contents($token, false, $context)) {
// instantiate a new DOMDocument object
$dom = new DOMDocument;
// load the html into the DOMDocument obj
#$dom->loadHTML($file);
// retrieve the title from the DOM node
// if assignment is valid then...
if ($title = $dom->getElementsByTagName("title")) {
// send a message to the channel
foreach ($title as $theTitle) {
$this->privmsg($data["target"], $theTitle->nodeValue);
}
}
} else {
// notify of failure
$this->privmsg($data["target"], "Site could not be reached");
}
}
}
}
What I'd prefer, is to somehow limit it to only processing the first title tag. I'm aware that I can just wrap an if statement around it with a variable so it only echos one time, but I'm more looking at using a "for" statement to process a single iteration. However, when I do this, I can't access the title attribute with $title->nodeValue; it says it's undefined, and only when i use the foreach $title as $theTitle can I access the values. I've tried $title[0]->nodeValue and $title->nodeValue(0) to retrieve the first title from the list, but unfortunately to no avail. A bit stumped and a quick google didn't turn up a lot.
Any help would be greatly appreciated! Cheers, and I'll keep looking too.
You can solve this with XPath:
$dom = new DOMDocument();
#$dom->loadHTML($file);
$xpath = new DOMXPath($dom);
$title = $xpath->query('//title')->item(0)->nodeValue;
Try something like this:
$title->item(0)->nodeValue;
http://www.php.net/manual/en/class.domnodelist.php