PHP DOMDocument - trouble accessing list index - php

I am writing some code for an IRC bot written in php and running on the linux cli. I'm having a little trouble with my code to retrieve a websites title tag and display it using DOMDocument NodeList. Basically, on websites with two or more tags (and you would be surprised how many there actually are...) I want to process for only the first title tag. As you can see from the code below (which is working fine for processing one, or more tags) there is a foreach block where it iterates through each title tag.
public function onReceivedData($data) {
// loop through each message token
foreach ($data["message"] as $token) {
// if the token starts with www, add http file handle
if (strcmp(substr($token, 0, 4), "www.") == 0) {
$token = "http://" . $token;
}
// validate token as a URL
if (filter_var($token, FILTER_VALIDATE_URL)) {
// create timeout stream context
$theContext['http']['timeout'] = 3;
$context = stream_context_create($theContext);
// get contents of url
if ($file = file_get_contents($token, false, $context)) {
// instantiate a new DOMDocument object
$dom = new DOMDocument;
// load the html into the DOMDocument obj
#$dom->loadHTML($file);
// retrieve the title from the DOM node
// if assignment is valid then...
if ($title = $dom->getElementsByTagName("title")) {
// send a message to the channel
foreach ($title as $theTitle) {
$this->privmsg($data["target"], $theTitle->nodeValue);
}
}
} else {
// notify of failure
$this->privmsg($data["target"], "Site could not be reached");
}
}
}
}
What I'd prefer, is to somehow limit it to only processing the first title tag. I'm aware that I can just wrap an if statement around it with a variable so it only echos one time, but I'm more looking at using a "for" statement to process a single iteration. However, when I do this, I can't access the title attribute with $title->nodeValue; it says it's undefined, and only when i use the foreach $title as $theTitle can I access the values. I've tried $title[0]->nodeValue and $title->nodeValue(0) to retrieve the first title from the list, but unfortunately to no avail. A bit stumped and a quick google didn't turn up a lot.
Any help would be greatly appreciated! Cheers, and I'll keep looking too.

You can solve this with XPath:
$dom = new DOMDocument();
#$dom->loadHTML($file);
$xpath = new DOMXPath($dom);
$title = $xpath->query('//title')->item(0)->nodeValue;

Try something like this:
$title->item(0)->nodeValue;
http://www.php.net/manual/en/class.domnodelist.php

Related

Why using DOMDocument makes site load slower?

I'm using DOMDocument with xpath to load some data to my site from external (fast) website.
Right now I use 4 urls (please see below). I need to increase to 8 urls.
What I have noticed, that the more of those you add, the more slower the site loads.
Is there any way to use xpath for more faster load?
Or maybe there's at least some kind of a way to load the data on website1 (child website) and when it loads, include the data to my main website.
Any tips would be appeciated.
<?php
$parent_title = get_the_title( $post->post_parent );
$html_string = file_get_contents('weburladresshere');
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html_string);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$values = array();
$row = $xpath->query('myquery');
foreach($row as $value) {
print($value->nodeValue);
}
?>
It's slow because you load external sites. Instead of loading them just in time try to load them "in the background" via another php job and save them to a temporary file. Then you can load the html from your local temp file which is faster than the loading the remote $html_string via file_get_contents.
Extended answer
Here you can see a very leightweight example of how you could handle it.
function getPageContent($url) {
$filename = md5($url).'.tmp';
// implement your extended cache logic here
// for example: store it just for 60 seconds...
if(!file_exists($filename)) {
file_put_contents($filename, $url);
}
return file_get_contents($filename);
}
function businessLogic($url) {
$htmlContent = getPageContent($url);
// your business logic here
}
businessLogic($url);

How I can check and validate the phone number from an HTML page using php?

I am trying to check and validate the phone number from an HTML page.
I am using the following code to check the phone number:
<?php
class Validation {
public $default_filters = array(
'phone' => array(
'regex'=>'/^\(?(\d{3})\)?[-\. ]?(\d{3})[-\. ]?(\d{4})$/',
'message' => 'is not a valid US phone number format.'
)
);
public $filter_list = array();
function Validation($filters=false) {
if(is_array($filters)) {
$this->filters = $filters;
} else {
$this->filters = array();
}
}
function validate($filter,$value) {
if(in_array($filter,$this->filters)) {
if(in_array('default_filter',$this->filters[$filter])) {
$f = $this->default_filters[$this->filters[$filter]['default_filter']];
if(in_array('message',$this->filters[$filter])) {
$f['message'] = $this->filters[$filter]['message'];
}
} else {
$f = $this->filters[$filter];
}
} else {
$f = $this->default_filters[$filter];
}
if(!preg_match($f['regex'],$value)) {
$ret = array();
$ret[$filter] = $f['message'];
return $ret;
}
return true;
}
}
This code is working fine for US phone number validation. But I do not understand how to pass a complete page to extract and check the valid phone number from an HTML page? Kindly help me and make me understand what I can do to fulfill my requirement.
You want to look into cURL.
cURL is a computer software project providing a library and command-line tool for transferring data using various protocols.
You should get the page from your php script (use cURL or whatever you want).
Find the div / input containing the telephone number from the response cURL give you. You can do that with a library like DomXPath (It allow you to navigate through DOM tree).
https://secure.php.net/manual/en/class.domxpath.php
Get the node values from the telephone input and pass it into your validator.
That is the way i would try it.
Your class does not really play any role in the task you want to accomplish because it's just some generic validation code that was never designed to become a scraper. However, the valuable part of it (the regular expression to determine what a US phone number is) is part of a public property so it can be reused in several ways (extend the class or call it from some other class):
public $default_filters = array(
'phone' => array(
'regex'=>'/^\(?(\d{3})\)?[-\. ]?(\d{3})[-\. ]?(\d{4})$/',
'message' => 'is not a valid US phone number format.'
)
);
E.g.:
// Scraper is a custom class written by you
$scraper = new Scraper('http://example.com');
$scraper->findByRegularExpression((new Validation())->default_filters['phone']['message']);
Of course, this is assuming that you cannot touch the validator code.
I cannot really answer the overall question without either writing a long tutorial or the app itself but here's some quick code to get started:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://example.com');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//text()') as $textNode) {
var_dump($textNode->nodeValue);
}

Post xml to PHP for validation against xsd with XMLHttpRequest or Ajax?

I have a flash application that posts xml to a php page which validates it against an xsd schema. I'm trying to do the same thing but from an html page. I'm using XMLHttpRequest or with jquery's ajax call but I keep running around the same issues. "document has no document" or the "access-control-allow-origin' header issue. I can fix one but not the other.
My PHP page looks like this:
function libxml_append_errors() {
global $returnXML, $errors;
$e = libxml_get_errors();
foreach ($e as $error) {
$en = $returnXML->createElement("error", trim($error->message));
$en->setAttribute('line',$error->line);
$errors->appendChild($en);
}
libxml_clear_errors();
}
libxml_use_internal_errors(true);
$xml = new DOMDocument();
$contents = $HTTP_RAW_POST_DATA;
$xml->loadXML($contents);
$returnXML = new DOMDocument("1.0", "utf-8");
$rootNode = $returnXML->createElement("result");
$returnXML->appendChild($rootNode);
$errors = $returnXML->createElement("errors");
if (!$xml->schemaValidate('muffin_dumplings.xsd'))
{
libxml_append_errors();
}
$rootNode->appendChild($errors);
echo $returnXML->saveXML();
Either way I'm looking to get xml back with any validation errors or a simple empty error xml node same as I do with flash.
If anyone cares, here is the answer. When you post to PHP, have the right headers, and are using ajax you need to pass a content type. Then it works.

I'm trying to scrape a specific div with an id on a page

I want to scrape the contents of a page, well really just a single div from that page, and display it to the user inside of a small div on a webpage. I just need a piece of info from a carfax page that needs user credentials so I can't post the exact code but I tried using google.com and have the same problem so the solution should cross over.
Right now I've tried this:
$webPage = file_get_contents('http://www.google.com');
$doc = new DOMDocument();
$doc->loadHTML($webPage);
$div = $doc->getElementById('lga');//this is the id to the div holding the image above the textbox
//echo $webPage;//this displays www.google.com minus the image. I imagine because of the file path
//var_dump($div);//this display "object(DOMElement)#2 (0) { }" and I'm not sure what that means
//echo $div;//this has a server error
I'm also looking at simple_html_dom.php trying to figure that out.
You can use this:
/**
* Downloads a web page from $url, selects the the element by $id
* and returns it's xml string representation.
*/
function getElementByIdAsString($url, $id, $pretty = true) {
$doc = new DOMDocument();
#$doc->loadHTMLFile($url);
if(!$doc) {
throw new Exception("Failed to load $url");
}
// Obtain the element
$element = $doc->getElementById($id);
if(!$element) {
throw new Exception("An element with id $id was not found");
}
if($pretty) {
$doc->formatOutput = true;
}
// Return the string representation of the element
return $doc->saveXML($element);
}
// call it:
echo getElementByIdAsString('http://www.google.com', 'lga');

Traversing the XML response from Yandex API using PHP

I am creating a metasearch engine using Yandex API. Yandex gives result in XML format. So we need to traverse the XML response inorder to get the different fields like URL,title ,description etc.
The XML response by Yandex is as follows:
http://pastebin.com/kAVAVri9
This is how i have implemented: paste
$dom5 = new DOMDocument();
if ($dom5->loadXML($site_results)) {
$results = $dom5->getElementsByTagName("response");
$results1 = $results->getElementsByTagName("results");
$results2 = $results1->getElementsByTagName("group");
$totals["yandex"] = 1000;
foreach ($results1 as $link) {
$url = $link->getElementsByTagName("doc")->item(2)->nodeValue;
;
$url = str_replace('http://', '', $url);
if (substr($url, -1, 1) == '/') {
$url = substr($url, 0, strlen($url) - 1);
}
$search_results[$i]["url"] = $url;
$title = $link->getElementsByTagName("doc")->item(4)->nodeValue;
$search_results[$i]["title"] = $title;
$test = $link->getElementsByTagName("doc");
$test1 = $test->getElementsByTagName("title");
$desc = $test1->getElementsByTagName("headline")->item(0)->nodeValue;
$search_results[$i]["desc"] = $desc;
$search_results[$i]["engine"] = 'yandex';
$search_results[$i]["position"] = $i + 1;
$i++;
}
}
I am new to php. Please forgive me if i have done some stupid mistake. I am unable to retrive the results through my implementation. Please help me find the mistake and get the necessary fields from xml response.
Thank you!
The method getElementsByTagName() returns a DOMNodeList:
$results = $dom5->getElementsByTagName("response");
The DOMNodeList does not have a method called getElementsByTagName(), but you call it:
$results1 = $results->getElementsByTagName("results");
Therefore the fatal error is triggered: Whenever in PHP you execute a method on an object that does not exist, you will get a fatal error and your script stops working.
Do not call undefined object methods and you should be fine.
Apart from these basics, for parsing such XML documents I normally suggest SimpleXML, however this XML file is a little specific therfore I suggest to extend from SimpleXML and add the features you likely need to use, in part from regular expressions as well as from DOMDocument.
One concept you should know about when parsing these XML files is Xpath. For example to access the elements you had that many problems with above, you can write the path literally:
/*/response/results/grouping/group
In PHP with SimpleXML this looks like:
$url = 'http://pastebin.com/raw.php?i=kAVAVri9';
$xml = simplexml_load_file($url, 'MySimpleXML');
foreach ($xml->xpath('/*/response/results/grouping/group') as $link) {
# ... operate on $link
}
A larger example:
$url = 'http://pastebin.com/raw.php?i=kAVAVri9';
$url = '../data/yandex.xml';
$xml = simplexml_load_file($url, 'MySimpleXML');
foreach ($xml->xpath('/*/response/results/grouping/group') as $link) {
$url = $link->doc->url->str()->preg('~^https?://(.*?)/*$~u', '$1');
$title = $link->doc->title->text();
$headline = $link->doc->headline->text();
printf("<%s> %s\n%s\n\n", $url, $title, wordwrap($headline));
}
And it's exemplary output:
<www.facebook.com> " Facebook" - a social networking service
Allows users to find and communicate with friends, classmates and
colleagues, share thoughts, photos and videos, and join various groups.
<en.wikipedia.org/wiki/Facebook> Facebook - Wikipedia, the free encyclopedia
Facebook is a social networking service launched in February 2004, owned
and operated by Facebook, Inc. As of September 2012, Facebook has over one
billion active users, more than half of them using Facebook on a mobile
device.
<mashable.com/category/facebook> Facebook
...
The PHP code example above needs some more code to work because it extends from SimpleXML for the ease of use. This is done with the following code:
class MySimpleXML extends SimpleXMLElement
{
public function text()
{
$string = null === $this[0] ? ''
: (dom_import_simplexml($this)->textContent);
return $this->str($string)->normlaizeWS();
}
public function str($string = null)
{
return new MyString($string ?: $this);
}
}
class MyString
{
private $string;
public function __construct($string)
{
$this->string = $string;
}
public function preg($pattern, $replacement)
{
return new self(preg_replace($pattern, $replacement, $this));
}
public function normlaizeWS()
{
return $this->preg('~\s+~', ' ');
}
public function __toString()
{
return (string) $this->string;
}
}
This might be all a little bit much for the beginning, checkout the PHP manual for SimpleXML and the other functions used in the code-example.

Categories