can i validate xhtml programmatically from a php script? - php

I would like to have a PHP function to check if a URL returns valid HTML or NOT, and returns true or false.
Something like:
if (validate_page("/somefile.html")) { echo "This page validated!!"; }
I found TWINE but it doesn't just give me true or false. Also I got an error running it on my system. http://twineproject.sourceforge.net/
I found this offline tool that looked promising. http://htmlhelp.com/tools/validator/offline/
Also I found this thread that talks about a gem, but it sounds problematic. How do I validate XHTML with nokogiri?

Tidy?
Validate: http://us.php.net/manual/en/function.tidy-diagnose.php
Repair: http://us.php.net/manual/en/tidy.repairstring.php

You can use W3C's validator API. There's a PHP library available through PEAR (click here) which uses said API.
You can also install the validator on your local server (instructions here), though you might not have sufficient permissions to do so if you are using shared hosting.

You could also try DOMDocument->validate() if you are using PHP 5 and if the document contains a DTD.
http://www.php.net/manual/en/domdocument.validate.php

xhtml hast to be valid xml - if you only want to check that, you could easily use simplexml, but if you also want to check for correct elements/attributes this won't help you (in that case, NullUserExceptions hint to W3C's validator API would be the best solution to choose).

libxml_use_internal_errors ( true );
$doc = new DOMDocument;
$doc -> loadHTMLFile ( $file ); // load the file you want validated
var_dump ( libxml_get_errors () );

Related

file_get_html() doesnt work [duplicate]

I used the following code to parse the HTML of another site but it display the fatal error:
$html=file_get_html('http://www.google.co.in');
Fatal error: Call to undefined function file_get_html()
are you sure you have downloaded and included php simple html dom parser ?
You are calling class does not belong to php.
Download simple_html_dom class here and use the methods included as you like it. It is really great especially when you are working with Emails-newsletter:
include_once('simple_html_dom.php');
$html = file_get_html('http://www.google.co.in');
As everyone have told you, you are seeing this error because you obviously didn't downloaded and included simple_html_dom class after you just copy pasted that third party code,
Now you have two options, option one is what all other developers have provided in their answers along with mine,
However my friend,
Option two is to not use that third party php class at all! and use the php developer's default class to perform same task, and that class is always loaded with php, so there is also efficiency in using this method along with originality plus security!
Instead of file_get_html which not a function defined by php developers use-
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
echo $doc->saveHTML(); that's indeed defined by them. Check it on php.net/manual(Original php manual by its devs)
This puts the HTML into a DOM object which can be parsed by individual tags, attributes, etc.. Here is an example of getting all the 'href' attributes and corresponding node values out of the 'a' tag. Very cool....
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
echo $tag->getAttribute('href').' | '.$tag->nodeValue."\n";
}
P.S. : PLEASE UPVOTE IF YOU LIKED MY ANSWER WILL HELP MY REPUTATION ON STACKOVERFLOW, THIS PEOPLES THINK I'M NOOB!
It looks like you're looking for simplexml_load_file which will load a file and put it into a SimpleXML object.
Of course, if it is not well-formatted that might cause problems. Your other option is DomObject::loadHTMLFile. That is a good deal more forgiving of badly formed documents.
If you don't care about the XML and just want the data, you can use file_get_contents.
$html = file_get_contents('http://www.google.co.in');
to get the html content of the page
in simple words
download the simple_html_dom.php from here Click here
now write these line to your Php file
include_once('simple_html_dom.php');
and start your coading after that
$html = file_get_html('http://www.google.co.in');
no error will be displayed
Try file_get_contents.
http://www.php.net/manual/en/function.file-get-contents.php

Solr for PHP: getDigestedResponse not working

I've managed to install Solr for PHP on my Windows 7 64bit Machine using the plugin I found here:
downloads.php.net/pierre/
It was linked to on this site:
wiki.apache.org/solr/SolPHP
(links are not clickable because I'm a new user)
I've got everything up and running, searches and indexing are working, but only when I use the getRawResponse() Method and parse it through SimpleXml (http://de.php.net/manual/en/book.simplexml.php).
The getDigestedResponse() method, which ist supposed to return a PHP-Object, just returns string(1) " ".
The method getResponse() (http://docs.php.net/manual/en/solrresponse.getresponse.php) just times out.
It wouldn't be that much of a problem, but some of the XML from the Raw Response doesn't seem to be valid and parsed with simpleXML, some of the attributes are missing, using regular expressions to get the needed data would be too much of a hassle.
Has anyone get this to work yet? Help is greatly appreciated!
Depends on how you are parsing the response. Try code below and drop the PHP/PECL solr libs and go CURL (ex: hostNameHere:8983/solr/select/?q=solr&start=0&rows=10&indent=on and send the result XML to the function below).
If you can access a resource (solr) via a URL, then there is no need to use an ancillary library to do what CURL can do:
function makeSimpleXML($xml) {
$dom = new DOMDocument;
$dom->loadXML($xml);
if (!$dom) {
// ErrorUtility::throwFatal("could not parse xml. please check the format", "XMLParisng Error");
}
return simplexml_import_dom($dom);
}

PHP Tidy: tidy_setopt() alternative?

So I'm trying to tailor PHP's Tidy to my liking, but the problem is with the tidy_setopt() function.
I know tidy is installed and working just fine, and reading the PHP docs it says tidy_setopt() has been removed as of 2.0 (So since the ob callback is working perfectly I'm safe to assume I'm running Tidy 2.0+).
Here is the problem: There is no alternative function. I'm hoping there is a way to get around this so I can set the ob handler's settings up how I want them to without actually needing to edit a configuration file.
I'm sure my hosting will be willing to edit Tidy's configuration file if needed, but I'd rather not add to the barrage of support tickets I've been sending them for various reasons as it is.
If I need to create my own callback for output buffering I can do so (I see some possibly useful methods using the OO approach to tidy) but I'd rather have it as slim as possible.
Instead of using
tidy_setopt('indent', FALSE);
You should use
$config = array('indent' => FALSE);
$text = tidy_parse_string($text, $config, 'UTF8');
Also see Output Control Functions manual for "User defined callback function example"

passing an xml in nusoap

Good day,
I am having trouble passing an xml in nusoap.
sample:
I pass this xml
<test>123</test>
The nusoap response is
test123/test
The greater than and less than sign is removed.
This is my code for the server:
require_once('nusoap/nusoap.php');
$server = new nusoap_server; // Create server instance
$server->configureWSDL('demows','http://example.org/demo');
$server->register('myFunction',
array("param"=>"xsd:string"), // input
array("result"=>"xsd:string"), // output
'http://example.org/demo'
);
function myFunction($parameters) {
return $parameters;
}
// Use the request to try to invoke the service
$HTTP_RAW_POST_DATA = isset($HTTP_RAW_POST_DATA) ? $HTTP_RAW_POST_DATA: '';
$server->service($HTTP_RAW_POST_DATA);
This is my code for the client:
require_once('nusoap/nusoap.php');
$client = new nusoap_client('http://localhost/nusoap/ws.php?wsdl', true);
$clientparam = '<test>123</test>';
$result = $client->call('myFunction',
array('param'=>$clientparam)
);
print_r($result);
*Note that the above code is working on PHP Version 5.3.0 but NOT on PHP Version 5.2.0-8+etch13 which is the one on our production is using.
I've searched the net for any issues on the 2 version but none found.
Any help is highly appreciated. TIA
Upgrade you libxml2 and rebuild PHP.
I don't know nusoap at all, but it sounds like your entities are being discarded.
It might be worth controlling the entities at either end, for instance by changing '>' for >, '<' for < either manually or using a function such as htmlentities()
Not sure if you're using a different version of nusoap than me, but I've been using the proxy, which seems to be working. I also instantiate the client with soapclient rather than nusoap_client (hadn't seen that before):
$client = new soapclient('http://localhost/nusoap/ws.php?wsdl', true);
$proxy = $client->getProxy();
$response = $proxy->call("myfunction", array('test' => 123));
Yes and the answer is in soapval class.
Little messy but simple example is here.
In quick - you have to wrap with this class any non-generic type, that means i.e. php array. Nesting of this wraps could of course happen but it's not against design.
If you want to pass xml value within a soap message and you control both the server and the client (or at least you can instruct the client), why not base64 encode your xml. Then the parser will just see it as a normal string and not get confused.

Scraping Library for PHP - phpQuery?

I'm looking for a PHP library that allows me to scrap webpages and takes care about all the cookies and prefilling the forms with the default values, that's what annoys me the most.
I'm tired of having to match every single input element with xpath and I would love if something better existed. I've come across phpQuery but the manual isn't much clear and I can't find out how to make POST requests.
Can someone help me? Thanks.
#Jonathan Fingland:
In the example provided by the manual for browserGet() we have:
require_once('phpQuery/phpQuery.php');
phpQuery::browserGet('http://google.com/', 'success1');
function success1($browser)
{
$browser->WebBrowser('success2')
->find('input[name=q]')->val('search phrase')
->parents('form')
->submit();
}
function success2($browser)
{
echo $browser;
}
I suppose all the other fields are scrapped and send back in the GET request, I want to do the same with the phpQuery::browserPost() method but I don't know how to do it. The form I'm trying to scrape has a input token and I would love if phpQuery could be smart enough to scrape the token and just let me change the other fields (in this case username and password), submiting via POST everything.
PS: Rest assured, this is not going to be used for spamming.
See http://code.google.com/p/phpquery/wiki/Ajax and in particular:
phpQuery::post($url, $data, $callback, $type)
and
# data Object, String which defines the data parameter as being either an Object or a String. POST requests should be possible using query string format, e.g.:
$data = "username=Jon&password=123456";
$url = "http://www.mysite.com/login.php";
phpQuery::post($url, $data, $callback, $type)
as phpQuery is a jQuery port the method signature is the same (the docs link directly to the jquery site -- http://docs.jquery.com/Ajax/jQuery.post)
Edit
Two things:
There is also a phpQuery::browserPost function which might meet your needs better.
However, also note that the success2 callback is only called on the submit() or click() methods so you can fill in all of the form fields prior to that.
e.g.
require_once('phpQuery/phpQuery.php');
phpQuery::browserGet('http://www.mysite.com/login.php', 'success1');
function success1($browser) {
$handle = $browser
->WebBrowser('success2');
$handle
->find('input[name=username]')
->val('Jon');
$handle
->find('input[name=password]')
->val('123456');
->parents('form')
->submit();
}
function success2($browser) {
print $browser;
}
(Note that this has not been tested, but should work)
I've used SimpleTest's ScriptableBrowser for such stuff in the past. It's part of the SimpleTest testing framework, but you can use it stand-alone.
I would use a dedicated library for parsing HTML files and a dedicated library for processing HTTP requests. Using the same library for both seems like a bad idea, IMO.
For processing HTTP requests, check out eg. Httpful, Unirest, Requests or Guzzle. Guzzle is especially popular these days, but in the end, whichever library works best for you is still a matter of personal taste.
For parsing HTML files I would recommend a library that I wrote myself : DOM-Query. It allows you to (1) load an HTML file and then (2) select or change parts of your HTML pretty much the same way you'd do it if you'd be using jQuery in a frontend app.

Categories