Parsing Lucene/SOLR debug.explain.structured xml output in PHP

Parsing Lucene/SOLR debug.explain.structured xml output in PHP - php

The default "human readable" formatting of solr's debug-mode explain feature is completely useless. You can get some structured xml output by passing debug.explain.structured=true.
However the xml it generates isn't really usable either and I need to be able to use this debugging information elsewhere in my code.
Before I re-invent the wheel, I have two questions:
1) Does anyone know of an existing PHP class (or function) that will parse this xml and turn it into a useful object? (googling didn't turn up anything obvious)
2) For those familiar with SOLR's debug mode, is there a better way to approach this than parsing the debug.explain.structured xml?
(I'm Using SOLR 3.6)

I am doing this using the solr-php-client. I do use a regular expression to parse out a specific value, but it's easy to access the debug explanation.
As an example, here is how I extract the coord value from the debug explanation:
$client = new Apache_Solr_Service($hostname, $port, $url);
$response = $client->search($q, $offset, $limit, $parameters);
$debug = reset($response['debug']['explain']); // get the first explanation
if (preg_match('#([\d\.]+) = coord\((\d+)/(\d+)\)#m', $debug, $matches)) {
$coord = floatval($matches[1]);
$overlap = intval($matches[2]); // the number of matching keywords
$max_overlap = intval($matches[3]); // the total number of keywords
}

i am having the same problem and stared a github project for parsing the solr explain into an object structure. With this library it is possible to calculate the impact of a certain field from the explain output:
https://github.com/timoschmidt/php-solr-explain

Related

REST service consume with PHP 7

As I am new to this PHP world I require some help to consume rest request with php. What I have already achieved is to submit json thoug jquery($("#form").serialize) and read the same in php along with returning json value in the response of the server call.
But I am bit stuck when I have to read rest url parameter with php. For example in case of select by Id my client calls me with url mentioned below. Now I would like to retrieve 1 and fetch from the url. With Spring pathvariable it's easy but how I can parse this url in php.
/dummy/customer/1/fetch
Note - As of now I am not using any framework only raw php.
Any help would be appreciated. Thanks in advance.

There's a lot of different ways to do this. I'll show 2 ways.
Basic string functions
We can simply split the string on / and grab the 4th part.
$output = explode($input, '/')[3];
Regular expressions
This example of a regular expression doesn't just grab the 1, it also ensures that all the other parts in the url are what you expect. Regular expressions are harder but more versatile.
$matches = null;
$success = preg_match('#^/dummy/customer/([0-9]+)/fetch$#', $input, $matches);
if (!$success) {
throw new Exception('Unexpected format');
}
$output = $matches[1];

simplexml_load_string not parsing my XML string. Charset issue?

I'm using the following PHP code to read XML data from NOAA's tide reporting station API:
$rawxml = file_get_contents(
"http://opendap.co-ops.nos.noaa.gov/axis/webservices/activestations/"
."response.jsp?v=2&format=xml&Submit=Submit"
);
$rawxml = utf8_encode($rawxml);
$ob = simplexml_load_string($rawxml);
var_dump($ob);
Unfortunately, I end up with it displaying this:
object(SimpleXMLElement)#246 (0) { }
It looks to me like the XML is perfectly well-formed - why won't this parse? From looking at another question (Simplexml_load_string() fail to parse error) I got the idea that the header might be the problem - the http call does indeed return a charset value of "ISO-8859-1". But adding in the utf8_encode() call doesn't seem to do the trick.
What's especially confusing is that simplexml_load_string() doesn't actually fail - it returns a cheerful XML array, just with nothing in it!

You've been fooled (and had me fooled) by the oldest trick in the SimpleXML book: SimpleXML doesn't parse the whole document into a PHP object, it presents a PHP API to an internal structure. Functions like var_dump can't see this structure, so don't always give a useful idea of what's in the object.
The reason it looks "empty" is that it is listing the children of the root element which are in the default namespace - but there aren't any, they're all in the "soapenv:" namespace.
To access namespaced elements, you need to use the children() method, passing in the full namespace name (recommended) or its local prefix (simpler, but could be broken by changes in the way the file is generated the other end). To switch back to the "default namespace", use ->children(null).
So you could get the ID attribute of the first stationV2 element like this (live demo):
// Define constant for the namespace names, rather than relying on the prefix the remote service uses remaining stable
define('NS_SOAP', 'http://schemas.xmlsoap.org/soap/envelope/');
// Download the XML
$rawxml = file_get_contents("http://opendap.co-ops.nos.noaa.gov/axis/webservices/activestations/response.jsp?v=2&format=xml&Submit=Submit");
// Parse it
$ob = simplexml_load_string($rawxml);
// Use it!
echo $ob->children(NS_SOAP)->Body->children(null)->ActiveStationsV2->stationsV2->stationV2[0]['ID'];
I've written some debugging functions to use with SimpleXML which should be much less misleading than var_dump etc. Here's a live demo with your code and simplexml_dump.

Simple PHP script for downloading results

I need to download results from a website using a for loop to compile them.
(Note that it's an ASP request which displays a webpage with these parameters)
I wrote the following code to get me this:
<?php
for ($i=10; $i<500; $i++) {
$m = $i*10;
$dl = $query;
$text = file_get_contents($dl);
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
$aObj = $doc->find('Academic');
if (count($aObj) > 0)
{
echo "<h4>Found</h4>";
//Don't download this
}
else
{
echo "<h4>Not found</h4>";
//Download this
}
}
?>
But it returns several errors. Apparently it can't copy the ASPX file to the HTML DOM. How do I go about doing this? Also, how can I download/save the pages where the string 'Download' is not found?
I also think my method of finding 'Download' in the document is not working. What is the correct way to do this?

The website you're attempting to parse contains a lot of errors, therefore you wont be able to use the standard DOMDocument object. You can attempt to use a library such as SimpleHTMLDOM (http://simplehtmldom.sourceforge.net/) or phpQuery (https://code.google.com/p/phpquery/) and hope that those are good enough to parse the malformed document.
In case you just need some information it might be easier to use regular expressions and preg_match_all (http://www.php.net/manual/en/function.preg-match-all.php) to find every occurrence of 'Academic' for example.
Note, usually it is not very advisable to use regular expression when working with structured documents such as HTML since you wont be able to take advantage of the structure, but since those documents seem to contain over 300 errors and differ from each other it might be the only way.

parse special character xml file using simplexml

Apologies if there is an obvious answer (and I know there are about 1000 of these similar questions) - but I have spent two days trying to attack this without success. I cannot seem to crack why I get a null response...
Short background: the following works just fine
$xurl= new SimpleXMLElement('https://gptxsw.appspot.com/view/submissionList?formId=GP_v7&numEntries=1', NULL, TRUE);
$keyname = $xurl->idList->id[0];
echo $keyname;
this provides a response: a unique key like uuid:d0721391-6953-4d0b-b981-26e38f05d2e5
however I try a similar request (which ultimately would be based on first request) and get a failure. I've simplified code as follows...
$xdurl= new SimpleXMLElement('https://gptxsw.appspot.com/view/downloadSubmission?formId=GP_v7[#version=null%20and%20#uiVersion=null]/GP_v7[#key=uuid:d0721391-6953-4d0b-b981-26e38f05d2e5]', NULL, TRUE);
$keyname2 = $xdurl->data->GP_v7->SDD_ID_N[0];
echo $keyname2;
this provides null. And if I try something like
echo $xdurl->asXML();
I get an error response from the site (not from PHP).
Do I need to eject from SimpleXMLElement for the second request? I've read about using XPath and about defining the namespace, but I'm not sure that either would be required: the second file does have two namespaces but one of them isn't used and the other has no prefix for elements. Plus I have tried variations of those - enough to think that my problem/error is either more global in nature (or oversight due to inexperience).
For purposes of this request I have no control over the formatting of either XML file.

Here we go: SimpleXMLElement seems to re-escape (or incorrectly handle in some way) already url-escaped characters like white spaces. Try:
$xdurl= new SimpleXMLElement('https://gptxsw.appspot.com/view/downloadSubmission?formId=GP_v7[#version=null and #uiVersion=null]/GP_v7[#key=uuid:d0721391-6953-4d0b-b981-26e38f05d2e5]', NULL, TRUE);
$keyname2 = $xdurl->data->GP_v7->SDD_ID_N[0];
echo $keyname2;
and you should be fine.
(FYI: I debugged this by manually creating a local copy of the XML request result named "foo.xml" which worked perfectly.)

Thanks to #Matze for getting me on right track.
Issue is that URL has special characters that SimpleXMLElement cannot parse without help.
Solution: add urlencode() command like the following
$fixurl = urlencode('https://gptxsw.appspot.com/view/downloadSubmission?formId=GP_v7[#version=null and #uiVersion=null]/GP_v7[#key=uuid:d0721391-6953-4d0b-b981-26e38f05d2e5]');
$xdurl= new SimpleXMLElement($fixurl, NULL, TRUE);
$keyname2 = $xdurl->data->GP_v7->SDD_ID_N[0];
echo $keyname2;
this provided the answer (in this case 958)

Alternative to php preg_match to pull data from an external website?

I want to extrat the content of a specific div in an external webpage, the div looks like this:
<dt>Win rate</dt><dd><div>50%</div></dd>
My target is the "50%". I'm actually using this php code to extract the content:
function getvalue($parameter,$content){
preg_match($parameter, $content, $match);
return $match[1];
};
$parameter = '#<dt>Score</dt><dd><div>(.*)</div></dd>#';
$content = file_get_contents('https://somewebpage.com');
Everything works fine, the problem is that this method is taking too much time, especially if I've to use it several times with diferents $content.
I would like to know if there's a better (faster, simplier, etc.) way to acomplish the same function? Thx!

You may use DOMDocument::loadHTML and navigate your way to the given node.
$content = file_get_contents('https://somewebpage.com');
$doc = new DOMDocument();
$doc->loadHTML($content);
Now to get to the desired node, you may use method DOMDocument::getElementsByTagName, e.g.
$dds = $doc->getElementsByTagName('dd');
foreach($dds as $dd) {
// process each <dd> element here, extract inner div and its inner html...
}
Edit: I see a point #pebbl has made about DomDocument being slower. Indeed it is, however, parsing HTML with preg_match is a call for trouble; In that case, I'd also recommend looking at event-driven SAX XML parser. It is much more lightweight, faster and less memory intensive as it does not build a tree. You may take a look at XML_HTMLSax for such a parser.

There are basically three main things you can do to improve the speed of your code:
Off load the external page load to another time (i.e. use cron)
On a linux based server I would know what to suggest but seeing as you use Windows I'm not sure what the equivalent would be, but Cron for linux allows you to fire off scripts at certain schedule time offsets - in the background - so not using a browser. Basically I would recommend that you create a script who's sole purpose is to go and fetch the website pages at a particular time offset (depending on how frequently you need to update your data) and then write those webpages to files on your local system.
$listOfSites = array(
'http://www.something.com/page.htm',
'http://www.something-else.co.uk/index.php',
);
$dirToContainSites = getcwd() . '/sites';
foreach ( $listOfSites as $site ) {
$content = file_get_contents( $site );
/// i've just simply converted the URL into a filename here, there are
/// better ways of handling this, but this at least keeps things simple.
/// the following just converts any non letter or non number into an
/// underscore... so, http___www_something_com_page_htm
$file_name = preg_replace('/[^a-z0-9]/i','_', $site);
file_put_contents( $dirToContainSites . '/' . $file_name, $content );
}
Once you've created this script, you then need to set the server up to execute it as regularly as you need. Then you can modify your front-end script that displays the stats to read from local files, this would give a significant speed increase.
You can find out how to read files from a directory here:
http://uk.php.net/manual/en/function.dir.php
Or the simpler method (but prone to possible problems) is just to re-step your array of sites, convert the URLs to file names using the preg_replace above, and then check for the file's existence in the folder.
Cache the result of calculating your statistics
It's quite likely this being a stats page that you'll want to visit it quite frequently (not as frequent as a public page, but still). If the same page is visited more often than the cron-based script is executed then there is no reason to do all the calculation again. So basically all you have to do to cache your output is do something similar to the following:
$cachedVersion = getcwd() . '/cached/stats.html';
/// check to see if there is a cached version of this page
if ( file_exists($cachedVersion) ) {
/// if so, load it and echo it to the browser
echo file_get_contents($cachedVersion);
}
else {
/// start output buffering so we can catch what we send to the browser
ob_start();
/// DO YOUR STATS CALCULATION HERE AND ECHO IT TO THE BROWSER LIKE NORMAL
/// end output buffering and grab the contents so we now have a string
/// of the page we've just generated
$content = ob_get_contents(); ob_end_clean();
/// write the content to the cached file for next time
file_put_contents($cachedVersion, $content);
echo $content;
}
Once you start caching things you need to be aware of when you should delete or clear your cache - otherwise if you don't your stats output will never change. With regards to this situation, the best time to clear your cache is at the point you go and fetch the external web pages again. So you should add this line to the bottom of your "cron" script.
$cachedVersion = getcwd() . '/cached/stats.html';
unlink( $cachedVersion ); /// will delete the file
There are other speed improvements you could make to the caching system (you could even record the modified times of the external webpages and load only when they have been updated) but I've tried to keep things easy to explain.
Don't use a HTML Parser for this situation
Scanning a HTML file for one particular unique value does not require the use of a fully-blown or even lightweight HTML Parser. Using RegExp incorrectly seems to be one of those things that lots of start-up programmers fall into, and is a question that is always asked. This has led to lots of automatic knee-jerk reactions from more experience coders to automatically adhere to the following logic:
if ( $askedAboutUsingRegExpForHTML ) {
$automatically->orderTheSillyPersonToUse( $HTMLParser );
} else {
$soundAdvice = $think->about( $theSituation );
print $soundAdvice;
}
HTMLParsers should be used when the target within the markup is not so unique, or your pattern to match relies on such flimsy rules that it'll break the second an extra tag or character occurs. They should be used to make your code more reliable, not if you want to speed things up. Even parsers that do not build a tree of all the elements will still be using some form of string searching or regular expression notation, so unless the library-code you are using has been compiled in an extremely optimised manner, it will not beat well coded strpos/preg_match logic.
Considering I have not seen the HTML you are hoping to parse, I could be way off, but from what I've seen of your snippet it should be quite easy to find the value using a combination of strpos and preg_match. Obviously if your HTML is more complex and might have random multiple occurances of <dt>Win rate</dt><dd><div>50%</div></dd> it will cause problems - but even so - a HTMLParser would still have the same problem.
$offset = 0;
/// loop through the occurances of 'Win rate'
while ( ($p = stripos ($html, 'win rate', $offset)) !== FALSE ) {
/// grab out a snippet of the surrounding HTML to speed up the RegExp
$snippet = substr($html, $p, $p + 50 );
/// I've extended your RegExp to try and account for 'white space' that could
/// occur around the elements. The following wont take in to account any random
/// attributes that may appear, so if you find some pages aren't working - echo
/// out the $snippet var using something like "echo '<xmp>'.$snippet.'</xmp>';"
/// and that should show you what is appearing that is breaking the RegExp.
if ( preg_match('#^win\s+rate\s*</dt>\s*<dd>\s*<div>\s*([0-9]+%)\s*<#i', $snippet, $regs) ) {
/// once you are here your % value will be in $regs[1];
break; /// exit the while loop as we have found our 'Win rate'
}
/// reset our offset for the next loop
$offset = $p;
}
Gotchas to be aware of
If you are new to PHP, as you state in a comment above, then the above may seem rather complicated - which it is. What you are trying to do is quite complex, especially if you want to do it optimally and fast. However, if you follow throught the code I've given and research any bits that you aren't sure of / haven't heard of (php.net is your friend), it should give you a better understanding of a good way to achieve what you are doing.
Guessing ahead however, here are some of the problems you might face with the above:
File Permission errors - in order to be able to read and write files to and from the local operating system you will need to have the correct permissions to do so. If you find you can not write files to a particular directory it might be that the host you are using wont allow you to do so. If this is the case you can either contact them to ask about how to get write permission to a folder, or if that isn't possible you can easily change the code above to use a database instead.
I can't see my content - when using output buffering all the echo and print commands do not get sent to the browser, they instead get saved up in memory. PHP should automatically output all the stored content when the script exits, but if you use a command like ob_end_clean() this actually wipes the 'buffer' so all the content is erased. This can lead to confusing situations when you know you are echoing something.. but it just isn't appearing.
(Mini Disclaimer :) I've typed all the above manually so you may find there are PHP errors, if so, and they are baffling, just write them back here and StackOverflow can help you out)

Instead of trying to not use preg_match why not just trim your document contents down in size? for example, you could dump everything before <body and everything after </body>. then preg_match will be searching less content already.
Also, you could try to do each one of these processes as a pseudo separate thread, so that way they aren't happening one at a time.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing Lucene/SOLR debug.explain.structured xml output in PHP - php

i am having the same problem and stared a github project for parsing the solr explain into an object structure. With this library it is possible to calculate the impact of a certain field from the explain output: https://github.com/timoschmidt/php-solr-explain

Related

REST service consume with PHP 7

simplexml_load_string not parsing my XML string. Charset issue?

Simple PHP script for downloading results

parse special character xml file using simplexml

Alternative to php preg_match to pull data from an external website?

Categories

Resources