So right now I have this code, which works great:
This takes anything that's in the xpath and print.
<?php
$parent_title = get_the_title( $post->post_parent );
$html_string = file_get_contents('http://www.weburladresshere.com');
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html_string);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$values = array();
$row = $xpath->query('myquery');
foreach($row as $value) {
print($value->nodeValue);
}
?>
I need to insert two things into the code (if possible):
To check if the content is longer than x characters, then don't print.
To check if the content contains http in the content, then don't print.
If both of the above are negative - take it and print it.
If one of them is positive - skip, and then check the secondquery on the same page:
$row = $xpath->query('secondquery');
If this also contains one of the above, then check the thirdquery (from the same page) and so on.
Until it matches.
Any help would be appreciated.
From what I understand from the question you want a way to continue to run queries on the DOMDocument and evaluate the following conditions.
If the string length of the nodeValue is below a threshold
If the string of nodeValue does not contain "http"
Logic conditions:
IF both of those above are true then echo to the screen
IF one of those are false then run the next subquery
Below is the code which uses 500 characters as the length. My example has 3 entries which have the following character counts: 294, 98, and 1305.
<?php
/**
* #param $xpath
* #param $xPathQueries
* #param int $iteration
*/
function doXpathQuery($xpath, $xPathQueries, $iteration = 0)
{
// Validate there's no more subquery to go through
if (!isset($xPathQueries[$iteration])) {
return;
}
$runNextIteration = false;
// Run the XPATH subquery
$rows = $xpath->query($xPathQueries[$iteration]);
foreach ($rows as $row) {
$value = trim($row->nodeValue);
$smallerThanLength = (strlen($value) < 500);
// Case insensitive search, might use "http://" for less false positives
$noHttpFound = (stristr($value, 'http') === FALSE);
// Is it smaller than length, and no http found?
if($smallerThanLength && $noHttpFound) {
echo $value;
} else {
// One of them isn't true so run the next query
$runNextIteration = true;
}
}
// Should we do the next query?
if ($runNextIteration) {
$iteration++;
doXpathQuery($xpath, $xPathQueries, $iteration);
}
}
// Commented out this next line because I'm not sure what it does in this context
// $parent_title = get_the_title( $post->post_parent );
// Get all the contents for the URL
$html_string = file_get_contents('https://theeasyapi.com');
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html_string);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// Container that will hold all the rows that match the criteria
$values = [];
// An array containing all of the XPATH queries you want to run
$xPathQueries = ['/html/body/div/section', '/html/body/div'];
doXpathQuery($xpath, $xPathQueries);
This will run all of the queries put in $xPathQueries as long as the query produces a value where the string length is above 500 or 'http' is found.
Related
I have to process a huge XML file, I used DOMDocument to process But the datas returned is huge, so how can I choose specific amount of elements to display.
For example I want to display 5 elements.
My code:
<?php
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->load('IPCCPC-epoxif-201905.xml'); //IPCCPC-epoxif-201905
$xpath = new DOMXPath($doc);
if(empty($_POST['search'])){
$txtSearch = 'A01B1/00';
}
else{
$txtSearch = $_POST['search'];
}
$titles = $xpath->query("Doc/Fld[#name='IC']/Prg/Sen[contains(text(),\"$txtSearch\")]");
foreach ($titles as $title)
{
// I want to display 5 results here.
}
Add an index to the loop, and break out when it hits the limit.
$limit = 5;
foreach ($titles as $i => $title) {
if ($i >= $limit) {
break;
}
// rest of code
}
I wrote a database seeder last year that scrapes a stats website. Upon revisiting my code, it no longer seems to be working and I am a bit stumped as to the reason. $html->find() is supposed to return an array of elements found, however it seems to only be finding the first table when used.
As per the documentation, I instead tried using find() and specifying each table's ID, however this also seems to fail.
$table_passing = $html->find('table[id=passing]');
Can anyone help me figure out what is wrong here? I am at a loss as to why neither of these methods are working, where the page source clearly shows multiple tables and the IDs, where both approaches should work.
private function getTeamStats()
{
$url = 'http://www.pro-football-reference.com/years/2016/opp.htm';
$html = file_get_html($url);
$tables = $html->find('table');
$table_defense = $tables[0];
$table_passing = $tables[1];
$table_rushing = $tables[2];
//$table_passing = $html->find('table[id=passing]');
$teams = array();
# OVERALL DEFENSIVE STATISTICS #
foreach ($table_defense->find('tr') as $row)
{
$stats = $row->find('td');
// Ignore the lines that don't have ranks, these aren't teams
if (isset($stats[0]) && !empty($stats[0]->plaintext))
{
$name = $stats[1]->plaintext;
$rank = $stats[0]->plaintext;
$games = $stats[2]->plaintext;
$yards = $stats[4]->plaintext;
// Calculate the Yards Allowed per Game by dividing Total / Games
$tydag = $yards / $games;
$teams[$name]['rank'] = $rank;
$teams[$name]['games'] = $games;
$teams[$name]['tydag'] = $tydag;
}
}
# PASSING DEFENSIVE STATISTICS #
foreach ($table_passing->find('tr') as $row)
{
$stats = $row->find('td');
// Ignore the lines that don't have ranks, these aren't teams
if (isset($stats[0]) && !empty($stats[0]->plaintext))
{
$name = $stats[1]->plaintext;
$pass_rank = $stats[0]->plaintext;
$pass_yards = $stats[14]->plaintext;
$teams[$name]['pass_rank'] = $pass_rank;
$teams[$name]['paydag'] = $pass_yards;
}
}
# RUSHING DEFENSIVE STATISTICS #
foreach ($table_rushing->find('tr') as $row)
{
$stats = $row->find('td');
// Ignore the lines that don't have ranks, these aren't teams
if (isset($stats[0]) && !empty($stats[0]->plaintext))
{
$name = $stats[1]->plaintext;
$rush_rank = $stats[0]->plaintext;
$rush_yards = $stats[7]->plaintext;
$teams[$name]['rush_rank'] = $rush_rank;
$teams[$name]['ruydag'] = $rush_yards;
}
}
I never use simplexml or other derivatives but when using an XPath query to find an attribute such as ID usually one would prefix with # and the value should be quoted - so for your case it might be
$table_passing = $html->find('table[#id="passing"]');
Using a standard DOMDocument & DOMXPath approach the issue was that the actual table was "commented out" in source code - so a simple string replacement of the html comments enabled the following to work - this could easily be adapted to the original code.
$url='http://www.pro-football-reference.com/years/2016/opp.htm';
$html=file_get_contents( $url );
/* remove the html comments */
$html=str_replace( array('<!--','-->'), '', $html );
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTML( $html );
libxml_clear_errors();
$xp=new DOMXPath( $dom );
$tbl=$xp->query( '//table[#id="passing"]' );
foreach( $tbl as $n )echo $n->tagName.' > '.$n->getAttribute('id');
/* outputs */
table > passing
I made a script using phpQuery. The script finds td's that contain a certain string:
$match = $dom->find('td:contains("'.$clubname.'")');
It worked good until now. Because one clubname is for example Austria Lustenau and the second club is Lustenau. It will select both clubs, but it only should select Lustenau (the second result), so I need to find a td containing an exact match.
I know that phpQuery are using jQuery selectors, but not all. Is there a way to find an exact match using phpQuery?
Update: It is possible, see the answer from # pguardiario
Original answer. (at least an alternative):
No, unfortunately it is not possible with phpQuery. But it can be done easily with XPath.
Imagine you have to following HTML:
$html = <<<EOF
<html>
<head><title>test</title></head>
<body>
<table>
<tr>
<td>Hello</td>
<td>Hello World</td>
</tr>
</table>
</body>
</html>
EOF;
Use the following code to find exact matches with DOMXPath:
// create empty document
$document = new DOMDocument();
// load html
$document->loadHTML($html);
// create xpath selector
$selector = new DOMXPath($document);
// selects all td node which's content is 'Hello'
$results = $selector->query('//td[text()="Hello"]');
// output the results
foreach($results as $node) {
$node->nodeValue . PHP_EOL;
}
However, if you really need a phpQuery solution, use something like this:
require_once 'phpQuery/phpQuery.php';
// the search string
$needle = 'Hello';
// create phpQuery document
$document = phpQuery::newDocument($html);
// get matches as you suggested
$matches = $document->find('td:contains("' . $needle . '")');
// empty array for final results
$results = array();
// iterate through matches and check if the search string
// is the same as the node value
foreach($matches as $node) {
if($node->nodeValue === $needle) {
// put to results
$results []= $node;
}
}
// output results
foreach($results as $result) {
echo $node->nodeValue . '<br/>';
}
It can be done with filter, but it's not pretty:
$dom->find('td')->filter(function($i, $node){return 'foo' == $node->nodeValue;});
But then, neither is switching back and forth between css and xpath
I've no experience of phpQuery but the jQuery would be something like this :
var clubname = 'whatever';
var $match = $("td").map(function(index, domElement) {
return ($(domElement).text() === clubname) ? domElement : null;
});
The phpQuery documentation indicates that ->map() is available and that it accepts a callback function in the same way as in jQuery.
I'm sure you will be able to perform the translation into phpQuery.
Edit
Here's my attempt based on 5 minutes' reading - probably rubbish but here goes :
$match = $dom->find("td")->map(function($index, $domElement) {
return (pq($domElement)->text() == $clubname) ? $domElement : null;
});
Edit 2
Here's a demo of the jQuery version.
If phpQuery does what it says in its documentation, then (correctly translated from javascript) it should match the required element in the same way.
Edit 3
After some more reading about the phpQuery callback system, the following code stands a better chance of workng :
function textFilter($i, $el, $text) {
return (pq($el)->text() == $text) ? $el : null;
}};
$match = $dom->find("td")->map('textFilter', new CallbackParam, new CallbackParam, $clubname);
Note that ->map() is preferred over ->filter() as ->map() supports a simpler way to define pararameter "places" (see Example 2 in the referenced page).
I know this question is old, but I've written a function based on answer from hek2mgl
<?php
// phpQuery contains function
/**
* phpQuery contains method, this will be able to locate nodes
* relative to the NEEDLE
* #param string element_pattern
* #param string needle
* #return array
*/
function contains($element_pattern, $needle) {
$needle = (string) $needle;
$needle = trim($needle);
$element_haystack_pattern = "{$element_pattern}:contains({$needle})";
$element_haystack_pattern = (string) $element_haystack_pattern;
$findResults = $this->find($element_haystack_pattern);
$possibleResults = array();
if($findResults && !empty($findResults)) {
foreach($findResults as $nodeIndex => $node) {
if($node->nodeValue !== $needle) {
continue;
}
$possibleResults[$nodeIndex] = $node;
}
}
return $possibleResults;
}
?>
Usage
<?php
$nodes = $document->contains("td.myClass", $clubname);
?>
So as the title says I want to get a value of this site : Xtremetop100 Conquer-Online
My server is called Zath-Co and right now we are on rank 11.
What I want is that a script is going to tell me which rank we are, how much in's and out's we have. Only thing is we are getting up and down in the list so I want a script that checks on the name not at the rank, but I can't come out of it.
I tried this script
<?php $lines = file('http://xtremetop100.com/conquer-online');
while ($line = array_shift($lines)) {
if (strpos($line, 'Zath-Co') !== false) break; }
print_r(explode(" ", $line)); ?>
But it is only showing the name of my server and the description.
How can I get this to work as I want or do I have to use something really different. (If yes then what to use, and a example would be great.)
It can also be fixed with the file()-function, as you tried yourself. You just have to look up the source code and find the starting-line of your "part". I found out (in the source-code), that you need 7 lines to get the rank, description and in/out data. Here is a tested example:
<?php
$lines = file('http://xtremetop100.com/conquer-online');
$CountLines = sizeof( $lines);
$arrHtml = array();
for( $i = 0; $i < $CountLines; $i++) {
if( strpos( $lines[$i], '/sitedetails-1132314895')) {
//The seven lines taken here under is your section at the site
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
break;
}
}
//We simply strip all tags, so you just got the content.
$arrHtml = array_map("strip_tags", $arrHtml);
//Here we echo the data
echo implode('<br>',$arrHtml);
?>
You can fix the layout yourself by taking out each element from the $arrHtml throug a loop.
I suggest using SimpleXML and XPath. Here is working example:
$html = file_get_contents('http://xtremetop100.com/conquer-online');
// suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$xpath = '//span[#class="hd1" and ./a[contains(., "Zath-Co")]]/ancestor::tr/td[#class="number" or #class="stats1" or #class="stats"]';
$anchor = $xml->xpath($xpath);
// Clear invalid markup error buffer
libxml_clear_errors();
$rank = (string)$anchor[0]->b;
$in = (string)$anchor[1]->span;
$out = (string)$anchor[2]->span;
// Clear invalid markup error buffer
libxml_clear_errors();
I'm having difficulty extracting a single node value from a nodelist.
My code takes an xml file which holds several fields, some containing text, file paths and full image names with extensions.
I run an expath query over it, looking for the node item with a certain id. It then stores the matched node item and saves it as $oldnode
Now my problem is trying to extract a value from that $oldnode. I have tried to var_dump($oldnode) and print_r($oldnode) but it returns the following: "object(DOMElement)#8 (0) { } "
Im guessing the $oldnode variable is an object, but how do I access it?
I am able to echo out the whole node list by using: echo $oldnode->nodeValue;
This displays all the nodes in the list.
Here is the code which handles the xml file. line 6 is the line in question...
$xpathexp = "//item[#id=". $updateID ."]";
$xpath = new DOMXpath($xml);
$nodelist = $xpath->query($xpathexp);
if((is_null($nodelist)) || (! is_numeric($nodelist))) {
$oldnode = $nodelist->item(0);
echo $oldnode->nodeValue;
//$imgUpload = strchr($oldnode->nodeValue, ' ');
//$imgUpload = strrchr($imgUpload, '/');
//explode('/',$imgUpload);
//$imgUpload = trim($imgUpload);
$newItem = new DomDocument;
$item_node = $newItem ->createElement('item');
//Create attribute on the node as well
$item_node ->setAttribute("id", $updateID);
$largeImageText = $newItem->createElement('largeImgText');
$largeImageText->appendChild( $newItem->createCDATASection($largeImgText));
$item_node->appendChild($largeImageText);
$urlANode = $newItem->createElement('urlA');
$urlANode->appendChild( $newItem->createCDATASection($urlA));
$item_node->appendChild($urlANode);
$largeImg = $newItem->createElement('largeImg');
$largeImg->appendChild( $newItem->createCDATASection($imgUpload));
$item_node->appendChild($largeImg);
$thumbnailTextNode = $newItem->createElement('thumbnailText');
$thumbnailTextNode->appendChild( $newItem->createCDATASection($thumbnailText));
$item_node->appendChild($thumbnailTextNode);
$urlB = $newItem->createElement('urlB');
$urlB->appendChild( $newItem->createCDATASection($urlA));
$item_node->appendChild($urlB);
$thumbnailImg = $newItem->createElement('thumbnailImg');
$thumbnailImg->appendChild( $newItem->createCDATASection(basename($_FILES['thumbnailImg']['name'])));
$item_node->appendChild($thumbnailImg);
$newItem->appendChild($item_node);
$newnode = $xml->importNode($newItem->documentElement, true);
// Replace
$oldnode->parentNode->replaceChild($newnode, $oldnode);
// Display
$xml->save($xmlFileData);
//header('Location: index.php?a=112&id=5');
Any help would be great.
Thanks
Wasn't it supposed to be echo $oldnode->firstChild->nodeValue;? I remember this because technically you need the value from the text node.. but I might be mistaken, it's been a while. You could give it a try?
After our discussion in the comments on this answer, I came up with this solution. I'm not sure if it can be done cleaner, perhaps. But it should work.
$nodelist = $xpath->query($xpathexp);
if((is_null($nodelist)) || (! is_numeric($nodelist))) {
$oldnode = $nodelist->item(0);
$largeImg = null;
$thumbnailImg = null;
foreach( $oldnode->childNodes as $node ) {
if( $node->nodeName == "largeImg" ) {
$largeImg = $node->nodeValue;
} else if( $node->nodeName == "thumbnailImg" ) {
$thumbnailImg = $node->nodeValue;
}
}
var_dump($largeImg);
var_dump($thumbnailImg);
}
You could also use getElementsByTagName on the $oldnode, then see if it found anything (and if a node was found, $oldnode->getElementsByTagName("thumbnailImg")->item(0)->nodeValue). Which might be cleaner then looping through them.