Wondering if anyone knows what can be modified with this scraping/output code that spits out values like: +40.07%, in order to ignore certain values. For instance, I already have on my site an automatic "+" being placed in front of every value. However when scraping from that source, it also provides a + in front of positive values, so I'm getting a ++40.07 type of output, when I only want one +. Anyone know what can be added to ignore the outputted + ?
// get sandpdailychange
function getSandpdailychange(){
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://quotes.wsj.com/index/SPX');
$xpath = new DOMXPath($doc);
$query = "//span[#id='quote_change']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$result = trim($entry->textContent);
$ret_ = explode(' ', $result);
//make sure every element in the array don't start or end with blank
foreach ($ret_ as $key=>$val){
$ret_[$key]=trim($val);
}
//delete the empty element and the element is blank "\n" "\r" "\t"
//I modify this line
$ret_ = array_values(array_filter($ret_,deleteBlankInArray));
//echo the last element
file_put_contents(globalVars::$_cache_dir . "sandpdailychange",
$ret_[0]);
}
}
Related
So right now I have this code, which works great:
This takes anything that's in the xpath and print.
<?php
$parent_title = get_the_title( $post->post_parent );
$html_string = file_get_contents('http://www.weburladresshere.com');
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html_string);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$values = array();
$row = $xpath->query('myquery');
foreach($row as $value) {
print($value->nodeValue);
}
?>
I need to insert two things into the code (if possible):
To check if the content is longer than x characters, then don't print.
To check if the content contains http in the content, then don't print.
If both of the above are negative - take it and print it.
If one of them is positive - skip, and then check the secondquery on the same page:
$row = $xpath->query('secondquery');
If this also contains one of the above, then check the thirdquery (from the same page) and so on.
Until it matches.
Any help would be appreciated.
From what I understand from the question you want a way to continue to run queries on the DOMDocument and evaluate the following conditions.
If the string length of the nodeValue is below a threshold
If the string of nodeValue does not contain "http"
Logic conditions:
IF both of those above are true then echo to the screen
IF one of those are false then run the next subquery
Below is the code which uses 500 characters as the length. My example has 3 entries which have the following character counts: 294, 98, and 1305.
<?php
/**
* #param $xpath
* #param $xPathQueries
* #param int $iteration
*/
function doXpathQuery($xpath, $xPathQueries, $iteration = 0)
{
// Validate there's no more subquery to go through
if (!isset($xPathQueries[$iteration])) {
return;
}
$runNextIteration = false;
// Run the XPATH subquery
$rows = $xpath->query($xPathQueries[$iteration]);
foreach ($rows as $row) {
$value = trim($row->nodeValue);
$smallerThanLength = (strlen($value) < 500);
// Case insensitive search, might use "http://" for less false positives
$noHttpFound = (stristr($value, 'http') === FALSE);
// Is it smaller than length, and no http found?
if($smallerThanLength && $noHttpFound) {
echo $value;
} else {
// One of them isn't true so run the next query
$runNextIteration = true;
}
}
// Should we do the next query?
if ($runNextIteration) {
$iteration++;
doXpathQuery($xpath, $xPathQueries, $iteration);
}
}
// Commented out this next line because I'm not sure what it does in this context
// $parent_title = get_the_title( $post->post_parent );
// Get all the contents for the URL
$html_string = file_get_contents('https://theeasyapi.com');
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html_string);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// Container that will hold all the rows that match the criteria
$values = [];
// An array containing all of the XPATH queries you want to run
$xPathQueries = ['/html/body/div/section', '/html/body/div'];
doXpathQuery($xpath, $xPathQueries);
This will run all of the queries put in $xPathQueries as long as the query produces a value where the string length is above 500 or 'http' is found.
Testing with data scraping. The output I'm scraping, is a percent. So I basically slapped on a
echo "%<br>";
At the end of the actual number output which is
echo $ret_[66];
However there's an issue where the percent is actually appearing before the number as well, which is not desirable. This is the output:
%
-0.02%
Whereas what I'm trying to get is just -0.02%
Clearly I'm doing something wrong with the PHP. I'd really appreciate any feedback/solutions. Thank you!
Full code:
<?php
error_reporting(E_ALL^E_NOTICE^E_WARNING);
include_once "global.php";
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.moneycontrol.com/markets/global-indices/');
$xpath = new DOMXPath($doc);
$query = "//div[#class='MT10']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$result = trim($entry->textContent);
$ret_ = explode(' ', $result);
//make sure every element in the array don't start or end with blank
foreach ($ret_ as $key => $val){
$ret_[$key] = trim($val);
}
//delete the empty element and the element is blank "\n" "\r" "\t"
//I modify this line
$ret_ = array_values(array_filter($ret_,deleteBlankInArray));
//echo the last element
echo $ret_[66];
echo "%<br>";
}
<?php
echo "%<br>";
?>
On a seperate following PHP code. Does the same thing.
Here is where I set up basic variables, such as creating the new DomDoc and such as well as loading some of the Tags. This all works fine at the moment.
<?php
if (isset($_GET['edit'])&& $_GET['edit']=='delete' && isset($_GET['id'])&&!empty($_GET['id'])){
$dom = new DomDocument();
$dom->preserveWhiteSpace = false;
$dom->load("data.xml");
$root = $dom->documentElement;
$record = $root->getElementsByTagName("data");
$ID=$root->getElementsByTagName("ID");
$nodetoremove = null;
//$namenode=$root->getElementsByTagName("own_name");
//$name="";
//$datenode=$root->getElementsByTagName("sign_in");
//$date="";
$newid=$_GET['id'];
foreach($ID as $node){
$pid =$node->textContent;
Here I am checking if it's a new ID and if it is it does the following as seen.
if ($pid == $newid)
{
$nodetoremove=$node->parentNode;
}
}
The issue is here. I am able to go through the selected node I wish to delete ($nodetoremove) and select a specific element (sign_in) but I am unsure how to so. Right now all I can do is go through and print all of the elements within the nodes of $nodetoremove. Is there a way I can get the element I want from XML this way?
//Prints all information within $nodetoremove
foreach ($nodetoremove->childNodes AS $item){
print $item->nodeName . "=" . $item->nodeValue . "<br>";
}
foreach ($nodetoremove as $node) {
}
//Sets $name to the first Child of $nodetoremove
$name=$nodetoremove->firstChild->nodeValue;
//Checks if the nods to remove is not null, if it is removes $nodetoremove
if($nodetoremove!=null){
$root->removeChild($nodetoremove);
?>
So as the title says I want to get a value of this site : Xtremetop100 Conquer-Online
My server is called Zath-Co and right now we are on rank 11.
What I want is that a script is going to tell me which rank we are, how much in's and out's we have. Only thing is we are getting up and down in the list so I want a script that checks on the name not at the rank, but I can't come out of it.
I tried this script
<?php $lines = file('http://xtremetop100.com/conquer-online');
while ($line = array_shift($lines)) {
if (strpos($line, 'Zath-Co') !== false) break; }
print_r(explode(" ", $line)); ?>
But it is only showing the name of my server and the description.
How can I get this to work as I want or do I have to use something really different. (If yes then what to use, and a example would be great.)
It can also be fixed with the file()-function, as you tried yourself. You just have to look up the source code and find the starting-line of your "part". I found out (in the source-code), that you need 7 lines to get the rank, description and in/out data. Here is a tested example:
<?php
$lines = file('http://xtremetop100.com/conquer-online');
$CountLines = sizeof( $lines);
$arrHtml = array();
for( $i = 0; $i < $CountLines; $i++) {
if( strpos( $lines[$i], '/sitedetails-1132314895')) {
//The seven lines taken here under is your section at the site
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
$arrHtml[] = $lines[$i++];
break;
}
}
//We simply strip all tags, so you just got the content.
$arrHtml = array_map("strip_tags", $arrHtml);
//Here we echo the data
echo implode('<br>',$arrHtml);
?>
You can fix the layout yourself by taking out each element from the $arrHtml throug a loop.
I suggest using SimpleXML and XPath. Here is working example:
$html = file_get_contents('http://xtremetop100.com/conquer-online');
// suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$xpath = '//span[#class="hd1" and ./a[contains(., "Zath-Co")]]/ancestor::tr/td[#class="number" or #class="stats1" or #class="stats"]';
$anchor = $xml->xpath($xpath);
// Clear invalid markup error buffer
libxml_clear_errors();
$rank = (string)$anchor[0]->b;
$in = (string)$anchor[1]->span;
$out = (string)$anchor[2]->span;
// Clear invalid markup error buffer
libxml_clear_errors();
From a html page I need to extract the values of v from all anchor links…each anchor link is hidden in some 5 div tags
<a href="/watch?v=value to be retrived&list=blabla&feature=plpp_play_all">
Each v value has 11 characters, for this as of now am trying to read it by character by character like
<?php
$file=fopen("xx.html","r") or exit("Unable to open file!");
$d='v';
$dd='=';
$vd=array();
while (!feof($file))
{
$f=fgetc($file);
if($f==$d)
{
$ff=fgetc($file);
if ($ff==$dd)
{
$idea='';
for($i=0;$i<=10;$i++)
{
$sData = fgetc($file);
$id=$id.$sData;
}
array_push($vd, $id);
That is am getting each character of v and storing it in sData variable and pushing it into id so as to get those 11 characters as a string(id)…
the problem is…searching for the ‘v=’ through the entire html file and if found reading the 11characters and pushing it into a sData array is sucking, it is taking considerable amount of time…so pls help me to sophisticate the things
<?php
function substring(&$string,$start,$end)
{
$pos = strpos(">".$string,$start);
if(! $pos) return "";
$pos--;
$string = substr($string,$pos+strlen($start));
$posend = strpos($string,$end);
$toret = substr($string,0,$posend);
$string = substr($string,$posend);
return $toret;
}
$contents = #file_get_contents("xx.html");
$old="";
$videosArray=array();
while ($old <> $contents)
{
$old = $contents;
$v = substring($contents,"?v=","&");
if($v) $videosArray[] = $v;
}
//$videosArray is array of v's
?>
I would better parse HTML with SimpleXML and XPath:
// Get your page HTML string
$html = file_get_contents('xx.html');
// As per comment by Gordon to suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
// Find a nodes
$anchors = $xml->xpath('//a[contains(#href, "v=")]');
foreach ($anchors as $a)
{
$href = (string)$a['href'];
$url = parse_url($href);
parse_str($url['query'], $params);
// $params['v'] contains what we need
$vd[] = $params['v']; // push into array
}
// Clear invalid markup error buffer
libxml_clear_errors();