Testing with data scraping. The output I'm scraping, is a percent. So I basically slapped on a
echo "%<br>";
At the end of the actual number output which is
echo $ret_[66];
However there's an issue where the percent is actually appearing before the number as well, which is not desirable. This is the output:
%
-0.02%
Whereas what I'm trying to get is just -0.02%
Clearly I'm doing something wrong with the PHP. I'd really appreciate any feedback/solutions. Thank you!
Full code:
<?php
error_reporting(E_ALL^E_NOTICE^E_WARNING);
include_once "global.php";
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.moneycontrol.com/markets/global-indices/');
$xpath = new DOMXPath($doc);
$query = "//div[#class='MT10']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$result = trim($entry->textContent);
$ret_ = explode(' ', $result);
//make sure every element in the array don't start or end with blank
foreach ($ret_ as $key => $val){
$ret_[$key] = trim($val);
}
//delete the empty element and the element is blank "\n" "\r" "\t"
//I modify this line
$ret_ = array_values(array_filter($ret_,deleteBlankInArray));
//echo the last element
echo $ret_[66];
echo "%<br>";
}
<?php
echo "%<br>";
?>
On a seperate following PHP code. Does the same thing.
Related
I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
}
}
echo $la;
Current Result
DoyleBrainDavid,
Expected Result
Doyle, Brain, David
I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
die(curl_error($c1));
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
curl_close($c1);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
}
var_dump($match1);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you
You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.
Wondering if anyone knows what can be modified with this scraping/output code that spits out values like: +40.07%, in order to ignore certain values. For instance, I already have on my site an automatic "+" being placed in front of every value. However when scraping from that source, it also provides a + in front of positive values, so I'm getting a ++40.07 type of output, when I only want one +. Anyone know what can be added to ignore the outputted + ?
// get sandpdailychange
function getSandpdailychange(){
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://quotes.wsj.com/index/SPX');
$xpath = new DOMXPath($doc);
$query = "//span[#id='quote_change']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$result = trim($entry->textContent);
$ret_ = explode(' ', $result);
//make sure every element in the array don't start or end with blank
foreach ($ret_ as $key=>$val){
$ret_[$key]=trim($val);
}
//delete the empty element and the element is blank "\n" "\r" "\t"
//I modify this line
$ret_ = array_values(array_filter($ret_,deleteBlankInArray));
//echo the last element
file_put_contents(globalVars::$_cache_dir . "sandpdailychange",
$ret_[0]);
}
}
Hi so I've currently got a output echoing 176 8 58 from a web scraping script. I want to pack this script up into a variable and echo it out in other places on the website.
I've packed this up by doing this
ob_start();
echo $node->nodeValue. "\n";
$thenumbers = ob_get_contents();
ob_end_clean();
but when I echo it out like this
Now on the website the numbers are in spans and are split up by "/" do I need to do anything fancy? I'm kind of new to PHP so let me know if its something stupid!
<?php echo $thenumbers ?>
my output is then 176 8 58
Would really appreciate a bit of help
(web scraping script i'm using had to hide the website i'm scraping as its in development)
<?php
$teamlink = rwmb_meta( 'WEBSITE_HIDDEN' );
$arr = array( $teamlink );
foreach ($arr as &$value) {
$file = $DOCUMENT_ROOT. $value;
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[contains(#class, 'table')]/tr[3]/td[3]/span");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
ob_start();
echo $node->nodeValue. "\n";
$win_loss = ob_get_contents();
ob_end_clean();
}
}
}
}
?>
p.s I know the script works as its currently outputting standard text fine.
My apoligies if I have completely misunderstood your question.
If you want to add a "/" between the numbers, where the spaces are you could:
echo str_replace(' ','/',$thenumbers);
If you just want to show the last 3 digits (cleaning out the spaces from the string) you could;
echo substr(str_replace(' ','',$thenumbers),-3);
I'm having trouble to extract the integers between the brackets from this website.
Part of markup from the website:
<span class="b-label b-link-number" data-num="(322206)">Music & Video</span>
<span class="b-label b-link-number" data-num="(954218)">Toys, Hobbies & Games</span>
<span class="b-label b-link-number" data-num="(502981)">Kids, Baby & Maternity</span>
How do I extract the integers between the brackets?
Desired output:
322206
954218
502981
Should I use Regex since they got the same class name (but not Regex to get between brackets since there are other unwanted elements inside bracket as well from the source code).
Normally, this would be the way I use to extract information:
<?php
//header('Content-Type: text/html; charset=utf-8');
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://global.rakuten.com/en/search/?tl=&k=");
$finder = new DomXPath($grep);
$class = "b-list-item";
$nodes = $finder->query("//*[contains(#class, '$class')]");
foreach ($nodes as $node) {
$span = $node->childNodes;
$search = array(0,1,2,3,4,5,6,7,8,9,'(',')');
$categories = str_replace($search, '', $span->item(0)->nodeValue);
echo '<br>' . '<font color="green">' . $categories . ' ' . '</font>' ;
}
?>
but since the data I want is inside the tag, how do I extract them?
Adding on your current code, its simply straight forward, just change that $class to that class you desire and use ->getAttribute() to get those data-num's:
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://global.rakuten.com/en/search/?tl=&k=");
$finder = new DomXPath($grep);
$class = "b-link-number"; // change the span class
$nodes = $finder->query("//*[contains(#class, '$class')]"); // target those
$numbers = array();
foreach ($nodes as $node) { // for every found elemenet
$link_num = $node->getAttribute('data-num'); // get the attribute `data-num`
$link_num = str_replace(['(', ')'], '', $link_num); // simply remove those parenthesis
$numbers[] = $link_num; // push it inside the container
}
echo '<pre>';
print_r($numbers);
<span[^>)()]*\((\d+)\)[^>]*>
Try this.Grab the capture.See demo.
http://regex101.com/r/iM2wF9/10
From a html page I need to extract the values of v from all anchor links…each anchor link is hidden in some 5 div tags
<a href="/watch?v=value to be retrived&list=blabla&feature=plpp_play_all">
Each v value has 11 characters, for this as of now am trying to read it by character by character like
<?php
$file=fopen("xx.html","r") or exit("Unable to open file!");
$d='v';
$dd='=';
$vd=array();
while (!feof($file))
{
$f=fgetc($file);
if($f==$d)
{
$ff=fgetc($file);
if ($ff==$dd)
{
$idea='';
for($i=0;$i<=10;$i++)
{
$sData = fgetc($file);
$id=$id.$sData;
}
array_push($vd, $id);
That is am getting each character of v and storing it in sData variable and pushing it into id so as to get those 11 characters as a string(id)…
the problem is…searching for the ‘v=’ through the entire html file and if found reading the 11characters and pushing it into a sData array is sucking, it is taking considerable amount of time…so pls help me to sophisticate the things
<?php
function substring(&$string,$start,$end)
{
$pos = strpos(">".$string,$start);
if(! $pos) return "";
$pos--;
$string = substr($string,$pos+strlen($start));
$posend = strpos($string,$end);
$toret = substr($string,0,$posend);
$string = substr($string,$posend);
return $toret;
}
$contents = #file_get_contents("xx.html");
$old="";
$videosArray=array();
while ($old <> $contents)
{
$old = $contents;
$v = substring($contents,"?v=","&");
if($v) $videosArray[] = $v;
}
//$videosArray is array of v's
?>
I would better parse HTML with SimpleXML and XPath:
// Get your page HTML string
$html = file_get_contents('xx.html');
// As per comment by Gordon to suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
// Find a nodes
$anchors = $xml->xpath('//a[contains(#href, "v=")]');
foreach ($anchors as $a)
{
$href = (string)$a['href'];
$url = parse_url($href);
parse_str($url['query'], $params);
// $params['v'] contains what we need
$vd[] = $params['v']; // push into array
}
// Clear invalid markup error buffer
libxml_clear_errors();