scrape html page with strange result

scrape html page with strange result - php

the scrape works but, the strange thing is that the result is ["-3°"]
I tried so many different things to get just -3°
But how is it that does [" and "] show up if they are not in the code!
Does someone can give me some direction how to achieve this
the code I am using is
<?php
function scrape($url){
$output = file_get_contents($url);
return $output;
}
function fetchdata($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$page = scrape("https://weather.gc.ca/city/pages/bc-37_metric_e.html");
$result = fetchdata($page, "<p class=\"text-center mrgn-tp-md mrgn-bttm-sm lead\"><span class=\"wxo-metric-hide\">", "<abbr title=\"Celsius\">C</abbr>");
echo json_encode(array($result));
?>
already thanks for you help!

You can use the DOMDocument to parse the HTML file.
$page = file_get_contents("https://weather.gc.ca/city/pages/bc-37_metric_e.html");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
libxml_use_internal_errors(false);
$paragraphs = $doc->getElementsByTagName('p');
foreach($paragraphs as $p){
if($p->getAttribute('class') == 'text-center mrgn-tp-md mrgn-bttm-sm lead') {
foreach($p->getElementsbyTagName('span') as $attr) {
if($attr->getAttribute('class') == 'wxo-metric-hide') {
foreach($attr->getElementsbyTagName('abbr') as $abbr) {
if($abbr->getAttribute('title') == 'Celsius') {
echo trim($attr->nodeValue);
}
}
}
}
}
}
Output:
-3°C
This is assuming the classes and structure are consistent...

Related

Generating pdf from XML response taking so much time

I have xml in my database and wanted to generate pdf from xml, but if xml is too big then it through timeout error. I don't want to change my server timeout instead of how to improve the timing of pdf generate.
I am using <pre> tag for good formatting.
Below code for pdf generate
public function generateCreditReport($form, $path=false, $htmlPath=false){
$this->load->library('fpdf/fpdf');
$this->load->library('fpdi/fpdi');
if(!empty($path) && !file_exists($path)){
$data['form'] = $form;
$data['date'] = $data['xml'] = null;
if($data['form']->credit_report != ''){
try{
$data['xml'] = new SimpleXMLElement($data['form']->credit_report);
}catch(Exception $e){
}
$timeStatus = $this->admin->get_timestamp($data['form']->id, 'credit_pulled');
if(!empty($timeStatus)) {
$data['date'] = $timeStatus->date;
}
}
$pdf = new TCPDF();
$pdf->SetFont('Helvetica');
$pdf->SetFontSize(10);
$pdf->SetTextColor(0, 0, 0);
$pdf->SetProtection(array('print','modify'),"$form->loan_id","sefinance",0);
$pdf->AddPage();
$html = $this->load->view('admin/credit_report_pdf', $data,true);
$generatedReport = file_put_contents($htmlPath, $html , FILE_APPEND | LOCK_EX);
$pdf->WriteHTML($html);
if($path){
$pdf->Output($path, "F");
}else{
$pdf->Output();
}
}
}
View file credit_report_pdf is as below
<h4>Report Results</h4>
<br>
<?php if(!empty($xml)){ ?>
<?php echo print_credit($xml->EfxReport->USPrintImage); ?>
<?php } ?>
I am formatting using <pre> to look good.
function print_credit($report) {
$split = " <div style='page-break-before: always;'></div>" . repeater(' ', 43);
$report = preg_replace('/\* \* \*[\s\S]+?USER REF./', $split, $report);
$output = "";
$output = '<pre>';
$length = strlen($report);
$count = $length / 81;
$position = 1;
for ($i = 0; $i < $count; $i++) {
if (strpos(substr($report, $position, 80), 'THIS FORM PRODUCED BY EQUIFAX') === FALSE) {
$output .= substr($report, $position, 80) . '<br>';
}
if ($i == 32) {
$output .= '</pre><pre>';
}
$position += 81;
}
$output .= '</pre>';
return $output;
}
Here is screenshot how pdf looks like.
Please help me to improve the timing of pdf, currently taking more than 10min for 10-page pdf.

Determining that a string is a valid HTML element

I'm having trouble getting this constraint matches function to match all HTML elements.
It must return true for any legitimate, properly-formed HTML element and return false for anything that is not a legitimate, properly-formed HTML element.
The following are things that did not work:
$dom = new \DOMDocument(); return $dom->loadHTML($value);
$dom = new \DOMDocument(); return $dom->loadHTML($value,LIBXML_HTML_NOIMPLIED);
Adding the flag LIBXML_NOENT to simplexml_load_string().
Adding the flag LIBXML_HTML_NOIMPLIED to simplexml_load_string().
Here is the current function:
function matches($value)
{
\libxml_use_internal_errors(true);
if (!\is_string($value) || empty($value)) {
return false;
}
$start = \strpos($value, '<');
$end = \strrpos($value, '>', $start);
$len = \strlen($value);
if ($end !== false) {
$value = \substr($value, $start);
} else {
$value = \substr($value, $start, $len - $start);
}
$value = \html_entity_decode($value);
$value = \str_replace('&', '', $value);
\libxml_clear_errors();
$xml = \simplexml_load_string($value);
return \count(\libxml_get_errors()) === 0;
}
The current version has two known problems:
<script>&</script>: Should fail but passes.
<a b="""></a>: Should pass but fails.

Website Scraping Using PHP

I have a php code that could extract the product categories in this website: http://www.tradeindia.com/. So far I had managed to extract only the categories. How do I make it so that it will also extract the product numbers beside it since its not in any class name?
My code:
<?php
//header('Content-Type: text/html; charset=utf-8');
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://www.tradeindia.com/");
$finder = new DomXPath($grep);
$class = "cate_menu";
$nodes = $finder->query("//*[contains(#class, '$class')]");
$total_L = 0;
foreach ($nodes as $node) {
$span = $node->childNodes;
echo '<br>' . $span->item(0)->nodeValue . ' : ';
}
?>
Source code from website:
<td align="left" style="padding-left:8px;color:blue"><a href=/Seller/Agriculture/ class="cate_menu" >Agriculture</a>(100892)</td>
<td align="left" style="padding-left:8px;color:blue"><a href=/Seller/Apparel-Fashion/ class="cate_menu" >Apparel & Fashion</a>(237902)</td>
<td align="left" style="padding-left:8px;color:blue"><a href=/Seller/Automobile/ class="cate_menu" >Automobile</a>(78614)</td>
I need the numbers between brackets.

I'm not an xpath guru, but what I would do is to target first that particular table using that needle categories, then from there get those rows based on that and start looping on found rows.
Rough example:
$grep = new DOMDocument();
#$grep->loadHTMLFile("http://www.tradeindia.com/");
$finder = new DOMXpath($grep);
$products = array();
$nodes = $finder->query("
//td[#class='showroom1'][contains(text(), 'CATEGORIES')]
/parent::tr/parent::table/parent::td/parent::tr
/following-sibling::tr
/td[1]/table/tr/td/table/tr
");
if($nodes->length > 0) {
foreach($nodes as $tr) {
if($finder->evaluate('count(./td/a)', $tr) > 0) {
foreach($finder->query('./td/a[#class="cate_menu"]', $tr) as $row) {
$text = $row->nodeValue;
$number = $finder->query('./following-sibling::text()', $row)->item(0)->nodeValue;
$products[] = "$text $number";
}
}
}
}
echo '<pre>';
print_r($products);
Sample Output

Since the number is between two brackets, this should be easy. You can use a function like this;
function get_string_between($string, $start, $end) {
$string = " ".$string;
$ini = strpos($string,$start);
if ($ini == 0) return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}
$product = get_string_between($htmlline, "(", ")");
You will need to get each line of the table inserted separately though. You could loop through an array of strings containing each line; foreach($htmllines as $htmlline) or similar.
Hope this helps.

Itirate through array, and run function on each value

I need to iterate through an array, and edit each value but not differently.
<?php
Function parseStatus($Input, $Start, $End){
$String = " " . $Input;
$Init = StrPos($String, $Start);
If($Init == 0){
Return '';
}
$Init += StrLen($Start);
$Length = StrPos($String, $End, $Init) - $Init;
Return SubStr($String, $Init, $Length);
}
Function getAllStatuses($Username){
$DOM = new DOMDocument();
$DOM->validateOnParse = True;
#$DOM->loadHtml(File_Get_Contents('http://lifestream.aol.com/stream/' . $Username));
$xPath = new DOMXPath($DOM);
$Stream = $DOM->getElementById('stream')->nodeValue; // return stream content for display name
$Nodes = $xPath->query('//div[#class="stream"]');
$Name = Explode(' ', Trim($Stream));
$User = $Name[0];
$Statuses = Array();
ForEach($Nodes as $Node){
ForEach($Node->getElementsByTagName('li') as $Key => $Tags){
$Statuses[] = $Tags->nodeValue;
}
}
ForEach($Statuses as $Status){
If(StrPos($Status, 'Services')){
Echo 'services is definitely in there';
$New = AIM::parseStatus($Status, $User, 'Services');
Echo $New;
Break;
}
}
?>
The issue is, $New only echos the very first output, but how do I get that to run through each value in the array, and do the same thing?
Expected output:
[name as start] what i need [word Services]
Then on each value in the array, do the same thing so it'd be like:
what i need
again what i need but different string
etc.
Thanks for any help.

The Break; in your foreach loop is, well, breaking the loop.
Remove the Break; and it should work.
Have a read here:
http://www.php.net/break
break ends execution of the current for, foreach, while, do-while or switch structure.

php substring occurances between two strings in an html file

So i have an HTML file as source, it contains several instances of the following code:
<span itemprop="name">NAME</span>
where the NAME part always changing to something different.
how can i write a php code that would go through the html code, extract all the names between the "<span itemprop="name">" and "</span>" and put it in an array?
i have tried this code but it doesn't work:
$prev=$html;
for($i=0; $i<10; $i++){
$current = explode('<span itemprop="name">', $prev);
$cur = explode('</span>', $current[1]);
$names[] = $cur[0];
$prev = $current[2];
}
print_r($names);

Probably better way would be using php DOMDocument or simple php dom or any DOM representative than the way you planed.
Here is example of working DOMDocument code:
$doc = new DOMDocument();
$doc->loadHTML('<html><body><span itemprop="name">1</span><span itemprop="name">2</span><span itemprop="name">3</span></body></html>');
$finder = new DomXPath($doc);
$nodes = $finder->query("//*[contains(#itemprop, 'name')]");
foreach($nodes as $node)
{
echo $node->nodeValue . '<br />';
}
Outputs:
1
2
3

I kinda feel bad for saying this... but you could use a regular expression
preg_match_all('/<span itemprop="name">(.*?)<\/span>/i', $matches);
var_dump($matches); // results are stored in the variable $matches;

This function will get us the "NAME"
function getbetween($content,$start,$end) {
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '';
}
This function will replace only the first occurence
<?php
function str_replace_once($search, $replace, $subject) {
$firstChar = strpos($subject, $search);
if($firstChar !== false) {
$beforeStr = substr($subject,0,$firstChar);
$afterStr = substr($subject, $firstChar + strlen($search));
return $beforeStr.$replace.$afterStr;
} else {
return $subject;
}
}
?>
now a loop
$start = '<span itemprop="name">';
$end = '</span>';
while(strpos($content, $start)) {
$name = getbetween($content, $start, $end);
$content = str_replace_once($start.$name.$end, '',$content);
echo $name.'<br>';
}

use this function:
function get_string_between($string, $start, $end){
$string = ' ' . $string;
$ini = strpos($string, $start);
if ($ini == 0) return '';
$ini += strlen($start);
$len = strpos($string, $end, $ini) - $ini;
return substr($string, $ini, $len);
}
$fullstring = 'this is my [tag]dog[/tag]';
$parsed = get_string_between($fullstring, '[tag]', '[/tag]');
echo $parsed; // (result = dog)
Refenter link description here

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

scrape html page with strange result - php

Related

Generating pdf from XML response taking so much time

Determining that a string is a valid HTML element

Website Scraping Using PHP

Itirate through array, and run function on each value

php substring occurances between two strings in an html file

Categories

Resources