Website Scraping Using PHP - php

I have a php code that could extract the product categories in this website: http://www.tradeindia.com/. So far I had managed to extract only the categories. How do I make it so that it will also extract the product numbers beside it since its not in any class name?
My code:
<?php
//header('Content-Type: text/html; charset=utf-8');
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://www.tradeindia.com/");
$finder = new DomXPath($grep);
$class = "cate_menu";
$nodes = $finder->query("//*[contains(#class, '$class')]");
$total_L = 0;
foreach ($nodes as $node) {
$span = $node->childNodes;
echo '<br>' . $span->item(0)->nodeValue . ' : ';
}
?>
Source code from website:
<td align="left" style="padding-left:8px;color:blue"><a href=/Seller/Agriculture/ class="cate_menu" >Agriculture</a>(100892)</td>
<td align="left" style="padding-left:8px;color:blue"><a href=/Seller/Apparel-Fashion/ class="cate_menu" >Apparel & Fashion</a>(237902)</td>
<td align="left" style="padding-left:8px;color:blue"><a href=/Seller/Automobile/ class="cate_menu" >Automobile</a>(78614)</td>
I need the numbers between brackets.

I'm not an xpath guru, but what I would do is to target first that particular table using that needle categories, then from there get those rows based on that and start looping on found rows.
Rough example:
$grep = new DOMDocument();
#$grep->loadHTMLFile("http://www.tradeindia.com/");
$finder = new DOMXpath($grep);
$products = array();
$nodes = $finder->query("
//td[#class='showroom1'][contains(text(), 'CATEGORIES')]
/parent::tr/parent::table/parent::td/parent::tr
/following-sibling::tr
/td[1]/table/tr/td/table/tr
");
if($nodes->length > 0) {
foreach($nodes as $tr) {
if($finder->evaluate('count(./td/a)', $tr) > 0) {
foreach($finder->query('./td/a[#class="cate_menu"]', $tr) as $row) {
$text = $row->nodeValue;
$number = $finder->query('./following-sibling::text()', $row)->item(0)->nodeValue;
$products[] = "$text $number";
}
}
}
}
echo '<pre>';
print_r($products);
Sample Output

Since the number is between two brackets, this should be easy. You can use a function like this;
function get_string_between($string, $start, $end) {
$string = " ".$string;
$ini = strpos($string,$start);
if ($ini == 0) return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}
$product = get_string_between($htmlline, "(", ")");
You will need to get each line of the table inserted separately though. You could loop through an array of strings containing each line; foreach($htmllines as $htmlline) or similar.
Hope this helps.

Related

php to extract data from a website

I want to get all <p> elements from 1st jokes so basically I made this script:
<?php
$url = "http://sms.hindijokes.co";
$html = file_get_contents($url);
$doc = new DOMDocument;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTML("<html><body>".$html."
</body> </html>");
$xpath = new DOMXPath($doc);
$query1 = "//h2[#class='entry-title']/a";
$query2 = "//div[#class='entry-content']/p";
$entries1 = $xpath->query($query1);
$entries2 = $xpath->query($query2);
$var1 = $entries1->item(0)->textContent;
$var2 = $entries2->item(0)->textContent;
echo "$var1";
echo "<br>";
$f = 5;
for($i = 0; $i < $f; $i++){
echo $entries2->item($i)->textContent."\n";
}
?>
This time I was knowing that there are five <p> elements in first joke but if I want it to be automate script, there would be sometimes more or less than five <p> elements so it would cause mess.
You need first div's p elements only, so your query would be:
$entries2 = $xpath->query('//(div[#class='entry-content'])[1]/p');
Now you can iterate all p elements with foreach() loop (extracting its html contents):
$innerHtml = '';
foreach ($entries2 as $entry) {
$children = $entry->childNodes;
foreach ($children as $child) {
$innerHtml .= $child->ownerDocument->saveXML($child);
}
}
$innerHtml = str_replace(["\r\n", "\r", "\n", "\t"], '', $innerHtml);
DOMXPath::query returns DOMNodeList object. Use DOMNodeList::length property.
$f = $entries2->length;
Try this way it is returning until null; but some joke has multiple p tags so its better for you to find it by your custom class/id
$i = 0;
while($entries2->item($i)->textContent!=NULL) {
echo "<br>";
echo $i." ".$entries2->item($i)->textContent;
$i++;
}

scrape html page with strange result

the scrape works but, the strange thing is that the result is ["-3°"]
I tried so many different things to get just -3°
But how is it that does [" and "] show up if they are not in the code!
Does someone can give me some direction how to achieve this
the code I am using is
<?php
function scrape($url){
$output = file_get_contents($url);
return $output;
}
function fetchdata($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$page = scrape("https://weather.gc.ca/city/pages/bc-37_metric_e.html");
$result = fetchdata($page, "<p class=\"text-center mrgn-tp-md mrgn-bttm-sm lead\"><span class=\"wxo-metric-hide\">", "<abbr title=\"Celsius\">C</abbr>");
echo json_encode(array($result));
?>
already thanks for you help!
You can use the DOMDocument to parse the HTML file.
$page = file_get_contents("https://weather.gc.ca/city/pages/bc-37_metric_e.html");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
libxml_use_internal_errors(false);
$paragraphs = $doc->getElementsByTagName('p');
foreach($paragraphs as $p){
if($p->getAttribute('class') == 'text-center mrgn-tp-md mrgn-bttm-sm lead') {
foreach($p->getElementsbyTagName('span') as $attr) {
if($attr->getAttribute('class') == 'wxo-metric-hide') {
foreach($attr->getElementsbyTagName('abbr') as $abbr) {
if($abbr->getAttribute('title') == 'Celsius') {
echo trim($attr->nodeValue);
}
}
}
}
}
}
Output:
-3°C
This is assuming the classes and structure are consistent...

Itirate through array, and run function on each value

I need to iterate through an array, and edit each value but not differently.
<?php
Function parseStatus($Input, $Start, $End){
$String = " " . $Input;
$Init = StrPos($String, $Start);
If($Init == 0){
Return '';
}
$Init += StrLen($Start);
$Length = StrPos($String, $End, $Init) - $Init;
Return SubStr($String, $Init, $Length);
}
Function getAllStatuses($Username){
$DOM = new DOMDocument();
$DOM->validateOnParse = True;
#$DOM->loadHtml(File_Get_Contents('http://lifestream.aol.com/stream/' . $Username));
$xPath = new DOMXPath($DOM);
$Stream = $DOM->getElementById('stream')->nodeValue; // return stream content for display name
$Nodes = $xPath->query('//div[#class="stream"]');
$Name = Explode(' ', Trim($Stream));
$User = $Name[0];
$Statuses = Array();
ForEach($Nodes as $Node){
ForEach($Node->getElementsByTagName('li') as $Key => $Tags){
$Statuses[] = $Tags->nodeValue;
}
}
ForEach($Statuses as $Status){
If(StrPos($Status, 'Services')){
Echo 'services is definitely in there';
$New = AIM::parseStatus($Status, $User, 'Services');
Echo $New;
Break;
}
}
?>
The issue is, $New only echos the very first output, but how do I get that to run through each value in the array, and do the same thing?
Expected output:
[name as start] what i need [word Services]
Then on each value in the array, do the same thing so it'd be like:
what i need
again what i need but different string
etc.
Thanks for any help.
The Break; in your foreach loop is, well, breaking the loop.
Remove the Break; and it should work.
Have a read here:
http://www.php.net/break
break ends execution of the current for, foreach, while, do-while or switch structure.

php substring occurances between two strings in an html file

So i have an HTML file as source, it contains several instances of the following code:
<span itemprop="name">NAME</span>
where the NAME part always changing to something different.
how can i write a php code that would go through the html code, extract all the names between the "<span itemprop="name">" and "</span>" and put it in an array?
i have tried this code but it doesn't work:
$prev=$html;
for($i=0; $i<10; $i++){
$current = explode('<span itemprop="name">', $prev);
$cur = explode('</span>', $current[1]);
$names[] = $cur[0];
$prev = $current[2];
}
print_r($names);
Probably better way would be using php DOMDocument or simple php dom or any DOM representative than the way you planed.
Here is example of working DOMDocument code:
$doc = new DOMDocument();
$doc->loadHTML('<html><body><span itemprop="name">1</span><span itemprop="name">2</span><span itemprop="name">3</span></body></html>');
$finder = new DomXPath($doc);
$nodes = $finder->query("//*[contains(#itemprop, 'name')]");
foreach($nodes as $node)
{
echo $node->nodeValue . '<br />';
}
Outputs:
1
2
3
I kinda feel bad for saying this... but you could use a regular expression
preg_match_all('/<span itemprop="name">(.*?)<\/span>/i', $matches);
var_dump($matches); // results are stored in the variable $matches;
This function will get us the "NAME"
function getbetween($content,$start,$end) {
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '';
}
This function will replace only the first occurence
<?php
function str_replace_once($search, $replace, $subject) {
$firstChar = strpos($subject, $search);
if($firstChar !== false) {
$beforeStr = substr($subject,0,$firstChar);
$afterStr = substr($subject, $firstChar + strlen($search));
return $beforeStr.$replace.$afterStr;
} else {
return $subject;
}
}
?>
now a loop
$start = '<span itemprop="name">';
$end = '</span>';
while(strpos($content, $start)) {
$name = getbetween($content, $start, $end);
$content = str_replace_once($start.$name.$end, '',$content);
echo $name.'<br>';
}
use this function:
function get_string_between($string, $start, $end){
$string = ' ' . $string;
$ini = strpos($string, $start);
if ($ini == 0) return '';
$ini += strlen($start);
$len = strpos($string, $end, $ini) - $ini;
return substr($string, $ini, $len);
}
$fullstring = 'this is my [tag]dog[/tag]';
$parsed = get_string_between($fullstring, '[tag]', '[/tag]');
echo $parsed; // (result = dog)
Refenter link description here

Display most recent additions to XML file?

I'm trying to display the latest additions to this NVD XML file:
http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-recent.xml
I can get all of them to list using the following code, but I'm only interested in displaying the most recent ten (from 2013 for the time being) and the XML file lists them in chronological order (starting in 2011).
<?php
$file= 'http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-recent.xml';
$xml = file_get_contents($file);
$sxe = new SimpleXMLElement($xml);
$ns = $sxe->getNamespaces(true);
echo "<b>Latest Vulnerabilities:</b><p>";
foreach($sxe->entry as $entry)
{
$vuln = $entry->children($ns['vuln']);
$href = $vuln->references->reference->attributes()->href;
echo "" . $vuln->{'cve-id'} . "<br>";
}
?>
Since you cannot manipulate the XML arrays directly, something like this should work for your needs:
$file= 'http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-recent.xml';
$xml = file_get_contents($file);
$sxe = new SimpleXMLElement($xml);
$ns = $sxe->getNamespaces(true);
echo "<b>Latest Vulnerabilities:</b><p>";
$all = $sxe->entry;
$length = count($all);
$offset_start = $length - 10;
for($i = 0; $i < $length; $i++)
{
if($i >= $offset_start)
{
$entry = $all[$i];
$vuln = $entry->children($ns['vuln']);
$href = $vuln->references->reference->attributes()->href;
echo "" . $vuln->{'cve-id'} . "<br>";
}
}

Categories