get information from html table using Curl - php

i need to get some information about some plants and put it into mysql table.
My knowledge on Curl and DOM is quite null, but i've come to this:
set_time_limit(0);
include('simple_html_dom.php');
$ch = curl_init ("http://davesgarden.com/guides/pf/go/1501/");
curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept-Language: es-es,en"));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec ($ch);
curl_close ($ch);
$html= str_get_html($data);
$e = $html->find("table", 8);
echo $e->innertext;
now, i'm really lost about how to move in from this point, can you please guide me?
Thanks!

This is a mess.
But at least it's a (somewhat) consistent mess.
If this is a one time extraction and not a rolling project, personally I'd use quick and dirty regex on this instead of simple_html_dom. You'll be there all day twiddling with the tags otherwise.
For example, this regex pulls out the majority of title/data pairs:
$pattern = "/<b>(.*?)</b>\s*<br>(.*?)</?(td|p)>/si";
You'll need to do some pre and post cleaning before it will get them all though.
I don't envy you having this task...

Your best bet will be to wrape this in php ;)
Yes, this is a ugly hack for a ugly html code.
<?php
ob_start();
system("
/usr/bin/env links -dump 'http://davesgarden.com/guides/pf/go/1501/' |
/usr/bin/env perl -lne 'm/((Family|Genus|Species):\s+\w+\s+\([\w-]+\))/ && \
print $1'
");
$out = ob_get_contents();
ob_end_clean();
print $out;
?>

Use Simple Html Dom and you would be able to access any element/element's content you wish. Their api is very straightforward.

you can try somthing like this.
<?php
$ch = curl_init ("http://www.digionline.ir/Allprovince/CategoryProducts/cat=10301");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$data = array();
// get all table rows and rows which are not headers
$table_rows = $xpath->query('//table[#id="tbl-all-product-view"]/tr[#class!="rowH"]');
foreach($table_rows as $row => $tr) {
foreach($tr->childNodes as $td) {
$data[$row][] = preg_replace('~[\r\n]+~', '', trim($td->nodeValue));
}
$data[$row] = array_values(array_filter($data[$row]));
}
echo '<pre>';
print_r($data);
?>

Related

PhP curl simple_dom_document request to get snow data from snowbird.com

Im using php, curl, and simple_dom_document to get snow data from snowbird.com. The problem is I cant seem to actually find the data I need. I am able to find the parent div and its name but I cant find the actually snow info div. Here is my code. Below my code ill past a small part of the output.
<?php
require('simple_html_dom.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.snowbird.com/mountain-report/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$content = curl_exec($ch);
curl_close($ch);
$html = new simple_html_dom();
$html->load($content);
$ret = $html->find('.horizSnowChartText');
$ret = serialize($ret);
$ret3 = new simple_html_dom();
$ret3->load($ret);
$es = $ret3->find('text');
$ret2 = $ret3->find('.total-inches');
print_r($ret2);
//print_r($es);
?>
And here is a picture of the output. You can see it skips the actual snow data and goes right to the inches mark ".
Do note that the html markup you're getting has multiple instances of .total-inches (multiple nodes with this class). If you want to explicitly get one, you can point to it directly using the second argument of ->find().
Example:
$ret2 = $html->find('.total-inches', 3);
// ^
If you want to check them all out, a simple foreach should suffice:
foreach($html->find('.current-conditions .snowfall-total .total-inches') as $in) {
echo $in , "\n";
}

DOM structure, get element by attribute name/value

I see a lot of answers on SO that pertain to the question but either there are slight differences that I couldn't overcome or maybe i just couldn't repeat the processes shown.
What I am trying to accomplish is to use CURL to get the HTML from a Google+ business page, iterate over the HTML and for each review of the business scrape the reviews HTML for display on that businesses non google+ webpage.
Every review shares this parent div structure:
<div class="ZWa nAa" guidedhelpid="userreviews"> .....
Thus i am trying to do a foreach loop based on finding and grabbing the div and innerhtml for each div with attribute: guidehelpid="userreviews"
I am succesfully getting the HTML back via CURL and can parse it when targeting a standard TAG name like "a" or if it had an ID, but iterating over the HTML using the PHP default parser when looking for a attribute name is problematic:
How can I take this successful code below and make it work like intended as shown in the second code which of course is wrong?
WORKING CODE (Finds,gets, echo's all "a" tags in $output)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->getAttribute('href');
echo "<br />";}
THEORETICALLY NEEDED CODE: (Find every review by custom attribute in HTML and echo them)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('div[guidehelpid=userreviews]') as $review) {
echo $review;
echo "<br />"; }
Any help i correcting this would be appreciated. I would prefer not to use "simple_html_dom" if I can accomplish this without it.
I suggest and you could use an DOMXpath in this case too. Example:
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($output);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$review = $xpath->query('//div[#guidedhelpid="userreviews"]');
if($review->length > 0) { // if it exists
echo $review->item(0)->nodeValue;
// echoes
// John DeRemer reviewed 3 months ago Last fall, we had a major issue with mold which required major ... and so on
}

<?php echo file_get_contents how to get content in a certain tag

<?php echo file_get_contents ("http://www.google.com/"); ?>
but I only want to get the contents of the tag in the url...how to do that...?
I need to echo the content between a tag....not the whole page
Refer this PHP manual and cURL which also help you.
You may also use user define function instead of file_get_contents():
function get_content($URL){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $URL);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
echo get_content('http://example.com');
Hope, it will resolve your issue.
I think you want to extract content from a specific html tag in the file. For this you can use regular expressions. However view the following link to parse an HTML document file:
http://php.net/manual/en/class.domdocument.php
libxml_use_internal_errors(true);
$url = "http://stackoverflow.com/questions/15947331/php-echo-file-get-contents-how-to-get-content-in-a-certain-tag";
$dom = new DomDocument();
$dom->loadHTML(file_get_contents($url));
foreach($dom->getElementsByTagName('a') as $element) {
echo $element->nodeValue.'<br/>';
}
exit;
More info: http://www.php.net/manual/en/class.domdocument.php
There you can see how to select elements by id or class, how to get elements' attribute values etc.
Note: It's better to get content via cURL instead of get_file_contents. For example:
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Also note that on some websites you have to specify options like CURLOPT_USERAGENT etc., otherwise the content may not be returned.
Here are the other options: http://www.php.net/manual/en/function.curl-setopt.php

Simple html dom parser can`t parse all page

I need to get info from the center column of that site
(I need phone numbers exactly)
I`m using SimpleHTML dom parser, and was trying some curl method, but it always gives me html source without that central column !
I understood that using this code:
$html = file_get_html('http://vashmagazin.ua/cat/catalog/?rub=100&subrub=1');
$str = $html->Save();
echo $str;
I need to say can i do this or not today or i will loose this order.
Sorry for my bad english, thanks.
Pay attention on request headers and iconv for the charset conversion.
If you don't convert the string from windows-1251 in utf-8, preg_match will fail.
After conversion I used a simple regular expression to extract the phone numbers from the whole page.
<?php
$url = 'http://vashmagazin.ua/cat/catalog/?rub=100&subrub=1';
$ch = curl_init();
$request_headers = array
(
"Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Charset" => "windows-1251,utf-8;q=0.7,*;q=0.3",
);
$header = array();
foreach ($request_headers as $key => $value)
$header[] = "{$key}: {$value}";
curl_setopt($ch, CURLOPT_HTTPGET, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7');
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
$html = iconv("windows-1251", "UTF-8", $html);
$matches = array();
$pattern = '/\([0-9]{3}\)[0-9]{3,}\-[0-9]+/us';
if (preg_match_all($pattern, $html, $matches))
{
var_dump($matches);
}
?>
The source code above is fully tested and fully working.
If you can't install the curl library try to replace the curl block with a file_get_contents($url).
To install curl on your operating system search on google, on Ubuntu use sudo apt-get install curl libcurl3 php5-curl and restart apache.

CURL to grab an XML file associated with this URL

I am trying to use CURL to grab an XML file associated with this URL, then i am trying to parse the xml file using DOMxPath.
There are no output errors at this point it is just not displaying anything, i tried to catch some errors but i was unable to figure it out, any direction would be amazing.
<?php
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
function tideTime() {
$ch = curl_init("http://tidesandcurrents.noaa.gov/noaatidepredictions/NOAATidesFacade.jsp?datatype=XML&Stationid=8721138");
$fp = fopen("8721138.xml", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
$dom = new DOMDocument();
#$dom->loadHTML($ch);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//time");
$arr = array();
foreach ($entries as $entry) {
$tide = $entry->nodeValue;
}
echo $tide;
}
?>
Youre trying to load the curl resource handle as the DOM which it is not. the curl functions either output directly or output to string.
$ch = curl_init("http://tidesandcurrents.noaa.gov/noaatidepredictions/NOAATidesFacade.jsp?datatype=XML&Stationid=8721138");
$fp = fopen("8721138.xml", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
$data = curl_exec($ch);
curl_close($ch);
fclose($fp);
$dom = new DomDocument();
$dom->loadHTML($data);
// the rest of the code
it seems you try to catch some unavailable xpath, make sure you have ("//time"); in the xml file, are you sure that you grab is a xml file ? or you just put into xml ?
if we look at that page, it seems xml generated by javascript, look at the http://tidesandcurrents.noaa.gov/noaatidepredictions/NOAATidesFacade.jsp?datatype=XML&Stationid=8721138&text=datafiles%2F8721138%2F09122011%2F877%2F&imagename=images/8721138/09122011/877/8721138_2011-12-10.gif&bdate=20111209&timelength=daily&timeZone=2&dataUnits=1&interval=&edate=20111210&StationName=Ponce Inlet, Halifax River&Stationid_=8721138&state=FL&primary=Subordinate&datum=MLLW&timeUnits=2&ReferenceStationName=GOVERNMENT CUT, MIAMI HARBOR ENTRANCE&HeightOffsetLow=*1.00&HeightOffsetHigh=* 1.18&TimeOffsetLow=33&TimeOffsetHigh=5&pageview=dayly&print_download=true&Threshold=&thresholdvalue=
may be you can grab that

Categories