Scrape a statistic from YouTube using PHP - php

After struggling for 3 hours at trying to do this on my own, I have decided that it is either not possible or not possible for me to do on my own. My question is as follows:
How can I scrape the numbers in the attached image using PHP to echo them in a webpage?
Image URL: http://gyazo.com/6ee1784a87dcdfb8cdf37e753d82411c
Please help. I have tried almost everything, from using cURL, to using a regex, to trying an xPath. Nothing has worked the right way.
I only want the numbers by themselves in order for them to be isolated, assigned to a variable, and then echoed elsewhere on the page.
Update:
http://youtube.com/exonianetwork - The URL I am trying to scrape.
/html/body[#class='date-20121213 en_US ltr ytg-old-clearfix guide-feed-v2 site-left-aligned exp-new-site-width exp-watch7-comment-ui webkit webkit-537']/div[#id='body-container']/div[#id='page-container']/div[#id='page']/div[#id='content']/div[#id='branded-page-default-bg']/div[#id='branded-page-body-container']/div[#id='branded-page-body']/div[#class='channel-tab-content channel-layout-two-column selected blogger-template ']/div[#class='tab-content-body']/div[#class='secondary-pane']/div[#class='user-profile channel-module yt-uix-c3-module-container ']/div[#class='module-view profile-view-module']/ul[#class='section'][1]/li[#class='user-profile-item '][1]/span[#class='value']
The xPath I tried, which didn't work for some unknown reason. No exceptions or errors were thrown, and nothing was displayed.

Perhaps a simple XPath would be easier to manipulate and debug.
Here's a Short Self-Contained Correct Example (watch for the space at the end of the class name):
#!/usr/bin/env php
<?
$url = "http://youtube.com/exonianetwork";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html)
{
print "Failed to fetch page. Error handling goes here";
}
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$profile_items = $xpath->query("//li[#class='user-profile-item ']/span[#class='value']");
if ($profile_items->length === 0) {
print "No values found\n";
} else {
foreach ($profile_items as $profile_item) {
printf("%s\n", $profile_item->textContent);
}
}
?>
Execute:
% ./scrape.php
57
3,593
10,659,716
113,900
United Kingdom

If you are willing to try a regex again, this pattern should work:
!Network Videos:</span>\r\n +<span class=\"value\">([\d,]+).+Views:</span>\r\n +<span class=\"value\">([\d,]+).+Subscribers:</span>\r\n +<span class=\"value\">([\d,]+)!s
It captures the numbers with their embedded commas, which would then need to be stripped out. I'm not familiar with PHP, so cannot give you more complete code

Related

PHP cURL web-scraper intermittently returns error "Recv failure: Connection was reset"

I've programmed a very basic web-scraping tool in PHP using cURL and DOM. I'm running it locally on a Windows 10 box using XAMPP (Apache & MySQL). It scrapes approximately 5 values on 400 pages (~2,000 values in total) on one specific website. The job typically completes in < 120 seconds, but intermittently (about once every 5 runs) it'll stop around the 60 second mark with the following error:
Recv failure: Connection was reset
Probably irrelevant, but all of my scraped data is being thrown into a MySQL table, and a separate .php file is styling the data and presenting it. This part is working fine. The error is being thrown by cURL. Here's my (very trimmed) code:
$html = file_get_html('http://IPAddressOfSiteIAmScraping/subpage/listofitems.html');
//Some code that creates my SQL table.
//Finds all subpages on the site - this part works like a charm.
foreach($html->find('a[href^=/subpage/]') as $uniqueItems){
//3 array variables defined here, which I didn't include in this example.
$path = $uniqueItems->href;
$url = 'http://IPAddressOfSiteIAmScraping' . $path;
//Here's the cURL part - I suspect this is the problem. I am an amateur!
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); //An attempt to fix it - didn't work.
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);
//This is the part that throws up the connection reset error.
if(curl_errno($curl)) {
echo 'Scraping error: ' . curl_error($curl);
exit; }
curl_close($curl);
//Here we use DOM to begin collecting specific cURLed values we want in our SQL table.
$dom = new DOMDocument;
$dom->encoding = 'utf-8'; //Alows the DOM to display html entities for special characters like รถ.
#$dom->loadHTML(utf8_decode($page)); //Loads the HTML of the cURLed page.
$xpath = new DOMXpath($dom); //Allows us to use Xpath values.
//Xpaths that I've set - this is for the SQL part. Probably irrelevant.
$header = $xpath->query('(//div[#id="wrapper"]//p)[#class="header"][1]');
$price = $xpath->query('//tr[#class="price_tr"]/td[2]');
$currency = $xpath->query('//tr[#class="price_tr"]/td[3]');
$league = $xpath->query('//td[#class="left-column"]/p[1]');
//Here we collect specifically the item name from the DOM.
foreach($header as $e) {
$temp = new DOMDocument();
$temp->appendChild($temp->importNode($e,TRUE));
$val = $temp->saveHTML();
$val = strip_tags($val); //Removes the <p> tag from the data that goes into SQL.
$val = mb_convert_encoding($val, 'html-entities', 'utf-8'); //Allows the HTML entity for special characters to be handled.
$val = html_entity_decode($val); //Converts HTML entities for special characters to the actual character value.
$final = mysqli_real_escape_string($conn, trim($val)); //Defense against SQL injection attacks by canceling out single apostrophes in item names.
$item['title'] = $final; //Here's the item name, ready for the SQL table.
}
//Here's a bunch of code where I write to my SQL table. Again, this part works great!
}
I am not opposed to switching to regex if I need to ditch DOM, but I did three days worth of lurking before I chose DOM over regex. I have spent a lot of time researching this problem, but everything I'm seeing says "Recv failure: Connection was reset by peer", which is not what I am getting. I'm really frustrated that I have to ask for help - I've been doing so great so far - just learning as I go. This is the first thing I've ever written in PHP.
TL;DR: I wrote a cURL web-scraper that works brilliantly only 80% of the time. 20% of the time, for an unknown reason, it errors out with "Recv failure: Connection was reset".
Hopefully someone can help me!! :) Thanks for reading even if you can't!
P.S. if you'd like to see my FULL code, it's at: http://pastebin.com/vf4s0d5L.
After researching this at length (I'd already been researching it for days before posting my question), I've caved in and accepted that this error is probably tied to the site I'm trying to scrape and therefore out of my control.
I did manage to work around it though, so I'll drop my workaround here...
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, trim($url));
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($curl, CURLOPT_TIMEOUT, 1200); //Amount of time I let cURL execute for.
$page = curl_exec($curl);
if(curl_errno($curl)) {
echo 'Scraping error: ' . curl_error($curl) . '</br>';
echo 'Dropping table...</br>';
$sql = "DROP TABLE table_item_info";
if (!mysqli_query($conn, $sql)) {
echo "Could not drop table: " . mysqli_error($conn);
}
mysqli_close($conn);
echo "TABLE has been dropped. Restarting.</br>";
goto start;
exit; }
curl_close($curl);
Basically, what I've done is implemented error-checking. If the error comes up under curl_errno($curl), I assume it's the connection reset error. That being the case, I drop my SQL table and then jump back to the start of my script using "goto start". Then, at the top of my file I have "start:"
This fixed my problem! Now I don't need to worry about whether the connection was reset or not. My code is smart enough to determine that on its own and reset the script if that was the case.
Hope this helps!

PHP file_get_contents error, wouldn't populate from an array?

I've been trying to write a simple script in PHP to pull off data from a ISBN database site. and for some reason I've had nothing but issues using the file_get_contents command.. I've managed to get something working for this now, but would just like to see if anyone knows why this wasn't working?
The below would not populate the $page with any information so the preg matches below failed to get any information. If anyone knows what the hell was stopping this would be great?
$links = array ('
http://www.isbndb.com/book/2009_cfa_exam_level_2_schweser_practice_exams_volume_2','
http://www.isbndb.com/book/uniform_investment_adviser_law_exam_series_65','
http://www.isbndb.com/book/waterworks_a02','
http://www.isbndb.com/book/winning_the_toughest_customer_the_essential_guide_to_selling','
http://www.isbndb.com/book/yale_daily_news_guide_to_fellowships_and_grants'
); // array of URLs
foreach ($links as $link)
{
$page = file_get_contents($link);
#print $page;
preg_match("#<h1 itemprop='name'>(.*?)</h1>#is",$page,$title);
preg_match("#<a itemprop='publisher' href='http://isbndb.com/publisher/(.*?)'>(.*?)</a>#is",$page,$publisher);
preg_match("#<span>ISBN10: <span itemprop='isbn'>(.*?)</span>#is",$page,$isbn10);
preg_match("#<span>ISBN13: <span itemprop='isbn'>(.*?)</span>#is",$page,$isbn13);
echo '<tr>
<td>'.$title[1].'</td>
<td>'.$publisher[2].'</td>
<td>'.$isbn10[1].'</td>
<td>'.$isbn13[1].'</td>
</tr>';
#exit();
}
My guess is you have wrong (not direct) URLs. Proper ones should be without the www. part - if you fire any of them and inspect the returned headers, you'll see that you're redirected (HTTP 301) to another URL.
The best way to do it in my opinion is to use cURL among curl_setopt with options CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS.
Of course you should trim your urls beforehands just to be sure it's not the problem.
Example here:
$curl = curl_init();
foreach ($links as $link) {
curl_setopt($curl, CURLOPT_URL, $link);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($curl, CURLOPT_MAXREDIRS, 5); // max 5 redirects
$result = curl_exec($curl);
if (! $result) {
continue; // if $result is empty or false - ignore and continue;
}
// do what you need to do here
}
curl_close($curl);

PHP search for website with specific words

I'm trying to monitor a new products page of a website with specific words. I already have a basic script that searches for a single word using file_get_contents(); however this is not effective.
Looking at the code they are in <td> tags within a <table>
How do I get PHP to search for the words no matter what order and get declaration they are in? e.g.
$searchTerm = "Orange Boots";
from:
<table>
<td>Boots (RED)</td>
</table>
<table>
<td>boots (ORANGE)</td>
</table>
<table>
<td>Shirt (GREEN)</td>
</table>
Returns a match.
Sorry if its not clear, but I hope you understand
you can do this like
$newcontent= (str_replace( 'Boots', '<span class="Red">Boots</span>',$cont));
and just write css for class red like you want to show the red color than color:red; and do same thing for rest
but the better approach will be DOM and Xpath
If you're looking to make a quick and dirty search over that HTML block, you can try a simple regular expression with the preg_match_all() function. For example, you can try:
$html_block = get_file_contents(...);
$matches_found = preg_match_all('/(orange|boots|shirt)/i', $html_block, $matches);
$matches_found would be either 1 or 0, as an indication if a match was found or not. $matches would be populated with any matches in accordance.
Use curl. It's much faster than filegetcontents(). Here's a starting point:
$target_url="http://www.w3schools.com/htmldom/dom_nodes.asp";
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {exit;}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$query = "(/html/body//tr)"; //this is where the search takes place
$xpath = new DOMXPath($dom);
$result = $xpath->query($query);
for ($i = 0; $i <$result->length; $i++) {
$node = $result->item(0);
echo "{$node->nodeName} - {$node->nodeValue}<br />";
}

Parsing XML with PHP?

This has been driving me insane for about the last hour. I'm trying to parse a bit of XML out of Last.fm's API, I've used about 35 different permutations of the code below, all of which have failed. I'm really bad at XML parsing, lol. Can anyone help me parse the first toptags>tag>name 'name' from this XML API in PHP? :(
http://ws.audioscrobbler.com/2.0/?method=track.getinfo&api_key=b25b959554ed76058ac220b7b2e0a026&artist=Owl+city&track=fireflies
Which in that case ^ would be 'electronic'
Right now, all I have is this
<?
$xmlstr = file_get_contents("http://ws.audioscrobbler.com/2.0/?method=track.getinfo&api_key=b25b959554ed76058ac220b7b2e0a026&artist=Owl+city&track=fireflies");
$genre = new SimpleXMLElement($xmlstr);
echo $genre->lfm->track->toptags->tag->name;
?>
Which returns with, blank. No errors either, which is what's incredibly annoying!
Thank You very Much :) :) :)
Any help greatly, and by greatly I mean really, really greatly appreciated! :)
The <tag> tag is an array, so you should loop through them with a foreach or similar construct. In your case, just grabbing the first would look like this:
<?
$xmlstr = file_get_contents("http://ws.audioscrobbler.com/2.0/?method=track.getinfo&api_key=b25b959554ed76058ac220b7b2e0a026&artist=Owl+city&track=fireflies");
$genre = new SimpleXMLElement($xmlstr);
echo $genre->track->toptags->tag[0]->name;
Also note that the <lfm> tag is not needed.
UPDATE
I find it's much easier to grab exactly what I'm looking for in a SimpleXMLElement by using print_r(). It'll show you what's an array, what's a simple string, what's another SimpleXMLElement, etc.
Try using
$url = "http://ws.audioscrobbler.com/2.0/?method=track.getinfo&api_key=b25b959554ed76058ac220b7b2e0a026&artist=Owl+city&track=fireflies";
$xml = simplexml_load_file($url);
echo $xml->track->toptags->tag[0]->name;
Suggestion: insert a statement to echo $xmlstr, and make sure you are getting something back from the API.
You don't need to reference lfm. Actually, $genre already is lfm. Try this:
echo $genre->track->toptags->tag->name;
if you wan't to read xml data please follow those steps,
$xmlURL = "your xml url / file name goes here";
try {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $xmlURL);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Content-type: text/xml'
));
$content = curl_exec($ch);
$error = curl_error($ch);
curl_close($ch);
$obj = new SimpleXMLElement($content);
echo "<pre>";
var_dump($obj);
echo "</pre>";
}
catch(Exception $e){
var_dump($e);exit;
}
You will get array formate of whole xml file.
Thanks.

Get div and the correct close tag preg

Now preg has always been a tool to me that i like but i cant figure out for the life if me if what i want to do is possible let and how to do it is going over my head
What i want is preg_match to be able to return me a div's innerHTML the problem is the div im tring to read has more divs in it and my preg keeps closing on the first tag it find
Here is my Actual code
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
preg_match('% <div id="torrent_details">(.*)</div> %six', $data, $match);
print_r($match);
This has been updated for TomcatExodus's help
Live at :: http://megatorrentz.com/beta/details.php?hash=98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6
<?php
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($data);
libxml_use_internal_errors(false);
$div = $domd->getElementById("torrent_details");
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
Using regular expression leads often to problems when parsing markup documents.
XPath version - independent of the source layout. The only thing you need is a div with that id.
loadHTMLFile($url);
$xp = new domxpath($dom);
$result = $xp->query("//*[#id = 'torrent_details']");
$div=$result->item(0);
if($result->length){
$out =new DOMDocument();
$out->appendChild($out->importNode($div, true));
echo $out->saveHTML();
}else{
echo "No such id";
}
?>
And this is the fix for Maerlyn solution. It didn't work because getElementById() wants a DTD with the id attribute specified. I mean, you can always build a document with "apple" as the record id, so you need something that says "id" is really the id for this tag.
validateOnParse = true;
#$domd->loadHTML($data);
//this doesn't work as the DTD is not specified
//or the specified id attribute is not the attributed called "id"
//$div = $domd->getElementById("torrent_details");
/*
* workaround found here: https://fosswiki.liip.ch/display/BLOG/GetElementById+Pitfalls
* set the "id" attribute as the real id
*/
$elements = $domd->getElementsByTagName('div');
if (!is_null($elements)) {
foreach ($elements as $element) {
//try-catch needed because of elements with no id
try{
$element->setIdAttribute('id', true);
}catch(Exception $e){}
}
}
//now it works
$div = $domd->getElementById("torrent_details");
//Print its content or error
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
?>
Both of the solutions work for me.
You can do this:
/]>(.)<\/div>/i
Which would give you the largest possible innerHTML.
You cannot. I will not link to the famous question, because I dislike the pointless drivel on top. But still regular expressions are unfit to match nested structures.
You can use some trickery, but this is neither reliable, nor necessarily fast:
preg_match_all('#<div id="1">((<div>.*?</div>|.)*?)</div>#ims'
Your regex had a problem due to the /x flag not matching the opening div. And you used a wrong assertion notation.
preg_match_all('% <div \s+ id="torrent_details">(?<innerHtml>.*)</div> %six', $html, $match);
echo $match['innerHtml'];
That one will work, but you should only need preg_match not preg_match_all if the pages are written well, there should only be one instance of id="torrent_details" on the given page.
I'm retracting my answer. This will not work properly. Use DOM for navigating the document.
haha did it with a bit of tampering thanks for the DOMDocument idea i just to use simple
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(false);
$doc->strictErrorChecking = FALSE;
libxml_use_internal_errors(true);
$doc->loadHTML($data);
$xml = simplexml_import_dom($doc);
print_r($xml->body->table->tr->td->table[2]->tr->td[0]->span[0]->div);

Categories