Programmatically fetching definition of a word - php

I am writing a social app where people will use TAGs for organizing their articles. These tags are shared across the site and each tag needs to have some description with it.
I wonder if there is any way I can programmatically fetch it from a resource like wikipedia. (say the first paragraph).
The tags will be typically associated with brands products and services.

Yes you can
<?php
$contents = file_get_contents("http://en.wikipedia.org/wiki/PHP");
preg_match("/<p>(.*?)<\/p>/", $contents, $match);
echo $match[1];
?>
http://sandbox.phpcode.eu/g/45c56.php
EDIT: Looks like they don't like non-validated browser agents. You'll have to do it with curl
EDIT2: curl with browser agent:
<?php
$ch = curl_init("http://en.wikipedia.org/wiki/PHP");
$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$contents = curl_exec($ch);
preg_match("/<p>(.*?)<\/p>/", $contents, $match);
$match[1] = preg_replace("|\[[0-9]\]|", "", strip_tags($match[1]));
echo (($match[1]));
?>
http://sandbox.phpcode.eu/g/ad578.php

Related

Scrape site using Curl returning blank results

What i'm trying to do is do a search on Amazon using a random keyword, then i'll just scrape maybe the first 10 results, the issue when i print the html results i get nothing, it's just blank, my code looks ok to me and i have used CURL in the past and never come accross this, my code:
<?php
include_once("classes/simple_html_dom.php");
function get_random_keyword() {
$f_contents = file("keywords.txt");
return $f_contents[rand(0, count($f_contents) - 1)];
}
function getHtml($page) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $page);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$html = curl_exec($ch);
print "html -> " . $html;
curl_close($ch);
return $html;
}
$html = getHtml("https://www.amazon.co.uk/s?k=" . get_random_keyword());
?>
Ideally i would have preferred to use the API, but from what i understand you need 3 sales first before you are granted access, can anyone see any issues? i'm not sure what else to check, any help is appreciated.
Amazon is returning the response encoded in gzip. You need to decode it:
$html = getHtml("https://www.amazon.co.uk/s?k=" . get_random_keyword());
echo gzdecode($html);

file_get_html(); not working with Teleduino links

I am making a home automantion project with Arduino and I am using Teleduino to remotely control an LED as a test. I want to take the contents of this link and display them into a php page.
<!DOCTYPE html>
<html>
<body>
<?php
include 'simple_html_dom.php';
echo file_get_html('http://us01.proxy.teleduino.org/api/1.0/2560.php?k=202A57E66167ADBDC55A931D3144BE37&r=definePinMode&pin=7&mode=1');
?>
</body>
The problem is that the function does not return anything.
Is something wrong with my code?
Is there any other function I can use to send a request to a page and get that page in return?
I think you had to use function file_get_contents but your server is protcting data from scraping so curl would be a better solution:
<?php
// echo file_get_contents('http://us01.proxy.teleduino.org/api/1.0/2560php?k=202A57E66167ADBDC55A931D3144BE37&r=definePinMode&pin=7&mode=1');
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://us01.proxy.teleduino.org/api/1.0/2560.php?k=202A57E66167ADBDC55A931D3144BE37&r=definePinMode&pin=7&mode=1");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
// $output contains the output string
$output = curl_exec($ch);
echo $output;
// close curl resource to free up system resources
curl_close($ch);
?>

copy particular div from Flipkart.com web scraping using Curl and Php

I want to copy particular div contain data from flipkart product web page and display it.
<table cellspacing="0" class="specTable">
///// contains /////
</table>
its table value are variable in some web page have 10 tables in same class and some page have more, how i can get all table value from this ?
Also wants to get specific specsValue, is it possible to get it also ?
<td class="specsKey">Brand</td><td class="specsValue">Apple</td>
Web page address: http://www.flipkart.com/apple-iphone-6/p/itme8ra5z7yx5c9j?pid=MOBEYHZ2JHVFHFBG
Sample code
$url = "http://dl.flipkart.com/dl/apple-iphone-6/p/itme8ra5z7yx5c9j?pid=MOBEYHZ2JHVFHFBG";
$response = getPriceFromFlipkart($url);
echo json_encode($response);
/* Returns the response in JSON format */
function getPriceFromFlipkart($url) {
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 10.10; labnol;) ctrlq.org");
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($curl);
curl_close($curl);
$regex = '/<meta itemprop="price" content="([^"]*)"/';
preg_match($regex, $html, $price);
$regex = '/<h1[^>]*>([^<]*)<\/h1>/';
preg_match($regex, $html, $title);
$regex = '/data-src="([^"]*)"/i';
preg_match($regex, $html, $image);
if ($price && $title && $image) {
$response = array("price" => $price[1], "title" => $title[1], "image" => $image[1]);
} else {
$response = array("status" => "404", "error" => "We could not find the product details on Flipkart $url");
}
return $response;
}
?>
Flipkart now change its interface and you can fetch the product price and all by using Flipkart API.
Currently I'm also using their API.
But I also want to fetch the product details using below curl command, if anyone is doing the same without any problem please share what else i have to add here to fetch the product webpage content, while debugging this by using getinfo() it will return 301 Moved Permanentlywith Status Code 0
$curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL,<flipkart_url>);
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,100);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
curl_setopt($curl_handle, CURLOPT_REFERER, 'http://www.flipkart.com/');
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$str = curl_exec($curl_handle);
$html = new simple_html_dom();
$html->load($str);

PhP curl simple_dom_document request to get snow data from snowbird.com

Im using php, curl, and simple_dom_document to get snow data from snowbird.com. The problem is I cant seem to actually find the data I need. I am able to find the parent div and its name but I cant find the actually snow info div. Here is my code. Below my code ill past a small part of the output.
<?php
require('simple_html_dom.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.snowbird.com/mountain-report/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$content = curl_exec($ch);
curl_close($ch);
$html = new simple_html_dom();
$html->load($content);
$ret = $html->find('.horizSnowChartText');
$ret = serialize($ret);
$ret3 = new simple_html_dom();
$ret3->load($ret);
$es = $ret3->find('text');
$ret2 = $ret3->find('.total-inches');
print_r($ret2);
//print_r($es);
?>
And here is a picture of the output. You can see it skips the actual snow data and goes right to the inches mark ".
Do note that the html markup you're getting has multiple instances of .total-inches (multiple nodes with this class). If you want to explicitly get one, you can point to it directly using the second argument of ->find().
Example:
$ret2 = $html->find('.total-inches', 3);
// ^
If you want to check them all out, a simple foreach should suffice:
foreach($html->find('.current-conditions .snowfall-total .total-inches') as $in) {
echo $in , "\n";
}

"Checking browser before accessing..." error when using Curl

I am trying to use curl to get the contents off a website. The error that I am getting is.
"Checking your browser before accessing roosterteeth.com"
I tried changing different attributes in curl but still no luck. I have tried using PHP Simple HTML Dom Parser but once again no luck.
below is my current code.
<?php
$divContents = array();
$userAgent = 'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0';
$html = curl_init("http://roosterteeth.com/home.php");
curl_setopt($html, CURLOPT_RETURNTRANSFER, true);
curl_setopt($html, CURLOPT_BINARYTRANSFER, true);
curl_setopt($html, CURLOPT_USERAGENT, $userAgent);
curl_setopt($html, CURLOPT_SSL_VERIFYPEER, false);
$content = curl_exec($html);
foreach($content->find("div.streamIndividual") as $div) {
$divContents[] = $div->outertext; }
file_put_contents("cache.htm", implode(PHP_EOL, $divContents));
$hash = file_get_contents("pg_1_hash.htm");
$cache = file_get_contents("cache.htm");
if ($hash == ($pageHash = md5($test))) {
} else {
$fpa = fopen("pg_1.htm", "w");
fwrite($fpa, $cache);
fclose($fpa);
$fpb = fopen("pg_1_hash.htm", "w");
fwrite($fpb, $pageHash);
fclose($fpb);
}
?>
As it stands the code above shows a different error due to the find command not being able to get any content. The code below shows the error I get from the site.
<?php
$divContents = array();
$userAgent = 'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0';
$html = curl_init("http://roosterteeth.com/home.php");
curl_setopt($html, CURLOPT_RETURNTRANSFER, true);
curl_setopt($html, CURLOPT_BINARYTRANSFER, true);
curl_setopt($html, CURLOPT_USERAGENT, $userAgent);
curl_setopt($html, CURLOPT_SSL_VERIFYPEER, false);
$content = curl_exec($html);
echo $content;
?>
My hunch about the error is that the server thinks that I am a bot (which I don't blame it to believe that). I used curl to see if i can pretend to be a client and bypass the checker but was unsuccessful. I hope someone can shed some light onto this.
For a visual error click this link.
Thank you for your time :)
If the site you're trying to access uses wordpress, it's definetly has security issues. It' a known malicious modification for WP and redirects users to some different sites. So in this case the problem is not in your code.

Categories