How to parse feed with php - php

Using Wikiepdia API link to get some basic informations about some world known characters.
Example : (About Dave Longaberger)
This would show as following
Now my question
I'd like to parse the xml to get such basic informations between <extract></extract> to show it.
Here is my idea but failed (I/O warning : failed to load external entity)
<?PHP
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Dave Longaberger&format=xml&exintro=1';
$xml = simplexml_load_file($url);
// get extract
$text=$xml->pages[0]->extract;
// show title
echo $text;
?>
Another idea but also failed (failed to open stream: HTTP request failed!)
<?PHP
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
$url = "http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Dave Longaberger&format=xml&exintro=1";
$text = file_get_contents($url);
echo $text;
?>
so any idea how to do it. ~ Thanks
Update (after added urlencode or rawurlencode still not working)
$name = "Dave Longaberger";
$name = urlencode($name);
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles='.$name.'&format=xml&exintro=1';
$text = file_get_contents($url);
Also not working
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Dave Longaberger&format=xml&exintro=1';
$url = urlencode($url);
$text = file_get_contents($url);
nor
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles='.rawurlencode('Dave Longaberger').'&format=xml&exintro=1';
$text = file_get_contents($url);
Well so i really don't know looks like it is impossible by somehow.

Set the User Agent Header in your curl request, wikipedia replies with error 403 forbidden otherwise.
<?PHP
$url = "http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Dave+Longaberger&format=xml&exintro=1";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
$xml = curl_exec($ch);
curl_close($ch);
echo $xml;
?>
Alternatively:
ini_set("user_agent","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
$url = "http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Dave+Longaberger&format=xml&exintro=1";
$xml = simplexml_load_file($url);
$extracts = $xml->xpath("/api/query/pages/page/extract");
var_dump($extracts);

Look at the note in this php man page
http://php.net/manual/en/function.file-get-contents.php
If you're opening a URI with special characters, such as spaces, you need to encode the URI with urlencode().

Related

Extract/Display JSON Wikipedia PHP

I am new to programming,
I need to extract the wikipedia content and put it into html.
//curl request returns json output via json_decode php function
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$search = $_GET["search"];
if (empty($search)) {
//term param not passed in url
exit;
} else {
//create url to use in curl call
$term = str_replace(" ", "_", $search);
$url = "https://en.wikipedia.org/w/api.php?action=opensearch&search=".$search."&limit=1&namespace=0&format=jsonfm";
$json = curl($url);
$data = json_decode($json, true);
$data = $data['parse']['wikitext']['*'];
}
so I basically want to reprint a wiki page but with my styles and do not know how to do.
Any ideas, Thanks

.txt->Curl->Pregmetch->Mysql

1)Instead of http://webiste.com/filename i want to take the data from the .txt file which contains 20374 rows and each row contains different website
for example:
website1.com
website2.com
website3.com
etc.
2)parse them individually using curl command
3)get the needed data via preg_match
4)and final results i want to save to my mysql databse
bellow is the code that i am using at the moment, please advise the solution what needs to be added to achieve this goal ?
function curl($url)
{
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Cookie: ddosdefend=1d4607e3ac67b865e6c7263260c34e888cae7c56"));
curl_setopt($ch, CURLOPT_USERAGENT, $agent); // The contents of the "User-Agent: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$test = curl('http://webiste.com/filename');
preg_match('/<iframe class=\"metaframe rptss\" src="(.*?)"/', $test, $matches);
$test2 = $matches[1];

Passing Colon in curl function issue

I have this function:
function file_get_contents_curl($url) {
$ch = curl_init();
$ua = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)';
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, $ua);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Now I have a strange issue: I;m trying to search in Google
using the "inurl:example.com", like this:
$email = $_GET["email"];
$in = "inurl:example.com+" . $email;
$in = str_replace(' ','+',$in);
$url = 'http://www.google.com/search?hl=en&q='.$in;
and every time i get 0 results.
However, when i get rid from the "inurl:" and replace it with some other
condition, like this:
$in = "example.com+" . $email;
it works well and i see the results.
So something about the COLON ("inurl:") causes trouble for the file_get_contents_curl function.
I tried the same inurl: string in file_get_html function (instead of file_get_contents_curl)
$html = file_get_html($url);
with no problem at all.
Any ideas?
EDIT:
$in = "+inurl:pastebin.com";
$in = str_replace("%3A",":",$in);
(NOT) solved the problem.
EDIT:
Sometimes it works and sometimes not.
So it's something related to Google blocking/rate-limit
and stuff like that.
Thanks anyway.

php curl super long url malformed

I have a long url nested in a variable: $mp4, and trying to download it with curl but i'm getting malformed error. Please help me if you can, thank you in advance!
The below is what I have in my php script:
exec("curl -o $fnctid.mp4 \"$mp4\"");
Error message:
curl: (3) <url> malformed
Sample url to test download:
http://f26.stream.nixcdn.com/6f4df1d8c248cf149b846c24d32f1c35/514e0209/PreNCT5/22-TaylorSwift-2426783.mp4
The current url is returning 408 - Request Timeout if that is fixed you are you this simple code :
$url = 'http://f26.stream.nixcdn.com/6f4df1d8c248cf149b846c24d32f1c35/514e0209/PreNCT5/22-TaylorSwift-2426783.mp4';
$useragent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.3 Safari/533.2';
$file = __DIR__ . DIRECTORY_SEPARATOR . basename($url);
$fp = fopen($file, 'w+');
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_TIMEOUT, 320);
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
echo curl_exec($ch);
var_dump(curl_getinfo($ch)); // return request information
curl_close($ch);
fclose($fp);
This error can be resolved through using urlencode
$url = urlencode ( $url )
This function is convenient when encoding a string to be used in a query part of a URL, as a convenient way to pass variables.
Hope this solve answer

cURL error: Could not resolve host

A similar question has been posted at but i could not find the solution there
Curl error Could not resolve host: saved_report.xml; No data record of requested type"
<?php
$url="http://en.wikipedia.org/wiki/Pakistan";
$ch = curl_init(urlencode($url));
echo $ch;
// used to spoof that coming from a real browser so we don't get blocked by some sites
$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 4);
curl_setopt($ch, CURLOPT_TIMEOUT, 8);
curl_setopt($ch, CURLOPT_LOW_SPEED_TIME, 10);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
$info = curl_getinfo($ch);
if ($content === false || $info['http_code'] != 200) {
$content = "No cURL data returned for $url [". $info['http_code']. "]";
if (curl_error($ch))
$content .= "\n". curl_error($ch);
}
else {
// 'OK' status; format $output data if necessary here:
echo "...";
}
echo $content;
curl_close($ch);
?>
when i paste the same address in browser i am able to access the webpage. but when i run this script i get the error message. Can anyone please help me.
Thanks
Remove the urlencode call.
remove the urlencode($url) it should be:
$ch = curl_init($url);
Well.
If you remove urlencode() with instantiating your $ch-var, you go just fine. urlencode() is definitely wrong here.
Good:
$ch = curl_init($url);
Bad:
$ch = curl_init(urlencode($url));
$ch = curl_init($url);
instead of
$ch = curl_init(urlencode($url));

Categories