How to get a specified row using cUrl PHP - php

Hey guys I use curl to communicate web external server, but the type of response is html, I was able to convert it to json code (more than 4000 row) but I have no idea how to get specified row which contains my result. Any idea ?
Here is my cUrl code :
require_once('getJson.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.reputationauthority.org/domain_lookup.php?ip=website.com&Submit.x=9&Submit.y=5&Submit=Search');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
$data = '<<<EOF'.$data.'EOF';
$json = new GetJson();
header("Content-Type: text/plain");
$res = json_encode($json->html_to_obj($data), JSON_PRETTY_PRINT);
$myArray = json_decode($res,true);
For getJson.php
class GetJson{
function html_to_obj($html) {
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
return $this->element_to_obj($dom->documentElement);
}
function element_to_obj($element) {
if ($element->nodeType == XML_ELEMENT_NODE){
$obj = array( "tag" => $element->tagName );
foreach ($element->attributes as $attribute) {
$obj[$attribute->name] = $attribute->value;
}
foreach ($element->childNodes as $subElement) {
if ($subElement->nodeType == XML_TEXT_NODE) {
$obj["html"] = $subElement->wholeText;
}
else {
$obj["children"][] = $this->element_to_obj($subElement);
}
}
return $obj;
}
}
}
My idea is instead of Browsing rows to achieve lign 2175 (doing something like : $data['children'][2]['children'][7]['children'][3]['children'][1]['children'][1]['children'][0]['children'][1]['children'][0]['children'][1]['children'][2]['children'][0]['children'][0]['html'] is not a good idea to me), I want to go directly to it.

If the HTML being returned has a consistent structure every time, and you just want one particular value from one part of it, you may be able to use regular expressions to parse the HTML and find the part you need. This is an alternative you trying to put the whole thing into an array. I have used this technique before to parse a HTML document and find a specific item. Here's a simple example. You will need to adapt it to your needs, since you haven't specified the exact nature of the data you're seeking. You may need to go down several levels of parsing to find the right bit:
$data = curl_exec($ch);
//Split the output into an array that we can loop through line by line
$array = preg_split('/\n/',$data);
//For each line in the output
foreach ($array as $element)
{
//See if the line contains a hyperlink
if (preg_match("/<a href/", "$element"))
{
...[do something here, e.g. store the data retrieved, or do more matching to find something within it]...
}
}

Related

I need help avoiding duplicate code (copy pasting code twice)

I'm trying to improve my programming skills constantly, I learned everything online so far. But I can't find a way to avoid duplicate code. Here's my code:
public function Curl($page, $check_top = 0, $pages = 1, $pagesources = array()){
//$page is the URL
//$check_top 0 = false 1 = true. When true it needs to check both false & true
//$pages is the amount of pages it needs to check.
$agent = "Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0";
try{
for($i = 0; $i < $pages; $i++){
$count = $i * 25; //Page 1 starts at 0, page 2 at 25 etc..
$ch = curl_init($page . "/?count=" . $count);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$pagesource = curl_exec($ch);
$pagesources[] = $pagesource;
}
if($check_top == 1){
for($i = 0; $i < $pages; $i++){
$count = $i * 25;
$ch = curl_init($page . "/top/?sort=top&t=all&count=" . $count);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$pagesource = curl_exec($ch);
$pagesources[] = $pagesource;
}
}
} catch (Exception $e){
echo $e->getMessage();
}
return $pagesources;
}
What I'm trying to do:
I want to get the HTML Page Sources from a specific page range (for example 1 to 5 pages). There are top pages and standard pages I want to get the sources from both with the page range. So my code works fine, but obviously; there must be a better way.
Here 's a short example, how you can avoid duplicate code with writing functions and using them together.
class A
{
public function methodA($paramA, $paramB, $paramC)
{
if ($paramA == 'A') {
$result = $this->methodB($paramB);
} else {
$result = $this->methodB($paramC);
}
return $result;
}
public function methodB($paramA)
{
// do something with the given param and return the result
}
}
$classA = new Class();
$result = $classA->methodA('foo', 'bar', 'baz');
The code given above shows a simple class with two methods. As you declared your function Curl in your example as public, I guess you 're using a class. The class in the example above is very basic. It calls the method methodB with different params in the nethodA method of the class.
What this means to you? You have to find out, which parameters your helper function needs. If you found out, which parameters it needs, just write another class method, which executes the curl functions with the given parameters. Simple as pie.
If you 're new into using classes and methods with php I suggest reading the documentation, where the basic functionality of classes, methods and members are described: http://php.net/manual/en/classobj.examples.php.

How to play this python code in php?

I want to convert the python function below to PHP function, if someone could help a little bit I'd appreaciate it:
p.s .: I know that for those who master the process the question may seem simple and repetitive (there are several posts about converting function in the Stack), however, for beginners it is quite complicated.
def resolvertest(url):
if not 'http://' in url:
url = 'http://www.exemplo.com'+url
log(url)
link = abrir_url(url)
match=re.compile('<iframe name="Font" ="" src="(.*?)"').findall(link)[0]
req = urllib2.Request(match)
req.add_header('User-Agent', 'Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.114 Mobile Safari/537.36')
response = urllib2.urlopen(req)
link=response.read()
response.close()
url = re.compile(r'file: "(.+?)"').findall(link)[0]
return url
I created a function to pass all url calls through the curl getcurl($url), making it easier to read the pages and their contents.
We use a kind of loop that will go through all the sub-links you have on the page, until you get to the final page, when it arrives there, if($link) is no longer called, and your regex file: "(. +?)" is executed, capturing the desired content.
The script is written in a simple way.
$url = "http://www.exemplo.com/content.html";
$file_contents = getcurl($url);
preg_match('/<iframe name="Font" ="" src="(.*?)"/', $file_contents, $match_url);
#$match = $match_url[1];
function get_redirect($link){
$file_contents = getcurl($link);
preg_match('/<a href="(.*?)"/', $file_contents, $match_url);
#$link = $match_url[1];
if($link){
return get_redirect($link);
}else {
preg_match('/file: "(.+?)"/',$file_contents, $match_content_url);
#$match_content_url = $match_content_url[1];
return $match_content_url;
}
}
function getcurl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$url = curl_exec($ch);
curl_close ($ch);
return $url;
}
$content = get_redirect($match);
echo $content;
From my limited Python knowledge I'd assume this does the same:
function resolvertest($url) {
if (strpos($url, 'http://') === FALSE) {
$url = 'http://www.exemplo.com' . $url;
}
echo $url; // or whatever log(url) does
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($url);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$match = $xpath->evaluate('//iframe[#name="Font"]/#src')->item(0)->nodeValue;
$ua = stream_context_create(['http' => ['user_agent' => 'blah']]);
$link = file_get_contents($match, false, $ua);
preg_match('~file: "(.+?)~', $link, $matches);
return $matches[1];
}
Note that I didn't use a Regular Expression to get the iframe src, but actually parsed the HTML and used XPath. Getting the final link does use a Regex, because it seems to match some JSON and not HTML. If so, you want to use json_decode instead for more reliable results.

Parse REST Response Using PHP / CURL

I'm trying the REST API here: https://www.semrush.com/api-analytics/ , specifically the Organic Results, but no matter what I've tried, I can't seem to manipulate the data. Can someone tell me how to do this? I've tried SimpleXML, JSON, and even breaking up the response via explode() but I must be missing something because all I can do is push the result to the beginning of an array and not actually break it up.
This is my current code:
$url = "http://api.semrush.com/?type=phrase_organic&key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&display_limit=10&export_columns=Dn,Ur&phrase=seo&database=us";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch,CURLOPT_USERAGENT,"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
curl_close($ch);
var_dump($result);
With the result being:
string 'Domain;Url
site-analyzer.com;https://www.site-analyzer.com/
woorank.com;https://www.woorank.com/
hubspot.com;http://blog.hubspot.com/blog/tabid/6307/bid/33164/6-SEO-Tools-to-Analyze-Your-Site-Like-Google-Does.aspx
seoworkers.com;http://www.seoworkers.com/tools/analyzer.html
seositecheckup.com;http://seositecheckup.com/
site-seo-analysis.com;http://www.site-seo-analysis.com/
webseoanalytics.com;http://www.webseoanalytics.com/free/seo-tools/web-seo-analysis.php
seocentro.com;http://www.seocentro.com/t'... (length=665)
Is there a simple way to break this up so I can manipulate or reformat the response?
You need to properly explode the new-line characters in order to get to the csv structure, then parse it, as csv
foreach(preg_split("/((\r?\n)|(\r\n?))/", $response) as $key=>$line){
if ($key!=0) {
list($domain,$url) = str_getcsv($line,';');
print 'Domain: ' . $domain . ', URL: ' . $url . PHP_EOL;
}
}
Using the sample response from https://www.semrush.com/api-analytics/#phrase_organic,
the above will output
Domain: wikipedia.org, URL: http://en.wikipedia.org/wiki/Search_engine_optimization
Domain: searchengineland.com, URL: http://searchengineland.com/guide/what-is-seo
Domain: moz.com, URL: http://moz.com/beginners-guide-to-seo
The if statement is there to filter out the first line, the csv header.
Well, we could explode by space " ", then by ;
$response = explode(" ", trim(str_replace("Domain;Url", "", $response)));
$readableResponse = [];
foreach($response as $r)
{
$e = explode(";", $r);
$readableResponse[$e[0]] = $e[1];
}
print_r($readableResponse);
Ie. Live on phpsandbox
[searchengineland.com] => http://searchengineland.com/guide/what-is-seo
[wikipedia.org] => https://en.wikipedia.org/wiki/Search_engine_optimization
....

How to get Wikipedia content section by section using Wikipedia API - PHP

Is there any better way to fetch text contents of particular sections from wikipedia. I have the below code to skip some sections but the process is taking too long to fetch data what am looking for.
for($i=0;$i>10;$i++){
if($i != 2 || $i != 4){
$url = 'http://en.wikipedia.org/w/api.php?action=parse&page=ramanagara&format=json&prop=text&section='.$i;
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "TestScript");
$c = curl_exec($ch);
$json = json_decode($c);
$content = $json->{'parse'}->{'text'}->{'*'};
print preg_replace('/<\/?a[^>]*>/','',$content);
}
}
For starters, you're telling this to loop until $i is greater than 10, which in practice, will loop until the server request times out. Change it to $i<10, or if you need only a handful of sections, try:
foreach (array(1,3,5,6,7) as $i)
//your code
Second, decoding JSON into an associative array like this:
$json = json_decode($c, true);
And referencing it like $json['parse']['text']['*'] is easier to work with, but that's up to you.
And third, you'll find that strip_tags() will likely function faster and more accurately than stripping tags with regular expressions.

Caching JSON output in PHP

Got a slight bit of an issue. Been playing with the facebook and twitter API's and getting the JSON output of status search queries no problem, however I've read up further and realised that I could end up being "rate limited" as quoted from the documentation.
I was wondering is it easy to cache the JSON output each hour so that I can at least try and prevent this from happening? If so how is it done? As I tried a youtube video but that didn't really give much information only how to write the contents of a directory listing to a cache.php file, but it didn't really point out whether this can be done with JSON output and certainly didn't say how to use the time interval of 60 minutes or how to get the information then back out of the cache file.
Any help or code would be very much appreciated as there seems to be very little in tutorials on this sorta thing.
Here a simple function that adds caching to getting some URL contents:
function getJson($url) {
// cache files are created like cache/abcdef123456...
$cacheFile = 'cache' . DIRECTORY_SEPARATOR . md5($url);
if (file_exists($cacheFile)) {
$fh = fopen($cacheFile, 'r');
$size = filesize($cacheFile);
$cacheTime = trim(fgets($fh));
// if data was cached recently, return cached data
if ($cacheTime > strtotime('-60 minutes')) {
return fread($fh, $size);
}
// else delete cache file
fclose($fh);
unlink($cacheFile);
}
$json = /* get from Twitter as usual */;
$fh = fopen($cacheFile, 'w');
fwrite($fh, time() . "\n");
fwrite($fh, $json);
fclose($fh);
return $json;
}
It uses the URL to identify cache files, a repeated request to the identical URL will be read from the cache the next time. It writes the timestamp into the first line of the cache file, and cached data older than an hour is discarded. It's just a simple example and you'll probably want to customize it.
It's a good idea to use caching to avoid the rate limit.
Here's some example code that shows how I did it for Google+ data,
in some php code I wrote recently.
private function getCache($key) {
$cache_life = intval($this->instance['cache_life']); // minutes
if ($cache_life <= 0) return null;
// fully-qualified filename
$fqfname = $this->getCacheFileName($key);
if (file_exists($fqfname)) {
if (filemtime($fqfname) > (time() - 60 * $cache_life)) {
// The cache file is fresh.
$fresh = file_get_contents($fqfname);
$results = json_decode($fresh,true);
return $results;
}
else {
unlink($fqfname);
}
}
return null;
}
private function putCache($key, $results) {
$json = json_encode($results);
$fqfname = $this->getCacheFileName($key);
file_put_contents($fqfname, $json, LOCK_EX);
}
and to use it:
// $cacheKey is a value that is unique to the
// concatenation of all params. A string concatenation
// might work.
$results = $this->getCache($cacheKey);
if (!$results) {
// cache miss; must call out
$results = $this->getDataFromService(....);
$this->putCache($cacheKey, $results);
}
I know this post is old, but it show in google so for everyone looking, I made this simple one that curl a JSON url and cache it in a file that is in a specific folder, when json is requested again if 5min passed it will curl it if the 5min didnt pass yet, it will show it from file, it uses timestamp to track time and yea, enjoy
function ccurl($url,$id){
$path = "./private/cache/$id/";
$files = scandir($path);
$files = array_values(array_diff(scandir($path), array('.', '..')));
if(count($files) > 1){
foreach($files as $file){
unlink($path.$file);
$files = scandir($path);
$files = array_values(array_diff(scandir($path), array('.', '..')));
}
}
if(empty($files)){
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_TIMEOUT, 15);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_USERAGENT,
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
$response = curl_exec($c);
curl_close ($c);
$fp = file_put_contents($path.time().'.json', $response);
return $response;
}else {
if(time() - str_replace('.json', '', $files[0]) > 300){
unlink($path.$files[0]);
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_TIMEOUT, 15);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_USERAGENT,
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
$response = curl_exec($c);
curl_close ($c);
$fp = file_put_contents($path.time().'.json', $response);
return $response;
}else {
return file_get_contents($path. $files[0]);
}
}
}
for usage create a directory for all cached files, for me its /private/cache then create another directory inside for the request cache like x for example, and when calling the function it should be like htis
ccurl('json_url','x')
where x is the id, if u have question pls ask me ^_^ also enjoy (i might update it later so it doesn't use a directory for id

Categories