using php and curl to update mediawiki - php

I've been working on a php script to update mediawiki entries, however whenever I run it it doesn't seem to update the wiki at all and just returns the article page unedited.
I've included a section which logs into the wiki first and I've successfully read information off the wiki but I have not been able to update it.
Is there something I'm missing or better yet is there an existing php package which can be used to update a mediawiki.
thanks in advance,
sample code follows:
function curl_post_page($site, $post ) {
$headers = array();
$headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$headers[] = 'Connection: Keep-Alive';
$headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
//var_dump($post);
$cl = curl_init($site);
curl_setopt($process, CURLOPT_HTTPHEADER, $headers);
curl_setopt($cl, CURLOPT_HEADER, true);
curl_setopt($cl, CURLOPT_VERBOSE, true);
curl_setopt($cl, CURLOPT_FAILONERROR, true);
curl_setopt($cl, CURLOPT_POST, TRUE);
curl_setopt($cl, CURLOPT_POSTFIELDS, $post);
curl_setopt($cl, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($cl, CURLOPT_USERAGENT,"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT)");
curl_setopt($cl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($cl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($cl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($cl, CURLOPT_COOKIEFILE, "cookie.txt");

Is the api writing enabled? ($wgEnableWriteAPI = false;) It is disabled by default for versions below 1.14.
Are you getting any errors back?

There are several client libraries available (known good revision | mirror at archive.org)

Related

CURL returns the page with errors (JS, CSS, etc errors)

Actually I saw some similar cases in Stack Overflow but seems I have errors in my PHP code and can not show the page correctly. The Page I am trying to get is a resource from Pentaho BI. (Version 7.1.0.0.12). I tried many, many things, but nothing works.
Firstly I perform the authentication by 'Cookie-Based Authentication' (the method provided by Pentaho) -> Information: https://help.pentaho.com/Documentation/7.1/0R0/070/010/00A
In order to get the Cookie, I perform an HTTP POST by CURL PHP. That works well, I am able to get the Cookie from Pentaho.
Please check the code below;
$post_data['username'] = "suzy";
$post_data['password'] = "password";
foreach ($post_data as $key => $value) {
$post_items[] = $key . '=' . $value;
}
$post_string = implode('&', $post_items);
$curl_connection = curl_init('http://10.10.10.215:8080/pentaho/j_spring_security_check');
curl_setopt($curl_connection, CURLOPT_CONNECTTIMEOUT, 3000 * 10);
curl_setopt($curl_connection, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
curl_setopt($curl_connection, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl_connection, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl_connection, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl_connection, CURLOPT_POSTFIELDS, 'j_username=suzy&j_password=password');
curl_setopt($curl_connection, CURLOPT_HTTPHEADER, array('Content-Type: application/x-www-form-urlencoded'));
$result = curl_exec($curl_connection);
$sessionID = explode("=", $url);
$cookie = $sessionID[1];
So, the variable $cookie contains the SESSIONID that I should use to access to the resource from Pentaho.
And then I perform an HTTP GET by CURL PHP in order to get the page (resource) from Pentaho.
This is the part that doesn't work. Actually what can I see is that PHP is "connected" to Pentaho by the URL and the Cookie I previously requested, and also the call returns the whole page, but when I displays the page in the browser it throws a lot of errors as I said before (JS, CSS errors and more).
Please check the code bellow;
$url = "http://10.10.10.215:8080/pentaho/api/repos/%3Apublic%3ASteel%20Wheels%3ADashboards%3ACTools_dashboard.wcdf/generatedContent";
$ch = curl_init();
$headers[] = 'Accept: text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap';
$headers[] = 'Connection: Keep-Alive';
$headers[] = 'Cache-Control: no-cache';
$headers[] = 'Pragma: no-cache';
$headers[] = 'Transfer-Encoding: chunked';
$headers[] = 'Accept-Language: nl-NL,nl;q=0.8,en_US;q=0.6,en;q=0.4';
$headers[] = 'Accept-Encoding: gzip, deflate';
$headers[] = 'Accept: text/plain,* /*;q=0.01';
$headers[] = 'Content-Type: text/html; charset=utf-8';
$headers[] = 'Cookie: JSESSIONID='.$sessionID[1];
$user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)';
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_POST, false);
curl_setopt($ch, CURLOPT_POSTFIELDS, '');
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$response = curl_exec($ch);
echo $response;die();
I'd like to clarify, I am able to get the page but with all of those erros I mentioned.
I have also tried to get this content into an IFRAME but couldn't do it. Is there any way to do it?
All information you can add is welcome! If you have some code that I can check, and more, as well.

cURL and get_file_contents blocked

I know this question has been dealt with on a few occasions but none of the fixes seem to work with my particular problem.
I am trying to grab any page from http://www.lewmar.com but some how they are managing to block all attempts. My latest script is as follows:
function curl_get_contents($url)
{
$ch = curl_init();
$browser_id = "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0";
$ip = $_SERVER["SERVER_ADDR"];
curl_setopt($ch, CURLOPT_USERAGENT, $browser_id);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, $ip);
$headers = array();
$headers[] = 'Cache-Control: max-age=0';
$headers[] = 'Connection: keep-alive';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$headers[] = 'Accept-Language: en-US,en;q=0.5';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$url = 'http://www.lewmar.com';
$contents = curl_get_contents($url);
echo strlen($contents);
I have tried to replicate most of the headers and the site doesn't seem to check for 'Javascript' compatibility but yet still can't get anything returned.
Does anyone have any idea how they might be recognizing cURL and blocking.
Cheers
When you first visit that site it checks to see if you have a cookie. If you don't, it will send you one and send a redirect (to the same page). You haven't got anything in your code to store cookies so you end up going round in a circle. Curl gives up after 20 redirects. Solution: enable cookies!
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies2.txt');

file_get_contents()/curl getting unexpected page

I'm doing some scraping with php. I've been extracting data including link to the next relevant page so the whole thing is automatic. The problem is that I seem to be getting a page which is slightly modified compared to what I would expect using that URL in my browser (for e.g. the dates are different).
I've tried using curl and get_file_contents but both get the wrong file.
At the moment I am using:
$url = "http://www.example.com";
$ch = curl_init();
$timeout = 5;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
url_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$temp = curl_exec($ch);
curl_close($ch);
What is going on here?
UPDATE:
I've tried imitating a browser using the following code but still unsuccessful. I find this bizarre.
function get_url_contents($url){
$crl = curl_init();
$timeout = 10;
$header=array(
'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-us,en;q=0.5',
'Accept-Encoding: gzip,deflate',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Keep-Alive: 115',
'Connection: keep-alive',
);
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($crl, CURLOPT_AUTOREFERER, FALSE);
curl_setopt ($crl, CURLOPT_FOLLOWLOCATION, FALSE);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
Further update:
Seems that the site is using my location to discriminate. Is there a locale option?
Can be many things...
Server may render pages differently based on cookies and header sent
Server may render pages differently based on existing pre-conditions and states on the server
You may have a proxy in between that modifies the content based on user-agent and since you don't have a specific user-agent (such as CURL browser) then your proxy is sending back different content
This is just a few things that could happen!

Issue downloading some files using PHP curl

I have a script that downloads PDF files after logging into another site. It has so far worked great for all sites but I am now getting something strange with a new site I'm scraping: some of the files downloaded are 1kb (i.e it didn't work) while others work just fine. Using the download link in the browser opens the "do you want to save this file" window and the file is correct there.
here is my code (I include both the general curl parameters used throughout the scrape, and the final part where I try downloading the files):
//Initial connection to login page
$header[] = 'Host: www.domain.com';
$header[] = 'Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$header[] = 'Accept-Language: en-US,en;q=0.5';
$header[] = 'Connection: keep-alive';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$webpage = curl_exec($ch);
//Then several operations to login, grab the list of links to PDF download files (...)
//Loop through the array containing the url of the file to download and save it to a folder (writable)
curl_setopt($ch, CURLOPT_POST, false);
foreach($foundBills as $key => $bill)
{
curl_setopt($ch, CURLOPT_URL, $bill['url']);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
$pdfFile = curl_exec($ch);
$randomFileName = rand_string(20); //generates a 20 char long random string
$newPDF = $userBillsRoot.$randomFileName.'.pdf';
write_file($newPDF, $pdfFile, 'wb'); //using a Codeigniter function to save the file
}
The files are under 1mb each. Any ideas? How can I see more details about why it's not working (e.g timeout)? Thanks!

PHP Curl shows different page than in the browser

I am trying to scrape a list of bills from a website after logging into it via curl but on one of the pages the content is not the same as in my browser (namely, instead of showing a list of bills it shows "Your bill history cannot be displayed"). I can correctly scrape other pages that are only available after login so I'm quite puzzled by why that page refuse to display the bill history when I use curl.
Here is my code:
//Load login page
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; rv:20.0) Gecko/20100101 Firefox/20.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieLocation);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieLocation);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$webpage = curl_exec($ch);
//Submit post to login page to authentify
$postVariables = 'emailAddress='.$username.
'&password='.$password;
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postVariables);
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/login/POST.servlet');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/login');
$webpage = curl_exec($ch);
//Go to my account main page now that we are logged in
curl_setopt($ch, CURLOPT_POST, false);
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/My_Account');
curl_setopt($ch, CURLOPT_REFERER, $target);
$webpage = curl_exec($ch); //shows the same content as in the browser
$accountNumber = return_between($webpage, 'id="accountID1">', '<', EXCL); //this is correctly found
//Go to bills page
curl_setopt($ch, CURLOPT_URL, 'https://www.domain.com/Bill_History/?accountnumber='.$accountNumber);
curl_setopt($ch, CURLOPT_REFERER, 'https://www.domain.com/My_Account');
$webpage = curl_exec($ch); //Not showing the same content as in the browser
The last curl_exec being the one that doesn't work properly.
I have checked extensively the logic of the page and used Tamper Data to analyse what was going on: there doesn't seem to be any javascript / ajax call that would pull the bill history separately, and no POST request: as far as I can see the bill history should be displayed at page load.
Any ideas as to what I could try to fix it or what could be the problem? The fact that it works on other pages is especially puzzling.
Thanks in advance!
EDIT: it still doesn't work but I have found another page on their site where I can get what I need and where the content is displayed correctly - so no need for a solution anymore.
You might add additional header fields that "real" browsers usually transmit:
$header[] = 'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
Just to name a few.
If you happen to use FFox, get that handy "Live HTTP Headers" plugin and check what headers your browser transmits when loading the relevant page. Then try to do the same.

Categories