this is just so bang head on wall situation. this pattern works perfectly in javascript. and i have no idea what to do.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://yugioh.wikia.com/wiki/List_of_Yu-Gi-Oh!_BAM_cards');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$chHtml = curl_exec($ch);
curl_close($ch);
$patt = '/<table class="wikitable sortable card-list">[\s\S]*?<\/table/im'; //////////////this
preg_match($patt, $chHtml, $matches);
is the problem line
if i make it greedy
[\s\S]*
it works fine but it goes till the last
There is nothing wrong with the pattern, the problem is that you need a larger backtrack limit than the default.
Explaining:
In regex problems like that always check for errors using the preg_last error().
If you use it in the specific response from the site you submitted, since this is a resource problem and smaller texts do not raise the error, you will see that you are getting a PREG_BACKTRACK_LIMIT_ERROR.
Solution:
To overcome this limit you can raise it with the following in the start of your script:
ini_set ('pcre.backtrack_limit', 10000000);
Related
I have a solr query that has been working perfectly:
$ch = curl_init();
$ch_searchURL = "$base_url/$collection/select?q=$s&wt=json&indent=true";
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $ch_searchURL);
$rawData = curl_exec($ch);
$json = json_decode($rawData,true);
Initially, my $s variable was literally one thing: e.g. ?q=name:brian, but my user base wanted the ability to search multiple things at once, so I started to build that in:
?q=name:("brian"+OR+"mike"+OR+"james"+OR+"emma"+OR+"luke")
It then got to the point where they wanted to search 5,000 things at once, which caused this method of building out the solr GET query to fail as the literal URL length was longer than the max allowed length of ~2,000, so I thought using a POST might work, which I accomplished by adding the following lines:
$ch_searchURL = "$base_url/$collection/select";
$multiline_q = "q=$s&wt=json&indent=true";
curl_setopt($ch, CURLOPT_POSTFIELDS, $multline_q);
This seemed to allow me to search for around 500 items at a time - (which would still, in GET world, cause a URL length of around 4,000) - so better than the GET method, but once I go past that number of items, the solr query fails again.
Because I'm POSTing (maybe?), I don't get any error response from solr, so I don't know what's causing the query to fail, and I can't manually test the query in the browser because it's ~40,000 characters long and won't paste. If I do var_dump($rawData);, I see this:
string(238) " 05 " // or 04, or 08
I've used solr quite a bit with PHP & cURL, but always with the GET method. This is my first foray into using POST. Am I doing something wrong here? Am I just exceeding the actual amount of q options that I can ask solr to retrieve for me, regardless of the method?
Any light that anyone could shed on this would be helpful...
There is no limit on the Solr side - we regularly use Solr in a similar way.
You need to look at the settings for your servlet container (Tomcat, Jetty etc.) and increase the maximum POST size. Look up maxPostSize if you are using Tomcat and maxFormContentSize if you are using Jetty.
source : link
I've been trying to pull Survey Data for a client from their Survey Monkey account, it seems that the more data their is the more likely illegal characters are introduced in to the resulting JSON string.
Below is a sample of what is returned on a bad response, every response is different and even shorter requests some times fail leaving me at a miss.
{
"survey_id": "REDACTED",
"title": "REDACTED",
"date_modified": "2014-XX-18 17:59:00",
"num_responses": 0,
"date_created": "�2014-01-21 10:29:00",
"question_count": 102
}
I can't fathom as to why this is happening, the more parameters in the fields option there are, the more illegal characters are introduced. It isn't just illegal invalid characters, some times random letters are thrown in as well which prevents me from handling the data correctly.
I am using Laravel 4 with the third party Survey Monkey library by oori
https://github.com/oori/php-surveymonkey
Any help would be appreciated in tracking down the issue, the deadline is pretty tight and if this can't be resolved I'll have to resort to asking the client to manually import CSV files which isn't ideal and introduces possible user error.
On a side note, I don't see this issue cropping up when using the same parameters on the Survey Monkey console.
O/S: Windows 8.1 with WAMP Server
Code used to execute the request
$Surveys = SurveyMonkey::getSurveyList(array
(
'page_size' => 1000,
'fields' => array
(
'title', 'question_count', 'num_responses', 'date_created', 'date_modified'
)
));
The SurveyMonkey facade is a custom package used to integrate the original Survey Monkey library located here:
https://github.com/oori/php-surveymonkey/blob/master/SurveyMonkey.class.php
Raw PHP cURL request
$header = array('Content-Type: application/json','Authorization: Bearer REDACTED');
$post = json_encode(array(
'fields' => array(
'title', 'question_count', 'num_responses', 'date_created', 'date_modified'
)
));
$post = json_encode($post);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://api.surveymonkey.net/v2/surveys/get_survey_list?api_key=REDACTED");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
$result = curl_exec($ch);
The above request returns the same troublesome characters, nothing else was used to get the response.
Using the following code
echo "\n".mb_detect_encoding($result, 'UTF-8', true);
This code shows the charset for the response, when successful and no illegal characters are present (there are still random characters in the wrong places) it returns that it is in fact UTF-8, when illegal characters are present false is returned so nothing is outputted. More often than not, false is returned.
Maybe I'm grossly oversimplifying the whole thing and apologies if so, but I have had these funny little chars pop in to results, too.
They were leading and trailing whitespace.
Can you trim data on retrieve and see if it still happens?
This question has two parts:
Part I - restriction?
I'm able to store data to my DB with this:
www.mysite.com/myscript.php?testdata=abc123
This works for a short string (eg 'abc123') and the page echos what was written to the DB; however, if the [testdata=] string is longer than 512 chars and i check the database, it shows a row has been added but it's blank and also my echo statement in the script doesn't display the input string.
N.B. I'm on a shared server and have emailed my host to see if it's a restriction.
Part II - best practice?
If i can get past the above hurdle, I want to use a string that's ~15k chars long created in a desktop app that concatenates the [testdata=] string from various parameters; what's the best way to send a long string in PHP POST?
Thanks in advance for your help, i'm not too savvy with PHP.
Edit: Table config:
Edit2: Row anomaly with long string > 512 chars:
Edit3: here's my PHP script, if it helps:
<?
include("connect.php");
$data = $_GET['testdata'];
$result = mysql_query("INSERT INTO test (testdata) VALUES ('$data')");
if ($result) // Check result
{
echo $data;
}
else echo "Error ".$mysqli->error;
mysql_close(); ?>
POST is definitely the method you want to use, and your best bet with that will be with cURL. Something like this should work:
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, "http://www.mysite.com/myscript.php" );
curl_setopt( $ch, CURLOPT_POST, TRUE );
curl_setopt( $ch, CURLOPT_POSTFIELDS, $my_really_long_string );
$data = curl_exec( $ch );
You'll need to modify the above to include additional cURL options as per your environment, but something like this is what you'd be looking for.
You'll want to make sure that your DB field is long enough to hold the really long string as well.
Answer 1 Yes, max length of url has restriction. See more:
What is the maximum possible length of a query string?
Answer 2 You can send your string like simple varible ($_POST). Check only settings for max vals of inputing/exectuting in php.ini.
Is it possible to use file_get_contents() to download a portion of data. For example, if I'm downloading a text file that is 2MB, and I only want the first 5 bytes, is this possible?
Sure. The additional arguments allow you to specify a portion of the file. See example #3 on the manual page:
<?php
// Read 14 characters starting from the 21st character
$section = file_get_contents('./people.txt', NULL, NULL, 20, 14);
var_dump($section);
?>
Here, the last two arguments limit the amount of data returned to just the portion of interest.
Note: The offset argument is a little unpredictable with remote files, as stated also on the manual page:
Seeking (offset) is not supported with remote files. Attempting to seek on non-local files may work with small offsets, but this is unpredictable because it works on the buffered stream.
function ranger($url, $bytes){
$headers = array(
"Range: bytes=0-".$bytes
);
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
return curl_exec($curl);
}
$url = "http://example.com/textfile.txt";
$raw = ranger($url, 5);
echo $raw;
Keep in mind that Range header must be supported by server. With fgc I think it is impossibru, even if it is, you should use cURL.
If I download a file from a website using:
$html = file_get_html($url);
Then how can I know the size, in kilobyes, of the HTML string? I want to know, because I want to skip files over 100Kb.
If you do file_get_contents, you've already gotten the whole file.
If you mean "skip processing", rather than "skip retrieval", you can just get the length of the string: strlen($html). For kilobytes, divide that by 1024.
This is imprecise because the string may contain UTF-8 characters over one byte in length, and very small files will actually occupy a FS block instead of their byte length, but it's probably good enough for the arbitrary-threshold cutoff you're looking for.
To skip fetching large files, you want to use the cURL library.
<?php
function get_content_length($url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
$hraw=explode("\r\n",curl_exec($ch));
curl_close($ch);
$hdrs=array();
foreach($hraw as $hdr) {
$a=explode(": ", trim($hdr));
$hdrs[$a[0]]=$a[1];
}
return (isset($hdrs['Content-Length'])) ? $hdrs['Content-Length'] : FALSE;
}
$url="http://www.example.com/";
if (get_content_length($url) < 100000) {
$html = file_get_contents($url);
print "Yes.\n";
} else {
print "No.\n";
}
?>
There may be a more elegant way to pull this information out of curl, but this is what came to mind fastest. YMMV.
Note that setting the CURLOPT options this way makes curl use a "HEAD" rather than "GET" request, so we're not actually fetching this URL twice.
The definition, what a string is, is different between PHP and the intuitive meaning:
"Hällo" (mind the Umlaut) looks like a 5-character String, but to PHP it is really a 6-byte array (assuming UTF8) - PHP doesn't have a notion of a String representing text, it just sees it as a sequence of bytes (The PHP euphemism is "binary safe").
So strlen("Hällo") will be 6 (UTF8).
That said, if you want to skip above 100Kb you probably won't mind if it is 99.5k characters translating to 100k bytes.
file_get_html returns an object to you, the information of how big the string is is lost at that point. Get the string first, the object later:
$html = file_get_contents($url);
echo strlen($html); // size in bytes
$html = str_get_html($html);
You can use mb_strlen to force 8bit or what not and then 1 character = 1 byte