[Updated At Bottom]
Hi everyone.
Start With Short URLs:
Imagine that you've got a collection of 5 short urls (like http://bit.ly) in a php array, like this:
$shortUrlArray = array("http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123");
End with Final, Redirected URLs:
How can I get the final url of these short urls with php? Like this:
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
I have one method (found online) that works well with a single url, but when looping over multiple urls, it only works with the final url in the array. For your reference, the method is this:
function get_web_page( $url )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => true, // return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
//$header['errno'] = $err;
//$header['errmsg'] = $errmsg;
//$header['content'] = $content;
print($header[0]);
return $header;
}
//Using the above method in a for loop
$finalURLs = array();
$lineCount = count($shortUrlArray);
for($i = 0; $i <= $lineCount; $i++){
$singleShortURL = $shortUrlArray[$i];
$myUrlInfo = get_web_page( $singleShortURL );
$rawURL = $myUrlInfo["url"];
array_push($finalURLs, $rawURL);
}
Close, but not enough
This method works, but only with a single url. I Can't use it in a for loop which is what I want to do. When used in the above example in a for loop, the first four elements come back unchanged, and only the final element is converted into its final url. This happens whether your array is 5 elements or 500 elements long.
Solution Sought:
Please give me a hint as to how you'd modify this method to work when used inside of a for loop with collection of urls (Rather than just one).
-OR-
If you know of code that is better suited for this task, please include it in your answer.
Thanks in advance.
Update:
After some further prodding I've found that the problem lies not in the above method (which, after all, seems to work fine in for loops) but possibly encoding. When I hard-code an array of short urls, the loop works fine. But when I pass in a block of newline-seperated urls from an html form using GET or POST, the above mentioned problem ensues. Are the urls somehow being changed into a format not compatible with the method when I submit the form????
New Update:
You guys, I've found that my problem was due to something unrelated to the above method. My problem was that the URL encoding of my short urls converted what i thought were just newline characters (separating the urls) into this: %0D%0A which is a line feed or return character... And that all short urls save for the final url in the collection had a "ghost" character appended to the tail, thus making it impossible to retrieve the final urls for those only. I identified the ghost character, corrected my php explode, and all works fine now. Sorry and thanks.
This may be of some help: How to put string in array, split by new line?
You would probably do something like this, assuming you're getting the URLs returned in POST:
$final_urls = array();
$short_urls = explode( chr(10), $_POST['short_urls'] ); //You can replace chr(10) with "\n" or "\r\n", depending on how you get your urls. And of course, change $_POST['short_urls'] to the source of your string.
foreach ( $short_urls as $short ) {
$final_urls[] = get_web_page( $short );
}
I get the following output, using var_dump($final_urls); and your bit.ly url:
http://codepad.org/8YhqlCo1
And my source: $_POST['short_urls'] = "http://bit.ly/123\nhttp://bit.ly/123\nhttp://bit.ly/123\nhttp://bit.ly/123";
I also got an error, using your function: Notice: Undefined offset: 0 in /var/www/test.php on line 27 Line 27: print($header[0]); I'm not sure what you wanted there...
Here's my test.php, if it will help: http://codepad.org/zI2wAOWL
I think you are almost have it there. Try this:
$shortUrlArray = array("http://yhoo.it/2deaFR",
"http://bit.ly/900913",
"http://bit.ly/4m1AUx");
$finalURLs = array();
$lineCount = count($shortUrlArray);
for($i = 0; $i < $lineCount; $i++){
$singleShortURL = $shortUrlArray[$i];
$myUrlInfo = get_web_page( $singleShortURL );
$rawURL = $myUrlInfo["url"];
printf($rawURL."\n");
array_push($finalURLs, $rawURL);
}
I implemented to get a each line of a plain text file, with one shortened url per line, the according redirect url:
<?php
// input: textfile with one bitly shortened url per line
$plain_urls = file_get_contents('in.txt');
$bitly_urls = explode("\r\n", $plain_urls);
// output: where should we write
$w_out = fopen("out.csv", "a+") or die("Unable to open file!");
foreach($bitly_urls as $bitly_url) {
$c = curl_init($bitly_url);
curl_setopt($c, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36');
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($c, CURLOPT_HEADER, 1);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 20);
// curl_setopt($c, CURLOPT_PROXY, 'localhost:9150');
// curl_setopt($c, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
$r = curl_exec($c);
// get the redirect url:
$redirect_url = curl_getinfo($c)['redirect_url'];
// write output as csv
$out = '"'.$bitly_url.'";"'.$redirect_url.'"'."\n";
fwrite($w_out, $out);
}
fclose($w_out);
Have fun and enjoy!
pw
Related
I can't allow for file_get_contents to work more than 1 second, if it is not possible - I need to skip to next loop.
for ($i = 0; $i <=59; ++$i) {
$f=file_get_contents('http://example.com');
if(timeout<1 sec) - do something and loop next;
else skip file_get_contents(), do semething else, and loop next;
}
Is it possible to make a function like this?
Actually I'm using curl_multi and I can't fugure out how to set timeout on a WHOLE curl_multi request.
If you are working with http urls only you can do the following:
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
));
for ($i = 0; $i <=59; $i++) {
file_get_contents("http://example.com/", 0, $ctx);
}
However, this is just the read timeout, meaning the time between two read operations (or the time before the first read operation). If the download rate is constant, there should not being such gaps in the download rate and the download can take even an hour.
If you want the whole download not take more than a second you can't use file_get_contents() anymore. I would encourage to use curl in this case. Like this:
// create curl resource
$ch = curl_init();
for($i=0; $i<59; $i++) {
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
// set timeout
curl_setopt($ch, CURLOPT_TIMEOUT, 1);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
}
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
)
);
file_get_contents("http://example.com/", 0, $ctx);
Source
I have around 600k of image URLs in different tables and am downloading all the images with the code below and it is working fine. (I know FTP is the best option but somehow I can’t use it.)
$queryRes = mysql_query("SELECT url FROM tablName LIMIT 50000"); // everytime I am using LIMIT
while ($row = mysql_fetch_object($queryRes)) {
$info = pathinfo($row->url);
$fileName = $info['filename'];
$fileExtension = $info['extension'];
try {
copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension);
} catch(Exception $e) {
echo "<br/>\n unable to copy '$fileName'. Error:$e";
}
}
Problems are:
After some time, say 10 minutes, scripts give 503 error. But still continue downloading the images. Why, it should stop copying it?
And it does not download all the images, everytime there will be difference of 100 to 150 images. So how can I trace which images are not downloaded?
I hope I have explained well.
first of all... copy will not throw any exception... so you are not doing any error handling... thats why your script will continue to run...
second... you should use file_get_contets or even better, curl...
for example you could try this function... (I know... its open and closes curl every time... just an example i found here https://stackoverflow.com/a/6307010/1164866)
function getimg($url) {
$headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$headers[] = 'Connection: Keep-Alive';
$headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
$user_agent = 'php';
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $headers);
curl_setopt($process, CURLOPT_HEADER, 0);
curl_setopt($process, CURLOPT_USERAGENT, $useragent);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
$return = curl_exec($process);
curl_close($process);
return $return;
}
or even.. try to doit with curl_multi_exec and get your files dowloaded in parallel, wich will be a lot faster
take a look here:
http://www.php.net/manual/en/function.curl-multi-exec.php
edit:
to track wich files failed to download you need to do something like this
$queryRes = mysql_query("select url from tablName limit 50000"); //everytime i am using limit
while($row = mysql_fetch_object($queryRes)) {
$info = pathinfo($row->url);
$fileName = $info['filename'];
$fileExtension = $info['extension'];
if (!#copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension)) {
$errors= error_get_last();
echo "COPY ERROR: ".$errors['type'];
echo "<br />\n".$errors['message'];
//you can add what ever code you wnat here... out put to conselo, log in a file put an exit() to stop dowloading...
}
}
more info: http://www.php.net/manual/es/function.copy.php#83955
I haven't used copy myself, I'd use file_get_contents it works fine with remote servers.
edit:
also returns false. so...
if( false === file_get_contents(...) )
trigger_error(...);
I think 50000 is too large. Network is every time consuming, downloading an image might cost over 100 ms(depend on your nerwork condition), so 50000 images, in the most stable case(without timeout or some other errors), might cost 50000*100/1000/60 = 83 mins, that's really a long time for script like php. If you run this script as a cgi(not cli), normally you only got 30 secs by default(without set_time_limit). So I recommend making this script a cronjob and run it every 10 secs to fetch about 50 urls maybe.
To make the script only fetch a few images each time, you must remember which ones have been processed(successfully) alreay. For example, you can add a flag column to the url table, by default, the flag = 1, if url processed successfully, it becomes 2, or it becomes 3, which means the url got something wrong. And each time, the script can only select the ones which flag=1(3 might be also included, but sometimes, the url might be so wrong so re-try won't work).
copy function is too simple, I recommend using curl instead, it's more reliable, and you can got the exactlly network info of downloading.
Here the code:
//only fetch 50 urls each time
$queryRes = mysql_query ( "select id, url from tablName where flag=1 limit 50" );
//just prefer absolute path
$imgDirPath = dirname ( __FILE__ ) + '/';
while ( $row = mysql_fetch_object ( $queryRes ) )
{
$info = pathinfo ( $row->url );
$fileName = $info ['filename'];
$fileExtension = $info ['extension'];
//url in the table is like //www.example.com???
$result = fetchUrl ( "http:" . $row->url,
$imgDirPath + "img/$fileName" . "_" . $row->id . "." . $fileExtension );
if ($result !== true)
{
echo "<br/>\n unable to copy '$fileName'. Error:$result";
//update flag to 3, finish this func yourself
set_row_flag ( 3, $row->id );
}
else
{
//update flag to 3
set_row_flag ( 2, $row->id );
}
}
function fetchUrl($url, $saveto)
{
$ch = curl_init ( $url );
curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $ch, CURLOPT_MAXREDIRS, 3 );
curl_setopt ( $ch, CURLOPT_HEADER, false );
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt ( $ch, CURLOPT_CONNECTTIMEOUT, 7 );
curl_setopt ( $ch, CURLOPT_TIMEOUT, 60 );
$raw = curl_exec ( $ch );
$error = false;
if (curl_errno ( $ch ))
{
$error = curl_error ( $ch );
}
else
{
$httpCode = curl_getinfo ( $ch, CURLINFO_HTTP_CODE );
if ($httpCode != 200)
{
$error = 'HTTP code not 200: ' . $httpCode;
}
}
curl_close ( $ch );
if ($error)
{
return $error;
}
file_put_contents ( $saveto, $raw );
return true;
}
Strict checking for mysql_fetch_object return value is IMO better as many similar functions may return non-boolean value evaluating to false when checking loosely (e.g. via !=).
You do not fetch id attribute in your query. Your code should not work as you wrote it.
You define no order of rows in the result. It is almost always desirable to have an explicit order.
The LIMIT clause leads to processing only a limited number of rows. If I get it correctly, you want to process all the URLs.
You are using a deprecated API to access MySQL. You should consider using a more modern one. See the database FAQ # PHP.net. I did not fix this one.
As already said multiple times, copy does not throw, it returns success indicator.
Variable expansion was clumsy. This one is purely cosmetic change, though.
To be sure the generated output gets to the user ASAP, use flush. When using output buffering (ob_start etc.), it needs to be handled too.
With fixes applied, the code now looks like this:
$queryRes = mysql_query("SELECT id, url FROM tablName ORDER BY id");
while (($row = mysql_fetch_object($queryRes)) !== false) {
$info = pathinfo($row->url);
$fn = $info['filename'];
if (copy(
'http:' . $row->url,
"img/{$fn}_{$row->id}.{$info['extension']}"
)) {
echo "success: $fn\n";
} else {
echo "fail: $fn\n";
}
flush();
}
The issue #2 is solved by this. You will see which files were and were not copied. If the process (and its output) stops too early, then you know the id of the last processed row and you can query your DB for the higher ones (not processed). Another approach is adding a boolean column copied to tblName and updating it immediately after successfully copying the file. Then you may want to change the query in the code above to not include rows with copied = 1 already set.
The issue #1 is addressed in Long computation in php results in 503 error here on SO and 503 service unavailable when debugging PHP script in Zend Studio on SU. I would recommend splitting the large batch to smaller ones, launching in a fixed interval. Cron seems to be the best option to me. Is there any need to lauch this huge batch from browser? It will run for a very long time.
It is better handled batch-by-batch.
The actual script
Table structure
CREATE TABLE IF NOT EXISTS `images` (
`id` int(60) NOT NULL AUTO_INCREMENTh,
`link` varchar(1024) NOT NULL,
`status` enum('not fetched','fetched') NOT NULL DEFAULT 'not fetched',
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
);
The script
<?php
// how many images to download in one go?
$limit = 100;
/* if set to true, the scraper reloads itself. Good for running on localhost without cron job support. Just keep the browser open and the script runs by itself ( javascript is needed) */
$reload = false;
// to prevent php timeout
set_time_limit(0);
// db connection ( you need pdo enabled)
try {
$host = 'localhost';
$dbname= 'mydbname';
$user = 'root';
$pass = '';
$DBH = new PDO("mysql:host=$host;dbname=$dbname", $user, $pass);
}
catch(PDOException $e) {
echo $e->getMessage();
}
$DBH->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION );
// get n number of images that are not fetched
$query = $DBH->prepare("SELECT * FROM images WHERE status = 'not fetched' LIMIT {$limit}");
$query->execute();
$files = $query->fetchAll();
// if no result, don't run
if(empty($files)){
echo 'All files have been fetched!!!';
die();
}
// where to save the images?
$savepath = dirname(__FILE__).'/scrapped/';
// fetch 'em!
foreach($files as $file){
// get_url_content uses curl. Function defined later-on
$content = get_url_content($file['link']);
// get the file name from the url. You can use random name too.
$url_parts_array = explode('/' , $file['link']);
/* assuming the image url as http:// abc . com/images/myimage.png , if we explode the string by /, the last element of the exploded array would have the filename */
$filename = $url_parts_array[count($url_parts_array) - 1];
// save fetched image
file_put_contents($savepath.$filename , $content);
// did the image save?
if(file_exists($savepath.$file['link']))
{
// yes? Okay, let's save the status
$query = $DBH->prepare("update images set status = 'fetched' WHERE id = ".$file['id']);
// output the name of the file that just got downloaded
echo $file['link']; echo '<br/>';
$query->execute();
}
}
// function definition get_url_content()
function get_url_content($url){
// ummm let's make our bot look like human
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
return curl_exec($ch);
}
//reload enabled? Reload!
if($reload)
echo '<script>location.reload(true);</script>';
503 is a fairly generic error, which in this case probably means something timed out. This could be your web server, a proxy somewhere along the way, or even PHP.
You need to identify which component is timing out. If it's PHP, you can use set_time_limit.
Another option might be to break the work up so that you only process one file per request, then redirect back to the same script to continue processing the rest. You would have to somehow maintain a list of which files have been processed between calls. Or process in order of database id, and pass the last used id to the script when you redirect.
I am trying to build a list of all the city pages on ghix.com, which does not have such a complete directory. To do this I am using their 'city id' which is unique for each city, but does not follow any particular order.
I am using cURL and PHP to loop through the possible domains to search for those that match actual cities. Simple enough. See the code bellow, which produces a 500 Internal Service Error.
If it worked, output should be a list of 'city ids' which do not match actual cities. If the URL matches an actual city it will not have (0) on the page, but if it does not match a city it will have (0) on the page.
I have looked over and corrected this several times, what is causing the error?
<html>
<?php
for ($i = 1; ; $i <= 1000000; $i++) {
$url = "http://www.ghix.com/goto/dynamic/city?CityID=" . $i;
$term="(0)";
curl_setopt($ch, CURLOPT_URL, trim($url));
$html = curl_exec($ch);
if ($html !== FALSE && stristr($html, $term) !== FALSE) { // Found!
echo $i;
Echo "br/";
}
}
?>
</html>
UPDATE
a slightly different approach I tried, with the same effect...
<html>
<?php
for ($i = 1; $i <= 100; $i++) {
$url = "http://www.ghix.com/goto/dynamic/city?CityID=" . $i;
$term="(0)";
curl_setopt($ch, CURLOPT_URL, trim($url));
$html = curl_exec($ch);
if (strpos($ch,$term)) {
echo $url;
echo "<br>";
}
?>
</html>
In your first chunk of code, you have an extra ; in the for conditions. Next, you need to initialize cURL with $ch = curl_init(); around the beginning. That opens the handler $ch that you call on later. Finally, use 1= for the false condition for if instead of the exclamation and double equal. After these fixes, I'm not getting any 500 errors. After that, it's just a matter of collecting the data from the pages and putting it in the right places.
In the second chunk of code, you still need to initialize cURL. Then you need to put in another curly bracket at the end to close out the for loop. And then it's a matter of dealing with the output from cURL.
It seems you are getting more typo errors than anything. Watch your server logs and they will tell you more about what you are looking for. And for cURL, read up on the options that you can set in PHP on the PHP site. It's a good read. Good luck.
What I have found is that you can use the auto-complete JSON request to find all the city ids.
The JSON request url is http://www.ghix.com/goto/dynamic/suggest?InputVarName=q&q=FRAGMENT&Type=json&SkipPrerequisites=1. Here FRAGMENT is the letters you type in the input box. Iteratively request on that URL would reveal all the CityIDs you are looking for.
Be aware, they may have bot protection for such ajax queries.
I saw their JSON is malformed. It can be fixed using Services_JSON pear package.
require_once 'Services/JSON.php';
$Services_JSON = new Services_JSON();
$json = $Services_JSON->decode($jsons);
Using dxtool/WebGet the following code seems to be working.
require_once("WebGet.php");
// common headers to make this request more real.
$headers = array(
"Accept-Charset" => "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
"Accept-Encoding" => "gzip, deflate",
"Accept-Language" => "en-us,en;q=0.5",
"Connection" => "keep-alive",
"Referer" => "http://www.ghix.com/goto/dynamic/search",
"User-Agent" => "Mozilla/5.0 (Ubuntu; X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1"
);
$w = new WebGet();
$w->cookieFile = dirname(__FILE__) . "/cookie.txt";
// do a page landing
$w->requestContent("http://www.ghix.com/goto/dynamic/search", array(), array(), $headers);
$city_ids = array();
for ($i = 1; $i <= 1000; $i++) {
$url = "http://www.ghix.com/goto/dynamic/city?CityID=" . $i;
$term = "> (0)<";
$w->requestContent($url, array(), array(), $headers);
// sleep some times to make it more like request from human
usleep(10000);
if (strpos($w->cachedContent, $term) !== FALSE){
$city_ids[] = $i;
echo $url, PHP_EOL;
}
}
print_r($city_ids);
I'm trying to convert this PHP cURL function to work with my rails app. The piece of code is from an SMS payment gateway that needs to verify the POST paramters. Since I'm a big PHP noob I have no idea how to handle this problem.
$verify_url = 'http://smsgatewayadress';
$fields = '';
$d = array(
'merchant_ID' => $_POST['merchant_ID'],
'local_ID' => $_POST['local_ID'],
'total' => $_POST['total'],
'ipn_verify' => $_POST['ipn_verify'],
'timeout' => 10,
);
foreach ($d as $k => $v)
{
$fields .= $k . "=" . urlencode($v) . "&";
}
$fields = substr($fields, 0, strlen($fields)-1);
$ch = curl_init($verify_url); //this initiates a HTTP connection to $verify_url, the connection headers will be stored in $ch
curl_setopt($ch, CURLOPT_POST, 1); //sets the delivery method as POST
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields); //The data that is being sent via POST. From what I can see the cURL lib sends them as a string that is built in the foreach loop above
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //This verifies if the target url sends a redirect header and if it does cURL follows that link
curl_setopt($ch, CURLOPT_HEADER, 0); //This ignores the headers from the answer
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //This specifies that the curl_exec function below must return the result to the accesed URL
$result = curl_exec($ch); //It ransfers the data via POST to the URL, it gets read and returns the result
if ($result == true)
{
//confirmed
$can_download = true;
}
else
{
//failed
$can_download = false;
}
}
if (strpos($_SERVER['REQUEST_URI'], 'ipn.php'))
echo $can_download ? '1' : '0'; //we tell the sms sever that we processed the request
I've googled a cURL lib counterpart in Rails and found a ton of options but none that I could understand and use in the same way this script does.
If anyone could give me a hand with converting this script from php to ruby it would be greatly appreciated.
The most direct approach might be to use the Ruby curb library, which is the most straightforward wrapper for cURL. A lot of the options in Curl::Easy map directly to what you have here. A basis might be:
url = "http://smsgatewayadress/"
Curl::Easy.http_post(url,
Curl::PostField.content('merchant_ID', params[:merchant_ID]),
# ...
)
I am trying to load an XML file from a different domain name as a string. All I want is an array of the text within the < title >< /title > tags of the xml file, so I am thinking since I am using php4 the easiest way would be to do a regex on it to get them. Can someone explain how to load the XML as a string? Thanks!
You could use cURL like the example below. I should add that regex-based XML parsing is generally not a good idea, and you may be better off using a real parser, especially if it gets any more complicated.
You may also want to add some regex modifiers to make it work across multiple lines etc., but I assume the question is more about fetching the content into a string.
<?php
$curl = curl_init('http://www.example.com');
//make content be returned by curl_exec rather than being printed immediately
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);
if ($result !== false) {
if (preg_match('|<title>(.*)</title>|i', $result, $matches)) {
echo "Title is '{$matches[1]}'";
} else {
//did not find the title
}
} else {
//request failed
die (curl_error($curl));
}
first use
file_get_contents('http://www.example.com/');
to get the file,
insert in to var.
after parse the xml
the link is
http://php.net/manual/en/function.xml-parse.php
have example in the comments
If you're loading well-formed xml, skip the character-based parsing, and use the DOM functions:
$d = new DOMDocument;
$d->load("http://url/file.xml");
$titles = $d->getElementsByTagName('title');
if ($titles) {
echo $titles->item(0)->nodeValue;
}
If you can't use DOMDocument::load() due to how php is set up, the use curl to grab the file and then do:
$d = new DOMDocument;
$d->loadXML($grabbedfile);
...
I have this function as a snippet:
function getHTML($url) {
if($url == false || empty($url)) return false;
$options = array(
CURLOPT_URL => $url, // URL of the page
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 3, // stop after 3 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
//Ending all that cURL mess...
//Removing linebreaks,multiple whitespace and tabs for easier Regexing
$content = str_replace(array("\n", "\r", "\t", "\o", "\xOB"), '', $content);
$content = preg_replace('/\s\s+/', ' ', $content);
$this->profilehtml = $content;
return $content;
}
That returns the HTML with no linebreaks, tabs, multiple spaces, etc, only 1 line.
So now you do this preg_match:
$html = getHTML($url)
preg_match('|<title>(.*)</title>|iUsm',$html,$matches);
and $matches[1] will have the info you need.