I'm working on an app that gets all the URLs from an array of sites and displays it in array form or JSON.
I can do it using for loop, the problem is the execution time when I tried 10 URLs it gives me an error saying exceeds maximum execution time.
Upon searching I found this multi curl
I also found this Fast PHP CURL Multiple Requests: Retrieve the content of multiple URLs using CURL. I tried to add my code but didn't work because I don't how to use the function.
Hope you help me.
Thanks.
This is my sample code.
<?php
$urls=array(
'http://site1.com/',
'http://site2.com/',
'http://site3.com/');
$mh = curl_multi_init();
foreach ($urls as $i => $url) {
$urlContent = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
// validate url
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.'<br />';
}
}
$conn[$i]=curl_init($url);
$fp[$i]=fopen ($g, "w");
curl_setopt ($conn[$i], CURLOPT_FILE, $fp[$i]);
curl_setopt ($conn[$i], CURLOPT_HEADER ,0);
curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,60);
curl_multi_add_handle ($mh,$conn[$i]);
}
do {
$n=curl_multi_exec($mh,$active);
}
while ($active);
foreach ($urls as $i => $url) {
curl_multi_remove_handle($mh,$conn[$i]);
curl_close($conn[$i]);
fclose ($fp[$i]);
}
curl_multi_close($mh);
?>
Here is a function that I put together that will properly utilize the curl_multi_init() function. It is more or less the same function that you will find on PHP.net with some minor tweaks. I have had great success with this.
function multi_thread_curl($urlArray, $optionArray, $nThreads) {
//Group your urls into groups/threads.
$curlArray = array_chunk($urlArray, $nThreads, $preserve_keys = true);
//Iterate through each batch of urls.
$ch = 'ch_';
foreach($curlArray as $threads) {
//Create your cURL resources.
foreach($threads as $thread=>$value) {
${$ch . $thread} = curl_init();
curl_setopt_array(${$ch . $thread}, $optionArray); //Set your main curl options.
curl_setopt(${$ch . $thread}, CURLOPT_URL, $value); //Set url.
}
//Create the multiple cURL handler.
$mh = curl_multi_init();
//Add the handles.
foreach($threads as $thread=>$value) {
curl_multi_add_handle($mh, ${$ch . $thread});
}
$active = null;
//execute the handles.
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
//Get your data and close the handles.
foreach($threads as $thread=>$value) {
$results[$thread] = curl_multi_getcontent(${$ch . $thread});
curl_multi_remove_handle($mh, ${$ch . $thread});
}
//Close the multi handle exec.
curl_multi_close($mh);
}
return $results;
}
//Add whatever options here. The CURLOPT_URL is left out intentionally.
//It will be added in later from the url array.
$optionArray = array(
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',//Pick your user agent.
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_TIMEOUT => 10
);
//Create an array of your urls.
$urlArray = array(
'http://site1.com/',
'http://site2.com/',
'http://site3.com/'
);
//Play around with this number and see what works best.
//This is how many urls it will try to do at one time.
$nThreads = 20;
//To use run the function.
$results = multi_thread_curl($urlArray, $optionArray, $nThreads);
Once this is complete you will have an array containing all of the html from your list of websites. It is at this point where I would loop through them and pull out all of the urls.
Like so:
foreach($results as $page){
$dom = new DOMDocument();
#$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
// validate url
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.'<br />';
}
}
}
It is also worth keeping in the back of you head the ability to increase the run time of your script.
If your using a hosting service you may be restricted to something in the ball park of two minutes regardless of what you set your max execution time to. Just food for thought.
This is done by:
ini_set('max_execution_time', 120);
You can always try more time but you'll never know till you time it.
Hope it helps.
You may be using an endless loop - if not, you can can increase the maximum execution time in php.ini or with:
ini_set('max_execution_time', 600); // 600 seconds = 10 minutes
This is what I achieved after working on the code, It worked but not sure if this is the best answer. Kindly check my code.
<?php
$array = array('https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/');
print_r (getUrls($array));
function getUrls($array) {
$arrUrl = array();
$arrList = array();
$url_count = count($array);
$curl_array = array();
$ch = curl_multi_init();
foreach($array as $count => $url) {
$curl_array[$count] = curl_init($url);
curl_setopt($curl_array[$count], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($ch, $curl_array[$count]);
}
do{
curl_multi_exec($ch, $exec);
curl_multi_select($ch,1);
}while($exec);
foreach($array as $count => $url) {
$arrUrl = array();
$urlContent = curl_multi_getcontent($curl_array[$count]);
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
// validate url
if (filter_var($url, FILTER_VALIDATE_URL) !== false) {
if (strpos($url, 'mailto') === false) {
$arrUrl[] = $url;
}
}
}
array_push($arrList, array_unique($arrUrl));
}
foreach($array as $count => $url) {
curl_multi_remove_handle($ch, $curl_array[$count]);
}
curl_multi_close($ch);
foreach($array as $count => $url) {
curl_close($curl_array[$count]);
}
return $arrList;
}
First of all i know that OP does asking about multi_curl but i just adding another alternative if the OP may changes his mind. What i do here is splitting the urls into many request so the cpu usage will not that big. If the OP still wants use multi_curl maybe the PHP master here could gives more better solution.
<?php
$num = preg_replace('/[^0-9]/','',$_GET['num']);
$num = empty($num) ? 0 : $num;
$urls=array(
'http://site1.com/',
'http://site2.com/',
'http://site3.com/');
if(!empty($urls[$num]))
{
/* do your single curl stuff here and store its data here*/
/*now redirect to the next url. dont use header location redirect, it would ends up too many redirect error in browser*/
$next_url = !empty($urls[$num+1]) ? $urls[$num+1] : 'done';
echo '<html>
<head>
<meta http-equiv="refresh" content="0;url=http://yourcodedomain.com/yourpath/yourcode.php?num='.$next_url.'" />
</head>
<body>
<p>Fetching: '.$num+1.' / '.count($urls).'</p>
</body>
</html>';
}
elseif($_GET['num'] == 'done')
{
/*if all sites have been fetched, do something here*/
}
else
{
/*throws exception here*/
}
?>
i had same issue then i solved using usleep() this try and let me know
do {
usleep(10000);
$n=curl_multi_exec($mh,$active);
}
Try this simplified version:
$urls = [
'https://en.wikipedia.org/',
'https://secure.php.net/',
];
set_time_limit(0);
libxml_use_internal_errors(true);
$hrefs = [];
foreach ($urls as $url) {
$html = file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('a') as $link) {
$href = filter_var($link->getAttribute('href'), FILTER_SANITIZE_URL);
if (filter_var($href, FILTER_VALIDATE_URL)) {
echo "<a href='{$href}'>{$href}</a><br/>\n";
}
}
}
Related
How to clear loadHTMLFile () at the beginning of the foreach loop so that with each subsequent iteration of this loop it does not accumulate $response inside $document?
There is a $response that accumulates the content inside loadHTMLFile on each subsequent iteration into the content from the previous loop,
... I want to start with a "blank" $document on each loop.
Here is my code:
$responses = [
"warsztaty" => "http://www.barlewiczki.pl/index.php/galeria/category/29-um-warsztaty-2018",
"rozpoczecie_roku_szkolnego" => "http://www.barlewiczki.pl/index.php/galeria/category/36-rozpoczecie-roku-szkolnego-2018-2019"
];
foreach($responses as $key=>$response) {
$document = new DOMDocument();
$document->loadHTMLFile('');
$document->loadHTMLFile($response);
$xpath = new DOMXpath($document);
$imgs = $xpath->query("//a[contains(#class, 'shadowbox-button')]");
for ($i=0; $i < $imgs->length; $i++) {
$img = $imgs->item($i);
$src = $img->getAttribute("href");
// do something with $src
$urls_to_image[] = 'http://www.barlewiczki.pl' . $src;
}
// Desired folder structure
$my_save_dir = ('zdjecia_barlewiczki/' . $key . "/");
// To create the nested structure, the $recursive parameter
// to mkdir() must be specified.
if (!mkdir($my_save_dir, 0777, true)) {
$my_save_dir = mkdir('zdjecia_barlewiczki/');
}
foreach ($urls_to_image as $url) {
$ch = curl_init($url);
$filename = basename($url);
$complete_save_loc = $my_save_dir . $filename;
$fp = fopen($complete_save_loc, 'wb');
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
}
$document->saveHTML();
}
If I understand you correctly, try chainging your xpath expression from
$imgs = $xpath->query("//a[contains(#class, 'shadowbox-button')]");
to
$imgs = $xpath->query("//div[#class='pg-box3']/a[contains(#class, 'shadowbox-button')]/#href");
At leaset for the first url, this resulted in 20 urls correspoinging to the 20 images on the page.
Im a a newbie trying to code a crawler to make some stats from a forum.
Here is my code :
<?php
$ch = curl_init();
$timeout = 0; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, 'http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($file_contents);
$xpath = new DOMXPath($dom);
$posts = $xpath->query("//div[#class='who-post']/a");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$dates = $xpath->query("//div[#class='date-post']");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$contents = $xpath->query("//div[#class='message text-enrichi-fmobile text-crop-fmobile']/p");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$i = 0;
foreach ($posts as $post) {
$nodes = $post->childNodes;
foreach ($nodes as $node) {
$value = trim($node->nodeValue);
$tab[$i]['author'] = $value;
$i++;
}
}
$i = 0;
foreach ($dates as $date) {
$nodes = $date->childNodes;
foreach ($nodes as $node) {
$value = trim($node->nodeValue);
$tab[$i]['date'] = $value;
$i++;
}
}
$i = 0;
foreach ($contents as $content) {
$nodes = $content->childNodes;
foreach ($nodes as $node) {
$value = $node->nodeValue;
echo $value;
$tab[$i]['content'] = trim($value);
$i++;
}
}
?>
<h1>Participants</h2>
<pre>
<?php
print_r($tab);
?>
</pre>
As you can see, the code do not retrieve some content. For example, Im trying to retrieve this content from : http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm
The second post is a picture and my code do not work.
On the second hand, I guess i made some errors, I find my code ugly.
Can you help me please ?
You could simply select the posts first, then grab each subdata separately using:
DOMXPath::evaluate combined with normalize-space to retrieve pure text,
DOMXPath::query combined with DOMDocument::save to retrieve message paragraphs.
Code:
$xpath = new DOMXPath($dom);
$postsElements = $xpath->query('//*[#class="post"]');
$posts = [];
foreach ($postsElements as $postElement) {
$author = $xpath->evaluate('normalize-space(.//*[#class="who-post"])', $postElement);
$date = $xpath->evaluate('normalize-space(.//*[#class="date-post"])', $postElement);
$message = '';
foreach ($xpath->query('.//*[contains(#class, "message")]/p', $postElement) as $messageParagraphElement) {
$message .= $dom->saveHTML($messageParagraphElement);
}
$posts[] = (object)compact('author', 'date', 'message');
}
print_r($posts);
Unrelated note: scraping a website's HTML is not illegal in itself, but you should refrain from displaying their data on your own app/website without their consent. Also, this might break just about anytime if they decide to alter their HTML structure/CSS class names.
I have a problem with sending data from a table to a CSV file.
Array
[link1] => HTTP Code
[link2] => HTTP Code
[link3] => HTTP Code
[link4] => HTTP Code
I need to send the data to a CSV file so that the links do not recur.
Unfortunately, I don't know how to send link after link (I work in a foreach loop) to extract each of these links and send it to CSV, and at the same time check that already did not show up.
This is my code:
require('simple/simple_html_dom.php');
$xml = simplexml_load_file('https://www.gutscheinpony.de/sitemap.xml');
$fp = fopen('Links2.csv', 'w');
set_time_limit(0);
$links=[];
foreach ($xml->url as $link_url)
{
$url = $link_url->loc;
$data=file_get_html($url);
$data = strip_tags($data,"<a>");
$d = preg_split("/<\/a>/",$data);
foreach ( $d as $k=>$u ){
if( strpos($u, "<a href=") !== FALSE ){
$u = preg_replace("/.*<a\s+href=\"/sm","",$u);
$u = preg_replace("/\".*/","",$u);
if ( strpos($u, "http") !== FALSE) {
$ch = curl_init($u);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if(strpos($u, "https://www.gutscheinpony.de/") !== FALSE )
$u = substr($u, 28);
if($u == "/")
$u = $url;
}
$links[$u] = $http_code;
$wynik = array( array($u, $url , $http_code));
foreach ($wynik as $fields) {
fputcsv($fp, $fields);
}
}
}
}
curl_close($ch);
fclose($fp);
echo 'Send to CSV file successfully completed ... ';
I need get every link from .xml, download links that are on the same page and specify the HTTP status. This part I have done. I can't only appropriate way to send data to a CSV file.
I'm counting on your help.
The code below is essentially your code with a few modifications. There was also the observation that :// does not seem acceptable as part of PHP Array Keys.
<?php
require __DIR__ . '/simple/simple_html_dom.php';
$xml = simplexml_load_file('https://www.gutscheinpony.de/sitemap.xml');
$fp = fopen(__DIR__ . '/Links2.csv', 'w');
set_time_limit(0);
$links = [];
$status = false;
foreach ($xml->url as $link_url){
$url = $link_url->loc;
$data = file_get_html($url);
$data = strip_tags($data,"<a>");
$d = preg_split("/<\/a>/",$data);
foreach ( $d as $k=>$u ){
$http_code = 404;
if( strpos($u, "<a href=") !== FALSE ){
$u = preg_replace("/.*<a\s+href=\"/sm","",$u);
$u = preg_replace("/\".*/","",$u);
if ( strpos($u, "http") !== FALSE) {
// JUST GET THE CODE ON EACH ITERATION,
// OPENING THE STREAM & CLOSING IT AGAIN ON EACH ITERATION...
$http_code = getHttpCodeStatus($u);
if(strpos($u, "https://www.gutscheinpony.de/") !== FALSE ){
$u = substr($u, 28);
}
if($u == "/") {
$u = $url;
}
// THIS COULD BE A BUG... USING :// AS PART OF AN ARRAY KEY SEEMS NOT TO WORK
$links[str_replace("://", "_", $u)] = $http_code;
// RUN THE var_dump(), TO VIEW THE PROCESS AS IT PROGRESSES IF YOU WISH TO
var_dump($links);
$status = fputcsv($fp, array($u, $url , $http_code));
}
}
}
}
fclose($fp);
if($status) {
echo count($links) . ' entries were successfully processed and written to disk as a CSV File... ';
}else{
echo 'It seems like some entries were not successfully written to disk - at least the last entry... ';
}
function getHttpCodeStatus($u){
$ch = curl_init($u);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return $http_code;
}
The following codes scrapes a list of links from a given webpage and then place them into another script that scrapes the text from the given links and places the data into a csv document. The code runs perfectly on localhost (wampserver 5.5 php) but fails horribly when placed on domain.
You can check out the functionality of the script at http://miskai.tk/ANOFM/csv.php .
Also, file get html and curl are both enabled onto the server.
<?php
header('Content-Type: application/excel');
header('Content-Disposition: attachment; filename="Mehedinti.csv"');
include_once 'simple_html_dom.php';
include_once 'csv.php';
$urls = scrape_main_page();
function scraping($url) {
// create HTML DOM
$html = file_get_html($url);
// get article block
if ($html && is_object($html) && isset($html->nodes)) {
foreach ($html->find('/html/body/table') as $article) {
// get title
$item['titlu'] = trim($article->find('/tbody/tr[1]/td/div', 0)->plaintext);
// get body
$item['tr2'] = trim($article->find('/tbody/tr[2]/td[2]', 0)->plaintext);
$item['tr3'] = trim($article->find('/tbody/tr[3]/td[2]', 0)->plaintext);
$item['tr4'] = trim($article->find('/tbody/tr[4]/td[2]', 0)->plaintext);
$item['tr5'] = trim($article->find('/tbody/tr[5]/td[2]', 0)->plaintext);
$item['tr6'] = trim($article->find('/tbody/tr[6]/td[2]', 0)->plaintext);
$item['tr7'] = trim($article->find('/tbody/tr[7]/td[2]', 0)->plaintext);
$item['tr8'] = trim($article->find('/tbody/tr[8]/td[2]', 0)->plaintext);
$item['tr9'] = trim($article->find('/tbody/tr[9]/td[2]', 0)->plaintext);
$item['tr10'] = trim($article->find('/tbody/tr[10]/td[2]', 0)->plaintext);
$item['tr11'] = trim($article->find('/tbody/tr[11]/td[2]', 0)->plaintext);
$item['tr12'] = trim($article->find('/tbody/tr[12]/td/div/]', 0)->plaintext);
$ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret;}
}
$output = fopen("php://output", "w");
foreach ($urls as $url) {
$ret = scraping($url);
foreach($ret as $v){
fputcsv($output, $v);}
}
fclose($output);
exit();
second file
<?php
function get_contents($url) {
// We could just use file_get_contents but using curl makes it more future-proof (setting a timeout for example)
$ch = curl_init($url);
curl_setopt_array($ch, array(CURLOPT_RETURNTRANSFER => true,));
$content = curl_exec($ch);
curl_close($ch);
return $content;
}
function scrape_main_page() {
set_time_limit(300);
libxml_use_internal_errors(true); // Prevent DOMDocument from spraying errors onto the page and hide those errors internally ;)
$html = get_contents("http://lmvz.anofm.ro:8080/lmv/index2.jsp?judet=26");
$dom = new DOMDocument();
$dom->loadHTML($html);
die(var_dump($html));
$xpath = new DOMXPath($dom);
$results = $xpath->query("//table[#width=\"645\"]/tr");
$all = array();
//var_dump($results);
for($i = 1; $i < $results->length; $i++) {
$tr = $results->item($i);
$id = $tr->childNodes->item(0)->textContent;
$requesturl = "http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=" . urlencode($id) .
"&judet=26";
$details = scrape_detail_page($requesturl);
$newObj = new stdClass();
$newObj = $id;
$all[] = $newObj;
}
foreach($all as $xtr) {
$urls[] = "http://lmvz.anofm.ro:8080/lmv/detalii.jsp?UNIQUEJVID=" . $xtr .
"&judet=26";
}
return $urls;
}
scrape_main_page();
Yeah, the problem here is your php.ini configuration. Make sure the server supports curl and fopen. If not start your own linux server.
Ok, so the preg_match_all wont work towards Yahoo.
I'm trying to preg_match_all the results i get from Yahoo using a cURL curl_multi_getcontent method.
I have succeeded to fetch the site and so, but when I'm trying to get the result of the links, it wont match anything. When I'm using the regex in Notepad++ it succeeds but not in PHP apparently.
I'm currently using:
preg_match_all(
'#<span class="url" id="(.*?)">(.+?)</span>#si', $urlContents[2], $yahoo
);
Check the HTML at [http://se.search.yahoo.com/search?p=random&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t][1] for example and you will see that all links start with <span class="url" id="something random"> and ends with </span>.
Could someone possible help me with how I should retreive this information?
I only need the actual link address to each result.
Entire PHP Script
public function multiSearch($question)
{
$sites['google'] = "http://www.google.com/search?q={$question}&gl=sv";
$sites['bing'] = "http://www.bing.com/search?q={$question}";
$sites['yahoo'] = "http://se.search.yahoo.com/search?p={$question}";
$urlHandler = array();
foreach($sites as $site)
{
$handler = curl_init();
curl_setopt($handler, CURLOPT_URL, $site);
curl_setopt($handler, CURLOPT_HEADER, 0);
curl_setopt($handler, CURLOPT_RETURNTRANSFER, 1);
array_push($urlHandler, $handler);
}
$multiHandler = curl_multi_init();
foreach($urlHandler as $key => $url)
{
curl_multi_add_handle($multiHandler, $url);
}
$running = null;
do
{
curl_multi_exec($multiHandler, $running);
}
while($running > 0);
$urlContents = array();
foreach($urlHandler as $key => $url)
{
$urlContents[$key] = curl_multi_getcontent($url);
}
foreach($urlHandler as $key => $url)
{
curl_multi_remove_handle($multiHandler, $url);
}
foreach($urlContents as $urlContent)
{
preg_match_all('/<li class="g">(.*?)<\/li>/si', $urlContent, $matches);
//$this->view_data['results'][] = "Random";
}
preg_match_all('#<cite>(.+?)</cite>#si', $urlContents[1], $googleLinks);
preg_match_all('#<span class="url" id="(.*)">(.+?)</span>#si', $urlContents[2], $yahoo);
var_dump($yahoo);
die();
$findHtml = array('/<cite>/', '/<\/cite>/', '/<b>/', '/<\/b>/', '/ /', '/"/', '/<strong>/', '/<\/strong>/');
$removeHtml = array('', '', '', '', '', '', '', '');
foreach($googleLinks as $links => $val)
{
foreach($val as $link)
$this->view_data['results'][] = preg_replace($findHtml, $removeHtml, $link);
break;
}
}
First off, you should not use regular expressions to process HTML. There are pretty good DOM parsers available for PHP. For example:
$d = new DOMDocument;
$d->loadHTML($s);
$x = new DOMXPath($d);
foreach ($x->query('//span[#class="url"]') as $node) {
// process each node the way you wish
// print the id for instance
echo $node->getAttribute('id'), PHP_EOL;
}
Besides that, the expression should work except that id="(.*)" is greedy; that can be fixed with:
#<span class="url" id="(.*?)">(.+?)</span>#si
It's possible that there's more text after id="..." and the >; that would bring the expression to:
#<span class="url" id="(.*?)"[^>]*>(.+?)</span>#si