Preg_match_all not stopping where it should be - php

Update Yahoo error
Ok, so I got it all working, but the preg_match_all wont work towards Yahoo.
If you take a look at:
http://se.search.yahoo.com/search?p=random&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t
then you can see that in their html, they have
<span class="url" id="something random"> the actual link </span>
But when I try to preg_match_all, I wont get any result.
preg_match_all('#<span class="url" id="(.*)">(.+?)</span>#si', $urlContents[2], $yahoo);
Anyone got an idea?
End of update
I'm trying to preg_match_all the results i get from Google using a cURL curl_multi_getcontent method.
I have succeeded to fetch the site and so, but when I'm trying to get the result of the links, it just takes too much.
I'm currently using:
preg_match_all('#<cite>(.+)</cite>#si', $urlContents[0], $links);
And that starts where it should be, but it doesn't stop, it just keeps going.
Check the HTML at www.google.com/search?q=random for example and you will see that all links start with and ends with .
Could someone possible help me with how I should retreive this information?
I only need the actual link address to each result.
Update Entire PHP Script
public function multiSearch($question)
{
$sites['google'] = "http://www.google.com/search?q={$question}&gl=sv";
$sites['bing'] = "http://www.bing.com/search?q={$question}";
$sites['yahoo'] = "http://se.search.yahoo.com/search?p={$question}";
$urlHandler = array();
foreach($sites as $site)
{
$handler = curl_init();
curl_setopt($handler, CURLOPT_URL, $site);
curl_setopt($handler, CURLOPT_HEADER, 0);
curl_setopt($handler, CURLOPT_RETURNTRANSFER, 1);
array_push($urlHandler, $handler);
}
$multiHandler = curl_multi_init();
foreach($urlHandler as $key => $url)
{
curl_multi_add_handle($multiHandler, $url);
}
$running = null;
do
{
curl_multi_exec($multiHandler, $running);
}
while($running > 0);
$urlContents = array();
foreach($urlHandler as $key => $url)
{
$urlContents[$key] = curl_multi_getcontent($url);
}
foreach($urlHandler as $key => $url)
{
curl_multi_remove_handle($multiHandler, $url);
}
foreach($urlContents as $urlContent)
{
preg_match_all('/<li class="g">(.*?)<\/li>/si', $urlContent, $matches);
//$this->view_data['results'][] = "Random";
}
preg_match_all('#<div id="search"(.*)</ol></div>#i', $urlContents[0], $match);
preg_match_all('#<cite>(.+)</cite>#si', $urlContents[0], $links);
var_dump($links);
}

run the regular expression in U-ngready mode
preg_match_all('#<cite>(.+)</cite>#siU

As in Darhazer's answer you can turn on ungreedy mode for the whole regex using the U pattern modifier, or just make the pattern itself ungreedy (or lazy) by following it with a ?:
preg_match_all('#<cite>(.+?)</cite>#si', ...

Related

Loop Preg_match until no more matches

How can I preg_match until no more results is found?
I'm using curl to login a page and then delete posts from there.
But to delete those posts I need to preg_match the content and filter the IDs and if found ids there my script run the delete command.
So, basically:
$pattern = '/(?<=list_id=).*?(?=&cmd=edit)/s';
preg_match($pattern, $LoginResult, $id); //THIS PREG_MATCH IS WORKING, IT GETS THE FIRST RESULT OF THE PAGE (WHAT I NEED). BUT I NEED TO MAKE A LOOP TO THIS SCRIPT RUN OVER AND OVER UNTIL NOTHING MORE IS FOUND.
$idpagina = $id[0];
In words it should make something like:
If > preg_match is true > run delete command.
Loop If until preg_match is false.
With this code I can find everything there is between list_id= and &cmd=edit. If the script find something between this two strings, It needs to perform a curl to delete this ID:
//THIS IS WORKING
$paginadelete = "https://example/list/folder/0?list_id=".$idpagina."&cmd=delete&type=AD_DELETE";
curl_setopt($login, CURLOPT_URL, $paginadelete);
curl_setopt($login, CURLOPT_POST, 1);
curl_setopt($login, CURLOPT_FOLLOWLOCATION, 1);
$step1 = curl_exec($login);
echo $step1;
What this basically does is (or should do):
Loop preg_match and if preg_match is true go to #2
Run Delete Curl
Return to #1 until nothing is found in preg_match
But this script run 3 curl processes:
Login
Go to delete page (this one above)
Confirm delete
So this loop should be between step #2 and #3 until nothing more is found.
My #3 step (confirm delete) is this one:
$url = curl_getinfo($login, CURLINFO_EFFECTIVE_URL);
$url = parse_url($url, PHP_URL_PATH);
$url = substr($url, 9);
$url = "http://example.com/cmd/act/".$url;
$post_data = array(
'1' => 'delete',
'2' => '1',
'3' => '2',
'4' => '10',
'5' => '',
'6' => 'continue',
);
//traverse array and prepare data for posting (key1=value1)
foreach ( $post_data as $key => $value) {
$post_items[] = $key . '=' . $value;
}
//create the final string to be posted using implode()
$post_string = implode ('&', $post_items);
curl_setopt($login, CURLOPT_URL, $url);
curl_setopt($login, CURLOPT_POST, 1);
curl_setopt($login, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($login, CURLOPT_POSTFIELDS, $post_string);
$step2 = curl_exec($login);
//echo $step2;
////////////////////// EDIT
I was trying:
if (preg_match('/(?<=list_id=).*?(?=&cmd=edit)/s', $LoginResult, $id)){
}
else {
}
But this will only work for the first result. After that, the script stops. I need to re-run the if until preg_match is false and then end in the else.
I thought about using DO and WHILE, but I don't know how and neither if it'll work.
////////////////// EDIT 2
I'm now trying to use a GOTO until get false and close connection
verification:
if (preg_match('/(?<=list_id=).*?(?=&cmd=edit)/s', $LoginResult, $id)){
[..........]
} else {
//close the connection
curl_close($login);
}
goto verification;
But doesn't seem to work, lol.
Your question isn't clear, but I guess what you need is preg_match_all, i.e.:
preg_match_all('/((?<=list_id=).*?(?=&cmd=edit))/im', $html, $matches, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($matches[1]); $i++) {
//here you can implement an if/else to check if the ID exist
echo $matches[1][$i];
}
http://php.net/manual/en/function.preg-match-all.php
Based on your first edit, It seems that what you are trying to achieve is the following:
while(preg_match('/(?<=list_id=).*?(?=&cmd=edit)/s', $LoginResult, $id)) {
// do stuff
}
// do stuff after the preg_match is false
Edit:
Based on your description on the comments maybe this code will satisfy your needs.
while(true) {
$result = preg_match('/(?<=list_id=).*?(?=&cmd=edit)/s', $LoginResult, $id);
if($result) {
// Run Delete Curl
} else {
curl_close($login);
break;
}
}

PHP file_get_contents error, wouldn't populate from an array?

I've been trying to write a simple script in PHP to pull off data from a ISBN database site. and for some reason I've had nothing but issues using the file_get_contents command.. I've managed to get something working for this now, but would just like to see if anyone knows why this wasn't working?
The below would not populate the $page with any information so the preg matches below failed to get any information. If anyone knows what the hell was stopping this would be great?
$links = array ('
http://www.isbndb.com/book/2009_cfa_exam_level_2_schweser_practice_exams_volume_2','
http://www.isbndb.com/book/uniform_investment_adviser_law_exam_series_65','
http://www.isbndb.com/book/waterworks_a02','
http://www.isbndb.com/book/winning_the_toughest_customer_the_essential_guide_to_selling','
http://www.isbndb.com/book/yale_daily_news_guide_to_fellowships_and_grants'
); // array of URLs
foreach ($links as $link)
{
$page = file_get_contents($link);
#print $page;
preg_match("#<h1 itemprop='name'>(.*?)</h1>#is",$page,$title);
preg_match("#<a itemprop='publisher' href='http://isbndb.com/publisher/(.*?)'>(.*?)</a>#is",$page,$publisher);
preg_match("#<span>ISBN10: <span itemprop='isbn'>(.*?)</span>#is",$page,$isbn10);
preg_match("#<span>ISBN13: <span itemprop='isbn'>(.*?)</span>#is",$page,$isbn13);
echo '<tr>
<td>'.$title[1].'</td>
<td>'.$publisher[2].'</td>
<td>'.$isbn10[1].'</td>
<td>'.$isbn13[1].'</td>
</tr>';
#exit();
}
My guess is you have wrong (not direct) URLs. Proper ones should be without the www. part - if you fire any of them and inspect the returned headers, you'll see that you're redirected (HTTP 301) to another URL.
The best way to do it in my opinion is to use cURL among curl_setopt with options CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS.
Of course you should trim your urls beforehands just to be sure it's not the problem.
Example here:
$curl = curl_init();
foreach ($links as $link) {
curl_setopt($curl, CURLOPT_URL, $link);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($curl, CURLOPT_MAXREDIRS, 5); // max 5 redirects
$result = curl_exec($curl);
if (! $result) {
continue; // if $result is empty or false - ignore and continue;
}
// do what you need to do here
}
curl_close($curl);

Regex to find out specific part of an html page

i want a regex to find out the below lines from a set of codes.
The part that i want to find:---
-->Copy frame link\",\"url240\":\"http:\/\/cs534515v4.vk.me\/u163220668\/videos\/1c1b06aec9.240.mp4\",\"url360\":\"http:\/\/cs534515v4.vk.me\/u163220668\/videos\/1c1b06aec9.360.mp4\",\"jpg\"<--
This code form part if an html page and i want to retrieve only the part shown.I am writing the codes in php
My complete codes.....
<?php
set_time_limit(0);
function get_content_of_url($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
return $content;
}
$plyst = get_content_of_url("http://vk.com/video56612186_167113956");
preg_match('/link\\".*"jpg\\"/', $plyst , $matches);
var_dump($matches);
//preg_match('/http:\/\/[a-zA-Z0-9\\/-_.]+/', $matches[0][0], $id);
//start_script($id[0]);
?>
How about this.
$str = "video_get_current_url\":\"Copy frame link\",\"url240\":\"http:\\\/\\\/cs534515v4.vk.me\\\/u163220668\\\/videos\\\/1c1b06aec9.24‌​0.mp4\",\"url360\":\"http:\\\/\\\/cs534515v4.vk.me\\\/u163220668\\\/videos\\\/1c1b06aec9.36‌​0.mp4\",\"jpg\":\"http:\\\/\\\/cs534515.vk.me\\\/u163220668\\\/video\\\/l_8a5b0712.jpg\",\"‌​ip_subm\":1,\"nologo";
preg_match('/\\"Copy\sframe.*"jpg\\"/is', $str, $matches);
var_dump($matches);
Output:
array(1) {
[0]=>
string(199) ""Copy frame link","url240":"http:\\/\\/cs534515v4.vk.me\\/u163220668\\/videos\\/1c1b06aec9.24‌​0.mp4","url360":"http:\\/\\/cs534515v4.vk.me\\/u163220668\\/videos\\/1c1b06aec9.36‌​0.mp4","jpg""
}
Edit:
And then, if you wanted to extract the video url's from that:
preg_match_all('/(https?:.*?\.mp4)/', $matches[0], $id);
//Then echo out the url's
foreach ($id[0] as $url) {
// the preg_replace strips out the double backslashes.
echo preg_replace('/\\\\/', '', $url)."<br />";
}
Output:
http://cs534515v4.vk.me/u163220668/videos/1c1b06aec9.24‌​0.mp4
http://cs534515v4.vk.me/u163220668/videos/1c1b06aec9.36‌​0.mp4
Working example: http://sandbox.onlinephpfunctions.com/code/329106d990fe8927a7670b9448770643afbd0865

Pull text from another website

Is it possible to pull text data from another domain (not currently owned) using php? If not any other method? I've tried using Iframes, and because my page is a mobile website things just don't look good. I'm trying to show a marine forecast for a specific area. Here is the link I'm trying to display.
Update...........
This is what I ended up using. Maybe it will help someone else. However I felt there was more than one right answer to my question.
<?php
$ch = curl_init("http://forecast.weather.gov/MapClick.php?lat=29.26034686&lon=-91.46038359&unit=0&lg=english&FcstType=text&TextType=1");
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
echo $content;
?>
This works as I think you want it to, except it depends on the same format from the weather site (also that "Outlook" is displayed).
<?php
//define the URL of the resource
$url = 'http://forecast.weather.gov/MapClick.php?lat=29.26034686&lon=-91.46038359&unit=0&lg=english&FcstType=text&TextType=1';
//function from http://stackoverflow.com/questions/5696412/get-substring-between-two-strings-php
function getInnerSubstring($string, $boundstring, $trimit=false)
{
$res = false;
$bstart = strpos($string, $boundstring);
if($bstart >= 0)
{
$bend = strrpos($string, $boundstring);
if($bend >= 0 && $bend > $bstart)
{
$res = substr($string, $bstart+strlen($boundstring), $bend-$bstart-strlen($boundstring));
}
}
return $trimit ? trim($res) : $res;
}
//if the URL is reachable
if($source = file_get_contents($url))
{
$raw = strip_tags($source,'<hr>');
echo '<pre>'.substr(strstr(trim(getInnerSubstring($raw,"<hr>")),'Outlook'),7).'</pre>';
}
else{
echo 'Error';
}
?>
If you need any revisions, please comment.
Try using a user-agent as shown below. Then you can use simplexml to parse the contents and extract the text you want. For more info on simplexml.
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"User-agent: www.example.com"
)
);
$content = file_get_contents($url, false, stream_context_create($opts));
$xml = simplexml_load_string($content);
You may use cURL for that. Have a Look at http://www.php.net/manual/en/book.curl.php

How can assign preg_match_all variable to a vaiable

Forgive me as I am a newbie programmer. How can I assign the resulting $matches (preg_match) value, with the first character stripped, to another variable ($funded) in php? You can see what I have below:
<?php
$content = file_get_contents("https://join.app.net");
//echo $content;
preg_match_all ("/<div class=\"stat-number\">([^`]*?)<\/div>/", $content, $matches);
//testing the array $matches
//echo sprintf('<pre>%s</pre>', print_r($matches, true));
$funded = $matches[0][1];
echo substr($funded, 1);
?>
Don't parse HTML with RegEx.
The best way is to use PHP DOM:
<?php
$handle = curl_init('https://join.app.net');
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$raw = curl_exec($handle);
curl_close($handle);
$doc = new DOMDocument();
$doc->loadHTML($raw);
$elems = $doc->getElementsByTagName('div');
foreach($elems as $item) {
if($item->getAttribute('class') == 'stat-number')
if(strpos($item->textContent, '$') !== false) $funded = $item->textContent;
}
// Remove $ sign and ,
$funded = preg_replace('/[^0-9]/', '', $funded);
echo $funded;
?>
This returned 380950 at the time of posting.
I am not 100% sure but it seems like you are trying to get the dollar amount that the funding is currently ?
And the character is a dollar sign that you want to strip out ?
If that is the case why not just add the dollar sign to the regex outside the group so it isn't captured.
/<div class=\"stat-number\">\$([^`]*?)<\/div>/
Because $ means end of line in regex you must first escape it with a slash.

Categories