php regex - Scraping images from javascript object - php

I'm trying to scrape images from the mark-up of certain webpages. These webpages all have a slideshow. Their sources are contained in javascript objects on the page. I'm thinking i need to get_file_contents("http://www.example.com/page/1"); and then have a preg_match_all() function that i can input a phrase(ie. "\"LargeUrl\":\"", or "\"Description\":\"") and get the string of characters until it hits the next quotation mark it finds.
var photos = {};
photos['photo-391094'] = {"LargeUrl": "http://www.example.org/images/1.png","Description":"blah blah balh"};
photos['photo-391095'] = {"LargeUrl": "http://www.example.org/images/2.png","Description":"blah blah balh"};
photos['photo-391096'] = {"LargeUrl": "http://www.example.org/images/3.png","Description":"blah blah balh"};
I have this function, but it returns the entire line after the input phrase. How can i modify it to look for whatever's after the input phrase up until it hits the next quotation mark it finds? Or am i doing this all wrong and there's a better way?
$page = file_get_contents("http://www.example.org/page/1");
$word = "\"LargeUrl\":\"";
if(preg_match_all("/(?<=$word)\S+/i", $page, $matches))
{
echo "<pre>";
print_r($matches);
echo "</pre>";
}
Ideally the function would return a an array like the following if i inputed "\"LargeUrl\":\""
$matches[0] = "http://www.example.org/images/1.png";
$matches[1] = "http://www.example.org/images/2.png";
$matches[2] = "http://www.example.org/images/3.png";

You can use parenthesis to capture the parts you're interested in. A simple regex to do it is
$word = '"LargeUrl":';
$pattern = "$word" . '\s+"([^"]+)"';
preg_match_all("/$pattern/", $page, $matches);
print_r($matches[1]);

There is definitely a regex that will match each image URL, but you could also, if its easier for you, match the whole object and then json_decode() the matched string

I have perfect solution for you....use the following code and you will get your needed result.
preg_match_all('/{"LargeUrl":(.*?)"(.*?)"/', $page, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
echo "<pre>";
echo $result[2][$i];
echo "</pre>";
}
Thanks......p2c

Related

PHP why doesn't preg_match work?

Hello I have the following code:
$headerOcc = substr_count($text, "[Image][");
if($headerOcc < '1'){
$text = $text;
}
for ($i=0; $i < $headerOcc; $i++) {
preg_match('/[Image\]\[(.*)\]/', $text, $match);
$headerTitle = ($match[1]);
print($headerTitle);
}
I also have the following variable:
$text = "Hello world [Image][61][Image][62]Hello world";
I want to find out if there is a part of the text that says [image][ID]. Obviously the ID will be replaced by a number and I want to get it. How can I get $headerTitle to return that ID inside the curly brackets. It currently gives me this when I print it:
61]**
**[Image][62
Which is not what I want. I want it to return:
61
62
What am i doing wrong and how can I fix it?
Use preg_match_all() and use \d+ to get the digits:
preg_match_all('/\[Image\]\[(\d+)\]/', $text, $matches);
Alternately you could match on NOT ], so ([^\]]+).
Then just loop $matches[1] to echo etc...
foreach($matches[1] as $headerTitle) {
print($headerTitle);
}
No need for the substr_count() or any of the other stuff.

preg_match() to check Image urls without [img] BB tags and return boolean value using PHP

In my text field I have images enclosed within [img] BB tags like
[img]http://i58.tinypic.com/i3yxar.jpg[/img]
and plain image URLs like
http://www.jonco48.com/blog/tongue1.jpg
I want preg_match to look for plain image urls and if found return 1 otherwise 0, How to do this???
Thanks
With regex is quite difficult to look for a pattern without a piece, in this case the img open and closure tag.
So I would search the urls within the tag, then search all the urls and compare these counts
$text = "";
$tagPattern = "/\[img\].+?\[\/img\]/";
preg_match_all($pattern, $text, $tagMatches);
$urlInTagCount = count($tagMatches[0]);
$plainPattern = "~https?://\S+\.(?:jpe?g|gif|png)(?:\?\S*)?(?=\s|$|\pP)~i";
preg_match_all($pattern, $text, $plainMatches);
$allUrlCount = count($plainMatches[0]);
return $allUrlCount > $urlInTagCount;
Using regex is really overkill for this if all you need to do is check whether or not there are [img][/img] tags around your string.
You can just as easily use some simple string functions:
function isBB($s){
$len = strlen($s);
return $check = substr($s, 0, 5) == "[img]" && substr($s, $len-6, $len) == "[/img]";
}
isBB('[img]http://i58.tinypic.com/i3yxar.jpg[/img]') // true
isBB('http://www.jonco48.com/blog/tongue1.jpg') //false
Here you have the REGEX : ~https?://\S+\.(?:jpe?g|gif|png)(?:\?\S*)?(?=\s|$|\pP)~i
In PHP :
preg_match('#\[img\](.+?)\[/img\]#', $your_text, $matches);
echo $matches[1];
The following should work as expected:
<?php
$str = '[img]http://i58.tinypic.com/i3yxar.jpg[/img]';
preg_match('#\[img\](.+?)\[/img\]#', $str, $matches);
echo $matches[1];

creating hyperlinks from php array elements

I have a hashtag system: (Note: $body is a variable that is a post that a user submits. The hashtags are in the posts.) I have tried to do this using regex but have found this method to be as equally efficient and a bit easier to follow.
<?php
$string = $body;
$htag = "#";
$arr = explode(" ", $string);
$arrc = count($arr);
$i = 0;
while($i < $arrc) {
if(substr($arr[$i], 0, 1) === $htag) {
$arr[$i] = "<a href = 'category.php?#=$arr[$i]'>".$arr[$i]."</a>";
}
$i++;
}
$string = implode(" ", $arr);
?>
Then, $string is echoed later in the page.
My problem with this is that my method for linking the hashtag to the category page using the php array element. On this page I want to call the word that was "hashtaged" and use a mysql query to get posts that have the hashtags. However, when I call the $arr[$i], to be echoed, I get an error:
Undefined offset: 1 on the line in which I call this array element
into another variable.
Is there any way I can complete this task in a better and more effective way?
Okay, so this can be greatly simplified using some regex and PHP's preg_replace function. #Doge was on the right track, but in my tests, \b didn't quite give the results that I think you want.
Basically, you can replace almost all of what you have with
$newText = preg_replace("/#(\w+)/", "<a href='category.php?tag=$1'>#$1</a>", $text);
As in…
$text = "This is a #hashtag within some #awesometext.";
$newText = preg_replace("/#(\w+)/", "<a href='category.php?tag=$1'>#$1</a>", $text);
echo $newText;
The result of this would be…
This is a <a href='category.php?tag=hashtag'>#hashtag</a> within some <a href='category.php?tag=awesometext'>#awesometext</a>.
See it in action here

PHP extract one part of a string

I have to extract the email from the following string:
$string = 'other_text_here to=<my.email#domain.fr> other_text_here <my.email#domain.fr> other_text_here';
The server send me logs and there i have this kind of format, how can i get the email into a variable without "to=<" and ">"?
Update: I've updated the question, seems like that email can be found many times in the string and the regular expresion won't work well with it.
You can try with a more restrictive Regex.
$string = 'other_text_here to=<my.email#domain.fr> other_text_here';
preg_match('/to=<([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})>/i', $string, $matches);
echo $matches[1];
Simple regular expression should be able to do it:
$string = 'other_text_here to=<my.email#domain.fr> other_text_here';
preg_match( "/\<(.*)\>/", $string, $r );
$email = $r[1];
When you echo $email, you get "my.email#domain.fr"
Try this:
<?php
$str = "The day is <tag> beautiful </tag> isn't it? ";
preg_match("'<tag>(.*?)</tag>'si", $str, $match);
$output = array_pop($match);
echo $output;
?>
output:
beautiful
Regular expression would be easy if you are certain the < and > aren't used anywhere else in the string:
if (preg_match_all('/<(.*?)>/', $string, $emails)) {
array_shift($emails); // Take the first match (the whole string) off the array
}
// $emails is now an array of emails if any exist in the string
The parentheses tell it to capture for the $matches array. The .* picks up any characters and the ? tells it to not be greedy, so the > isn't picked up with it.

PHP Regex match string but exclude a certain word

This question has been asked multiple times, but I didn't find a working solution for my needs.
I've created a function to check for the URLs on the output of the Google Ajax API:
https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site%3Awww.bierdopje.com%2Fusers%2F+%22Gebruikersprofiel+van+%22+Stevo
I want to exclude the word "profile" from the output. So that if the string contains that word, skip the whole string.
This is the function I've created so far:
function getUrls($data)
{
$regex = '/https?\:\/\/www.bierdopje.com[^\" ]+/i';
preg_match_all($regex, $data, $matches);
return ($matches[0]);
}
$urls = getUrls($data);
$filteredurls = array_unique($urls);
I've created a sample to make clear what I mean exactly:
http://rubular.com/r/1U9YfxdQoU
In the sample you can see 4 strings selected from which I only need the upper 2 strings.
How can I accomplish this?
function getUrls($data)
{
$regex = '#"(https?://www\\.bierdopje\\.com[^"]*+(?<!/profile))"#';
return preg_match_all($regex, $data, $matches) ?
array_unique($matches[1]) : array();
}
$urls = getUrls($data);
Result: http://ideone.com/dblvpA
vs json_decode: http://ideone.com/O8ZixJ
But generally you should use json_decode.
Don't use regular expressions to parse JSON data. What you want to do is parse the JSON and loop over it to find the correct matching elements.
Sample code:
$input = file_get_contents('https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site%3Awww.bierdopje.com%2Fusers%2F+%22Gebruikersprofiel+van+%22+Stevo');
$parsed = json_decode($input);
$cnt = 0;
foreach($parsed->responseData->results as $response)
{
// Skip strings with 'profile' in there
if(strpos($response->url, 'profile') !== false)
continue;
echo "Result ".++$cnt."\n\n";
echo 'URL: '.$response->url."\n";
echo 'Shown: '.$response->visibleUrl."\n";
echo 'Cache: '.$response->cacheUrl."\n\n\n";
}
Sample on CodePad (since it doesn't support loading external files the string is inlined there)

Categories