PHP Regex match string but exclude a certain word - php

This question has been asked multiple times, but I didn't find a working solution for my needs.
I've created a function to check for the URLs on the output of the Google Ajax API:
https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site%3Awww.bierdopje.com%2Fusers%2F+%22Gebruikersprofiel+van+%22+Stevo
I want to exclude the word "profile" from the output. So that if the string contains that word, skip the whole string.
This is the function I've created so far:
function getUrls($data)
{
$regex = '/https?\:\/\/www.bierdopje.com[^\" ]+/i';
preg_match_all($regex, $data, $matches);
return ($matches[0]);
}
$urls = getUrls($data);
$filteredurls = array_unique($urls);
I've created a sample to make clear what I mean exactly:
http://rubular.com/r/1U9YfxdQoU
In the sample you can see 4 strings selected from which I only need the upper 2 strings.
How can I accomplish this?

function getUrls($data)
{
$regex = '#"(https?://www\\.bierdopje\\.com[^"]*+(?<!/profile))"#';
return preg_match_all($regex, $data, $matches) ?
array_unique($matches[1]) : array();
}
$urls = getUrls($data);
Result: http://ideone.com/dblvpA
vs json_decode: http://ideone.com/O8ZixJ
But generally you should use json_decode.

Don't use regular expressions to parse JSON data. What you want to do is parse the JSON and loop over it to find the correct matching elements.
Sample code:
$input = file_get_contents('https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site%3Awww.bierdopje.com%2Fusers%2F+%22Gebruikersprofiel+van+%22+Stevo');
$parsed = json_decode($input);
$cnt = 0;
foreach($parsed->responseData->results as $response)
{
// Skip strings with 'profile' in there
if(strpos($response->url, 'profile') !== false)
continue;
echo "Result ".++$cnt."\n\n";
echo 'URL: '.$response->url."\n";
echo 'Shown: '.$response->visibleUrl."\n";
echo 'Cache: '.$response->cacheUrl."\n\n\n";
}
Sample on CodePad (since it doesn't support loading external files the string is inlined there)

Related

Replacing multiple href values using a regular expression in PHP

I know this question has more of a WordPress background to it, but I'm hoping it's just my lack of PHP knowledge that is the problem here.
I have a regex that looks for <a> tags such as: Find Out More, which are generated from content inputted by the user. Once it finds this content a foreach loop runs over the matched text and uses a WordPress function $postId = url_to_postid( $url ); to convert the URL into a PostID.
This is an example output: Find Out More.
This works fine as long as there is one link in each piece of matched text. However if there are two or more links, it sets every link to have the same href which is incorrect.
I'm sure this is something to do with how I've got my loop running. The code I'm using is below:
<?php
$string = get_field('sample_text_box');
$pattern = "/(?<=href=(\"|'))[^\"']+(?=(\"|'))/";
preg_match_all($pattern, $string, $matches);
$urls = $matches[0];
foreach($urls as $key => $url) {
$postId[$key] = url_to_postid( $url );
}
$newstring = preg_replace($pattern , $postId[$key] , $string);
echo $newstring;
?>
The line get_field('sample_text_box'); is an Advanced Custom Field function which "returns the value of the specified field". You can read about it here if it helps: https://www.advancedcustomfields.com/resources/get_field/
Thanks!
The problem with your code is that you are indeed replacing the pattern with a single value of $postId[$key], where $key is the last value assigned in foreach.
You also can not pass $postId instead because preg_replace expects both pattern and replacement be of the same type - whether strings, or arrays.
However, the problem is easily solved with preg_replace_callback function:
$pattern = '/(?<=href\="|\')([^"\']+)(?="|\')/m';
$new_string = preg_replace_callback($pattern, function ($matches) {
return isset($matches[1]) ? url_to_postid($matches[1]) : $matches[0];
}, $string);

make a regex that matches a string which comprises a url and image extensions

I am bad at regex, having said that, I am trying to make a regex that will allow all the picture types of a url something like this:
$regex = '~mydomain.ide.com/polopoly_fs/([a-z0-9\\.\\_\\-]+(\\.gif|\\.png|\\.jpe?g))~i';
//include all formats
The purpose is to get a web's pictures with a function that could be like this:
/**
* Get the Images
* #param unknown $html
*/
function getImages($html) {
echo "Hello from getImages <br/>";
$matches = array();
$regex = '~http://mySite.ide.com/([a-z0-9\\.\\_\\-]+(\\.gif|\\.png|\\.jpe?g))~i'; //include all formats
preg_match_all($regex2, $html, $matches);
foreach ($matches[1] as $img) {
saveImg($img);
}
}
But I need a regex which does it with a url with this structure:
http://mydomain.ide.com/polopoly_fs/1.573651!imageManager/3188890795.jpg
Can anybody help?
Thank you
Something like this should suffice.
/(.*.com\/[a-z_-]+\/[0-9\.a-z!]+\/[a-z0-9_-]+\.(gif|jpg|jpeg|png))/i
Here's the PHP code
<?php
$s = "http://mydomain.ide.com/polopoly_fs/1.573651!imageManager/3188890795.jpg";
preg_match("/(.*.com\/[a-z_-]+\/[0-9\.a-z!]+\/[a-z0-9_-]+\.(gif|jpg|jpeg|png))/i", $s, $arrMatches);
echo print_r($arrMatches[0], true);
Live Preview
Case insensitive match
Will only match .com tld, but can be adapted to match more
Ok now that's a little more clear I can help a little more.
If I understand properly what you want is to search a whole html for images urls fully qualified to save them, no care of the name or where they are at all.
function getImages($html) {
echo "Hello from getImages <br/>";
$matches = array();
$regex = '~http://.*/[a-z0-9._-]+(?:\.gif|\.png|\.jpe?g)~i'; //include all formats
preg_match_all($regex, $html, $matches);
foreach ($matches as $img) {
saveImg($img[0]);
}
}
you don't have to capture anything here, the match will be enought, the only char needing to be escaped is the . to match a littelal dot in .gif .png , etc.
If I misunderstood, feel free to comment and I'll adapt

regular expression word preceded by char

I want to grab a specific string only if a certain word is followed by a = sign.
Also, I want to get all the info after that = sign until a / is reached or the string ends.
Let's take into example:
somestring.bla/test=123/ohboy/item/item=capture
I want to get item=capture but not item alone.
I was thinking about using lookaheads but I'm not sure it this is the way to go. I appreciate any help as I'm trying to grasp more and more about regular expressions.
[^/=]*=[^/]*
will give you all the pairs that match your requirements.
So from your example it should return:
test=123
item=capture
Refiddle Demo
If you want to capture item=capture, it is straightforward:
/item=[^\/]*/
If you want to also extract the value,
/item=([^\/]*)/
If you only want to match the value, then you need to use a look-behind.
/(?<=item=)[^\/]*/
EDIT: too many errors due to insomnia. Also, screw PHP and its failure to disregard separators in a character group as separators.
Here is a function I wrote some time ago. I modified it a little, and added the $keys argument so that you can specify valid keys:
function getKeyValue($string, Array $keys = null) {
$keys = (empty($keys) ? '[\w\d]+' : implode('|', $keys));
$pattern = "/(?<=\/|$)(?P<key>{$keys})\s*=\s*(?P<value>.+?)(?=\/|$)/";
preg_match_all($pattern, $string, $matches, PREG_SET_ORDER);
foreach ($matches as & $match) {
foreach ($match as $key => $value) {
if (is_int($key)) {
unset($match[$key]);
}
}
}
return $matches ?: FALSE;
}
Just trow in the string and valid keys:
$string = 'somestring.bla/test=123/ohboy/item/item=capture';
$keys = array('test', 'item');
$keyValuePairs = getKeyValue($string, $keys);
var_dump($keyValuePairs);

php regex - Scraping images from javascript object

I'm trying to scrape images from the mark-up of certain webpages. These webpages all have a slideshow. Their sources are contained in javascript objects on the page. I'm thinking i need to get_file_contents("http://www.example.com/page/1"); and then have a preg_match_all() function that i can input a phrase(ie. "\"LargeUrl\":\"", or "\"Description\":\"") and get the string of characters until it hits the next quotation mark it finds.
var photos = {};
photos['photo-391094'] = {"LargeUrl": "http://www.example.org/images/1.png","Description":"blah blah balh"};
photos['photo-391095'] = {"LargeUrl": "http://www.example.org/images/2.png","Description":"blah blah balh"};
photos['photo-391096'] = {"LargeUrl": "http://www.example.org/images/3.png","Description":"blah blah balh"};
I have this function, but it returns the entire line after the input phrase. How can i modify it to look for whatever's after the input phrase up until it hits the next quotation mark it finds? Or am i doing this all wrong and there's a better way?
$page = file_get_contents("http://www.example.org/page/1");
$word = "\"LargeUrl\":\"";
if(preg_match_all("/(?<=$word)\S+/i", $page, $matches))
{
echo "<pre>";
print_r($matches);
echo "</pre>";
}
Ideally the function would return a an array like the following if i inputed "\"LargeUrl\":\""
$matches[0] = "http://www.example.org/images/1.png";
$matches[1] = "http://www.example.org/images/2.png";
$matches[2] = "http://www.example.org/images/3.png";
You can use parenthesis to capture the parts you're interested in. A simple regex to do it is
$word = '"LargeUrl":';
$pattern = "$word" . '\s+"([^"]+)"';
preg_match_all("/$pattern/", $page, $matches);
print_r($matches[1]);
There is definitely a regex that will match each image URL, but you could also, if its easier for you, match the whole object and then json_decode() the matched string
I have perfect solution for you....use the following code and you will get your needed result.
preg_match_all('/{"LargeUrl":(.*?)"(.*?)"/', $page, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
echo "<pre>";
echo $result[2][$i];
echo "</pre>";
}
Thanks......p2c

mb_eregi_replace multiple matches get them

$string = 'test check one two test3';
$result = mb_eregi_replace ( 'test|test2|test3' , '<$1>' ,$string ,'i');
echo $result;
This should deliver: <test> check one two <test3>
Is it possible to get, that test and test3 was found, without using another match function ?
You can use preg_replace_callback instead:
$string = 'test check one two test3';
$matches = array();
$result = preg_replace_callback('/test|test2|test3/i' , function($match) use ($matches) {
$matches[] = $match;
return '<'.$match[0].'>';
}, $string);
echo $result;
Here preg_replace_callback will call the passed callback function for each match of the pattern (note that its syntax differs from POSIX). In this case the callback function is an anonymous function that adds the match to the $matches array and returns the substitution string that the matches are to be replaced by.
Another approach would be to use preg_split to split the string at the matched delimiters while also capturing the delimiters:
$parts = preg_split('/test|test2|test3/i', $string, null, PREG_SPLIT_DELIM_CAPTURE);
The result is an array of alternating non-matching and matching parts.
As far as I know, eregi is deprecated.
You could do something like this:
<?php
$str = 'test check one two test3';
$to_match = array("test", "test2", "test3");
$rep = array();
foreach($to_match as $val){
$rep[$val] = "<$val>";
}
echo strtr($str, $rep);
?>
This too allows you to easily add more strings to replace.
Hi following function used to found the any word from string
<?php
function searchword($string, $words)
{
$matchFound = count($words);// use tha no of word you want to search
$tempMatch = 0;
foreach ( $words as $word )
{
preg_match('/'.$word.'/',$string,$matches);
//print_r($matches);
if(!empty($matches))
{
$tempMatch++;
}
}
if($tempMatch==$matchFound)
{
return "found";
}
else
{
return "notFound";
}
}
$string = "test check one two test3";
/*** an array of words to highlight ***/
$words = array('test', 'test3');
$string = searchword($string, $words);
echo $string;
?>
If your string is utf-8, you could use preg_replace instead
$string = 'test check one two test3';
$result = preg_replace('/(test3)|(test2)|(test)/ui' , '<$1>' ,$string);
echo $result;
Oviously with this kind of data to match the result will be suboptimal
<test> check one two <test>3
You'll need a longer approach than a direct search and replace with regular expressions (surely if your patterns are prefixes of other patterns)
To begin with, the code you want to enhance does not seem to comply with its initial purpose (not at least in my computer). You can try something like this:
$string = 'test check one two test3';
$result = mb_eregi_replace('(test|test2|test3)', '<\1>', $string);
echo $result;
I've removed the i flag (which of course makes little sense here). Still, you'd still need to make the expression greedy.
As for the original question, here's a little proof of concept:
function replace($match){
$GLOBALS['matches'][] = $match;
return "<$match>";
}
$string = 'test check one two test3';
$matches = array();
$result = mb_eregi_replace('(test|test2|test3)', 'replace(\'\1\')', $string, 'e');
var_dump($result, $matches);
Please note this code is horrible and potentially insecure. I'd honestly go with the preg_replace_callback() solution proposed by Gumbo.

Categories