This script won't find Absolute Urls - php

in the code below, it is supposed to scan links and index them in the array [links]. but for some reason, they won't index.
I am starting to think if my regex code is wrong, how can i improve it. Also is it my file_get_contents command? Is it used correctly?
$links = Array();
$URL = 'http://www.theqlick.com'; // change it for urls to grab
// grabs the urls from URL
$file = file_get_contents($URL);
$abs_url = preg_match_all("'^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$^'", $file, $link);
if (!empty($abs_url)) {
$links[] = $abs_url;
}

In your preg_match_all you are saving into $link not $links.

preg_match_all Returns the number of full pattern matches (which might be zero), or FALSE if an error occurred (c) php.net
preg_match_all("'^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$^'", $file, $matches);
if (!empty($matches)
$links = $matches;

Your regex is wrong. You have a head anchor ^ at the end of the pattern adjacent to a tail match $. I don't think the anchors really aren't needed. Additionally, your variable you are storing matches in $link (no s). Plus your pattern delimiter appears to be the ' character. Was that intentional? It would fortunately work, but I'm guessing you didn't intend for that?
Try this:
$matchCount = preg_match_all("/(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?/", $file, $matches);
if ($matchCount)
{
foreach ($matches as $match)
{
$links[] = $match[0];
}
}
Read up on PHP regular expressions.

Related

How to check specyfic url_path in preg_match

How can I check if specific path match to pattern.
Example:
I have a path with one or more unknown variable
$pathPattern = 'user/?/stats';
And let say I received this path
$receivedPath = 'user/12/stats'
So, how can I check if that received path match to my pattern?
I tried to do something like below but didn't work.
$pathPattern = 'user/?/stats';
$receivedPath = 'user/12/stats';
$pathPatternReg = str_replace('?','.*',$pathPattern);
echo preg_match('/$pathPatternReg/', $receivedPath);
Thank you.
Regex should be something like this for a unknown user\/[0-9]+\/stats
And Could be used as such;
if(preg_match("user\/[0-9]+\/stats",$variable)) { .... }
As stated by Tom, you possibly have to escape the '/' characters with a '\'.
You only want to match one specific part of your total query string, the number in the center. Most regex interpreters provide this functionality in form of round brackets, like this:
$pattern = "user\/([0-9]+)\/stats";
Notice the round brackets around the [0-9]+ : it tells preg_match to store this part of the matched pattern in the $matches array.
So, your code could look like this:
$subject = "user/12/stats";
$pattern = "user\/([0-9]+)\/stats";
$matches = array();
if( preg_match($pattern, $subject, $matches) ){
// there was a match
// The $matches array now looks like this:
// { "user/12/stats", "12" }
// { <whole matched string>, <string in first parenthesis>, .... }
$user_id = $matches[1]
...
}
(not tested)
See also here: https://secure.php.net/manual/en/function.preg-match.php
Thank you booth.
This is a solution:
$pathPattern = 'user/?/stats';
$receivedPath = 'user/12/stats';
$pathPatternReg = str_replace(['/','?'],['\/','.*'],$pathPattern);
$pattern = "/^$uri/";
echo preg_match($pattern, $receivedPath);

PHP Regex match string but exclude a certain word

This question has been asked multiple times, but I didn't find a working solution for my needs.
I've created a function to check for the URLs on the output of the Google Ajax API:
https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site%3Awww.bierdopje.com%2Fusers%2F+%22Gebruikersprofiel+van+%22+Stevo
I want to exclude the word "profile" from the output. So that if the string contains that word, skip the whole string.
This is the function I've created so far:
function getUrls($data)
{
$regex = '/https?\:\/\/www.bierdopje.com[^\" ]+/i';
preg_match_all($regex, $data, $matches);
return ($matches[0]);
}
$urls = getUrls($data);
$filteredurls = array_unique($urls);
I've created a sample to make clear what I mean exactly:
http://rubular.com/r/1U9YfxdQoU
In the sample you can see 4 strings selected from which I only need the upper 2 strings.
How can I accomplish this?
function getUrls($data)
{
$regex = '#"(https?://www\\.bierdopje\\.com[^"]*+(?<!/profile))"#';
return preg_match_all($regex, $data, $matches) ?
array_unique($matches[1]) : array();
}
$urls = getUrls($data);
Result: http://ideone.com/dblvpA
vs json_decode: http://ideone.com/O8ZixJ
But generally you should use json_decode.
Don't use regular expressions to parse JSON data. What you want to do is parse the JSON and loop over it to find the correct matching elements.
Sample code:
$input = file_get_contents('https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site%3Awww.bierdopje.com%2Fusers%2F+%22Gebruikersprofiel+van+%22+Stevo');
$parsed = json_decode($input);
$cnt = 0;
foreach($parsed->responseData->results as $response)
{
// Skip strings with 'profile' in there
if(strpos($response->url, 'profile') !== false)
continue;
echo "Result ".++$cnt."\n\n";
echo 'URL: '.$response->url."\n";
echo 'Shown: '.$response->visibleUrl."\n";
echo 'Cache: '.$response->cacheUrl."\n\n\n";
}
Sample on CodePad (since it doesn't support loading external files the string is inlined there)

PHP Regex Matching Image URLs

This is my Image Url PHP Code.
$GetImage = 'https://lh6.ggpht.com/hWXw7YRl9DpSMewd29xT9rvxcgnmGXeXSY9FTaPc3cbBCa-JO8yfwSynmD5C1DLglw=w124';
preg_match_all("/https://\w\w\d.\w+.com/[\w-]+=\w\d{2,3}/", $GetImage, $Result, PREG_SET_ORDER);
its working for me, but i want to extract "[\w-]" pattern results, in other words, i want to extract "hWXw7YRl9DpSMewd29xT9rvxcgnmGXeXSY9FTaPc3cbBCa-JO8yfwSynmD5C1DLglw" this string from my image Url...
Please anybody help my to solve this problem....
thanks
I feel it's overkill to try to match the entire URL using a regular expression. I suggest you parse the URL first using PHP's built-in function parse_url().
<?php
$str = 'https://lh6.ggpht.com/hWXw7YRl9DpSMewd29xT9rvxcgnmGXeXSY9FTaPc3cbBCa-JO8yfwSynmD5C1DLglw=w124';
// Parse the URL before applying a regex. Only get the path part. Use substring to remove the leading slash
$path = substr( parse_url( $str, PHP_URL_PATH ), 1 );
$pattern = '/([^=]+)/';
$matches = array();
if ( preg_match( $pattern, $path, $matches ) ) {
// Regex matched
$id = $matches[1];
// Outputs: string 'hWXw7YRl9DpSMewd29xT9rvxcgnmGXeXSY9FTaPc3cbBCa-JO8yfwSynmD5C1DLglw' (length=66)
var_dump( $id );
}
?>
Note that the snippet does not check the domain name. You can easily adjust the script to do so by not limiting the parse_url() function to only return the path, but also the other parts.
Try like this
$GetImage = 'https://lh6.ggpht.com/hWXw7YRl9DpSMewd29xT9rvxcgnmGXeXSY9FTaPc3cbBCa-JO8yfwSynmD5C1DLglw=w124';
preg_match_all('#https://.*\.com/([\w-]+=\w\d{2,3})#iU', $GetImage, $match, PREG_SET_ORDER);
print_r($match);

regex help with getting tag content in PHP

so I have the code
function getTagContent($string, $tagname) {
$pattern = "/<$tagname.*?>(.*)<\/$tagname>/";
preg_match($pattern, $string, $matches);
print_r($matches);
}
and then I call
$url = "http://www.freakonomics.com/2008/09/24/wall-street-jokes-please/";
$html = file_get_contents($url);
getTagContent($html,"title");
but then it shows that there are no matches, while if you open the source of the url there clearly exist a title tag....
what did I do wrong?
try DOM
$url = "http://www.freakonomics.com/2008/09/24/wall-street-jokes-please/";
$doc = new DOMDocument();
$dom = $doc->loadHTMLFile($url);
$items = $doc->getElementsByTagName('title');
for ($i = 0; $i < $items->length; $i++)
{
echo $items->item($i)->nodeValue . "\n";
}
The 'title' tag is not on the same line as its closing tag, so your preg_match doesn't find it.
In Perl, you can add a /s switch to make it slurp the whole input as though on one line: I forget whether preg_match will let you do so or not.
But this is just one of the reasons why parsing XML and variants with regexp is a bad idea.
Probably because the title is spread on multiple lines. You need to add the option s so that the dot will also match any line returns.
$pattern = "/<$tagname.*?>(.*)<\/$tagname>/s";
Have your php function getTagContent like this:
function getTagContent($string, $tagname) {
$pattern = '/<'.$tagname.'[^>]*>(.*?)<\/'.$tagname.'>/is';
preg_match($pattern, $string, $matches);
print_r($matches);
}
It is important to use non-greedy match all .*? for matching text between start and end of tag and equally important is to use flags s for DOTALL (matches new line as well) and i for ignore case comparison.

PHP - strip URL to get tag name

I need to strip a URL using PHP to add a class to a link if it matches.
The URL would look like this:
http://domain.com/tag/tagname/
How can I strip the URL so I'm only left with "tagname"?
So basically it takes out the final "/" and the start "http://domain.com/tag/"
For your URL
http://domain.com/tag/tagname/
The PHP function to get "tagname" is called basename():
echo basename('http://domain.com/tag/tagname/'); # tagname
combine some substring and some position finding after you take the last character off the string. use substr and pass in the index of the last '/' in your URL, assuming you remove the trailing '/' first.
As an alternative to the substring based answers, you could also use a regular expression, using preg_split to split the string:
<?php
$ptn = "/\//";
$str = "http://domain.com/tag/tagname/";
$result = preg_split($ptn, $str);
$tagname = $result[count($result)-2];
echo($tagname);
?>
(The reason for the -2 is because due to the ending /, the final element of the array will be a blank entry.)
And as an alternate to that, you could also use preg_match_all:
<?php
$ptn = "/[a-z]+/";
$str = "http://domain.com/tag/tagname/";
preg_match_all($ptn, $str, $matches);
$tagname = $matches[count($matches)-1];
echo($tagname);
?>
Many thanks to all, this code works for me:
$ptn = "/\//";
$str = "http://domain.com/tag/tagname/";
$result = preg_split($ptn, $str);
$tagname = $result[count($result)-2];
echo($tagname);

Categories