Regex to retrieve all URLs not containing specific string - php

I am using the Google Analytics PHP API, and trying to use it to retrieve the most popular links on my website.
It works, but it retrieve some duplicates due to it retrieving URLs containing query strings. So basically, I want to retrieve all the links which do not contain the string "?start=" inside them. I think this can be done via regex (Google Analytics accepts regex filters), but don't know how.
Any ideas?
Thanks!

You can use negative look ahead regex assertion.
See http://www.perlmonks.org/?node_id=518444 and http://www.regular-expressions.info/lookaround.html
your_string(?!\?start=)

Related

Best Choice Regex for extract the Facebook Link

i search the best regex method for the most functionality.
I search on Google and will extract the Facebook Links. Because Google has no Search API that works 1to1 with the exact Google Results i don't can use the API.
I send now a normal request to google, extract the html code and will find all Facebook Link without google parameters.
Examples you find on regex debbuger.
I will see only this links if is possible.
Here Example Strings to search:
`
/url?q=https://www.facebook.com/pageid/about&sa=U&ved=0ahUKEwi27NeDvfTTAhWBfywKHbuDDS4QjBAIHDAB&usg=AFQjCNH7T2JEP5DzGpiiwT_pMt2oGJ10ow
/url?q=https://www.facebook.com/pageid/%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ
/url?q=https://www.facebook.com/pageid%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ
/url?q=https://www.facebook.com/name-name-585606818284844/%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ
/url?q=https://www.facebook.com/name-name-585606818284844%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ`
Thats my Regex this works but not for all options. Regex Debugger:
https://regex101.com/r/LcYz8c/8
Something like:
"q=(https?://.*?facebook.com/)derName-/"
"q=(https?://.*?facebook.com/)derName(?:%[^%]*%..|[-/])?([^&]‌​+)"
might be what you are looking for. From what I see in your example, it looks like you want:
everything from the http up to the first / after the domain. Then skip the derName, and then grab everything up to the next &. So this is going to use 2 capture groups. Hope that helps!
Try this:
q=(https:\/\/www.facebook.com.*?)&
https://regex101.com/r/LcYz8c/11

How to assign complicated regex to php variable

first question in a long while! I need to find any and all urls's in a string returned from a facebook page request (I'm requesting the website of a page using the graphi api) and putting the value into an array that I subsequently display in a datatable js table.
Anyhow, I'm having issues as when I build the json data for the datatable, it breaks in some cases:-
http://socialinsightlab.com/datatable_fpages.json
The issue is with the website field having erroneous characters / structure / white space etc in the field.
Anyhow I found the perfect regex to use to find all websites in the field (there can be more than one website listed in the return).
The regex is
(?i)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
When I try and assign it to a php variable as in preg_match_all I can't as it won't accept the regex string into the variable as it has quotes in it I guess.
So my question is how can I extract only the urls found in the website field and then assign them to a variable so i can add them to the datatable.
Here is an example of a call that fails:-
http://socialinsightlab.com/datatable_fpages.json
I need to be able to just return websites and nothing more.
Any ideas?
Thanks
Jonathan
This regex is specifically made as a solution to this problem:
(?:https?:\/\/|www)[^"\s]+
Live demo
If you don't want to deal with all this quotes escaping, you can do the following:
Save regex to a file, say, regex.txt.
Read this file into variable and trim: $regex = trim(file_get_contents("regex.txt"));
Use it with preg_match() etc.

Extract URL containing /find/ from numerous URL's?

I'm really a major novice at RegEx and could do with some help.
I have a long string containing lots of URL's and other text, and one of the URL's contains has /find/ in it. ie:
1. http://www.example.com/not/index.html
2. http://www.example.com/sat/index.html
3. http://www.example.com/find/index.html
4. http://www.example.com/rat/mine.html
5. http://www.example.com/mat/find.html
What sort of RegEx would I use to return the URL that is number 3 in that list but not return me number 5 as well? I suppose basically what I'm looking for is a way of returning a whole word that contains a specific set of letters and / in order.
TIA
I would assume you want preg_match("%/find/%",$input); or similar.
EDIT: To get the full line, use:
preg_match("%^.*?/find/.*$%m",$input);
I can suggest you to use RegExr to generate regular expressions.
You can type in a sample list (like the one above) and use a palette to create a RegExp and test it in realtime. The program is available both online and as downloadable Adobe AIR package.
Unfortunately I cannot access their site now, so I'm attaching the AIR package of the downloadable version.
I really recommend you this, since it helped a RegExp newbie like me to design even the most complex patterns.
However, for your question, I think that just
\/find\/
goes well if you want to obtain a yes/no result (i.e. if it contains or not /find/), otherwise to obtain the full line use
.*\/find\/.*
In addition to Kolink's answer, in case you wanted to regex match the whole URI:
This is by no means an exhaustive regex for URIs, but this is a good starting point. I threw in a few options at key points, like .com, .net, and .org. In reality you'll have a fairly hard time matching URIs with regular expressions due to the lack of conformity, but you can come very close
The regex from the above link:
/(https?:\/\/)?(www\.)?([a-zA-Z0-9-_]+)\.(com|org|net)\/(find)\/([a-zA-Z0-9-_]+)\.(html|php|aspx)?/is

PHP: Get specific links with preg_match_all()

i want to extract specific links from a website.
The links look like that:
<a href="1494761,offer-mercedes-used.html">
The links are always the same - except the brandname (mercedes in this case).
This works fine so far but only delivers the first part of the link:
preg_match_all('/((\d{7}),offer-)/s',$inhalt,$results);
And this delivers the first link with the whole website :(
preg_match_all('/((\d{7}).*html)/s',$inhalt,$results);
Any ideas?
Note that i use preg_match_all() and not preg_match().
Thanks,
Chama
While .*? would do (= less greedy), in both cases you should specify a more precise pattern.
Here [\w.-]+ would do. But [^">]+ might also be feasible, if the HTML source is consistent (or you specifically wish to ignore other variations).
preg_match_all('/((\d{7}),offer-[\w.-])/s',$inhalt,$results);
Trying to parse xml/html with regex generally isn't a good idea, but if you're sure it will always be formatted well, this should return any links in the content.
/<a href="([^">]+)">/
This will more closely match only the example pattern you gave, but not sure what variations you might have
/<a href="([0-9]{7},offer-[a-z]+-used\.html)">/
// [7 numbers],offer-[at least one letter]-used.html

Append a parameter to the end of a URL with PHP

I am struggling to do something which appears quite simple...
I use PHP cURL to scrape data and insert it into my website. cURL saves the data as a string in $data before it is output.
What I am trying to do is target all of the URL's contained within $data. The URL's sometimes contain a fixed value parameter that I need move to the end of the URL. The URL's look like this, where category=widgets can appear anywhere in the URL:
http://www.mysite.com/script.php?category=widgets&show=all&size=big
I need to move the parameter category=widgets to the end of all URL's, so they look like this:
http://www.mysite.com/script.php?show=all&size=big&category=widgets
I'm thinking that I can firstly remove all occurences of category=widgets with str_replace, that's the easy bit.
The problem I have is appending category=widgets to the end of the URL. Because the URL is dynamic, perhaps preg_replace is more appropriate. I'm new to regular expressions, and it's giving me a headache.
Would appreciate your help. Thanks.
I'd recommend making use of the parse_url, as this is liable to be considerably more robust in the long term than string manipulation.
As such, you could use parse_url to extract the various chunks and then assemble a new URL based on these as required.

Categories