i search the best regex method for the most functionality.
I search on Google and will extract the Facebook Links. Because Google has no Search API that works 1to1 with the exact Google Results i don't can use the API.
I send now a normal request to google, extract the html code and will find all Facebook Link without google parameters.
Examples you find on regex debbuger.
I will see only this links if is possible.
Here Example Strings to search:
`
/url?q=https://www.facebook.com/pageid/about&sa=U&ved=0ahUKEwi27NeDvfTTAhWBfywKHbuDDS4QjBAIHDAB&usg=AFQjCNH7T2JEP5DzGpiiwT_pMt2oGJ10ow
/url?q=https://www.facebook.com/pageid/%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ
/url?q=https://www.facebook.com/pageid%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ
/url?q=https://www.facebook.com/name-name-585606818284844/%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ
/url?q=https://www.facebook.com/name-name-585606818284844%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ`
Thats my Regex this works but not for all options. Regex Debugger:
https://regex101.com/r/LcYz8c/8
Something like:
"q=(https?://.*?facebook.com/)derName-/"
"q=(https?://.*?facebook.com/)derName(?:%[^%]*%..|[-/])?([^&]+)"
might be what you are looking for. From what I see in your example, it looks like you want:
everything from the http up to the first / after the domain. Then skip the derName, and then grab everything up to the next &. So this is going to use 2 capture groups. Hope that helps!
Try this:
q=(https:\/\/www.facebook.com.*?)&
https://regex101.com/r/LcYz8c/11
Related
I have a text file of links after scrapping, I need to make a regular expression for these links so i can extract them from a file, but different links have same structure but different in length, like
https://www.cnbc.com/2016/10/12/billionaire-richard-branson-learned-a-key-business-lesson-playing-tennis.html
and this:
https://www.cnbc.com/2016/10/12/hedge-fund-bonus-makeover.html
I can successfully make RE for the base domain, but after that title give me a tough time, mine is
[h][t][t][p][s]:\/\/[w][w][w].[c][n][b][c].[c][o][m]\/[2][0][1][5-8]
for https://www.cnbc.com/2016/10/11/
but dont know how to make for further with diiferent words for different links ahead,
You can simplify your regex to something like this:
preg_match("/http.*:\/\/www\.cnbc\.com\/201[5-8].*/", $string, $match);
This matches the address with http or https.
Then any link that is between 2015 and 2018.
See here how it works:
https://www.phpliveregex.com/p/o7p
You are overcomplicating things,
https?://\S+?cnbc\.com\S+
will probably do, see https://regex101.com/r/ci3O1I/1/ for a demo.
I have a list of 1 million website url and I have a list of key words. I want to use Google to search for this keywords on those websites one by one; if I find some thing that's mean it's a valid URL for me.
I was Googling to find some tool to do it, I found two.
https://github.com/NikolaiT/GoogleScraper after installing everything I find that this scraper doesn't support "as_sitesearch" as a search parameter so I can not search by website.
Same thing for the 2nd one: http://jaunt-api.com/jaunt-tutorial.htm
Is there any good tool to do that?
I am the programmer of GoogleScraper. You can use the 'as_sitesearch' parameter when you use keyword files for your 1 million keywords.
Just use GoogleScraper something like this:
GoogleScraper --mode selenium --keyword-file you-keyword.txt --proxy-file your-proxies
where the file you-keyword.txt looks like:
site:yourdomain.com some sneaky words
site:yourdomain2.com some other words
...
To view all help:
GoogleScraper --help
Cheers
I am using the Google Analytics PHP API, and trying to use it to retrieve the most popular links on my website.
It works, but it retrieve some duplicates due to it retrieving URLs containing query strings. So basically, I want to retrieve all the links which do not contain the string "?start=" inside them. I think this can be done via regex (Google Analytics accepts regex filters), but don't know how.
Any ideas?
Thanks!
You can use negative look ahead regex assertion.
See http://www.perlmonks.org/?node_id=518444 and http://www.regular-expressions.info/lookaround.html
your_string(?!\?start=)
(Programming Language: PHP v5.3)
I am working on this website where I make search on specific websites using google and bing search APIs.
The Project:
A user can select a website to search from a drop-down list. We have an admin panel on this website. If the admin wants to add a new website to the drop-down list, he has to provide two sample URLs from the site as shown below.
On the submit of form a code goes through input and generates a regex that we later use for pattern matching. The regex is stored in database for later use.
In a different form the visiting user selects a website from the drop-down list. He then enters the search "query" in a text box. We fetch results as JSON using search APIs(as mentioned above) where we use the following query syntax as search string:
"site:website query"
(where we replace "website" with the website user chose for search and replace "query" with user's search query).
The Problem
Now what we have to do is get the best match of the url. The reason for doing a pattern match is that some times there are unwanted links in search results. For example lets say I search on website "www.example.com" for an article names "abcd". Search engines might return these two urls:
1) www.example.com/articles/854/abcd
2) www.example.com/search/abcd
The first url is the one that I want. Now I have two issues to resolve.
1) I know that the code that I wrote to make a regex pattern from sample URLs is never going to be perfect considering that the admin adds websites on regular basis. There can never be enough conditions to check for creating a pattern for different websites from same code. Is there a better way to do this or regex is my only option?
2) I am developing on a machine running Windows 7 OS. preg_match_all() returns results here. But when I move the code to server which is running Linux OS, preg_match_all() does not return any results for the same parameters? I can't seem to get why that is happening. Anyone knows why is this happening?
I have been working on web technologies for only past few weeks, so I don't know if I have better options than regex. I would be very grateful if you could assist me or guide me towards resources where I can find solution for my problems.
About question 1:
I can't quite grasp what you're trying to accomplish so I can't give any valid opinion.
Regarding question 2:
If both servers are running the same version of PHP, the regex library used ought to be the same. You can test this, however, by making a mock static file or string to test against the regex and see if the results are the same.
Since you're grabbing results from the search engines and then parsing them, the data retrieve might not be the same. Google/Bing change part of the data regarding the OS you use and that might alter preg results.
i want to extract specific links from a website.
The links look like that:
<a href="1494761,offer-mercedes-used.html">
The links are always the same - except the brandname (mercedes in this case).
This works fine so far but only delivers the first part of the link:
preg_match_all('/((\d{7}),offer-)/s',$inhalt,$results);
And this delivers the first link with the whole website :(
preg_match_all('/((\d{7}).*html)/s',$inhalt,$results);
Any ideas?
Note that i use preg_match_all() and not preg_match().
Thanks,
Chama
While .*? would do (= less greedy), in both cases you should specify a more precise pattern.
Here [\w.-]+ would do. But [^">]+ might also be feasible, if the HTML source is consistent (or you specifically wish to ignore other variations).
preg_match_all('/((\d{7}),offer-[\w.-])/s',$inhalt,$results);
Trying to parse xml/html with regex generally isn't a good idea, but if you're sure it will always be formatted well, this should return any links in the content.
/<a href="([^">]+)">/
This will more closely match only the example pattern you gave, but not sure what variations you might have
/<a href="([0-9]{7},offer-[a-z]+-used\.html)">/
// [7 numbers],offer-[at least one letter]-used.html