Google Crawler or Scraper with all search parameters

Google Crawler or Scraper with all search parameters - php

I have a list of 1 million website url and I have a list of key words. I want to use Google to search for this keywords on those websites one by one; if I find some thing that's mean it's a valid URL for me.
I was Googling to find some tool to do it, I found two.
https://github.com/NikolaiT/GoogleScraper after installing everything I find that this scraper doesn't support "as_sitesearch" as a search parameter so I can not search by website.
Same thing for the 2nd one: http://jaunt-api.com/jaunt-tutorial.htm
Is there any good tool to do that?

I am the programmer of GoogleScraper. You can use the 'as_sitesearch' parameter when you use keyword files for your 1 million keywords.
Just use GoogleScraper something like this:
GoogleScraper --mode selenium --keyword-file you-keyword.txt --proxy-file your-proxies
where the file you-keyword.txt looks like:
site:yourdomain.com some sneaky words
site:yourdomain2.com some other words
...
To view all help:
GoogleScraper --help
Cheers

Related

Best Choice Regex for extract the Facebook Link

i search the best regex method for the most functionality.
I search on Google and will extract the Facebook Links. Because Google has no Search API that works 1to1 with the exact Google Results i don't can use the API.
I send now a normal request to google, extract the html code and will find all Facebook Link without google parameters.
Examples you find on regex debbuger.
I will see only this links if is possible.
Here Example Strings to search:
`
/url?q=https://www.facebook.com/pageid/about&sa=U&ved=0ahUKEwi27NeDvfTTAhWBfywKHbuDDS4QjBAIHDAB&usg=AFQjCNH7T2JEP5DzGpiiwT_pMt2oGJ10ow
/url?q=https://www.facebook.com/pageid/%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ
/url?q=https://www.facebook.com/pageid%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ
/url?q=https://www.facebook.com/name-name-585606818284844/%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ
/url?q=https://www.facebook.com/name-name-585606818284844%3Fpnref%3Dlhc&sa=U&ved=0ahUKEwiWv8S6vfTTAhUEBiwKHW04AH8Q_BcIyQQoATBu&usg=AFQjCNEZIUb1yqqYtzjPfDEVi4GPHDY5FQ`
Thats my Regex this works but not for all options. Regex Debugger:
https://regex101.com/r/LcYz8c/8

Something like:
"q=(https?://.*?facebook.com/)derName-/"
"q=(https?://.*?facebook.com/)derName(?:%[^%]*%..|[-/])?([^&]‌+)"
might be what you are looking for. From what I see in your example, it looks like you want:
everything from the http up to the first / after the domain. Then skip the derName, and then grab everything up to the next &. So this is going to use 2 capture groups. Hope that helps!

Try this:
q=(https:\/\/www.facebook.com.*?)&
https://regex101.com/r/LcYz8c/11

Finding images in Wikipedia that are being used across various articles

I'm trying to query wikipedia using MediaWiki api with php (and Curl), in order to search for images that are being used in various articles by a specific search term. For example - search for 'panda', but get only images that are being used somewhere and be able to go to the articles.
I am able to search for images generally using:
https://en.wikipedia.org/w/api.php?action=query&list=allimages&ailimit=100&aifrom=Panda&aiprop=url&format=xmlfm
and I know that basically this should show the usage:
https://commons.wikimedia.org/w/api.php?action=query&prop=images&list=imageusage&iutitle=File:MY_IMAGE_NAME&format=xmlfm
Trying the above does not give me the result I need - I can see a list of images, but I cannot know if or where they are being used.
Can anyone assist?

list=imageusage does not show cross-wiki usage; you'll need prop=globalusage for that. Which is also conveniently a prop module, so it can be folded into the first query using allimages as a generator:
action=query&generator=allimages&gailimit=100&gaifrom=Panda&prop=globalusage
(Omitted prop=images since it does not seem to have any useful purpose.)

Regex PHP tweet filter separate into columns

I'm working on my school project about showing public transport delays in a table on my website.
I'm crawling a twitter profile, using Rstudio, Postgreql, PHP, cmd
I'm trying to filter tweets from a twitter profile(public transport profile, showing delays etc.). twitter.com/TotalOVNL
I'm using php with postgresql . I already have the tweets in a table on my website using SQL Query see link for screenshot http://prntscr.com/385squ . I would like to use Regex to filter the tweettext. I dont want to show all of the tweets' text from the tweet, just the ones containg delays and seperate it into few collumns. Ive made in paint an example of the table that i would like to have on my website, see the link for screenshot http://prntscr.com/3866u6.
I know that i have to use regex, but dont know the language that well.
Would someone be able to help me?

There doesn't seem to be a pattern in the last two (pt number and delay).
I tried to find the best pattern matching possible, hope this helps. Look at the match groups in the bottom right to find the matches.
#(\w+)\s#(\w+)\sis op (\d+-\d+)\s(\d+:\d+) (\w+)lijn
Example on some of the tweets on regex101.

URL Pattern Matching (PHP)?

(Programming Language: PHP v5.3)
I am working on this website where I make search on specific websites using google and bing search APIs.
The Project:
A user can select a website to search from a drop-down list. We have an admin panel on this website. If the admin wants to add a new website to the drop-down list, he has to provide two sample URLs from the site as shown below.
On the submit of form a code goes through input and generates a regex that we later use for pattern matching. The regex is stored in database for later use.
In a different form the visiting user selects a website from the drop-down list. He then enters the search "query" in a text box. We fetch results as JSON using search APIs(as mentioned above) where we use the following query syntax as search string:
"site:website query"
(where we replace "website" with the website user chose for search and replace "query" with user's search query).
The Problem
Now what we have to do is get the best match of the url. The reason for doing a pattern match is that some times there are unwanted links in search results. For example lets say I search on website "www.example.com" for an article names "abcd". Search engines might return these two urls:
1) www.example.com/articles/854/abcd
2) www.example.com/search/abcd
The first url is the one that I want. Now I have two issues to resolve.
1) I know that the code that I wrote to make a regex pattern from sample URLs is never going to be perfect considering that the admin adds websites on regular basis. There can never be enough conditions to check for creating a pattern for different websites from same code. Is there a better way to do this or regex is my only option?
2) I am developing on a machine running Windows 7 OS. preg_match_all() returns results here. But when I move the code to server which is running Linux OS, preg_match_all() does not return any results for the same parameters? I can't seem to get why that is happening. Anyone knows why is this happening?
I have been working on web technologies for only past few weeks, so I don't know if I have better options than regex. I would be very grateful if you could assist me or guide me towards resources where I can find solution for my problems.

About question 1:
I can't quite grasp what you're trying to accomplish so I can't give any valid opinion.
Regarding question 2:
If both servers are running the same version of PHP, the regex library used ought to be the same. You can test this, however, by making a mock static file or string to test against the regex and see if the results are the same.
Since you're grabbing results from the search engines and then parsing them, the data retrieve might not be the same. Google/Bing change part of the data regarding the OS you use and that might alter preg results.

Extract URL containing /find/ from numerous URL's?

I'm really a major novice at RegEx and could do with some help.
I have a long string containing lots of URL's and other text, and one of the URL's contains has /find/ in it. ie:
1. http://www.example.com/not/index.html
2. http://www.example.com/sat/index.html
3. http://www.example.com/find/index.html
4. http://www.example.com/rat/mine.html
5. http://www.example.com/mat/find.html
What sort of RegEx would I use to return the URL that is number 3 in that list but not return me number 5 as well? I suppose basically what I'm looking for is a way of returning a whole word that contains a specific set of letters and / in order.
TIA

I would assume you want preg_match("%/find/%",$input); or similar.
EDIT: To get the full line, use:
preg_match("%^.*?/find/.*$%m",$input);

I can suggest you to use RegExr to generate regular expressions.
You can type in a sample list (like the one above) and use a palette to create a RegExp and test it in realtime. The program is available both online and as downloadable Adobe AIR package.
Unfortunately I cannot access their site now, so I'm attaching the AIR package of the downloadable version.
I really recommend you this, since it helped a RegExp newbie like me to design even the most complex patterns.
However, for your question, I think that just
\/find\/
goes well if you want to obtain a yes/no result (i.e. if it contains or not /find/), otherwise to obtain the full line use
.*\/find\/.*

In addition to Kolink's answer, in case you wanted to regex match the whole URI:
This is by no means an exhaustive regex for URIs, but this is a good starting point. I threw in a few options at key points, like .com, .net, and .org. In reality you'll have a fairly hard time matching URIs with regular expressions due to the lack of conformity, but you can come very close
The regex from the above link:
/(https?:\/\/)?(www\.)?([a-zA-Z0-9-_]+)\.(com|org|net)\/(find)\/([a-zA-Z0-9-_]+)\.(html|php|aspx)?/is

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.