Regex not working with web crawler

Regex not working with web crawler - php

I have this simple web crawler that returns all links ( tags) from the Google search result page, however, my preg_match function doesn't seem to be returning the relevant links I want that are in between 2 strings. I believe my regex is correct though, I've tested it on several other platforms.
foreach($html->find('a') as $element) {
preg_match_all("/url\?q=(.*?)&sa=U&ei=/", $element->href, $matches); //attempt to retrieve the actual link in between these strings
echo $element->href.'<br/>'; //prints out each of the links
}
print_r($matches);
Here is what the page which I am trying to retrieve the relevant links from looks like when Im searching for someone named John Smith
https://www.google.com/webhp?tab=ww
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbm=isch&source=og&sa=N&tab=wi
https://maps.google.com/maps?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=wl
https://play.google.com/?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=w8
https://www.youtube.com/results?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=w1
https://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
http://www.google.com/intl/en/options/
https://www.google.com/calendar?tab=wc
https://translate.google.com/?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=wT
http://www.google.com/mobile/?hl=en&tab=wD
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=bks&source=og&sa=N&tab=wp
https://wallet.google.com/manage/?tab=wa
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=shop&source=og&sa=N&tab=wf
https://www.blogger.com/?tab=wj
https://www.google.com/finance?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=we
https://plus.google.com/photos?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=wq
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=vid&source=og&sa=N&tab=wv
http://www.google.com/intl/en/options/
https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search%3Fq%3DJohn%2BSmith
http://www.google.com/preferences?hl=en
/preferences?hl=en
http://www.google.com/history/optout?hl=en
/webhp?hl=en
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=isch&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAUQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=vid&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAYQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=nws&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAcQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=shop&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAgQ_AU
https://maps.google.com/maps?q=John+Smith&um=1&ie=UTF-8&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAkQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=bks&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAoQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:h&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:d&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:w&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:m&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:y&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=li:1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBQQFjAA&usg=AFQjCNFgBV3CPR5ydtty6z72kDKto_Ij7A
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:2n5isO4EbUAJ:http://en.wikipedia.org/wiki/John_Smith_(explorer)%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBcQIDAA&usg=AFQjCNGxUvb-aHUJmV-p4VbGXmUJE1nPBw
/search?ie=UTF-8&q=related:en.wikipedia.org/wiki/John_Smith_(explorer)+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBgQHzAA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23Early_adventures&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBoQ0gIoADAA&usg=AFQjCNFK7RzMUfQA5LZYUNaL2C_K0cEbjA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23In_Jamestown&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBsQ0gIoATAA&usg=AFQjCNF0pFVxwtdohofHr3bWQXJhk1XMcA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23New_England&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBwQ0gIoAjAA&usg=AFQjCNE4VqtjkQwsNzO_haCNSUi-3bgTsw
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23Death_and_burial&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CB0Q0gIoAzAA&usg=AFQjCNFAr4O8yWEK93_GyyN6_srpqLaljQ
/url?q=http://www.apva.org/history/jsmith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CB8QFjAB&usg=AFQjCNEMx0-702N1edJVXxiS5ILRl651zw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:iuJ7Uh7IOtgJ:http://www.apva.org/history/jsmith.html%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCIQIDAB&usg=AFQjCNG_keb3HZAHUteBGMb3k5GTIeVr5w
/search?ie=UTF-8&q=related:www.apva.org/history/jsmith.html+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCMQHzAB
/images?q=John+Smith&hl=en&sa=X&oi=image_result_group&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCUQsAQ
/url?q=http://etc.usf.edu/clipart/200/269/smith_2.htm&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCcQ9QEwAg&usg=AFQjCNF3B9TL94enKovOL1hlz-n0A4PXrA
/url?q=http://www.apva.org/history/jsmith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCkQ9QEwAw&usg=AFQjCNEMx0-702N1edJVXxiS5ILRl651zw
/url?q=http://www.biography.com/people/john-smith-9486928&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCsQ9QEwBA&usg=AFQjCNEdM50NAIJCmLRDMG_Ruyox4gshPQ
/url?q=http://www.shmoop.com/jamestown/photo-john-smith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CC0Q9QEwBQ&usg=AFQjCNFvEq7Cq3P6WdNIIHpNVVuQLTMhdQ
/url?q=http://www.wpclipart.com/American_History/settlement/John_Smith/Captain_John_Smith.png.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CC8Q9QEwBg&usg=AFQjCNGEWlYKoQUhODn-3jypeyaw4urAGw
/url?q=http://www.web-books.com/Classics/ON/B1/B1583/07MB1583.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDEQ9QEwBw&usg=AFQjCNGSF2DNQHhwDTHz4ogVcLVhM5TiDQ
/url?q=http://www.biography.com/people/john-smith-9486928&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDMQFjAI&usg=AFQjCNEdM50NAIJCmLRDMG_Ruyox4gshPQ
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:IJvKbJ_a540J:http://www.biography.com/people/john-smith-9486928%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDYQIDAI&usg=AFQjCNHnW1ezRcv8sn_Jk3GBvECp-QOCTg
/search?ie=UTF-8&q=related:www.biography.com/people/john-smith-9486928+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDcQHzAI
/url?q=http://johnsmithjohnsmith.com/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDkQFjAJ&usg=AFQjCNH9a_jF2woyDESMRrLneIIbbTeS4g
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:_KyTfWhQuFEJ:http://johnsmithjohnsmith.com/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDwQIDAJ&usg=AFQjCNGX37w0NUcEFa0t04-28gLhlMVfdA
/search?ie=UTF-8&q=related:johnsmithjohnsmith.com/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CD0QHzAJ
/url?q=http://www.johnsmith.co.uk/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CD8QFjAK&usg=AFQjCNHEhG7WRm1dP5c_0xqqH0P0U-9jUA
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:jPrP5TbGXhYJ:http://www.johnsmith.co.uk/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEIQIDAK&usg=AFQjCNFe-QSMSKMs8Z6mSu-oLraaeKYAug
/search?ie=UTF-8&q=related:www.johnsmith.co.uk/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEMQHzAK
/url?q=http://www.johnsmith.co.uk/uel&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEUQ0gIoADAK&usg=AFQjCNEk2GkTaQvtpqaaYdztlWV7iVs0Jg
/url?q=http://www.johnsmith.co.uk/bedfordshire&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEYQ0gIoATAK&usg=AFQjCNFcOIItpAW46XRn1BwGvuG7mertRA
/url?q=http://www.johnsmith.co.uk/aru&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEcQ0gIoAjAK&usg=AFQjCNFq68oEVG7KAAu-Mbd0ScBFOMF4MA
/url?q=http://www.history.com/topics/john-smith&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEkQFjAL&usg=AFQjCNGytp4P2oI3szUVSzJbJ1YdOWDldw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:5hQtC90uVmYJ:http://www.history.com/topics/john-smith%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEwQIDAL&usg=AFQjCNERGtQrhvZLOovq8W-Mp8AXeT_W1g
/search?ie=UTF-8&q=related:www.history.com/topics/john-smith+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CE0QHzAL
/url?q=http://johnsmithmusic.com/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CE8QFjAM&usg=AFQjCNFlpAC8HDml6r5DpmAo4VviZ_GeMw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:-T7dO31PjlkJ:http://johnsmithmusic.com/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFIQIDAM&usg=AFQjCNFFeePBNGGMWPaVS9j4_niZpMVyxA
/search?ie=UTF-8&q=related:johnsmithmusic.com/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFMQHzAM
/url?q=http://www.nps.gov/jame/historyculture/life-of-john-smith.htm&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFUQFjAN&usg=AFQjCNHPmqp05pAUp2yk1R9aKPqohTmWpQ
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:Q_nfCPRpnwQJ:http://www.nps.gov/jame/historyculture/life-of-john-smith.htm%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFgQIDAN&usg=AFQjCNHad3eFxSDuthM23n4FcusD5rY1uw
/search?ie=UTF-8&q=related:www.nps.gov/jame/historyculture/life-of-john-smith.htm+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFkQHzAN
/url?q=http://www.enchantedlearning.com/explorers/page/s/smith.shtml&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFsQFjAO&usg=AFQjCNEWo4pji9pBq89XmlprWg2okGHl5g
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:zs0buZvw9N8J:http://www.enchantedlearning.com/explorers/page/s/smith.shtml%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CF4QIDAO&usg=AFQjCNEu0cbayJymDVJ4IfbRc_NtrEtaPA
/search?ie=UTF-8&q=related:www.enchantedlearning.com/explorers/page/s/smith.shtml+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CF8QHzAO
/search?ie=UTF-8&q=john+smith+texture+pack&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGIQ1QIoAA
/search?ie=UTF-8&q=john+smith+and+pocahontas&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGMQ1QIoAQ
/search?ie=UTF-8&q=john+smith+actor&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGQQ1QIoAg
/search?ie=UTF-8&q=john+smith+realty&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGUQ1QIoAw
/search?ie=UTF-8&q=john+smith+doctor+who&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGYQ1QIoBA
/search?ie=UTF-8&q=captain+john+smith&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGcQ1QIoBQ
/search?ie=UTF-8&q=john+smith+wrestler&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGgQ1QIoBg
/search?ie=UTF-8&q=john+smith+wrestling&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGkQ1QIoBw
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=10&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=20&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=30&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=40&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=50&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=60&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=70&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=80&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=90&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=10&sa=N
/advanced_search?q=John+Smith&ie=UTF-8&prmd=ivnsp
/support/websearch/bin/answer.py?answer=134479&hl=en
/tools/feedback/survey/html?productId=196&query=John+Smith&hl=en
/
/intl/en/ads
/services
/intl/en/policies/
/intl/en/about.html
array(0) { }

The problem with your code, is that each time you attempt to match an element, $matches is a new array.
A possible solution:
$result = array();
foreach($html->find('a') as $element) {
preg_match_all("/url\?q=(.*?)&sa=U&ei=/", $element->href, $matches); //try to match
if(array_key_exists(1,$matches) && $matches[1] != "") { //if we found a match
$result[] = $matches[1]; //push it to $results
}
}
print_r($result);//print result
Another way is of course to try to find some kind of markers in the generated HTML page. You can do this for instance by converting the HTML document to XML and then analyze it. The problem with this approach is however that now and then, Google can modify it's page layout and thus you will need to rewrite your algorithm.

Related

How to use preg_match_all search if HTML source contains given URL?

I want to find all href tags that include my URL in any html source.
I used this code:
preg_match_all("'<a.*?href=\"(http[s]*://[^>\"]*?)\"[^>]*?>(.*?)</a>'si", $target_source, $matches);
Example, I try to find a href tags that include http://www.emrekadan.com
How can I do it ?

I'd simply use PHP's DOM Parser for this purpose. This may seem harder than regex, but it's actually a lot more easier and is the correct way to parse HTML.
$url = 'WEBSITE_TO_SEARCH_FOR';
$searchstring = 'YOUR_SEARCH_STRING';
$dom = new DOMDocument();
#$dom->loadHTMLFile($url);
$result = array();
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
if(stripos($href, $searchstring) !== FALSE) {
$result[] = $href;
}
}
if(!empty($result)) print_r($result);
Explanation:
Loads the given URL using loadHTMLfile() method
Finds all <a> tags and loops through them
Uses stripos() to case-insensitively check if the href contains the given search term
If it does, it's pushed into the $result array
Note: If an empty string is passed as the filename or an empty file is named, a warning will be generated. I've used # to hide that message, but it's generally regarded as a bad practice. You can add additional checks to make sure the URL exists before trying to load it.

PHP - Extracting two values from a line

I'm a beginner with regular expressions and am working on a server where I cannot instal anything (does using DOM methods require the instal of anything?).
I have a problem that I cannot solve with my current knowledge.
I would like to extract from the line below the album id and image url.
There are more lines and other url elements in the string (file), but the album ids and image urls I need are all in strings similar to the one below:
<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">
So in this case I would like to get '774' and 'http://img255.imageshack.us/img00/000/000001.png'
I've seen multiple examples of extracting just the url or one other element from a string, but I really need to keep these both together and store these in one record of the database.
Any help is really appreciated!

Since you are new to this, I'll explain that you can use PHP's HTML parser known as DOMDocument to extract what you need. You should not use a regular expression as they are inherently error prone when it comes to parsing HTML, and can easily result in many false positives.
To start, lets say you have your HTML:
$html = '<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">';
And now, we load that into DOMDocument:
$doc = new DOMDocument;
$doc->loadHTML( $html);
Now, we have that HTML loaded, it's time to find the elements that we need. Let's assume that you can encounter other <a> tags within your document, so we want to find those <a> tags that have a direct <img> tag as a child. Then, check to make sure we have the correct nodes, we need to make sure we extract the correct information. So, let's have at it:
$results = array();
// Loop over all of the <a> tags in the document
foreach( $doc->getElementsByTagName( 'a') as $a) {
// If there are no children, continue on
if( !$a->hasChildNodes()) continue;
// Find the child <img> tag, if it exists
foreach( $a->childNodes as $child) {
if( $child->nodeType == XML_ELEMENT_NODE && $child->tagName == 'img') {
// Now we have the <a> tag in $a and the <img> tag in $child
// Get the information we need:
parse_str( parse_url( $a->getAttribute('href'), PHP_URL_QUERY), $a_params);
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
}
}
A print_r( $results); now leaves us with:
Array
(
[0] => Array
(
[0] => 774
[1] => http://img255.imageshack.us/img00/000/000001.png
)
)
Note that this omits basic error checking. One thing you can add is in the inner foreach loop, you can check to make sure you successfully parsed an album parameter from the <a>'s href attribute, like so:
if( isset( $a_params['album'])) {
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
Every function I've used in this can be found in the PHP documentation.

If you've already narrowed it down to this line, then you can use a regex like the following:
$matches = array();
preg_match('#.+album=(\d+).+src="([^"]+)#', $yourHtmlLineHere, $matches);
Now if you
echo $matches[1];
echo " ";
echo $matches[2];
You'll get the following:
774 http://img255.imageshack.us/img00/000/000001.png

need help troubleshooting a function that searches for specific url links in an array of text

hey guys Im building a script by which Im attempting to find specific links in twitter text results
This script basically checks whether the text contains a url, then determines if that url is one of 6 specific urls, and if it matches outputs the original text into a new array Ive labeled as $imgtweets
the problem however is that despite the fact that I have about 4 text string in the array only one of them is matching and being returned in the $imgtweets array, Im having a hard time determining where Ive made the mistake, any help would go a long way!
this is my code, Ive had to make the array slightly smaller however because im not allowed to post more that 2 hyperlinks at this point:
<?php
$tweets = array(
"Photo: therulesofagentleman: http://tumblr.com/xc52sgx6u7",
"http://mypict.me/iysEX So this is Karly. Karly say hello to the world. We've been at this a while when your fans (cont)",
"this is some test text that doesnt contain any links for testing purposes");
$imgtweets = array();
foreach($tweets as $tweet) {
preg_match_all("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t<]*)#ise", $tweet, $matches, PREG_PATTERN_ORDER);
$tweetlinks = $matches[0];
if (!empty($tweetlinks)){
foreach($tweetlinks as $key => $link) {
if (substr($link,0,14)=="http://lockerz" || substr($link,0,12)=="http://yfrog"
|| substr($link,0,14)=="http://twitpic" || substr($link,0,13)=="http://tumblr"
|| substr($link,0,13)=="http://mypict" || substr($link,0,14)=="http://instagr" )
{
array_push($imgtweets,"$tweet");
}
}
}
}
print_r($imgtweets);
?>

basically you can replace all your code with something like
$hosts = "lockerz|yfrog|twitpic|etc";
$regexp = "~http://($hosts)~";
$img_tweets = preg_grep($regexp, $all_tweets);

Rather than using substr you could use an array and str_replace
$urls = array(*****your urls*****);
foreach($tweets as $tweet)
{
str_replace($urls, '', $tweet, $count);
if ($count)
{
array_push($imgtweets,"$tweet");
}
}

Searching within a webpage

what would be the best way to write a code in Php that would search within a webpage for a number of words stored in a file? is it best to store the source code in a file or is it another way? please help.

The best way is to use google: site:example.com word1 OR word2 OR word3
Do you want to search in ONE PAGE? or one website with MULTIPLE PAGES?
If its only one page i think you can store the html code in memory without problems.
if you know exactly what you search strpos for reach word will probably be the fastest (stripos for case insensitive). you can also define your own character class and use preg_match_all or something... just something like this will do...
<?
$keywords = array("word1","word2","word3");
$doc = strip_tags(file_get_contents("http://www.example.com")); // remove tags to get only text
$doc = preg_replace('/\s+/', ' ',$doc); // remove multiple whitespaces...
foreach($keywords as $word) {
$pos = stripos($doc,$word);
if($pos !== false) {
echo "match: ...".str_replace($word,"<em>$word</em>",substr($doc,$pos-20,50))."... \n";
}
}
?>
something like the following for example will perform MUCH faster as its based on hashmap lookups with O(1) and doesnt need to scan the whole text for every keyword...
<?
setlocale(LC_ALL, "en_US.utf8");
$keywords = array("word1","word2","word3","word4");
$doc = file_get_contents("http://www.example.com");
$doc = strtolower($doc);
$doc = preg_replace('!/\*.*?\*/!s', '', $doc);
$doc = preg_replace("/<!--.*>/i", "", $doc);
$doc = preg_replace('!<script.*?script>!s', '', $doc);
$doc = preg_replace('!<style.*?style>!s', '', $doc);
$doc = strip_tags($doc);
$doc = preg_replace('/[^0-9a-z\s]/','',$doc);
$doc = iconv('UTF-8', 'ASCII//TRANSLIT', $doc); // check if encoding is really utf8
//$doc = preg_replace('{(.)\1+}','$1',$doc); remove duplicate chars ... possible step to add even more fuzzyness
$doc = preg_split("/\s+/",trim($doc));
foreach($keywords as $word) {
$word = strtolower($word);
$word = iconv('UTF-8', 'ASCII//TRANSLIT', $word);
$key = array_search($word,$doc);
var_dump($key);
if($key !== false) {
echo "match: ";
for($i=$key;$i<=5 && isset($doc[$i]);$i++) {
echo $doc[$i]." ";
}
}
}
?>
this code is untested.
it would be however be more elegant to dump textnodes from a domdocument
Simple searching is easy. If you want to search in a whole website the crawling logic is difficult.
I once did a backlink-checker for a company that worked like a crawler.
My first advice is not to do a recursion (like scanning a page and following all links and following all links in that until you reach a certain level...)
rather do it like this:
do a for loop as often as many levels you want to crawl.
set a site array with one entry (start page)
pass array to a function downloads every link, scans the site there and stores links on it in array.
when done with all links return the new link list array
in the for loop update the array with the return value of the function, and call the function again.
this way you can avoid following nasty paths but rather crawl website level by level.
also store already visited links in an array to skip, dont follow external links, check for weird url parameters etc..
for future use you can store documents in lucene or solr, there are classes to turn html pages into senseful lucene objects and search within.

How to write a PHP script to find the number of indexed pages in Google?

I need to find the number of indexed pages in google for a specific domain name, how do we do that through a PHP script?
So,
foreach ($allresponseresults as $responseresult)
{
$result[] = array(
'url' => $responseresult['url'],
'title' => $responseresult['title'],
'abstract' => $responseresult['content'],
);
}
what do i add for the estimated number of results and how do i do that?
i know it is (estimatedResultCount) but how do i add that? and i call the title for example this way: $result['title'] so how to get the number and how to print the number?
Thank you :)

I think it would be nicer to Google to use their RESTful Search API. See this URL for an example call:
http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site:stackoverflow.com&filter=0
(You're interested in the estimatedResultCount value)
In PHP you can use file_get_contents to get the data and json_decode to parse it.
You can find documentation here:
http://code.google.com/apis/ajaxsearch/documentation/#fonje
Example
Warning: The following code does not have any kind of error checking on the response!
function getGoogleCount($domain) {
$content = file_get_contents('http://ajax.googleapis.com/ajax/services/' .
'search/web?v=1.0&filter=0&q=site:' . urlencode($domain));
$data = json_decode($content);
return intval($data->responseData->cursor->estimatedResultCount);
}
echo getGoogleCount('stackoverflow.com');

You'd load http://www.google.com/search?q=domaingoeshere.com with cURL and then parse the file looking for the results <p id="resultStats" bit.
You'd have the resulting html stored in a variable $html and then say something like
$arr = explode('<p id="resultStats"'>, $html);
$bottom = $arr[1];
$middle = explode('</p>', $bottom);
Please note that this is untested and a very rough example. You'd be better off parsing the html with a dedicated parser or matching the line with regular expressions.

google ajax api estimatedResultCount values doesn't give the right value.
And trying to parse html result is not a good way because google blocks after several search.

Count the number of results for site:yourdomainhere.com - stackoverflow.com has about 830k

// This will give you the count what you see on search result on web page,
//this code will give you the HTML content from file_get_contents
header('Content-Type: text/plain');
$url = "https://www.google.com/search?q=your url";
$html = file_get_contents($url);
if (FALSE === $html) {
throw new Exception(sprintf('Failed to open HTTP URL "%s".', $url));
}
$arr = explode('<div class="sd" id="resultStats">', $html);
$bottom = $arr[1];
$middle = explode('</div>', $bottom);
echo $middle[0];
Output:
About 8,130 results
//vKj
Case 2: you can also use google api, but its count is different:
https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=ursitename&callback=processResults

https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site:google.com
cursor":{"resultCount":"111,000,000","
"estimatedResultCount":"111000000",

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex not working with web crawler - php

Related

How to use preg_match_all search if HTML source contains given URL?

PHP - Extracting two values from a line

need help troubleshooting a function that searches for specific url links in an array of text

Searching within a webpage

How to write a PHP script to find the number of indexed pages in Google?

Categories

Resources