PHP web crawler, check URL for path

PHP web crawler, check URL for path - php

I'm writing a simple web crawler to grab some links from a site.
I need to check the returned links to make sure I selectively collect what I want.
For example, here's a few links returned from http://www.polygon.com/
[0] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide#comments
[1] http://www.polygon.com/videos
[2] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide
[3] http://www.polygon.com/features
so link 0 and 2 are links I want to grab, 1 and 3 we don't want. there's an obvious visual distinction between the links so how would I compare them?
How would I check to make sure I don't return 1 and 3? ideally i'd like to be able to input something so it could adapt to any site.
I was thinking I need to check the link to make sure its past /2015/ etc but I'm pretty lost.
here's the PHP code i'm using to grab links:
<?php
$source_url = 'http://www.polygon.com/';
$html = file_get_contents($source_url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$input_url = $link->getAttribute('href');
echo $input_url . "<br>";
}
?>

It looks like regular expressions would be helpful here.
You could say, for instance:
/* if $input_url contains a 4 digit year, slash, number(s), slash, number(s) */
if (preg_match("/\/20\d\d\/\d+\/\d+\/",$input_url)) {
echo $input_url . "<br>";
}

Related

Array Numbering Issue

Why is this code able to fetch data from the following first page and insert them into an array by numbering the array, while it fails to do the same for the following second page:
http://nimishprabhu.com
https://www.fiverr.com/search/gigs?utf8=%E2%9C%93&source=guest-homepage&locale=en&search_in=everywhere&query=php
The page shows arrays numbered like the following, which is not correct:
Array ( [0] => mailto:support#fiverr.com )
Array ( [0] => https://collector.fiverr.com/api/v1/collector/noScript.gif?appId=PXK3bezZfO
[1] => https://collector.fiverr.com/api/v1/collector/pxPixel.gif?appId=PXK3bezZfO )
Array ( [0] => One Small Step )
Code:
<?php
/*
2.
FINDING HTML ELEMENTS BASED ON THEIR TAG NAMES
Suppose you wanted to find each and every image on a webpage or say, each
and every hyperlink.
We will be using “find” function to extract this information from the
object. Doing it using Simple HTML DOM Parser :
*/
include('simple_html_dom.php');
$html = file_get_html('https://www.fiverr.com/search/gigs?utf8=%E2%9C%93&source=guest-homepage&locale=en&search_in=everywhere&query=php');
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";
//to fetch all images from a webpage
$images = array();
foreach($html->find('img') as $img) {
$images[] = $img->src;
}
print_r($images);
echo "<br />";
//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
print_r($headlines);
echo "<br />";
?>
Any suggestions and code samples welcome for my learning purpose.
I am a self study student.

The reason is that the page you are trying to download (fiverr.com) is JavaScript-based with dynamically loaded content. This will not work in PHP, because it only sees the HTML that was sent by the server, it can't parse and run JavaScript. Because this is for learning purposes, you can simply try a different website.
However, if you want a working solution, you should look into Selenium. It's basically a headless web browser which does everything like other browsers, including running JavaScript. Through its web driver you will be able to fully parse websites like fiverr.com.

Regex not working with web crawler

I have this simple web crawler that returns all links ( tags) from the Google search result page, however, my preg_match function doesn't seem to be returning the relevant links I want that are in between 2 strings. I believe my regex is correct though, I've tested it on several other platforms.
foreach($html->find('a') as $element) {
preg_match_all("/url\?q=(.*?)&sa=U&ei=/", $element->href, $matches); //attempt to retrieve the actual link in between these strings
echo $element->href.'<br/>'; //prints out each of the links
}
print_r($matches);
Here is what the page which I am trying to retrieve the relevant links from looks like when Im searching for someone named John Smith
https://www.google.com/webhp?tab=ww
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbm=isch&source=og&sa=N&tab=wi
https://maps.google.com/maps?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=wl
https://play.google.com/?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=w8
https://www.youtube.com/results?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=w1
https://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
http://www.google.com/intl/en/options/
https://www.google.com/calendar?tab=wc
https://translate.google.com/?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=wT
http://www.google.com/mobile/?hl=en&tab=wD
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=bks&source=og&sa=N&tab=wp
https://wallet.google.com/manage/?tab=wa
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=shop&source=og&sa=N&tab=wf
https://www.blogger.com/?tab=wj
https://www.google.com/finance?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=we
https://plus.google.com/photos?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=wq
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=vid&source=og&sa=N&tab=wv
http://www.google.com/intl/en/options/
https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search%3Fq%3DJohn%2BSmith
http://www.google.com/preferences?hl=en
/preferences?hl=en
http://www.google.com/history/optout?hl=en
/webhp?hl=en
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=isch&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAUQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=vid&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAYQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=nws&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAcQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=shop&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAgQ_AU
https://maps.google.com/maps?q=John+Smith&um=1&ie=UTF-8&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAkQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=bks&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAoQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:h&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:d&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:w&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:m&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:y&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=li:1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBQQFjAA&usg=AFQjCNFgBV3CPR5ydtty6z72kDKto_Ij7A
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:2n5isO4EbUAJ:http://en.wikipedia.org/wiki/John_Smith_(explorer)%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBcQIDAA&usg=AFQjCNGxUvb-aHUJmV-p4VbGXmUJE1nPBw
/search?ie=UTF-8&q=related:en.wikipedia.org/wiki/John_Smith_(explorer)+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBgQHzAA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23Early_adventures&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBoQ0gIoADAA&usg=AFQjCNFK7RzMUfQA5LZYUNaL2C_K0cEbjA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23In_Jamestown&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBsQ0gIoATAA&usg=AFQjCNF0pFVxwtdohofHr3bWQXJhk1XMcA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23New_England&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBwQ0gIoAjAA&usg=AFQjCNE4VqtjkQwsNzO_haCNSUi-3bgTsw
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23Death_and_burial&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CB0Q0gIoAzAA&usg=AFQjCNFAr4O8yWEK93_GyyN6_srpqLaljQ
/url?q=http://www.apva.org/history/jsmith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CB8QFjAB&usg=AFQjCNEMx0-702N1edJVXxiS5ILRl651zw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:iuJ7Uh7IOtgJ:http://www.apva.org/history/jsmith.html%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCIQIDAB&usg=AFQjCNG_keb3HZAHUteBGMb3k5GTIeVr5w
/search?ie=UTF-8&q=related:www.apva.org/history/jsmith.html+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCMQHzAB
/images?q=John+Smith&hl=en&sa=X&oi=image_result_group&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCUQsAQ
/url?q=http://etc.usf.edu/clipart/200/269/smith_2.htm&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCcQ9QEwAg&usg=AFQjCNF3B9TL94enKovOL1hlz-n0A4PXrA
/url?q=http://www.apva.org/history/jsmith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCkQ9QEwAw&usg=AFQjCNEMx0-702N1edJVXxiS5ILRl651zw
/url?q=http://www.biography.com/people/john-smith-9486928&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCsQ9QEwBA&usg=AFQjCNEdM50NAIJCmLRDMG_Ruyox4gshPQ
/url?q=http://www.shmoop.com/jamestown/photo-john-smith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CC0Q9QEwBQ&usg=AFQjCNFvEq7Cq3P6WdNIIHpNVVuQLTMhdQ
/url?q=http://www.wpclipart.com/American_History/settlement/John_Smith/Captain_John_Smith.png.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CC8Q9QEwBg&usg=AFQjCNGEWlYKoQUhODn-3jypeyaw4urAGw
/url?q=http://www.web-books.com/Classics/ON/B1/B1583/07MB1583.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDEQ9QEwBw&usg=AFQjCNGSF2DNQHhwDTHz4ogVcLVhM5TiDQ
/url?q=http://www.biography.com/people/john-smith-9486928&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDMQFjAI&usg=AFQjCNEdM50NAIJCmLRDMG_Ruyox4gshPQ
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:IJvKbJ_a540J:http://www.biography.com/people/john-smith-9486928%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDYQIDAI&usg=AFQjCNHnW1ezRcv8sn_Jk3GBvECp-QOCTg
/search?ie=UTF-8&q=related:www.biography.com/people/john-smith-9486928+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDcQHzAI
/url?q=http://johnsmithjohnsmith.com/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDkQFjAJ&usg=AFQjCNH9a_jF2woyDESMRrLneIIbbTeS4g
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:_KyTfWhQuFEJ:http://johnsmithjohnsmith.com/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDwQIDAJ&usg=AFQjCNGX37w0NUcEFa0t04-28gLhlMVfdA
/search?ie=UTF-8&q=related:johnsmithjohnsmith.com/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CD0QHzAJ
/url?q=http://www.johnsmith.co.uk/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CD8QFjAK&usg=AFQjCNHEhG7WRm1dP5c_0xqqH0P0U-9jUA
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:jPrP5TbGXhYJ:http://www.johnsmith.co.uk/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEIQIDAK&usg=AFQjCNFe-QSMSKMs8Z6mSu-oLraaeKYAug
/search?ie=UTF-8&q=related:www.johnsmith.co.uk/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEMQHzAK
/url?q=http://www.johnsmith.co.uk/uel&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEUQ0gIoADAK&usg=AFQjCNEk2GkTaQvtpqaaYdztlWV7iVs0Jg
/url?q=http://www.johnsmith.co.uk/bedfordshire&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEYQ0gIoATAK&usg=AFQjCNFcOIItpAW46XRn1BwGvuG7mertRA
/url?q=http://www.johnsmith.co.uk/aru&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEcQ0gIoAjAK&usg=AFQjCNFq68oEVG7KAAu-Mbd0ScBFOMF4MA
/url?q=http://www.history.com/topics/john-smith&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEkQFjAL&usg=AFQjCNGytp4P2oI3szUVSzJbJ1YdOWDldw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:5hQtC90uVmYJ:http://www.history.com/topics/john-smith%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEwQIDAL&usg=AFQjCNERGtQrhvZLOovq8W-Mp8AXeT_W1g
/search?ie=UTF-8&q=related:www.history.com/topics/john-smith+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CE0QHzAL
/url?q=http://johnsmithmusic.com/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CE8QFjAM&usg=AFQjCNFlpAC8HDml6r5DpmAo4VviZ_GeMw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:-T7dO31PjlkJ:http://johnsmithmusic.com/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFIQIDAM&usg=AFQjCNFFeePBNGGMWPaVS9j4_niZpMVyxA
/search?ie=UTF-8&q=related:johnsmithmusic.com/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFMQHzAM
/url?q=http://www.nps.gov/jame/historyculture/life-of-john-smith.htm&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFUQFjAN&usg=AFQjCNHPmqp05pAUp2yk1R9aKPqohTmWpQ
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:Q_nfCPRpnwQJ:http://www.nps.gov/jame/historyculture/life-of-john-smith.htm%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFgQIDAN&usg=AFQjCNHad3eFxSDuthM23n4FcusD5rY1uw
/search?ie=UTF-8&q=related:www.nps.gov/jame/historyculture/life-of-john-smith.htm+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFkQHzAN
/url?q=http://www.enchantedlearning.com/explorers/page/s/smith.shtml&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFsQFjAO&usg=AFQjCNEWo4pji9pBq89XmlprWg2okGHl5g
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:zs0buZvw9N8J:http://www.enchantedlearning.com/explorers/page/s/smith.shtml%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CF4QIDAO&usg=AFQjCNEu0cbayJymDVJ4IfbRc_NtrEtaPA
/search?ie=UTF-8&q=related:www.enchantedlearning.com/explorers/page/s/smith.shtml+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CF8QHzAO
/search?ie=UTF-8&q=john+smith+texture+pack&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGIQ1QIoAA
/search?ie=UTF-8&q=john+smith+and+pocahontas&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGMQ1QIoAQ
/search?ie=UTF-8&q=john+smith+actor&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGQQ1QIoAg
/search?ie=UTF-8&q=john+smith+realty&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGUQ1QIoAw
/search?ie=UTF-8&q=john+smith+doctor+who&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGYQ1QIoBA
/search?ie=UTF-8&q=captain+john+smith&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGcQ1QIoBQ
/search?ie=UTF-8&q=john+smith+wrestler&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGgQ1QIoBg
/search?ie=UTF-8&q=john+smith+wrestling&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGkQ1QIoBw
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=10&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=20&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=30&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=40&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=50&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=60&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=70&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=80&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=90&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=10&sa=N
/advanced_search?q=John+Smith&ie=UTF-8&prmd=ivnsp
/support/websearch/bin/answer.py?answer=134479&hl=en
/tools/feedback/survey/html?productId=196&query=John+Smith&hl=en
/
/intl/en/ads
/services
/intl/en/policies/
/intl/en/about.html
array(0) { }

The problem with your code, is that each time you attempt to match an element, $matches is a new array.
A possible solution:
$result = array();
foreach($html->find('a') as $element) {
preg_match_all("/url\?q=(.*?)&sa=U&ei=/", $element->href, $matches); //try to match
if(array_key_exists(1,$matches) && $matches[1] != "") { //if we found a match
$result[] = $matches[1]; //push it to $results
}
}
print_r($result);//print result
Another way is of course to try to find some kind of markers in the generated HTML page. You can do this for instance by converting the HTML document to XML and then analyze it. The problem with this approach is however that now and then, Google can modify it's page layout and thus you will need to rewrite your algorithm.

Pulling specific data from php file

Essentially, I have pulled the text from a URL and need to find a way to pull specific characters from the text.
The line I need pulled from is:
<p align="center">http://sitexplosion.com/?rid=1256</p>
The text I need pulled is essentially the number 1256, basically everything after ?rid= and before " target="_blank">
That number will change and will be anywhere from 1 to 6 characters in length.
If something like this has been posted already, I apologize. I have been scouring the net for the last 3 hours trying to find an answer of some sort.
If you can show me how to pull those characters from that line, I have got the rest already going.
Thanks in advance!

How about this one:-
$strout="<p align='center'><a href='http://sitexplosion.com/?rid=1256' target='_blank'>http://sitexplosion.com/?rid=1256</a></p>";
$startsAt = strpos($strout, "?rid") + strlen("?rid=");
$endsAt = strpos($strout, "{\'target}", $startsAt);
$result = substr($strout, $startsAt, ($endsAt-3) - $startsAt);
echo $result;
Output:-

Here, why not use a HTML parser or domdocument to extract the links, then get the links query params with parse_url()
$html = '
<p align="center">http://sitexplosion.com/?rid=1256</p>
<p align="center">http://sitexplosion.com/?rid=123456</p>
<p align="center">http://sitexplosion.com/</p>
';
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$link_ids = array();
foreach ($dom->getElementsByTagName('a') as $link)
{
if($query = parse_url($link->getAttribute('href'), PHP_URL_QUERY))
{
$link_ids[] = str_replace('rid=','',$query);
}
}
print_r($link_ids);
/*
Array
(
[0] => 1256
[1] => 123456
)
*/
hope it helps

Well this is much shorter
$string = strip_tags('<p align="center">http://sitexplosion.com/?rid=1256</p>');
echo str_replace('http://sitexplosion.com/?rid=','',$string);

PHP - Extracting two values from a line

I'm a beginner with regular expressions and am working on a server where I cannot instal anything (does using DOM methods require the instal of anything?).
I have a problem that I cannot solve with my current knowledge.
I would like to extract from the line below the album id and image url.
There are more lines and other url elements in the string (file), but the album ids and image urls I need are all in strings similar to the one below:
<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">
So in this case I would like to get '774' and 'http://img255.imageshack.us/img00/000/000001.png'
I've seen multiple examples of extracting just the url or one other element from a string, but I really need to keep these both together and store these in one record of the database.
Any help is really appreciated!

Since you are new to this, I'll explain that you can use PHP's HTML parser known as DOMDocument to extract what you need. You should not use a regular expression as they are inherently error prone when it comes to parsing HTML, and can easily result in many false positives.
To start, lets say you have your HTML:
$html = '<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">';
And now, we load that into DOMDocument:
$doc = new DOMDocument;
$doc->loadHTML( $html);
Now, we have that HTML loaded, it's time to find the elements that we need. Let's assume that you can encounter other <a> tags within your document, so we want to find those <a> tags that have a direct <img> tag as a child. Then, check to make sure we have the correct nodes, we need to make sure we extract the correct information. So, let's have at it:
$results = array();
// Loop over all of the <a> tags in the document
foreach( $doc->getElementsByTagName( 'a') as $a) {
// If there are no children, continue on
if( !$a->hasChildNodes()) continue;
// Find the child <img> tag, if it exists
foreach( $a->childNodes as $child) {
if( $child->nodeType == XML_ELEMENT_NODE && $child->tagName == 'img') {
// Now we have the <a> tag in $a and the <img> tag in $child
// Get the information we need:
parse_str( parse_url( $a->getAttribute('href'), PHP_URL_QUERY), $a_params);
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
}
}
A print_r( $results); now leaves us with:
Array
(
[0] => Array
(
[0] => 774
[1] => http://img255.imageshack.us/img00/000/000001.png
)
)
Note that this omits basic error checking. One thing you can add is in the inner foreach loop, you can check to make sure you successfully parsed an album parameter from the <a>'s href attribute, like so:
if( isset( $a_params['album'])) {
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
Every function I've used in this can be found in the PHP documentation.

If you've already narrowed it down to this line, then you can use a regex like the following:
$matches = array();
preg_match('#.+album=(\d+).+src="([^"]+)#', $yourHtmlLineHere, $matches);
Now if you
echo $matches[1];
echo " ";
echo $matches[2];
You'll get the following:
774 http://img255.imageshack.us/img00/000/000001.png

Searching within a webpage

what would be the best way to write a code in Php that would search within a webpage for a number of words stored in a file? is it best to store the source code in a file or is it another way? please help.

The best way is to use google: site:example.com word1 OR word2 OR word3
Do you want to search in ONE PAGE? or one website with MULTIPLE PAGES?
If its only one page i think you can store the html code in memory without problems.
if you know exactly what you search strpos for reach word will probably be the fastest (stripos for case insensitive). you can also define your own character class and use preg_match_all or something... just something like this will do...
<?
$keywords = array("word1","word2","word3");
$doc = strip_tags(file_get_contents("http://www.example.com")); // remove tags to get only text
$doc = preg_replace('/\s+/', ' ',$doc); // remove multiple whitespaces...
foreach($keywords as $word) {
$pos = stripos($doc,$word);
if($pos !== false) {
echo "match: ...".str_replace($word,"<em>$word</em>",substr($doc,$pos-20,50))."... \n";
}
}
?>
something like the following for example will perform MUCH faster as its based on hashmap lookups with O(1) and doesnt need to scan the whole text for every keyword...
<?
setlocale(LC_ALL, "en_US.utf8");
$keywords = array("word1","word2","word3","word4");
$doc = file_get_contents("http://www.example.com");
$doc = strtolower($doc);
$doc = preg_replace('!/\*.*?\*/!s', '', $doc);
$doc = preg_replace("/<!--.*>/i", "", $doc);
$doc = preg_replace('!<script.*?script>!s', '', $doc);
$doc = preg_replace('!<style.*?style>!s', '', $doc);
$doc = strip_tags($doc);
$doc = preg_replace('/[^0-9a-z\s]/','',$doc);
$doc = iconv('UTF-8', 'ASCII//TRANSLIT', $doc); // check if encoding is really utf8
//$doc = preg_replace('{(.)\1+}','$1',$doc); remove duplicate chars ... possible step to add even more fuzzyness
$doc = preg_split("/\s+/",trim($doc));
foreach($keywords as $word) {
$word = strtolower($word);
$word = iconv('UTF-8', 'ASCII//TRANSLIT', $word);
$key = array_search($word,$doc);
var_dump($key);
if($key !== false) {
echo "match: ";
for($i=$key;$i<=5 && isset($doc[$i]);$i++) {
echo $doc[$i]." ";
}
}
}
?>
this code is untested.
it would be however be more elegant to dump textnodes from a domdocument
Simple searching is easy. If you want to search in a whole website the crawling logic is difficult.
I once did a backlink-checker for a company that worked like a crawler.
My first advice is not to do a recursion (like scanning a page and following all links and following all links in that until you reach a certain level...)
rather do it like this:
do a for loop as often as many levels you want to crawl.
set a site array with one entry (start page)
pass array to a function downloads every link, scans the site there and stores links on it in array.
when done with all links return the new link list array
in the for loop update the array with the return value of the function, and call the function again.
this way you can avoid following nasty paths but rather crawl website level by level.
also store already visited links in an array to skip, dont follow external links, check for weird url parameters etc..
for future use you can store documents in lucene or solr, there are classes to turn html pages into senseful lucene objects and search within.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP web crawler, check URL for path - php

It looks like regular expressions would be helpful here. You could say, for instance: /* if $input_url contains a 4 digit year, slash, number(s), slash, number(s) */ if (preg_match("/\/20\d\d\/\d+\/\d+\/",$input_url)) { echo $input_url . "<br>"; }

Related

Array Numbering Issue

Regex not working with web crawler

Pulling specific data from php file

PHP - Extracting two values from a line

Searching within a webpage

Categories

Resources