Array Numbering Issue - php

Why is this code able to fetch data from the following first page and insert them into an array by numbering the array, while it fails to do the same for the following second page:
http://nimishprabhu.com
https://www.fiverr.com/search/gigs?utf8=%E2%9C%93&source=guest-homepage&locale=en&search_in=everywhere&query=php
The page shows arrays numbered like the following, which is not correct:
Array ( [0] => mailto:support#fiverr.com )
Array ( [0] => https://collector.fiverr.com/api/v1/collector/noScript.gif?appId=PXK3bezZfO
[1] => https://collector.fiverr.com/api/v1/collector/pxPixel.gif?appId=PXK3bezZfO )
Array ( [0] => One Small Step )
Code:
<?php
/*
2.
FINDING HTML ELEMENTS BASED ON THEIR TAG NAMES
Suppose you wanted to find each and every image on a webpage or say, each
and every hyperlink.
We will be using “find” function to extract this information from the
object. Doing it using Simple HTML DOM Parser :
*/
include('simple_html_dom.php');
$html = file_get_html('https://www.fiverr.com/search/gigs?utf8=%E2%9C%93&source=guest-homepage&locale=en&search_in=everywhere&query=php');
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";
//to fetch all images from a webpage
$images = array();
foreach($html->find('img') as $img) {
$images[] = $img->src;
}
print_r($images);
echo "<br />";
//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
print_r($headlines);
echo "<br />";
?>
Any suggestions and code samples welcome for my learning purpose.
I am a self study student.

The reason is that the page you are trying to download (fiverr.com) is JavaScript-based with dynamically loaded content. This will not work in PHP, because it only sees the HTML that was sent by the server, it can't parse and run JavaScript. Because this is for learning purposes, you can simply try a different website.
However, if you want a working solution, you should look into Selenium. It's basically a headless web browser which does everything like other browsers, including running JavaScript. Through its web driver you will be able to fully parse websites like fiverr.com.

Related

Get all HTML list element using Simple HTML Dom

Currently I am working on a project which requires me to parse some data from an alternative website, and I'm having some issues (note I am very new to PHP coding.)
Here's the code I am using below + the content it returns.
$dl = $html2->find('ol.tracklist',0);
print $dl = $dl->outertext;
The above code returns the data for what we're trying to get, it's below but extremely messy provided you would like to see click here.
However, when I put this in a foreach, it only returns one of the a href attributes at a time.
foreach($html2->find('ol.tracklist') as $li)
{
$title = $li->find('a',0);
print $title;
}
What can I do so that it returns all of the a href elements from the example code above?
NOTE: I am using simple_html_dom.php for this.
Based on the markup, just point directly to it, just get it list then point to its anchor:
foreach ($html2->find('ol.tracklist li') as $li) {
$anchor = $li->find('ul li a', 0);
echo $anchor->href; // and other attributes
}

PHP Web Scraping And JSON or Array Output

I'm experimenting scraping Amazon with PHP but I don't know what I am doing wrong. The problem is that I can't access all the data I scraped. Here is my code:
<?php
$url = 'https://www.amazon.com/s/ref=nb_sb_ss_c_1_9?url=search-alias%3Daps&field-keywords=most+sold+items+on+amazon&sprefix=most+sold%2Caps%2C435&crid=348CE8G406XVG&rh=i%3Aaps%2Ck%3Amost+sold+items+on+amazon';
$html = file_get_html($url);
foreach ($html->find('h2[class=a-size-medium]') as $element) {
echo "<li>" .$element->plaintext."</li><br>";
}
?>
The foreach statement loops through and output the plain text but I want to be able to pass the plain text to a variable or array. The problem is that if I do that and output the result, I only get the last string of the plain text array. I have done lots of research to find what I'm doing wrong but I can't find it. Please any help will be appreciated. Here is what I'm trying to achieve:
<?php
$url = 'https://www.amazon.com/s/ref=nb_sb_ss_c_1_9?url=search-alias%3Daps&field-keywords=most+sold+items+on+amazon&sprefix=most+sold%2Caps%2C435&crid=348CE8G406XVG&rh=i%3Aaps%2Ck%3Amost+sold+items+on+amazon';
$hold = array();
$html = file_get_html($url);
foreach ($html->find('h2[class=a-size-medium]') as $element) {
$hold = $element->plaintext;
}
print_r($hold);
?>
The second code will output the last string of the plain text which is: "Rubbermaid LunchBlox Side Container Kit, 2-Pack, 1806176". I also tried achieving this by encoding and decoding the plain text but nothing changed. What am I doing wrong?
Instead of setting the array hold to a string...add new elements to the array:
$hold[] = $element->plaintext;

PHP web crawler, check URL for path

I'm writing a simple web crawler to grab some links from a site.
I need to check the returned links to make sure I selectively collect what I want.
For example, here's a few links returned from http://www.polygon.com/
[0] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide#comments
[1] http://www.polygon.com/videos
[2] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide
[3] http://www.polygon.com/features
so link 0 and 2 are links I want to grab, 1 and 3 we don't want. there's an obvious visual distinction between the links so how would I compare them?
How would I check to make sure I don't return 1 and 3? ideally i'd like to be able to input something so it could adapt to any site.
I was thinking I need to check the link to make sure its past /2015/ etc but I'm pretty lost.
here's the PHP code i'm using to grab links:
<?php
$source_url = 'http://www.polygon.com/';
$html = file_get_contents($source_url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$input_url = $link->getAttribute('href');
echo $input_url . "<br>";
}
?>
It looks like regular expressions would be helpful here.
You could say, for instance:
/* if $input_url contains a 4 digit year, slash, number(s), slash, number(s) */
if (preg_match("/\/20\d\d\/\d+\/\d+\/",$input_url)) {
echo $input_url . "<br>";
}

Regex not working with web crawler

I have this simple web crawler that returns all links ( tags) from the Google search result page, however, my preg_match function doesn't seem to be returning the relevant links I want that are in between 2 strings. I believe my regex is correct though, I've tested it on several other platforms.
foreach($html->find('a') as $element) {
preg_match_all("/url\?q=(.*?)&sa=U&ei=/", $element->href, $matches); //attempt to retrieve the actual link in between these strings
echo $element->href.'<br/>'; //prints out each of the links
}
print_r($matches);
Here is what the page which I am trying to retrieve the relevant links from looks like when Im searching for someone named John Smith
https://www.google.com/webhp?tab=ww
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbm=isch&source=og&sa=N&tab=wi
https://maps.google.com/maps?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=wl
https://play.google.com/?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=w8
https://www.youtube.com/results?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=w1
https://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
http://www.google.com/intl/en/options/
https://www.google.com/calendar?tab=wc
https://translate.google.com/?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=wT
http://www.google.com/mobile/?hl=en&tab=wD
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=bks&source=og&sa=N&tab=wp
https://wallet.google.com/manage/?tab=wa
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=shop&source=og&sa=N&tab=wf
https://www.blogger.com/?tab=wj
https://www.google.com/finance?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=we
https://plus.google.com/photos?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=wq
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=vid&source=og&sa=N&tab=wv
http://www.google.com/intl/en/options/
https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search%3Fq%3DJohn%2BSmith
http://www.google.com/preferences?hl=en
/preferences?hl=en
http://www.google.com/history/optout?hl=en
/webhp?hl=en
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=isch&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAUQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=vid&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAYQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=nws&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAcQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=shop&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAgQ_AU
https://maps.google.com/maps?q=John+Smith&um=1&ie=UTF-8&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAkQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=bks&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAoQ_AU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:h&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:d&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:w&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:m&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:y&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=li:1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBQQFjAA&usg=AFQjCNFgBV3CPR5ydtty6z72kDKto_Ij7A
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:2n5isO4EbUAJ:http://en.wikipedia.org/wiki/John_Smith_(explorer)%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBcQIDAA&usg=AFQjCNGxUvb-aHUJmV-p4VbGXmUJE1nPBw
/search?ie=UTF-8&q=related:en.wikipedia.org/wiki/John_Smith_(explorer)+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBgQHzAA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23Early_adventures&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBoQ0gIoADAA&usg=AFQjCNFK7RzMUfQA5LZYUNaL2C_K0cEbjA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23In_Jamestown&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBsQ0gIoATAA&usg=AFQjCNF0pFVxwtdohofHr3bWQXJhk1XMcA
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23New_England&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBwQ0gIoAjAA&usg=AFQjCNE4VqtjkQwsNzO_haCNSUi-3bgTsw
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23Death_and_burial&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CB0Q0gIoAzAA&usg=AFQjCNFAr4O8yWEK93_GyyN6_srpqLaljQ
/url?q=http://www.apva.org/history/jsmith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CB8QFjAB&usg=AFQjCNEMx0-702N1edJVXxiS5ILRl651zw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:iuJ7Uh7IOtgJ:http://www.apva.org/history/jsmith.html%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCIQIDAB&usg=AFQjCNG_keb3HZAHUteBGMb3k5GTIeVr5w
/search?ie=UTF-8&q=related:www.apva.org/history/jsmith.html+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCMQHzAB
/images?q=John+Smith&hl=en&sa=X&oi=image_result_group&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCUQsAQ
/url?q=http://etc.usf.edu/clipart/200/269/smith_2.htm&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCcQ9QEwAg&usg=AFQjCNF3B9TL94enKovOL1hlz-n0A4PXrA
/url?q=http://www.apva.org/history/jsmith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCkQ9QEwAw&usg=AFQjCNEMx0-702N1edJVXxiS5ILRl651zw
/url?q=http://www.biography.com/people/john-smith-9486928&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCsQ9QEwBA&usg=AFQjCNEdM50NAIJCmLRDMG_Ruyox4gshPQ
/url?q=http://www.shmoop.com/jamestown/photo-john-smith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CC0Q9QEwBQ&usg=AFQjCNFvEq7Cq3P6WdNIIHpNVVuQLTMhdQ
/url?q=http://www.wpclipart.com/American_History/settlement/John_Smith/Captain_John_Smith.png.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CC8Q9QEwBg&usg=AFQjCNGEWlYKoQUhODn-3jypeyaw4urAGw
/url?q=http://www.web-books.com/Classics/ON/B1/B1583/07MB1583.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDEQ9QEwBw&usg=AFQjCNGSF2DNQHhwDTHz4ogVcLVhM5TiDQ
/url?q=http://www.biography.com/people/john-smith-9486928&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDMQFjAI&usg=AFQjCNEdM50NAIJCmLRDMG_Ruyox4gshPQ
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:IJvKbJ_a540J:http://www.biography.com/people/john-smith-9486928%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDYQIDAI&usg=AFQjCNHnW1ezRcv8sn_Jk3GBvECp-QOCTg
/search?ie=UTF-8&q=related:www.biography.com/people/john-smith-9486928+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDcQHzAI
/url?q=http://johnsmithjohnsmith.com/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDkQFjAJ&usg=AFQjCNH9a_jF2woyDESMRrLneIIbbTeS4g
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:_KyTfWhQuFEJ:http://johnsmithjohnsmith.com/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDwQIDAJ&usg=AFQjCNGX37w0NUcEFa0t04-28gLhlMVfdA
/search?ie=UTF-8&q=related:johnsmithjohnsmith.com/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CD0QHzAJ
/url?q=http://www.johnsmith.co.uk/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CD8QFjAK&usg=AFQjCNHEhG7WRm1dP5c_0xqqH0P0U-9jUA
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:jPrP5TbGXhYJ:http://www.johnsmith.co.uk/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEIQIDAK&usg=AFQjCNFe-QSMSKMs8Z6mSu-oLraaeKYAug
/search?ie=UTF-8&q=related:www.johnsmith.co.uk/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEMQHzAK
/url?q=http://www.johnsmith.co.uk/uel&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEUQ0gIoADAK&usg=AFQjCNEk2GkTaQvtpqaaYdztlWV7iVs0Jg
/url?q=http://www.johnsmith.co.uk/bedfordshire&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEYQ0gIoATAK&usg=AFQjCNFcOIItpAW46XRn1BwGvuG7mertRA
/url?q=http://www.johnsmith.co.uk/aru&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEcQ0gIoAjAK&usg=AFQjCNFq68oEVG7KAAu-Mbd0ScBFOMF4MA
/url?q=http://www.history.com/topics/john-smith&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEkQFjAL&usg=AFQjCNGytp4P2oI3szUVSzJbJ1YdOWDldw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:5hQtC90uVmYJ:http://www.history.com/topics/john-smith%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEwQIDAL&usg=AFQjCNERGtQrhvZLOovq8W-Mp8AXeT_W1g
/search?ie=UTF-8&q=related:www.history.com/topics/john-smith+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CE0QHzAL
/url?q=http://johnsmithmusic.com/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CE8QFjAM&usg=AFQjCNFlpAC8HDml6r5DpmAo4VviZ_GeMw
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:-T7dO31PjlkJ:http://johnsmithmusic.com/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFIQIDAM&usg=AFQjCNFFeePBNGGMWPaVS9j4_niZpMVyxA
/search?ie=UTF-8&q=related:johnsmithmusic.com/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFMQHzAM
/url?q=http://www.nps.gov/jame/historyculture/life-of-john-smith.htm&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFUQFjAN&usg=AFQjCNHPmqp05pAUp2yk1R9aKPqohTmWpQ
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:Q_nfCPRpnwQJ:http://www.nps.gov/jame/historyculture/life-of-john-smith.htm%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFgQIDAN&usg=AFQjCNHad3eFxSDuthM23n4FcusD5rY1uw
/search?ie=UTF-8&q=related:www.nps.gov/jame/historyculture/life-of-john-smith.htm+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFkQHzAN
/url?q=http://www.enchantedlearning.com/explorers/page/s/smith.shtml&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFsQFjAO&usg=AFQjCNEWo4pji9pBq89XmlprWg2okGHl5g
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:zs0buZvw9N8J:http://www.enchantedlearning.com/explorers/page/s/smith.shtml%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CF4QIDAO&usg=AFQjCNEu0cbayJymDVJ4IfbRc_NtrEtaPA
/search?ie=UTF-8&q=related:www.enchantedlearning.com/explorers/page/s/smith.shtml+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CF8QHzAO
/search?ie=UTF-8&q=john+smith+texture+pack&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGIQ1QIoAA
/search?ie=UTF-8&q=john+smith+and+pocahontas&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGMQ1QIoAQ
/search?ie=UTF-8&q=john+smith+actor&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGQQ1QIoAg
/search?ie=UTF-8&q=john+smith+realty&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGUQ1QIoAw
/search?ie=UTF-8&q=john+smith+doctor+who&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGYQ1QIoBA
/search?ie=UTF-8&q=captain+john+smith&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGcQ1QIoBQ
/search?ie=UTF-8&q=john+smith+wrestler&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGgQ1QIoBg
/search?ie=UTF-8&q=john+smith+wrestling&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGkQ1QIoBw
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=10&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=20&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=30&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=40&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=50&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=60&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=70&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=80&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=90&sa=N
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=10&sa=N
/advanced_search?q=John+Smith&ie=UTF-8&prmd=ivnsp
/support/websearch/bin/answer.py?answer=134479&hl=en
/tools/feedback/survey/html?productId=196&query=John+Smith&hl=en
/
/intl/en/ads
/services
/intl/en/policies/
/intl/en/about.html
array(0) { }
The problem with your code, is that each time you attempt to match an element, $matches is a new array.
A possible solution:
$result = array();
foreach($html->find('a') as $element) {
preg_match_all("/url\?q=(.*?)&sa=U&ei=/", $element->href, $matches); //try to match
if(array_key_exists(1,$matches) && $matches[1] != "") { //if we found a match
$result[] = $matches[1]; //push it to $results
}
}
print_r($result);//print result
Another way is of course to try to find some kind of markers in the generated HTML page. You can do this for instance by converting the HTML document to XML and then analyze it. The problem with this approach is however that now and then, Google can modify it's page layout and thus you will need to rewrite your algorithm.

PHP Tag system without database (plain text files)

I want to implement a tag system on my website. The website is made in PHP, but uses NO database (sql) system. It reads the files from plain text files and includes them.
The pages are in a file, if a page is requested that file is read, and if the page is in there the site returns it. If the page is not in there it gives an error (so no path traversal issues, I can let page "blablabla" go to "other-page.inc.php").
The page list is a big case statement, like this:
case "faq":
$s_inc_page= $s_contentdir . "static/faq.php";
$s_pagetitle="FAQ";
$s_pagetype="none";
break;
($s_pageype is for the css theme).
What I want is something like this:
case "article-about-cars":
$s_inc_page= $s_contentdir . "article/vehicles/about-cars.php";
$s_pagetitle="Article about Cars";
$s_pagetype="article";
$s_tags=array("car","mercedes","volvo","gmc");
break;
And a tag page which takes a tag as get variable, checks which cases have that tag in the $s_tag array and then returns those cases.
Is this possible, or am I thinking in the wrong direction?
I would do this by keeping your page details in an array such as:
$pages['faq']['s_inc_page'] = $s_contentdir . "static/faq.php";
$pages['faq']['s_pagetitle'] = "FAQ";
$pages['faq']['s_pagetype'] = "none";
$pages['faq']['s_tags'] = array("car","mercedes","volvo","gmc");
You could then use a foreach loop to go through this array and pull out the items with matching tags:
$tag = "car";
foreach($pages as $page) {
if (in_array($tag, $page['s_tags'])) {
//do whatever you want to do with the matches
echo $page['s_pagetitle'];
}
}
It's possible, but you may need to think outside your current structure.
Something like this will work:
$pages = array(
"article-about-cars" => array ("car", "mercedes", "volvo"),
"article-about-planes" => array ("757", "747", "737")
); //an array containing page names and tags
foreach ($pages as $key => $value) {
if (in_array($_GET['tag'], $value)) {
$found_pages[] = $key;
}
}
return $found_pages; //returns an array of pages that include the tag

Categories