Targeting URLs with parameters - php

I want to grab the URL with highest pg value:
$html ='
';
I use this regex to locate the appropriate links:
preg_match_all('/<a.*href="\.\/\?pg=(\d+)".*>(?:.*)<\/a>/U', $html, $preg_matches);
Sometimes, the links include another parameter:
http://example.com/?pg=3&test=1
My question is, how do I adjust my regex so links with the added parameters are included as well?

Use a DOM parser to find the anchors.
Use parse_url to parse the urls and get the query value
use parse_str to get the query values
Example:
$dom = new DOMDocument;
$dom->loadHTML($html);
$html ='
';
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $anchor) {
$url = $anchor->getAttribute('href');
$query = parse_url($url, PHP_URL_QUERY);
parse_str($query, $output);
$pg = $output['pg'];
//do something
}
Here's a helpful tutorial for PHP. http://htmlparsing.com/php.html
Also see here, why you should not use Regex for parsing html https://stackoverflow.com/a/1732454/81785

$html ='
';
preg_match_all('/<a[^>]+href=\"(.*?)\"[^>]*>(.*)?<\/a>/', $html, $out);
$result = null;
foreach ($out[1] as $link){
parse_str(parse_url($link, PHP_URL_QUERY), $atr);
$result[$link] = $atr['pg'];
}
print_r($result);
// "http://example.com/?pg=1" => "1"
// "http://example.com/?pg=2" => "2"
// "http://example.com/?pg=4&test=1" => "4"

Related

PHP Web Crawler doesn't crawl .php files

This is the simple webcrawler I was trying to build
<?php
$to_crawl = "http://samplewebsite.com/about.php";
function get_links($url)
{
$input = #file_get_contents($url);
$regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> ";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
echo $link."</br>";
}
}
get_links($to_crawl);
?>
When I try to run the script with the $to_crawl variable set to a url ending with a file name, e.g. "facebook.com/about", it works, but for some reason, it just echo's nothing when the link is ending with a '.php' filename. Can someone please help?
To get all links and their inner texts, you can use DOMDocument like this:
$dom = new DOMDocument;
#$dom->loadHTML($input); // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[#href]'); // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
$result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);
See IDEONE demo

How to find this url in the html with php?

I want to find a specific url in a html page and get it's part.
the url is in this page :
http://site1.com/games/arcade/139173-angry-birds-friends-1-7-0.html`
and is like
http://download.site2.org/?server=2&apkid=com.rovio.angrybirdsfriends&ver=1.7.0
I want 3 parts of it:
2
com.rovio.angrybirdsfriends
1.7.0
My code:
$html = file_get_contents("http://site1.com/games/name/139173-angry-birds-friends-1-7-0.html");
preg_match("/download(.*)/", $html, $results)
echo = $results[0];
Is this what you are looking for?
$url = 'http://download.site2.org/?server=2&apkid=com.rovio.angrybirdsfriends&ver=1.7.0';
$query = parse_url($url, PHP_URL_QUERY);
parse_str($query, $params);
echo $params['server'], PHP_EOL;
echo $params['apkid'], PHP_EOL;
echo $params['ver'], PHP_EOL;
Output:
2
com.rovio.angrybirdsfriends
1.7.0
Update
// Read HTML
$html = file_get_contents(
'http://getandroidapp.org/games/arcade/'
. '139173-angry-birds-friends-1-7-0.html'
);
// Turn HTML into a DOM document
$dom = new DOMDocument();
#$dom->loadHTML($html); // Mute warnings
// Find anchor ...
foreach ($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
// ... having a query part that starts with 'server='
if (preg_match('#\?server=#', $href)) {
$url = $href;
// Parse query string from href
$query = parse_url($url, PHP_URL_QUERY);
parse_str($query, $params);
// Display values
echo $params['server'], PHP_EOL;
echo $params['apkid'], PHP_EOL;
echo $params['ver'], PHP_EOL;
// One is enough
break;
}
}
Output:
2
com.rovio.angrybirdsfriends
1.7.0
It is not totally fool-proof but maybe good enough in your case.

PHP Regex or DOMDocument for Matching & Removing URLs?

I'm trying to extract links from html page using DOM:
$html = file_get_contents('links.html');
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br/>';
}
Output:
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
I would like to remove all results matching dontwantthisdomain.com, dontwantthisdomain2.com and dontwantthisdomain3.com so the output will looks like that:
http://domain1.com/page-X-on-domain-com.html
http://domain.com/page-XZ-on-domain-com.html
http://domain3.com/page-XYZ-on-domain3-com.html
Some people saying I should not use regex for html and others that it's ok. Could somebody point the best way how I can remove unwanted urls from my html file? :)
Maybe something like this:
function extract_domains($buffer, $whitelist) {
preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches);
$result = array();
foreach($matches[1] as $url) {
$url = urldecode($url);
$parts = #parse_url((string) $url);
if ($parts !== false && in_array($parts['host'], $whitelist)) {
$result[] = $parts['host'];
}
}
return $result;
}
$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com')));
It does a rough match on the all the <a> with href=, grabs what's between the quotes, then filters it based on your whitelist of domains.
None regex solution (without potential errors :-) :
$html='
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
';
$html=explode("\n", $html);
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com');
foreach ($html as $link) {
$ok=true;
foreach($dontWant as $notWanted) {
if (strpos($link, $notWanted)>0) {
$ok=false;
}
if (trim($link=='')) $ok=false;
}
if ($ok) $final_result[]=$link;
}
echo '<pre>';
print_r($final_result);
echo '</pre>';
outputs
Array
(
[0] => http://domain1.com/page-X-on-domain-com.html
[1] => http://domain.com/page-XZ-on-domain-com.html
[2] => http://domain3.com/page-XYZ-on-domain3-com.html
)

php regular expression for matching anchor tags

go to the source of this page : www.songs.pk/indian/7days.html
there will be only eight links which start with http://link1
for example : Tune Mera Naam Liya
i want a php regular expression which matches the
http://link1.songs.pk/song1.php?songid=2792
and
Tune Mera Naam Liya
Thanks.
You're better off using something like simplehtmldom to find all links, then find all links with the relevant HTML / href.
Parsing HTML with regex isn't always the best solution, and in your case I feel it will bring you only pain.
$href = 'some_href';
$inner_text = 'some text';
$desired_anchors = array();
$html = file_get_html ('your_file_or_url');
// Find all anchors, returns a array of element objects
foreach($html->find('a') as $anchor) {
if ($a->href = $href && $anchor->innertext == $inner_text) {
$desired_anchors[] = $anchor;
}
}
print_r($desired_anchors);
That should get you started.
Don't use a regex buddy! PHP has a better suited tool for this...
$dom = new DOMDocument;
$dom->loadHTML($str);
$matchedAnchors = array();
$anchors = $dom->getElementsByTagName('a');
$match = 'http://link1';
foreach($anchors as $anchor) {
if ($anchor->hasAttribute('href') AND substr($anchor->getAttribute('href'), 0, strlen($match)) == $match) {
$matchedAnchors[] = $anchor;
}
}
here you go
preg_match_all('~<a .*href="(http://link1\..*)".*>(.*)</a>~Ui',$str,$match,PREG_SET_ORDER);
print_r($match);

php associative arrays, regex, array

I currently have the following code :
$content = "
<name>Manufacturer</name><value>John Deere</value><name>Year</name><value>2001</value><name>Location</name><value>NSW</value><name>Hours</name><value>6320</value>";
I need to find a method to create and array as name=>value. E.g Manufacturer => John Deere.
Can anyone help me with a simple code snipped I tried some regex but doesn't even work to extract the names or values, e.g.:
$pattern = "/<name>Manufacturer<\/name><value>(.*)<\/value>/";
preg_match_all($pattern, $content, $matches);
$st_selval = $matches[1][0];
You don't want to use regex for this. Try out something like SimpleXML
EDIT
Well, why don't you start with this:
<?php
$content = "<root>" . $content . "</root>";
$xml = new SimpleXMLElement($c);
print_r($xml);
?>
EDIT 2
Despite the fact that some of the answers posted using regular expression MAY work, you should get in the habit of using the correct tool for the job and regular expressions are not the correct tool for parsing of XML.
I'm using your $content variable:
$preg1 = preg_match_all('#<name>([^<]+)#', $content, $name_arr);
$preg2 = preg_match_all('#<value>([^<]+)#', $content, $val_arr);
$array = array_combine($name_arr[1], $val_arr[1]);
This is rather simple, can be solved by regex. Should be:
$name = '<name>\s*([^<]+)</name>\s*';
$value = '<value>\s*([^<]+)</value>\s*';
$pattern = "|$name $value|";
preg_match_all($pattern, $content, $matches);
# create hash
$stuff = array_combine($matches[1], $matches[2]);
# display
var_dump($stuff);
Regards
rbo
First of all, never use regex to parse xml...
You could do this with an XPATH query...
First, wrap the content in a root tag to make the parser happy (if it doesn't already have it):
$content = '<root>' . $content . '</root>';
Then, load the document
$dom = new DomDocument();
$dom->loadXml($content);
Then, initialize the XPATH
$xpath = new DomXpath($dom);
Write your query:
$xpathQuery = '//name[text()="Manufacturer"]/follwing-sibling::value/text()';
Then, execute it:
$manufacturer = $xpath->evaluate($xpathQuery);
If I did the xpath right, it $manufacturer should be John Deere...
You can see the docs on DomXpath, a basic primer on XPath, and a bunch of XPath examples...
Edit: That won't work (PHP doesn't support that syntax (following-sibling). You could do this instead of the xpath query:
$xpathQuery = '//name[text()="Manufacturer"]';
$elements = $xpath->query($xpathQuery);
$manufacturer = $elements->item(0)->nextSibling->nodeValue;
I think this is what you're looking for:
<?php
$content = "<name>Manufacturer</name><value>John Deere</value><name>Year</name><value>2001</value><name>Location</name><value>NSW</value><name>Hours</name><value>6320</value>";
$pattern = "(\<name\>(\w*)\<\/name\>\<value\>(\w*)\<\/value\>)";
preg_match_all($pattern, $content, $matches);
$arr = array();
for ($i=0; $i<count($matches); $i++){
$arr[$matches[1][$i]] = $matches[2][$i];
}
/* This is an example on how to use it */
echo "Location: " . $arr["Location"] . "<br><br>";
/* This is the array */
print_r($arr);
?>
If your array has a lot of elements dont use the count() function in the for loop, calculate the value first and then use it as a constant.
I'll edit as my PHP is wrong, but here's some PHP (pseudo-)code to give some direction.
$pattern = '|<name>([^<]*)</name>\s*<value>([^<]*)</value>|'
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
for($i = 0; $i < count($matches); $i++) {
$arr[$matches[$i][1]] = $matches[$i][2];
}
$arr is the array you want to store the name/value pairs.
Using XMLReader:
$content = '<name>Manufacturer</name><value>John Deere</value><name>Year</name><value>2001</value><name>Location</name><value>NSW</value><name>Hours</name><value>6320</value>';
$content = '<content>' . $content . '</content>';
$output = array();
$reader = new XMLReader();
$reader->XML($content);
$currentKey = null;
$currentValue = null;
while ($reader->read()) {
switch ($reader->name) {
case 'name':
$reader->read();
$currentKey = $reader->value;
$reader->read();
break;
case 'value':
$reader->read();
$currentValue = $reader->value;
$reader->read();
break;
}
if (isset($currentKey) && isset($currentValue)) {
$output[$currentKey] = $currentValue;
$currentKey = null;
$currentValue = null;
}
}
print_r($output);
The output is:
Array
(
[Manufacturer] => John Deere
[Year] => 2001
[Location] => NSW
[Hours] => 6320
)

Categories