PHP Regex or DOMDocument for Matching & Removing URLs? - php

I'm trying to extract links from html page using DOM:
$html = file_get_contents('links.html');
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br/>';
}
Output:
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
I would like to remove all results matching dontwantthisdomain.com, dontwantthisdomain2.com and dontwantthisdomain3.com so the output will looks like that:
http://domain1.com/page-X-on-domain-com.html
http://domain.com/page-XZ-on-domain-com.html
http://domain3.com/page-XYZ-on-domain3-com.html
Some people saying I should not use regex for html and others that it's ok. Could somebody point the best way how I can remove unwanted urls from my html file? :)

Maybe something like this:
function extract_domains($buffer, $whitelist) {
preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches);
$result = array();
foreach($matches[1] as $url) {
$url = urldecode($url);
$parts = #parse_url((string) $url);
if ($parts !== false && in_array($parts['host'], $whitelist)) {
$result[] = $parts['host'];
}
}
return $result;
}
$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com')));
It does a rough match on the all the <a> with href=, grabs what's between the quotes, then filters it based on your whitelist of domains.

None regex solution (without potential errors :-) :
$html='
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
';
$html=explode("\n", $html);
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com');
foreach ($html as $link) {
$ok=true;
foreach($dontWant as $notWanted) {
if (strpos($link, $notWanted)>0) {
$ok=false;
}
if (trim($link=='')) $ok=false;
}
if ($ok) $final_result[]=$link;
}
echo '<pre>';
print_r($final_result);
echo '</pre>';
outputs
Array
(
[0] => http://domain1.com/page-X-on-domain-com.html
[1] => http://domain.com/page-XZ-on-domain-com.html
[2] => http://domain3.com/page-XYZ-on-domain3-com.html
)

Related

Is there a way to match words to sentences inside a html <b> tag in PHP

So i have this code to extract the text between in b tags.
$source_url = "https://www.wordpress.com/";
$html = file_get_contents($source_url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('b');
$words = "php";
echo "<pre>";
print_r($dom);
echo "</pre>";
I tried to put the text inside in an array using array_push and others but if im going to use in_array
i need to put the whole sentence to return true not only a word.
So what i want exactly is :
If that sentence contains 'php' then return true
Try This:
foreach($links as $link) {
$p = strtolower($link->nodeValue);
if (strpos($p, 'php') !== false) {
// do something
}
}

How to scrape between specific tags using file_get_contents

I am using file_get_contents to scrape a html page. I would like the scrape to be between <pre> and </pre> tags only. Any ideas how to achieve this? The code is as follows:
$html = file_get_contents('http://www.atletiek.co.za/.....htm');
$tags = explode(' ', $html);
foreach ($tags as $tag) {
// skip scripts
if (strpos($tag, 'script') !== false) {
continue;
}
// get text
$text = strip_tags(' ' . $tag);
// only if text present remember
if (trim($text) != '') $texts[] = $text;
}
print_r($text);
You can use regex if it's enough for you.
$s = 'test <pre>this is simple</pre> test <pre class="tricky">this is' . "\n" . 'tricky</pre> test';
if (preg_match_all('#<pre(?: [^>]*)?>(.*?)</pre>#msi', $s, $m)) {
print_r($m[1]);
}
shows
Array
(
[0] => this is simple
[1] => this is
tricky
)
But please read this - https://stackoverflow.com/a/1732454/437763
May be you need XPath - http://php.net/manual/en/domxpath.query.php
I solved it by adding all the tags and attributes I wanted to exclude. I used if (strpos($tag, 'script') !== false) { for all the tags I did not want to load. It worked for me because there were only about 5 or six others.

PHP Web Crawler doesn't crawl .php files

This is the simple webcrawler I was trying to build
<?php
$to_crawl = "http://samplewebsite.com/about.php";
function get_links($url)
{
$input = #file_get_contents($url);
$regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> ";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
echo $link."</br>";
}
}
get_links($to_crawl);
?>
When I try to run the script with the $to_crawl variable set to a url ending with a file name, e.g. "facebook.com/about", it works, but for some reason, it just echo's nothing when the link is ending with a '.php' filename. Can someone please help?
To get all links and their inner texts, you can use DOMDocument like this:
$dom = new DOMDocument;
#$dom->loadHTML($input); // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[#href]'); // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
$result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);
See IDEONE demo

Parse html with regexp

I want to find all <h3> blocks in this example:
<h3>sdf</h3>
sdfsdf
<h3>sdf</h3>
32
<h2>fs</h2>
<h3>23sd</h3>
234
<h1>h1</h1>
(From h3 to other h3 or h2) This regexp find only first h3 block
~\<h3[^>]*\>[^>]+\<\/h3\>.+(?:\<h3|\<h2|\<h1)~is
I use php function preg_match_all (Quote from docs: After the first match is found, the subsequent searches are continued on from end of the last match.)
What i have to modify in my regexp?
ps
<h3>1</h3>
1content
<h3>2</h3>
2content
<h2>h2</h2>
<h3>3</h3>
3content
<h1>h1</h1>
this content have to be parsed as:
[0] => <h3>1</h3>1content
[1] => <h3>2</h3>2content
[2] => <h3>2</h3>3content
with DOMDocument:
$dom = new DOMDocument();
#$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('body')->item(0)->childNodes;
$flag = false;
$results = array();
foreach ($nodes as $node) {
if ( $node->nodeType == XML_ELEMENT_NODE &&
preg_match('~^h(?:[12]|(3))$~i', $node->nodeName, $m) ):
if ($flag)
$results[] = $tmp;
if (isset($m[1])) {
$tmp = $dom->saveXML($node);
$flag = true;
} else
$flag = false;
elseif ($flag):
$tmp .= $dom->saveXML($node);
endif;
}
echo htmlspecialchars(print_r($results, true));
with regex:
preg_match_all('~<h3.*?(?=<h[123])~si', $html, $matches);
echo htmlspecialchars(print_r($matches[0], true));
You shouldn't use Regex to parse HTML if there is any nesting involved.
Regex
(<(h\d)>.*?<\/\2>)[\r\n]([^\r\n<]+)
Replacement
\1\3
or
$1$3
http://regex101.com/r/uQ3uC2
preg_match_all('/<h3>(.*?)<\/h3>/is', $stringHTML, $matches);

(PHP) Regex for finding specific href tag

i have a html document with n "a href" tags with different target urls and different text between the tag.
For example:
<span ....>lorem ipsum</span>
<span ....>example</span>
example3
<img ...>test</img>
without a d as target url
As you can see the target urls switch between "d?, d., d/d?, d/d." and between the "a tag" there could be any type of html which is allowed by w3c.
I need a Regex which gives me all links which has one of these combination in the target url:
"d?, d., d/d?, d/d." and has "Lorem" or "test" between the "a tags" in any position including sub html tags.
My Regex so far:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>.*?</a>)
I tried to include the lorem / test as followed:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>(lorem|test)+</a>)
but this will only works if I put a ".*?" before and after the (lorem|test) and this would be to greedy.
If there is a easier way with SimpleXml or any other DOM parser, please let me know. Otherwise I would appreciate any help with the regex.
Thanks!
Here you go:
$html = array
(
'<span ....>lorem ipsum</span>',
'<span ....>example</span>',
'example3',
'<img ...>test</img>',
'without a d as target url',
);
$html = implode("\n", $html);
$result = array();
$anchors = phXML($html, '//a[contains(., "lorem") or contains(., "test")]');
foreach ($anchors as $anchor)
{
if (preg_match('~d[.?]~', strval($anchor['href'])) > 0)
{
$result[] = strval($anchor['href']);
}
}
echo '<pre>';
print_r($result);
echo '</pre>';
Output:
Array
(
[0] => http://www.example.com/d?12345abc
[1] => http://www.example.com/d/d.1234
)
The phXML() function is based on my DOMDocument / SimpleXML wrapper, and goes as follows:
function phXML($xml, $xpath = null)
{
if (extension_loaded('libxml') === true)
{
libxml_use_internal_errors(true);
if ((extension_loaded('dom') === true) && (extension_loaded('SimpleXML') === true))
{
if (is_string($xml) === true)
{
$dom = new DOMDocument();
if (#$dom->loadHTML($xml) === true)
{
return phXML(#simplexml_import_dom($dom), $xpath);
}
}
else if ((is_object($xml) === true) && (strcmp('SimpleXMLElement', get_class($xml)) === 0))
{
if (isset($xpath) === true)
{
$xml = $xml->xpath($xpath);
}
return $xml;
}
}
}
return false;
}
I'm too lazy not to use this function right now, but I'm sure you can get rid of it if you need to.
Here is a Regular Expression which works:
$search = '/<a\s[^>]*href=["\'](?:http:\/\/)?(?:[a-z0-9-]+(?:\.[a-z0-9-]+)*)\/(?:d\/)?d[?.].*?>.*?(?:lorem|test)+.*?<\/a>/i';
$matches = array();
preg_match_all($search, $html, $matches);
The only thing is it relies on there being a new-line character between each ` tag. Otherwise it will match something like:
example3<img ...>test</img>
Use an HTML parser. There are lots of reasons that Regex is absolutely not the solution for parsing HTML.
There's a good list of them here:
Robust and Mature HTML Parser for PHP
Will print only first and fourth link because two conditions are met.
preg_match_all('#href="(.*?)"(.*?)>(.*?)</a>#is', $string, $matches);
$count = count($matches[0]);
unset($matches[0], $matches[2]);
for($i = 0; $i < $count; $i++){
if(
strpos($matches[1][$i], '/d') !== false
&&
preg_match('#(lorem|test)#is', $matches[3][$i]) == true
)
{
echo $matches[1][$i];
}
}

Categories