php regex selecting url from html source

php regex selecting url from html source - php

I'm new to stackoverflow and from South Korea.
I'm having difficulties with regex with php.
I want to select all the urls from user submitted html source.
The restrictions I want to make are following.
Select urls EXCEPT
urls are within tags
for example if the html source is like below,
http://aaa.com
Neither of http://aaa.com should be selected.
urls right after " or =
Here is my current regex stage.
/(?<![\"=])https?\:\/\/[^\"\s<>]+/i
but with this regex, I can't achieve the first rule.
I tried to add negative lookahead at the end of my current regex like
/(?<![\"=])https?\:\/\/[^<>\"\s]+(?!<\/a>)/i
It still chooses the second url in the a tag like below.
http://aaa.co
We don't have developers Q&A community like Stackoverflow in Korea, so I really hope someone can help this simplely looking regex issue!

Seriously consider using PHP's DOMDocument class. It does reliable HTML parsing. Doing this with regular expressions is error prone, more work, and slower.
The DOM works just like in the browser and you can use getElementsByTagName to get all links.
I got your use case working with this code using the DOM (try it here: http://3v4l.org/5IFof):
<?php
$html = <<<HTML
http://aaa.com
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $link) {
var_dump($link->getAttribute('href'));
// Output: http://aaa.com
}

Don't use Regex. Use DOM
$html = 'http://aaa.com';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $a) {
if($a->hasAttribute('href')){
echo $a->getAttribute('href');
}
//$a->nodeValue; // If you want the text in <a> tag
}

Seeing as you're not trying to extract urls that are the href attribute of an a node, you'll want to start by getting the actual text content of the dom. This can be easily done like so:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$root = $dom->getElementsByTagName('body')[0];//get outer tag, in case of a full dom, this is body
$text = $root->textContent;//no tags, no attributes, no nothing.
An alternative approach would be this:
$text = strip_tags($htmlString);//gets rid of makrup.

Related

How can I remove all <span> tags and their respective content, including other nested elements?

I've tried a few solutions which only remove the the tags themselves leaving the content and any other nested
Regular Expression,
preg_replace('/<span\b[^>]*>(.*?)<\/span>/ig', '', $page->body);
Tried using HTML purifier also,
$purifier->set('Core.HiddenElements', array('span'));
$purifier->set('HTML.ForbiddenElements', array('span'));

Depending on your actual strings and the things you tried you could use a regular expression (assuming your span tags are only span tags).
A more "appropriate" solution however would be to use an html parser like DomDocument.
You can use the function document.getElementsByName("span"); to get all the span elements and remove them from the document object.
Then use saveHTML to get the html code back.
You will get something like this:
$doc = new DOMDocument;
$doc->load($yourpage);
$root = $doc->documentElement;
// we retrieve the spans and remove it from the book
$spans = $book->getElementsByTagName('span');
foreach ($spans as $span){
$root->removeChild($span);
}
echo $doc->saveXML();

How to regex scrape HTML and ignore whitespace and newlines in code?

I'm putting together a quick script to scrape a page for some results and I'm having trouble figuring out how to ignore white space and new lines in my regex.
For example, here's how the page may present a result in HTML:
<td class="things">
<div class="stuff">
<p>I need to capture this text.</p>
</div>
</td>
How would I change the following regex to ignore the spaces and new lines:
$regex = '/<td class="things"><div class="stuff"><p>(.*)<\/p><\/div><\/td>/i';
Any help would be appreciated. Help that also explains why you did something would be greatly appreciated!

Needless to caution you that you're playing with fire by trying to use regex with HTML code. Anyway to answer your question you can use this regex:
$regex='#^<td class="things">\s*<div class="stuff">\s*<p>(.*)</p>\s*</div>\s*</td>#si';
Update: Here is the DOM Parser based code to get what you want:
$html = <<< EOF
<td class="things">
<div class="stuff">
<p>I need to capture this text.</p>
</div>
</td>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//td[#class='things']/div[#class='stuff']/p");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$val = $node->nodeValue;
echo "$val\n"; // prints: I need to capture this text.
}
And now please refrain from parsing HTML using regex in your code.

SimpleHTMLDomParser will let you grab the content of a selected div or the contents of elements such as <p> <h1> <img> etc.
That might be a quicker way to achieve what your trying to do.

The solution is to not use regular expressions on HTML. See this great article on the subject: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Bottom line is that HTML is not a regular language, so regular expressions are not a good fit. You have variations in white space, potentially unclosed tags (who is to say the HTML you are scraping is going to always be correct?), among other challenges.
Instead, use PHP's DomDocument, impress your friends, AND do it the right way every time:
// create a new DOMDocument
$doc = new DOMDocument();
// load the string into the DOM
$doc->loadHTML('<td class="things"><div class="stuff"><p>I need to capture this text.</p></div></td>');
// since we are working with HTML fragments here, remove <!DOCTYPE
$doc->removeChild($doc->firstChild);
// likewise remove <html><body></body></html>
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
$contents = array();
//Loop through each <p> tag in the dom and grab the contents
// if you need to use selectors or get more complex here, consult the documentation
foreach($doc->getElementsByTagName('p') as $paragraph) {
$contents[] = $paragraph->textContent;
}
print_r($contents);
Documentation
PHP's DomDocument - http://php.net/manual/en/class.domdocument.php
PHP's DomElement - http://www.php.net/manual/en/class.domelement.php
This PHP extension is regarded as "standard", and is usually already installed on most web servers -- no third-party scripts or libraries required. Enjoy!

RegExp: Finding all links on page w/ nofollow

I'm trying to write a RegEx which finds all links on a webpage with the rel="nofollow" attribute. Mind you, I'm a RegEx newb so please don't be to harsh on me :)
This is what I got so far:
$link = "/<a href=\"([^\"]*)\" rel=\"nofollow\">(.*)<\/a>/iU";
Obviously this is very flawed. Any link with any other attribute or styled a little differently (single quotes) won't be matched.

You should really use DOM parser for this purpose as any regex based solution will be error prone for this kind of HTML parsing. Consider code like this:
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
// returns a list of all links with rel=nofollow
$nlist = $xpath->query("//a[#rel='nofollow']");

Try this:
$link = "/<(a)[^>]*rel\s*=\s*(['\"])nofollow\\2[^>]*>(.*?)<\/\\1>/i";

Remove javascript codes in parsing a webpage

When capturing the content of a webpage by CURL or file_get_contents, What is the easiest way to remove inline javascrip codes. I am thinking of regex to remove everything between tags; but regex is not a reliable method for this purpose.
Is there a better way to parse an html page (just removing javascript codes)? If regex is still the best option, what is the most reliable command to do so?

You can make use of DOMDocument and its removeChild() function. Something like the following should get you going.
<?php
$doc = new DOMDocument;
$doc->load('index.html');
$page = $doc->documentElement;
// we retrieve the chapter and remove it from the book
$scripts = $page->getElementsByTagName('script');
foreach($scripts as $script) {
$page->removeChild($script);
}
echo $doc->saveHTML();
?>

str_replace within certain html tags only

I have an html page loaded into a PHP variable and am using str_replace to change certain words with other words. The only problem is that if one of these words appears in an important peice of code then the whole thing falls to bits.
Is there any way to only apply the str_replace function to certain html tags? Particularly: p,h1,h2,h3,h4,h5
EDIT:
The bit of code that matters:
$yay = str_ireplace($find, $replace , $html);
cheers and thanks in advance for any answers.
EDIT - FURTHER CLARIFICATION:
$find and $replace are arrays containing words to be found and replaced (respectively). $html is the string containing all the html code.
a good example of it falling to bits would be if I were to find and replace a word that occured in e.g. the domain name. So if I wanted to replace the word 'hat' with 'cheese'. Any occurance of an absolute path like
www.worldofhat.com/images/monkey.jpg
would be replaced with:
www.worldofcheese.com/images/monkey.jpg
So if the replacements could only occur in certain tags, this could be avoided.

Do not treat the HTML document as a mere string. Like you already noticed, tags/elements (and how they are nested) have meaning in an HTML page and thus, you want to use a tool that knows what to make of an HTML document. This would be DOM then:
Here is an example. First some HTML to work with
$html = <<< HTML
<body>
<h1>Germany reached the semi finals!!!</h1>
<h2>Germany reached the semi finals!!!</h2>
<h3>Germany reached the semi finals!!!</h3>
<h4>Germany reached the semi finals!!!</h4>
<h5>Germany reached the semi finals!!!</h5>
<p>Fans in Germany are totally excited over their team's 4:0 win today</p>
</body>
HTML;
And here is the actual code you would need to make Argentina happy
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//*[self::h1 or self::h2 or self::p]');
foreach( $nodes as $node ) {
$node->nodeValue = str_replace('Germany', 'Argentina', $node->nodeValue);
}
echo $dom->saveHTML();
Just add the tags you want to replace content in the XPath query call. An alternative to using XPath would be to use DOMDocument::getElementsByTagName, which you might know from JavaScript:
$nodes = $dom->getElementsByTagName('h1');
In fact, if you know it from JavaScript, you might know a lot more of it, because DOM is actually a language agnostic API defined by the W3C and implemented in many languages. The advantage of XPath over getElementsByTagName is obviously that you can query multiple nodes in one go. The drawback is, you have to know XPath :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php regex selecting url from html source - php

Don't use Regex. Use DOM $html = 'http://aaa.com'; $dom = new DOMDocument; $dom->loadHTML($html); foreach ($dom->getElementsByTagName('a') as $a) { if($a->hasAttribute('href')){ echo $a->getAttribute('href'); } //$a->nodeValue; // If you want the text in <a> tag }

Related

How can I remove all <span> tags and their respective content, including other nested elements?

How to regex scrape HTML and ignore whitespace and newlines in code?

RegExp: Finding all links on page w/ nofollow

Remove javascript codes in parsing a webpage

str_replace within certain html tags only

Categories

Resources