I'm trying to write a RegEx which finds all links on a webpage with the rel="nofollow" attribute. Mind you, I'm a RegEx newb so please don't be to harsh on me :)
This is what I got so far:
$link = "/<a href=\"([^\"]*)\" rel=\"nofollow\">(.*)<\/a>/iU";
Obviously this is very flawed. Any link with any other attribute or styled a little differently (single quotes) won't be matched.
You should really use DOM parser for this purpose as any regex based solution will be error prone for this kind of HTML parsing. Consider code like this:
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
// returns a list of all links with rel=nofollow
$nlist = $xpath->query("//a[#rel='nofollow']");
Try this:
$link = "/<(a)[^>]*rel\s*=\s*(['\"])nofollow\\2[^>]*>(.*?)<\/\\1>/i";
Related
I'm new to stackoverflow and from South Korea.
I'm having difficulties with regex with php.
I want to select all the urls from user submitted html source.
The restrictions I want to make are following.
Select urls EXCEPT
urls are within tags
for example if the html source is like below,
http://aaa.com
Neither of http://aaa.com should be selected.
urls right after " or =
Here is my current regex stage.
/(?<![\"=])https?\:\/\/[^\"\s<>]+/i
but with this regex, I can't achieve the first rule.
I tried to add negative lookahead at the end of my current regex like
/(?<![\"=])https?\:\/\/[^<>\"\s]+(?!<\/a>)/i
It still chooses the second url in the a tag like below.
http://aaa.co
We don't have developers Q&A community like Stackoverflow in Korea, so I really hope someone can help this simplely looking regex issue!
Seriously consider using PHP's DOMDocument class. It does reliable HTML parsing. Doing this with regular expressions is error prone, more work, and slower.
The DOM works just like in the browser and you can use getElementsByTagName to get all links.
I got your use case working with this code using the DOM (try it here: http://3v4l.org/5IFof):
<?php
$html = <<<HTML
http://aaa.com
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $link) {
var_dump($link->getAttribute('href'));
// Output: http://aaa.com
}
Don't use Regex. Use DOM
$html = 'http://aaa.com';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $a) {
if($a->hasAttribute('href')){
echo $a->getAttribute('href');
}
//$a->nodeValue; // If you want the text in <a> tag
}
Seeing as you're not trying to extract urls that are the href attribute of an a node, you'll want to start by getting the actual text content of the dom. This can be easily done like so:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$root = $dom->getElementsByTagName('body')[0];//get outer tag, in case of a full dom, this is body
$text = $root->textContent;//no tags, no attributes, no nothing.
An alternative approach would be this:
$text = strip_tags($htmlString);//gets rid of makrup.
How can I remove the elements I desire (for example, the search box) from by instructions in PHP?
Link : http://api.wordreference.com/38159/enes/hello
NOTE: in my site I first request what you see in the link above and paste it in a div, if this info is of any use
For searching a HTML-tree for certain elements, which is what you seem to DOMXPath for searching the tree of an XML-tree.
Have a look at: http://www.php.net/manual/en/domxpath.query.php
Remember however that searching an XML tree is not the cheapest of operations, so use it with caution.
Writing XPath queries can be tricky - but Firebug can be used to easily form them ( in firebug, go to some element of desire and hover the "CSS path" firebug is showing - the tooltip contains the XPath for the same...)
Well, asuming that you for example want to get the content, try:
$word = "hello";
$doc = new DOMDocument;
// We don't want to bother with white spaces.
$doc->preserveWhiteSpace = false;
// preg_replace is used for security, all except a-z and 0-9, and underscore will be replaced with nothing.
$url = "http://api.wordreference.com/38159/enes/" . preg_replace( "/[^\w ]/", "", $word );
$doc -> Load( $url );
$xpath = new DOMXPath($doc);
// We start from the root element.
$entries = $xpath -> query( "//html/body/div/div[3]" );
// do some stuff with the elements found...
I'm putting together a quick script to scrape a page for some results and I'm having trouble figuring out how to ignore white space and new lines in my regex.
For example, here's how the page may present a result in HTML:
<td class="things">
<div class="stuff">
<p>I need to capture this text.</p>
</div>
</td>
How would I change the following regex to ignore the spaces and new lines:
$regex = '/<td class="things"><div class="stuff"><p>(.*)<\/p><\/div><\/td>/i';
Any help would be appreciated. Help that also explains why you did something would be greatly appreciated!
Needless to caution you that you're playing with fire by trying to use regex with HTML code. Anyway to answer your question you can use this regex:
$regex='#^<td class="things">\s*<div class="stuff">\s*<p>(.*)</p>\s*</div>\s*</td>#si';
Update: Here is the DOM Parser based code to get what you want:
$html = <<< EOF
<td class="things">
<div class="stuff">
<p>I need to capture this text.</p>
</div>
</td>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//td[#class='things']/div[#class='stuff']/p");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$val = $node->nodeValue;
echo "$val\n"; // prints: I need to capture this text.
}
And now please refrain from parsing HTML using regex in your code.
SimpleHTMLDomParser will let you grab the content of a selected div or the contents of elements such as <p> <h1> <img> etc.
That might be a quicker way to achieve what your trying to do.
The solution is to not use regular expressions on HTML. See this great article on the subject: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Bottom line is that HTML is not a regular language, so regular expressions are not a good fit. You have variations in white space, potentially unclosed tags (who is to say the HTML you are scraping is going to always be correct?), among other challenges.
Instead, use PHP's DomDocument, impress your friends, AND do it the right way every time:
// create a new DOMDocument
$doc = new DOMDocument();
// load the string into the DOM
$doc->loadHTML('<td class="things"><div class="stuff"><p>I need to capture this text.</p></div></td>');
// since we are working with HTML fragments here, remove <!DOCTYPE
$doc->removeChild($doc->firstChild);
// likewise remove <html><body></body></html>
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
$contents = array();
//Loop through each <p> tag in the dom and grab the contents
// if you need to use selectors or get more complex here, consult the documentation
foreach($doc->getElementsByTagName('p') as $paragraph) {
$contents[] = $paragraph->textContent;
}
print_r($contents);
Documentation
PHP's DomDocument - http://php.net/manual/en/class.domdocument.php
PHP's DomElement - http://www.php.net/manual/en/class.domelement.php
This PHP extension is regarded as "standard", and is usually already installed on most web servers -- no third-party scripts or libraries required. Enjoy!
Good day!
My regular expression is really bad and I would like to ask for help on my project.
I have contents that I crawled from other sites and I would like to get all anchor tags that have this string in them.
target="_blank"
How will I accomplish this? Any suggestion would be greatly appreciated.
Thanks
As mentioned in the comments, regular expressions are not the answer here.
Use DOM and XPath to achieve what you want
$doc = new DOMDocument;
$doc->loadHTMLFile('http://www.example.com/some-file.html');
$xpath = new DOMXPath($doc);
$anchors = $xpath->query('//a[#target="_blank"]');
$dom = new DOMDocument();
$dom->loadHtml($yourCobtent);
$xpath = new DOMXpath($dom);
$yourAnchors = $xpath->query('//a[#target="_blank"]');
agree with #quentin, however you can use regexr, (http://gskinner.com/RegExr/), a basic regex for all anchor tags is <a.*href=["'](?<url>[^"]+[.\s]*)["'].*>(?<name>[^<]+[.\s]*)</a>
(http://weblogs.asp.net/palermo4/archive/2004/06/18/regex-pattern-for-anchor-tags-part-2.aspx)
When capturing the content of a webpage by CURL or file_get_contents, What is the easiest way to remove inline javascrip codes. I am thinking of regex to remove everything between tags; but regex is not a reliable method for this purpose.
Is there a better way to parse an html page (just removing javascript codes)? If regex is still the best option, what is the most reliable command to do so?
You can make use of DOMDocument and its removeChild() function. Something like the following should get you going.
<?php
$doc = new DOMDocument;
$doc->load('index.html');
$page = $doc->documentElement;
// we retrieve the chapter and remove it from the book
$scripts = $page->getElementsByTagName('script');
foreach($scripts as $script) {
$page->removeChild($script);
}
echo $doc->saveHTML();
?>