Remove javascript codes in parsing a webpage - php

When capturing the content of a webpage by CURL or file_get_contents, What is the easiest way to remove inline javascrip codes. I am thinking of regex to remove everything between tags; but regex is not a reliable method for this purpose.
Is there a better way to parse an html page (just removing javascript codes)? If regex is still the best option, what is the most reliable command to do so?

You can make use of DOMDocument and its removeChild() function. Something like the following should get you going.
<?php
$doc = new DOMDocument;
$doc->load('index.html');
$page = $doc->documentElement;
// we retrieve the chapter and remove it from the book
$scripts = $page->getElementsByTagName('script');
foreach($scripts as $script) {
$page->removeChild($script);
}
echo $doc->saveHTML();
?>

Related

How to conditionally remove content from a string?

I have a function that creates a preview of a post like this
<?php $pos=strpos($post->content, ' ', 280);
echo substr($post->content,0,$pos ); ?>
But it's possible that the very first thing in that post is a <style> block. How can i create some conditional logic to make sure my preview writes what is after the style block?
If the only HTML content is a <style> tag, you could just simply use preg_replace:
echo preg_replace('#<style>.*?</style>#', '', $post->content);
However it is better (and more robust) to use DOMDocument (note that loadHTML will put a <body> tag around your post content and that is what we search for) to output just the text it contains:
$doc = new DOMDocument();
$doc->loadHTML($post->content);
echo $doc->getElementsByTagName('body')->item(0)->nodeValue . "\n";
For this sample input:
$post = (object)['content' => '<style>some random css</style>the text I really want'];
The output of both is
the text I really want
Demo on 3v4l.org
Taking a cue from the excellent comment of #deceze here's one way to use the DOM with PHP to eliminate the style tags:
<?php
$_POST["content"] =
"<style>
color:blue;
</style>
The rain in Spain lies mainly in the plain ...";
$dom = new DOMDocument;
$dom->loadHTML($_POST["content"]);
$style_tags = $dom->GetElementsByTagName('style');
foreach($style_tags as $style_tag) {
$prent = $style_tag->parentNode;
$prent->replaceChild($dom->createTextNode(''), $style_tag);
}
echo strip_tags($dom->saveHTML());
See demo here
I also took guidance from a related discussion specifically looking at the officially accepted answer.
The advantage of manipulating PHP with the DOM is that you don't even need to create a conditional to remove the STYLE tags. Also, you are working with HTML elements, so you don't have to bother with the intricacies of using a regex. Note that in replacing the style tags, they are replaced by a text node containing an empty string.
Note, tags like HEAD and BODY are automatically inserted when the DOM object executes its saveHTML() method. So, in order to display only text content, the last line uses strip_tags() to remove all HTML tags.
Lastly, while the officially accepted answer is generally a viable alternative, it does not provide a complete solution for non-compliant HTML containing a STYLE tag after a BODY tag.
You have two options.
If there are no tags in your content use strip_tags()
You could use regex. This is more complex but there is always a suiting pattern. e.g. preg_match()

php regex selecting url from html source

I'm new to stackoverflow and from South Korea.
I'm having difficulties with regex with php.
I want to select all the urls from user submitted html source.
The restrictions I want to make are following.
Select urls EXCEPT
urls are within tags
for example if the html source is like below,
http://aaa.com
Neither of http://aaa.com should be selected.
urls right after " or =
Here is my current regex stage.
/(?<![\"=])https?\:\/\/[^\"\s<>]+/i
but with this regex, I can't achieve the first rule.
I tried to add negative lookahead at the end of my current regex like
/(?<![\"=])https?\:\/\/[^<>\"\s]+(?!<\/a>)/i
It still chooses the second url in the a tag like below.
http://aaa.co
We don't have developers Q&A community like Stackoverflow in Korea, so I really hope someone can help this simplely looking regex issue!
Seriously consider using PHP's DOMDocument class. It does reliable HTML parsing. Doing this with regular expressions is error prone, more work, and slower.
The DOM works just like in the browser and you can use getElementsByTagName to get all links.
I got your use case working with this code using the DOM (try it here: http://3v4l.org/5IFof):
<?php
$html = <<<HTML
http://aaa.com
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $link) {
var_dump($link->getAttribute('href'));
// Output: http://aaa.com
}
Don't use Regex. Use DOM
$html = 'http://aaa.com';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $a) {
if($a->hasAttribute('href')){
echo $a->getAttribute('href');
}
//$a->nodeValue; // If you want the text in <a> tag
}
Seeing as you're not trying to extract urls that are the href attribute of an a node, you'll want to start by getting the actual text content of the dom. This can be easily done like so:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$root = $dom->getElementsByTagName('body')[0];//get outer tag, in case of a full dom, this is body
$text = $root->textContent;//no tags, no attributes, no nothing.
An alternative approach would be this:
$text = strip_tags($htmlString);//gets rid of makrup.

how would I screen scrape a page like this using file_get_contents and preg_match?

I have a page with many HTML lines like this:
<ul><li><a href='a_silly_link_that_changes_each_line.php'>the_content_i_need</a></li></ul>
Now as you can see, theres a link in that line, which unfortunately changes on each line.
So I need a way to scrape the content in that line, without letting the link get in the way.
I've also tried to scrape like this: .php'>(*.)</a></li></ul> but thats no good, as it returns allot of unwanted content.
Also, because there are many lines on the page that i need to take the content from, could i just loop through, somehow?
I'm using preg_match and file_get_contents but am open to other suggestions. :)
From: PHP Parse HTML code
Use something like:
$str = '<ul><li><a src="test.html">linky</a></li></ul>';
$DOM = new DOMDocument;
$DOM->loadHTML($str);
$items = $DOM->getElementsByTagName('ul');
for($i =0;$i<$items->length;$i++){
$ul = $items->item($i);
$li=$ul->firstChild;
if($li->nodeName=='li' && $li->firstChild->nodeName=='a'){
//do something with $li->firstChild->nodeValue
}
}
In this case, $li->firstChild->nodeValue will be linky.
That should do it :)
Try using
$match = array();
preg_match_all( '~\\.php>(.*?)</a></li></ul>~', file_get_contents( $filename), $matches, PREG_SET_ORDER)`.
This will match all links inside your file. *? means "match 0-inf characters but as little characters as possible" (greedy killer) so you won't be getting any unvanted content.

automatic link creation using php without breaking the html tags

i want to convert text links in my content page into active links using php. i tried every possible script out there, they all fine but the problem that they convert links in img src tag. they convert links everywhere and break the html code.
i find a good script that do what i want exactly but it is in javascript. it is called jquery-linkify.
you can find the script here
http://github.com/maranomynet/linkify/
the trick in the script that it convert text links without breaking the html code. i tried to convert the script into php but failed.
i cant use the script on my website because there is other scripts that has conflict with jquery.
anyone could rewrite this script for php? or at least guide me how?
thanks.
First, parse the text with an HTML parser, with something like DOMDocument::loadHTML. Note that poor HTML can be hard to parse, and depending on the parser, you might get slightly different output in the browser after running such a function.
PHP's DOMDocument isn't very flexible in that regard. You may have better luck by parsing with other tools. But if you are working with valid HTML (and you should try to, if it's within your control), none of that is a concern.
After parsing the text, you need to look at the text nodes for links and replace them. Using a regular expression is the simplest way.
Here's a sample script that does just that:
<?php
function linkify($text)
{
$re = "#\b(https?://)?(([0-9a-zA-Z_!~*'().&=+$%-]+:)?[0-9a-zA-Z_!~*'().&=+$%-]+\#)?(([0-9]{1,3}\.){3}[0-9]{1,3}|([0-9a-zA-Z_!~*'()-]+\.)*([0-9a-zA-Z][0-9a-zA-Z-]{0,61})?[0-9a-zA-Z]\.[a-zA-Z]{2,6})(:[0-9]{1,4})?((/[0-9a-zA-Z_!~*'().;?:\#&=+$,%#-]+)*/?)#";
preg_match_all($re, $text, $matches, PREG_OFFSET_CAPTURE);
$matches = $matches[0];
$i = count($matches);
while ($i--)
{
$url = $matches[$i][0];
if (!preg_match('#^https?://#', $url))
$url = 'http://'.$url;
$text = substr_replace($text, ''.$matches[$i][0].'', $matches[$i][1], strlen($matches[$i][0]));
}
return $text;
}
$dom = new DOMDocument();
$dom->loadHTML('<b>stackoverflow.com</b> test');
$xpath = new DOMXpath($dom);
foreach ($xpath->query('//text()') as $text)
{
$frag = $dom->createDocumentFragment();
$frag->appendXML(linkify($text->nodeValue));
$text->parentNode->replaceChild($frag, $text);
}
echo $dom->saveHTML();
?>
I did not come up with that regular expression, and I cannot vouch for its accuracy. I also did not test the script, except for this above case. However, this should be more than enough to get you going.
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<b>stackoverflow.com</b>
test
</body>
</html>
Note that saveHTML() adds the surrounding tags. If that's a problem, you can strip them out with substr().
Use a HTML parser and only search for URLs within text nodes.
I think the trick is in tracking the single ' and double quotes '' in your PHP code and merging between them in a correct way so you put '' inside "" or vice versa.
For Example,
<?PHP
//old html tags
echo "<h1>Header1</h1>";
echo "<div>some text</div>";
//your added links
echo "<p><a href='link1.php'>Link1</a><br>";
echo "<a href='link1.php'>Link1</a></p>";
//old html tags
echo "<h1>Another Header</h1>";
echo "<div>some text</div>";
?>
I hope this helps you ..
$text = 'Any text ... link http://example123.com and image <img src="http://exaple.com/image.jpg" />';
$text = preg_replace('!([^\"])(http:\/\/(?:[\w\.]+))([^\"])!', '\\1\\2\\3', $text);

Regex and PHP for extracting contents between tags with several line breaks

How can I extract the content between tags with several line breaks?
I'm a newbie to regex, who would like to know how to handle unknown numbers of line break to match my query.
Task: Extract content between <div class="test"> and the first closing </div> tag.
Original source:
<div class="test">optional text<br/>
content<br/>
<br/>
content<br/>
...
content<br/>Hyperlink</div></div></div>
I've worked out the below regex,
/<div class=\"test\">(.*?)<br\/>(.*?)<\/div>/
Just wonder how to match several line breaks using regex.
There is DOM for us but I am not familiar with that.
You should not parse (x)html with regular expressions. Use DOM.
I'm a beginner in xpath, but one like this should work:
//div[#class='test']
This selects all divs with the class 'test'. You will need to load your html into a DOMDocument object, then create a DOMXpath object relating to that, and call its execute() method to get the results. It will return a DOMNodeList object.
Final code looks something like this:
$domd = new DOMDocument();
$domd->loadHTML($your_html_code);
$domx = new DOMXPath($domd);
$items = $domx->execute("//div[#class='test']");
After this, your div is in $items->item(0).
This is untested code, but if I remember correctly, it should work.
Update, forgot that you need the content.
If you need the text content (no tags), you can simply call $items->item(0)->textContent. If you also need the tags, here's the equivalent of javascript's innerHTML for PHP DOM:
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
Call it with $items->item(0) as the parameter.
You could use preg_match_all('/<div class="test">(.*?)<\/div>/si', $html, $matches);. But remember that this will match the first closing </div> within the HTML. Ie. if the HTML looks like <div class="test">...aaa...<div>...bbb...</div>...ccc...</div> then you would get ...aaa...<div>...bbb... as the result in $matches...
So in the end using a DOM parser would indeed by a better solution.

Categories