I need regular expression to be used in PHP that can extract all script tags links (src attributes).
i already have this regex which i created to extract script src values but i'm unable to make it work to find only in the head section
/<script [^>]*src=["|\']([^"|\']+(\.js))/i
hoping someone will check this and test before sending a new regex that can work.
/html/head/script/#src
Easy peasy. Obviously not a regex, it's xpath. Not good things tend to happen when you try to parse HTML with regular expressions. Fortunately a more capable HTML parser comes with PHP's DOM extension - exposed by the loadHTML() and loadHTMLFile() methods.
This lets you work with all the wonderful DOM methods as well as XPath for querying the document.
Example:
$html = <<<'HTML'
<html>
<head>
<script src="foo.js"></script>
<script src="bar.js"></script>
</head>
<body>
<script src="baz.js"></script>
</body>
</html>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('/html/head/script/#src') as $src) {
echo $src->value, "\n";
}
Output:
foo.js
bar.js
Related
I'm new to stackoverflow and from South Korea.
I'm having difficulties with regex with php.
I want to select all the urls from user submitted html source.
The restrictions I want to make are following.
Select urls EXCEPT
urls are within tags
for example if the html source is like below,
http://aaa.com
Neither of http://aaa.com should be selected.
urls right after " or =
Here is my current regex stage.
/(?<![\"=])https?\:\/\/[^\"\s<>]+/i
but with this regex, I can't achieve the first rule.
I tried to add negative lookahead at the end of my current regex like
/(?<![\"=])https?\:\/\/[^<>\"\s]+(?!<\/a>)/i
It still chooses the second url in the a tag like below.
http://aaa.co
We don't have developers Q&A community like Stackoverflow in Korea, so I really hope someone can help this simplely looking regex issue!
Seriously consider using PHP's DOMDocument class. It does reliable HTML parsing. Doing this with regular expressions is error prone, more work, and slower.
The DOM works just like in the browser and you can use getElementsByTagName to get all links.
I got your use case working with this code using the DOM (try it here: http://3v4l.org/5IFof):
<?php
$html = <<<HTML
http://aaa.com
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $link) {
var_dump($link->getAttribute('href'));
// Output: http://aaa.com
}
Don't use Regex. Use DOM
$html = 'http://aaa.com';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $a) {
if($a->hasAttribute('href')){
echo $a->getAttribute('href');
}
//$a->nodeValue; // If you want the text in <a> tag
}
Seeing as you're not trying to extract urls that are the href attribute of an a node, you'll want to start by getting the actual text content of the dom. This can be easily done like so:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$root = $dom->getElementsByTagName('body')[0];//get outer tag, in case of a full dom, this is body
$text = $root->textContent;//no tags, no attributes, no nothing.
An alternative approach would be this:
$text = strip_tags($htmlString);//gets rid of makrup.
I'm putting together a quick script to scrape a page for some results and I'm having trouble figuring out how to ignore white space and new lines in my regex.
For example, here's how the page may present a result in HTML:
<td class="things">
<div class="stuff">
<p>I need to capture this text.</p>
</div>
</td>
How would I change the following regex to ignore the spaces and new lines:
$regex = '/<td class="things"><div class="stuff"><p>(.*)<\/p><\/div><\/td>/i';
Any help would be appreciated. Help that also explains why you did something would be greatly appreciated!
Needless to caution you that you're playing with fire by trying to use regex with HTML code. Anyway to answer your question you can use this regex:
$regex='#^<td class="things">\s*<div class="stuff">\s*<p>(.*)</p>\s*</div>\s*</td>#si';
Update: Here is the DOM Parser based code to get what you want:
$html = <<< EOF
<td class="things">
<div class="stuff">
<p>I need to capture this text.</p>
</div>
</td>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//td[#class='things']/div[#class='stuff']/p");
for($i=0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i);
$val = $node->nodeValue;
echo "$val\n"; // prints: I need to capture this text.
}
And now please refrain from parsing HTML using regex in your code.
SimpleHTMLDomParser will let you grab the content of a selected div or the contents of elements such as <p> <h1> <img> etc.
That might be a quicker way to achieve what your trying to do.
The solution is to not use regular expressions on HTML. See this great article on the subject: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Bottom line is that HTML is not a regular language, so regular expressions are not a good fit. You have variations in white space, potentially unclosed tags (who is to say the HTML you are scraping is going to always be correct?), among other challenges.
Instead, use PHP's DomDocument, impress your friends, AND do it the right way every time:
// create a new DOMDocument
$doc = new DOMDocument();
// load the string into the DOM
$doc->loadHTML('<td class="things"><div class="stuff"><p>I need to capture this text.</p></div></td>');
// since we are working with HTML fragments here, remove <!DOCTYPE
$doc->removeChild($doc->firstChild);
// likewise remove <html><body></body></html>
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
$contents = array();
//Loop through each <p> tag in the dom and grab the contents
// if you need to use selectors or get more complex here, consult the documentation
foreach($doc->getElementsByTagName('p') as $paragraph) {
$contents[] = $paragraph->textContent;
}
print_r($contents);
Documentation
PHP's DomDocument - http://php.net/manual/en/class.domdocument.php
PHP's DomElement - http://www.php.net/manual/en/class.domelement.php
This PHP extension is regarded as "standard", and is usually already installed on most web servers -- no third-party scripts or libraries required. Enjoy!
I'm a self taught PHP programmer and I'm only now starting to grasp the regex stuff. I'm pretty aware of its capabilities when it is done right, but this is something I need to dive in too. so maybe someone can help me, and save me so hours of experiment.
I have this string:
here is the <img src="http://www.somewhere.com/1.png" alt="some' /> and there is not a chance...
now, I need to preg_match this string and search for the a href tag that has an image in it, and replace it with the same tag with a small difference: after the title attribute inside the tag, I'll want to add a rel="here" attribute.
of course, it should ignore links (a href's) that don't have img tag inside.
First of all: never ever ever use regex for html!
You're much better off using an XML parser: create a DOMDocument, load your HTML, and then use XPath to get the node you want.
Something like this:
$str = 'here is the <img src="http://www.somewhere.com/1.png" alt="some" /> and there is not a chance...';
$doc = new DOMDocument();
$doc->loadHTML($str);
$xpath = new DOMXPath($doc);
$results = $xpath->query('//a/img');
foreach ($results as $result) {
// edit result node
}
$doc->saveHTML();
Ideally you should use HTML (or XML) parser for this purpose. Here is an example using PHP built-in XML manipulation functions:
<?php
error_reporting(E_ALL);
$doc = new DOMDocument();
$doc->loadHTML('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
<p>here is the <img src="http://www.somewhere.com/1.png" alt="some" /> and there is not a chance...</p>
</body></html>');
$xpath = new DOMXPath($doc);
$result = $xpath->query('//a[img]');
foreach ($result as $r) {
$r->setAttribute('rel', $r->getAttribute('title')); // i am confused whether you want a hard-coded "here" or the value of the title
}
echo $doc->saveHTML();
Output
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
<p>here is the <img src="http://www.somewhere.com/1.png" alt="some"> and there is not a chance...</p>
</body></html>
here a couple of link that might help you with Regex:
RegEx Tutorial
Email Samples of RegEx
I used the web site in the last link extensively in my previous Job. It is a great collections of RegEx that you can also test according to your specific case.
First two links would help you to find to get some further knowledge about it.
The following xpath works, but how can i get the line number at which the xpath finds the p element with class 'blah'?
<?php
$doc = new DOMDocument();
$doc->loadHTML('<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
</head>
<body>
<p class='blah'>some text here</p>
</body>
</html>');
$xpath = new DOMXPath($doc);
$xpath_query = "//*[contains(#class, 'blah')]";
?>
XPath and DOM have no concept of a line number. They only see nodes, and the linkages between them.
The DOM object itself may have some internal metadata which can relate a node back to which line it was on in the source file, but you'd have to rummage around inside the object and the DOM source to find out. Doesn't seem to be anything mentioned at http://php.net/dom.
Alternatively, if the node you're looking at, and/or the surrounding HTML is fairly/totally unique in the document, you could search the raw html for the matching HTML text of the node and get a line number that way.
I am trying to detect text between 3 or 4 tags and I have no idea how - USING PHP.
I know that I am supposed to use regex but thats too hard for my mind :X
If you can explain me how to do it / give me example of what I need it will be great!
I am trying to detect code between <script> tag > which mean if I got <script type="text/javascript"> it will detect also. if there's <script src="..."> then it wont detect the text between (shouldnt be text between).
same with script ^ if there's <style type="text/css"> it will detect the text between too
and I also want to detect text between style="detect text here" artitube.
Last tag I want to text between is <?php ?>. (php can be also in upper case, so I dont want the regex to be case sensitive).
Thanks for the helpers!!!
Using regular expressions you could write something like:
<?php
$html = <<<EOF
<script type="text/javascript">
function xyz() { alert('some alert'); }
</script>
EOF;
preg_match('/<script.*>(.*)<\/script>/sU', $html, $matches);
var_dump($matches)
?>
Regular expressions aren't best suited for parsing HTML. For good reasons why, see the question Can you provide some examples of why it is hard to parse XML and HTML with a regex?
You'll have an easier time loading the HTML into the DOM XML classes, then you can perform XPath queries to extract the tags you want.
For example, try something like this to get all the <script> tags which don't have a src attribute...
$doc = new DOMDocument();
$doc->loadHTMLFile("myfile.html");
$xpath=new DOMXPath($doc);
//find script elements which don't have a src attribute
$scriptNodes=$xpath->query("script[not(#src)]");
foreach ($scriptNodes as $scriptNode) {
//do something here...
}