PHP XPath and Line Numbers

PHP XPath and Line Numbers - php

The following xpath works, but how can i get the line number at which the xpath finds the p element with class 'blah'?
<?php
$doc = new DOMDocument();
$doc->loadHTML('<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
</head>
<body>
<p class='blah'>some text here</p>
</body>
</html>');
$xpath = new DOMXPath($doc);
$xpath_query = "//*[contains(#class, 'blah')]";
?>

XPath and DOM have no concept of a line number. They only see nodes, and the linkages between them.
The DOM object itself may have some internal metadata which can relate a node back to which line it was on in the source file, but you'd have to rummage around inside the object and the DOM source to find out. Doesn't seem to be anything mentioned at http://php.net/dom.
Alternatively, if the node you're looking at, and/or the surrounding HTML is fairly/totally unique in the document, you could search the raw html for the matching HTML text of the node and get a line number that way.

Related

PHP Regular expression to find scripts in head

I need regular expression to be used in PHP that can extract all script tags links (src attributes).
i already have this regex which i created to extract script src values but i'm unable to make it work to find only in the head section
/<script [^>]*src=["|\']([^"|\']+(\.js))/i
hoping someone will check this and test before sending a new regex that can work.

/html/head/script/#src
Easy peasy. Obviously not a regex, it's xpath. Not good things tend to happen when you try to parse HTML with regular expressions. Fortunately a more capable HTML parser comes with PHP's DOM extension - exposed by the loadHTML() and loadHTMLFile() methods.
This lets you work with all the wonderful DOM methods as well as XPath for querying the document.
Example:
$html = <<<'HTML'
<html>
<head>
<script src="foo.js"></script>
<script src="bar.js"></script>
</head>
<body>
<script src="baz.js"></script>
</body>
</html>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('/html/head/script/#src') as $src) {
echo $src->value, "\n";
}
Output:
foo.js
bar.js

Regular expression to get page title

There are lots of answers to this question, but not a single complete one:
With using one regular expression, how do you extract page title from <title>Page title</title>?
There are several other cases how title tags are typed, such as:
<TITLE>Page title</TITLE>
<title>
Page title</title>
<title>
Page title
</title>
<title lang="en-US">Page title</title>
...or any combination of above.
And it can be on its own line or in between other tags:
<head>
<title>Page title</title>
</head>
<head><title>Page title</title></head>
Thanks for help in advance.
UDPATE: So, the regex approach might not be the best solution to this. Which PHP based HTML parser could handle all scenarios, where HTML is well formed (or not so well)?
UPDATE 2: sp00m's regex (https://stackoverflow.com/a/13510307/1844607) seems to be working in all cases. I'll get back to this if needed.

Use a HTML parser instead. But in case of:
<title[^>]*>(.*?)</title>
Demo

Use the DOMDocument class:
$doc = new DOMDocument();
$doc->loadHTML($html);
$titles = $doc->getElementsByTagName("title");
echo $titles->item[0]->nodeValue;

Use this regex:
<title>[\s\S]*?</title>

PHP DOMDocument->getElementByID adding Â in place of empty <span>

I'm using PHP's DOMDocument object to parse some HTML (fetched with cURL). When I get an element by ID and output it, any empty <span> </span> tags get an additional character and become <span>Â </span>.
The Code:
<?php
$document = new DOMDocument();
$document->validateOnParse = true;
$document->loadHTML( curl_exec($handle) );
curl_close($handle);
$element = $document->getElementById( __ELEMENT_ID__ );
echo $document->saveHTML();
echo $document->saveHTML($element);
?>
The $document->saveHTML() command behaves as expected and prints out the entire page. BUT, like I say above, on the echo $document->saveHTML($element) command transforms empty <span> tags into <span>Â </span>.
This happens to all <span> </span> tags within $element.
What in this process (of getting the element by ID and outputting the element) is inserting this extra character? I'm could work around it, but I'm more interested in getting to the root.

I was able to fix the problem by setting the character encoding of the page. The page I was fetching did not have a defined character encoding, and my page was just a snippet without defined header info. When I added
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
The problem disappeared.

Regex syntax question - trying to understand

I'm a self taught PHP programmer and I'm only now starting to grasp the regex stuff. I'm pretty aware of its capabilities when it is done right, but this is something I need to dive in too. so maybe someone can help me, and save me so hours of experiment.
I have this string:
here is the <img src="http://www.somewhere.com/1.png" alt="some' /> and there is not a chance...
now, I need to preg_match this string and search for the a href tag that has an image in it, and replace it with the same tag with a small difference: after the title attribute inside the tag, I'll want to add a rel="here" attribute.
of course, it should ignore links (a href's) that don't have img tag inside.

First of all: never ever ever use regex for html!
You're much better off using an XML parser: create a DOMDocument, load your HTML, and then use XPath to get the node you want.
Something like this:
$str = 'here is the <img src="http://www.somewhere.com/1.png" alt="some" /> and there is not a chance...';
$doc = new DOMDocument();
$doc->loadHTML($str);
$xpath = new DOMXPath($doc);
$results = $xpath->query('//a/img');
foreach ($results as $result) {
// edit result node
}
$doc->saveHTML();

Ideally you should use HTML (or XML) parser for this purpose. Here is an example using PHP built-in XML manipulation functions:
<?php
error_reporting(E_ALL);
$doc = new DOMDocument();
$doc->loadHTML('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
<p>here is the <img src="http://www.somewhere.com/1.png" alt="some" /> and there is not a chance...</p>
</body></html>');
$xpath = new DOMXPath($doc);
$result = $xpath->query('//a[img]');
foreach ($result as $r) {
$r->setAttribute('rel', $r->getAttribute('title')); // i am confused whether you want a hard-coded "here" or the value of the title
}
echo $doc->saveHTML();
Output
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
<p>here is the <img src="http://www.somewhere.com/1.png" alt="some"> and there is not a chance...</p>
</body></html>

here a couple of link that might help you with Regex:
RegEx Tutorial
Email Samples of RegEx
I used the web site in the last link extensively in my previous Job. It is a great collections of RegEx that you can also test according to your specific case.
First two links would help you to find to get some further knowledge about it.

html to text with domdocument class

How to get a html page source code without htl tags?
For example:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta http-equiv="content-language" content="hu"/>
<title>this is the page title</title>
<meta name="description" content="this is the description" />
<meta name="keywords" content="k1, k2, k3, k4" />
start the body content
<!-- <div>this is comment</div> -->
open
End now one noframes tag.
<noframes><span>text</span></noframes>
<select name="select" id="select"><option>ttttt</option></select>
<div class="robots-nocontent"><span>something</span></div>
<img src="url.png" alt="this is alt attribute" />
I need this result:
this is the page title this is the description k1, k2, k3, k4 start the body content this is title attribute open End now one noframes tag. text ttttt something this is alt attribute
I need too the title and the alt attributes.
Idea?

You could do it with a regex.
$regex = '/\<.\>/';
would be a very simple start to remove anything with < and > around it. But in order to do this, you're going to have to pull in the HTML as a file_get_contents() or some other function that will turn the code into text.
Addendum:
If you want individual attributes pulled as well, you're going to have to write a more complex regex to pull that text out. For instance:
$regex2 = '/\<.(?<=(title))(\=\").(?=\")/';
Would pull out (I think... I'm still learning RegEx) any text between < and title=", assuming you had no other matching expressions before title. Again, this would be a pretty complicated regex process.

This cannot be done in an automated way. PHP cannot know which node attributes you want to omit. You'd either had to create some code that iterates over all attributes and textnodes which you can feed a map, defining when to use a node's content or you just pick what you want with XPath one by one.
An alternative would be to use XMLReader. It allows you to iterate over the entire document and define callbacks for the element names. This way, you can define what to do with what element. See
http://www.ibm.com/developerworks/library/x-pullparsingphp.html

My solution is a bit more complicate but it worked fine for me.
If you are sure that you have XHTML, you can simply consider the code as XML (but you have to put everything in a proper wrapping).
Then with XSLT you can define some basic templates that do what you need.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP XPath and Line Numbers - php

Related

PHP Regular expression to find scripts in head

Regular expression to get page title

PHP DOMDocument->getElementByID adding Â in place of empty <span>

Regex syntax question - trying to understand

html to text with domdocument class

Categories

Resources