BeautifulSoup and php/html files

BeautifulSoup and php/html files - php

I've been using BeautifulSoup to convert relative URLs in some ancient HTML files from an archival site to absolute URLs (mostly so they can be targeted better by .htaccess rules). This part I've got down: search for certain tags and their atts, use urllib.parse.urljoin (this is Python3) to correct. Fine.
However, there are also some .php files in this collection, from later years of this website. They mostly use 3-5 lines to include other .php files, then the rest is HTML, though there are some exceptions.
Problem: BeautifulSoup parsers try to interpret what's between <?php ?> tags. In fact, there appear to be cases where they just throw out just the angle brackets, but leave the question marks -- behaviour which I hackishly addressed thus:
for c in soup.contents:
c = str(c) # previously a BeautifulSoup Tag
# I don't need soup after this point, hence not reconstructing contents
c = ('<' if c.startswith('?') else '') + c
c = c + ('>' if c.endswith('?') else '')
But in any case, I noticed that whole <?php ?> tags were often mangled, in different ways depending on the parser. For example, the html5lib parser takes these lines:
<?
//echo "BEGIN PAGE: " . $_SESSION["i"]."<br>";
include ('util.php');
And interprets the tag as ending at the > that closes <br>.
What I'd prefer to happen is for the php tags to be left alone. (Obviously, in an ideal world a parser would read through them and work on any inner HTML, but that seems like asking for too much!)
Possible solutions
Skip .php files and only work with .html -- the work being done is not essential, just an optimization, so no great loss will ensue;
Find a BeautifulSoup parser not mentioned in the docs that handles these cases better;
Pre-parse the text myself, extract all <?php ?> blocks, and reinsert after the work with BeautifulSoup is done, taking care to recall where they should fall (potentially very difficult if any of these thousands of files have <?php echo 'foobar' ?> in the middle of HTML lines, for example)
Similarly to the above, programmatically protect all <?php ?> tags from the parser, e.g. inserting HTML comments around them, and then remove the protection after the soup

To answer my own question... :)
I used solution #4: programmatically protect all <?php ?> tags from the parser by inserting HTML comments around them. The parser then skips interpreting whatever's inside a comment. Later, when using soup.prettify() or soup.contents, the output can just be given a straightforward replace <!--<? with <? and likewise for closing tags.
Note that this doesn't work for PHP tags used to generate dynamic content inside certain HTML tags, e.g.:
<a href= "<? echo foo_bar(); ?>" >
The current versions of html.parser, lxml, and html5lib all interpret this as a series of nonsense attributes of <a>, even when the PHP tags are enclosed in HTML comments. In such cases I manually extracted the tags with regex to solve my issue instead.

Related

Alternative regex to get contents for a xml tag

I'm processing a XML file and I need to get all content inside <section> tags.
Right now I'm using this regex:
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/i', $myXmlString, $results);?>
The code inside the <section> tags is pretty complex. It include math equations and stuff like that.
In my local machine the regex works perfect.
It is php 5.3.10 over apache 2.2.22 (Ubuntu)
BUT in my staging server it doesn't work.
It is php 5.3.3 over apache 2.2.15 (Red Hat)
I would ask 2 questions:
Is there any issue with preg_match_all for php 5.3.3?
Is there a better way to express the regex?
--EDIT: VARIATIONS OF REGEX USED UNSUCCESSFULY--
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/is', $myXmlString, $results);?>
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/ims', $myXmlString, $results);?>
<?php preg_match_all('#<section[^>]*>(.*?)<\/section>#ims', $myXmlString, $results);?>
<?php preg_match_all('#<section[^>]*>([^\00]*?)<\/section>#ims', $myXmlString, $results);?>
--EDIT: Why haven't I used a parser?
The XML consists of two <sections>. Each section groups n questions for an exam.
Each question can include math equations represented by its own XML. An equation may be something like this:
<inlineequation><m:math baseline="-16.5" display="inline" overflow="scroll"><m:mrow><m:mtable columnalign="left"><m:mtr><m:mtd><m:mrow><m:mo stretchy="true">[</m:mo><m:mrow><m:mtable columnalign="right"><m:mtr><m:mtd><m:mn>4</m:mn></m:mtd><m:mtd columnalign="right"><m:mrow><m:mo>-</m:mo><m:mn>9</m:mn></m:mrow></m:mtd><m:mtd columnalign="right"><m:mrow><m:mn>54</m:mn></m:mrow></m:mtd></m:mtr><m:mtr><m:mtd columnalign="right"><m:mrow><m:mo>−</m:mo><m:mn>28</m:mn></m:mrow></m:mtd><m:mtd columnalign="right"><m:mo>−</m:mo><m:mn>1</m:mn></m:mtd><m:mtd columnalign="right"><m:mo>−</m:mo><m:mn>14</m:mn></m:mtd></m:mtr></m:mtable></m:mrow><m:mo stretchy="true">]</m:mo></m:mrow></m:mtd></m:mtr></m:mtable></m:mrow></m:math></inlineequation>
I need that code to remain XML (no array) because I will pass that code as it is to a jQuery plugin which will render the equation (it will look like LaTeX equations).
If I parse the XML it will be really difficult to create the string for the equation again and locate it in the right place inside the question's statement.

regex can be resource intensive.
perhaps consider using xml_parse_into_struct;
<?php
$xmlp = xml_parser_create();
xml_parse_into_struct($xmlp, $myXmlString, $vals, $index);
xml_parser_free($xmlp);
print_r($vals);
?>

As others have said, don't use regex to parse XML. Having said that, let's answer your actual question:
Is it at all likely that your XML document contains line breaks? Do you realise that the . character will match everything except line-breaks unless you explicitly turn this feature on?
Try this:
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/si', $myXmlString, $results);?>
The extra s at the end, tells the regex engine to allow . to match line-breaks.
Honestly though, a lot of people get too hung up on "not parsing XML with regex" without actually thinking about why it's a bad idea. Performance aside, it's essentially because there's no proper way of dealing with nested tags - there's more to it than that, but this is basically what it boils down to. XML documents are not regular so you can't use regular expressions to parse them.
HOWEVER! Sometimes the data that you want to get out of an XML document definitely IS regular. If you throw away the fact that you're dealing with XML for a moment and treat it as just a string of text - you can establish definite patterns that you ABSOLUTELY can use regex to pull out.
In your case, I'd say it's a safe bet that your XML document has a flat structure; there wouldn't be tags nested inside other tags for example. In that case, if we forget the XML component and just think about the patterns you've got
Unmatched text
Pattern that denotes the start of a match
Matched text
Patten that denotes the end of a match
Unmatched text
etc ...
This is absolutely regular and - save for some insane edge cases I wouldn't bother worrying about - it's pretty damned safe!

PHP block syntax conventions

Sorry if this a completely nube sounding questioning. I'm new to the PHP syntax conventions, so I'm not entirely sure what I should be looking for.
The book I've got gives the following example as a conventional php block in html code.
<?php
//... some code ...
?>
I get that, but the confusing bit is that the example code I'm looking at some examples from xampp (e.g. the CD collection source code) doesn't seem to follow the same convention.
Instead, the example code reads more like this.
<? include("langsettings.php"); ?>
<?
//... some code ...
?>
Are the two forms just equivalent for all intents and purposes or did I completely miss something crucial to an intro to php here?
Also why doesn't php use closing tags (or does it and have I just not seen them)? I guess I'm thinking of javascript with the closing tags, but I guess either way, they're codebases in and of themselves so it works. It just seems like html has symmetry at the core of it's syntax, but php syntax sort breaks from that symmetry, which is odd.
Thanks for your time.

The only difference between these is that the second requires the setting short_open_tag to be enabled (which is off by default in new PHP version).
<?php regular open tag.
<? Short open tag (disabled by default)
Beyond this, the placement of something like <? include("langsettings.php"); ?> on its own line enclosed in its own pair of <? ?> is really a matter of style specific to the source you found it in. Different projects use very widely different conventions, and PHP books each tend to adopt their own convention.
PHP doesn't unfortunately have any real specific coding conventions such as you might find in languages like Ruby, Java, or Python, which is, in my unsolicited opionion, one of PHP's chief failings as well as one of its greatest flexibilities.
Now, as to whether or not short open tags are good practice for use in a modern PHP application is a separate issue entirely, which has been discussed at great length here.

The two forms are equivalent, but you will find that the shortcode can give you issues. I would stick with the regular tags:
<?php
and the block closed by
?>
Edit: The closing tag is optional, but only if you want everything after the opening tag to be interpreted as PHP until the end of the page.
Anything outside those blocks are interpreted as HTML, so you have to ensure that you watch where you are opening and closing.
For example:
<body>
<h1> The Heading </h1>
<p>
<?php
echo "This is the Content";
?>
</p>
</body>
Will work, and output the php generated string into your paragraph tag.
PHP is similar to javascript in that it doesn't have 'open' and 'close' tags, but rather utilize a semicolon to declare the end of a particular php statement.
include "file1.php";
include "file2.php";
If you forget the semi colon, like so
include "file1.php"
include "file2.php";
That will generate an error.

The closing tag for a PHP block is ?>. The closing tag is not required, but it can be used if you want to interpret part of your page as PHP and other parts as literal HTML. People sometimes do this if they want to do some PHP processing at the beginning of the page, then write an ordinary static HTML page with just a few PHP variables echoed into it.
In other words, text that comes after a <?php tag and before a ?> tag is interpreted as PHP. If the closing tag is omitted, then all text between the opening php tag and the end of the page is interpreted as PHP.
One exception to this is that if you open a conditional statement inside a php block, then close the php block, ALL the following text on the page will be subject to that conditional, until you start a new php block and close the conditional statement. For example, if you run the script:
<?php
if(1==0) {
?>
<B>conditional HTML</B>
<?php
}
?>
the HTML between the two PHP blocks will not appear on the page.
Note that different PHP blocks are all part of the same script. Variables, functions, and classes defined in one block can be used by other blocks on that page, and so forth.

PHP starting tag is <?php and closing tag is ?>.
If there are short tags allowed on server you can use <? ?> syntax also.
You can read more about that on Offcial PHP Documentation
Best regards,
Tom.

Issues regarding short version long open tags have already been covered.
I'll just mention one common gotcha in the question that hasn't been mentioned yet in these answers.
Compare the following:
<?php
/*
* Some comments here, (c) notice, etc.
*/
header("Content-type: text/html");
...
vs
<?php
/*
* Some comments here, (c) notice, etc.
*/
?>
<?php
header("Content-type: text/html");
...
The second one doesn't work.
Why?
There's a blank line of non-PHP code between the first block of PHP and the second. In a server environment that is not using output buffering, the blank line signals to PHP that the headers are all done, and anything from this point on is part of the HTML (or whatever) being sent to the browser.
Then, we try to send a header. Which produces:
Warning: Cannot modify header information - headers already sent
So ... be careful of your blank lines. A blank line INSIDE your PHP is fine. OUTSIDE your PHP, it may have nasty side-effects.

Regex to detect HTML tags with onclick or onload attributes is too greedy

I have the following regex used to check HTML code:
/<.+(onclick|onload)[^=>]*=[^>]+>/si
This regex is supposed to detect if there are tags with onclick or onload attributes somewhere in the HTML. It does so in most cases, however the ".+" part is a huge performance problem on big texts (and also source of some bugs as it's too greedy). I've tried to fix it and make it smarter but failed so far - "smarter" one misses some examples like this:
<img alt="<script>" src="http://someurl.com/image.jpg"; onload="alert(42)" width="1" height="1"/>
Now, I know I should not parse HTML with regexes and unmentionable horrors happen if I do. However, in this particular case I can not replace it with the proper code (e.g. real HTML parser). Is it still possible to fix this regex or there's no way to do it?

i would strongly recommend that you be researching alternatives to regex matching - the onclick/load js handler code may comprise arbitrary occurrences of > and < as relops or inside js comments. this applies to the code of other js handlers on the same element before or after the onclick/load handlers as well. the whole tag containing the match might be inside a html comment (though you might want to match these occurrences too or strip the html comments before).
however, having hinted to dire straits you appear to be aware of, the standard disclaimers against 'html regex matching' do not fully apply as you only need matches inside tags. try scanning for
on(click|load)[[:space:]]*=[[:space:]]*('[^']*'|"[^']*")
and add some logic to search the text surrounding any matches for the enclosing tags. if you're brave, try this one:
<(([^'">]+(('[^']*'|"[^"']*")[^'">]+)*)|([^'">]+('[^']*'|"[^"']*"))+)on(click|load)[[:space:]]*=[[:space:]]*('[^']*'|"[^']*")
it matches alternating sequences of text inside and outside of pairs of quotes between the tag opener < and the onclick/load-attribute. the outermost alternative caters for the special case of no whitespace between a closing quote and the onclick/load-attribute.
hope this helps

Text Search - Highlighting the Search phrase

What will be the best way to highligh the Searched pharase within a HTML Document.
I have Complete HTML Document as a Large String in a Variable.
And I want to Highlight the searched term excluding text with Tags.
For example if the user searches for "img" the img tag should be ignored but
phrase "img" within the text should be highlighted.

Don't use regex.
Because regex cannot parse HTML (or even come close), any attempt to mess around with matching words in an HTML string risks breaking words that appear in markup. A badly-implemented HTML regex hack can even leave you with HTML-injection vulnerabilities which an attacker may be able to leverage to do cross-site-scripting.
Instead you should parse the HTML and do the searches on the text content only.
If you can accept a solution that adds the highlighting from JavaScript on the client side, this is really easy because the browser will already have parsed the HTML into a bunch of DOM objects you can manipulate. See eg. this question for a client-side example.
If you have to do it with PHP that's a bit more tricky. The simple solution would be to use DOMDocument::loadHTML and then translate the findText function from the above example into PHP. At least the DOM methods used are standardised so they work the same.

Edit: This was tagged as Java before, so this answer might not be applicable.
This is quick and dirty but it might work for you, or at least be a starting point
private String highlight(String search,String html) {
return html.replaceAll("(>[^<]*)("+search+")([^>]*<)","$1<em>$2</em>$3");
}
This requires testing, and I make no guarantees that its correct but the simplest way to explain how is that you ensure that your term exists between two tags and is thus is not itself a tag or part of a tag parameter.

var highlight = function(what){
var html = document.body.innerHTML,
word = "(" + what + ")",
match = new RegExp(word, "gi");
html = html.replace(match, "<span style='background-color: red'>$1</span>");
document.body.innerHTML = html;
};
highlight('ll');
This would highlight any occurence of 'll'.
Be carefull by calling highlight() with < or > or any tag name, it would also replace those, screwing up your markup. You might workaround that by reading innerText instead of innerHTML, but that way you'll lose the markup information.
Best way probably is to implement a parser routine yourself.
Example: http://www.jsfiddle.net/DRtVn/

there is a free javascript library that might help you out -> http://scott.yang.id.au/code/se-hilite/

You must be using some server side language to render the search results on the webpage.
So the best way I can think of is to highlight the word while rendering it using the server side language itself,which may be php,java or any other language.
This way you would have only the result strings without html and without parsing overhead.

How to show a comparison of 2 html text blocks

I need to take two text blocks with html tags and render a comparison - merge the two text blocks and then highlight what was added or removed from one version to the next.
I have used the PEAR Text_Diff class to successfully render comparisons of plain text, but when I try to throw text with html tags in it, it gets UGLY. Because of the word and character-based compare algorithms the class uses, html tags get broken and I end up with ugly stuff like <p><span class="new"> </</span>p>. It slaughters the html.
Is there a way to generate a text comparison while preserving the original valid html markup?
Thanks for the help. I've been working on this for weeks :[
This is the best solution I could think of: find/replace each type of html tag with 1 special non-standard character like the apple logo (opt shift k), render the comparison with this kind of primative markdown, then revert the non-standard characters back into tags. Any feedback?

Simple Diff, by Paul Butler, looks as though it's designed to do exactly what you need: http://github.com/paulgb/simplediff/blob/5bfe1d2a8f967c7901ace50f04ac2d9308ed3169/simplediff.php
Notice in his php code that there's an html wrapper: htmlDiff($old, $new)
(His blog post on it: http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/

The problem seems to be that your diff program should be treating existing HTML tags as atomic tokens rather than as individual characters.
If your engine has the ability to limit itself to working on word boundaries, see if you can override the function that determines word boundaries so it recognizes and treats HTML tags as a single "word".
You could also do as you are saying and create a lookup dictionary of distinct HTML tags that replaces each with a distinct unused Unicode value (I think there are some user-defined ranges you can use). However, if you do this, any changes to markup will be treated as if they were a change to the previous or following word, because the Unicode character will become part of that word to the tokenizer. Adding a space before and after each of your token Unicode characters would keep the HTML tag changes separate from the plain text changes.

I wonder that nobody mentioned HTMLDiff based on MediaWiki's Visual Diff. Give it a try, I was looking for something like you and found it pretty useful.

What about using an html tidier / formatter on each block first? This will create a standard "structure" which your diff might find easier to swallow

A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.

Try running your HTML blocks through this function first:
htmlentities();
That should convert all of your "<"'s and ">"'s into their corresponding codes, perhaps fixing your problem.
//Example:
$html_1 = "<html><head></head><body>Something</body></html>"
$html_2 = "<html><head></head><body><p id='abc'>Something Else</p></body></html>"
//Below code taken from http://www.go4expert.com/forums/showthread.php?t=4189.
//Not sure if/how it works exactly
$diff = &new Text_Diff(htmlentities($html_1), htmlentities($html_2));
$renderer = &new Text_Diff_Renderer();
echo $renderer->render($diff);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

BeautifulSoup and php/html files - php

Related

Alternative regex to get contents for a xml tag

PHP block syntax conventions

Regex to detect HTML tags with onclick or onload attributes is too greedy

Text Search - Highlighting the Search phrase

How to show a comparison of 2 html text blocks

Categories

Resources