I am trying to index some content from a series of .html's that share the same format.
So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...
And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with "> and end with either <a or </p. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>, <u>, etc.).
So then I have something like this:
$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);
Let's look at $regex for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">. Then I do the numbers inside the [] thing. Then I single out the </a>. So far so good.
At the end, I do a (.)*?(<). This is the turning point. By leaving the last bit, (<) like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p) the resulting array ends up empty. I've tried changing that to only (<a), but it seems that 2 characters mess up the whole ting.
What can I do? I've been struggling with this all day.
PHP Tidy is your friend. Don't use regexes.
Something like /">\[(.*)\](.*)(?:<(?:a|\/p))/ seems to work fine for given your example and description. Perhaps adding non-capturing subpatterns does it? Please provide a counterexample wherein this doesn't work for you.
Though I agree that RegEx isn't a parser, it sounds like what you're looking for is part of a regularly behaved string - which is exactly what RegEx is strong at.
As you've found, using a regex to parse HTML is not very easy. This is because HTML is not particularly regular.
I suggest using an XML parser such as PHP's DomDocument.
Create an object, then use the loadHTMLFile method to open the file. Extract your a tags with getElementsByTagName, and then extract the content as the NodeValue property.
It might look like
// Create a DomDocument object
$html = new DOMDocument();
// Load the url's contents into the DOM
$html->loadHTMLFile("http://whatever.com/some.htm");
// make an array to hold the text
$anchors = array();
//Loop through the a tags and store them in an array
foreach($html->getElementsByTagName('a') as $link) {
$anchors[] = $link->nodeValue;
}
One alternative to this style of XML/HTML parser is phpquery. The documentation on their page should do a good job of explaining how to extract the tags. If you know jQuery, the interface may seem more natural.
Related
I am programmatically cleaning up some basic grammar in comments and other user submitted content. Capitalizing I, the first letter of sentence, etc. The comments and content are mixed with HTML as users have some options in formatting their text.
This is actually proving to bit a bit more challenging than expected, especially to someone new to PHP and regex.
If there a function like ucfirst that will ignore html to help capitalize sentences?
Also, any links or tutorials on cleaning up text like this in html, would be appreciated. Please leave anything you feel would help in the comments. thanks!
EDIT:
Sample Text:
<div><p>i wuz walkin thru the PaRK and found <strong>ur dog</strong>. <br />i hoPe to get a reward.<br /> plz call or text 7zero4 8two8 49 sevenseven</div>
I need for it to be (ultimately)
<div><p>I was walking through the park and found <strong>your dog<strong>. <p>I hope to get a reward.</p><p> Please call or text (704) 828-4977.</p>
I know this is going a little farther than the intended question, but my thought was to do this incrementally. ucfirst() is just one of many functions I was using to do one small cleanup at a time per scan. Even if I had to run the text 100 times through the filter, this runs on a cron run when the site has no traffic. I wish there was a discussion forum where this could continue as obviously there would be some great ideas on continuing the approach. Any thoughts on how to approach this as an overall project by all means please leave a comment.
I guess in the spirit of the question itself. ucfirst then would not be the best function for this as it could not take an argument list of things to ignore. A flag IGNORE_HTML would be great!
Given this is a PHP question, then the DOM parser recommended below sounds like the best answer? Thoughts?
You can also add a CSS pseudo-element to your desired elements like this:
div:first-letter {
text-transform: uppercase;
}
But you will probably need to change the way, you print out your senteces ( if you are printing them all in one huge tag ), since CSS lacks the ability to detect the start of a new sentence inside a single tag :(
You should probably use a DOM parser (either the built-in one or for example this one, which is really easy to use).
Walk through all of the text nodes in your HTML and perform the clean-up with preg_replace_callback, ucfirst and a regular expression like this one:
'/(\s*)([^.?!]*)/'
This will match a string of whitespace, and then as many non-sentence-ending-punctuation characters as possible. The actual sentence (starting with a letter, unless your sentence starts with ", which complicates things a bit) will then be found in the first capturing group.
But from your question, I suppose you are already doing something like the latter and your code is just choking on HTML tags. Here is some example code to get all text nodes with the second DOM parser I linked:
require 'simple_html_dom.php';
$html = new simple_html_dom();
$html->load($fullHtmlStr);
foreach($html->find('text') as $textNode)
$textNode = cleanupFunction($textNode);
$cleanedHtmlStr = $html->save();
In html it will be very difficult to do, as you will be building some kind of html parser. My suggestion would be to cleanup the text before it is transformed into html, at the moment you pull it out of the database. Or even better, cleanup the database once.
This should do it:
function html_ucfirst($s) {
return preg_replace_callback('#^((<(.+?)>)*)(.*?)$#', function ($c) {
return $c[1].ucfirst(array_pop($c));
}, $s);
}
Converts
<b>foo</b> to <b>Foo</b>,
<div><p>test</p></div> to <div><p>Test</p></div>,
but also bar to Bar.
Edit: According to your detailed question, you probably want to apply this function to each sentence. You will have to parse the text first (e.g. splitting by periods).
I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.
How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).
"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.
I'm looking for a way to count html tags in a chunk of html using php. This may not be a full web page with a doctype body tags etc.
For example:
If I had something like this
$string = "
<div></div>
<div style='blah'></div>
<p>hello</p>
<p>its debbie mcgee
<p class='pants'>missing p above</p>
<div></div>";
I want to pass it to a function with a tag name such as
CheckHtml( $string, 'p' );
and I would like it to tell me the number of open <p> tags and the number of close p tags </p>. I don't want it to do anything fancy beyond that (no sneaky trying to fix it).
I have tried with string counts with start tags such as <p but it can too easily find things like and return wrong results.
I had a look as DOMDocument but it doesn't seem to count close tags and always expects <html> tags (although I could work around this).
Any suggestions on what to use.
To get a accurate count, you can't use string matching or regex because of the well-known problems of parsing HTML with regex
Nor can you use the output of a standard parser, because that's a DOM consisting of elements and all the information about the tags that were in the HTML has been discarded. End tags will be inferred even for valid HTML, and even some start tags (e.g. html, head, body, tbody) can be inferred. Moreover things like the adoption agency algorithm can result in there being more elements than there were tags in the HTML mark-up. For example <b><i></b>x</i> will result in there being two i elements in the DOM. At the same time, end tags that can't be matched with start tags are simply discarded, as indeed can start and end tags that appear in the wrong place. (e.g. <caption> not in <table> or <legend> not in <fieldset>)
The only way I can think you could do this in any way reliably is this:
There's an open source PHP library for parsing HTML called html5lib.
In there, there's a file called Tokenizer.php and at the end of that file there's a function called emitToken. At this point, the parser has done all the work of figuring out all the HTML weirdnesses¹, and the $token parameter contains all the information about what kind of token has been recognised, including start and end tags.
You could take the library and modify it so that it counts up the start and end tag tokens at that point, and then exposes those totals to your application code at the end of the parse process.
¹: That is, it's figured out the weirdnesses related to your counting problem. It hasn't begun to figure out the tree construction weirdnesses.
You can use substr_count() to return the number of times the needle substring occurs in the haystack $string.
$open_tag_count = substring_count( $string, '<p' );
$close_tag_count = substring_count( $string, '</p>' );
Be aware that '<param and <pre, so you may need to modify your search to handle two different specific cases:
$open_tag_count_without_attributes = substring_count( $string, '<p>' );
$open_tag_count_with_attributes = substring_count( $string, '<p ' );
$open_tag_count = $open_tag_count_without_attributes + $open_tag_count_with_attributes;
You may also wish to consider using [preg_match()][1]. Using a regular expression to parse HTML comes with a fairly substantial set of pitfalls, so use with caution.
substr_count seems like a good bet.
EDIT: You'll have to use preg_match then
I haven't tested, this but, for an idea..
function checkHTML($string,$htmlTag){
$openTags = preg_match('/<'.$htmlTag.'\b[^>]*>',$string);
$closeTags = preg_match('/<\/'.$htmlTag.'>/',$string);
return array($openTags, $closeTags);
}
$numberOfParagraphTags = checkHTML($string,'p');
echo('Open Tags:'.$numberOfParagraphTags[0].' Close Tags:'.$numberOfParagraphTags[1]);
For the chunk of HTML, try using the DomDocument PHP class instead of a string. Then you can use methods such as getElementsByTagName(); that will allow you to count the tags easier and more accurately. To load your string into a DomDocument, you could do something like this:
$doc = new DOMDocument();
$doc->loadHTML($string);
Then, to count your tags, do the following:
$tagList = $doc->getElementsByTagName($tag);
return $tagList.length;
I'm writing an application for my client that uses a WYSIWYG to allow employees to modify a letter template with certain variables that get parsed out to be information for the customer that the letter is written for.
The WYSIWYG generates HTML that I save to a SQL server database. I then use a PHP class to generate a PDF document with the template text.
Here's my issue. The PDF generation class can translate b,u,i HTML tags. That's it. This is mostly okay, except I need blockquote to be translated too. I figure the best solution would be to write a regex statement that is to take the contents of each blockquote HTML block, and replace each line within the block with five spaces. The trick is that some blockquotes might contain nested blockquotes (double indenting, and what not)
But unfortunately I have never been too well versed with regex, and I spent the last 1.5 hours experimenting with different patterns and got nothing working.
Here are the gotchyas:
String may or may not contain a blockquote block
String could contain multiple blockquotes
String could contain potentially any level of nesting of blockquotes blocks
We can rely on the HTML being properly formed
A sample input string would be look something like something like this:
Dear Charlie,<br><br>We are contacting you because blah blah blah blah.<br><br><br>To login, please use this information:<blockquote>Username: someUsername<br>Password: somePassword</blockquote><br><br>Thank you.
To simply the solution, I need to replace each HTML break inside each blockquote with 5 spaces and then the \n line break character.
You might want to check PHP Simple HTML DOM Parser out. You can use it to parse the input to an HTML DOM tree and use that.
~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~
You will need to run this regex recursively using preg_replace_callback:
const REGEX_BLOCKQUOTE = '~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~';
function blockquoteCallback($matches) {
return doIndent(preg_replace_callback(REGEX_BLOCKQUOTE, __FUNCTION__, $matches[1]));
}
$output = preg_replace_callback(REGEX_BLOCKQUOTE, 'blockQuoteCallback', $input);
My regex assumes, that there won't be any attributes on the blockquote or anywhere else.
(PS: I'll leave the "Use a DOM parser" comment to someone else.)
Regular expressions have a theory behind them, and even though the modern day's regular expresison engine provide can provide a 'Type - 2.5' level language , some things are still not doable. In your partiular case, nesting is not achievable easily.
A simple way way to explain this, is to say that regular expression can't keep a count ..
i.e. they can't count the nesting level...
what is you need is a limited CFG ( the paren-counting types ) ..
you need to somehow keep a count ..may be a stack or tree ...
I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren't a part of HTML
<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>
I need to parse these files. PHP is the only only tool available. The documents don't come close to being well formed XML.
My original thought was to use the loadHTML methods on PHPs DOMDocument. However, these methods choke on the make up HTML tags, and will refuse to parse the string/file.
$oDom = new DomDocument();
$oDom->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
//gives us
DOMDocument::loadHTML() [function.loadHTML]: Tag pseud-template invalid in Entity, line: 1 occured in ....
The only solution I've been able to come up with is to pre-process the files with string replacement functions that will remove the invalid tags and replace them with a valid HTML tag (maybe a span with an id of the tag name).
Is there a more elegant solution? A way to let DOMDocument know about additional tags to consider as valid? Is there a different, robust HTML parsing class/object out there for PHP?
(if it's not obvious, I don't consider regular expressions a valid solution here)
Update: The information in the fake tags is part of the goal here, so something like Tidy isn't an option. Also, I'm after something that does the some level, if not all, of well-formedness cleanup for me, which is why I was looking the DomDocument's loadHTML method in the first place.
You can suppress warnings with libxml_use_internal_errors, while loading the document. Eg.:
libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);
If, for some reason, you need access to the warnings, use libxml_get_errors
I wonder if passing the "bad" HTML through HTML Tidy might help as a first pass? Might be worth a look, if you can get the document to be well formed, maybe you could load it as a regular XML file with DomDocument.
#Twan
You don't need a DTD for DOMDocument to parse custom XML. Just use DOMDocument->load(), and as long as the XML is well-formed, it can read it.
Once you get the files to be well-formed, that's when you can start looking at XML parsers, before that you're S.O.L. Lok Alejo said, you could look at HTML TIDY, but it looks like that's specific to HTML, and I don't know how it would go with your custom elements.
I don't consider regular expressions a valid solution here
Until you've got well-formedness, that might be your only option. Once you get the documents to that stage, then you're in the clear with the DOM functions.
Take a look at the Parser in the PHP Fit port. The code is clean and was originally designed for loading the dirty HTML saved by Word. It's configured to pull tables out, but can easily be adapated.
You can see the source here:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/Parser.phps
The unit test will show you how to use it:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/test/parser.phps
My quick and dirty solution to this problem was to run a loop that matches my list of custom tags with a regular expression. The regexp doesn't catch tags that have another inner custom tag inside them.
When there is a match, a function to process that tag is called and returns the "processed HTML". If that custom tag was inside another custom tag than the parent becomes childless by the fact that actual HTML was inserted in place of the child, and it will be matched by the regexp and processed at the next iteration of the loop.
The loop ends when there are no childless custom tags to be matched. Overall it's iterative (a while loop) and not recursive.
#Alan Storm
Your comment on my other answer got me to thinking:
When you load an HTML file with DOMDocument, it appears to do some level of cleanup re: well well-formedness, BUT requires all your tags to be legit HTML tags. I'm looking for something that does the former, but not the later. (Alan Storm)
Run a regex (sorry!) over the tags, and when it finds one which isn't a valid HTML element, replace it with a valid element that you know doesn't exist in any of the documents (blink comes to mind...), and give it an attribute value with the name of the illegal element, so that you can switch it back afterwards. eg:
$code = str_replace("<pseudo-tag>", "<blink rel=\"pseudo-tag\">", $code);
// and then back again...
$code = preg_replace('<blink rel="(.*?)">', '<\1>', $code);
obviously that code won't work, but you get the general idea?