Regular Expression - get tables from html string in PHP - php

I try to wrap all tables inside my content with a special div container, to make them usable for mobile.
I can't wrap the tables, before they are saved within the database of the custom CSS. I managed to get to the content, before it's printed on the page and I need to preg_replace all the tables there.
I do this, to get all tables:
preg_match_all('/(<table[^>]*>(?:.|\n)*<\/table>)/', $aFile['sContent'], $aMatches);
The problem is to get the inner part (?:.|\n)* to match everything that is inside the tags, without matching the ending tag. Right now the expression matches everything, even the ending tag of the table...
Is there a way to exclude the match for the ending tag?

You need to perform a non greedy match: /(<table[^>]*>(?:.|\n)*?<\/table>)/. Note the question mark: ?.
However, I would use a DOM parser for that:
$doc = new DOMDocument();
$doc->loadHTML($html);
$tables = $doc->getElementsByTagName('table');
foreach($tables as $table) {
$content = $doc->saveHTML($table);
}
While it is already more convenient to use a DOM parser for extracting data from HTML documents, it is definitely the better solution if you are attempting to modify the HTML (as you told).

You could use lookahead if you don't want to match the end tag,
preg_match_all('/(<table[^>]*>(?:.|\n)*(?=<\/table>))/', $aFile['sContent'], $aMatches);

Related

PHP Regex, remove string from another one if expression is valid

There are many questions like this question but I can not find exact answer. And I am unfamiliar Regular Expresion topic.
PHP7 : I want to check if $str contains a html code and its href refer to the url : "website.fr" like '*****'
i used the pattern <\b[a>]\S*\<\/a> but is not working.
Any help please.
This regexp catches an a element with an href attribute which refers to a website.fr url:
<a.*\shref="([^"]*\.)?website\.fr([.\/][^"]*)?"
Explanation:
<a[^>]*: an anchor beginning
\shref=": ...followed by an opened href attribute
([^"]*\.)? : the URL may begin by anything except a quote and finishing by a dot
website\.fr : your website
([.\/][^"]*)?: the URL may finish by a slash followed by anything except a quote
This regexp may not cover all cases (for example an URL containing a quote). Generally, it's discouraged to parse HTML with regexes. Better use a XML parser.
In general, parsing HTML with regex is a bad idea (see this question). In PHP, you can use DOMDocument and DOMXPath to search for elements with specific attributes in an HTML document. Something like this, which searches for an <a> element somewhere in the HTML which has an href value containing the string 'website.fr/':
$html = '*****';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
if (count($xpath->query("//a[contains(#href, 'website.fr/')]")))
echo "found";
else
echo "not found";
Demo on 3v4l.org

How do I regex for a phrase up to a quote?

I am working on moving some blog-ish articles to a new third-party home, and need to replace some existing URLs with new ones. I cannot use XML, and am being forced to use a wrapper class that requires this search to happen in regex. I'm currently having trouble regex-ing for the URLs that exist in the html. For example if the html is:
<h1>Whatever</h1>
I need my regex to return:
http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345
The beginning part of the URL never changes (the "http://www.website.com/article/" part). However, I have no clue what the slug phrases are going to be, but do know they will contain an unknown about of hyphens between the words. The ID number at the end of the URL could be any integer.
There are multiple links of these types in each article, and there are also other types of URLs in the article that I want to be sure are ignored, so I can't just look for phrases starting with http inside of quotes.
FWIW: I'm working in php and am currently trying to use preg_match_all to return an array of the URLs needed
Here's my latest attempt:
$array_of_urls = [];
preg_match_all('/http:\/\/www\.website\.com\/article\/[^"]*/', $variable_with_html, $array_of_urls);
var_dump($array_of_urls);
And then I get nada dumped out. Any help appreciated!!!
We, StackOverflow volunteers, must insist on enjoying the stability of a dom parser rather than regex when parsing html data.
Code: (Demo)
$html=<<<HTML
<h1>Whatever</h1>
<p>Here is a url as plain text: http://www.website.com/article/sluggy-slug</p>
<div>Here is a qualifying link: Whatever</div>
HTML;
$dom = new DomDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $item) {
$output[] = $item->getAttribute('href');
}
var_export($output);
Output:
array (
0 => 'http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345',
1 => 'http://www.website.com/article/slugger-sluggington-jr/666',
)
If for some crazy reason, the above doesn't work for your project and you MUST use regex, this should suffice:
~<a.*?href="\K[^"]+~i // using case-insensitive flag in case of all-caps syntax
Pattern Demo

Count start and end html tags

I'm looking for a way to count html tags in a chunk of html using php. This may not be a full web page with a doctype body tags etc.
For example:
If I had something like this
$string = "
<div></div>
<div style='blah'></div>
<p>hello</p>
<p>its debbie mcgee
<p class='pants'>missing p above</p>
<div></div>";
I want to pass it to a function with a tag name such as
CheckHtml( $string, 'p' );
and I would like it to tell me the number of open <p> tags and the number of close p tags </p>. I don't want it to do anything fancy beyond that (no sneaky trying to fix it).
I have tried with string counts with start tags such as <p but it can too easily find things like and return wrong results.
I had a look as DOMDocument but it doesn't seem to count close tags and always expects <html> tags (although I could work around this).
Any suggestions on what to use.
To get a accurate count, you can't use string matching or regex because of the well-known problems of parsing HTML with regex
Nor can you use the output of a standard parser, because that's a DOM consisting of elements and all the information about the tags that were in the HTML has been discarded. End tags will be inferred even for valid HTML, and even some start tags (e.g. html, head, body, tbody) can be inferred. Moreover things like the adoption agency algorithm can result in there being more elements than there were tags in the HTML mark-up. For example <b><i></b>x</i> will result in there being two i elements in the DOM. At the same time, end tags that can't be matched with start tags are simply discarded, as indeed can start and end tags that appear in the wrong place. (e.g. <caption> not in <table> or <legend> not in <fieldset>)
The only way I can think you could do this in any way reliably is this:
There's an open source PHP library for parsing HTML called html5lib.
In there, there's a file called Tokenizer.php and at the end of that file there's a function called emitToken. At this point, the parser has done all the work of figuring out all the HTML weirdnesses¹, and the $token parameter contains all the information about what kind of token has been recognised, including start and end tags.
You could take the library and modify it so that it counts up the start and end tag tokens at that point, and then exposes those totals to your application code at the end of the parse process.
¹: That is, it's figured out the weirdnesses related to your counting problem. It hasn't begun to figure out the tree construction weirdnesses.
You can use substr_count() to return the number of times the needle substring occurs in the haystack $string.
$open_tag_count = substring_count( $string, '<p' );
$close_tag_count = substring_count( $string, '</p>' );
Be aware that '<param and <pre, so you may need to modify your search to handle two different specific cases:
$open_tag_count_without_attributes = substring_count( $string, '<p>' );
$open_tag_count_with_attributes = substring_count( $string, '<p ' );
$open_tag_count = $open_tag_count_without_attributes + $open_tag_count_with_attributes;
You may also wish to consider using [preg_match()][1]. Using a regular expression to parse HTML comes with a fairly substantial set of pitfalls, so use with caution.
substr_count seems like a good bet.
EDIT: You'll have to use preg_match then
I haven't tested, this but, for an idea..
function checkHTML($string,$htmlTag){
$openTags = preg_match('/<'.$htmlTag.'\b[^>]*>',$string);
$closeTags = preg_match('/<\/'.$htmlTag.'>/',$string);
return array($openTags, $closeTags);
}
$numberOfParagraphTags = checkHTML($string,'p');
echo('Open Tags:'.$numberOfParagraphTags[0].' Close Tags:'.$numberOfParagraphTags[1]);
For the chunk of HTML, try using the DomDocument PHP class instead of a string. Then you can use methods such as getElementsByTagName(); that will allow you to count the tags easier and more accurately. To load your string into a DomDocument, you could do something like this:
$doc = new DOMDocument();
$doc->loadHTML($string);
Then, to count your tags, do the following:
$tagList = $doc->getElementsByTagName($tag);
return $tagList.length;

Need help with regular expressions in PHP

I am trying to index some content from a series of .html's that share the same format.
So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...
And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with "> and end with either <a or </p. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>, <u>, etc.).
So then I have something like this:
$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);
Let's look at $regex for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">. Then I do the numbers inside the [] thing. Then I single out the </a>. So far so good.
At the end, I do a (.)*?(<). This is the turning point. By leaving the last bit, (<) like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p) the resulting array ends up empty. I've tried changing that to only (<a), but it seems that 2 characters mess up the whole ting.
What can I do? I've been struggling with this all day.
PHP Tidy is your friend. Don't use regexes.
Something like /">\[(.*)\](.*)(?:<(?:a|\/p))/ seems to work fine for given your example and description. Perhaps adding non-capturing subpatterns does it? Please provide a counterexample wherein this doesn't work for you.
Though I agree that RegEx isn't a parser, it sounds like what you're looking for is part of a regularly behaved string - which is exactly what RegEx is strong at.
As you've found, using a regex to parse HTML is not very easy. This is because HTML is not particularly regular.
I suggest using an XML parser such as PHP's DomDocument.
Create an object, then use the loadHTMLFile method to open the file. Extract your a tags with getElementsByTagName, and then extract the content as the NodeValue property.
It might look like
// Create a DomDocument object
$html = new DOMDocument();
// Load the url's contents into the DOM
$html->loadHTMLFile("http://whatever.com/some.htm");
// make an array to hold the text
$anchors = array();
//Loop through the a tags and store them in an array
foreach($html->getElementsByTagName('a') as $link) {
$anchors[] = $link->nodeValue;
}
One alternative to this style of XML/HTML parser is phpquery. The documentation on their page should do a good job of explaining how to extract the tags. If you know jQuery, the interface may seem more natural.

/regexp?/ on HTML, but not in form [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I need to do some regex replacement on HTML input, but I need to exclude some parts from filtering by other regexp.
(e.g. remove all <a> tags with specific href="example.com…, except the ones that are inside the <form> tag)
Is there any smart regex technique for this? Or do I have to find all forms using $regex1, then split the input to the smaller chunks, excluding the matched text blocks, and then run the $regex2 on all the chunks?
The NON-regexp way:
<?php
$html = '<html><body>a <b>bold</b> foz b c <form>l</form> a</body></html>';
$d = new DOMDocument();
$d->loadHTML($html);
$x = new DOMXPath($d);
$elements = $x->query('//a[not(ancestor::form) and #href="foo"]');
foreach($elements as $elm){
//run if contents of <a> should be visible:
while($elm->firstChild){
$elm->parentNode->insertBefore($elm->firstChild,$elm);
}
//remove a
$elm->parentNode->removeChild($elm);
}
var_dump($d->saveXML());
?>
Why can't you just dump the html string you need into a DOM helper, then use getElementsByTagName('a') to grab all anchors and use getAttribute to get the href, removeChild to remove it?
This looks like PHP, right? http://htmlpurifier.org/
Any particular reason you would want to do that with Regular Expressions? It sounds like it would be fairly straightforward in Javascript to spin through the DOM and to it that way.
In jQuery, for instance, it seems like you could do this in just a couple lines using its DOM selectors.
If forms can be nested, it is technically impossible.
If forms can not be nested, it is practically impossible. There is no function where you can use the same regex to
define an area where the matching should be done (i.e. outside form)
define things to be matched (i.e. elements)

Categories