I am trying to separate and get a number out of a string that contains 2 similar HTML statements:
1 - <td class="center"><p class="texte">1914</p></td>
2 - <td class="center"><p class="texte">135.000</p></td>
So, I am looking for the number 135.000 and not the number 1914.
IMPORTANT: This is not US notation for number. 135.000 is actually one hundred and thirty five thousands.
I have tried things like ([1-9][0-9]{1,2}), but that will capture 191 out of statement 1 above, which is not intended.
Thanks
You are dealing with html, you need to use an html parser first (XPATH is your friend). Then you need the preg_match function to filter numbers with your desired format. Example:
$dom = new DOMDocument;
$dom->loadHTML($yourHtmlString);
$xp = new DOMXPath($dom);
// you need to register the function `preg_match` to use it in your xpath query
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('preg_match');
// The xpath query
$targetNodeList = $xp->query('//td[#class="center"]/p[#class="texte"][php:functionString("preg_match", "~^[1-9][0-9]{0,2}(?:\.[0-9]{3})*$~", .) > 0]');
# ^ ^^ ^
# '------------------+------------------''-----------------------------------+-----------------------------------------'
# '- describe the path in the DOM tree |
# '- predicate to check the content format
foreach ($targetNodeList as $node) {
echo $node->nodeValue, PHP_EOL;
}
Give this a shot :)
\s*[\d.]+(?=<)
Here is the link:
Regex Example
Related
I have the following regex to find anchor tag that has 'Kontakt' as the anchor text:
#<a.*href="[^"]*".*>Kontakt<\/a>#
Here is the string to find from:
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
So the result should be:
<a href="/kontakt" >Kontakt</a>
But the result I get is:
Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt
And here is my PHP code:
$pattern = '#<a.*href="[^"]*".*>Kontakt<\/a>#';
preg_match_all($pattern, $string, $matches);
You are using preg_match_all() so I assume you are willing to receive multiple qualifying anchor tags. Parsing valid HTML with a legitimate DOM parser will always be more stable and easier to read than the equivalent regex technique. It's just not a good idea to rely on regex for DOM parsing because regex is "DOM-unaware" -- it just matches things that look like HTML entities.
In the XPath query, search for <a> tags (existing at any depth in the document) which have the qualifying string as the whole text.
Code: (Demo)
$html = <<<HTML
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//a[text() = "Kontakt"]') as $a) {
$result[] = $dom->saveHtml($a);
}
var_export($result);
Output:
array (
0 => 'Kontakt',
)
Is it more concise to use regex? Yes, but it is also less reliable for general use.
You will notice that the DOMDocument also automatically cleans up the unnecessary spacing in your markup.
If you can trust your input will always have <a href in every anchor tag then try:
'#<a href="[^"]*"[^>]*>Kontakt<\/a>#';
// Instead of what you have:
'#<a.*href="[^"]*".*>Kontakt<\/a>/#';
.* is the "wildcard" meta-character . and the "zero or more times" quantifier * together.
.* matches anything any number of times.
Try it https://regex101.com/r/qxnRZv/1
Your regex:
...a.*href...
is greedy, which means: "after a, match as many characters as possible before a href". That causes your regex to return multiple hrefs.
You can use the lazy-mode operator ? :
...a.*?href....
which means "after a, match as few characters as possible before a href". It should work.
I have the following string stored in a database table that contains HTML I need to strip out before rendering on a web page (This is old content I had no control over).
<p>I am <30 years old and weight <12st</p>
When I have used strip_tags it is only showing I am.
I understand why the strip_tags is doing that so I need to replace the 2 instances of the < with <
I have found a regex that converts the first instance but not the 2nd, but I can't work out how to amend this to replace all instances.
/<([^>]*)(<|$)/
which results in I am currently <30 years old and less than
I have a demo here https://eval.in/1117956
It's a bad idea to try to parse html content with string functions, including regex functions (there're many topics that explain that on SO, search them). html is too complicated to do that.
The problem is that you have poorly formatted html on which you have no control.
There're two possible attitudes:
There's nothing to do: the data are corrupted, so informations are loss once and for all and you can't retrieve something that has disappear, that's all. This is a perfectly acceptable point of view.
May be you can find another source for the same data somewhere or you can choose to print the poorly formatted html as it.
You can try to repair. In this case you have to ensure that all the document problems are limited and can be solved (at least by hand).
In place of a direct string approach, you can use the PHP libxml implementation via DOMDocument. Even if the libxml parser will not give better results than strip_tags, it provides errors you can use to identify the kind of error and to find the problematic positions in the html string.
With your string, the libxml parser returns a recoverable error XML_ERR_NAME_REQUIRED with the code 68 on each problematic opening angle bracket. Errors can be seen using libxml_get_errors().
Example with your string:
$s = '<p>I am <30 years old and weight <12st</p>';
$libxmlErrorState = libxml_use_internal_errors(true);
function getLastErrorPos($code) {
$errors = array_filter(libxml_get_errors(), function ($e) use ($code) {
return $e->code === $code;
});
if ( !$errors )
return false;
$lastError = array_pop($errors);
return ['line' => $lastError->line - 1, 'column' => $lastError->column - 2 ];
}
define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name
$patternTemplate = '~(?:.*\R){%d}.{%d}\K<~A';
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
while ( false !== $position = getLastErrorPos(XML_ERR_NAME_REQUIRED) ) {
libxml_clear_errors();
$pattern = vsprintf($patternTemplate, $position);
$s = preg_replace($pattern, '<', $s, 1);
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
}
echo $dom->saveHTML();
libxml_clear_errors();
libxml_use_internal_errors($libxmlErrorState);
demo
$patternTemplate is a formatted string (see sprintf in the php manual) in which the placeholders %d stand for respectively the number of lines before and the position from the start of the line. (0 and 8 here)
Pattern details: The goal of the pattern is to reach the angle bracket position from the start of the string.
~ # my favorite pattern delimiter
(?:
.* # all character until the end of the line
\R # the newline sequence
){0} # reach the desired line
.{8} # reach the desired column
\K # remove all on the left from the match result
< # the match result is only this character
~A # anchor the pattern at the start of the string
An other related question in which I used a similar technique: parse invalid XML manually
try this
$string = '<p>I am <30 years old and weight <12st</p>';
$html = preg_replace('/^\s*<[^>]+>\s*|\s*<\/[^>]+>\s*\z/', '', $string);// remove html tags
$final = preg_replace('/[^A-Za-z0-9 !##$%^&*().]/u', '', $html); //remove special character
Live DEMO
A simple use of str_replace() would do it.
Replace the <p> and </p> with [p] and [/p]
replace the < with <
put the p tags back i.e. Replace the [p] and [/p] with <p> and </p>
Code
<?php
$description = "<p>I am <30 years old and weight <12st</p>";
$d = str_replace(['[p]','[/p]'],['<p>','</p>'],
str_replace('<', '<',
str_replace(['<p>','</p>'], ['[p]','[/p]'],
$description)));
echo $d;
RESULT
<p>I am <30 years old and weight <12st</p>
My guess is that here we might want to design a good right boundary to capture < in non-tags, maybe a simple expression similar to:
<(\s*[+-]?[0-9])
might work, since we should normally have numbers or signs right after <. [+-]?[0-9] would likely change, if we would have other instances after <.
Demo
Test
$re = '/<(\s*[+-]?[0-9])/m';
$str = '<p>I am <30 years old and weight <12st I am < 30 years old and weight < 12st I am <30 years old and weight < -12st I am < +30 years old and weight < 12st</p>';
$subst = '<$1';
$result = preg_replace($re, $subst, $str);
echo $result;
I've been using this regular express (probably found on stackoverflow a few years back) to convert mailto tags in PHP:
preg_match_all("/<a([ ]+)href=([\"']*)mailto:(([[:alnum:]._\-]+)#([[:alnum:]._\-]+\.[[:alnum:]._\-]+))([\"']*)([[:space:][:alnum:]=\"_]*)>([^<|#]*)(#?)([^<]*)<\/a>/i",$content,$matches);
I pass it $content = 'somename#domain.com'
It returns these matched pieces:
0 somename#domain.com
1
2 "
3 name#domain.com
4 name
5 domain.com
6 "
7
8 somename
9 #
10 domain.com
Example usage: ucwords($matches[8][0])
My problem is, some links contain nested tags. Since the preg expression is looking for "<" to get pieces 8,9,10 and nested tags are throwing it off...
Example:
<span><b>somename#domain.com</b></span>
I need to ignore the nested tags and just extract the "some name" piece:
match part 8 = <span><b>
match part 9 = somename
match part 10 = #
match part 11 = domain.com
match part 12 = </b></span>
I've tried to get it to work by tweaking ([^<|#]*)(#?)([^<]*) but I can't figure out the right syntax to match or ignore the nested tags.
You could just replace the whole match between the <a> tag with a .*?. Replace ([^<|#]*)(#?)([^<]*) with (.*?) and it would include everything within the <a> tag including nested tags. You can remove the nested tags after that with striptags or another regex.
However, regular expressions are not very good at html nested tags. You are better off using something like DOMDocument, which is made exactly for parsing html. Something like:
<?php
$DOM = new DOMDocument();
$DOM->loadXML('<span><b>somename#domain.com</b></span>');
$list = $DOM->getElementsByTagName('a');
foreach($list as $link){
$href = $link->getAttribute('href');
$text = $link->nodeValue;
//only match if href starts with mailto:
if(stripos($href, 'mailto:') === 0){
var_dump($href);
var_dump($text);
}
}
http://codepad.viper-7.com/SqDKgr
To only get access to the part within the link, try
[^>]*>([^>]+)#.*
What you need should be in the first group of the result.
You can try this pattern:
$pattern = '~\bhref\s*+=\s*+(["\'])mailto:\K(?<mail>(?<name>[^#]++)#(?<domain>.*?))\1[^>]*+>(?:\s*+</?(?!a\b)[^>]*+>\s*+)*+(?<content>[^<]++)~i';
preg_match_all($pattern, $html, $matches, PREG_SET_ORDER);
echo '<pre>' . print_r($matches, true) . '</pre>';
and you can access your data like that:
echo $matches[0]['name'];
Try this regex
/^(<.*>)(.*)(#)/
/^/- Start of string
/(<.*>)/ - First match group, starts with < then anything in between until it hits >
/(.*)(#)/ - Match anything up to the parenthesis
I have the html document in a php $content. I can echo it, but I just need all the <a...> tags with class="pret" and after I get them I would need the non words (like a code i.e. d3852) from href attribute of <a> and the number (i.e. 2352.2345) from between <a> and </a>.
I have tried more examples from the www but I either get empty arrays or php errors.
A regex example that gives me an empty array (the <a> tag is in a table)
$pattern = "#<table\s.*?>.*?<a\s.*?class=[\"']pret[\"'].*?>(.*?)</a>.*?</table>#i";
preg_match_all($pattern, $content, $results);
print_r($results[1]);
Another example that gives just an error
$a=$content->getElementsByTagName(a);
Reason for various errors: unvalid html, non utf 8 chars.
Next I did this on another website, matched the contents in a single SQL table, and the result is a copied website with updated data from my country. No longer will I search the www for matching single results.
Let's hope you're trying to parse valid (at least valid enough) HTML document, you should use DOM for this:
// Simple example from php manual from comments
$xml = new DOMDocument();
$xml->loadHTMLFile($url);
$links = array();
foreach($xml->getElementsByTagName('a') as $link) {
$links[] = array('url' => $link->getAttribute('href'),
'text' => $link->nodeValue);
}
Note using loadHTML not load (it's just more robust against errors). You also may set DOMDocument::recover (as suggested in comment by hakre) so parser will try to recover from errors.
Or you could use xPath (here's explanation of syntax):
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//a[#class='pret']");
if (!is_null($elements)) {
foreach ($elements as $element) {
$links[] = array('url' => $link->getAttribute('href'),
'text' => $link->nodeValue);
}
}
And for case of invalid HTML you may use regexp like this:
$a1 = '\s*[^\'"=<>]+\s*=\s*"[^"]*"'; # Attribute with " - space tolerant
$a2 = "\s*[^'\"=<>]+\s*=\s*'[^']*'"; # Attribute with ' - space tolerant
$a3 = '\s*[^\'"=<>]+\s*=\s*[\w\d]*' # Unescaped values - space tolerant
# [^'"=<>]* # Junk - I'm not inserting this to regexp but you may have to
$a = "(?:$a1|$a2|$a2)*"; # Any number of arguments
$class = 'class=([\'"])pret\\1'; # Using ?: carefully is crucial for \\1 to work
# otherwise you can use ["']
$reg = "<a{$a}\s*{$class}{$a}\s*>(.*?)</a";
And then just preg_match_all.All regexp are written from the top of my head - you may have to debug them.
got the links like this
preg_match_all('/<a[^>]*class="pret">(.*?)<\\/a>/si', $content, $links);
print_r($links[0]);
and the result is
Array(
[0] => <span>3340.3570 word</span>..........)
so I need to get the first number inside href and the number between span
I have an html (sample.html) like this:
<html>
<head>
</head>
<body>
<div id="content">
<!--content-->
<p>some content</p>
<!--content-->
</div>
</body>
</html>
How do i get the content part that is between the 2 html comment '<!--content-->' using php? I want to get that, do some processing and place it back, so i have to get and put! Is it possible?
esafwan - you could use a regex expression to extract the content between the div (of a certain id).
I've done this for image tags before, so the same rules apply. i'll look out the code and update the message in a bit.
[update] try this:
<?php
function get_tag( $attr, $value, $xml ) {
$attr = preg_quote($attr);
$value = preg_quote($value);
$tag_regex = '/<div[^>]*'.$attr.'="'.$value.'">(.*?)<\\/div>/si';
preg_match($tag_regex,
$xml,
$matches);
return $matches[1];
}
$yourentirehtml = file_get_contents("test.html");
$extract = get_tag('id', 'content', $yourentirehtml);
echo $extract;
?>
or more simply:
preg_match("/<div[^>]*id=\"content\">(.*?)<\\/div>/si", $text, $match);
$content = $match[1];
jim
If this is a simple replacement that does not involve parsing of the actual HTML document, you may use a Regular Expression or even just str_replace for this. But generally, it is not a advisable to use Regex for HTML because HTML is not regular and coming up with reliable patterns can quickly become a nightmare.
The right way to parse HTML in PHP is to use a parsing library that actually knows how to make sense of HTML documents. Your best native bet would be DOM but PHP has a number of other native XML extensions you can use and there is also a number of third party libraries like phpQuery, Zend_Dom, QueryPath and FluentDom.
If you use the search function, you will see that this topic has been covered extensively and you should have no problems finding examples that show how to solve your question.
<?php
$content=file_get_contents("sample.html");
$comment=explode("<!--content-->",$content);
$comment=explode("<!--content-->",$comment[1]);
var_dump(strip_tags($comment[0]));
?>
check this ,it will work for you
Problem is with nested divs
I found solution here
<?php // File: MatchAllDivMain.php
// Read html file to be processed into $data variable
$data = file_get_contents('test.html');
// Commented regex to extract contents from <div class="main">contents</div>
// where "contents" may contain nested <div>s.
// Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{ # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*> # match the "main" class DIV opening tag
( # capture "main" DIV contents into $1
(?: # non-cap group for nesting * quantifier
(?: (?!<div[^>]*>|</div>). )++ # possessively match all non-DIV tag chars
| # or
<div[^>]*>(?1)</div> # recursively match nested <div>xyz</div>
)* # loop however deep as necessary
) # end group 1 capture
</div> # match the "main" class DIV closing tag
}six'; // single-line (dot matches all), ignore case and free spacing modes ON
// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(? 1)</div>)*)</div>}si';
$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
echo("$matchcount matches found.\n");
// print_r($matches);
for($i = 0; $i < $matchcount; $i++) {
echo("\nMatch #" . ($i + 1) . ":\n");
echo($matches[1][$i]); // print 1st capture group for match number i
}
} else {
echo('No matches');
}
echo("\n</pre>");
?>
Have a look here for a code example that means you can load a HTML document into SimpleXML http://blog.charlvn.com/2009/03/html-in-php-simplexml.html
You can then treat it as a normal SimpleXML object.
EDIT: This will only work if you want the content in a tag (e.g. between <div> and </div>)