Determining that a string is a valid HTML element - php

I'm having trouble getting this constraint matches function to match all HTML elements.
It must return true for any legitimate, properly-formed HTML element and return false for anything that is not a legitimate, properly-formed HTML element.
The following are things that did not work:
$dom = new \DOMDocument(); return $dom->loadHTML($value);
$dom = new \DOMDocument(); return $dom->loadHTML($value,LIBXML_HTML_NOIMPLIED);
Adding the flag LIBXML_NOENT to simplexml_load_string().
Adding the flag LIBXML_HTML_NOIMPLIED to simplexml_load_string().
Here is the current function:
function matches($value)
{
\libxml_use_internal_errors(true);
if (!\is_string($value) || empty($value)) {
return false;
}
$start = \strpos($value, '<');
$end = \strrpos($value, '>', $start);
$len = \strlen($value);
if ($end !== false) {
$value = \substr($value, $start);
} else {
$value = \substr($value, $start, $len - $start);
}
$value = \html_entity_decode($value);
$value = \str_replace('&', '', $value);
\libxml_clear_errors();
$xml = \simplexml_load_string($value);
return \count(\libxml_get_errors()) === 0;
}
The current version has two known problems:
<script>&</script>: Should fail but passes.
<a b="""></a>: Should pass but fails.

Related

scrape html page with strange result

the scrape works but, the strange thing is that the result is ["-3°"]
I tried so many different things to get just -3°
But how is it that does [" and "] show up if they are not in the code!
Does someone can give me some direction how to achieve this
the code I am using is
<?php
function scrape($url){
$output = file_get_contents($url);
return $output;
}
function fetchdata($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$page = scrape("https://weather.gc.ca/city/pages/bc-37_metric_e.html");
$result = fetchdata($page, "<p class=\"text-center mrgn-tp-md mrgn-bttm-sm lead\"><span class=\"wxo-metric-hide\">", "<abbr title=\"Celsius\">C</abbr>");
echo json_encode(array($result));
?>
already thanks for you help!
You can use the DOMDocument to parse the HTML file.
$page = file_get_contents("https://weather.gc.ca/city/pages/bc-37_metric_e.html");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
libxml_use_internal_errors(false);
$paragraphs = $doc->getElementsByTagName('p');
foreach($paragraphs as $p){
if($p->getAttribute('class') == 'text-center mrgn-tp-md mrgn-bttm-sm lead') {
foreach($p->getElementsbyTagName('span') as $attr) {
if($attr->getAttribute('class') == 'wxo-metric-hide') {
foreach($attr->getElementsbyTagName('abbr') as $abbr) {
if($abbr->getAttribute('title') == 'Celsius') {
echo trim($attr->nodeValue);
}
}
}
}
}
}
Output:
-3°C
This is assuming the classes and structure are consistent...

Itirate through array, and run function on each value

I need to iterate through an array, and edit each value but not differently.
<?php
Function parseStatus($Input, $Start, $End){
$String = " " . $Input;
$Init = StrPos($String, $Start);
If($Init == 0){
Return '';
}
$Init += StrLen($Start);
$Length = StrPos($String, $End, $Init) - $Init;
Return SubStr($String, $Init, $Length);
}
Function getAllStatuses($Username){
$DOM = new DOMDocument();
$DOM->validateOnParse = True;
#$DOM->loadHtml(File_Get_Contents('http://lifestream.aol.com/stream/' . $Username));
$xPath = new DOMXPath($DOM);
$Stream = $DOM->getElementById('stream')->nodeValue; // return stream content for display name
$Nodes = $xPath->query('//div[#class="stream"]');
$Name = Explode(' ', Trim($Stream));
$User = $Name[0];
$Statuses = Array();
ForEach($Nodes as $Node){
ForEach($Node->getElementsByTagName('li') as $Key => $Tags){
$Statuses[] = $Tags->nodeValue;
}
}
ForEach($Statuses as $Status){
If(StrPos($Status, 'Services')){
Echo 'services is definitely in there';
$New = AIM::parseStatus($Status, $User, 'Services');
Echo $New;
Break;
}
}
?>
The issue is, $New only echos the very first output, but how do I get that to run through each value in the array, and do the same thing?
Expected output:
[name as start] what i need [word Services]
Then on each value in the array, do the same thing so it'd be like:
what i need
again what i need but different string
etc.
Thanks for any help.
The Break; in your foreach loop is, well, breaking the loop.
Remove the Break; and it should work.
Have a read here:
http://www.php.net/break
break ends execution of the current for, foreach, while, do-while or switch structure.

php substring occurances between two strings in an html file

So i have an HTML file as source, it contains several instances of the following code:
<span itemprop="name">NAME</span>
where the NAME part always changing to something different.
how can i write a php code that would go through the html code, extract all the names between the "<span itemprop="name">" and "</span>" and put it in an array?
i have tried this code but it doesn't work:
$prev=$html;
for($i=0; $i<10; $i++){
$current = explode('<span itemprop="name">', $prev);
$cur = explode('</span>', $current[1]);
$names[] = $cur[0];
$prev = $current[2];
}
print_r($names);
Probably better way would be using php DOMDocument or simple php dom or any DOM representative than the way you planed.
Here is example of working DOMDocument code:
$doc = new DOMDocument();
$doc->loadHTML('<html><body><span itemprop="name">1</span><span itemprop="name">2</span><span itemprop="name">3</span></body></html>');
$finder = new DomXPath($doc);
$nodes = $finder->query("//*[contains(#itemprop, 'name')]");
foreach($nodes as $node)
{
echo $node->nodeValue . '<br />';
}
Outputs:
1
2
3
I kinda feel bad for saying this... but you could use a regular expression
preg_match_all('/<span itemprop="name">(.*?)<\/span>/i', $matches);
var_dump($matches); // results are stored in the variable $matches;
This function will get us the "NAME"
function getbetween($content,$start,$end) {
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '';
}
This function will replace only the first occurence
<?php
function str_replace_once($search, $replace, $subject) {
$firstChar = strpos($subject, $search);
if($firstChar !== false) {
$beforeStr = substr($subject,0,$firstChar);
$afterStr = substr($subject, $firstChar + strlen($search));
return $beforeStr.$replace.$afterStr;
} else {
return $subject;
}
}
?>
now a loop
$start = '<span itemprop="name">';
$end = '</span>';
while(strpos($content, $start)) {
$name = getbetween($content, $start, $end);
$content = str_replace_once($start.$name.$end, '',$content);
echo $name.'<br>';
}
use this function:
function get_string_between($string, $start, $end){
$string = ' ' . $string;
$ini = strpos($string, $start);
if ($ini == 0) return '';
$ini += strlen($start);
$len = strpos($string, $end, $ini) - $ini;
return substr($string, $ini, $len);
}
$fullstring = 'this is my [tag]dog[/tag]';
$parsed = get_string_between($fullstring, '[tag]', '[/tag]');
echo $parsed; // (result = dog)
Refenter link description here

how to validate the number of opened and closed tags?

I thought to do a preg_count for each "/<[a-z0-9]+>/i" and then count if exists the same number with the closed tags ie: "/</[a-z0-9]+>/i"
But I am not too sure. How would you count all opened tags and check if exists all closed tags?
Ps. i don't need to check for attribute and for xml /> single close tag. I just need a count on plain simple html tag
Thanks
I wrote this handy functions. I think it could be faster if I search both opened/closed tags within one preg_match_all but as this it's more readable:
<?php
//> Will count number of <[a-z]> tag and </[a-z]> tag (will also validate the order)
//> Note br should be in the form of <br /> for not causing problems
function validHTML($html,$checkOrder=true) {
preg_match_all( '#<([a-z]+)>#i' , $html, $start, PREG_OFFSET_CAPTURE );
preg_match_all( '#<\/([a-z]+)>#i' , $html, $end, PREG_OFFSET_CAPTURE );
$start = $start[1];
$end = $end[1];
if (count($start) != count($end) )
throw new Exception('Check numbers of tags');
if ($checkOrder) {
$is = 0;
foreach($end as $v){
if ($v[0] != $start[$is][0] || $v[1] < $start[$is][1] )
throw new Exception('End tag ['.$v[0].'] not opened');
$is++;
}
}
return true;
}
//> Usage::
try {
validHTML('<p>hello</p><li></li></p><p>');
} catch (Exception $e) {
echo $e->getMessage();
}
Note if you need to catch even h1 or any other tag with numbers you need to add 0-9 within pattern of preg
The proper way to validate HTML is using a HTML parser. Using Regexes to deal with HTML is very wrong - see RegEx match open tags except XHTML self-contained tags
My case
function checkHtml($html) {
$level = 0;
$map = [];
$length = strlen($html);
$open = false;
$tag = '';
for($i = 0; $i < $length; $i ++) {
$c = substr($html, $i, 1);
if($c == '<') {
$open = true;
$tag = '';
} else if($open && ($c == '>' || ord($c) == 32)) {
$open = false;
if(in_array($tag, ['br', 'br/', 'hr/', 'img/', 'hr', 'img'])) {
continue;
}
if(strpos($tag, '/') === 0) {
if(!isset($map[$tag.($level-1)])) {
return false;
}
$level --;
unset($map[$tag.$level]);
} else {
$map['/'.$tag.$level] = true;
$level ++;
}
} else if($open) {
$tag .= $c;
}
}
return $level == 0;
}
ok, one solution would be:
function open_tags($page)
{
$arr=array();
$page // your html/xml/somthing content
$i=0;
while ($i<strlen($page))
{
$i=strpos($page,'<',$i); //position of starting the tag
$end=strpos($page,'>',$i); //position of ending the tag
if(strpos($page,'/')<$end) //if it's an end tag
{
if (array_pop($arr)!=substr($page,$i,$end-$i)); // pop the last value inserted into the stack, and check if it's the same as this one
return FALSE;
}
else
{
array_push($arr,substr($page,$i,$end-$i)); // push the new tag value into the stack
}
}
return $arr;
}
this will return opened tags by order, or false if error.
edit:
function open_tags($page)
{
$arr=array();
$page // your html/xml/somthing content
$i=0;
while ($i<strlen($page))
{
$i=strpos($page,'<',$i); //position of starting the tag
$end=strpos($page,'>',$i); //position of ending the tag
if($end>strpos($page,'<',$i))
return false;
if(strpos($page,'/')<$end) //if it's an end tag
{
if (array_pop($arr)!=substr($page,$i,$end-$i)); // pop the last value inserted into the stack, and check if it's the same as this one
return FALSE;
}
else
{
array_push($arr,substr($page,$i,$end-$i)); // push the new tag value into the stack
}
}
return $arr;
}

How to get tag content?

Im making a script to get other pages content, and right now im working on a function that should get tag content... but im a bit stuck :D
found a new tag of same kind inside tag...
nothing found...
1111
2222
is printed.
<?php
function d($toprint)
{
echo $toprint."<br />";
}
function GetTagContents($source, $tag, $pos)
{
$startTagPos = strpos( $source, "<".$tag, $pos );
$startTagEndPos = strpos( $source, ">", $startTagPos )+1;
$endTagPos = strpos( $source, "</".$tag, $startTagEndPos);
$lastpos = $startTagPos+1;
while( $lastpos != False )
{
$newStartTagPos = strpos( $source, "<".$tag, $lastpos );
if( $newStartTagPos == False )
{
d("nothing found...");
$lastpos = False;
}
else if( $newStartTagPos > $endTagPos )
{
d("out of bounds...");
$lastpos = False;
}
else
{
d("found a new tag of same kind inside tag...");
$lastpos = $newStartTagPos+1;
$endTagPos = strpos( $source, "</".$tag, $newStartTagPos);
}
}
return substr($source, $startTagEndPos, $endTagPos-$startTagEndPos);
}
?>
<html>
<body>
<?php
d(GetTagContents('<div>1111<div>2222</div>3333</div>', "div", 0));
?>
</body>
</html>
someone got any ideas?
Using PHP DOM:
$src = new DOMDocument('1.0', 'utf-8');
$src->formatOutput = true;
$src->preserveWhiteSpace = false;
$src->load('path/to/file.html');
$tagName = 'foo';
$element = $src->getElementsByTagName($tagName)->item(0);
var_dump($element->nodValue)
strpos will return 0 the first time, and 0 == false in PHP. The check you want is to compare the result with ===, which evaluates to true if both values are the same value and the same type. That is, 0 == false is true but 0 === false is not true.
you can use this
simplexml_load_string
$xml = "[div]1111[div]2222[/div]3333[/div]";
$loadStrring = simplexml_load_string($xml);
foreach($loadStrring->children() as $name => $data) {
if($name ='div')
echo $data . "\n";
}
}

Categories