Determining that a string is a valid HTML element

Determining that a string is a valid HTML element - php

I'm having trouble getting this constraint matches function to match all HTML elements.
It must return true for any legitimate, properly-formed HTML element and return false for anything that is not a legitimate, properly-formed HTML element.
The following are things that did not work:
$dom = new \DOMDocument(); return $dom->loadHTML($value);
$dom = new \DOMDocument(); return $dom->loadHTML($value,LIBXML_HTML_NOIMPLIED);
Adding the flag LIBXML_NOENT to simplexml_load_string().
Adding the flag LIBXML_HTML_NOIMPLIED to simplexml_load_string().
Here is the current function:
function matches($value)
{
\libxml_use_internal_errors(true);
if (!\is_string($value) || empty($value)) {
return false;
}
$start = \strpos($value, '<');
$end = \strrpos($value, '>', $start);
$len = \strlen($value);
if ($end !== false) {
$value = \substr($value, $start);
} else {
$value = \substr($value, $start, $len - $start);
}
$value = \html_entity_decode($value);
$value = \str_replace('&', '', $value);
\libxml_clear_errors();
$xml = \simplexml_load_string($value);
return \count(\libxml_get_errors()) === 0;
}
The current version has two known problems:
<script>&</script>: Should fail but passes.
<a b="""></a>: Should pass but fails.

Related

scrape html page with strange result

the scrape works but, the strange thing is that the result is ["-3°"]
I tried so many different things to get just -3°
But how is it that does [" and "] show up if they are not in the code!
Does someone can give me some direction how to achieve this
the code I am using is
<?php
function scrape($url){
$output = file_get_contents($url);
return $output;
}
function fetchdata($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$page = scrape("https://weather.gc.ca/city/pages/bc-37_metric_e.html");
$result = fetchdata($page, "<p class=\"text-center mrgn-tp-md mrgn-bttm-sm lead\"><span class=\"wxo-metric-hide\">", "<abbr title=\"Celsius\">C</abbr>");
echo json_encode(array($result));
?>
already thanks for you help!

You can use the DOMDocument to parse the HTML file.
$page = file_get_contents("https://weather.gc.ca/city/pages/bc-37_metric_e.html");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
libxml_use_internal_errors(false);
$paragraphs = $doc->getElementsByTagName('p');
foreach($paragraphs as $p){
if($p->getAttribute('class') == 'text-center mrgn-tp-md mrgn-bttm-sm lead') {
foreach($p->getElementsbyTagName('span') as $attr) {
if($attr->getAttribute('class') == 'wxo-metric-hide') {
foreach($attr->getElementsbyTagName('abbr') as $abbr) {
if($abbr->getAttribute('title') == 'Celsius') {
echo trim($attr->nodeValue);
}
}
}
}
}
}
Output:
-3°C
This is assuming the classes and structure are consistent...

Itirate through array, and run function on each value

I need to iterate through an array, and edit each value but not differently.
<?php
Function parseStatus($Input, $Start, $End){
$String = " " . $Input;
$Init = StrPos($String, $Start);
If($Init == 0){
Return '';
}
$Init += StrLen($Start);
$Length = StrPos($String, $End, $Init) - $Init;
Return SubStr($String, $Init, $Length);
}
Function getAllStatuses($Username){
$DOM = new DOMDocument();
$DOM->validateOnParse = True;
#$DOM->loadHtml(File_Get_Contents('http://lifestream.aol.com/stream/' . $Username));
$xPath = new DOMXPath($DOM);
$Stream = $DOM->getElementById('stream')->nodeValue; // return stream content for display name
$Nodes = $xPath->query('//div[#class="stream"]');
$Name = Explode(' ', Trim($Stream));
$User = $Name[0];
$Statuses = Array();
ForEach($Nodes as $Node){
ForEach($Node->getElementsByTagName('li') as $Key => $Tags){
$Statuses[] = $Tags->nodeValue;
}
}
ForEach($Statuses as $Status){
If(StrPos($Status, 'Services')){
Echo 'services is definitely in there';
$New = AIM::parseStatus($Status, $User, 'Services');
Echo $New;
Break;
}
}
?>
The issue is, $New only echos the very first output, but how do I get that to run through each value in the array, and do the same thing?
Expected output:
[name as start] what i need [word Services]
Then on each value in the array, do the same thing so it'd be like:
what i need
again what i need but different string
etc.
Thanks for any help.

The Break; in your foreach loop is, well, breaking the loop.
Remove the Break; and it should work.
Have a read here:
http://www.php.net/break
break ends execution of the current for, foreach, while, do-while or switch structure.

php substring occurances between two strings in an html file

So i have an HTML file as source, it contains several instances of the following code:
<span itemprop="name">NAME</span>
where the NAME part always changing to something different.
how can i write a php code that would go through the html code, extract all the names between the "<span itemprop="name">" and "</span>" and put it in an array?
i have tried this code but it doesn't work:
$prev=$html;
for($i=0; $i<10; $i++){
$current = explode('<span itemprop="name">', $prev);
$cur = explode('</span>', $current[1]);
$names[] = $cur[0];
$prev = $current[2];
}
print_r($names);

Probably better way would be using php DOMDocument or simple php dom or any DOM representative than the way you planed.
Here is example of working DOMDocument code:
$doc = new DOMDocument();
$doc->loadHTML('<html><body><span itemprop="name">1</span><span itemprop="name">2</span><span itemprop="name">3</span></body></html>');
$finder = new DomXPath($doc);
$nodes = $finder->query("//*[contains(#itemprop, 'name')]");
foreach($nodes as $node)
{
echo $node->nodeValue . '<br />';
}
Outputs:
1
2
3

I kinda feel bad for saying this... but you could use a regular expression
preg_match_all('/<span itemprop="name">(.*?)<\/span>/i', $matches);
var_dump($matches); // results are stored in the variable $matches;

This function will get us the "NAME"
function getbetween($content,$start,$end) {
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '';
}
This function will replace only the first occurence
<?php
function str_replace_once($search, $replace, $subject) {
$firstChar = strpos($subject, $search);
if($firstChar !== false) {
$beforeStr = substr($subject,0,$firstChar);
$afterStr = substr($subject, $firstChar + strlen($search));
return $beforeStr.$replace.$afterStr;
} else {
return $subject;
}
}
?>
now a loop
$start = '<span itemprop="name">';
$end = '</span>';
while(strpos($content, $start)) {
$name = getbetween($content, $start, $end);
$content = str_replace_once($start.$name.$end, '',$content);
echo $name.'<br>';
}

use this function:
function get_string_between($string, $start, $end){
$string = ' ' . $string;
$ini = strpos($string, $start);
if ($ini == 0) return '';
$ini += strlen($start);
$len = strpos($string, $end, $ini) - $ini;
return substr($string, $ini, $len);
}
$fullstring = 'this is my [tag]dog[/tag]';
$parsed = get_string_between($fullstring, '[tag]', '[/tag]');
echo $parsed; // (result = dog)
Refenter link description here

how to validate the number of opened and closed tags?

I thought to do a preg_count for each "/<[a-z0-9]+>/i" and then count if exists the same number with the closed tags ie: "/</[a-z0-9]+>/i"
But I am not too sure. How would you count all opened tags and check if exists all closed tags?
Ps. i don't need to check for attribute and for xml /> single close tag. I just need a count on plain simple html tag
Thanks

I wrote this handy functions. I think it could be faster if I search both opened/closed tags within one preg_match_all but as this it's more readable:
<?php
//> Will count number of <[a-z]> tag and </[a-z]> tag (will also validate the order)
//> Note br should be in the form of <br /> for not causing problems
function validHTML($html,$checkOrder=true) {
preg_match_all( '#<([a-z]+)>#i' , $html, $start, PREG_OFFSET_CAPTURE );
preg_match_all( '#<\/([a-z]+)>#i' , $html, $end, PREG_OFFSET_CAPTURE );
$start = $start[1];
$end = $end[1];
if (count($start) != count($end) )
throw new Exception('Check numbers of tags');
if ($checkOrder) {
$is = 0;
foreach($end as $v){
if ($v[0] != $start[$is][0] || $v[1] < $start[$is][1] )
throw new Exception('End tag ['.$v[0].'] not opened');
$is++;
}
}
return true;
}
//> Usage::
try {
validHTML('<p>hello</p><li></li></p><p>');
} catch (Exception $e) {
echo $e->getMessage();
}
Note if you need to catch even h1 or any other tag with numbers you need to add 0-9 within pattern of preg

The proper way to validate HTML is using a HTML parser. Using Regexes to deal with HTML is very wrong - see RegEx match open tags except XHTML self-contained tags

My case
function checkHtml($html) {
$level = 0;
$map = [];
$length = strlen($html);
$open = false;
$tag = '';
for($i = 0; $i < $length; $i ++) {
$c = substr($html, $i, 1);
if($c == '<') {
$open = true;
$tag = '';
} else if($open && ($c == '>' || ord($c) == 32)) {
$open = false;
if(in_array($tag, ['br', 'br/', 'hr/', 'img/', 'hr', 'img'])) {
continue;
}
if(strpos($tag, '/') === 0) {
if(!isset($map[$tag.($level-1)])) {
return false;
}
$level --;
unset($map[$tag.$level]);
} else {
$map['/'.$tag.$level] = true;
$level ++;
}
} else if($open) {
$tag .= $c;
}
}
return $level == 0;
}

ok, one solution would be:
function open_tags($page)
{
$arr=array();
$page // your html/xml/somthing content
$i=0;
while ($i<strlen($page))
{
$i=strpos($page,'<',$i); //position of starting the tag
$end=strpos($page,'>',$i); //position of ending the tag
if(strpos($page,'/')<$end) //if it's an end tag
{
if (array_pop($arr)!=substr($page,$i,$end-$i)); // pop the last value inserted into the stack, and check if it's the same as this one
return FALSE;
}
else
{
array_push($arr,substr($page,$i,$end-$i)); // push the new tag value into the stack
}
}
return $arr;
}
this will return opened tags by order, or false if error.
edit:
function open_tags($page)
{
$arr=array();
$page // your html/xml/somthing content
$i=0;
while ($i<strlen($page))
{
$i=strpos($page,'<',$i); //position of starting the tag
$end=strpos($page,'>',$i); //position of ending the tag
if($end>strpos($page,'<',$i))
return false;
if(strpos($page,'/')<$end) //if it's an end tag
{
if (array_pop($arr)!=substr($page,$i,$end-$i)); // pop the last value inserted into the stack, and check if it's the same as this one
return FALSE;
}
else
{
array_push($arr,substr($page,$i,$end-$i)); // push the new tag value into the stack
}
}
return $arr;
}

How to get tag content?

Im making a script to get other pages content, and right now im working on a function that should get tag content... but im a bit stuck :D
found a new tag of same kind inside tag...
nothing found...
1111
2222
is printed.
<?php
function d($toprint)
{
echo $toprint."<br />";
}
function GetTagContents($source, $tag, $pos)
{
$startTagPos = strpos( $source, "<".$tag, $pos );
$startTagEndPos = strpos( $source, ">", $startTagPos )+1;
$endTagPos = strpos( $source, "</".$tag, $startTagEndPos);
$lastpos = $startTagPos+1;
while( $lastpos != False )
{
$newStartTagPos = strpos( $source, "<".$tag, $lastpos );
if( $newStartTagPos == False )
{
d("nothing found...");
$lastpos = False;
}
else if( $newStartTagPos > $endTagPos )
{
d("out of bounds...");
$lastpos = False;
}
else
{
d("found a new tag of same kind inside tag...");
$lastpos = $newStartTagPos+1;
$endTagPos = strpos( $source, "</".$tag, $newStartTagPos);
}
}
return substr($source, $startTagEndPos, $endTagPos-$startTagEndPos);
}
?>
<html>
<body>
<?php
d(GetTagContents('<div>1111<div>2222</div>3333</div>', "div", 0));
?>
</body>
</html>
someone got any ideas?

Using PHP DOM:
$src = new DOMDocument('1.0', 'utf-8');
$src->formatOutput = true;
$src->preserveWhiteSpace = false;
$src->load('path/to/file.html');
$tagName = 'foo';
$element = $src->getElementsByTagName($tagName)->item(0);
var_dump($element->nodValue)

strpos will return 0 the first time, and 0 == false in PHP. The check you want is to compare the result with ===, which evaluates to true if both values are the same value and the same type. That is, 0 == false is true but 0 === false is not true.

you can use this
simplexml_load_string
$xml = "[div]1111[div]2222[/div]3333[/div]";
$loadStrring = simplexml_load_string($xml);
foreach($loadStrring->children() as $name => $data) {
if($name ='div')
echo $data . "\n";
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Determining that a string is a valid HTML element - php

Related

scrape html page with strange result

Itirate through array, and run function on each value

php substring occurances between two strings in an html file

how to validate the number of opened and closed tags?

How to get tag content?

Categories

Resources