Extracting specific text from HTML texts

Extracting specific text from HTML texts - php

I am not so familiar with regex. I am trying to obtain the results described at the bottom. Here is what I have done so far (note that $page contains tabulators):
$page = "<div class=\"title-container\">
<h1>Text here<span> /Sub-text/</span> </h1>
</div>";
// TITLE
preg_match_all ('/<h1>(.*)<\/h1>/U', $page, $out);
$hutitle = preg_replace("#<span>(.*)<\/span>\s#", "", $out[1][0]);
$entitle = preg_replace("'(.*)<span> /'", "", $out[1][0]);
I would like to get this:
$hutitle = "Text here";
$entitle = "Sub-text"; (Without html and "/")

I'd suggest using DOM with trim, no need for regex, here is a working code for your concrete case:
$page = "<div class=\"title-container\">\n <h1>Text here<span> /Sub-text/</span> </h1>\n </div>";
$dom = new DOMDocument;
$dom->loadHTML($page);
$hs = $dom->getElementsByTagName('h1');
foreach ($hs as $h) {
$enttitlenodes = $h->getElementsByTagName('span');
if ($enttitlenodes->length > 0 && $enttitlenodes->item(0)->tagName == 'span')
{
$entitle = trim($enttitlenodes->item(0)->nodeValue, " /");
echo $entitle . "\n";
$h->removeChild($enttitlenodes->item(0));
}
$hutitle = $h->nodeValue;
echo $hutitle;
}
See IDEONE demo

try this
<h1>(.*?)<span> /(.*?)/</span>
$1 and $2 are the results as you expected.

Related

Change 'href' value of a link using PHP and DOM

I would like to change all links in an HTML variable to random ones. Here is my code but something prevents links from being changed:
<?php
$jobTemplateDetails = 'Click!
Click!';
////////////////////// CHANGE ALL LINKS
$linkDom = new DOMDocument;
#$linkDom->loadHTML($jobTemplateDetails);
$allLinks = $linkDom->getElementsByTagName('a');
foreach ($allLinks as $rawLink) {
$longLink = $rawLink->getAttribute('href');
$str = 'abcdefghijklmnopqrstuvwxyz';
$randomChar1 = $str[mt_rand(0, strlen($str)-1)];
$randomChar2 = $str[mt_rand(0, strlen($str)-1)];
$randomChar3 = $str[mt_rand(0, strlen($str)-1)];
$randomChar4 = $str[mt_rand(0, strlen($str)-1)];
$shortURL = mt_rand(1, 9).$randomChar1.mt_rand(1, 9).$randomChar2.$randomChar3.$randomChar4;
$rawLink->setAttribute('href', $shortURL);
}
echo $jobTemplateDetails;

When you echo $jobTemplateDetails; you only show the very first input string, not the DomDocument you manipulate.
Change that to
echo $linkDom->saveHTML();
///OUTPUT:
Click!
Click!
a fiddle: https://3v4l.org/KuCic
and the docs

PHP: Test for string of text inside html tags from file_get_contents string

I need to perform a series of tests on a url. The first test is a word count, I have that working perfectly and the code is below:
if (isset($_GET[article_url])){
$title = 'This is an example title';
$str = #file_get_contents($_GET[article_url]);
$test1 = str_word_count(strip_tags(strtolower($str)));
if($test1 === FALSE) { $test = '0'; }
if ($test1 > '550') {
echo '<div><i class="fa fa-check-square-o" style="color:green"></i> This article has '.$test1.' words.';
} else {
echo '<div><i class="fa fa-times-circle-o" style="color:red"></i> This article has '.$test1.' words. You are required to have a minimum of 500 words.';
}
}
Next I need to get all h1 and h2 tags from $str and test them to see if any contain the text $title and echo yes if so and no if not. I am not really sure how to go about doing this.
I am looking for a pure php means of doing this without installing php libraries or third party functions.

please try below code.
if (isset($_GET[article_url])){
$title = 'This is an example title';
$str = #file_get_contents($_GET[article_url]);
$document = new DOMDocument();
$document->loadHTML($str);
$tags = array ('h1', 'h2');
$texts = array ();
foreach($tags as $tag)
{
//Fetch all the tags with text from the dom matched with passed tags
$elementList = $document->getElementsByTagName($tag);
foreach($elementList as $element)
{
//Store text in array from dom for tags
$texts[] = strtolower($element->textContent);
}
}
//Check passed title is inside texts array or not using php
if(in_array(strtolower($title),$texts)){
echo "yes";
}else{
echo "no";
}
}

Remove all attributes from PHP string but keep basic markdown tags [duplicate]

How can I use php to strip all/any attributes from a tag, say a paragraph tag?
<p class="one" otherrandomattribute="two"> to <p>

Although there are better ways, you could actually strip arguments from html tags with a regular expression:
<?php
function stripArgumentFromTags( $htmlString ) {
$regEx = '/([^<]*<\s*[a-z](?:[0-9]|[a-z]{0,9}))(?:(?:\s*[a-z\-]{2,14}\s*=\s*(?:"[^"]*"|\'[^\']*\'))*)(\s*\/?>[^<]*)/i'; // match any start tag
$chunks = preg_split($regEx, $htmlString, -1, PREG_SPLIT_DELIM_CAPTURE);
$chunkCount = count($chunks);
$strippedString = '';
for ($n = 1; $n < $chunkCount; $n++) {
$strippedString .= $chunks[$n];
}
return $strippedString;
}
?>
The above could probably be written in less characters, but it does the job (quick and dirty).

Strip attributes using SimpleXML (Standard in PHP5)
<?php
// define allowable tags
$allowable_tags = '<p><a><img><ul><ol><li><table><thead><tbody><tr><th><td>';
// define allowable attributes
$allowable_atts = array('href','src','alt');
// strip collector
$strip_arr = array();
// load XHTML with SimpleXML
$data_sxml = simplexml_load_string('<root>'. $data_str .'</root>', 'SimpleXMLElement', LIBXML_NOERROR | LIBXML_NOXMLDECL);
if ($data_sxml ) {
// loop all elements with an attribute
foreach ($data_sxml->xpath('descendant::*[#*]') as $tag) {
// loop attributes
foreach ($tag->attributes() as $name=>$value) {
// check for allowable attributes
if (!in_array($name, $allowable_atts)) {
// set attribute value to empty string
$tag->attributes()->$name = '';
// collect attribute patterns to be stripped
$strip_arr[$name] = '/ '. $name .'=""/';
}
}
}
}
// strip unallowed attributes and root tag
$data_str = strip_tags(preg_replace($strip_arr,array(''),$data_sxml->asXML()), $allowable_tags);
?>

Here is one function that will let you strip all attributes except ones you want:
function stripAttributes($s, $allowedattr = array()) {
if (preg_match_all("/<[^>]*\\s([^>]*)\\/*>/msiU", $s, $res, PREG_SET_ORDER)) {
foreach ($res as $r) {
$tag = $r[0];
$attrs = array();
preg_match_all("/\\s.*=(['\"]).*\\1/msiU", " " . $r[1], $split, PREG_SET_ORDER);
foreach ($split as $spl) {
$attrs[] = $spl[0];
}
$newattrs = array();
foreach ($attrs as $a) {
$tmp = explode("=", $a);
if (trim($a) != "" && (!isset($tmp[1]) || (trim($tmp[0]) != "" && !in_array(strtolower(trim($tmp[0])), $allowedattr)))) {
} else {
$newattrs[] = $a;
}
}
$attrs = implode(" ", $newattrs);
$rpl = str_replace($r[1], $attrs, $tag);
$s = str_replace($tag, $rpl, $s);
}
}
return $s;
}
In example it would be:
echo stripAttributes('<p class="one" otherrandomattribute="two">');
or if you eg. want to keep "class" attribute:
echo stripAttributes('<p class="one" otherrandomattribute="two">', array('class'));
Or
Assuming you are to send a message to an inbox and you composed your message with CKEDITOR, you can assign the function as follows and echo it to the $message variable before sending. Note the function with the name stripAttributes() will strip off all html tags that are unnecessary. I tried it and it work fine. i only saw the formatting i added like bold e.t.c.
$message = stripAttributes($_POST['message']);
or
you can echo $message; for preview.

I honestly think that the only sane way to do this is to use a tag and attribute whitelist with the HTML Purifier library. Example script here:
<html><body>
<?php
require_once '../includes/htmlpurifier-4.5.0-lite/library/HTMLPurifier/Bootstrap.php';
spl_autoload_register(array('HTMLPurifier_Bootstrap', 'autoload'));
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'p,b,a[href],i,br,img[src]');
$config->set('URI.Base', 'http://www.example.com');
$config->set('URI.MakeAbsolute', true);
$purifier = new HTMLPurifier($config);
$dirty_html = "
<a href=\"http://www.google.de\">broken a href link</a
fnord
<x>y</z>
<b>c</p>
<script>alert(\"foo!\");</script>
Anzahl besuchter Seiten
<img src=\"www.example.com/bla.gif\" />
<a href=\"http://www.google.de\">missing end tag
ende
";
$clean_html = $purifier->purify($dirty_html);
print "<h1>dirty</h1>";
print "<pre>" . htmlentities($dirty_html) . "</pre>";
print "<h1>clean</h1>";
print "<pre>" . htmlentities($clean_html) . "</pre>";
?>
</body></html>
This yields the following clean, standards-conforming HTML fragment:
broken a href linkfnord
y
<b>c
<a>Anzahl besuchter Seiten</a>
<img src="http://www.example.com/www.example.com/bla.gif" alt="bla.gif" /><a href="http://www.google.de">missing end tag
ende
</a></b>
In your case the whitelist would be:
$config->set('HTML.Allowed', 'p');

HTML Purifier is one of the better tools for sanitizing HTML with PHP.

You might also look into html purifier. True, it's quite bloated, and might not fit your needs if it only conceirns this specific example, but it offers more or less 'bulletproof' purification of possible hostile html. Also you can choose to allow or disallow certain attributes (it's highly configurable).
http://htmlpurifier.org/

Strip tag with class in PHP

So I need to strip the span tags of class tip.
So that would be <span class="tip"> and the corresponding </span>, and everything inside it...
I suspect a regular expression is needed but I terribly suck at this.
Laugh...
<?php
$string = 'April 15, 2003';
$pattern = '/(\w+) (\d+), (\d+)/i';
$replacement = '${1}1,$3';
echo preg_replace($pattern, $replacement, $string);
?>
Gives no error... But
<?php
$str = preg_replace('<span class="tip">.+</span>', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
Gives me the error:
Warning: preg_replace() [function.preg-replace]: Unknown modifier '.' in <A FILE> on line 4
previously, the error was at the ); in the 2nd line, but now.... >.>

This is the "proper" method (adapted from this answer).
Input:
<?php
$str = '<div>lol wut <span class="tip">remove!</span><span>don\'t remove!</span></div>';
?>
Code:
<?php
function recurse(&$doc, &$parent) {
if (!$parent->hasChildNodes())
return;
for ($i = 0; $i < $parent->childNodes->length; ) {
$elm = $parent->childNodes->item($i);
if ($elm->nodeName == "span") {
$class = $elm->attributes->getNamedItem("class")->nodeValue;
if (!is_null($class) && $class == "tip") {
$parent->removeChild($elm);
continue;
}
}
recurse($doc, $elm);
$i++;
}
}
// Load in the DOM (remembering that XML requires one root node)
$doc = new DOMDocument();
$doc->loadXML("<document>" . $str . "</document>");
// Iterate the DOM
recurse($doc, $doc->documentElement);
// Output the result
foreach ($doc->childNodes->item(0)->childNodes as $node) {
echo $doc->saveXML($node);
}
?>
Output:
<div>lol wut <span>don't remove!</span></div>

A simple regular expression like:
<span class="tip">.+</span>
Wont work, the issue being that if another span was opened and closed inside the tip span, your regex will terminate with its ending, rather than the tip one. DOM Based tools like the one linked in the comments will really provide a more reliable answer.
As per my comment below, you need to add pattern delimiters when working with regular expressions in PHP.
<?php
$str = preg_replace('\<span class="tip">.+</span>\', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
may be moderately more successful. Please take a look at the documentation page for the function in question.

Now without regexp, and without heavy XML parsing:
$html = ' ... <span class="tip"> hello <span id="x"> man </span> </span> ... ';
$tag = '<span class="tip">';
$tag_close = '</span>';
$tag_familly = '<span';
$tag_len = strlen($tag);
$p1 = -1;
$p2 = 0;
while ( ($p2!==false) && (($p1=strpos($html, $tag, $p1+1))!==false) ) {
// the tag is found, now we will search for its corresponding closing tag
$level = 1;
$p2 = $p1;
$continue = true;
while ($continue) {
$p2 = strpos($html, $tag_close, $p2+1);
if ($p2===false) {
// error in the html contents, the analysis cannot continue
echo "ERROR in html contents";
$continue = false;
$p2 = false; // will stop the loop
} else {
$level = $level -1;
$x = substr($html, $p1+$tag_len, $p2-$p1-$tag_len);
$n = substr_count($x, $tag_familly);
if ($level+$n<=0) $continue = false;
}
}
if ($p2!==false) {
// delete the couple of tags, the farest first
$html = substr_replace($html, '', $p2, strlen($tag_close));
$html = substr_replace($html, '', $p1, $tag_len);
}
}

How can I remove attributes from an html tag?

How can I use php to strip all/any attributes from a tag, say a paragraph tag?
<p class="one" otherrandomattribute="two"> to <p>

Although there are better ways, you could actually strip arguments from html tags with a regular expression:
<?php
function stripArgumentFromTags( $htmlString ) {
$regEx = '/([^<]*<\s*[a-z](?:[0-9]|[a-z]{0,9}))(?:(?:\s*[a-z\-]{2,14}\s*=\s*(?:"[^"]*"|\'[^\']*\'))*)(\s*\/?>[^<]*)/i'; // match any start tag
$chunks = preg_split($regEx, $htmlString, -1, PREG_SPLIT_DELIM_CAPTURE);
$chunkCount = count($chunks);
$strippedString = '';
for ($n = 1; $n < $chunkCount; $n++) {
$strippedString .= $chunks[$n];
}
return $strippedString;
}
?>
The above could probably be written in less characters, but it does the job (quick and dirty).

Strip attributes using SimpleXML (Standard in PHP5)
<?php
// define allowable tags
$allowable_tags = '<p><a><img><ul><ol><li><table><thead><tbody><tr><th><td>';
// define allowable attributes
$allowable_atts = array('href','src','alt');
// strip collector
$strip_arr = array();
// load XHTML with SimpleXML
$data_sxml = simplexml_load_string('<root>'. $data_str .'</root>', 'SimpleXMLElement', LIBXML_NOERROR | LIBXML_NOXMLDECL);
if ($data_sxml ) {
// loop all elements with an attribute
foreach ($data_sxml->xpath('descendant::*[#*]') as $tag) {
// loop attributes
foreach ($tag->attributes() as $name=>$value) {
// check for allowable attributes
if (!in_array($name, $allowable_atts)) {
// set attribute value to empty string
$tag->attributes()->$name = '';
// collect attribute patterns to be stripped
$strip_arr[$name] = '/ '. $name .'=""/';
}
}
}
}
// strip unallowed attributes and root tag
$data_str = strip_tags(preg_replace($strip_arr,array(''),$data_sxml->asXML()), $allowable_tags);
?>

Here is one function that will let you strip all attributes except ones you want:
function stripAttributes($s, $allowedattr = array()) {
if (preg_match_all("/<[^>]*\\s([^>]*)\\/*>/msiU", $s, $res, PREG_SET_ORDER)) {
foreach ($res as $r) {
$tag = $r[0];
$attrs = array();
preg_match_all("/\\s.*=(['\"]).*\\1/msiU", " " . $r[1], $split, PREG_SET_ORDER);
foreach ($split as $spl) {
$attrs[] = $spl[0];
}
$newattrs = array();
foreach ($attrs as $a) {
$tmp = explode("=", $a);
if (trim($a) != "" && (!isset($tmp[1]) || (trim($tmp[0]) != "" && !in_array(strtolower(trim($tmp[0])), $allowedattr)))) {
} else {
$newattrs[] = $a;
}
}
$attrs = implode(" ", $newattrs);
$rpl = str_replace($r[1], $attrs, $tag);
$s = str_replace($tag, $rpl, $s);
}
}
return $s;
}
In example it would be:
echo stripAttributes('<p class="one" otherrandomattribute="two">');
or if you eg. want to keep "class" attribute:
echo stripAttributes('<p class="one" otherrandomattribute="two">', array('class'));
Or
Assuming you are to send a message to an inbox and you composed your message with CKEDITOR, you can assign the function as follows and echo it to the $message variable before sending. Note the function with the name stripAttributes() will strip off all html tags that are unnecessary. I tried it and it work fine. i only saw the formatting i added like bold e.t.c.
$message = stripAttributes($_POST['message']);
or
you can echo $message; for preview.

I honestly think that the only sane way to do this is to use a tag and attribute whitelist with the HTML Purifier library. Example script here:
<html><body>
<?php
require_once '../includes/htmlpurifier-4.5.0-lite/library/HTMLPurifier/Bootstrap.php';
spl_autoload_register(array('HTMLPurifier_Bootstrap', 'autoload'));
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'p,b,a[href],i,br,img[src]');
$config->set('URI.Base', 'http://www.example.com');
$config->set('URI.MakeAbsolute', true);
$purifier = new HTMLPurifier($config);
$dirty_html = "
<a href=\"http://www.google.de\">broken a href link</a
fnord
<x>y</z>
<b>c</p>
<script>alert(\"foo!\");</script>
Anzahl besuchter Seiten
<img src=\"www.example.com/bla.gif\" />
<a href=\"http://www.google.de\">missing end tag
ende
";
$clean_html = $purifier->purify($dirty_html);
print "<h1>dirty</h1>";
print "<pre>" . htmlentities($dirty_html) . "</pre>";
print "<h1>clean</h1>";
print "<pre>" . htmlentities($clean_html) . "</pre>";
?>
</body></html>
This yields the following clean, standards-conforming HTML fragment:
broken a href linkfnord
y
<b>c
<a>Anzahl besuchter Seiten</a>
<img src="http://www.example.com/www.example.com/bla.gif" alt="bla.gif" /><a href="http://www.google.de">missing end tag
ende
</a></b>
In your case the whitelist would be:
$config->set('HTML.Allowed', 'p');

HTML Purifier is one of the better tools for sanitizing HTML with PHP.

You might also look into html purifier. True, it's quite bloated, and might not fit your needs if it only conceirns this specific example, but it offers more or less 'bulletproof' purification of possible hostile html. Also you can choose to allow or disallow certain attributes (it's highly configurable).
http://htmlpurifier.org/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting specific text from HTML texts - php

try this <h1>(.?)<span> /(.?)/</span> $1 and $2 are the results as you expected.

Related

Change 'href' value of a link using PHP and DOM

PHP: Test for string of text inside html tags from file_get_contents string

Remove all attributes from PHP string but keep basic markdown tags [duplicate]

Strip tag with class in PHP

How can I remove attributes from an html tag?

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting specific text from HTML texts - php

try this <h1>(.*?)<span> /(.*?)/</span> $1 and $2 are the results as you expected.

Related

Change 'href' value of a link using PHP and DOM

PHP: Test for string of text inside html tags from file_get_contents string

Remove all attributes from PHP string but keep basic markdown tags [duplicate]

Strip tag with class in PHP

How can I remove attributes from an html tag?

Categories

Resources

try this <h1>(.?)<span> /(.?)/</span> $1 and $2 are the results as you expected.