I'm trying to parse the html page and accessing some of the tags. I am parsing all of those tags and displaying the result in form of indentation which is according to the level of tags e.g. header tags h1, h2, h3 etc. Now, I want to save the resultant data (indented table of contents) into an array along with the name of the tags. Kindly help me to sort out my problem.
Here is my php code... I'm using html dom parser.
include ("simple_html_dom.php");
session_start();
error_reporting(0);
$string = file_get_contents('test.php');
$tags = array(0 => '<h1', 1 => '<h2', 2 => '<h3', 3 => '<h4', 4 => '<h5', 5 => '<h6');
function parser($html, $needles = array()){
$positions = array();
foreach ($needles as $needle){
$lastPos = 0;
while (($lastPos = strpos($html, $needle, $lastPos))!== false)
{
$positions[] = $lastPos;
$lastPos = $lastPos + strlen($needle);
}
unset($needles[0]);
if(count($positions) > 0){
break;
}
}
if(count($positions) > 0){
for ($i = 0; $i < count($positions); $i++) {
?>
<div class="<?php echo $i; ?>" style="padding-left: 20px; font-size: 14px;">
<?php
if($i < count($positions)-1){
$temp = explode('</', substr($html, $positions[$i]+4));
$pos = strpos($temp[0], '>');
echo substr($temp[0], $pos);
parser(substr($html, $positions[$i]+4, $positions[$i+1]-$positions[$i]-4), $needles);
} else {
$temp = explode('</', substr($html, $positions[$i]+4));
$pos = strpos($temp[0], '>');
echo substr($temp[0], $pos+1);
parser(substr($html, $positions[$i]+4), $needles);
}
?>
</div>
<?php
}
} else {
// not found any position of a tag
}
}
parser($string, $tags);
If you wanted to do it using SimpleXML and XPath, there is a shorter and much more readable version you could try...
$xml = new SimpleXMLElement($string);
$tags = $xml->xpath("//h1 | //h2 | //h3 | //h4");
$data = [];
foreach ( $tags as $tag ) {
$elementData['name'] = $tag->getName();
$elementData['content'] = (string)$tag;
$data[] = $elementData;
}
print_r($data);
You can see the pattern in the XPath - it combines any of the elements you need. The use of // means to find at any level and then the name of the element you want to find. These are combined using |, which is the 'or' operator. This could easily be expanded using the same type of expression to build a full set of tags you need.
The program then loops over the elements found and builds an array of each element at a time. Taking the name and content and adding them to the $data array.
Update:
If your file isn't well formed XML, you may have to use DOMDocument and loadHTML. Only a slight difference but is more tollerant of errors...
$string = file_get_contents("links.html");
$xml = new DOMDocument();
libxml_use_internal_errors();
$xml->loadHTML($string);
$xp = new DOMXPath($xml);
$tags = $xp->query("//h1 | //h2 | //h3 | //h4");
$data = [];
foreach ( $tags as $tag ) {
$elementData['name'] = $tag->tagName;
$elementData['content'] = $tag->nodeValue;
$data[] = $elementData;
}
print_r($data);
i need to sort some strings and match them with links, this is what i do:
$name_link = $dom->find('div[class=link] strong');
Returns array [0]-[5] containing strings such as NowDownload.eu
$code_link = $dom->find('div[class=link] code');
Returns links that match the names from 0-5, as in link [0] belongs to name [0]
I do not know the order in which they are returned, NowDownload.Eu, could be $code_link[4] or $code_link [3], but the name array will match it in order.
Now, i need $code_link[4] // lets say its NowDownload.Eu to become $link1 every time
so i do this
$i = 0;
while (!empty($code_link[$i]))
SortLinks($name_link, $code_link, $i); // pass all links and names to function, and counter
$i++;
}
function SortLinks($name_link, $code_link, &$i) { // counter is passed by reference since it has to increase after the function
$string = $name_link[$i]->plaintext; // name_link is saved as string
$string = serialize($string); // They are returned in a odd format, not searcheble unless i serialize
if (strpos($string, 'NowDownload.eu')) { // if string contains NowDownload.eu
$link1 = $code_link[$i]->plaintext;
$link1 = html_entity_decode($link1);
return $link1; // return link1
}
elseif (strpos($string, 'Fileswap')) {
$link2 = $code_link[$i]->plaintext;
$link2 = html_entity_decode($link2);
return $link2;
}
elseif (strpos($string, 'Mirrorcreator')) {
$link3 = $code_link[$i]->plaintext;
$link3 = html_entity_decode($link3);
return $link3;
}
elseif (strpos($string, 'Uploaded')) {
$link4 = $code_link[$i]->plaintext;
$link4 = html_entity_decode($link4);
return $link4;
}
elseif (strpos($string, 'Ziddu')) {
$link5 = $code_link[$i]->plaintext;
$link5 = html_entity_decode($link5);
return $link5;
}
elseif (strpos($string, 'ZippyShare')) {
$link6 = $code_link[$i]->plaintext;
$link6 = html_entity_decode($link6);
return $link6;
}
}
echo $link1 . '<br>';
echo $link2 . '<br>';
echo $link3 . '<br>';
echo $link4 . '<br>';
echo $link5 . '<br>';
echo $link6 . '<br>';
die();
I know they it finds the link, i have tested it before, but i wanted to make it a function, and it messed up, is my logic faulty or is there an issue with the way i pass the variables/ararys ?
I don't know why you pass $i as reference since you use it just for reading it. You could return an array contaning the named links and using it like so :
$all_links = SortLinks($name_link,$code_link);
echo $all_links['link1'].'<br/>';
echo $all_links['link2'].'<br/>';
You will have to put your loop inside the function, not outside.
I want to paginate the following filtered results from xml file:
<?php
//load up the XML file as the variable $xml (which is now an array)
$xml = simplexml_load_file('inventory.xml');
//create the function xmlfilter, and tell it which arguments it will be handling
function xmlfilter ($xml, $color, $weight, $maxprice)
{
$res = array();
foreach ($xml->widget as $w)
{
//initially keep all elements in the array by setting keep to 1
$keep = 1;
//now start checking to see if these variables have been set
if ($color!='')
{
//if color has been set, and the element's color does not match, don't keep this element
if ((string)$w->color != $color) $keep = 0;
}
//if the max weight has been set, ensure the elements weight is less, or don't keep this element
if ($weight)
{
if ((int)$w->weight > $weight) $keep = 0;
}
//same goes for max price
if ($maxprice)
{
if ((int)$w->price > $maxprice) $keep = 0;
}
if ($keep) $res[] = $w;
}
return $res;
}
//check to see if the form was submitted (the url will have '?sub=Submit' at the end)
if (isset($_GET['sub']))
{
//$color will equal whatever value was chosen in the form (url will show '?color=Blue')
$color = isset($_GET['color'])? $_GET['color'] : '';
//same goes for these fellas
$weight = $_GET['weight'];
$price = $_GET['price'];
//now pass all the variables through the filter and create a new array called $filtered
$filtered = xmlfilter($xml ,$color, $weight, $price);
//finally, echo out each element from $filtered, along with its properties in neat little spans
foreach ($filtered as $widget) {
echo "<div class='widget'>";
echo "<span class='name'>" . $widget->name . "</span>";
echo "<span class='color'>" . $widget->color . "</span>";
echo "<span class='weight'>" . $widget->weight . "</span>";
echo "<span class='price'>" . $widget->price . "</span>";
echo "</div>";
}
}
Where $xml->widget represents the following xml:
<hotels xmlns="">
<hotels>
<hotel>
<noofrooms>10</noofrooms>
<website></website>
<imageref>oias-sunset-2.jpg|villas-agios-nikolaos-1.jpg|villas-agios-nikolaos-24.jpg|villas-agios-nikolaos-41.jpg</imageref>
<descr>blah blah blah</descr>
<hotelid>119</hotelid>
</hotel>
</hotels>
</hotels>
Any good ideas?
Honestly if you're already using XML and want to do Pagination then use XSL. It'll allow for formatting of the results and for pagination with ease. PHP has a built in XSL transformer iirc
See http://www.codeproject.com/Articles/11277/Pagination-using-XSL for a decent example.
So I need to strip the span tags of class tip.
So that would be <span class="tip"> and the corresponding </span>, and everything inside it...
I suspect a regular expression is needed but I terribly suck at this.
Laugh...
<?php
$string = 'April 15, 2003';
$pattern = '/(\w+) (\d+), (\d+)/i';
$replacement = '${1}1,$3';
echo preg_replace($pattern, $replacement, $string);
?>
Gives no error... But
<?php
$str = preg_replace('<span class="tip">.+</span>', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
Gives me the error:
Warning: preg_replace() [function.preg-replace]: Unknown modifier '.' in <A FILE> on line 4
previously, the error was at the ); in the 2nd line, but now.... >.>
This is the "proper" method (adapted from this answer).
Input:
<?php
$str = '<div>lol wut <span class="tip">remove!</span><span>don\'t remove!</span></div>';
?>
Code:
<?php
function recurse(&$doc, &$parent) {
if (!$parent->hasChildNodes())
return;
for ($i = 0; $i < $parent->childNodes->length; ) {
$elm = $parent->childNodes->item($i);
if ($elm->nodeName == "span") {
$class = $elm->attributes->getNamedItem("class")->nodeValue;
if (!is_null($class) && $class == "tip") {
$parent->removeChild($elm);
continue;
}
}
recurse($doc, $elm);
$i++;
}
}
// Load in the DOM (remembering that XML requires one root node)
$doc = new DOMDocument();
$doc->loadXML("<document>" . $str . "</document>");
// Iterate the DOM
recurse($doc, $doc->documentElement);
// Output the result
foreach ($doc->childNodes->item(0)->childNodes as $node) {
echo $doc->saveXML($node);
}
?>
Output:
<div>lol wut <span>don't remove!</span></div>
A simple regular expression like:
<span class="tip">.+</span>
Wont work, the issue being that if another span was opened and closed inside the tip span, your regex will terminate with its ending, rather than the tip one. DOM Based tools like the one linked in the comments will really provide a more reliable answer.
As per my comment below, you need to add pattern delimiters when working with regular expressions in PHP.
<?php
$str = preg_replace('\<span class="tip">.+</span>\', "", '<span class="rss-title"></span><span class="rss-link">linkylink</span><span class="rss-id"></span><span class="rss-content"></span><span class=\"rss-newpost\"></span>');
echo $str;
?>
may be moderately more successful. Please take a look at the documentation page for the function in question.
Now without regexp, and without heavy XML parsing:
$html = ' ... <span class="tip"> hello <span id="x"> man </span> </span> ... ';
$tag = '<span class="tip">';
$tag_close = '</span>';
$tag_familly = '<span';
$tag_len = strlen($tag);
$p1 = -1;
$p2 = 0;
while ( ($p2!==false) && (($p1=strpos($html, $tag, $p1+1))!==false) ) {
// the tag is found, now we will search for its corresponding closing tag
$level = 1;
$p2 = $p1;
$continue = true;
while ($continue) {
$p2 = strpos($html, $tag_close, $p2+1);
if ($p2===false) {
// error in the html contents, the analysis cannot continue
echo "ERROR in html contents";
$continue = false;
$p2 = false; // will stop the loop
} else {
$level = $level -1;
$x = substr($html, $p1+$tag_len, $p2-$p1-$tag_len);
$n = substr_count($x, $tag_familly);
if ($level+$n<=0) $continue = false;
}
}
if ($p2!==false) {
// delete the couple of tags, the farest first
$html = substr_replace($html, '', $p2, strlen($tag_close));
$html = substr_replace($html, '', $p1, $tag_len);
}
}
The function below is designed to apply rel="nofollow" attributes to all external links and no internal links unless the path matches a predefined root URL defined as $my_folder below.
So given the variables...
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
And the content...
internal
internal cloaked link
external
The end result, after replacement should be...
internal
internal cloaked link
external
Notice that the first link is not altered, since its an internal link.
The link on the second line is also an internal link, but since it matches our $my_folder string, it gets the nofollow too.
The third link is the easiest, since it does not match the blog_url, its obviously an external link.
However, in the script below, ALL of my links are getting nofollow. How can I fix the script to do what I want?
function save_rseo_nofollow($content) {
$my_folder = $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
for ( $i = 0; $i <= sizeof($matches[0]); $i++){
if ( !preg_match( '~nofollow~is',$matches[0][$i])
&& (preg_match('~' . $my_folder . '~', $matches[0][$i])
|| !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
$result = trim($matches[0][$i],">");
$result .= ' rel="nofollow">';
$content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
}
}
return $content;
}
Here is the DOMDocument solution...
$str = 'internal
internal cloaked link
external
external
external
external
';
$dom = new DOMDocument();
$dom->preserveWhitespace = FALSE;
$dom->loadHTML($str);
$a = $dom->getElementsByTagName('a');
$host = strtok($_SERVER['HTTP_HOST'], ':');
foreach($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
continue;
}
$noFollowRel = 'nofollow';
$oldRelAtt = $anchor->attributes->getNamedItem('rel');
if ($oldRelAtt == NULL) {
$newRel = $noFollowRel;
} else {
$oldRel = $oldRelAtt->nodeValue;
$oldRel = explode(' ', $oldRel);
if (in_array($noFollowRel, $oldRel)) {
continue;
}
$oldRel[] = $noFollowRel;
$newRel = implode($oldRel, ' ');
}
$newRelAtt = $dom->createAttribute('rel');
$noFollowNode = $dom->createTextNode($newRel);
$newRelAtt->appendChild($noFollowNode);
$anchor->appendChild($newRelAtt);
}
var_dump($dom->saveHTML());
Output
string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
internal
internal cloaked link
external
external
external
external
</body></html>
"
Try to make it more readable first, and only afterwards make your if rules more complex:
function save_rseo_nofollow($content) {
$content["post_content"] =
preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
return $content;
}
function cb2($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/hostgator"; // re-add quirky config here
$blog_url = "http://localhost/";
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
Gives following output:
[post_content] =>
internal
<a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>
<a href="http://cnn.com" rel=nofollow>external</a>
The problem in your original code might have been $rseo which wasn't declared anywhere.
Try this one (PHP 5.3+):
skip selected address
allow manually set rel parameter
and code:
function nofollow($html, $skip = null) {
return preg_replace_callback(
"#(<a[^>]+?)>#is", function ($mach) use ($skip) {
return (
!($skip && strpos($mach[1], $skip) !== false) &&
strpos($mach[1], 'rel=') === false
) ? $mach[1] . ' rel="nofollow">' : $mach[0];
},
$html
);
}
Examples:
echo nofollow('something');
// will be same because it's already contains rel parameter
echo nofollow('something'); // ad
// add rel="nofollow" parameter to anchor
echo nofollow('something', 'localhost');
// skip this link as internall link
Using regular expressions to do this job properly would be quite complicated. It would be easier to use an actual parser, such as the one from the DOM extension. DOM isn't very beginner-friendly, so what you can do is load the HTML with DOM then run the modifications with SimpleXML. They're backed by the same library, so it's easy to use one with the other.
Here's how it can look like:
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
$html = '<html><body>
internal
internal cloaked link
external
</body></html>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$sxe = simplexml_import_dom($dom);
// grab all <a> nodes with an href attribute
foreach ($sxe->xpath('//a[#href]') as $a)
{
if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
&& substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
{
// skip all links that start with the URL in $blog_url, as long as they
// don't start with the URL from $my_folder;
continue;
}
if (empty($a['rel']))
{
$a['rel'] = 'nofollow';
}
else
{
$a['rel'] .= ' nofollow';
}
}
$new_html = $dom->saveHTML();
echo $new_html;
As you can see, it's really short and simple. Depending on your needs, you may want to use preg_match() in place of the strpos() stuff, for example:
// change the regexp to your own rules, here we match everything under
// "http://localhost/mytest/" as long as it's not followed by "go"
if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
{
continue;
}
Note
I missed the last code block in the OP when I first read the question. The code I posted (and basically any solution based on DOM) is better suited at processing a whole page rather than a HTML block. Otherwise, DOM will attempt to "fix" your HTML and may add a <body> tag, a DOCTYPE, etc...
Thanks #alex for your nice solution. But, I was having a problem with Japanese text. I have fixed it as following way. Also, this code can skip multiple domains with the $whiteList array.
public function addRelNoFollow($html, $whiteList = [])
{
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$a = $dom->getElementsByTagName('a');
/** #var \DOMElement $anchor */
foreach ($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
$domain = parse_url($href, PHP_URL_HOST);
// Skip whiteList domains
if (in_array($domain, $whiteList, true)) {
continue;
}
// Check & get existing rel attribute values
$noFollow = 'nofollow';
$rel = $anchor->attributes->getNamedItem('rel');
if ($rel) {
$values = explode(' ', $rel->nodeValue);
if (in_array($noFollow, $values, true)) {
continue;
}
$values[] = $noFollow;
$newValue = implode($values, ' ');
} else {
$newValue = $noFollow;
}
// Create new rel attribute
$rel = $dom->createAttribute('rel');
$node = $dom->createTextNode($newValue);
$rel->appendChild($node);
$anchor->appendChild($rel);
}
// There is a problem with saveHTML() and saveXML(), both of them do not work correctly in Unix.
// They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
// So we need to do as follows. #see https://stackoverflow.com/a/20675396/1710782
return $dom->saveHTML($dom->documentElement);
}
<?
$str='internal
internal cloaked link
external';
function test($x){
if (preg_match('#localhost/mytest/(?!go/)#i',$x[0])>0) return $x[0];
return 'rel="nofollow" '.$x[0];
}
echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);
?>
Here is the another solution which has whitelist option and add tagret Blank attribute.
And also it check if there already a rel attribute before add a new one.
function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true)
{
$Whitelist[] = $_SERVER['HTTP_HOST'];
foreach ($Whitelist as $Key => $Link)
{
$Host = preg_replace('#^https?://#', '', $Link);
$Host = "https?://". preg_quote($Host, '/');
$Whitelist[$Key] = $Host;
}
if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER))
{
foreach ($matches as $Anchor_Tag)
{
$IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag = false;
if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2))
{
foreach ($All_matches2[1] as $Key => $Attr_Name)
{
if($Attr_Name == 'href')
{
$Is_Valid_Tag = true;
$Url = $All_matches2[2][$Key];
// bypass #.. or internal links like "/"
if(preg_match('/^\s*[#|\/].*/', $Url))
{
continue 2;
}
foreach ($Whitelist as $Link)
{
if (preg_match("#$Link#", $Url)) {
continue 3;
}
}
}
else if($Attr_Name == 'rel')
{
$IS_Rel_Exist = true;
$Rel = $All_matches2[2][$Key];
preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
if( count($match) > 0 )
{
$IS_Follow_Exist = true;
}
else
{
$New_Rel = 'rel="'. $Rel . ' nofollow"';
}
}
else if($Attr_Name == 'target')
{
$IS_Target_Blank_Exist = true;
}
}
}
$New_Anchor_Tag = $Anchor_Tag;
if(!$IS_Rel_Exist)
{
$New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
}
else if(!$IS_Follow_Exist)
{
$New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
}
if($Add_Target_Blank && !$IS_Target_Blank_Exist)
{
$New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
}
$Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
}
}
return $Content;
}
To use it:
$Page_Content = 'internal
internal
google
example
stackoverflow';
$Whitelist = ["http://yoursite.com","http://localhost"];
echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);
WordPress decision:
function replace__method($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/articles"; // re-add quirky config here
$blog_url = 'https://'.$_SERVER['SERVER_NAME'];
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
add_filter( 'the_content', 'add_nofollow_to_external_links', 1 );
function add_nofollow_to_external_links( $content ) {
$content = preg_replace_callback('~<(a\s[^>]+)>~isU', "replace__method", $content);
return $content;
}
a good script which allows to add nofollow automatically and to keep the other attributes
function nofollow(string $html, string $baseUrl = null) {
return preg_replace_callback(
'#<a([^>]*)>(.+)</a>#isU', function ($mach) use ($baseUrl) {
list ($a, $attr, $text) = $mach;
if (preg_match('#href=["\']([^"\']*)["\']#', $attr, $url)) {
$url = $url[1];
if (is_null($baseUrl) || !str_starts_with($url, $baseUrl)) {
if (preg_match('#rel=["\']([^"\']*)["\']#', $attr, $rel)) {
$relAttr = $rel[0];
$rel = $rel[1];
}
$rel = 'rel="' . ($rel ? (strpos($rel, 'nofollow') ? $rel : $rel . ' nofollow') : 'nofollow') . '"';
$attr = isset($relAttr) ? str_replace($relAttr, $rel, $attr) : $attr . ' ' . $rel;
$a = '<a ' . $attr . '>' . $text . '</a>';
}
}
return $a;
},
$html
);
}