simple html dom unable to handle forward slash in find(id) - php

find() here is a function of the simple_html_dom library, that should return dom node elements when given an id/class.
$urlFetched->find("#".$id) always fails to find and return something when the $id is "fk-list-MP3-Players-/-IPods". I am guessing the problem is with the forward slash and simple_html_dom, because there is no problem with the other ids and urls(snipped).
What do I do? my program is almost complete and dependent on simple html dom.
Thanks
The code:
$urlAndIds = array(
array("http://www.flipkart.com/audio" , array('fk-list-Home-Audio', htmlentities("fk-list-MP3-Players-/-IPods"), 'fk-list-Accessories'),array('ALL','AllBrands')) );
foreach($urlAndIds as $uAI) {
$url = file_get_contents($uAI[0]) ;
$urlFetched = str_get_html($url) ;
if ($url == false){
echo 'page '.$uAI[0] . " not found" ."<br>" ."<br>";
} else {
foreach ($uAI[1] as $id) {
$idFound = $urlFetched->find("#".$id) ;
if(!$idFound) {
echo 'In page '.$uAI[0].' -id not found- '.$id ."<br>";
}
}
}
}

The slash is being interpreted as part of the XPath expression, so it's looking for a child element named -IPods. There is no XPath "quote" type function either. I'm not sure whether adding a backslash would work, but it may be easier for you to just use a normal attribute selector with id: [#id='fk-list-MP3-Players-/-IPods']

Related

Considering escaped quotes in an "all characters except" type regex [duplicate]

This question already has answers here:
Regex pattern for matching single quoted words in a string and ignore the escaped single quotes
(4 answers)
Closed 3 years ago.
I need to do some in-attribute JavaScript replacement to add custom JavaScript to the attribute. To be specific, to add a JS confirm() function around all of it. A rather hacky thing, but a thing I have to do regardless.
Here is the HTML tag I need to replace.
<input type='submit' id='gform_submit_button_4' class='gform_button button' value='Send' onclick='/* Lots of JS */' onkeypress='/* Lots of JS */' />
I have succeeded in doing it with the following PHP code.
$new_submit_html = $submit_html;
// __() is WordPress's function for internationalized text
$confirm_text = __("It will not be possible to modify your responses anymore if you continue.\\n\\nAre you sure you want to continue?", 'bonestheme');
$new_js_start = 'if( window.confirm("' . $confirm_text . '") ) { ';
$new_js_end = ' } else { event.preventDefault(); }';
$new_submit_html = preg_replace_callback( "/(onclick|onkeypress)(=')([^']*)(')/", function( $matches ) use( $new_js_start, $new_js_end ) {
$return_val = $matches[1] . $matches[2] . $new_js_start . $matches[3] . $new_js_end . $matches[4];
// (Other irrelevant manipulations)
return $return_val;
}, $new_submit_html );
return $new_submit_html;
This works like a charm right now, because the JavaScript where I wrote "Lots of JS" just so happens not to contain \' -- escaped single quotes -- which it could definitely contain.
I've seen this question, which would allow me to match the apostrophe unless it's escaped, but I'm not sure how to reverse it to match anything but an unescaped apostrophe. I imagine the solution will include lookbehinds, but I'm not sure how to proceed in this exact case.
I would use DOMDocument to do this as it won't care about the actual contents of the attribute as long as they are already valid:
function wrap_js($js) {
$confirm_text = "It will not be possible to modify your responses anymore if you continue.\\n\\nAre you sure you want to continue?";
$new_js_start = 'if( window.confirm("' . $confirm_text . '") ) { ';
$new_js_end = ' } else { event.preventDefault(); }';
return $new_js_start . $js . $new_js_end;
}
$html = "<input type='submit' id='gform_submit_button_4' class='gform_button button' value='Envoyer' onclick='/* Lots of JS */' onkeypress='/* Lots of JS */' />";
$doc = new DOMDocument();
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
foreach ($xpath->query("//input[#type='submit']") as $submit_input) {
foreach (['onclick', 'onkeypress'] as $attribute) {
if (($js = $submit_input->getAttribute($attribute)) != '') {
$submit_input->setAttribute($attribute, wrap_js($js));
}
}
}
echo $doc->saveHTML();
Output:
<input type="submit"
id="gform_submit_button_4"
class="gform_button button"
value="Envoyer"
onclick='if( window.confirm("It will not be possible to modify your responses anymore if you continue.\n\nAre you sure you want to continue?") ) { /* Lots of JS */ } else { event.preventDefault(); }'
onkeypress='if( window.confirm("It will not be possible to modify your responses anymore if you continue.\n\nAre you sure you want to continue?") ) { /* Lots of JS */ } else { event.preventDefault(); }'
>
Demo on 3v4l.org

Decode multiple xml tags inside using PHP

I'm looking for a 'smart way' of decoding multiple XML tags inside a string, i have the following function:
function b($params) {
$xmldata = '<?xml version="1.0" encoding="UTF-8" ?><root>' . html_entity_decode($params['data']) . '</root>';
$lang = ucfirst(strtolower($params['lang']));
if (simplexml_load_string($xmldata) === FALSE) {
return $params['data'];
} else {
$langxmlobj = new SimpleXMLElement($xmldata);
if ($langxmlobj -> $lang) {
return $langxmlobj -> $lang;
} else {
return $params['data'];
}
}
}
And trying out
$params['data'] = '<French>Service DNS</French><English>DNS Service</English> - <French>DNS Gratuit</French><English>Free DNS</English>';
$params['lang'] = 'French';
$a = b($params);
print_r($a);
But outputs:
Service DNS
And I want it to basically output every tags, so result should be :
Service DNS - DNS Gratuit
Pulling my hairs out. Any quick help or directions would be appreciated.
Edit: Refine needs.
Seems that I wasn't clear enough; so let me show another example
If i have the following string as input :
The <French>Chat</French><English>Cat</English> is very happy to stay on stackoverflow
because it makes him <French>Heureux</French><English>Happy</English> to know that it
is the best <French>Endroit</French><English>Place</English> to find good people with
good <French>Réponses</French><English>Answers</English>.
So if i'd run function with 'French' it will return :
The Chat is very happy to stay on stackoverflow
because it makes him Heureux to know that it
is the best Endroit to find good people with
good Réponses.
And with 'English' :
The Cat is very happy to stay on stackoverflow
because it makes him Happy to know that it
is the best Place to find good people with
good Answers.
Hope it's more clear now.
Basically, I will parse out the lang section firstly, like:
<French>Chat</French><English>Cat</English>
with this:
"#(<($defLangs)>.*?</\\2>)+#i"
Then parse the right lang str out with callback.
If you got php 5.3+, then:
function transLang($str, $lang, $defLangs = 'French|English')
{
return preg_replace_callback ( "#(<($defLangs)>.*?</\\2>)+#i",
function ($matches) use($lang)
{
preg_match ( "/<$lang>(.*?)<\/$lang>/i", $matches [0], $longSec );
return $longSec [1];
}, $str );
}
echo transLang ( $str, 'French' ), "\n", transLang ( $str, 'English' );
If not, a little complicated:
class LangHelper
{
private $lang;
function __construct($lang)
{
$this->lang = $lang;
}
public function callback($matches)
{
$lang = $this->lang;
preg_match ( "/<$lang>(.*?)<\/$lang>/i", $matches [0], $subMatches );
return $subMatches [1];
}
}
function transLang($str, $lang, $defLangs = 'French|English')
{
$langHelper = new LangHelper ( $lang );
return preg_replace_callback ( "#(<($defLangs)>.*?</\\2>)+#i",
array (
$langHelper,
'callback'
), $str );
}
echo transLang ( $str, 'French' ), "\n", transLang ( $str, 'English' );
If I understand you correctly you would like to remove all "language" tags, but keep the contents of the provided language.
The DOM is a tree of nodes. Tags are element nodes, the text is stored in text nodes. Xpath allows to select nodes using expressions. So take all the child nodes of the language elements you want to keep and copy them just before the language node. Then remove all language nodes. This will work even if the language elements contain other element nodes, like an <em>.
function replaceLanguageTags($fragment, $language) {
$dom = new DOMDocument();
$dom->loadXml(
'<?xml version="1.0" encoding="UTF-8" ?><content>'.$fragment.'</content>'
);
// get an xpath object
$xpath = new DOMXpath($dom);
// fetch all nodes with the language you like to keep
$nodes = $xpath->evaluate('//'.$language);
foreach ($nodes as $node) {
// copy all the child nodes of just before the found node
foreach ($node->childNodes as $childNode) {
$node->parentNode->insertBefore($childNode->cloneNode(TRUE), $node);
}
// remove the found node
$node->parentNode->removeChild($node);
}
// select all language nodes
$tags = array('English', 'French');
$nodes = $xpath->evaluate('//'.implode('|//', $tags));
foreach ($nodes as $node) {
// remove them
$node->parentNode->removeChild($node);
}
$result = '';
// we do not need the root node, so save all its children
foreach ($dom->documentElement->childNodes as $node) {
$result .= $dom->saveXml($node);
}
return $result;
}
$xml = <<<'XML'
The <French>Chat</French><English>Cat</English> is very happy to stay on stackoverflow
because it makes him <French>Heureux</French><English>Happy</English> to know that it
is the best <French>Endroit</French><English>Place</English> to find good people with
good <French>Réponses</French><English>Answers</English>.
XML;
var_dump(replaceLanguageTags($xml, 'English'));
var_dump(replaceLanguageTags($xml, 'French'));
Output:
string(146) "The Cat is very happy to stay on stackoverflow
because it makes him Happy to know that it
is the best Place to find good people with
good Answers."
string(153) "The Chat is very happy to stay on stackoverflow
because it makes him Heureux to know that it
is the best Endroit to find good people with
good Réponses."
What version of PHP are you on? I don't know what else could be different, but I copied & pasted your code and got the following output:
SimpleXMLElement Object
(
[0] => Service DNS
[1] => DNS Gratuit
)
Just to be sure, this is the code I copied from above:
<?php
function b($params) {
$xmldata = '<?xml version="1.0" encoding="UTF-8" ?><root>' . html_entity_decode($params['data']) . '</root>';
$lang = ucfirst(strtolower($params['lang']));
if (simplexml_load_string($xmldata) === FALSE) {
return $params['data'];
} else {
$langxmlobj = new SimpleXMLElement($xmldata);
if ($langxmlobj -> $lang) {
return $langxmlobj -> $lang;
} else {
return $params['data'];
}
}
}
$params['data'] = '<French>Service DNS</French><English>DNS Service</English> - <French>DNS Gratuit</French><English>Free DNS</English>';
$params['lang'] = 'French';
$a = b($params);
print_r($a);
Here's my suggestion. It should be fast and it is simple. You just need to strip the tags of the desired language and then remove any other tags along with their content.
The downside is that if you wish to use any other tags than the language one, you have to make sure that the opening one is different from the closing (e.g. <p >Lorem</p> instead of <p>Lorem</p>). On the other hand this allows you to add as many languages as you want, without keeping a list of them. You need to know only the default one (or just throw and catch exception) when the asked language is missing.
function only_lang($lang, $text) {
static $infinite_loop;
$result = str_replace("<$lang>", '', $text, $num_matches_open);
$result = str_replace("</$lang>", '', $result, $num_matches_close);
// Check if the text is malformed. Good place to throw an error
if($num_matches_open != $num_matches_close) {
//throw new Exception('Opening and closing tags does not match', 1);
return $text;
}
// Check if this language is present at all.
// Otherwise fallback to default language or throw an error
if( ! $num_matches_open) {
//throw new Exception('No such language', 2);
// Prevent infinite loop if even the default language is missing
if($infinite_loop) return $text;
$infinite_loop = __FUNCTION__;
return $infinite_loop('English', $text);
}
// Strip any other language and return the result
return preg_replace('!<([^>]+)>.*</\\1>!', '', $result);
}
I got a simple one using regex. Useful, if the input only contains <lang>...</lang> tags.
function to_lang($lang="", $str="") {
return strip_tags(preg_replace('~<(\w+(?<!'.$lang.'))>.*</\1>~Us',"",$str));
}
echo to_lang("English","The happy <French>Chat</French><English>Cat</English>");
Removes each <tag>...</tag>, that is not the specified one in $lang. If there could be spaces/specials inside the <tag-name> e.g. <French-1> replace \w with [^/>].
Search pattern explained a bit
1.) <(\w+(?<!'.$lang.'))
< followed by one or more Word characters,
not matching $lang (using a negative lookbehind)
and capturing the <tag_name>
2.) .* followed by anything (ungreedy: modifier U, dot matches newlines: modifier s)
3.) </\1> until the captured tag is closed

simple_html_dom not returning <h1> elements?

I'm testing a parser using SIMPLE_HTML_DOM and while parsing
the returned HTML DOM from this URL: HERE
It is not finding the H1 elements...
I tried returning all the div's with success.
I'm using a simple request for diagnosing this problem:
foreach($html->find('H1') as $value) { echo "<br />F: ".htmlspecialchars($value); }
While looking at the source code I realized that:
h1 is upper case -> H1 - but the SIMPLE_HTML... is handling that:
//PaperG - If lowercase is set, do a case insensitive test of the value of the selector.
if ($lowercase) {
$check = $this->match($exp, strtolower($val), strtolower($nodeKeyValue));
} else {
$check = $this->match($exp, $val, $nodeKeyValue);
}
if (is_object($debugObject)) {$debugObject->debugLog(2, "after match: " . ($check ? "true" : "false"));}
Can any body help me understanding what is going on here?
Try This
$oHtml = str_get_html($html);
foreach($oHtml->find('h1') as $element)
{
echo $element->innertext;
}
You will also use regular expression following function return an array of all h1 tag's innertext
function getH1($yourhtml)
{
$h1tags = preg_match_all("/(<h1.*>)(\w.*)(<\/h1>)/isxmU", $yourhtml, $patterns);
$res = array();
array_push($res, $patterns[2]);
array_push($res, count($patterns[2]));
return $res;
}
Found it...
But cant explain it!
I tested with another code including H1 (uppercase) and it worked.
While playing with the SIMPLE_HTML_DOM code i commented the "remove_noise" and now its working
perfectly, I think it's because that this website has invalid HTML and
the noise remover is removing too much and not ending after the end tags scripts:
// $this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is");
// $this->remove_noise("'<\s*script\s*>(.*?)<\s*/\s*script\s*>'is");
Thank you all for your help.

PHP script that counts the number of outgoing links on a page and ignores the rel="nofollow" ones

I am a php newb but I am pretty sure this will be hard to accomplish and very server consuming. But I want to ask, get the opinion of much smarter users than myself.
Here is what I am trying to do:
I have a list of URL's, an array of URL's actually.
For each URL, I want to count the outgoing links - which DO NOT HAVE REL="nofollow" attribute - on that page.
So in a way, I'm afraid I'll have to make php load the page and preg match using regular expressions all the links?
Would this work if I'd had lets say 1000 links?
Here is what I am thinking, putting it in code:
$homepage = file_get_contents('http://www.site.com/');
$homepage = htmlentities($homepage);
// Do a preg_match for http:// and count the number of appearances:
$urls = preg_match();
// Do a preg_match for rel="nofollow" and count the nr of appearances:
$nofollow = preg_match();
// Do a preg_match for the number of "domain.com" appearances so we can subtract the website's internal links:
$internal_links = preg_match();
// Substract and get the final result:
$result = $urls - $nofollow - $internal_links;
Hope you can help, and if the idea is right maybe you can help me with the preg_match functions.
You can use PHP's DOMDocument class to parse the HTML and parse_url to parse the URLs:
$url = 'http://stackoverflow.com/';
$pUrl = parse_url($url);
// Load the HTML into a DOMDocument
$doc = new DOMDocument;
#$doc->loadHTMLFile($url);
// Look for all the 'a' elements
$links = $doc->getElementsByTagName('a');
$numLinks = 0;
foreach ($links as $link) {
// Exclude if not a link or has 'nofollow'
preg_match_all('/\S+/', strtolower($link->getAttribute('rel')), $rel);
if (!$link->hasAttribute('href') || in_array('nofollow', $rel[0])) {
continue;
}
// Exclude if internal link
$href = $link->getAttribute('href');
if (substr($href, 0, 2) === '//') {
// Deal with protocol relative URLs as found on Wikipedia
$href = $pUrl['scheme'] . ':' . $href;
}
$pHref = #parse_url($href);
if (!$pHref || !isset($pHref['host']) ||
strtolower($pHref['host']) === strtolower($pUrl['host'])
) {
continue;
}
// Increment counter otherwise
echo 'URL: ' . $link->getAttribute('href') . "\n";
$numLinks++;
}
echo "Count: $numLinks\n";
You can use SimpleHTMLDOM:
// Create DOM from URL or file
$html = file_get_html('http://www.site.com/');
// Find all links
foreach($html->find('a[href][rel!=nofollow]') as $element) {
echo $element->href . '<br>';
}
As I'm not sure that SimpleHTMLDOM supports a :not selector and [rel!=nofollow] might only return a tags with a rel attribute present (and not ones where it isn't present), you may have to:
foreach($html->find('a[href][!rel][rel!=nofollow]') as $element)
Note the added [!rel]. Or, do it manually instead of with a CSS attribute selector:
// Find all links
foreach($html->find('a[href]') as $element) {
if (strtolower($element->rel) != 'nofollow') {
echo $element->href . '<br>';
}
}

php parser: Determine if string found via regex is inside an anchor tag

Edited. I know HTML should not be parsed with regex. I am asking for help. How can I find an arbitrary string in a mix of tags and text and then determine if it is inside an anchor?
I have an interactive glossary in my WordPress site. Part of its functionality is searching the content of a post for a glossary term (a text string). If found, the term is wrapped in a link to a custom taxonomy entry that contains the definition.
I like how it works, but one hitch is that if the term is already part of a link, the glossary parser hijacks the current link, by inserting a link within the link. The parser is purely regex based, there isn't DOM parsing. I know that HTML should not be parsed with regex. But currently the function is just searching for a specific text string, its not trying to do anything with tags at all.
But is there a relatively fast (in terms of processing) and reliable way I can check if the found string is inside an anchor tag? Obviously this would not always be the case, as the word could be seemingly be inside any tag. The glossary parser would not add a link in this case. I know this feature would use a DOM parser, but I'm unsure where to go from here.
The parser:
function glossary_parse($content){
//Run the glossary parser
if (((!is_page() && get_option('glossaryOnlySingle') == 0) OR
(!is_page() && get_option('glossaryOnlySingle') == 1 && is_single()) OR
(is_page() && get_option('glossaryOnPages') == 1))){
$glossary_index = get_children(array(
'post_type' => 'glossary',
'post_status' => 'publish',
));
$current_title = get_the_title();
if ($glossary_index){
$timestamp = time();
foreach($glossary_index as $glossary_item){
$timestamp++;
$glossary_title = $glossary_item->post_title;
if ($current_title == $glossary_title) {
continue;
}
$glossary_search = '/\b'.$glossary_title.'s*?\b(?=([^"]*"[^"]*")*[^"]*$)/i';
$glossary_replace = '<a'.$timestamp.'>$0</a'.$timestamp.'>';
if (get_option('glossaryFirstOnly') == 1) {
$content_temp = preg_replace($glossary_search, $glossary_replace, $content, 1);
}
else {
$content_temp = preg_replace($glossary_search, $glossary_replace, $content);
}
$content_temp = rtrim($content_temp);
$link_search = '/<a'.$timestamp.'>('.$glossary_item->post_title.'[A-Za-z]*?)<\/a'.$timestamp.'>/i';
if (get_option('glossaryTooltip') == 1) {
$link_replace = '<a class="glossaryLink" href="' . get_permalink($glossary_item) . '" title="Glossary: '. $glossary_title . '" onmouseover="tooltip.show(\'' . addslashes($glossary_item->post_excerpt) . '\');" onmouseout="tooltip.hide();">$1</a>';
}
else {
$link_replace = '<a class="glossaryLink" href="' . get_permalink($glossary_item) . '" title="Glossary: '. $glossary_title . '">$1</a>';
}
$content_temp = preg_replace($link_search, $link_replace, $content_temp);
$content = $content_temp;
}
}
}
return $content;
}

Categories