PHP and Simple DOM HTML Parser - Replace identical text string - php

I'm using the Simple DOM html parser php script in what seems to be a simple way, here's my code:
include('simple_html_dom.php');
$html = file_get_html($_SERVER['DOCUMENT_ROOT']."/wp-content/themes/genesis-sample-develop/cache-reports/atudem.html");
$snow_depth_min = $html->find('td', 115);
$snow_depth_max = $html->find('td', 116);
$snow_type = $html->find('td', 117);
The problem is with $snow_type. Sometimes the parsed text string is 'polvo' and sometimes it is 'polvo-dura'. I'm trying to replace 'polvo' with 'powder', and 'polvo-dura' with 'powder/packed'. If I do something like
if ($snow_type->innertext=='polvo-dura') {
$snow_type->innertext('powder');
}
or
$snow_type = str_replace("polvo", "powder", $snow_type);
$snow_type = str_replace("polvo-dura", "powder/packed", $snow_type);
it ends up with results like 'powder-dura' and weird things like that.
Obviously I'm new to php, so have some pattience with me ;) I would also like to understand why this happens and why a possible solution would work.
Thanks in advance

if ($snow_type->innertext=='polvo-dura') {
$innertext = 'powder/packed';
} else if ($snow_type->innertext=='polvo') {
$innertext = 'powder';
}

Provisional solution, using indexed arrays with preg_replace() :
$patterns = array();
$patterns[0] = '/-/';
$patterns[1] = '/polvo/';
$patterns[2] = '/dura/';
$replacements = array();
$replacements[0] = '/';
$replacements[1] = 'powder';
$replacements[2] = 'packed';
$snow_type_spanish_english = preg_replace($patterns, $replacements, $snow_type);
I have serious concerns about how it would work in real-world long complex texts, but for short-type data such as 'snow type' with values like 'a', 'b', 'a/b' or 'b/a', this can be just fine.
It would be great if someone comes with a better solution. I've been searching all over Internet for days and haven't found any specific solutions for text-values with the same words at the beginning, like 'powder' and 'powder-packed' for example.

Related

Improve this regex to prevent preg_replace to throw a PREG_BACKTRACK_LIMIT_ERROR

I want to remove all the scipt-tags from a HTML-page, except those with the word foo or bar.
So I came up with this statement:
$content = preg_replace('#<script((?!foo|bar).)*?</script>#is', '', $content);
echo "Last error: " + preg_last_error();
This works on smaller pages. But now I have a page with 30 big script-tags and it doesn't work.
The error I get is: PREG_BACKTRACK_LIMIT_ERROR
So I think I need to improve my regex to prevent this error, because this statement works:
$content = preg_replace('#<script.*?</script>#is', '', $content);
But this statement is removing all the script-tags, while I want to keep some of them.
There are solution about increasing the pcre.backtrack_limit, but I don't want to go that route. There should be a better solution imho.
The thing is that I don't know how to fix this, because the issue is with the regex as far as I can see.
Could you guide me to make the regex better so this error won't occur?
I would strongly suggest not using regular expressions here, but making use of DOM parsing instead, which is often more appropriate in this kind of scenario:
$doc = new \DOMDocument();
$doc->loadHTML($html, LIBXML_HTML_NODEFDTD);
$xpath = new \DOMXPath($doc);
foreach ($xpath->query('//script[contains(text(), \'foo\') or contains(text(), \'bar\')]') as $script_tag) {
$script_tag->parentNode->removeChild($script_tag);
}
echo $doc->saveHTML();
If you have more words, you can build your xpath query from an array instead:
$blacklist = ['foo', 'bar', 'apple', 'cold'];
$query = '//script[' . join(' or ', array_map(function($banword) {
return "contains(text(), '$banword')";
}, $blacklist)) . ']';
foreach ($xpath->query($query) as $script_tag) {
$script_tag->parentNode->removeChild($script_tag);
}
Demo: https://3v4l.org/dHGDt

unicode chars with wikipedia search in PHP

I pass a PHP string to wikipedia search page in order to retrieve part of the definition.
Everythin works fine, except unicode chars which appear in the \u... form. Here is an example to explain myself better. As you can see, the phonetic transcription of the name is not readable:
Henrik Ibsen, Henrik Ibsen \u02c8h\u025bn\u027eik \u02c8ips\u0259n
(Skien, 20 marzo 1828 - Oslo, 23 maggio 1906) è stato uno scrittore,
drammaturgo, poeta e regista teatrale norvegese.
The code I use to get the snippet from Wikipedia is this:
$word = $_GET["word"];
$html = file_get_contents('https://it.wikipedia.org/w/api.php?action=opensearch&search='.$word);
$utf8html = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $html), ENT_NOQUOTES, 'UTF-8');
The last line of my code does not solve the problem.
Do you know how to get a clean text which is entirely readable?
The output of the Wikipedia search API is JSON. Don't try to scrape bits out of it and parse string literal escapes yourself, that way madness lies. Just use a readily available JSON parser.
Also, you need to URL-escape the word when you add it into a query string, otherwise any searches for words with URL-special characters in will fail.
In summary:
$word = $_GET['word'];
$url = 'https://it.wikipedia.org/w/api.php?action=opensearch&search='.urlencode($word);
$response = json_decode(file_get_contents($url));
$matching_titles_array = $response[1];
$matching_summaries_array = $response[2];
$matching_urls = $response[3];
...etc...
You got some errors in your regex string, try using:
<?php
$str = "Henrik Ibsen, Henrik Ibsen \u02c8h\u025bn\u027eik \u02c8ips\u0259n(Skien, 20 marzo 1828 - Oslo, 23 maggio 1906) è stato uno scrittore, drammaturgo, poeta e regista teatrale norvegese.";
$utf8html = preg_replace('#\\\U([0-9A-F]{4})#i', "&#x\\1", $str);
echo $utf8html;
Well, the answer posted by bobince is certainly more effective than my previous procedure, which aimed at scraping and pruning bit by bit what I needed. Just to show you how I was doing it, here is my previous code:
$html = file_get_contents('https://it.wikipedia.org/w/api.php?action=opensearch&search='.$s);
$decoded = preg_replace('#\\\U([0-9A-F]{4})#i', "&#x\\1", $html);
$par = array("[", "]");
$def_no_par = str_replace($par, "", $decoded);
$def_no_vir = str_replace("\"\",", "", $def_no_par);
$def_cap = str_replace("\",", "\",<br>", $def_no_vir);
$def_pulita = str_replace("\"", "", $def_cap);
$def_clean = str_replace(".,", ".", $def_pulita);
$definizione = str_replace("$s,", "", $def_clean);
$out = str_replace("\\", "\"", $definizione);
As you can see, removing parts of the output to make it more readable was quite tiresome (and not completely successful).
Using the JSON approach makes everything more linear. Here is my new workaround:
$search = 'https://it.wikipedia.org/w/api.php?action=opensearch&search='.urlencode($s);
$response = json_decode(file_get_contents($search));
$matching_titles_array = $response[1];
$matching_summaries_array = $response[2];
$matching_urls = $response[3];
echo '<h3><div align="center"><font color=" #A3A375">'.$titolo.'</font></div></h3><br><br>';
foreach($response[1] as $t) {
echo '<font color="#5C85D6"><b>'.$t.'</b></font><br><br>';
}
foreach($response[2] as $s) {
echo $s.'<br><br>';
}
foreach($response[3] as $l) {
$link = preg_replace('!(((f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $l);
echo $link.'<br><br>';
}
The advantage is that now I can manipulate the arrays as I wish.
You can see it in action here:

PHP replace words to links except images

My code is
$words = array();
$links = array();
$result = mysql_query("SELECT `keyword`, `link` FROM `articles` where `link`!='".$act."' ")
or die(mysql_error());
$i = 0;
while($row = mysql_fetch_array( $result ))
{
if (!empty($row['keyword']))
{
$words[$i] = '/(?<!(src="|alt="))'.$row['keyword'].'/i';
$links[$i] = ''.$row['keyword'].'';
$i++;
}
}
$text = preg_replace($words, $links, $text);
I want to replace Hello with Guys except img src and alt.
From
Say Hello my dear <img src="say-hello-my-dear.jpg" alt="say hello my dear" />
I want
Say Guys my dear <img src="say-hello-my-dear.jpg" alt="say hello my dear" />
The current code, replaces only when my keyword has only 1 word.
EDIT: the previsouly suggested correction was not relevant.
Still:
I would suggest you not to use any regex but only str_replace in your case if you have a performance constraint.
You must change your MySQL functions that are legacy: http://php.net/manual/en/function.mysql-fetch-array.php
EDIT: I can't believe it took me that long to understand that you're trying to parse big chunks of HTML with regular expressions.
Read the answer to this question:
RegEx match open tags except XHTML self-contained tags
Edit: I updated the code to work better.
I'm unsure exactly what the issue is but looking at your code I wouldn't be surprised that the negative look behind regex isn't matching multiple word strings where the "keyword" is not the first word after the src or alt. It might possible to beef up the regex, but IMHO a complicated regex might be a little too brittle for your html parsing needs. I'd recommend doing some basic html parsing yourself and doing a simple string replace in the right places.
Here's some basic code. There is certainly a much better solution than this, but I'm not going to spend too much time on this. Probably, rather than inserting html in a text node, you should create a new html a element with the right attributes. Then you wouldn't have to decode it. But this would be my basic approach.
$text = "Lorem ipsum <img src=\"lorem ipsum\" alt=\"dolor sit amet\" /> dolor sit amet";
$result = array(
array('keyword' => 'lorem', 'link' => 'http://www.google.com'),
array('keyword' => 'ipsum', 'link' => 'http://www.bing.com'),
array('keyword' => 'dolor sit', 'link' => 'http://www.yahoo.com'),
);
$doc = new DOMDocument();
$doc->loadHTML($text);
$xpath = new DOMXPath($doc);
foreach($result as $row) {
if (!empty($row['keyword'])) {
$search = $row['keyword'];
$replace = ''.$row['keyword'].'';
$text_nodes = $xpath->evaluate('//text()');
foreach($text_nodes as $text_node) {
$text_node->nodeValue = str_ireplace($search, $replace, $text_node->nodeValue);
}
}
}
echo html_entity_decode($doc->saveHTML());
The $result data structure is meant to be similar to result of your mysql_fetch_array(). I'm only getting the children of the root for the created html DOMDocument. If the $text is more complicated, it should be pretty easy to traverse more thoroughly through the document. I hope this helps you.

bbCode in an another

I made a custom function which acts like bbCode. I'm using preg_replace and regex. The only problem is that if I use more than one bbCode formatting, then just only one works..
[align=center][img]myimagelink[/img][/align]
If I enter this line, then the image appears BUT the [align=center]image[/align] also. How can I avoid this problem?
$patterns[2] = '#\[align=(.*)\](.*)\[\/align\]#si';
$patterns[9] = '#\[img\](.*\.jpg)\[\/img\]#si';
$replacements[2] = '<table align=\1><tr><td align=\1>\2</td></tr></table>';//ALIGN
$replacements[9] = '<img src=\"$1\"/>';//image
Changing the .* expressions to non-greedy (.*?) will work for you.
Example:
$in = '[align=center][img]myimagelink[/img][/align]';
$patterns = array(
'~\[align=(left|right|center)\](.*?)\[/align\]~' => '<div style="text-align: $1">$2</div>',
'~\[img](.*?)\[/img\]~' => '<img src="$1" />',
);
$rep = preg_replace(array_keys($patterns), $patterns, $in);
echo htmlspecialchars($rep);
Rather than reinventing the wheel I recommend using an existing javascript library.
I believe StackOverflow uses Prettify to format user input.
As #nickb stated, your patterns are greedy. (.*) grabs everything. Try changing it to (.*?).
treat all tags as singles not pairs
$patterns[2] = '#\[align=(.*)\]#si';
$patterns[3] = '#\[\/align\]#si';
$patterns[9] = '#\[img\](.*\.jpg)\[\/img\]#si';
$replacements[2] = '<div align=\"$1\">';//ALIGN
$replacements[3] = '</div>';//ALIGN
$replacements[9] = '<img src=\"$1\"/>';//image

PHP Coding Restrict Preg_replace function from multiple tags

I have a great little script that will search a file and replace a list of words with their matching replacement word. I have also found a way to prevent preg_replace from replacing those words if they appear in anchor tags, img tags, or really any one tag I specify. I would like to create an OR statement to be able to specify multiple tags. To be clear, I would like to prevent preg_replace from replacing words that not only appear in an anchor tag, but any that appear in an anchor,link,embed,object,img, or span tag. I tried using the '|' OR operator at various places in the code with no success.
<?php
$data = 'somefile.html';
$data = file_get_contents($data);
$search = array ("/(?!(?:[^<]+>|[^>]+<\/a>))\b(red)\b/is","/(?!(?:[^<]+>|[^>]+<\/a>))\b(white)\b/is","/(?!(?:[^<]+>|[^>]+<\/a>))\b(blue)\b/is");
$replace = array ('Apple','Potato','Boysenberry');
echo preg_replace($search, $replace, $data);?>
print $data;
?>
looking at the first search term which basically says to search for "red" but not inside :
"/(?!(?:[^<]+>|[^>]+<\/a>))\b(red)\b/is"
I am trying to figure out how I can somehow add <\/link>,<\/embed>,<\/object>,<\/img> to this search so that preg_replace doesn't replace 'red' in any of those tags either.
Something like this?:
<?php
$file = 'somefile.html';
$data = file_get_contents($file);
print "Before:\n$data\n";
$from_to = array("red"=>"Apple",
"white"=>"Potato",
"blue"=>"Boysenberry");
$tags_to_avoid = array("a", "span", "object", "img", "embed");
$patterns = array();
$replacements = array();
foreach ($from_to as $from=>$to) {
$patterns[] = "/(?!(?:[^<]*>|[^>]+<\/(".implode("|",$tags_to_avoid).")>))\b".preg_quote($f
rom)."\b/is";
$replacements[] = $to;
}
$data = preg_replace($patterns, $replacements, $data);
print "After:\n$data\n";
?>
Result:
Before:
red
<span class="blue">red</span>
blue<div class="blue">white</div>
<div class="blue">red</div>
After:
red
<span class="blue">red</span>
Boysenberry<div class="blue">Potato</div>
<div class="blue">Apple</div>

Categories