PHP str_replace scraped content with wild card? - php

I'm looking for a solution to strip some HTML from a scraped HTML page. The page has some repetitive data I would like to delete so I tried with preg_replace() to delete the variable data.
Data I want to strip:
Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
....
...
Must be like this afterwards:
Producent:Example
Groep:Example1
Type:Example2
So a big piece is the same except the word within the data-title piece. How could I delete this piece of data?
I tried a few things like this one:
$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);
But that didn't work. Is there any solution to this?

Just use a wildcard:
$newstr = preg_replace('/<td class="datatable__body__item" data-title=".*?">/', '', $str);
.*? means match anything but don't be greedy

Assuming that the string looked like this:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example';
You could get the beginning and the end of the string with this:
preg_match('/^(\w+:).*\>(\w+)/', $string, $matches);
echo implode([$matches[1], $matches[2]]);
Which, in this case, will throw Producent:Example. So, then you could add this output to another variable/array you intend to use.
OR, since you mentioned replacing:
$string = preg_replace('/^(\w+:).*\>(\w+)/', '$1$2', $string);
But then again, checking as it would probably come in a variable number of lines:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2';
$stringRows = explode(PHP_EOL, $string);
$pattern = '/^(\w+:).*\>(\w+)/';
$replacement = '$1$2';
foreach ($stringRows as &$stringRow) {
$stringRow = preg_replace($pattern, $replacement, $stringRow);
}
$string = implode(PHP_EOL, $stringRows);
Which will then output the string like you expect.
Explaining my regex:
the first group catches the first word until the two dots :, then another group to catch the last word. I had previously specified anchors for both ends, but when breaking each line this wouldn't work as expected, so I kept only the beginning.
^(\w+:) => the word in the beginning of the string until two dots appear
.*\> => everything else until smaller symbol appears (escaped by slash)
(\w+) => the word after the smaller than symbol

Well maybe my question wasn't that good written. I had a table which I needed to scrape from a website. I needed the info in the table, but had to cleanup some parts as mentioned. The solution I finally made was this one and it works. It still has a little work to do with manual replacements but that is because of the stupid " they use for inch. ;-)
Solution:
\\ find the table in the sourcecode
foreach($techdata->find('table') as $table){
\\ filter out the rows
foreach($table->find('tr') as $row){
\\ take the innertext using simplehtmldom
$tech_specs = $row->innertext;
\\ strip some 'garbage'
$tech_specs = str_replace(" \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);
\\ find the first word of the string so I can use it
$spec1 = explode('</td>', $tech_specs)[0];
\\ use the found string to strip down the rest of the table
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);
\\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);
\\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);
\\ strip some 'garbage'
$tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","\n", $tech_specs);
$tech_specs = str_replace("</td>","", $tech_specs);
$tech_specs = str_replace(" ","", $tech_specs);
\\ put the clean row in an array ready for usage
$specs[] = $tech_specs;
}
}

Related

Preg replace callback validation

So I need to re-write some old code that I found on a library.
$text = preg_replace("/(<\/?)(\w+)([^>]*>)/e",
"'\\1'.strtolower('\\2').'\\3'", $text);
$text = preg_replace("/<br[ \/]*>\s*/","\n",$text);
$text = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n",
$text);
And for the first one I have tried like this:
$text = preg_replace_callback(
"/(<\/?)(\w+)([^>]*>)/",
function($subs) {
return strtolower($subs[0]);
},
$text);
I'm a bit confused b/c I don't understand this part: "'\\1'.strtolower('\\2').'\\3'" so I'm not sure what should I replace it with.
As far as I understand the first line looks for tags, and makes them lowercase in case I have data like
<B>FOO</B>
Can you guys help me out here with a clarification, and If my code is done properly?
The $subs is an array that contains the whole value in the first item and captured texts in the subsequent items. So, Group 1 is in $subs[1], Group 2 value is in $subs[2], etc. The $subs[0] contains the whole match value, and you applied strtolower to it, but the original code left the Group 3 value (captured with ([^>]*>) that may also contain uppercase letters) intact.
Use
$text = preg_replace_callback("~(</?)(\w+)([^>]*>)~", function($subs) {
return $subs[1] . strtolower($subs[2]) . $subs[3];
}, $text);
See the PHP demo.

php preg_replace pattern - replace text between commas

I have a string of words in an array, and I am using preg_replace to make each word into a link. Currently my code works, and each word is transformed into a link.
Here is my code:
$keywords = "shoes,hats,blue curtains,red curtains,tables,kitchen tables";
$template = '%1$s';
$newkeys = preg_replace("/(?!(?:[^<]+>|[^>]+<\/a>))\b([a-z]+)\b/is", sprintf($template, "\\1"), $keywords);
Now, the only problem is that when I want 2 or 3 words to be a single link. For example, I have a keyword "blue curtains". The script would create a link for the word "blue" and "curtains" separately. I have the keywords separated by commas, and I would like the preg_replace to only replace the text between the commas.
I've tried playing around with the pattern, but I just can't figure out what the pattern would be.
Just to clarify, currently the output looks as follows:
shoes,hats,blue curtains,red curtains,tables,kitchen tables
While I want to achieve the following output:
shoes,hats,blue curtains,red curtains,tables,kitchen tables
A little bit change in preg_replace code and your job will done :-
$keywords = "shoes,hats,blue curtains,red curtains,tables,kitchen tables";
$template = '%1$s';
$newkeys = preg_replace("/(?!(?:[^<]+>|[^>]+<\/a>))\b([a-z ' ']+)\b/is", sprintf($template, "\\1"), $keywords);
OR
$newkeys = preg_replace("/(?!(?:[^<]+>|[^>]+<\/a>))\b([a-z' ']+)\b/is", sprintf($template, "\\1"), $keywords);
echo $newkeys;
Output:- http://prntscr.com/77tkyb
Note:- I just added an white-space in your preg_replace. And you can easily get where it is. I hope i am clear.
Matching white-space along with words is missing there in preg_replace and i added that only.

Regex to add line breaks before and after a string?

The following code removes comments, line breaks, and extra space from HTML and PHP files, but a problem I have is when the original file has <<<EOT; in it. What regex rule would I use to add a linebreak before and after <<<EOT; from $pre6?
//a bit messy, but this is the core of the program. removes whitespaces, line breaks, and comments. sometimes makes EOT error.
$pre1 = preg_replace('#<!--[^\[<>].*?(?<!!)-->#s', '', preg_replace('~>\s+<~', '><', trim(preg_replace('/\s\s+/', ' ', php_strip_whitespace(stripslashes(htmlspecialchars($uploadfile)))))));
$pre2 = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $pre1);
$pre3 = str_replace(array("\r\n", "\r"), "\n", $pre2);
$pre4 = explode("\r\n", $pre3);
$pre5 = array();
foreach ($pre4 as $i => $line) {
if(!empty($line))
$pre5[] = trim($line);
}
$pre6 = implode($pre5);
echo $pre6;
To match <<<EOT, you could use <{3}[A-Z]{3}, or several other patterns, depending on how strictly you want to match that exact text.
Oh, I see what you're after now. I'm not great with PHP, but in regular expressions, you can capture a named group and then refer to that group in a replacement operation. You could use the following to capture <<<EOT into a group named Capture:
(?<Capture><{3}[A-Z]{3})
I think in PHP you can refer to it using something like:
$regs['Capture']
So maybe you're after a replacement parameter value of something like:
"\r\n".$regs['Capture']."\r\n"
...if $regs was the parameter passed to the replace operation.

regex for breadcrumb in php

I am currently building breadcrumb. It works for example for
http://localhost/researchportal/proposal/
<?php
$url_comp = explode('/',substr($url,1,-1));
$end = count($url_comp);
print_r($url_comp);
foreach($url_comp as $breadcrumb) {
$landing="http://localhost/";
$surl .= $breadcrumb.'/';
if(--$end)
echo '
<a href='.$landing.''.$surl.'>'.$breadcrumb.'</a>ยป';
else
echo '
'.$breadcrumb.'';
};?>
But when I typed in http://localhost////researchportal////proposal//////////
All the formatting was gone as it confuses my code.
I need to have the site path in an array like ([1]->researchportal, [2]->proposal)
regardless of how many slashes I put.
So can $url_comp = explode('/',substr($url,1,-1)); be turned into a regular expression to get my desired output?
You don't need regex. Look at htmlentities() and stripslashes() in the PHP manual. A regex will return a boolean value of whatever it says, and won't really help you achieve what you are trying to do. All the regex can let you do is say if the string matches the regex do something. If you put in a regex requiring at least 2 characters between each slash, then any time anyone puts more than one consecutive slash in there, the if statement will stop.
http://ca3.php.net/manual/en/function.stripslashes.php
http://ca3.php.net/manual/en/function.htmlentities.php
Found this on the php manual.
It uses simple str_replace statements, modifying this should achieve exactly what your post was asking.
<?
function stripslashes2($string) {
$string = str_replace("\\\"", "\"", $string);
$string = str_replace("\\'", "'", $string);
$string = str_replace("\\\\", "\\", $string);
return $string;
}
?>

Regular Expressions: how to do "option split" replaces

those reqular expressions drive me crazy. I'm stuck with this one:
test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not
Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:
test1:link test2:silver test3:out1insideout2 test4:this|not
I came up with (PHP)
$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]
this works for part1 of the task. but before that I think I should do the option split, my best solution:
$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);
Result:
test1:silver test3:[[out1[[inside]]out2]] this|not
I'm stuck. may someone with some free minutes help me? Thanks!
I think the easiest way to do this would be multiple passes. Use a regular expression like:
\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]
This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.
Edit 1: By way of explanation,
\[\[ # Opening [[
(?: # A non-matching group (we don't want this bit)
[^\[\]] # Non-bracket characters
* # Zero or more of anything but [
\| # A literal '|' character representing the end of the discarded options
)? # This group is optional: if there is only one option, it won't be present
( # The group we're actually interested in ($1)
[^\[\]] # All the non-bracket characters
+ # Must be at least one
) # End of $1
\]\] # End of the grouping.
Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).
Edit 3: There is no need to know the number of nested brackets as you can do something like:
$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
$oldtext = $newtext;
$newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;
Basically, this keeps running the regular expression replace until the output is the same as the input.
Note that I don't know PHP, so there are probably syntax errors in the above.
This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.
Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.
You will need to escape all backslashes when putting it into a string (\ becomes \\.)
\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]
As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)
Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:
test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not
becomes
test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not
Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.
When trying to get something going favour clarity and simplicity.
Seems like you have all the pieces.
Why not just simply remove any brackets that are left?
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);
Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/';
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
$opt = explode('|', $match); $match = $opt[count($opt)-1];
}
$m[2][$pos] = str_replace(array('[', ']'),'', $match );
}
foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];
This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.
String input = "test1:[[link]] " +
"test2:[[gold|silver]] " +
"test3:[[out1[[inside]]out2]] " +
"test4:this|not";
String step1 = Regex.Replace(input, #"\[\[([^|]+)\|([^\]]+)\]\]", #"[[$2]]");
String step2 = Regex.Replace(step1, #"\[\[|\]\]", String.Empty);
// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
$v = preg_replace("/\[\[|\]\]/","",$v);
$j = explode(":",$v);
$j[1]=preg_replace("/.*\|/","",$j[1]);
print implode(":",$j)."\n";
}

Categories