How to remove a long word from a string - php

If a user types a really long string it doesn't move onto a 2nd line and will break a page on my site. How do I take that string and remove it completely if it's not a URL?

Why would you want to remove what the user wrote? Instead, wrap it to a new line - there is a function in PHP to do that, called wordwrap

Do you really want to remove the word, or do you just want to prevent it from making your page layout too wide? If the latter is more what you want, consider using CSS to manage the overflow.
For instance:
div {
overflow:hidden;
}
will hide any content that exceeds the div boundary.
Here's more info on CSS overflow:
http://www.w3schools.com/css/pr_pos_overflow.asp

// remove words over 30 chars long
$str = preg_replace('/\S{30,}/', '', $str);
edit: updated per Tim P's suggestion, \S matches any non-space char (the same as [^\s])
Also here is a better way incorporating ehdv's suggestion to use wordwrap:
//This will break up the long words with spaces so they don't stretch layouts.
$str = preg_replace('/(\S{30,})/e', "wordwrap('$1', 30, ' ', true)", $str);

What if it is a really long URL? At any rate why not just match the text to a valid URL, and only accept those? Check out some php-regex info on URLs and see how they work. The Regular Expressions Cookbook has a good chapter on URL matching, as well.

#Rob care in using REGEX. Performance lookout.

Related

PHP regex conditional get content and link from HTML anchor tag

I am trying getting the all anchor tags from a given HTML where the content length is more then 30 chars i.e. if i have this HTML with me
<td><a hreh="anything">Content is more then 30 chars........</a>
<a hreh="anything">another link</a>
</td>
I have write this RegEx for this preg_match_all("/<a href=\"(.*)\"[^>]*>([a-zA-Z0-9]{30,999})<\\/[a-zA-Z]+>/si",
$match[0],$posts);
where 30 is putting the limit of minimum 30 char to anchor tag content but unfortunately this is not working.
Anyone out there who can point out what i have made wrong.
Thanks
Note : I am trying fetching this page URL's
This Link
Would something simple as
<a.*?>.{30,}?</a>
not suffice? The above looks for anchor tags, with their content being 30 characters or more. It does not attempt to validate the href attribute or any other attributes of the link. It can be altered if these are required.
This is translated into preg_match_all as (thanks to #php_nub_qq)
preg_match_all("#<a.*?>.{30,}?</a>#", $match[0],$posts);
The URL you have linked contains letters, numbers, and non-alphanumeric characters in the url string. As you have little control over the source, it might be best to generalise the case like above rather than attempt to white list on a per character basis.
Try this:
preg_match_all("/<a href=\"(.*)\"[^>]*>([a-z\d\s]{30,})<\\/[a-z]+>/si", $match[0],$posts);
Since you have the i case-insensitive modifier, you don't need both a-z and A-Z in your classes. And if you're just setting a minimum length of the content, you don't need to specify a maximum of 999; {30,} means 30 or more.

PHP Regex URL parsing issues preg_replace

I have a custom markup parsing function that has been working very well for many years. I recently discovered a bug that I hadn't noticed before and I haven't been able to fix it. If anyone can help me with this that'd be awesome. So I have a custom built forum and text based MMORPG and every input is sanitized and parsed for bbcode like markup. It'll also parse out URL's and make them into legit links that go to an exit page with a disclaimer that you're leaving the site... So the issue that I'm having is that when I user posts multiple URL's in a text box (let's say \n delimited) it'll only convert every other URL into a link. Here's the parser for URL's:
$markup = preg_replace("/(^|[^=\"\/])\b((\w+:\/\/|www\.)[^\s<]+)" . "((\W+|\b)([\s<]|$))/ei", '"$1".shortURL("$2")."$4"', $markup);
As you can see it calls a PHP function, but that's not the issue here. Then entire text block is passed into this preg_replace at the same time rather than line by line or any other means.
If there's a simpler way of writing this preg_replace, please let me know
If you can figure out why this is only parsing every other URL, that's my ultimate goal here
Example INPUT:
http://skylnk.co/tRRTnb
http://skylnk.co/hkIJBT
http://skylnk.co/vUMGQo
http://skylnk.co/USOLfW
http://skylnk.co/BPlaJl
http://skylnk.co/tqcPbL
http://skylnk.co/jJTjRs
http://skylnk.co/itmhJs
http://skylnk.co/llUBAR
http://skylnk.co/XDJZxD
Example OUTPUT:
http://skylnk.co/tRRTnb
<br>http://skylnk.co/hkIJBT
<br>http://skylnk.co/vUMGQo
<br>http://skylnk.co/USOLfW
<br>http://skylnk.co/BPlaJl
<br>http://skylnk.co/tqcPbL
<br>http://skylnk.co/jJTjRs
<br>http://skylnk.co/itmhJs
<br>http://skylnk.co/llUBAR
<br>http://skylnk.co/XDJZxD
<br>
e flag in preg_replace is deprecated. You can use preg_replace_callback to access the same functionality.
i flag is useless here, since \w already matches both upper case and lower case, and there is no backreference in your pattern.
I set m flag, which makes the ^ and $ matches the beginning and the end of a line, rather than the beginning and the end of the entire string. This should fix your weird problem of matching every other line.
I also make some of the groups non-capturing (?:pattern) - since the bigger capturing groups have captured the text already.
The code below is not tested. I only tested the regex on regex tester.
preg_replace_callback(
"/(^|[^=\"\/])\b((?:\w+:\/\/|www\.)[^\s<]+)((?:\W+|\b)(?:[\s<]|$))/m",
function ($m) {
return "$m[1]".shortURL($m[2])."$m[3]";
},
$markup
);

Html2pdf doesn't support word-break:break-all css

hai everybody i am using html2pdf ,it doesn't support word-break:break-all css any idea?
example
<td style="width:30%;word-break:break-all ;">
testtestetstetstetstetstettstetstetstetstetstetstetstetstetstetstets
</td>
output pdf take above 30% width like string length size
output pdf: testtestetstetstetstetstettstetstetstetstetstetstetstetstetstetstets
I want Output :
testtestetstetstetstetstettstets tetstetstetstetstetstetstetstets
Well, that's complicated. Your teststring is too long, but it's not composed of multiple words. That means that word-break won't work, because there aren't any words to break on. Obviously, this might well just be an example, in which case it might be that html2pdf just doesn't support relative widths and word-break, so you could try having an absolute width and word-break.
That said, here's something I know that will work: wordwrap in PHP. So, instead of echo $yourvar; you could use echo wordwrap($yourvar, 75, "\n", true) instead, which will always cut the string, even if it's just one long string. It takes a little fiddling to get the number of characters to match up with the width that you're looking for, but it will work.
<?php
$foo = str_repeat('test',12);
echo wordwrap($foo, 20, '<br />', true);
Output:
testtesttesttesttest
testtesttesttesttest
testtest
try this;
<td style="width:30%; word-wrap:break-word;">
testtestetstetstetstetstettstetstetstetstetstetstetstetstetstetstets
</td>
not word-break it is word-wrap ;
If you want long strings to wrap consistently within a boundary container I think you should be able to accomplish this by inserting zero-width space characters (​ or \xe2\x80\x8b) between every letter of the orignial string. This will have the effect of wrapping as if every character was its own word, but without displaying the spaces to the end user. This may cause you trouble with text searches or indexing on the final product, but it should accomplish the task reliably from an aesthetic perspective.
Thus:
testtestetstetstetstetstettstetstetstetstetstetstetstetstetstetstets
Becomes
t​e​s​t​t​e​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s
(which displays: "t​e​s​t​t​e​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s​t​e​t​s")
So if you wrap it it will wrap exactly to the bounds of its container. Here's a fiddle of it as an example.
Just write a PHP script to loop though the string and insert the space:
$string="testtestetstetstetstetstettstetstetstetstetstetstetstetstetstetstets";
$new_string = "";
for($i=0;$i<strlen($string);$i++){
if ($string[$i]==' ' || $string[$i+1]==' '){ //if it is a space or the next letter is a space, there's no reason to add a break character
continue;
}
$new_string .= $string[$i]."​";
}
echo $new_string
This is a particularly nice solution, because unlike wordwrap(), it automatically adjusts for non-fixed-width fonts (which is basically 99% of fonts that are actually used).
Again, if you need to resulting PDF to be searchable, this is not a good approach, but it will make it look like you want it to.
In your testing the word break will not work because the word break only works between the words in a particular sentence. So yo can use the multiple word sentence and then try with the word breaker
You just use substr function in your code.
I put a example for this. First put your output in variable.
$get_value = "testtestetstetstetstetstettstetstetstet";
$first = substr("$get_value",0,3);
$second = substr("$get_value",4,7);
and so on.
You can use "\r\n" to print newline character. make sure to use it with double quote. If your string is in the variable then you need to use word count function and append this string. You can also use PHP_EOL to avoid platform dependency.
html2pdf does not support this word-break:break-all css
Ref: http://www.yaronet.com/en/posts.php?sl=&h=0&s=151321#0
You may use this method.
<?php
$get_value = "testtestetstetstetstetstettstetstetstet";
$first = substr("$get_value",0,3);
$second = substr("$get_value",4,7);
$third = substr("$get_value",8,11);
?>
I want to add little bit of own experience with HTML2PDF and tables.
I used this solution to generate the PDF containing a table filled with delivery confirmation (list of products). Such list may contain up to thousand of products (rows).
I encountered a problem with formatting and long strings in cells. First problem was that the table was getting too wide even if I set the table's width to 100% and the width of header (<th>) columns (HTML2PDF does not support <colgroup> so I couldn't define it globally) - some columns were out of visible area. I used wordwrap() with <br /> as separator to break down the long strings which looked like it's working. Unfortunately, it turned out that if there is such long string in first and last row the whole table is prepended and appended with empty page. Not a real bugger but doesn't look nice either. The final solution was to (applies for tables which width could outreach the visible area):
set the fixed widths of table and each row in pixels
for A4 letter size I am using total width of 550 px with default margins but you'd have to play around a little to distribute the width between columns
in wordwrap use empty space or ​ / \xe2\x80\x8b as delimiter
For small tables that you'd like to spread for 100% of visible area width it is OK to use width expressed in %.
I think this function is a limping solution.
function String2PDFString($string,$Length)
{
$Arry=explode(" ",$string);
foreach($Arry as $Line)
{
if(strlen($Line)>$Length)
$NewString.=wordwrap ($Line,$Length," ",true);
else
$NewString.=" ".$Line;
}
return $NewString;
}

Complex PHP/Perl regular expression for emoticons

I've checked google for help on this subject but all the answers keep overlooking a fatal flaw in the replacement method.
Essentially I have a set of emoticons such as :) LocK :eek and so on and need to replace them with image tags. The problem I'm having is identifying that a particular emoticon is not part of a word and is alone on a line. For example on our site we allow 'quick links' which are not included in the smiley replacement which take the format go:forum, user:Username and so on. Pretty much all answers I've read don't allow for this possiblity and as such break these links (i.e. go<img src="image.gif" />orum). I've tried experimenting around with different ways to get around this to check for the start of the line, spaces/newline characters and so on but I've not had much luck.
Any help with this problem would be greatly appreciated. Oh also I'm using PHP 5 and the preg_% functions.
Thanks,
Rupert S.
Edit 18/04/2011:
Thanks for your help peeps :) Have created the final regex that I though I'd share with everyone, had a couple problems to do with special space chars including newline but it's now working like a dream the final regex is:
(?<=\s|\A|\n|\r|\t|\v|\<br \/\>|\<br\>)(:S)(?=\s|\Z|$|\n|\r|\t|\v|\<br \/\>|\<br\>)
To complete the comment into an answer: The simplest workaround would be to assert that the emoticons are always surrounded by whitespace.
(?<=\s|^)[<:-}]+(?=\s|$)
The \s covers normal spaces and line breaks. Just to be safe ^ and $ cover occurrences at the start or very end of the text subject. The assertions themselves do not match, so can be ignored in the replacement string/callback.
If you want to do all the replace in one single preg_replace, try this:
preg_replace('/(?<=^|\s)(:\)|:eek)(?=$|\s)/e'
,"'$1'==':)'?'<img src=\"smile.gif\"/>':('$1'==':eek'?'<img src=\"eek.gif\"/>':'$1')"
,$input);

preg_replace() help in PHP

Consider this string
hello awesome <a href="" rel="external" title="so awesome is cool"> stuff stuff
What regex could I use to match any occurence of awesome which doesn't appear within the title attribute of the anchor?
So far, this is what I've came up with (it doesn't work sadly)
/[^."]*(awesome)[^."]*/i
Edit
I took Alan M's advice and used a regex to capture every word and send it to a callback. Thanks Alan M for your advice. Here is my final code.
$plantDetails = end($this->_model->getPlantById($plantId));
$botany = new Botany_Model();
$this->_botanyWords = $botany->getArray();
foreach($plantDetails as $key=>$detail) {
$detail = preg_replace_callback('/\b[a-z]+\b/iU', array($this, '_processBotanyWords'), $detail);
$plantDetails[$key] = $detail;
}
And the _processBotanyWords()...
private function _processBotanyWords($match) {
$botanyWords = $this->_botanyWords;
$word = $match[0];
if (array_key_exists($word, $botanyWords)) {
return '' . $word . '';
} else {
return $word;
}
}
Hope this well help someone else some day! Thanks again for all your answers.
This subject comes up pretty much every day here and basically the issue is this: you shouldn't be using regular expressions to parse or alter HTML (or XML). That's what HTML/XML parsers are for. The above problem is just one of the issues you'll face. You may get something that mostly works but there'll still be corner cases where it doesn't.
Just use an HTML parser.
Asssuming this is related to the question you posted and deleted a little while ago (that was you, wasn't it?), it's your fundamental approach that's wrong. You said you were generating these HTML links yourself by replacing words from a list of keywords. The trouble is that keywords farther down the list sometimes appear in the generated title attributes and get replaced by mistake--and now you're trying to fix the mistakes.
The underlying problem is that you're replacing each keyword using a separate call to preg_replace, effectively processing the entire text over and over again. What you should do is process the text once, matching every single word and looking it up in your list of keywords; if it's on the list, replace it. I'm not set up to write/test PHP code, but you probably want to use preg_replace_callback:
$text = preg_replace_callback('/\b[A-Za-z]+\b/', "the_callback", $text);
"the_callback" is the name of a function that looks up the word and, if it's in the list, generates the appropriate link; otherwise it returns the matched word. It may sound inefficient, processing every word like this, but in fact it's a great deal more efficient than your original approach.
Sure, using a parsing library is the industrial-strength solution, but we all have times were we just want to write something in 10 seconds and be done. Next time you want to process the meaty text of a page, ignoring tags, try just run your input through strip_tags first. This way you will get only the plain, visible text and your regex powers will again reign supreme.
This is so horrible I hesitate to post it, but if you want a quick hack, reverse the problem--instead of finding the stuff that isn't X, find the stuff that IS, change it, do the thing and change it back.
This is assuming you're trying to change awesome (to "wonderful"). If you're doing something else, adjust accordingly.
$string = 'Awesome is the man who <b>awesome</b> does and awesome is.';
$string = preg_replace('#(title\s*=\s*\"[^"]*?)awesome#is', "$1PIGDOG", $string);
$string = preg_replace('#awesome#is', 'wonderful', $string);
$string = preg_replace('#pigdog#is', 'awesome', $string);
Don't vote me down. I know it's hack.

Categories