imagine a page Title string in any given language (english, arabic, japanese etc) containing several words in UTF-8. Example:
$stringRAW = "Blues & μπλουζ Bliss's ブルース Schön";
Now this actually needs to be converted into something thats a valid portion of a URL of that page:
$stringURL = "blues-μπλουζ-bliss-ブルース-schön"
just check out this link
This works on my server too!
Q1. What characters are allowed as valid URL these days? I remember having seen whol arabic strings sitting on the browser and i tested it on my apache 2 and all worked fine.
I guesse it must become: $stringURL = "blues-blows-bliss-black"
Q2. What existing php functions do you know that encode/convert these UTF-8 strings correctly for URL ripping them off of any invalid chars?
I guesse that at least:
1. spaces should be converted into dashes -
2. delete invalid characters? which are they? # and '&'?
3. converts all letters to lower case (or are capitcal letters valid in urls?)
Thanks: your suggestions are much appreciated!
this is solution which I use:
$text = 'Nevalidní Český text';
$text = preg_replace('/[^\\pL0-9]+/u', '-', $text);
$text = trim($text, "-");
$text = iconv("utf-8", "us-ascii//TRANSLIT", $text);
$text = preg_replace('/[^-a-z0-9]+/i', '', $text);
Capitals in URL's are not a problem, but if you want the text to be lowercase then simply add $text = strtolower($text); at the end :-).
I would use:
$stringURL = str_replace(' ', '-', $stringURL); // Converts spaces to dashes
$stringURL = urlencode($stringURL);
$stringURL = preg_replace('~[^a-z ]~', '', str_replace(' ', '-', $stringRAW));
Check this method: http://www.whatstyle.net/articles/52/generate_unique_slugs_in_cakephp
pick the title of your webpage
$title = "mytitle#$3%#$5345";
simply urlencode it
$url = urlencode($title);
you dont need to worry about small details but remember to identify your url request its best to use a unique id prefix in url such as /389894/sdojfsodjf , during routing process you can use id 389894 to get the topic sdojfsodjf .
Here is a short & handy one that does the trick for me
$title = trim(strtolower($title)); // lower string, removes white spaces and linebreaks at the start/end
$title = preg_replace('#[^a-z0-9\s-]#',null, $title); // remove all unwanted chars
$title = preg_replace('#[\s-]+#','-', $title); // replace white spaces and - with - (otherwise you end up with ---)
and of course you need to handle umlauts, currency signs and so forth depending on the possible input
Related
I have videosearchXL script (YouTube CMS PHP script). In that script, the URLs are not search-engine friendly and have a very large mess like %20, double and even triple dashes, (, %7 and some other characters. I know nothing about PHP but looking on my codes I can understand that following codes are responsible for generating URLs.
<? include ('siteconfig.php');
function cano($s){
$s = str_replace(" ", "-", $s);
$s = strip_tags($s);
$s = strtolower($s);
return $s;
}
$link = $_GET['link'];
$link = explode("/", $link);
{include("apis/youtube/video.php");}
?>
So please help me to remove these spaces and unusual characters. I also used the .htaccess method but that did not work for me.
I'm looking for a solution to strip some HTML from a scraped HTML page. The page has some repetitive data I would like to delete so I tried with preg_replace() to delete the variable data.
Data I want to strip:
Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
....
...
Must be like this afterwards:
Producent:Example
Groep:Example1
Type:Example2
So a big piece is the same except the word within the data-title piece. How could I delete this piece of data?
I tried a few things like this one:
$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);
But that didn't work. Is there any solution to this?
Just use a wildcard:
$newstr = preg_replace('/<td class="datatable__body__item" data-title=".*?">/', '', $str);
.*? means match anything but don't be greedy
Assuming that the string looked like this:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example';
You could get the beginning and the end of the string with this:
preg_match('/^(\w+:).*\>(\w+)/', $string, $matches);
echo implode([$matches[1], $matches[2]]);
Which, in this case, will throw Producent:Example. So, then you could add this output to another variable/array you intend to use.
OR, since you mentioned replacing:
$string = preg_replace('/^(\w+:).*\>(\w+)/', '$1$2', $string);
But then again, checking as it would probably come in a variable number of lines:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2';
$stringRows = explode(PHP_EOL, $string);
$pattern = '/^(\w+:).*\>(\w+)/';
$replacement = '$1$2';
foreach ($stringRows as &$stringRow) {
$stringRow = preg_replace($pattern, $replacement, $stringRow);
}
$string = implode(PHP_EOL, $stringRows);
Which will then output the string like you expect.
Explaining my regex:
the first group catches the first word until the two dots :, then another group to catch the last word. I had previously specified anchors for both ends, but when breaking each line this wouldn't work as expected, so I kept only the beginning.
^(\w+:) => the word in the beginning of the string until two dots appear
.*\> => everything else until smaller symbol appears (escaped by slash)
(\w+) => the word after the smaller than symbol
Well maybe my question wasn't that good written. I had a table which I needed to scrape from a website. I needed the info in the table, but had to cleanup some parts as mentioned. The solution I finally made was this one and it works. It still has a little work to do with manual replacements but that is because of the stupid " they use for inch. ;-)
Solution:
\\ find the table in the sourcecode
foreach($techdata->find('table') as $table){
\\ filter out the rows
foreach($table->find('tr') as $row){
\\ take the innertext using simplehtmldom
$tech_specs = $row->innertext;
\\ strip some 'garbage'
$tech_specs = str_replace(" \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);
\\ find the first word of the string so I can use it
$spec1 = explode('</td>', $tech_specs)[0];
\\ use the found string to strip down the rest of the table
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);
\\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);
\\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);
\\ strip some 'garbage'
$tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","\n", $tech_specs);
$tech_specs = str_replace("</td>","", $tech_specs);
$tech_specs = str_replace(" ","", $tech_specs);
\\ put the clean row in an array ready for usage
$specs[] = $tech_specs;
}
}
I have some content stored in a variable and it looks like"
$content = "This is a test content and the content of the url is http://www.test.com. The is a second sentence.";
Now my code is
$pos = strpos($content, '.');
$firstsentence = substr($content, 0, $pos);
The above code doesn't work as the string already contains a url having dots.
How can I get the first sentence considering the fact that a string contains a hyperlink?
Please share other scenarios of text. This works fine for your example:
$sentences = 'This is a test content and the content of the url is http://www.test.com. The is a second sentence.';
preg_match('/(http|https):(.*?)com/', $sentences, $match);
$sentences = preg_replace('/(http|https):(.*?)com/', '', $sentences);
$pos = strpos($sentences, '.');
$pos .= -1;
$firstsentence = substr($sentences, 0, $pos) .$match[0].'.';
//This is a test content and the content of the url is http://www.test.com.
In general, I think you're going to also have to look for <sentence-end-punct>"<whitespace>, "<sentence-end-punct><whitespace>, and <sentence-end-punct><whitespace> (where <whitespace> includes the end of a line). Is this very general English text, not especially under your control, or is the grammar very limited? For non-English text, there can be additional rules, such as putting spaces between punctuation and quotes.
Add: What are you trying to accomplish here? Do you really need to pull apart text into individual sentences, or are you just trying to create a "teaser". In the latter case, just cut off the text at a complete word before some number of characters, and add an ellipsis (...).
lets assume that i have a string as the following:
إصلاح إصلاح
and i want to convert it to seo friendly url removing slashes and special characters with the following function calls
$title = trim(strtolower($str));
$title = preg_replace('#[^a-z0-9\s-]#',null, $title);
$title = preg_replace('#[\s-]+#','-', $title);
in English its working fine and its giving correct results but in arabic its giving the following result :
15731589160415751581-15731589160415751581
Thanks in advance
I'd suggest urlencode() with unique post id, like
/blog/12345-<?= urlencode('إصلاح إصلاح') ?>
This is an unsolved problem yet. What you basically had to do is to transliterate any given character (irrelevant if arabic or chinese or japanese or whatever) to latin transcription and then perform the URI generation methods on it.
There is some basic(!) support in iconv for this, have a look at http://ch.php.net/manual/de/function.iconv.php, you have to use iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $text) but as I said, support is limited.
If I were you I would just remove spaces and such and then call urlencode() on it:
$url = urlencode(mb_ereg_replace('\s+', '-', $url));
I'm using mb_ereg_replace() because it is unicode aware and such replaces unicode whitespaces as well.
The unicode property for arabic letter is : \p{arabic}, change the second preg_replace by:
$title = preg_replace('#[^\p{arabic}\s-]#',null, $title);
Try This function. I always use it and it works perfectly!
function SafeUrl3($str) {
$friendlyURL = htmlentities($str, ENT_COMPAT, "UTF-8", false) ;
$friendlyURL = preg_replace ( "/[^أ-يa-zA-Z0-9_.-]/u", "-", $friendlyURL ) ;
$friendlyURL = html_entity_decode($friendlyURL,ENT_COMPAT, "UTF-8") ;
$friendlyURL = trim($friendlyURL, '-') ;
return $friendlyURL ;
}
I've built a custom CMS that does the usual things: post management, content management, contact management, etc.
In the post management section, I would like to extract the "Title" field and convert this into a URL-ready form.
Example: New post is created titled "3 Ways to Win in Real Estate & in Life". I want this to run through a PHP script that turns it into "3_ways_to_win_in_real_estate_&_in_life".
Anyone have a script for this, or would url_encode() do all of this for me?
Make use of currently developed code that you can use within your own projects.
Kohana 3 framework has solution for you. Below you can find solution on the basis of URL::title() method from Kohana 3 framework:
function title($title, $separator = '-') {
// Remove all characters that are not the separator, letters, numbers, or whitespace
$title = preg_replace('![^' . preg_quote($separator) . '\pL\pN\s]+!u', '', strtolower($title));
// Replace all separator characters and whitespace by a single separator
$title = preg_replace('![' . preg_quote($separator) . '\s]+!u', $separator, $title);
// Trim separators from the beginning and end
return trim($title, $separator);
}
function cleanURL($string)
{
$url = str_replace("'", '', $string);
$url = str_replace('%20', ' ', $url);
$url = preg_replace('~[^\\pL0-9_]+~u', '-', $url); // substitutes anything but letters, numbers and '_' with separator
$url = trim($url, "-");
$url = iconv("utf-8", "us-ascii//TRANSLIT", $url); // you may opt for your own custom character map for encoding.
$url = strtolower($url);
$url = preg_replace('~[^-a-z0-9_]+~', '', $url); // keep only letters, numbers, '_' and separator
return $url;
}
// echo cleanURL("Shelly's%20Greatest%20Poem%20(2008)"); // shellys-greatest-poem-2008
from here. You can write your own or possibly find one to replace things like & with and, and so on.
Also note that this function uses dashes, not underscores. The preferred way to create clean URLs is with dashes, not underscores.
This is basic, but works.
static public function slugify($text)
{
// replace all non letters or digits by -
$text = preg_replace('/\W+/', '-', $text);
// trim and lowercase
$text = strtolower(trim($text, '-'));
return $text;
}
From here:
http://www.symfony-project.org/jobeet/1_4/Doctrine/en/05
"Anyone have a script for this, or would url_encode() do all of this for me?"
Have you tried using url_encode() to do this for you? A quick test script would have revealed that much for you, or even using functions-online.com's urlencode() tester.
$str = '3 Ways to Win in Real Estate & in Life';
echo urlencode( $str );
// 3+Ways+to+Win+in+Real+Estate+%26+in+Life
You could use a simple preg_replace() and simple replace anything which is not a letter or digit with either an underline or a dash.
echo preg_replace( '/[^\d\w]+/' , '_' , $str );
// 3_Ways_to_Win_in_Real_Estate_in_Life
echo preg_replace( '/[^\d\w]+/' , '-' , $str );
// 3-Ways-to-Win-in-Real-Estate-in-Life
Just use a dash of str_replace to turn the spaces into underscores, and a sprinkle of urlencode to catch the rest.
Edit: I missed the strtolower part, but I think you had a handle on that.
This is of course just a basic way to go about it, if you want to exactly imitate the wordpress way of turning a text into a URL, have a look at that code, it's open and available for you to do so.