My code is:
$rawhtml = file_get_contents( "site url" );
$pat= '/((http|ftp|https):\/\/[\w#$&+,\/:;=?#.-]+)[^\w#$&+,\/:;=?#.-]/i';
preg_match_all($pat,$rawhtml,$matches1);
foreach($matches1[1] as $plinks)
{
$links_array[]=$plinks;
}
After testing several situations I noted that the function had some "leaks". The link gets broken if there is whitespace.
For example I have this text URL in a variable:
$rawhtml = " http://www.filesonic.com/file/2185085531/TEST Voice 640-461 Test Cert Guide.epub
"
The result should be one link by line:
http://www.filesonic.com/file/2185085481/TEST Voice (640)+461 Test Cert Guide.pdf
but the result is
http://www.filesonic.com/file/2185085531/TEST
Sometimes extracted links also contains , or ' or " at the end. How to get rid of these?
how to get rid of those commas,quotes or double quotes from the extracted links
One could use (?<![,'"]) to exclude something at the end. But your problem is that you simply shouldn't use the trailing character class:
[^\w#$&+,\/:;=?#.-]
That's what matches " and '.
As a hackish workaround to the other problem, the first character class could be augmented with a space.
[\w#$&+,\/:;=?#. -]+
▵
As said, that's probably not a good solution and might lead to other mismatches.
Related
I have a category named like this:
$name = 'Construction / Real Estate';
Those are two different categories, and I am displaying results from database
for each of them. But I before that I have to send a user to url just for that category.
Here is the problem, if I did something like this.
echo "<a href='site.com/category/{$name}'> $name </a>";
The URL will become
site.com/cateogry/Construction%20/%20Real%20Estate
I am trying to remove the %20 and make them / So, I did str_replace('%20', '/', $name);
But that will become something like this:
site.com/cateogry/Construction///Real/Estate
^ ^ and ^ those are the problems.
Since it is one word, I want it to appear as Construction/RealEstate only.
I could do this by using at-least 10 lines of codes, but I was hoping if there is a regex, and simple php way to fix it.
You have a string for human consumption, and based on that string you want to create a URL.
To avoid any characters messing up your HTML, or get abuses as XSS attack, you need to escape the human readable string in the context of HTML using htmlspecialchars():
$name = 'Construction / Real Estate';
echo "<h1>".htmlspecialchars($name)."</h1>;
If that name should go into a URL, it must also be escaped:
$url = "site.com/category/".rawurlencode($name);
If any URL should go into HTML, it must be escaped for HTML:
echo "<a href='".htmlspecialchars($url)."'>";
Now the problem with slashes in URLs is that they are most likely not accepted as a regular character even if they are escaped in the URL. And any space character also does not fit into a URL nicely, although they work.
And then there is that black magic of search engine optimization.
For whatever reason, you should convert your category string before you inject it as part of the URL. Do that BEFORE you encode it.
As a general rule, lowercase characters are better, spaces should be dashes instead, and the slash probably should be a dash too:
$urlname = strtr(mb_strtolower($name), array(" " => "-", "/" => "-"));
And then again:
$url = "site.com/category/".rawurlencode($urlname);
echo "<a href='".htmlspecialchars($url)."'>";
In fact, using htmlspecialchars() is not really enough. The escaping of output that goes into an HTML attribute differs from output as the elements content. If you have a look at the escaper class from Zend Framework 2, you realize that the whole thing of escaping a HTML attribute value is a lot more complicated
No, there is nothing you can do to make it easier. The only chance is to use a function that does everything you need to make things easier for you, but you still need to apply the correct escaping everywhere.
You can use a simple solution like this:
$s = "site.com/cateogry/Construction%20/%20Real%20Estate";
$s = str_replace('%20', '', $s);
echo $s; // site.com/cateogry/Construction/RealEstate
Perhaps, you want to use urldecode() and remove the whitespace afterwards?
Something I have noticed on the StackOverflow website:
If you visit the URL of a question on StackOverflow.com:
"https://stackoverflow.com/questions/10721603"
The website adds the name of the question to the end of the URL, so it turns into:
"https://stackoverflow.com/questions/10721603/grid-background-image-using-imagebrush"
This is great, I understand that this makes the URL more meaningful and is probably good as a technique for SEO.
What I wanted to Achieve after seeing this Implementation on StackOverflow
I wish to implement the same thing with my website. I am happy using a header() 301 redirect in order to achieve this, but I am attempting to come up with a tight script that will do the trick.
My Code so Far
Please see it working by clicking here
// Set the title of the page article (This could be from the database). Trimming any spaces either side
$original_name = trim(' How to get file creation & modification date/times in Python with-dash?');
// Replace any characters that are not A-Za-z0-9 or a dash with a space
$replace_strange_characters = preg_replace('/[^\da-z-]/i', " ", $original_name);
// Replace any spaces (or multiple spaces) with a single dash to make it URL friendly
$replace_spaces = preg_replace("/([ ]{1,})/", "-", $replace_strange_characters);
// Remove any trailing slashes
$removed_dashes = preg_replace("/^([\-]{0,})|([\-]{2,})|([\-]{0,})$/", "", $replace_spaces);
// Show the finished name on the screen
print_r($removed_dashes);
The Problem
I have created this code and it works fine by the looks of things, it makes the string URL friendly and readable to the human eye. However, it I would like to see if it is possible to simplify or "tightened it up" a bit... as I feel my code is probably over complicated.
It is not so much that I want it put onto one line, because I could do that by nesting the functions into one another, but I feel that there might be an overall simpler way of achieving it - I am looking for ideas.
In summary, the code achieves the following:
Removes any "strange" characters and replaces them with a space
Replaces any spaces with a dash to make it URL friendly
Returns a string without any spaces, with words separated with dashes and has no trailing spaces or dashes
String is readable (Doesn't contain percentage signs and + symbols like simply using urlencode()
Thanks for your help!
Potential Solutions
I found out whilst writing this that article, that I am looking for what is known as a URL 'slug' and they are indeed useful for SEO.
I found this library on Google code which appears to work well in the first instance.
There is also a notable question on this on SO which can be found here, which has other examples.
I tried to play with preg like you did. However it gets more and more complicated when you start looking at foreign languages.
What I ended up doing was simply trimming the title, and using urlencode
$url_slug = urlencode($title);
Also I had to add those:
$title = str_replace('/','',$title); //Apache doesn't like this character even encoded
$title = str_replace('\\','',$title); //Apache doesn't like this character even encoded
There are also 3rd party libraries such as: http://cubiq.org/the-perfect-php-clean-url-generator
Indeed, you can do that:
$original_name = ' How to get file creation & modification date/times in Python with-dash?';
$result = preg_replace('~[^a-z0-9]++~i', '-', $original_name);
$result = trim($result, '-');
To deal with other alphabets you can use this pattern instead:
~\P{Xan}++~u
or
~[^\pL\pN]++~u
I am very confused about the following:
echo("<a href='http://".urlencode("www.test.com/test.php?x=1&y=2")."'>test</a><br>");
echo("<a href='http://"."www.test.com/test.php?x=1&y=2"."'>test</a>");
The first link gets a trailing slash added (that's causing me problems)
The second link does not.
Can anyone help me to understand why.
Clearly it appears to be something to do with urlencode, but I can't find out what.
Thanks
c
You should not be using urlencode() to echo URLs, unless they contain some non standard characters.
The example provided doesn't contain anything unusual.
Example
$query = 'hello how are you?';
echo 'http://example.com/?q=' . urlencode($query);
// Ouputs http://example.com/?q=hello+how+are+you%3F
See I used it because the $query variable may contain spaces, question marks, etc. I can not use the question mark because it denotes the start of a query string, e.g. index.php?page=1.
In fact, that example would be better off just being output rather than echo'd.
Also, when I tried your example code, I did not get a traling slash, in fact I got
<a href='http://www.test.com%2Ftest.php%3Fx%3D1%26y%3D2'>test</a>
string urlencode ( string $str )
This function is convenient when
encoding a string to be used in a
query part of a URL, as a convenient
way to pass variables to the next
page.
Your urlencode is not used properly in your case.
Plus, echo don't usually come with () it should be echo "<a href='http [...]</a>";
You should use urlencode() for parameters only! Example:
echo 'http://example.com/index.php?some_link='.urlencode('some value containing special chars like whitespace');
You can use this to pass URLs, etc. to your URL.
OK, I need to scan many HTML / XHTML documents to see if a particular file has been embedded with SWFObject. If it's the case, I need to replace the call to something else.
So far I have extracted the <script> contents where the calls can be made. Now I need to scan this string to check if the call is there and if it's there I need to replace it.
I know this is a bit odd, but the content comes from a third party which we don't have control on.
Since the call can be made in many different syntax, I will need a regular expression to find and replace the calls.
OK imagine the following scenario:
I'm searching if the file test.swf is embedded with SWFObject in the file.
The <script> content look like this:
alert('test.swf');
//some other random stuff here
swfobject.embedSWF("test.swf",
"The alternative content can screw the regexp with );", "300", "120",
"9.0.0", false, flashvars, params, attributes);
Now I would like to replace swfobject.embedSWF (and all parameters) to something else.
Is there a not too horrible way to do this? Don't forget that the call can be on one or many lines, that the parameters can be wrapped with single quotes (') or double quotes ("), that whitespace can be all around...
EDIT: OK since catching all kind of JS syntax is a bit overkill I will simplify the requirement:
The regular expression can assume only the following
The call is always on the same line
It always start with swfobject.embedSWF (case sensitive)
Is then followed (or not) by whitespaces and then a (
Is then followed (or not) by whitespaces and then a " or a ' (either one but one of the 2 is required)
Is then followed by the filename
Is then followed by " or ' (if we can ensure that it's the same char that in 4 good if not too bad)
Is then followed (or not) by whitespaces and then a ,
Is then followed by anything
Is then followed by ) then any whitespaces (or not) then ; then an end of line.
It should be much simpler to parse this way (I guess).
EDIT 2: I've cooked a solution. I think I'm close but it's not working, Anyone can help? 0 should match but it's not...
<?php
$myFilename = 'test.swf';
$testCases = array();
$testCases[] = 'swfobject.embedSWF("test.swf", "The alternative content can screw the regexp with );", "300", "120", "9.0.0", false, flashvars, params, attributes);';
foreach ($testCases as $i => $currTest)
{
$currResult = preg_match('/\s*swfobject\.embedSWF\s*\(\s*(["\'])(' . preg_quote($myFilename) . ')[^"\']+\1\s*,[\s\S]+?\)\s*;\s*$/', $currTest);
if ($currResult === false || $currResult < 1)
echo $i, ' Not matching', PHP_EOL;
else
echo $i, ' Matching', PHP_EOL;
}
?>
Well, somebody had the time to write a basic javascript parser in PHP. I'd give the tokenizer a try (possibly using an HTML parser to first find the <script> nodes).
In regards of your EDIT2...
I'm not the best with regular expressions but you can try:
$currResult = preg_match('/\s*swfobject\.embedSWF\s*\(\s*(["\'])(' . preg_quote($myFilename) . ')\1\s*,[\s\S]+?\)\s*;\s*$/', $currTest);
Seems to work OK for me.
Use 'grep' or similar on the command line to get a list of files that contain the .swf/script/object strings you need. That'll whittle down the number of files you need to process.
Then, use a PHP script to slurp each of those files into the DOM parser of your choice and do the replacing/fixing-up there.
I have made one form in which there is rich text editor. and i m trying to store the data to database.
now i have mainly two problem..
1) As soon as the string which contents "#"(basically when i try to change the color of the font) character, then it does not store characters after "#". and it also not store "#" character also.
2) although i had tried....in javascript
html.replace("\"","'");
but it does not replace the double quotes to single quotes.
We'll need to see some code. My feeling is you're missing some essential escaping step somewhere. In particular:
As soon as the string which contents "#"(basically when i try to change the color of the font) character
Implies to me that you might be sticking strings together into a URL like this:
var url= '/something.php?content='+html;
Naturally if the html contains a # symbol, you've got problems, because in:
http://www.example.com/something.php?content=<div style="color:#123456">
the # begins a fragment identifier called #123456">, like when you put #section on the end of a URL to go to the anchor called section in the HTML file. Fragment identifiers are purely client-side and are not sent to the server, which would see:
http://www.example.com/something.php?content=<div style="color:
However this is far from the only problem with the above. Space, < and = are simly invalid in URLs, and other characters like & will also mess up parameter parsing. To encode an arbitrary string into a query parameter you must use encodeURIComponent:
var url= '/something.php?content='+encodeURIComponent(html);
which will replace # with %35 and similarly for the other out-of-band characters.
However if this is indeed what you're doing, you should in any case you should not be storing anything to the database in response to a GET request, nor relying on a GET to pass potentially-large content. Use a POST request instead.
It seems that you are doing something very strange with your database code. Can you show the actual code you use for storing the string to database?
# - character is a common way to create a comment. That is everything starting from # to end of line is discarded. However if your code to store to database is correct, that should not matter.
Javascript is not the correct place to handle quote character conversions. The right place for that is on server side.
As you have requested....
I try to replay you... I try to mention exact what I had done...
1) on the client side on the html form page I had written like this..
html = html.trim(); // in html, the data of the rich text editor will come.
document.RTEDemo.action = "submit.php?method='"+ html.replace("\"","'") + "'";
\\ i had done replace bcz i think that was some problem with double quotes.
now on submit.php , my browser url is like this...
http://localhost/nc/submit.php?method='This is very simple recipe.<br><strong style='background-color: #111111; color: #80ff00; font-size: 20px;">To make Bread Buttor you will need</strong><br><br><blockquote><ol><li>bread</li><li>buttor</li></ol></li></blockquote><span style="background-color: #00ff80;">GOOD.</span><br><br><br><blockquote><br></blockquote><br>'
2) on submit.php ........I just write simply this
echo "METHOD : ".$_GET['method'] . "<br><br>";
$method = $_GET['method'];
now my answer of upper part is like this...
METHOD : 'This is very simple recipe.
now i want to store the full detail of URL....but its only storing...
This is very simple recipe.