Python 3 normalize URL

Python 3 normalize URL - php

Alright, so apparently python 3 is pretty ridiculous when it comes to urllib.
So, I have an url like this formatted like so,
http_request = "http://localhost/system/index.php/index_file/store?cid={0}&cname={1}&fname={2}&fdir='{3}'"\
.format(client_id, client_name, each[1], each[2])
where each[1] and each[2] are the file names and file directories, respectively.
So a generated result of http_request through print() would give something like this,
http://localhost/system/index.php/index_file/store? \
cid=90823&cname=John Smith&fname=Sample Document.doc& \
fdir='C:\Users\williamyang\Desktop\Files\90823 Michelle Moore\Sample Document.doc'
(The purpose of the lone backslash is just so it fits here better. The actual code doesn't have lone backslashes at the end of each line.)
And that was perfectly fine if I enter that URL into a browser. The PHP app recieved all the indices through $_GET, then off to MySQL, no problems.
But if I let python do it,
PHP tells me indices $_GET['fname'] and $_GET['fdir'] Does not exist!!! What madness. Okay, then,
I tried everything from urllib.parse, urllib encoding and decoding, http_request.replace('\\', '/'), and many others.
None of which worked.
I was once told by my prof python does funny things when it comes to character encoding.
here is how I send my URL, before all the crazy and useless urllib parse experiments
def getResponseCode(url):
conn = urllib.request.urlopen((url))
return conn.read()
Where url = http_request
How can I go about solving this?
PHP says $_GET['fname'] and $_GET['fdir'] Does not exist
But when I paste the auto-generated http_request into a browser,
Everything is fine

URLs are not supposed to contain spaces. Your browser will automatically percent-encode URLs, replacing characters that shouldn't be in a URL with something like %20 or +, following the rules of URL escaping. Python won't do this automatically; most likely, the convenience introduces ambiguities that matter for programming, but don't bother the average web user. The Python tools for url escaping are urllib.quote and urllib.quote_plus; you probably want quote_plus. Pass the path component of the URL to urllib.quote_plus before sticking it to the domain name, and you should be good to go.

Solution for python 2:
How can I normalize a URL in python
Solution for python 3:
Ma wonky solution>
right after reading directories from os.walk() do var.replace(" ", "_")
on php end,
$var = str_replace('_', ' ', $_GET['var']);

Related

What is the point of rawurldecode() and urldecode() when the browser apparently does it automatically?

I can't tell you how many hours of my life I've wasted on these kinds of idiotic errors.
I'm basically constructing a URL such as: https://example.com/?test=' . urlencode('meow+foo#gmail.com');
Then, I display it from the URL, like this: echo urldecode($_GET['test']);
And then it shows: meow foo#gmail.com.
Ugh.
If I instead fo this: echo $_GET['test'];
I get: meow+foo#gmail.com.
(Naturally, echoing a GET variable like that is insanity, so I would of course do htmlspecialchars around it in reality. But that's not the point I'm making here.)
So, since browsers (or something) is clearly making this "translation" or "decoding" automatically, doing it again messes it up by removing certain characters, in this case the "+" (plus). Which leads me to believe that I'm not supposed to use urldecode/rawurldecode at all.
But then why do they exist?

So when would one ever want to use them
I recently had a case where we added triggers to an S3 bucket which were being picked up by a Lambda function and sent via a HTTP request to an API endpoint.
If the path of the file on S3 was multiword, it would replace the space with a + at which point it would break our code because tecnically the path is incorrect.
Once you run it through urldecode it becomes a valid path because as per the docs:
Decodes any %## encoding in the given string. Plus symbols ('+') are decoded to a space character.
That would be a valid use case for this function as no browser is involved. Just background processes/requests.

PHP preg_match on own computer doesn't work

I have this code:
$success = preg_match('/(.+(駅前)?駅) (\(([^線]+線)\) )?((([^線 ]+) )?(\d+[分時])?)/u', $m, $matches);
Example input text is
大正駅 (JR大阪環状線) ﾊﾞｽ 20分
This regex works on https://regex101.com/ and the code works on http://sandbox.onlinephpfunctions.com/. However, when I run the PHP code on my own computer, it never gives me a match. $matches is an empty array, and $success is 0. Yes, the exact same code. I have verified that the regex is correct (using first link) and that the code itself works (using second link). However, it still refuses to work on my own PC.
OS is Arch Linux, running PHP 7.3.11, system locale is ja_JP.UTF-8 (which I don't think matters, but just in case)
Does anyone see anything wrong with the code?

So I was able to find the problem.
First, I tried just the one-liner commented by Nick (3v4l.org/o4ADM) on my PC, and it works. (Of course it should. PHP can't be broken.)
So I figured out that it's the data I'm feeding preg_match that should be broken.
Normal prints and echos were in vain--$m is always how it should be. Then I considered AD7six's comment,
Check that the bytes for 駅 etc. are actually the same
so I looked carefully to check that the characters are all Japanese and no Chinese variants are there. And it's all Japanese, it's fine.
So what could it be?
I tried using PHP's file_put_contents to dump the variable to a file, and then typing the same text with my Japanese keyboard manually and saving them to another file. I opened Meld (a diff tool) and compared the two text and voila--the spaces on the text use a different codepoint than the usual half-width space (0x20). It uses 0xA0 instead, which is a "no-break space", apparently. What the heck.
Fortunately, a simple $m = str_replace("\u{00A0}", " ", $m) did the trick.
Thanks to everyone for leading me to the right answer!

Using ampersand in pretty URL breaks URL

I have seen plenty of people having this problem and it seems the only way to stop apache treating the encoded ampersand and a URL ampersand is it use the mod rewrite B flag, RewriteRule ^(.*)$ index.php?path=$1 [L,QSA,B].
However, this isn't available in earlier versions of apache and has to be installed which is also not supported by some hosting companies.
I have found a solution that works well for us. We have a url of /search/results/Takeaway+Foods/Inverchorachan,+Argyll+&+Bute+
This obviously breaks the url at & giving us /search/results/Takeaway+Foods/Inverchorachan,+Argyll which then gives a 404 error as there is no such page.
The url is held in the $_GET['url'] array. If it finds an & the it splits the array for each ampersand.
The following code pieces the URL back together by traversing the $_GET array for each piece.
I would like to know if this has any hidden problems that I may not be aware of.
The code:
$newurl = "";
foreach($_GET as $key=>$pcs) {
if($newurl=="")
$newurl = $pcs;
else
$newurl .= "& ".rtrim($key,"_");
}
//echo $newurl;exit;
if($newurl!='') $url=$newurl;
I am trimming the underscore from the piece as apache added this. Not sure why but any help on this would be great.

You said in a cooment:
We want the URL to show the ampersand so substituting with other characters is not an option.
Short answer: Don't do it.
Seriously, don't use ampersands this way in URLs. Even if looks pretty. Ampersands have a special meaning in a URL and trying to override that meaning because it looks nice is a very bad idea.
Most web-based software (including Apache, PHP and all browsers) makes assumptions about what an ampersand means in a URL, which you will find very hard to work around.
In particular, you will utterly confuse Google and other search engines if you've got arbitrary ampersands in the URL, so it will completely destroy your SEO rank.
If you must have an ampersand in the string, use urlencoding to turn it into a URL-friendly %26. This won't look good in the user's URL string, but it will work as intended.
If that's not acceptable, then substitute something different for ampersands; maybe the word "and", or a character like and underscore, or perhaps just remove it from the string without a replacement.
All of these are common practice. Trying to force the URL to have an actual ampersand character in it is not common practice, and for very good reason.

Take a look at urlencode :
You can also replace the "&" char with something not breaking the URI and won't be interpreted by apache like the "|" char.

We have had this fix in place for two weeks now so I believe that this has solved the issue. I hope this will help someone with a similar issue as I searched for weeks for a solution outside of an apache upgrade to include the B flag. Our users can now type in Bed & Breakfast and we can then serve the appropriate page.
Here is the fix in PHP.
$newurl = "";
foreach($_GET as $key=>$pcs)
{
if($newurl=="")
$newurl = $pcs;
else
$newurl .= "& ".rtrim($key,"_");
}
if($newurl!='') $url=$newurl;

Why is rawurlencode() in PHP adding additional escape characters to ampersands?

I think I'm missing something obvious here but it is driving me crazy and I can't figure it out. I'm developing a WordPress plugin and part of it needs to take the WordPress post title and send that to a RESTful web service to do something else. So of course I want to rawurlencode() the post title since who knows what text might be in there. However, for some reason the output I'm getting has extra escape characters and I have no idea where they are coming from (and it's causing problems with the web service I'm calling obviously).
My code is fairly straight forward:
$topic = get_the_title($post_id);
$curl_post_fields = 'name=' . rawurlencode( $topic );
Yet when I print the output of those two strings I get:
topic=a & b
name=a%20%26%23038%3B%20b
Whereas I would expect the URL encoded string to be
name=a%20%26%20b
I have no idea where that extra %23038%3B could be coming from. If I'm reading the encoding on that correctly it translates to #038; but I still don't know where it's coming from.

There seems to be a html encoding in between as well, instead of &, & is in the encoded string. Probably because & has to be escaped in html, and the get_title function escapes this using html_special_chars or something like that.

I had some problems with that when i used an older php version

Scrape a price off a website

I'm trying to scrape a price from a web page using PHP and Regexes. The price will be in the format £123.12 or $123.12 (i.e., pounds or dollars).
I'm loading up the contents using libcurl. The output of which is then going into preg_match_all. So it looks a bit like this:
$contents = curl_exec($curl);
preg_match_all('/(?:\$|£)[0-9]+(?:\.[0-9]{2})?/', $contents, $matches);
So far so simple. The problem is, PHP isn't matching anything at all - even when there are prices on the page. I've narrowed it down to there being a problem with the '£' character - PHP doesn't seem to like it.
I think this might be a charset issue. But whatever I do, I can't seem to get PHP to match it! Anyone have any ideas?
(Edit: I should note if I try using the Regex Test Tool using the same regex and page content, it works fine)

Have you try to use \ in front of £
preg_match_all('/(\$|\£)[0-9]+(\.[0-9]{2})/', $contents, $matches);
I have try this expression with .Net with \£ and it works. I just edited it and removed some ":".
(source: clip2net.com)
Read my comment about the possibility of Curl giving you bad encoding (comment of this post).

maybe pound has it's html entity replacement? i think you should try your regexp with some sort of couching program (i.e. match it against fixed text locally).
i'd change my regexp like this: '/(?:\$|£)\d+(?:\.\d{2})?/'

This should work for simple values.
'#(?:\$|\£|\€)(\d+(?:\.\d+)?)#'
This will not work with thousand separator like 234,343 and 34,454.45.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Python 3 normalize URL - php

Solution for python 2: How can I normalize a URL in python Solution for python 3: Ma wonky solution> right after reading directories from os.walk() do var.replace(" ", "_") on php end, $var = str_replace('_', ' ', $_GET['var']);

Related

What is the point of rawurldecode() and urldecode() when the browser apparently does it automatically?

PHP preg_match on own computer doesn't work

Using ampersand in pretty URL breaks URL

Why is rawurlencode() in PHP adding additional escape characters to ampersands?

Scrape a price off a website

Categories

Resources