How to pick up and remove %u200B from incoming links

How to pick up and remove %u200B from incoming links - php

One of our partner websites is sending through a tracking field on our link and the query string looks like this:
?tracking=value%u200B
When "value" gets looked up in our DB via PHP PDO, it kills the query (fatal error). I'd have thought prepared statements would cope with this but I guess not!
How can I pick up any codes like this during the initial hit on my website to keep the strings clean?
This has to be better than simply asking them to fix the URL in case other site do the same.

There are many ways to do it. If you know your value pattern you can do preg_replace to clean out unwanted string. If you know your messy value always starts with % and your actual value never contains % then you can do something similar to below.
You should also check with your partner website so that they will not send you anything undesired.
echo substr($_GET['tracking'], 0, strpos($_GET['tracking'], '%'));

Related

Running preg_replace on html code taking too long

At the risk of getting redirected to this answer (yes, I read it and spent the last 5 minutes laughing out loud at it), allow me to explain this issue, which is just one in a list of many.
My employer asked me to review a site written in PHP, using Smarty for templates and MySQL as the DBMS. It's currently running very slowly, taking up to 2 minutes (with a entirely white screen through it all, no less) to load completely.
Profiling the code with xdebug, I found a single preg_replace call that takes around 30 seconds to complete, which currently goes through all the HTML code and replaces each URL found to its SEO-friendly version. The moment it completes, it outputs all of the code to the browser. (As I said before, that's not the only issue -the code is rather old, and it shows-, but I'll focus on it for this question.)
Digging further into the code, I found that it currently looks through 1702 patterns with each appropriate match (both matches and replacements in equally-sized arrays), which would certainly account for the time it takes.
Code goes like this:
//This is just a call to a MySQL query which gets the relevant SEO-friendly URLs:
$seourls_data = $oSeoShared->getSeourls();
$url_masks = array();
$seourls = array();
foreach ($seourls_data as $seourl_data)
{
if ($seourl_data["url"])
{
$url_masks[] = "/([\"'\>\s]{1})".$site.str_replace("/", "\/", $seourl_data["url"])."([\#|\"'\s]{1})/";
$seourls[] = "$1".MAINSITE_URL.$seourl_data["seourl"]."$2";
}
}
//After filling both $url_masks and $seourls arrays, then the HTML is parsed:
$html_seo = preg_replace($url_masks, $seourls, $html);
//After it completes, $html_seo is simply echo'ed to the browser.
Now, I know the obvious answer to the problem is: don't parse HTML with a regexp. But then, how to solve this particular issue? My first attempt would probably be:
Load the (hopefully, well-formed) HTML into a DOMDocument, and then get each href attribute in each a tag, like so.
Go through each node, replacing the URL found for its appropriate match (which would probably mean using the previous regexps anyway, but on a much-reduced-size string)
???
Profit?
but I think it's most likely not the right way to solve the issue.
Any ideas or suggestions?
Thanks.

As your goal is to be SEO-friendly, using canonical tag in the target pages would tell the search engines to use your SEO-friendly urls, so you don't need to replace them in your code...

Oops ,That's really tough, bad strategy from the beginning , any way that's not your fault,
i have 2 suggestion:-
1-create a caching technique by smarty so , first HTML still generated in 2 min >
second HTMl just get from a static resource .
2- Don't Do what have to be done earlier later , so fix the system ,create a database migration that store the SEO url in a good format or generate it using titles or what ever, on my system i generate SEO links in this format ..
www.whatever.com/jobs/722/drupal-php-developer
where i use 722 as Id by parsing the url to get the right page content and (drupal-php-developer) is the title of the post or what ever
3 - ( which is not a suggestion) tell your client that project is not well engineered (if you truly believe so ) and need a re structure to boost performance .
run

How to robustly check Wikipedia pages via API using search terms of different casing

I have a website which allows users to submit photos of wildlife. Once uploaded, they can identify the specie on the photo, for example "Polar bear".
This triggers me to get information from Wikipedia about that specie, using that search term:
$query = "http://en.wikipedia.org/w/api.php?action=query&rvprop=content&format=json&titles=" . $query;
$pages = file_get_contents($query);
Such a query returns one of the following:
An array of pageids, which I can then query for that page's content
Nothing, because there simply isn't any match
a REDIRECT result, which allows me to resolve the page with the proper name
The problem I have has to do with casing. For example, the search term "Milky stork", returns nothing, not even a redirect. "Milky Stork" does work. Uppercasing each word in the query is not a solution either, as it could be that some pages are in lowercase, whereas the uppercase query does not work. There's no consistency.
I'm looking for a way to make this more robust. It shouldn't be that a query fails because of wrong casing, which cannot even be predicted on the user's side.
Does anyone know of a solution for this? Other than trying every possible combination of casings?
Note: Some may suggest to use dbpedia instead, but this is no solution for my total needs.

Unfortunatelly, there is no easy solution - read http://www.mediawiki.org/wiki/API:Opensearch#Note_on_case_sensitivity
You can try instead use opensearch to find appropriate casing (if normal query returns nothing usable):
http://en.wikipedia.org/w/api.php?action=opensearch&search=milky+stork&namespace=0&suggest=
will give you
["milky stork",["Milky Stork"]]

I think trying every possible combination is a viable solution. So, your query might look like:
http://en.wikipedia.org/w/api.php?action=query&rvprop=content&format=json&titles=Milky stork|Milky Stork
Note that the first letter is not case-sensitive on Wikipedia.

How to capture unset REQUEST values in PHP

I'm really unsure if this is even possible but we have an issue where we control an interface that is having XML posted in to it via HTTP post in the form of www.url.com/script.php?xml=<xmlgoeshere>. That is then URL encoded and passed in to us, and we decode and parse it.
Except I have one client who just refuses to url encode their incoming code, which works fine except for when the XML hits an ampersand, at which point everything is being parsed as an end of the xml variable.
www.url.com/script.php?xml=<xmlstart...foo&bar.../>
The end result being that I have XML being POST/GET'd into the xml variable as normal, and then I lose half of the incoming content because of the ampersand.
Now I know that's expected/proper behavior, my question is, is it possible to capture the &bar.../> segment of this code, so that if we hit a known error I can crowbar this into working anyways? I know this is non-ideal but I'm at my wit's end dealing with the outside party.
UPDATE
Ok so I was totally confused. After grabbing the server variables as mentioned below, it looks like I'm not getting the querystring, but that's because on the query they're submitting it has:
[CONTENT_TYPE] => application/x-www-form-urlencoded
[QUERY_STRING] =>
That being the case, is the above behavior still to be expected? Is their a way to get the raw form input in this case? Thanks to the below posters for their help

You'd be hard pressed to do it, if it's even possible, because the fragments of a query string take the format foo=bar with the & character acting as the separator. This means that you'd get an unpredictible $_GET variable created that would take the key name of everything between the & and the next = (assuming there even is one) that would take the value from the = to the next & or the end of the string.
It might be possible to attempt to parse the $_GET array in some way to recover the lost meaning but it would never be all that reliable. You might have more luck trying to parse $_SERVER ['QUERY_STRING'], but that's not guaranteed to succeed either, and would be a hell of a lot of effort for a problem that can be avoided just by the client using the API properly.
And for me, that's the main point. If your client refuses to use your API in the way you tell them to use it, then it's ultimately their problem if it doesn't work, not yours. Of course you should accommodate your clients to a reasonable standard, but that doesn't mean bending over backwards for them just because they refuse to accommodate your needs or technical standards that have been laid down for the good of everyone.

If the only parameter you use is xml=, and it's always at the front, and there are no other parameters, you can do something like this pseudocode:
if (count($_GET)>1 or is_not_well_formed_xml($_GET['xml'])) {
$xml = substr($_SERVER['QUERY_STRING'], 4);
if (is_not_well_formed_xml($xml)) {
really_fail();
}
}
However, you should tell the client to fix their code, especially since it's so easy for them to comply with the standard! You might still get trouble if the xml contains a ? or a #, since php or the web server may get confused about where the query string starts (messing up your $_SERVER['QUERY_STRING'], and either PHP, the client's code or an intermediary proxy or web server may get confused about the #, because that usually is the beginning of a fragment.
E.g., Something like this might be impossible to transmit reliably in a query parameter:
<root><link href="http://example.org/?querystring#fragment"/></root>
So tell them to fix their code. It's almost certainly incredibly easy for them to do so!
UPDATE
There's some confusion about whether this is a GET or POST. If they send a POST with x-www-form-urlencoded body, you can substitute file_get_contents('php://input') for $_SERVER['QUERY_STRING'] in the code above.

YES, Its possible. Using $_SERVER["QUERY_STRING"].
For your url www.url.com/script.php?xml=<xmlstart...foo&bar.../>, $_SERVER["QUERY_STRING"] should contain, xml=<xmlstart...foo&bar.../>;
The following code should extract the xml data.
$pos=strpos($_SERVER["QUERY_STRING"], 'xml');
$xml="";
if($pos!==false){
$xml = substr($_SERVER["QUERY_STRING"], $pos+strlen("xml="));
}

The problem here is that the query string will be parsed for & and = characters. If you know where your = character will be after the "bar" key then you may be able to capture the value of the rest of the string. However if you hit more & you are going to need to know the full content of the incoming message body. If you do then you should be able to get the rest of the content.

Best way to save info in hash

I have a webpage that the user inputs data into a textarea and then process and display it with some javascript. For example if the user types:
_Hello_ *World* it would do something like:
<underline>Hello</underline> <b>World</b>
Or something like that, the details aren't important. Now the user can "save" the page to make it something like site.com/page#_Hello_%20*World* and share that link with others.
My question is: Is this the best way to do this? Is there a limit on a url that I should be worried about? Should I do something like what jsfiddle does?
I would prefer not to as the site would work offline if the full text would be in the hash, and as the nature of the site is to be used offline, the user would have to first cache the jsfiddle-like hash before they could use it.
What's the best way to do this?
EDIT: Ok the example I gave is nothing similar to what I'm actually doing. I'm not cloning markdown or using underline or b tags, just wanted to illustrate what I wanted

Instead of trying to save stuff in the URL, you should use the same approach that is common in pastebins: you store the data , can provide use with url, containing an unique string to identify stored document. Something like http://foo.bar/g4jg64
From URL you get state or identifiers, not the data.

URLs are typically limited to 2KB total, but there is no officially designated limit. It is browser-dependent.
Other than that, make sure you properly URL encode what you're putting up there, and you're fine... although I certainly would not want to deal with obnoxiously long URLs. I might suggest you also avoid tags such as <underline> and <b>, as they have been deprecated for a very, very long time.

Use javascript function:
encodeURIComponent('_Hello_ *World*');

Code Igniter: allowing apostrophe in URI's while depending on Query Bindings for safety

I've been figuring out how to let apostrophe's cross URI's.
I'm building a site that allows users to "create photo albums". I have a link that when clicked, it will load and display all the contents of a certain album. I'm using codeigniter so this page is called this way:
http://www.fourthdraft.com/index.php/admin/manageAlbumContents/dan's/91
admin = controller
managealbums = function
dan's (album name) = variable
As you know, codeigniter does not allow apostrophe(') in uri's. My problems are:
If I htmlspecialchars/htmlentities
the album name it becomes &#xx;
Those new characters also not
allowed
If I url encode it becomes %xx. percent is allowed but codeigniter
urldecodes it before processing so
it just reverts back to apostrophe
I've tried making my own preg_replace ( ' => '~apos~' ) but i
just find it inefficient, too much
lines to run and tedious since I
have an 80% done website and the
strings I have to replace are
everywhere.
I've also considered using base64_encode. It takes more space
but it does the job. Then again, the
encoded version contains '=' which
is also disallowed
As much as possible I do not want to just add apostrophe in the allowed characters list in codeigniter's config file. I believe they don't have it there for a reason. At the same time, I'm running out of options.
The reason for wanting to allow apostrophe's is because in this context, it's bound to be used. For example, what if someone decided to put 'dan's birthday party' as an album name? It's bound to happen. and i'm pretty sure my users would complain. Even if I manage to convince them otherwise, what will i replace that with? dan_s birthday party? looks wrong. Also, if facebook can do it I should too. At the very least, if facebook did it, then that means there's a way.
If you guys have any suggestions, fire away. Otherwise I'm wondering if it's ok (and safe) to just allow apostrophe in the allowed URI characters. I know it's VERY dangerous for mysql which i use a lot but I just remembered codeigniter's query binding variables automatically escapes characters. I'm wondering if that would suffice and keep me safe.
Otherwise, please please please give me a good idea. I'm drained out

I like to believe that the days of mysql_query("SELECT * FROM table WHERE x={$_GET['val']}") are over. That being said, it's OK with any decent database library as long as you use parameter binding. So go ahead and use urlencode.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.