cURL source retrieval (ignore files/images)

cURL source retrieval (ignore files/images) - php

We've developed an irc bot in php which, (among many other functions), will respond with the page title of any url a user sends to the channel. The problem i'm having is that when someone puts the url of an image or a file, the bot tries to retrieve that file or image.
I'm trying to determine the best way to go about solving this issue. Should I filter the url inputs and regex them for all possible file types? That seems daunting and exhaustive, to say the least. If anyone caught on to it they could simply put a huge file somewhere with a senseless extension and then say that url in the channel and time the bot out.
I feel like i'm missing a curl option which could make it simply ignore file retrievals which aren't simply ascii in nature. Any advice or suggestions?

One idea could be that you do a HEAD request first and if the content type is text/html you download it otherwise you don't. Or you could just read the first 1000 characters (or something small) and check if the title is there. And if isn't you assume it is something else than html.

Related

Security of fetching a url content in php

I am concerned about the safety of fetching content from unknown url in PHP.
We will basically use cURL to fetch html content from user provided url and look for Open Graph meta tags, to show the links as content cards.
Because the url is provided by the user, I am worried about the possibility of getting malicious code in the process.
I have another question: does curl_exec actually download the full file to the server? If yes then is it possible that viruses or malware be downloaded when using curl?

Using cURL is similar to using fopen() and fread() to fetch content from a file.
Safe or not, depends on what you're doing with the fetched content.
From your description, your server works as some kind of intermediary that extracts specific subcontent from a fetched HTML content.
Even if the fetched content contains malicious code, your server never executes it, so no harm will come to your server.
Additionally, because your server only extracts specific subcontent (Open Graph meta tags, as you say),
everything else that is not what you're looking for in the fetched content is ignored,
which means your users are automatically protected.
Thus, in my opinion, there is no need to worry.
Of course, this relies on the assumption that the content extraction process is sound.
Someone should take a look at it and confirm it.
does curl_exec actually download the full file to the server?
It depends on what you mean by "full file".
If you mean "the entire HTML content", then yes.
If you mean "including all the CSS and JS files that the feched HTML content may refer to", then no.
is it possible that viruses or malware be downloaded when using curl?
The answer is yes.
The fetched HTML content may contain malicious code, however, if you don't execute it, no harm will come to you.
Again, I'm assuming that your content extraction process is sound.

Short answer is file_get_contents is safe you retrieve data, even curl is. It is up to you what you do with that data.
Few Guidelines:
1. Never Run eval on that data.
2. Don't save it to database without filtering.
3. Don't even use file_get_contents or curl.
Use: get_meta_tags
array get_meta_tags ( string $filename [, bool $use_include_path = false ] )
// Example
$tags = get_meta_tags('http://www.example.com/');
You will have all meta tags parsed, filtered in an array.

you can use httpclient.class instead of file_get_content or curl. because it connect's the page through the socket.After download the data you can take the meta data using preg_match.

Expanding on the answer made by Ray Radin.
Tips on precautionary measures
He is correct that if you use sound a sound process to search the fetched resource there should be no problem in fetching whatever url is provided. Some examples here are:
Don't store the file in a public facing directory on your webserver. Then you expose yourself to this being executed.
Don't store it in a database, this might lead to a second order sql injection attack
In general, don't store anything from the resource you are requesting, if you have to do this use a specific whitelist of what you are searching for
Check the header information
Even though there is no foolprof way of validating what you are requesting with a specific url. There are ways you can make your life easier and prevent some potential issues.
For example a url might point to a large binary, large image file or something similar.
Make a HEAD request first to get the header information. Then look at the Content-type and Content-length headers to see if the content is a plain text html file
You should however not trust these since they can be spoofed. Doing this will hovewer make sure that even non-malicous content won't crash your script. Requesting image files is presumably something you don't want users to do.
Guzzle
I recommend using Guzzle to do your request since it is in my opinion provides some functionallity that should make this easier

It is safe but you will need to do a proper data check before using it. As you should with any data input anyway.

Determining the source code of a website system

Ok, in work I use a particular system to look up part numbers for products. It's accessed in the browser and can only be accessed by company machines and I have a log on to use the system. Once logged in, I type in the part number and it prints a list into a rich text field with the part number, serial number, description and some other bits of info. It doesnt have the ability to search for multiple part numbers, so I literally have to type in the first, wait for the result, then the second, etc. What I'm looking to do is write some code that will loop through a text file and print out part of the result into the text file next to each part number. This kind of code I'm used to.
My problem however is that I dont know what the source code / function is for this company owned system. If I view the source I cant see a js file or anything similar with what I would think the script would live in, so assume its server side. If I watch the requests, I can see the parameters being passed, but I dont know how I could recreate this in code and obtain a result. Would be nice if it spat out some json, but I dont think its that easy :-)
Any pointers to get me going and areas I should look at?
Thoughts appreciated.

You can view the response headers in the browser's network tab, if it's php, by default it adds a header that you'll recognize.
But, how will you deploy your server side code if you don't have access to the server? And if you do, then why do you need to guess the language like this?

how to fake url detection by php

im working on a script for indexing and downloading whole website by user sent url
for example when a user submit a domain like http://example.com then i will copy all links in index page and go for download the its inside links and start from first.....
i do this part with curl and regular expression to download and extract the links
however
some yellow websites are making fake urls for example if you go to http://example.com?page=12 it have some links to http://example.com?page=12&id=10 or http://example.com?page=13 and etc..
this will make a loop and the script cant complete the site downloading
is there any way to detect these kind of pages!?
p.s.: i think google and yahoo and some other search engines face this kind of problem too but their database are clear and on searches thay dont show these kind of data....

Some pages may use GET variables and be perfectly valid (like as you've mentioned here, ?page=12 and ?page=13 may be acceptable). So what I believe you're actually looking for here is a unique page.
It's not possible however to detect these straight from their URL. ?page=12 may point to exactly the same thing as ?page=12&id=1 does; they may not. The only way to detect one of these is to download it, compare the download to pages you've already got, and as a result find out if it really is one you haven't seen yet. If you have seen it before, don't crawl its links.
Minor side note here: Make sure you block websites from a different domain, otherwise you may accidentally start crawling the whole web :)

creating own audio captcha in php

I wanted to make a little audio captcha in php, so I needed to convert text to speech, but I have two restrictions:
First it should be a php-solution. creating a mp3/ogg would be fine, it could be inserted and played with audio-tags etc.
Second I need to install it on a server only using ftp-access. So, I can't use standard applications to which php would speak.
So, I already investigated some solutions:
Jquery's Jtalk can read text aloud, but it's kind of impractical here as javascripts is always open source => the captcha would be plain in the source-Code.
Google has an Api to speak aloud, too. However, you need to make a call to an extern file with the text as part of the url. so, listening to the outgoing requests will reveil the captcha, too.
I tried to combine my own audio-files using php. I have read in some posts here, that many player supports simply a echo file_get_contents['audio1.ogg'].file_get_contents['audio2.ogg']; solution. However, using the plugin in Firefox, only the first file is played. Downloading and playing in VLC reveals both audio files. I'm also not really happy with this one, even if it would work, as one could just associate the ogg-source with the letter and recognise the captcha by slicing the audio-source-code...
I also thought of loading all letters in audio-tags and playing them as needed, but that will again reveal the captcha in the web's source code.
Lastly I heard of "flite" which promised to be able to do all these things, but I think I got a little mistaken and it needs to get installed directly on the server rather than just putting a few files on an ftp.
So, does anybody know how to make a text to speech solution with only ftp-access and without contacting other websites with the text as part of the url?
Regards,
Julian

So, I have made up a solution combining javascript and php which is pleasing for my taste and could get modified for additional security (like adding noise or having something else than a letter per sound file).
It works like this: you set up a sounds-folder, protected per htaccess, only allowing a captcha.php-script to get files. There is one file per letter you want to display.
The script can also access the captcha via Session, database or protected file and has a pointer to the position that is currently read. Every time it is visited, it gives the audio of the next letter back. This could get done by e.g.
echo file_get_contents('sounds/'.$_SESSION["curaudio"].'.ogg');
Then you only need to insert the audio-element into your html:
<audio hidden id="Sound_captcha">
Your browser does not support the audio element.
</audio>
And Use javascript to switch to the next letter. For that, use the src-attribute of the audio and give the address of your captcha.php-file. Remember to add a value to prevent Cache:
"captcha.php?"+(new Date()).getTime()
You can call the play()-function of the audio-element to play the file.
To switch to the next requires to either stay at a fixed amount of time per file (very insecure) or to use the ended-event of the audio-element.
Of course, your php-script should at the end also tell when the captcha has been read completely (e.g. to be read with another script where you need a an ajax-request or e.g. the script that produces the sound does it only at every odd access, otherwise status, or the script tells you at the beginning how many reloads you need...)
That is actually all for a basic player, which would also need to get modified to prevent an easy bot-access... however, in my opinion, this is at least as secure as a standard text-captcha and removes a great barrier for people with eye-problems.

verifying a domain using php

I have a member area, where they can add their domains and it will be displayed in the profile page..but now I want to add a verification process, just like google web-masters does..where they need to upload a certain file and so..
please tell me whats the best way to do this ?
Thanks :)

Generate a token for each domain (sha-1 of domain or so), store it in your DB or what have you.
Generate a text-file containing the token on user request.
Ask the user to inform you to poll or poll every now and then to check the URL. This can easily be done by file_get_contents in PHP if fopen_wrappers are enabled.
The token is obviously compared to the token in your DB to make sure it wasn't just a random file present at a random domain..
Could be a good idea to check at some time interval if the file is still there, to keep someone from selling the domain but remain in control
It's not really black art as we can assume the user has access to its domain once any specific request which proves access can be fulfilled by the user. There's no real way to fool the system except doing some DNS-magic, or gaining entry to the webserver running on the domain, which is out of your control anyway.

Not sure if that's the best way, but I think Google does something like this:
get user's domain name (e.g. "http://example.com")
generate unique code and store in db
tell user where to upload the code (e.g. something like "/verification.txt")
after confirmation, make a HTTP request for the code ("http://example.com/verification.txt") from own server to the user's server
compare the code you received to the code in the db
You may want to generate consistently the same code for the same domain.

This question is convoluted. I think you need to spell out what you are looking for a little better.
EDIT #1:
Generate an md5 and give it to the user, tell them to put it on their domain and provide a URL to where it is. This could be in a txt file or anything.
Then read that file and check if the md5 string exists in there.
Actually I would come up with something slightly different than an md5. Maybe three of them, so that you reduce the chance they find it on some other domain and then give you that URL.
This can still be spoofed unless you nail down constraints, like it has to be a text file, the file must only contain the md5... etc.
Right now I can type in an md5 but it doesn't mean I control this website:
md5("i fooled you") = "0afb2d659b709f8ad499f4b87d9162f0"
But if I handed the URL to this answer, your system might accidentally think I have admin here.
I recommend creating a file and making them upload the file and give you the URL to it. But even that won't necessarily work because there are many sites where you can just upload something.
Maybe if it's a php encoded file that can execute? That's kind of a security flaw because I don't know if I would upload just anyone's PHP file. Typically if you don't have admin nobody is going to let you upload a php file that would work.
You might want to create a php call-home script but that's gonna be bad. People wouldn't use it.

Another way it could be done is:
Get the domain name
Generate a random code/string.
Sore this in your database
Make a meta tag and the random code in the content.
Use file get contents of the index page of the website.
Then search the page for the meta tag with the code sorted in the database.
If statement for success or unsuccessful.
The meta tag should look like this:
<meta name="site-verification" content="1010101010101010101010101010101010101010" />

Actually, just creating an md5-string for the domainname, letting the site owner put that in a meta-tag so you can check that would allready work fine ...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.