I would write a WordPress plugin to parse all image source and check it's broken link or not.
My idea is :
Select all post&page's images by regex from MySQL
Navigate the image url and get the response header (404 ,403 error etc)
print a report
Since i don't need actual to download the binary file, so in performance ,compare in CURL , fopen , fsocketopen
Which one is worst to use?
And one more question, which method can execute in multi-thread?
The cost of opening a connection to the remote server makes the performance of the library a fairly moot point. In other words it isn't worth worrying about the performance of the functions.
A better option would be to use wse whatever function allows you to make HEAD requests (Which only return the HTTP headers). While you can do it with fsockopen (I don't know about fopen), it is a lot of work when cURL has code already written to send the request and parse the response.
For an example of how to do a head request using cURL see this answer.
And one more question, which method can execute in multi-thread?
PHP doesn't have threads
Related
I am concerned about the safety of fetching content from unknown url in PHP.
We will basically use cURL to fetch html content from user provided url and look for Open Graph meta tags, to show the links as content cards.
Because the url is provided by the user, I am worried about the possibility of getting malicious code in the process.
I have another question: does curl_exec actually download the full file to the server? If yes then is it possible that viruses or malware be downloaded when using curl?
Using cURL is similar to using fopen() and fread() to fetch content from a file.
Safe or not, depends on what you're doing with the fetched content.
From your description, your server works as some kind of intermediary that extracts specific subcontent from a fetched HTML content.
Even if the fetched content contains malicious code, your server never executes it, so no harm will come to your server.
Additionally, because your server only extracts specific subcontent (Open Graph meta tags, as you say),
everything else that is not what you're looking for in the fetched content is ignored,
which means your users are automatically protected.
Thus, in my opinion, there is no need to worry.
Of course, this relies on the assumption that the content extraction process is sound.
Someone should take a look at it and confirm it.
does curl_exec actually download the full file to the server?
It depends on what you mean by "full file".
If you mean "the entire HTML content", then yes.
If you mean "including all the CSS and JS files that the feched HTML content may refer to", then no.
is it possible that viruses or malware be downloaded when using curl?
The answer is yes.
The fetched HTML content may contain malicious code, however, if you don't execute it, no harm will come to you.
Again, I'm assuming that your content extraction process is sound.
Short answer is file_get_contents is safe you retrieve data, even curl is. It is up to you what you do with that data.
Few Guidelines:
1. Never Run eval on that data.
2. Don't save it to database without filtering.
3. Don't even use file_get_contents or curl.
Use: get_meta_tags
array get_meta_tags ( string $filename [, bool $use_include_path = false ] )
// Example
$tags = get_meta_tags('http://www.example.com/');
You will have all meta tags parsed, filtered in an array.
you can use httpclient.class instead of file_get_content or curl. because it connect's the page through the socket.After download the data you can take the meta data using preg_match.
Expanding on the answer made by Ray Radin.
Tips on precautionary measures
He is correct that if you use sound a sound process to search the fetched resource there should be no problem in fetching whatever url is provided. Some examples here are:
Don't store the file in a public facing directory on your webserver. Then you expose yourself to this being executed.
Don't store it in a database, this might lead to a second order sql injection attack
In general, don't store anything from the resource you are requesting, if you have to do this use a specific whitelist of what you are searching for
Check the header information
Even though there is no foolprof way of validating what you are requesting with a specific url. There are ways you can make your life easier and prevent some potential issues.
For example a url might point to a large binary, large image file or something similar.
Make a HEAD request first to get the header information. Then look at the Content-type and Content-length headers to see if the content is a plain text html file
You should however not trust these since they can be spoofed. Doing this will hovewer make sure that even non-malicous content won't crash your script. Requesting image files is presumably something you don't want users to do.
Guzzle
I recommend using Guzzle to do your request since it is in my opinion provides some functionallity that should make this easier
It is safe but you will need to do a proper data check before using it. As you should with any data input anyway.
My web application needs to parse remote resources from multiple remote servers. The problem is that the output of those remote servers is long / staged. I hence need a piece of code implementing the following logic:
Populate array $links_array with a set of links.
Some code here....
For $i in count($links_array)
`$results_array[$i] = {what has been output until now without waiting for full response}
Some code here....
The answer must not use extensions (except cURL) and work on PHP 5.3.
Thanks a lot for your help.
It seems like you need an asynchronous method of download links. You can look into the Guzzle Library (https://blog.madewithlove.be/post/concurrent-http-requests/). Another option is to use multi curl: http://php.net/manual/en/function.curl-multi-init.php
I'm trying to parse data from http://skytech.si/
I looked around a bit and I find out that the site uses http://skytech.si/skytechsys/data.php?c=tabela to show data. When I open this file in my browser I get nothing. Is the file protected and can run only from server side or something?
Is there any way to get data from it? If I cold get HTML data (perhaps in a table?) I would probably know how to parse it.
If not, would it be still possible to parse website and how?
I had a look at the requests made;
http://skytech.si/skytechsys/?c=graf&l=bf0b3c12e9b2c2d65bd5ae8925886b57
http://skytech.si/skytechsys/?c=tabela
Forbidden
You don't have permission to access /skytechsys/ on this server.
This website doesn't allow 'outside' GET requests. You could try parsing the data via file-put-contents but I don't think you will be able to get specific data tables (aside from those on that home) due to AJAX requests that need to be made. I believe the /data? is the controller to handle data which is not exposed via the API.
When you open this URL in your browser you send GET request. Data returned under this address is accessible after sending POST request with params as follows c:tabela, l:undefined, x:undefined. Analyze headers next time and look on Network log if you are using Chrome/Chromium.
If that website does not expose an API, it is not recommended to parse the data, as their HTML structure is prone to change.
See:
http://php.net/manual/en/function.file-put-contents.php
And then you can interpret it with an HTML-parsing engine or with an regular expression (not recommended).
I'm fetching pages with cURL in PHP. Everything works fine, but I'm fetching some parts of the page that are calculated with JavaScript a fraction after the page is loaded. cURL already send the page's source back to my PHP script before the JavaScript calculations are done, thus resulting in wrong end-results. The calculations on the site are fetched by AJAX, so I can't reproduce that calculation in an easy way. Also I have no access to the target-page's code, so I can't tweak that target-page to fit my (cURL) fetching needs.
Is there any way I can tell cURL to wait until all dynamic traffic is finished? It might be tricky, due to some JavaScripts that are keep sending data back to another domain that might result in long hangs. But at least I can test then if I at least get the correct results back.
My Developer toolbar in Safari indicates the page is done in about 1.57s. Maybe I can tell cURL statically to wait for 2 seconds too?
I wonder what the possibilities are :)
cURL does not execute any JavaScript or download any files referenced in the document. So cURL is not the solution for your problem.
You'll have to use a browser on the server side, tell it to load the page, wait for X seconds and then ask it to give you the HTML.
Look at: http://phantomjs.org/ (you'll need to use node.js, I'm not aware of any PHP solutions).
With Peter's advise and some research. It's late but I have found a solution. Hope someone find it helpful.
All you need to do is request the ajax call directly. First, load the page that you want to get in chrome, go to Network tab, filter XHR.
Now you have to find the ajax call that you want. Check the response to verify it.
Right click on the name of the ajax call, select copy -> "copy as Curl (bash)"
Go to https://reqbin.com/curl, paste the Curl and click Run. Check the response content.
If it's what you want then move to the next step.
Still in reqbin window, click Generate code and choose the language that you want it to be translated and you will get the desired code. Now intergrated to your code however you want.
Some tips: if test run on your own server return 400 error or nothing at all: Set POSTFIELDS to empty. If it return 301 permanently moved, check your url whether it's https or not.
Not knowing a lot about the page you are retrieving or the calculations you want to include, but it could be an option to cURL straight to the URL serving those ajax requests. Use something like Firebug to inspect the Ajax calls being made on your target page and you can figure out the URL and any parameters passed. If you do need the full web page, maybe you can cURL both the web page and the Ajax URL and combine the two in your PHP code, but then it starts to get messy.
There is one quite tricky way to achieve it using php. If you' really like it to work for php you could potentially use Codeception setup in junction with Selenium and use Chrome browser webdriver in headless mode.
Here are some general steps to have it working.
You make sure you have codeception in your PHP project
https://codeception.com
Download chrome webdriver:
https://chromedriver.chromium.org/downloads
Download selenium:
https://www.seleniumhq.org/download/
Configure it accordingly looking into documentation of codeception framework.
Write codeception test where you can use expression like $I->wait(5) for waiting 5 seconds or $I->waitForJs('js expression here') for waiting for js script to complete on the page.
Run written in previous step test using command php vendor/bin/codecept path/to/test
Part of my site requires user to input URLs, but in case they type the URL incorrectly or just input a non-existent one on purpose I end up with a bad record on my database.
E.G in Chrome if there isn't anything at a URL you get the error
message "Oops! Google Chrome could not find fdsafadsfadsf.com". (this is the case I'm referring)
This could be solved by checking the URL to see if there is anything, I can only think of one which is loading the external URL in a PHP file and then parsing it's content. But I hope there is a method that doesn't put unneeded strain on my server.
What other ways exist to check if there is anything at a particular URL?
I would just make a HEAD request. This will work with most servers, and avoids downloading the entire page, so it is very efficient.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
All you have to do is parse the status code returned. If it is 200, then you're good.
Example implementation with cURL here: http://icfun.blogspot.com/2008/07/php-get-server-response-header-by.html
You can use php get_headers($url), which will return false in case there isn't an answer
if you're willing to include a tiny Flash embed you can do a crossdomain AJAX call from the client to see if anything useful is at the destination. This would alleviate any Server involvement at all.
http://jimbojw.com/wiki/index.php?title=Introduction_to_Cross-Domain_Ajax
I would use cURL to do this, that way you can specify a timeout on it.
See the comments on: http://php.net/manual/en/function.get-headers.php