I have built a robot which basicaly crawls websites starting at the root, parses the page, saves all the internal links, and then moves to the first page in the link list, parses it and so on.
I know its not really optimized to run this sort of bot in PHP but as it's the only language I happen to know well enough, thats the one I chose.
I came accross all sort of issues : pages returning 404, then pages being redirected then pages which are not parsable (werid case of few pages that return a few words when being parsed but return the entire expected body when you send a GET http request), etc...
Anyway I reckon I have made the robot so it can go through 99.5% of the pages it parses, but yet there are some pages that are not parsable and at that point my bot crashes (about 1 page out of 400 make the bot crash, and as crashing I mean, I just get a fatal error, the code just stops then).
Now my question is : how can I prevent this from happening ? I am not asking how to fix a bug I cant even debug (they re most of the time times out, so not very easy to debug), I'd liek to know how to handle those errors. Is there a way to refresh the page in case a certain type of error occurs ? Is there a way to go around those time out fatal errors ?
I cannot see the point of showing you any sort of code, although i will if you feel the need of checking a certain part of it.
Thank you
Simplest way I can think of is to use a try{} catch(){} block.
[http://www.php.net/manual/en/language.exceptions.php][1]
You put the part of the parser into the try block, and if an error is thrown, feed some default values and go to the next link.
If you are getting fatal errors (which I think you can't catch with try), then you could also try to break each step download/parsing into a separate php file that is called with the url it needs to lookup via curl. This kind of poor man's parallelization will cause you to incurr a lot of overhead though and is probably not necessarily how php "should" be used, but should work. You'll also need to store the results in a database / text file.
Related
I've been fortunate enough to be a CF dev for pretty much my entire IT career without having to take on using another development language so I have a knowledge hole I'd like to ask others to help me with.
I've built an API and I want to describe to others how to invoke it. It needs to be invoked first thing during a request before any generated content is sent back to the user. One of the possible outcomes of the API call is that the incoming user request could be aborted so that there's no error message but also no generated content. Just a blank screen. Sending back the blank screen with no generated page code is critical.
I can tell someone using CF that it needs to be called at the beginning of the Request scope or OnRequest scope but I'm at a loss as to how to get across the same arrangement for someone using other languages/frameworks like PHP, ASP.NET, Node.js, Wordpress, etc.
So, for example, for a CF based site I'd say something like: "The synchronous API call needs to be made early in the Request or OnRequest scope and BEFORE any generated page content is returned to the user". What I'm looking for is how to describe that same thing but for users of those other languages/frameworks.
Odd question but Google has been zero help (or perhaps I just don't know how to search for something like this). Any advice/guidance would be most appreciated.
Thanks in advance!
Is not the answer to your question simply to tell them "It needs to be invoked first thing during a request before any generated content is sent back to the user" (I copy and pasted that from your question).
That's it. That is absolutely clear.
That's all you need to do.
Don't worry about how they need to do that in their language of choice, esp given the very nature of your question, you won't know how. It's their job to write the code to consume your API. Not yours.
At most you could give them some usage pseudo-code along the lines of:
// at the beginning of the response handler
apiResult = apiObj.makeRequest(args, here)
if (apiResult.youCanComeIn == false) {
// handle it with a 403 or something appropriate
// stop
}
// they're allowed in, so rest of processing here
Obviously, any API request must return a specific response. And probably you need to pass the expected value and the value of a certain error at the level of your API. Further, any developer will understand what information to issue when receiving some error from the API response.
You probably mean something like: "request processing is required on the server side, in case of an error, generate an empty page on the client side", etc.
It's hard to recommend anything. Maybe server-side rendering, SSR
The functions I have written are throwing exceptions if they can't do their job. For the productive environment I thought to redirect the exception to a nice looking error page. Therefore I'm thinking of setting the exception handler set_exception_handler on the beginning of every script. How does the error page know which error occured? I thought of putting an error code into the URL like header("Location: error.php?code=1234"). While in the development phase I just would not set the exception handler, thus every exception would be printed onto the php default error screen Uncaugt Exception: ... with all usefull informations.
I have read Exceptions in PHP - Try/Catch or set_exception_handler? but don't know how to write a front controller script and also think this is maybe to much the effort.
I'm a PHP beginner who likes to handle errors in the right way, but I'm just not sure if I'm doing it right or wrong. Do you think it's ok doing it like above described or do you have other suggestions?
Thank you!
Don't redirect. In your exception handler function just output the error page at that point (or include a PHP file which includes the error page HTML). You also want to set an appropriate status code (using the PHP header function).
Edit: Why not to redirect:
You want to return an appropriate HTTP status code on your error pages (usually 404, 403, 400, 500 or 503 depending on the cause), so that search engine robots know not to index the error, crawlers can identify broken pages, browsers know not to cache the page and so on. If you redirect, you are returning a 301/302 HTTP status code and not one of the error ones.
You want users to be able to refresh the page with the error on, in case it was a temporary glitch. If you redirect them to another URL, however many times they refresh they will always see your error page (since that's the page they're on).
Well to tell you the truth I think it's too early for you to worry about this kind of things.
For now just keen on mastering OOP, because later on you WILL (and probably will have to) use a MVC framework, which does all the error/exception catching for you. Take a look at symfony: in development environment it shows you exception and stack trace, but in production env. it spits out a nice, customizable error page.
What i mean is: Don't reinvent the wheel, see how others solved similar problems. And preferably use their solutions (but remember to understand them as well).
Do you think it's ok doing it like
above described or do you have other
suggestions?
No, I don't think this is an OK solution. You don't want to apply a golden hammer to every problem that arises. Some exceptions should be handled differently than others. Furthermore, a particular exception may need to be handled differently in one part of the code than another. I would suggest set_exception_handler acts as a last resort, handling those exceptions that for some reason were not properly caught and go forward writing try/catches to handle things more granularly.
I have a website that's written in PHP and uses intense level of JS coding. One of my clients has a very strange error. The site is empty and nothing is displayed. I can not reproduce the error in spite i use the same browser, the same OS and have much the same addons and firewall and antivirus.
So i would like to catch every one PHP and JS error or warning and put it in the error log (best - to database). Is there any ready, simple solution to acomplish this? I address this question to experienced web-developers.
Or is there any way to dump every data about user session while the error occurs that is easy to acomplish by no-tech user? I see it this way: when the user has this error, he clicks something (for example in extension or something) and this sends all session, error informations to me so I can figure out what is going on. Do you know any solution of this kind?
mplungjan's idea is good.
I would also ask the client to view the source of the page and send that to me to make sure it looks OK.
Your web server (e.g. apache) should keep a log file of every single PHP request and tell you whether errors occurred.
I don't know if there is a way to report javascript errors back to your server. If you were able to catch the error and send an AJAX request in your error handler to get logged on your server, that would work. But I think that some javascript errors (like syntax errors?) can not be caught with catch. I would ask the client to open the javascript console (or whatever it is called in his browser) and tell me all the errors he sees. You should eliminate all the errors eventually, and a good strategy to do that would be to focus on the first error that occurred.
I would run the page through a w3c validator to see if it is valid HTML/CSS.
Also, you should try the universal technique of simplifying the code down to the simplest possible thing that should work but doesn't work. That will either let you find the problem or produce something that is so small and simple that you can post it to Stack Overflow.
You need to differ between two types of errors: Client-side and Server-side.
A blank page can be both, but I would think most likely this is server-side.
For server-side errors you can log every error and even add own information like the session by registering your own error handler. You then can log errors into the database and append the session and request information as well as providing a backtrace. This will enable you to obtain more information.
For client side, David Grayson's answer has a suggestion.
I've worked on a CMS which would use Smarty to build the content pages as PHP files, then save them to disc so all subsequent views of the same page could bypass the generation phase, keeping DB load and page loading times down. These pages would be completely standalone and not have to run in the context of another script.
The problem was the instance where a user first visited a page that wasn't cached, they'd still have to be displayed the generated content. I was hoping I could save my generated file, then include() it, but filesystem latency meant that this wasn't an option.
The only solution I could find was using eval() to run the generated string after it was generated and saved to disc. While this works, it's not nice to have to debug in, so I'd be very interested in finding an alternative.
Is there some method I could use other than eval in the above case?
Given your scenario, I do not think there is an alternative.
As for the debugging part, you could always write it to disc and include it for the development to test / fix it up that way and then when you have the bugs worked out, switch it over to eval.
Not knowing your system, I will not second guess that you know it better than I do, but it seems like a lot of effort, especially since that the above scenario will only happen once per page...ever. I would just say is it really worth it for that one instance to display the initial page through eval and why could you not be the initial user to generate the pages?
What do you do when you detect your get request is broken or is passing wrong types of data? Say you have a forum-page.php?forum=3 that lists all the topics related to forum 3.
What would be a good way to deal with the absence of the "forum" variable? What about if instead of being an integer, you would get a string? How would you respond to such a wrong request?
Spit out an error telling why you refused the request
If forum-page.php is called without the "forum" variable simply redirect to a default page, something like forum-page.php?forum=1. The same thing for a wrongly typed forum variable.
Redirect to some other page. Something like the forum/board index?
Other options?
Would really love to read your opinions about this.
I typically return a 400 (Bad Request) with a status description explaining why (eg. "forum parameter is required"). Not sure if PHP allows this (ASP.NET does), but then you could then map a 400 to a custom page that displays the error in a way that makes sense for your application.
It depends quite a bit on each page and their GET requests. Most pages like the one you used as an example can fail gracefully, but others which have required variables missing may need to throw a 400 (Bad Request) or a 404 (Page Not Found). 404 is actually quite necessary because there may be a bad link being spidered by a search engine or being passed around through the internets, so you'd want to stop this behavior.
My view is to try the following:
For wrong/missing required variables, throw a 400 or a 404 (depending on your app). However, for a 400, I would fail gracefully to the default page (forum-page.php) and show the error in a error box at the top of the page.
For wrong non-essential variables that may be mistyped, fail gracefully to the default page.
For wrong non-essential variables that are completely the wrong format or object type, throw 404's since they may be attempts at subverting the security of your app.
Ultimately, the really important thing to never do is to try to "guess" the wrong/missing variables and fill it for the user (in most cases). I've come across many webapps where this behavior was misused by hackers to trick the webapp to simulate a vulnerability.