I recently made some major changes to an ecommerce website that include url structure. The url to view a product is modified by .htaccess and contains a short product description that if changed will not affect the results on the page.
example: www.Example.com/staticFolder/non-deterministic-product-details/MODEL#.html
Now in the error log file I am seeing bingbot requesting pages like example.com/non-deterministic-product-details
Our sitemaps don't link to this page and I am not able to find any bad links on pages. Has anyone else had problems with bingbot doing this? I found another question that was locked for being random. Bingbot causing 404 errors. Is it more likely that I am doing something wrong? Should I avoid using psuedo directories in my .htaccess?
-Thanks
There's nothing requiring that spiders stick only to link-crawling. It's entirely possible it's guessing URLs which are similar to known ones in the hope that it'll find something.
At any rate, I wouldn't worry about it unless you know it's following a bad link. It's quite normal to get lots of requests for non-existent pages.
Related
We are using Joomla for a public site and we are getting a ton of soft 404 errors that all look similar to
/?option=com_k2&view=itemlist&task=user&id=xxx
where xxx = some numeric id. Obviously this is some soft of spamming but how do I turn it off in Joomla/K2?
I'm not particularly Joomla oriented but this seems a task I should be able to accomplish if I can get an idea of "where" to fix the code. The page shows a warning instead of an error
Warning
JUser: :_load: Unable to load user with ID: 35414
so it seems the "page" is actually there but with no content. I'm guessing some internal handler is spitting this dynamic content out but I want to return 404 in this case. Any ideas would be appreciated.
I'm trying to understand the nature of your problem. What do you mean with "Soft 404" errors? Do you have 404 errors or not?
On my K2 websites, I have sometimes "visitors" who try to find holes in K2. I then have many, many accesses of the same page. These visitors try to post comments or somethings else on the articles or user profiles.
Is this a similar thing that is happening on your site?
Is there a (Joomla) user on your site with ID 35414? If not, you can be pretty sure that someone is trying to hack your site.
Is the URL always requested by the same user? You can find this in the log files of the Apache server. In such cases I add a "deny from" statement to my .htaccess file.
In case you seriously expect a K2 related issue, I would recommend to post the issue in the K2 forum. This forum is quite knowledgeable and JoomlaWorks makes a serious effort to bring good K2 customer support.
I'm trying to get to the bottom of an issue Moz's crawler got stuck on. The easy problem we need to fix is that we have duplicates of the same page i.e.:
/capabilities/
/capabilities/index.html
That problem is occurring for a handful or directories. But we also have an issue that seems to be making an infinite loop of pages that can be accessed, just for this subdirectory:
/customer_service.html/
/customer_service.html/contact/index.html
/customer_service.html/contact_us/contact_form.php
/customer_service.html/contact/contact/contact/contact/index.html
/customer_service.html/contact/contact/contact_us/contact_form.php
/customer_service.html/contact_us/contact/contact/contact/index.html
/customer_service.html/contact_us/contact/contact/contact_us/contact_form.php
/customer_service.html/contact/contact_us/contact/contact_us/contact_us/contact/index.html
And on and on and on... I think it stopped crawling just because it reached 24,000 pages. All these pages actually work. Really there only need to be two pages: one for customer service FAQs, and one for contacting the company.
I'm a marketer, not a developer, so all I know is that this is an issue. I'm wondering whether we can fix this using htaccess, or if there is another problem. It seems to me like all these other pages need to be eliminated, not just redirected. Thanks.
edit: added more examples for illustrative and comic purposes
There are two things to do.
One is, like you say, not to allow these URLs to redirect to the main page. Show what you have in your .htaccess file, I will look how you could change it.
On the other hand, it is not sufficient to address the symptom. You have to heal the sickness. Here it means that you have some incorrect links on your site. Most probably these are relative URLs that are missing the initial slash ( contact instead of /contact ).
Googlebot couldn't crawl this URL because it points to a non-existent page. Generally, 404s don't harm your site's performance in search, but you can use them to help improve the user experience.
this error occur in following urls.
how can i solve it..
check and see which page links to these pages. maybe your website's domain had a previous owner who had a webpage, and there are some inbound links pointing to that old website. this is something you can't control, if thats the case you should do a redirect on these pages to your start page. Do this with your .htaccess file:
ErrorDocument 404 /index.html
some things to check that may produce malformed urls and is under your control:
your paging code in search results and/or categories of products/services/content
your sitemap
i also had a similar experience, in one of my websites i had this scheme:
example.com?category=1
example.com?category=2
example.com?category=3
and in webmaster tools i was getting random strings:
example.com?category=xxcCzxvsd
in my analytics nobody (except googlebot) ever visited example.com?category=xxcCzxvsd. I couldn't find the source of this, so there is a strong chance the problem is in google's side.
I have a Joomla based community site and with search engine friendly URLs activated in the backend my profiles are located under mysite.com/community/profile/user/"username"
I need the htaccess file to do nothing unless a URL containing "community/profile/user" is found. If that string is found then it should change the link to mysite.com/"username" but in reality be showing the page mysite.com/community/profile/user/"username"
I think this would be rewrite rule instead of redirect, but I barely know what I'm talking about.
Can someone please tell me what code I must place in my .htaccess file in order to change this? I believe .htacces would be the best way to do what I need, but if you have another idea I'm glad to hear it.
First be sure you understand .htaccess's role.
It is only read when an incomming request is made. So it will not change URLs generated by joomla.
You can however allow urls like mysite.com/eddie to actually pull content from mysite.com/blah/blah/eddie
http://httpd.apache.org/docs/current/rewrite/remapping.html
If you are looking to "train" your users, you can add a step before that to redirect the URL as well. This get's very tricky though as if you're not careful you can get caught in a loop.
user clicks mysite.com/blah/blah/eddie
apache redirects to mysite.com/eddie
(browsers makes second request, user sees URL change)
apache sees mysite.com/eddie and loads the underlying mysite.com/blah/blah/eddie
An easier solution might be to tweak the joomla community code to generate the short urls (mysite/eddie) and use apache to make a call direct to the plugin (mysite/components/communit/index.php?user=eddie
Someone had changed my .htaccess, and I have removed that.
But I still have phantom pages like this:
http://www.biztalk-training.com/?puqr=usoe
I don't have any 404.php, 404.shtml, or 404.html pages.
I checked CPanel for redirects on 404, and it looked empty (but would have created a 404.shtml if I filled it in).
If I type in something like this in the browser, I get a 404;
http://biztalk-training.com/anything.html
I'm looking for what to kill, remove or fix to get red of the phantom page. I'm a developer (other platforms) with moderate familiarity with PHP and CPanel sites. I'm used to seeing domainname.com/progname.php?parm=test and I know how that works. But I don't know how the ?puqr=usoe is producing content on my site. They have other pages similar discovered by doing a site: search on google.
Thanks,
Neal Walters
Have you checked your index page? Under normal circumstances, http://www.example.com/?foo=bar will pass the query string (?foo=bar) to the index of example.com and will not produce a 404.
If these malcontents got write access to your server - and it sounds like they did - they could have easily modified your index page.