Some pages aren't indexed with Sphider unless added manually

Some pages aren't indexed with Sphider unless added manually - php

I recently installed Sphider onto my site and it was simple to do so and indexing the pages was very simple, however I ran into a small issue.
I have a lot (seriously loads) of pages on my site and a lot of them weren't indexed. I have a page which takes a .csv file and creates a table using a foreach loop in PHP and the first column is a hyperlink to each item with a dedicated page for that item. My issue arises whereby Sphider does not index these individual pages, it only indexes the table page. I'm in a right two and eight because I have no idea why these pages are not indexed.
I checked to see if I had any but I didn't and I even set Sphider to index a random one of the individual pages from the table and it appeared in the search. I'd do this with all the pages but I keep adding new pages every time we get a new item so I would get inundated with things to add to the index list.
My question comes here: is there some solution where I can have a script that adds each URL to Sphider's database seeing as that seems to make them appear; or am I being a complete div and am missing something really obvious here that because of the .csv PHP table something goes wrong, maybe?
I would really appreciate your help because I am completely confused.
Thanks, Carty
PS, What's the standard for including tl; dr? Is that just for Redditors? :P

I had a similar problem when I first started using Sphider Search that when I would try to spider a folder on my website eg. www.mysite.com/myfolder which contained 900 different html pages, it would only spider / list in database 1 link which was www.mysite.com/myfolder.
I Figured out That sphider wont spider a whole directory if it has a 'index.html' or 'home.html' or 'index.php' file in said folder.
So I temporarily deleted my index.html file, successfully spider'd all 900 html files.
then re-uploaded my index.html
If index & home html files are not the cause, It might be your Spidering Link Depth Settings are not high enought.
lastly Sphider search respects the rel="nofollow" attribute in tags, so it wont index said links either.
Hope this helps.

if your page contain less then 3 words then sphider is not able to index by default. You have to change in
/sphider/settings/conf.php
as per your requirment.
$min_words_per_page=0;

Related

For Google crawling purposes: Single PHP pull-page, or individual pages for each different item?

I am creating a site and want to have individual pages for each row in a database table. The information on each page is fairly useful and comprehensive, and it would be really nice if Google could index them.
My initial thought was to just create a single PHP template page and pull the correct information for whatever the user is looking at, but my fear is that search engines won't be able to index all of the pages.
My second thought was to batch-create/automate the process of creating the individual pages as html files (for the 2000+ rows in the table), because then I would be guaranteed that they'd be crawled. However, if I ever needed to make a change to the design, I'd have to re-process them all. Kind of a pain...
My final consideration was to just pick a page in my site and list all of the possible php pages in a hidden div, but I wasn't sure if search engines can index from that. I assume they just pull from the HTML, so it'd be able to find it, right?
Any suggestions? I would love it if I can just create a single page that populates based on what they user clicks, but I want them to be indexed.

Search engines can index dynamic pages so using one PHP file to create thousands of unique product pages will be fine for SEO. After all, each page/product will have a unique URL and will be seen as a unique page as a result. All you need to do is link to your product pages within your website and/or submit an XML sitemap so you can be sure they are found and indexed.
By linking your pages, I literally mean link to your product pages. Search engines find new content primarily through following links. So if you want your product pages to be found you need to link to them. Using form based search is not a good way to do it as search engines generally don't play to well with forms. But there are lots of way to make links to your pages including HTML sitemaps and product category pages which then can link to products in that category. Really, any way yo u an get a link to your product pages is a good way to help ensure they are found by the search engines.

You don't have to post links on invisible DIV!
Just create the page and have parameterized content fetching.
You can include the pages in the XML sitemap and submit to Google or you can include your page urls in the HTML sitemap too.

Looking for more efficient way to serve numerous link redirects?

I'm wondering if there's a more efficient way to serve a large number of link redirects. For background: my site receives tens of thousands of users a day, and we "deep link" to a large number of individual product pages on affiliate websites.
To "cloak" the affiliate links and keep them all in one place, I'm currently serving all our affiliate links from a single PHP file, e.g. a user clicks on mysite.com/go.php?id=1 and is taken to the page on the merchant's site, appended with our affiliate cookie, where you can buy the product. The code I'm using is as follows:
<?php
$path = array(
‘1′ => ‘http://affiliatelink1.com’,
‘2′ => ‘http://affiliatelink2.com’,
‘3′ => ‘http://affiliatelink3.com’,
);
if (array_key_exists($_GET['id'], $path))
header(‘Location: ‘ .$path[$_GET['id']]);
?>
The problem I'm having is that we link to lots of unique products every day and the php file now contains 11K+ links and is growing daily. I've already noticed it takes ages to simply download and edit the file via FTP, as it is nearly 2MB in size, and the links don't work on our site while the file is being uploaded. I also don't know if it's good for the server to serve that many links through a single php file - I haven't noticed any slowdowns yet, but can certainly see that happening.
So I'm looking for another option. I was thinking of simply starting a new .php file (e.g. go2.php) to house more links, since go.php is so large, but that seems inefficient. Should I be using a database for this instead? I'm running Wordpress too so I'm concerned about using mySQL too much, and simply doing it in PHP seems faster, but again, I'm not sure.
My other option is to find a way to dynamically create these affiliate links, i.e. create another PHP file that will take a product's URL and append our affiliate code to it, eliminating the need for me to manually update a php file with all these links, however I'm not sure about the impact on the server if we're serving nearly 100K clicks a day through something like this.
Any thoughts? Is the method I'm using spelling certain death for our server, or should I keep things as is for performance? Would doing this with a database or dynamically put more load on the server than the simple php file I'm using now? Any help/advice would be greatly appreciated!

What I would do is the following:
Change the URL format to have the product name in it for SEO purposes, such as something like "my_new_product/1"
Then use mod_rewrite to map that url to a page with a query string such as:
Rewriterule ^([a-zA-Z0-9_-]*)/([0-9]*)$ index.php?id=$2 [L]
Then create a database table containing the following fields:
id (autonumber, unique id)
url (the url to redirect to)
description (the text to make the url on your site)
Then, you can build a simple CRUD thing to keep those up to date easily and let your pages serve up the list of links from the DB.

I have an iframe which content I need Google to index. Is this possible?

I have a classifieds website.
The index.html has a form:
<form action="php_page" target="iframe" etc...>
The iframe displays the results, and the php_page builds the results for the iframe. Basically the php_page builds a table containing the results from a mysql db, and outputs it.
My problem is that this doesn't get indexed by google.
How can I solve this?
The reason I used an Iframe in the first place was to avoid page-reloading when hitting submit.
Ajax couldn't be used due to various reasons I wont go into here.
Any ideas what to do?
Thanks
UPDATE:
I have a sitemap with URLS to all the classifieds also, but I don't think this guarantees google to spider those URLS.

Trying to make the google spider crawl the results of a search form is not really the right approach.
Assuming you want google.com users to find your classifieds ads by searching google, the best approach is to create a set of static html pages from the ads, and link them (not invisibly) from elsewhere on your site (probably best from the home page - but such a link can be in a footer or something else unobtrusive)
They can also be linked to from your sitemap XML (you do have a sitemap XML file don't you?)
Note: the <iframe> doesn't really come into this. Or Ajax.

There is no way to make any webspider fill out and submit forms.
Workaround: Every night, create a dump of the database and save the HTML to a file. Create a link from index.html to that file. Use CSS classes to make the link invisible. This way, Google will pick it up but users won't see it.

Dynamic links should be included in sitemap?

I have a website live cricket scores , in which dynamically i am controlling the news section.
I have my own custom build CMS system with PHP, where admin will add the news to the web portal.
If i generate the Sitemap, all dynamically created pages wont be added to the sitemap,
is this a good practice or do we need to add the dynamically created links in sitemap?
if yes, can you please share how we can add dynamic links?
One more observation, I have made, whatever the news which is added getting cached within 4 Hrs in google.
Please share your thoughts, thanks in advance

If the pages are important, then you should add them to the site map so they can be indexed for future reference. However, if the pages are going to disappear after the match, then I wouldn't put them on the site map as they may get indexed then disappear, which may have a negative impact on your search engine rankings.
You can add these dynamic pages to a site map in a couple of ways:
Whenever a new dynamic page is created, re-create your site map. Do this by looking through the database for the pages which will be valid and writing them out into an XML site map file.
When a new page is created, read the current XML site map, and insert a new entry into the relevant place.
I would say the easiest option is option 1 as you can quickly and easily build a site map without having to read what you already have. That option also means that when you remove a one of the dynamic pages, it will be removed from the site map when it is re-built without the need to read through what you have, find the entry and remove it.
Google code has a number of different options for you, some of which you can download and run, others look like they need implementing within your own code.

Yes, if these pages content needs to be referenced by search engines, of course they have to be in sitemap.
I worked on a lot of ebusiness website and of course, almost 99% of pages where dynamically generated, almost 1000 product pages versus the 3 sales conditions & legal static pages.
So the sitemap itself was dynamic and regenerated every 15 minutes (to avoid dumping the whole product base each and running thousands of queries each tim the sitemap is called).
You can use a sort of separate script to do this : I would do one static part template if you have static page, and one other embedding the dynamically generated urls.
It would be easier if you CMS already embed url management (or routing) system.

Removing Uploaded Files from Google when item Expires

We're using the Google CSE (Custom Search Engine) paid service to index content on our website. The site is built of mostly PHP pages that are assembled with include files, but there are some dynamic pages that pull info from a database into a single page template (new releases for example). The issue we have is I can set an expire date on the content in the database so say "id=2" will bring up a "This content is expired" notice. However, if ID 2 had an uploaded PDF attached to it, the PDF file remains in the search index.
I know I could write a cleanup script and have cron run it that looks at the db, finds expired content, checks to see if any uploaded files were attached and either renames or removes them, but there has to be a better solution (I hope).
Please let me know if you have encountered this in the past, and what you suggest.
Thanks,
D.

There's unfortunately no way to give you a straight answer at this time: we have no knowledge of how your PDFs are "attached" to your pages or how your DB is structured.
The best solution would be to create a robots.txt file that blocks the URLs for the particular PDF files that you want to remove. Google will drop them from the index on its next pass (usually in about an hour).
http://www.robotstxt.org/

What we ended up doing was tying a check script to the upload script that once it completed the current upload, old files were "unlinked" and the DB records were deleted.
For us, this works because it's kind of an "add one/remove one" situation where we want a set number of of items to appear in a rolling order.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.