I have somewhere in the region of 60,000 URLs that I want to submit to Google. Given the restriction of 10,000 URLs per file i'm going to need to make a sitemap index and link to at least 6 sitemap files in that index.
I don't know what the most efficient way of doing this is. My idea was to go to my DB, take the TOP 10000 rows, run my foreach on the data and generate my links. My first idea was to create placeholder sitemap files (eg. sm1.xml, sm2.xml, etc.) and after each 10,000 rows increment the file index and insert the next 10,000 into the next file. The problem is that the data in the DB is always being added to, so next month I could have 70,000 URLs - meaning I'd have to create another placeholder file.
So with this in mind, I'd like to create the individual sitemap files dynamically but I don't know how.
Some idea's that might help, you on your way to building a sitemap generator in your project.
get the urls from your route.php file
get the classes/methods using the reflections class
get the data from the database or text file
Loop through each data set like you stated above and create indexed files for them.
use a CRON job to index your files via ping.
Use the ping service provided by these search engines.
You should maybe only ping the services at the end of each day or second day,
don't ping them once a new row is created!
Google Ping
http://www.google.com/webmasters/sitemaps/ping?sitemap=http://www.yourdomain.com/sitemap.xml
MSN
http://www.bing.com/webmaster/ping.aspx?siteMap=http://www.yourdomain.com/sitemap.xml
Related
I have a PHP script which builds a sitemap (a XML file accodring to the standard sitemap structure).
My question is about improving it. As you you, a website has new posts daily. Also post may be edited several times per hour/day/month or whenever. I have two strategy to handle that:
Making a new PHP script which parse that XML file and finds the node and modify it when the post is edited and add a new node when a new post is added (it needs to count the number of all nodes before inserting a new one, since a sitemap file can has 50,000 URL utmost).
Exucuting my current PHP script according to a specific period daily (i.e every night on midnight) using a Cron-Jobs. It means rebuilding it from the scratch every time (actually building a new sitemap every night)
Ok which strategy is more optimal and profitable? Which one is the standard approach?
Modifying a XML file has its dangers. One reason is that you need to compare and compile actions (replace, insert, delete). This is complex and the possibility of errors is high. Another problem is that sitemaps can be large, loading them into memory for modifications might not be possible.
I suggest you generate the XML sitemap in a cronjob. Do not overwrite the current sitemap directly but copy/link it after it is completed. This avoids having no sitemap at all if here is an error.
If you like to manage the URLs incrementally do so in an SQL table, treat the XML sitemap as an export of this table.
This depends on busy your website is.
If you have a small website where content changes happen either on a weekly- or monthly-basis, you can simply create an XML- and HTML-sitemap by script, any time new content is available and upload it to your webspace.
If you have a website with many pages and an almost daily update frequency, such as a blog, it is quite handy if you can automatically generate a new sitemap anytime new content is ready.
If you are using a CMS then you have a wide range of plugins that could update it incrementally. Or you could just make your script do it.
I'm new with Cloudsearch and my question might not be clear so I will try to explain my problem.
We have a backoffice were lot of people make research and time to time our database is KO because of some request that take more than 30s to execute, so we decide to use Cloudsearch because we already use some other Amazon web service.
So I created a search domain, I created the index according to the value we search in our current database and I indexed all our event (result of what people search) according to our test database (~ 42 000 row).
My problem is that each event have multiple media (.jpg, .gif and .mp4) in our database (and we are migrating from v3 to v4 so there is two media database and we need to know the event version to know where we should search : the old or the new database) so my question : Can I return some media information with Cloudsearch or I will still need to use a mysql request?
Right now we return the last media add in database (so he can change a lot of time if the event is running) and the total number of media of this event (that can change really often too).
What I think might work :
I can add the two field in my event index (number of media + url of last media) and create a batch file to add / update the event data EACH time we add a new media in database : problem is that we can send 1 batch each 10s and max 10 000 batch / day, so if we have 50 event that run in the same time it could be a big problem...
Same idea that before but we use a CRON to create a batch file with all the last data each hour for example : problem is that the research won't be right before a batch...and max batch size is 5 MB so it could be okay but if we have a lot of new data to add it could be a little problem.
The current idea is to do a mysql request using each event id we get from the cloudsearch research and return those information, but I find this kinda stupid to still use mysql if we change for Cloudsearch...
I saw the documentation for "Using Dynamic Fields in Amazon Cloudsearch" but I don't think it does what I want to achieve...maybe I missunderstand something, but if someone can help me to understand how to do it the best way I would be thankful.
Can I return some media information with Cloudsearch or I will still need to use a mysql request?
If you are asking whether you can store .mp4, .jpg, etc. media files in CloudSearch, the answer is no. You can store text, numbers, dates, and latlong coordinates (or arrays of any of those, except latlong).
I think the conventional way to handle media is to index a URL/path to the media as a text field.
Reference: AWS Cloudsearch Documentation - Configuring Index Fields
I have ~280,000 files that will need to be searched through, and the proper file returned and opened. The file names are exact matches of the expected search terms.
The search terms will be taken by an input box using PHP. What is the best way to accomplish this so that searches do not take a large amount of time?
Thanks!
I suspect the file system itself will struggle with 280,000 files in one directory.
An approach I've taken in the past is to put those files in subdirectories based upon the initial letters of the filename e.g.
1/100000.txt
1/100001.txt
...
9/900000.txt
etc. You can subdivide further using the second letter etc.
Its good you added mysql to your tags. Ideally i would have a CRON task that would index the directories into a mysql table and use that to do the actual search. Algebra is faster than File System iteration. You could run the task daily or hourly depending on how often your files change. Or use something like Guard to monitor the file system for changes and make appropriate updates.
See: https://github.com/guard/guard
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I need to make a script which auto-increments an xml sitemap (for use by search engines) every time a new ad is created on my site (classifieds site using php and mysql).
I have got stuck at how to auto-increment the xml site map. Every site map can contain a maximum of 50000 records of URLS.
Besides, whenever a user deletes their ad (for example after selling the item), I need this URL inside the sitemap to get deleted also.
I already have a script which generates xml site maps from my database, BUT, it will overwrite the xml sitemaps and create everything everytime a user posts an ad.
Is it even possible to edit an xml file with PHP at this level?
For example, if I could read how many lines there are in an xml file, I would know where to set the limit (50000) and create a new one.
Also, if I could read xml files and search for lines, I could also delete ads.
But is that possible?
Code snippets or what methods to use is appreciated!
Thanks
You could simply use SimpleXML to open the sitemap and then do the following:
Iterate the elements
If you find the element, update it (url, last changed, etc.)
If you dont find it append it.
Would of course have to be modified a bit for the multiple-sitemap situations. Furthermore you could use some XPath to search your files. Notice, however, that doing this kind of XML work can be quite slow.
I therefore think you should consider the possibility of regenerating your entire sitemap at regular intervals (say every 12 or 24 hours), because the search engines will be fetching your sitemap very rarely.
Considering the overhead of adding to or deleting from this file each time an ad is added/deleted, I'd stick with your existing script (which rebuilds the sitemap from scratch) and set it to run once every night, at say midnight. You won't be losing out, as the search engines won't fetch your sitemap more than once a day at most.
You have a forum (vbulletin) that has a bunch of images - how easy would it be to have a page that visits a thread, steps through each page and forwards to the user (via ajax or whatever) the images. i'm not asking about filtering (that's easy of course).
doable in a day? :)
I have a site that uses codeigniter as well - would it be even simpler using it?
assuming this is to be carried out on server, curl + regexp are your friends .. and yes .. doable in a day...
there are also some open-source HTML parsers that might make this cleaner
It depends on where your scraping script runs.
If it runs on the same server as the forum software, you might want to access the database directly and check for image links there. I'm not familiar with vbulletin, but probably it offers a plugin api that allows for high level database access. That would simplify querying all posts in a thread.
If, however, your script runs on a different machine (or, in other words, is unrelated to the forum software), it would have to act as a http client. It could fetch all pages of a thread (either automatically by searching for a NEXT link in a page or manually by having all pages specified as parameters) and search the html source code for image tags (<img .../>).
Then a regular expression could be used to extract the image urls. Finally, the script could use these image urls to construct another page displaying all these images, or it could download them and create a package.
In the second case the script actually acts as a "spider", so it should respect things like robots.txt or meta tags.
When doing this, make sure to rate-limit your fetching. You don't want to overload the forum server by requesting many pages per second. Simplest way to do this is probably just to sleep for X seconds between each fetch.
Yes doable in a day
Since you already have a working CI setup I would use it.
I would use the following approach:
1) Make a model in CI capable of:
logging in to vbulletin (images are often added as attachments and you need to be logged in before you can download them). Use something like snoopy.
collecting the url for the "last button" using preg_match(), parsing the url with parse_url() / and parse_str() and generating links from page 1 to page last
collecting html from all generated links. Still using snoopy.
finding all images in html using preg_match_all()
downloading all images. Still using snoopy.
moving the downloaded image from a tmp directory into another directory renaming it imagename_01, imagename_02, etc. if the same imagename allready exists.
saving the image name and precise bytesize in a db table. Then you can avoid downloading the same image more than once.
2) Make a method in a controller that collects all images
3) Setup a cronjob that collect images at regular intervals. wget -o /tmp/useless.html http://localhost/imageminer/collect should do nicely
4) Write the code that outputs pretty html for the enduser using the db table to get the images.