How should I modify sitemap? - php

I have a PHP script which builds a sitemap (a XML file accodring to the standard sitemap structure).
My question is about improving it. As you you, a website has new posts daily. Also post may be edited several times per hour/day/month or whenever. I have two strategy to handle that:
Making a new PHP script which parse that XML file and finds the node and modify it when the post is edited and add a new node when a new post is added (it needs to count the number of all nodes before inserting a new one, since a sitemap file can has 50,000 URL utmost).
Exucuting my current PHP script according to a specific period daily (i.e every night on midnight) using a Cron-Jobs. It means rebuilding it from the scratch every time (actually building a new sitemap every night)
Ok which strategy is more optimal and profitable? Which one is the standard approach?

Modifying a XML file has its dangers. One reason is that you need to compare and compile actions (replace, insert, delete). This is complex and the possibility of errors is high. Another problem is that sitemaps can be large, loading them into memory for modifications might not be possible.
I suggest you generate the XML sitemap in a cronjob. Do not overwrite the current sitemap directly but copy/link it after it is completed. This avoids having no sitemap at all if here is an error.
If you like to manage the URLs incrementally do so in an SQL table, treat the XML sitemap as an export of this table.

This depends on busy your website is.
If you have a small website where content changes happen either on a weekly- or monthly-basis, you can simply create an XML- and HTML-sitemap by script, any time new content is available and upload it to your webspace.
If you have a website with many pages and an almost daily update frequency, such as a blog, it is quite handy if you can automatically generate a new sitemap anytime new content is ready.
If you are using a CMS then you have a wide range of plugins that could update it incrementally. Or you could just make your script do it.

Related

Creating dynamic sitemaps with Codeigniter

I have somewhere in the region of 60,000 URLs that I want to submit to Google. Given the restriction of 10,000 URLs per file i'm going to need to make a sitemap index and link to at least 6 sitemap files in that index.
I don't know what the most efficient way of doing this is. My idea was to go to my DB, take the TOP 10000 rows, run my foreach on the data and generate my links. My first idea was to create placeholder sitemap files (eg. sm1.xml, sm2.xml, etc.) and after each 10,000 rows increment the file index and insert the next 10,000 into the next file. The problem is that the data in the DB is always being added to, so next month I could have 70,000 URLs - meaning I'd have to create another placeholder file.
So with this in mind, I'd like to create the individual sitemap files dynamically but I don't know how.
Some idea's that might help, you on your way to building a sitemap generator in your project.
get the urls from your route.php file
get the classes/methods using the reflections class
get the data from the database or text file
Loop through each data set like you stated above and create indexed files for them.
use a CRON job to index your files via ping.
Use the ping service provided by these search engines.
You should maybe only ping the services at the end of each day or second day,
don't ping them once a new row is created!
Google Ping
http://www.google.com/webmasters/sitemaps/ping?sitemap=http://www.yourdomain.com/sitemap.xml
MSN
http://www.bing.com/webmaster/ping.aspx?siteMap=http://www.yourdomain.com/sitemap.xml

Wordpress plugin, XML files or Database table..both?

Ok guys, so i am 50% the way through creating a "content manager" plugin for wordpress (mainly for the internal benefit of the company i work for) that can create custom post types, taonamies and meta boes with a prety interface.
At the moment im using XML files created through php to parse and hold the data relating to "post types", "Taxonamies" and "metaboxes". The main reason i began down the xml road was so i could allow users to export to an xml file and import on another wordpress install. simple.
Although no im not sure? is it too server heavy to have the plugin recursing through directorys every each time to init the post types, taxonamies and meta boxes? would i be better served to crete 3 db tables and when i need to import or export simple do the XML from there?
would love to hear our opions?!
I would go with the database-solution. When the XML-File grows size, the parsing will take more and more time, as the whole file is read every time.
In a Database, you can select only the values you need and don't need to parse the whole document every time.
Also, realizing a XML import/export from the values stored in the database shouldn't be that much of a problem.
But if you have very tiny XML-files (like less then 100 chars) and they don't grow much, you'll have to decide if it's worth the time to change to a database.

Can you get a specific xml value without loading the full file?

I recently wrote a PHP plugin to interface with my phpBB installation which will take my users' Steam IDs, convert them into the community ids that Steam uses on their website, grab the xml file for that community id, get the value of avatarFull (which contains the link to the full avatar), download it via curl, resize it, and set it as the user's new avatar.
In effect it is syncing my forum's avatars with Steam's avatars (Steam is a gaming community/platform and I run a gaming clan). My issue is that whenever I am reading the value from the xml file it takes around a second for each user as it loads the entire xml file before searching for the variable and this causes the entire script to take a very long time to complete.
Ideally I want to have my script run several times a day to check each avatarFull value from Steam and check to see if it has changed (and download the file if it has), but it currently takes just too long for me to tie up everything to wait on it.
Is there any way to have the server serve up just the xml value that I am looking for without loading the entire thing?
Here is how I am calling the value currently:
$xml = #simplexml_load_file("http://steamcommunity.com/profiles/".$steamid."?xml=1");
$avatarlink = $xml->avatarFull;
And here is an example xml file: XML file
The file isn't big. Parsing it doesn't take much time. Your second is wasted mostly for network communication.
Since there is no way around this, you must implement a cache. Schedule a script that will run on your server every hour or so, looking for changes. This script will take a lot of time - at least a second for every user; several seconds if the picture has to be downloaded.
When it has the latest picture, it will store it in some predefined location on your server. The scripts that serve your webpage will use this location instead of communicating with Steam. That way they will work instantly, and the pictures will be at most 1 hour out-of-date.
Added: Here's an idea to complement this: Have your visitors perform AJAX requests to Steam and check if the picture has changed via JavaScript. Do this only for pictures that they're actually viewing. If it has, then you can immediately replace the outdated picture in their browser. Also you can notify your server who can then download the updated picture immediately. Perhaps you won't even need to schedule anything yourself.
You have to read the whole stream to get to the data you need, but it doesn't have to be kept in memory.
If I were doing this with Java, I'd use a SAX parser instead of a DOM parser. I could handle the few values I was interested in and not keep a large DOM in memory. See if there's something equivalent for you with PHP.
SimpleXml is a DOM parser. It will load and parse the entire document into memory before you can work with it. If you do not want that, use XMLReader which will allow you to process the XML while you are reading it from a stream, e.g. you could exit processing once the avatar was fetched.
But like other people already pointed out elsewhere on this page, with a file as small as shown, this is likely rather a network latency issue than an XML issue.
Also see Best XML Parser for PHP
that file looks small enough. It shouldn't take that long to parse. It probably takes that long because of some sort of network problem and the slowness of parsing.
If the network is your issue then no amount of trickery will help you :(.
If isn't the network then you could try a regex match on the input. That will probably be marginally faster.
Try this expression:
/<avatarFull><![CDATA[(.*?)]]><\/avatarFull>/
and read the link from the first group match.
You could try the SAX way of parsing (http://php.net/manual/en/book.xml.php) but as i said since the file is small i doubt it will really make a difference.
You can take advantage of caching the results of simplexml_load_file() somewhere like memcached or filesystem. Here is typical workflow:
check if XML file was processed during last N seconds
return processing results on success
on failure get results from simplexml
process them
resize images
store results in cache

How would the conversion of a custom CMS using a text-file-based database to Drupal be tackled?

Just today I've started using Drupal for a site I'm designing/developing. For my own site http://jwm-art.net I wrote a user-unfriendly CMS in PHP. My brief experience with Drupal is making me want to convert from the CMS I wrote. A CMS whose sole method (other than comments) of automatically publishing content is by logging in via SSH and using NANO to create a plain text file in a format like so*:
head<<END_HEAD
title = Audio
keywords= open,source,audio,sequencing,sampling,synthesis
descr = Music, noise, and audio, created by James W. Morris.
parent = home
END_HEAD
main<<END_MAIN
text<<END_TEXT
Digital music, noise, and audio made exclusively with
#=xlink=http://www.linux-sound.org#:Linux Audio Software#_=#.
END_TEXT
image=gfb#--#;Accompanying image for penonpaper-c#right
ilink=audio_2008
br=
ilink=audio_2007
br=
ilink=audio_2006
END_MAIN
info=text<<END_TEXT
I've been making PC based music since the early nineties -
fortunately most of it only exists as tape recordings.
END_TEXT
( http://jwm-art.net/dark.php?p=audio - There's just over 400 pages on there. )
*The jounal-entry form which takes some of the work out of it, has mysteriously broken. And it still required SSH access to copy the file to the main dat dir and to check I had actually remembered the format correctly and the code hadn't mis-formatted anything (which it always does).
I don't want to drop all the old content (just some), but how much work would be involved in converting it, factoring into account I've been using Drupal for a day, have not written any PHP for a couple of years, and have zero knowledge of SQL?
How would I map the abstraction in the text file above so that a user can select these elements in the page-publishing mechanism to create a page?
How might a team of developers tackle this? How do-able is it for one guy in his spare time?
You would parse the text with PHP and use the Drupal API to save it as a node object.
http://api.drupal.org/api/function/node_save
See this similar issue, programmatically creating Drupal nodes:
recipe for adding Drupal node records
Drupal 5: CCK fields in custom content type
Essentially, you create the $node object and assign values. node_save($node) will do the rest of the work for you, a Drupal function that creates the content record and lets other modules add data if need be.
You could also employ XML RPC services, if that's possible on your setup.
Since you have not written any PHP for a long time, and you are probably in a hurry, I suggest you this approach:
Download and install this Drupal module: http://drupal.org/project/node_import
This module imports data - nodes, users, taxonomy entries etc.- into Drupal from CVS files.
read its documentations and spend some time to learn how to use it.
Convert your blog into CVS files. unfortunately, I cannot help you much on this, because your blog entries have a complex structure. I think writing a code that converts it into CVS files takes same time as creating CVS files manually.
Use Node Import module to import data into your new website.
Of course some issues will remain that you have to do them manually; like creating menus etc.

edit xml file (site-map) with php ! [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I need to make a script which auto-increments an xml sitemap (for use by search engines) every time a new ad is created on my site (classifieds site using php and mysql).
I have got stuck at how to auto-increment the xml site map. Every site map can contain a maximum of 50000 records of URLS.
Besides, whenever a user deletes their ad (for example after selling the item), I need this URL inside the sitemap to get deleted also.
I already have a script which generates xml site maps from my database, BUT, it will overwrite the xml sitemaps and create everything everytime a user posts an ad.
Is it even possible to edit an xml file with PHP at this level?
For example, if I could read how many lines there are in an xml file, I would know where to set the limit (50000) and create a new one.
Also, if I could read xml files and search for lines, I could also delete ads.
But is that possible?
Code snippets or what methods to use is appreciated!
Thanks
You could simply use SimpleXML to open the sitemap and then do the following:
Iterate the elements
If you find the element, update it (url, last changed, etc.)
If you dont find it append it.
Would of course have to be modified a bit for the multiple-sitemap situations. Furthermore you could use some XPath to search your files. Notice, however, that doing this kind of XML work can be quite slow.
I therefore think you should consider the possibility of regenerating your entire sitemap at regular intervals (say every 12 or 24 hours), because the search engines will be fetching your sitemap very rarely.
Considering the overhead of adding to or deleting from this file each time an ad is added/deleted, I'd stick with your existing script (which rebuilds the sitemap from scratch) and set it to run once every night, at say midnight. You won't be losing out, as the search engines won't fetch your sitemap more than once a day at most.

Categories