This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I need to make a script which auto-increments an xml sitemap (for use by search engines) every time a new ad is created on my site (classifieds site using php and mysql).
I have got stuck at how to auto-increment the xml site map. Every site map can contain a maximum of 50000 records of URLS.
Besides, whenever a user deletes their ad (for example after selling the item), I need this URL inside the sitemap to get deleted also.
I already have a script which generates xml site maps from my database, BUT, it will overwrite the xml sitemaps and create everything everytime a user posts an ad.
Is it even possible to edit an xml file with PHP at this level?
For example, if I could read how many lines there are in an xml file, I would know where to set the limit (50000) and create a new one.
Also, if I could read xml files and search for lines, I could also delete ads.
But is that possible?
Code snippets or what methods to use is appreciated!
Thanks
You could simply use SimpleXML to open the sitemap and then do the following:
Iterate the elements
If you find the element, update it (url, last changed, etc.)
If you dont find it append it.
Would of course have to be modified a bit for the multiple-sitemap situations. Furthermore you could use some XPath to search your files. Notice, however, that doing this kind of XML work can be quite slow.
I therefore think you should consider the possibility of regenerating your entire sitemap at regular intervals (say every 12 or 24 hours), because the search engines will be fetching your sitemap very rarely.
Considering the overhead of adding to or deleting from this file each time an ad is added/deleted, I'd stick with your existing script (which rebuilds the sitemap from scratch) and set it to run once every night, at say midnight. You won't be losing out, as the search engines won't fetch your sitemap more than once a day at most.
Related
I have a PHP script which builds a sitemap (a XML file accodring to the standard sitemap structure).
My question is about improving it. As you you, a website has new posts daily. Also post may be edited several times per hour/day/month or whenever. I have two strategy to handle that:
Making a new PHP script which parse that XML file and finds the node and modify it when the post is edited and add a new node when a new post is added (it needs to count the number of all nodes before inserting a new one, since a sitemap file can has 50,000 URL utmost).
Exucuting my current PHP script according to a specific period daily (i.e every night on midnight) using a Cron-Jobs. It means rebuilding it from the scratch every time (actually building a new sitemap every night)
Ok which strategy is more optimal and profitable? Which one is the standard approach?
Modifying a XML file has its dangers. One reason is that you need to compare and compile actions (replace, insert, delete). This is complex and the possibility of errors is high. Another problem is that sitemaps can be large, loading them into memory for modifications might not be possible.
I suggest you generate the XML sitemap in a cronjob. Do not overwrite the current sitemap directly but copy/link it after it is completed. This avoids having no sitemap at all if here is an error.
If you like to manage the URLs incrementally do so in an SQL table, treat the XML sitemap as an export of this table.
This depends on busy your website is.
If you have a small website where content changes happen either on a weekly- or monthly-basis, you can simply create an XML- and HTML-sitemap by script, any time new content is available and upload it to your webspace.
If you have a website with many pages and an almost daily update frequency, such as a blog, it is quite handy if you can automatically generate a new sitemap anytime new content is ready.
If you are using a CMS then you have a wide range of plugins that could update it incrementally. Or you could just make your script do it.
I need to create weekly texts using the same template. Being the lazy programmer I am I wanted to automate most of it by just creating a Google Form where I can input the data. By then running a PHP script I want to parse the new entry and put it into an automatically created new document.
I have created the template with placeholders such as <DATE> or <NEWMEMBERCOUNT> that I later want to replace by the values entered using the Google Form.
For this I have already utilized the packages google/apiclient and asimlqt/php-google-spreadsheet-client to read the form results (which are stored in a spreadsheet) and duplicate the template doc for each entry.
I'm almost finished and just need to replace the placeholders by their corresponding values, but I can't seem to find a way to do that. Specifically I need to read the content of the document, perform some transformations on it (i.e. replacing the placeholders) and save it with this transformed text.
I should have thought about this before starting to program it..
Is it possible for me to edit documents at all, using just PHP? If so, how could I go about it? Any guidance is appreciated!
You can't edit in situ, but you can download, edit, upload. Is this a classic mailmerge, ie. take a spreadsheet containing (rows of) data, apply a template to those rows that results in an output file for each row?
If so , simples...
Download the spreadsheet
Download the template
For each spreadsheet row
replace the placeholders with data
insert a new file to drive
That can all be done with the Drive API from PHP
this is not possible with anything except google apps script. see https://developers.google.com/apps-script/reference/document/document-app
you can use apps script to create a "contentService" and call it from your php. beware of limited quotas if you plan to have many daily calls.
more info about doing this content service is covered in other s.o. questions that ask that specifically.
I have somewhere in the region of 60,000 URLs that I want to submit to Google. Given the restriction of 10,000 URLs per file i'm going to need to make a sitemap index and link to at least 6 sitemap files in that index.
I don't know what the most efficient way of doing this is. My idea was to go to my DB, take the TOP 10000 rows, run my foreach on the data and generate my links. My first idea was to create placeholder sitemap files (eg. sm1.xml, sm2.xml, etc.) and after each 10,000 rows increment the file index and insert the next 10,000 into the next file. The problem is that the data in the DB is always being added to, so next month I could have 70,000 URLs - meaning I'd have to create another placeholder file.
So with this in mind, I'd like to create the individual sitemap files dynamically but I don't know how.
Some idea's that might help, you on your way to building a sitemap generator in your project.
get the urls from your route.php file
get the classes/methods using the reflections class
get the data from the database or text file
Loop through each data set like you stated above and create indexed files for them.
use a CRON job to index your files via ping.
Use the ping service provided by these search engines.
You should maybe only ping the services at the end of each day or second day,
don't ping them once a new row is created!
Google Ping
http://www.google.com/webmasters/sitemaps/ping?sitemap=http://www.yourdomain.com/sitemap.xml
MSN
http://www.bing.com/webmaster/ping.aspx?siteMap=http://www.yourdomain.com/sitemap.xml
I am developing a website that will display the most recent item from a RSS feed. However, each time a user accesses the website, I'd like for the page to display cached data. This will make the page display much quicker since I plan on caching 50+ RSS feeds.
My question is, how do I cache an RSS feed, but make sure it updates in the background every 4 hours or so?
Thanks in advance.
Create a cache folder to store all the RSS feeds.
When the page is loaded, check to see if the file exists, if it doesn't download it and process it.
If the file exists and the result of filemtime($cached_file) + (60 * 60 * 4) is greater than time(), it means that it has been less than the 4 hours since the RSS feed was fetched. Display the page like normal. If that is not the case, redownload the file and display it.
There are many tutorials about for parsing RSS feeds in PHP. I prefer using PHP's DOM extension, but there are so many different ways you can do it.
I created a simple PHP class to tackle this issue. Since I'm dealing with a variety of sources, it can handle whatever you throw at it (xml, json, etc). You give it a local filename (for storage purposes), the external feed, and an expires time. It begins by checking for the local file. If it exists and hasn't expired, it returns the contents. If it has expired, it attempts to grab the remote file. If there's an issue with the remote file, it will fall-back to the cached file.
Blog post here: http://weedygarden.net/2012/04/simple-feed-caching-with-php/
Code here: https://github.com/erunyon/FeedCache
I recently wrote a PHP plugin to interface with my phpBB installation which will take my users' Steam IDs, convert them into the community ids that Steam uses on their website, grab the xml file for that community id, get the value of avatarFull (which contains the link to the full avatar), download it via curl, resize it, and set it as the user's new avatar.
In effect it is syncing my forum's avatars with Steam's avatars (Steam is a gaming community/platform and I run a gaming clan). My issue is that whenever I am reading the value from the xml file it takes around a second for each user as it loads the entire xml file before searching for the variable and this causes the entire script to take a very long time to complete.
Ideally I want to have my script run several times a day to check each avatarFull value from Steam and check to see if it has changed (and download the file if it has), but it currently takes just too long for me to tie up everything to wait on it.
Is there any way to have the server serve up just the xml value that I am looking for without loading the entire thing?
Here is how I am calling the value currently:
$xml = #simplexml_load_file("http://steamcommunity.com/profiles/".$steamid."?xml=1");
$avatarlink = $xml->avatarFull;
And here is an example xml file: XML file
The file isn't big. Parsing it doesn't take much time. Your second is wasted mostly for network communication.
Since there is no way around this, you must implement a cache. Schedule a script that will run on your server every hour or so, looking for changes. This script will take a lot of time - at least a second for every user; several seconds if the picture has to be downloaded.
When it has the latest picture, it will store it in some predefined location on your server. The scripts that serve your webpage will use this location instead of communicating with Steam. That way they will work instantly, and the pictures will be at most 1 hour out-of-date.
Added: Here's an idea to complement this: Have your visitors perform AJAX requests to Steam and check if the picture has changed via JavaScript. Do this only for pictures that they're actually viewing. If it has, then you can immediately replace the outdated picture in their browser. Also you can notify your server who can then download the updated picture immediately. Perhaps you won't even need to schedule anything yourself.
You have to read the whole stream to get to the data you need, but it doesn't have to be kept in memory.
If I were doing this with Java, I'd use a SAX parser instead of a DOM parser. I could handle the few values I was interested in and not keep a large DOM in memory. See if there's something equivalent for you with PHP.
SimpleXml is a DOM parser. It will load and parse the entire document into memory before you can work with it. If you do not want that, use XMLReader which will allow you to process the XML while you are reading it from a stream, e.g. you could exit processing once the avatar was fetched.
But like other people already pointed out elsewhere on this page, with a file as small as shown, this is likely rather a network latency issue than an XML issue.
Also see Best XML Parser for PHP
that file looks small enough. It shouldn't take that long to parse. It probably takes that long because of some sort of network problem and the slowness of parsing.
If the network is your issue then no amount of trickery will help you :(.
If isn't the network then you could try a regex match on the input. That will probably be marginally faster.
Try this expression:
/<avatarFull><![CDATA[(.*?)]]><\/avatarFull>/
and read the link from the first group match.
You could try the SAX way of parsing (http://php.net/manual/en/book.xml.php) but as i said since the file is small i doubt it will really make a difference.
You can take advantage of caching the results of simplexml_load_file() somewhere like memcached or filesystem. Here is typical workflow:
check if XML file was processed during last N seconds
return processing results on success
on failure get results from simplexml
process them
resize images
store results in cache