Appending data to xml file using php - php

I'm newbie for xml files related stuff. i've stuck with an issue.
I have a mysql query which fetches url data nearly 5000 rows (1 row contains 1 url).
so i've implemented a cron which fetches 1000 rows at time from mysql with pagination. i need to do some validations on the urls and should append the valid urls in an xml file.
Here is my code
public function urlcheck()
{
$xFile = $this->base_path."sitemap/path/urls.xml";
$page = 0;
$cache_key = 'valid_urls';
$page = $this->cache->redis->get($cache_key);
if(!$page){
$page=0;
}
$xFile = simplexml_load_file($xFile);
$this->load->model('productnew/productnew_es6_m');
$urls= $this->db->query("SELECT url FROM product_data where `active` = 1 limit ".$page.",1000")->result();
$dom = new DOMDocument('1.0','UTF-8');
$dom->formatOutput = true;
$root = $dom->createElement('urlset');
$root->setAttribute('xsi:schemaLocation', 'http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd');
$root->setAttribute('xmlns:xsi', 'http://www.w3.org/2001/XMLSchema-instance');
$root->setAttribute('xmlns', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$dom->appendChild($root);
foreach($urls as $val)
{
// validations here
$url = $dom->createElement('url');
$root->appendChild($url);
$lastmod = $dom->createElement('lastmod', date("Y-m-d"));
$url->appendChild($lastmod);
$page++;
}
$dom->saveXML();
$dom->save($xFile) or die('XML Create Error');
if(sizeof($urls) == 0){
$page = 0;
}
print_r($page);
$this->cache->redis->save($cache_key, $page, 432000);
// echo '<xmp>'. $dom->saveXML() .'</xmp>';
// $dom->saveXML();
// $dom->save($xFile) or die('XML Create Error');
}
After my first cron execution, 300 valid urls out of 1000 urls are saved to xml file,
Now lets say In my second cron execution i have 200 valid urls out of 1000.
My expected result is to append these 200 to the existing xml file so that my xml file contains total 500 valid urls, and xml file should get refresh after 5000 urls as i mentioned above.
But after executing the cron every time, old url data is being replaced with latest once.
I was wondering how do I save the url values without overwriting the XML.
Thank you in Advance!

As per the comment above you are opening the file with one api (SimpleXML) but saving a new document with DOMDocument - thus overwriting previous work. Without SimpleXML perhaps you can try like this - though it is untested.
public function urlcheck(){
$file=$this->base_path."sitemap/path/urls.xml";
$cache_key='valid_urls';
$page=$this->cache->redis->get($cache_key);
if(!$page)$page=0;
$dom=new DOMDocument('1.0','UTF-8');
$dom->formatOutput = true;
$col=$dom->getElementsByTagName('urlset');
if( !empty( $col ) )$root=$col->item(0);
else{
$root=$dom->createElement('urlset');
$dom->appendChild( $root );
$root->setAttribute('xsi:schemaLocation','http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd');
$root->setAttribute('xmlns:xsi','http://www.w3.org/2001/XMLSchema-instance');
$root->setAttribute('xmlns','http://www.sitemaps.org/schemas/sitemap/0.9');
}
# does a `page` node exist - if so use the value as the $page variable
$col=$com->getElementsByTagName('page');
if( !empty( $col ) )$page=intval( $col->item(0)->nodeValue );
$this->load->model('productnew/productnew_es6_m');
$urls=$this->db->query("SELECT `url` FROM `product_data` where `active` = 1 limit ".$page.",1000")->result();
foreach( $urls as $val ){
$url = $dom->createElement('url');
$root->appendChild($url);
$lastmod = $dom->createElement('lastmod', date("Y-m-d"));
$url->appendChild($lastmod);
$page++;
}
$node=$dom->createElement( 'page', $page );
$root->insertBefore( $node, $root->firstChild );
if( empty( $urls ) )$page=0;
$dom->save( $file );
$this->cache->redis->save( $cache_key, $page, 432000 );
}

Appending to the document looks fine, but you don't open the file to which you want to append to from disk thought. Therefore on each page you start with 0 urls in the XML and append to the empty root node.
But after executing the cron every time, old url data is being replaced with latest once.
This is exactly the behaviour you describe and which sounds like you don't load the XML file in the first place, just write it.
So the question perhaps is how to open an XML file, append looks good by your description already.
Let's review, by reversing the introduction sentences of your question:
I need to do some validations on the urls and should append the valid urls in an xml file.
so i've implemented a cron which fetches 1000 rows at time from mysql with pagination.
I have a mysql query which fetches url data nearly 5000 rows (1 row contains 1 url).
Assuming the file to append each 1000 url-set to is already on disk (page 2-5), you would need to append. If however on page 1 the file would already be on disk, you would append to some other page 1-5.
So it looks like you have written the code only for when you're on the first page - to create a new document (and append to it).
And despite your question, appending does work, you write it yourself:
old url data is being replaced with latest once.
The only thing that does not work is to open the file on page 2 - 5.
So let's rephrase the question: How to open an XML file?
But first of all, the variable $page is not meant to stand for page as in page 1 - 5 above. It's just a variable with a questionable name and $page stands for the number of URLs processed so far in the cycle and not for the page in the pagination.
Regardless of its name, I'll use it for its value for this answer.
So now lets open the existing document for appending when $page is not 0:
...
$dom = new DOMDocument('1.0','UTF-8');
$dom->formatOutput = true;
if ($page !== 0) {
$dom->load(dom_import_simplexml($xFile)->ownerDocument->documentURI) ​
}
$col=$dom->getElementsByTagName('urlset');
...
only on the first run you'll have the described behaviour that the file is created new - and in that case it's fine (on the first run $page === 0).
in any other case $page is not 0 and the file is opened from disk.
I've left the other parts of your code alone so that this example is only introducing this 3-line if-clause.
The documentation for the load($file) function is available in the PHP docs, just in case you missed it so far:
https://www.php.net/manual/en/domdocument.load.php
Try to not re-use the same variable names if you want to come up to speed. Here I had to recycle a whole SimpleXMLElement and import it into DOM only to obtain the original xml-file-path to open the document - which was not available as plain string any longer despite it once was under the variable $xFile. But that just as a comment in the margin.
And as you're already using Redis, you perhaps may want to queue the URLs into it and process from there, then you'll likely not need the database paging. See Lists of the Redis Data-Types.
You can then also put the good URLs in there in a second list.
With two lists you can even constantly check the progress in Redis directly.
And when finally done, you can write the whole file at once in one transaction out of the good URLs in Redis.
If you want to throw some more (minimal) tech on it, take a look at Beanstalkd.

Related

Sending url parameters through file_get_contents returns nothig

I am trying to scrape a website in order to get latitude and longitude for counties in the us(there are 3306 thus why I am trying to do it through code and not manually)
I am using the code below
function GetLatitude($countyName,$stateShortName){
//Create DOM from url
$page = file_get_contents("https://www.mapdevelopers.com/geocode_tool.php?$countyName,$stateShortName");
$doc = new DOMDocument();
$doc->loadHTML($page);
$node = $doc->getElementById("display_lat");
var_dump($doc);
}
GetLatitude("Guilford County","NC");
This returns nothing but if I change the url to get without the parameters like "https://www.mapdevelopers.com/geocode_tool.php" then I can see that $doc now has some information in it but that is not useful because the value I need (latitude) is dependent upon the parameters passed into the url.
How do I solve this issue?
EDIT:
Based on the suggestion to encode the parameters I changed my code to this and now the document contains information but appears as though it is ignoring the parameters
<?
function GetLatitude($countyName,$stateShortName){
$countyName = urlencode($countyName);
$stateShortName = urlencode($stateShortName);
//Create DOM from url
$page = file_get_contents("https://www.mapdevelopers.com/geocode_tool.php?address=$countyName,$stateShortName");
$doc = new DOMDocument();
$doc->loadHTML($page);
$node = $doc->getElementById("display_lat");
var_dump($doc);
}
GetLatitude("Clarke County","AL");
?>
Your issue is that the latitude information etc isn't present on page load, and java script puts it there
You're going to have a hard time trying to run a webpage with JS and scraping it from PHP without something in the middle, maybe re-try this project with something like puppet or phantomjs so you can run your script against a real browser.
Searching the page there is a ajax request to https://www.mapdevelopers.com/data.php
Sending a POST or GET request will give you the response you are looking for

Fetch data from site Page by Page & go through sub links

URL : http://www.sayuri.co.jp/used-cars
Example : http://www.sayuri.co.jp/used-cars/B37753-Toyota-Wish-japanese-used-cars
Hey guys , need some help with one of my personal projects , I've already wrote the code to fetch data from each single car url (example) and post on my site
Now i need to go through the main url : sayuri.co.jp/used-cars , and :
1) Make an array / list / nodes of all the urls for all the single cars in it , then run my internal code for each one to fetch data , then move on to the next one
I already have the code to save each url into a log file when completed (don't think it will be necessary if it goes link by link without starting from the top but will ensure no repetition.
2) When all links are done for the page , it should move to the next page and do the same thing until the end ( there are 5-6 pages max )
I've been stuck on this part since last night and would really appreciate any help . Thanks
My code to get data from the main url :
$content = file_get_contents('http://www.sayuri.co.jp/used-cars/');
// echo $content;
and
$dom = new DOMDocument;
$dom->loadHTML($content);
//echo $dom;
I'm guessing you already know this since you say you've gotten data from the car entries themselves, but a good point to start is by dissecting the page's DOM and seeing if there are any elements you can use to jump around quickly. Most browsers have page inspection tools to help with this.
In this case, <div id="content"> serves nicely. You'll note it contains a collection of tables with the required links and a <div> that contains the text telling us how many pages there are.
Disclaimer, but it's been years since I've done PHP and I have not tested this, so it is probably neither correct or optimal, but it should get you started. You'll need to tie the functions together (what's the fun in me doing it?) to achieve what you want, but these should grab the data required.
You'll be working with the DOM on each page, so a convenience to grab the DOMDocument:
function get_page_document($index) {
$content = file_get_contents("http://www.sayuri.co.jp/used-cars/page:{$index}");
$document = new DOMDocument;
$document->loadHTML($content);
return $document;
}
You need to know how many pages there are in total in order to iterate over them, so grab it:
function get_page_count($document) {
$content = $document->getElementById('content');
$count_div = $content->childNodes->item($content->childNodes->length - 4);
$count_text = $count_div->firstChild->textContent;
if (preg_match('/Page \d+ of (\d+)/', $count_text, $matches) === 1) {
return $matches[1];
}
return -1;
}
It's a bit ugly, but the links are available inside each <table> in the contents container. Rip 'em out and push them in an array. If you use the link itself as the key, there is no concern for duplicates as they'll just rewrite over the same key-value.
function get_page_links($document) {
$content = $document->getElementById('content');
$tables = $content->getElementsByTagName('table');
$links = array();
foreach ($tables as $table) {
if ($table->getAttribute('class') === 'itemlist-table') {
// table > tbody > tr > td > a
$link = $table->firstChild->firstChild->firstChild->firstChild->getAttribute('href');
// No duplicates because they just overwrite the same entry.
$links[$link] = "http://www.sayuri.co.jp{$link}";
}
}
return $links;
}
Perhaps also obvious, but these will break if this site changes their formatting. You'd be better off asking if they have a REST API or some such available for long term use, though I'm guessing you don't care as much if it's just a personal project for tinkering.
Hope it helps prod you in the right direction.

Loading one file from XML root based on child value [duplicate]

This question already has answers here:
Use XPath with PHP's SimpleXML to find nodes containing a String
(4 answers)
Closed 9 years ago.
I have an xml feed which I am extracting using PHP, i have the code written to find the values I need and display correctly on the page.
XML code is:
<Agents>
<Agent>
<id></id>
<description></description>
<name></name>
</Agent>
</Agents>
PHP Code
<?php
$url = "urlgoeshere";
$xml = simplexml_load_file($url);
for ($html = "", $i = 0; $i < 10; $i++)
{
$id = $xml->Agent[$i]->id;
$name = $xml->Agent[$i]->name;
$description = $xml->Agent[$i]->description;
$html .= "<h1>$id</h1><h2>$name</h2><p>$description</p>";
}
echo $html;
This is set to load 11 agents which works fine but I want to change this and load only one specific Agent based on its id.
So for example if an agent has an id of 1200 on the xml field I want to find that and load only that one Agent but can't seem to work out an easy way to do this.
Just use an if condition with a continue
$idToFind = 1200;
for ($i = 0; $i < 10; $i++) {
$id = $xml->Agent[$i]->id;
if ($id != $idToFind)
continue;
else {
$name = $xml->Agent[$i]->name;
$description = $xml->Agent[$i]->description;
$html .="<h1>$id</h1><h2>$name</h2><p>$description</p>";
}
}
You have two options. Either you filter client-side (in your code) or you filter server-side.
Server side
If you request the XML file e.g. from a RESTfull service you might want to pass a parameter directly to your request.
Instead of requesting example.com/agents.xml you can maybe request example.com/agents/1.xml or something like that. In that case you have to check the API you request the XML file from. The pro for this type of filtering is, that you have to load a smaller xml file with less data and traffic.
Client side
If you are unable to filter the data on the server side, you need to check it in your PHP code. The simpelest option would be to add an if statement in your loop. And since you are talking about 1200 agents it might be the easiest aswell. In case you have to load more entries or speed and efficiency is required for your application you might want to use another XML parser. The SimpleXML class loads the whole file into the CPU. I have written a relatively efficient way to parse an XML file with the XML Reader which is more efficient and requires less memory. Feel free to edit the example to fit your needs.

Generating cache file for Twitter rss feed

I'm working on a site with a simple php-generated twitter box with user timeline tweets pulled from the user_timeline rss feed, and cached to a local file to cut down on loads, and as backup for when twitter goes down. I based the caching on this: http://snipplr.com/view/8156/twitter-cache/. It all seemed to be working well yesterday, but today I discovered the cache file was blank. Deleting it then loading again generated a fresh file.
The code I'm using is below. I've edited it to try to get it to work with what I was already using to display the feed and probably messed something crucial up.
The changes I made are the following (and I strongly believe that any of these could be the cause):
- Revised the time difference code (the linked example seemed to use a custom function that wasn't included in the code)
Removed the "serialize" function from the "fwrites". This is purely because I couldn't figure out how to unserialize once I loaded it in the display code. I truthfully don't understand the role that serialize plays or how it works, so I'm sure I should have kept it in. If that's the case I just need to understand where/how to deserialize in the second part of the code so that it can be parsed.
Removed the $rss variable in favor of just loading up the cache file in my original tweet display code.
So, here are the relevant parts of the code I used:
<?php
$feedURL = "http://twitter.com/statuses/user_timeline/#######.rss";
// START CACHING
$cache_file = dirname(__FILE__).'/cache/twitter_cache.rss';
// Start with the cache
if(file_exists($cache_file)){
$mtime = (strtotime("now") - filemtime($cache_file));
if($mtime > 600) {
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/75168146.rss');
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
echo "<!-- twitter cache generated ".date('Y-m-d h:i:s', filemtime($cache_file))." -->";
}
else {
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/#######.rss');
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
//END CACHING
//START DISPLAY
$doc = new DOMDocument();
$doc->load($cache_file);
$arrFeeds = array();
foreach ($doc->getElementsByTagName('item') as $node) {
$itemRSS = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue
);
array_push($arrFeeds, $itemRSS);
}
// the rest of the formatting and display code....
}
?>
ETA 6/17 Nobody can help…?
I'm thinking it has something to do with writing a blank cache file over a good one when twitter is down, because otherwise I imagine that this should be happening every ten minutes when the cache file is overwritten again, but it doesn't happen that frequently.
I made the following change to the part where it checks how old the file is to overwrite it:
$cache_rss = file_get_contents('http://twitter.com/statuses/user_timeline/75168146.rss');
if($mtime > 600 && $cache_rss != ''){
$cache_static = fopen($cache_file, 'wb');
fwrite($cache_static, $cache_rss);
fclose($cache_static);
}
…so now, it will only write the file if it's over ten minutes old and there's actual content retrieved from the rss page. Do you think this will work?
Yes your code is problematic, because whatever Twitter sends you, you write it.
You should test the file you get from Twitter like this:
if (($mtime > 600) && ($cache_rss = file_get_contents($feedURL)))
{
file_put_contents($cache_rss);
}
file_get_contents() return false if there is an error, check it before caching some new content.

Updating the XML file using PHP script

I'm making an interface-website to update a concert-list on a band-website.
The list is stored as an XML file an has this structure :
I already wrote a script that enables me to add a new gig to the list, this was relatively easy...
Now I want to write a script that enables me to edit a certain gig in the list.
Every Gig is Unique because of the first attribute : "id" .
I want to use this reference to edit the other attributes in that Node.
My PHP is very poor, so I hope someone could put me on the good foot here...
My PHP script :
Well i dunno what your XML structure looks like but:
<gig id="someid">
<venue></venue>
<day></day>
<month></month>
<year></year>
</gig>
$xml = new SimpleXmlElement('gig.xml',null, true);
$gig = $xml->xpath('//gig[#id="'.$_POST['id'].'"]');
$gig->venue = $_POST['venue'];
$gig->month = $_POST['month'];
// etc..
$xml->asXml('gig.xml)'; // save back to file
now if instead all these data points are attributes you can use $gig->attributes()->venue to access it.
There is no need for the loop really unless you are doing multiple updates with one post - you can get at any specific record via an XPAth query. SimpleXML is also a lot lighter and a lot easier to use for this type of thing than DOMDOcument - especially as you arent using the feature of DOMDocument.
You'll want to load the xml file in a domdocument with
<?
$xml = new DOMDocument();
$xml->load("xmlfile.xml");
//find the tags that you want to update
$tags = $xml->getElementsByTagName("GIG");
//find the tag with the id you want to update
foreach ($tags as $tag) {
if($tag->getAttribute("id") == $id) { //found the tag, now update the attribute
$tag->setAttribute("[attributeName]", "[attributeValue]");
}
}
//save the xml
$xml->save();
?>
code is untested, but it's a general idea

Categories