Scraping content from a file

Scraping content from a file - php

I have a file that contains many of these
<sync start="14400">
<p class="ENCC">
Removed
</p>
</sync>
and I would like to turn them into this format
<p begin="00:00:33.3" end="00:00:35.8">Removed</p>
I would like to get the data inside start="" and the data inside and loop through until I have all of them on the page.
I have been trying to do this for a few hours now but could do with a point in the right direction. Any help or guidance would be greatly appreciated. Thank you
Edit: also please ignore the start/behin formatting I already have the code to do that

If what you're after is a simple way to parse XML, look into phpQuery (very accessible if you're used to jQuery). The code would look something like (untested):
$start_values = array ();
$content_values = array ();
$doc = phpQuery::newDocumentXML ($xml);
foreach (pq ('sync') as $node)
{
$start_values[] = pq ($node)->attr ('start');
$content_values[] = pq ($node)->find ('p')->html ();
}
$start_values would then be an array with the respective values for the start-attribute and $content_values would be an array with the respective content of the actual tag.
UPDATED
I noticed that I didn't take the p-node under sync into consideration earlier. The find ('p') part should take care of that.

Related

PHP Decoding JSON and Accessing Array Value based on dynamic variable

I am trying to use a weather API for a project and I am trying to build a simple weather app that will check the forecast and display it. The code looks like this:
$url = "http://api.openweathermap.org/data/2.5/weather?zip=30553&appid={APIKEY}";
$data = file_get_contents($url);
$decodeJSON = json_decode($data);
$testvar = "weather[0]->main";
echo $decodeJSON->$testvar;
and ideally I would like to be able to change $testvar to point at different variables returned by the API and as of right now I know it can't be done by combining "$decodeJSON->" and "$testvar", but is there anyway to achieve something close to what I have above that basically just assembles the two?
Appreciate all the help in advance and any feedback on the code is greatly appreciated!

Fetch data from site Page by Page & go through sub links

URL : http://www.sayuri.co.jp/used-cars
Example : http://www.sayuri.co.jp/used-cars/B37753-Toyota-Wish-japanese-used-cars
Hey guys , need some help with one of my personal projects , I've already wrote the code to fetch data from each single car url (example) and post on my site
Now i need to go through the main url : sayuri.co.jp/used-cars , and :
1) Make an array / list / nodes of all the urls for all the single cars in it , then run my internal code for each one to fetch data , then move on to the next one
I already have the code to save each url into a log file when completed (don't think it will be necessary if it goes link by link without starting from the top but will ensure no repetition.
2) When all links are done for the page , it should move to the next page and do the same thing until the end ( there are 5-6 pages max )
I've been stuck on this part since last night and would really appreciate any help . Thanks
My code to get data from the main url :
$content = file_get_contents('http://www.sayuri.co.jp/used-cars/');
// echo $content;
and
$dom = new DOMDocument;
$dom->loadHTML($content);
//echo $dom;

I'm guessing you already know this since you say you've gotten data from the car entries themselves, but a good point to start is by dissecting the page's DOM and seeing if there are any elements you can use to jump around quickly. Most browsers have page inspection tools to help with this.
In this case, <div id="content"> serves nicely. You'll note it contains a collection of tables with the required links and a <div> that contains the text telling us how many pages there are.
Disclaimer, but it's been years since I've done PHP and I have not tested this, so it is probably neither correct or optimal, but it should get you started. You'll need to tie the functions together (what's the fun in me doing it?) to achieve what you want, but these should grab the data required.
You'll be working with the DOM on each page, so a convenience to grab the DOMDocument:
function get_page_document($index) {
$content = file_get_contents("http://www.sayuri.co.jp/used-cars/page:{$index}");
$document = new DOMDocument;
$document->loadHTML($content);
return $document;
}
You need to know how many pages there are in total in order to iterate over them, so grab it:
function get_page_count($document) {
$content = $document->getElementById('content');
$count_div = $content->childNodes->item($content->childNodes->length - 4);
$count_text = $count_div->firstChild->textContent;
if (preg_match('/Page \d+ of (\d+)/', $count_text, $matches) === 1) {
return $matches[1];
}
return -1;
}
It's a bit ugly, but the links are available inside each <table> in the contents container. Rip 'em out and push them in an array. If you use the link itself as the key, there is no concern for duplicates as they'll just rewrite over the same key-value.
function get_page_links($document) {
$content = $document->getElementById('content');
$tables = $content->getElementsByTagName('table');
$links = array();
foreach ($tables as $table) {
if ($table->getAttribute('class') === 'itemlist-table') {
// table > tbody > tr > td > a
$link = $table->firstChild->firstChild->firstChild->firstChild->getAttribute('href');
// No duplicates because they just overwrite the same entry.
$links[$link] = "http://www.sayuri.co.jp{$link}";
}
}
return $links;
}
Perhaps also obvious, but these will break if this site changes their formatting. You'd be better off asking if they have a REST API or some such available for long term use, though I'm guessing you don't care as much if it's just a personal project for tinkering.
Hope it helps prod you in the right direction.

How to get post URL out of the Blogger API in PHP

In Short, I am pulling the feed from my blogger using the Zend API in PHP. I need to get the URL that will link to that post in blogger. What is the order of functions I need to call to get that URL.
Right now I am pulling the data using:
$query = new Zend_Gdata_Query('http://www.blogger.com/feeds/MYID/posts/default');
$query->setParam('max-results', "1");
$feed = $gdClient->getFeed($query);
$newestPost = $feed->entry[0];
I can not for the life of me figure out where I have to go from here to get the URL. I can successfully get the Post title using: $newestPost->getTitle() and I can get the body by using $newestPost->getContent()->getText(). I have tried a lot of function calls, even ones in the documentation and most of them error out. I have printed out the entire object to look through it and I can find the data I want (so I know it is there) but the object is too complex to be able to just look at and see what I have to do to get to that data.
If anyone can help me or at least point me to a good explanation of how that Object is organized and how to get to each sub object within it, that would be greatly appreciated.
EDIT: Never mind I figured it out.

You are almost there, really all you need to do is once you have your feed entry is access the link element inside. I like pretty URLs so I went with the alternate rather than the self entry in the atom feed.
$link = $entry->link[4]->href;
where $entry is the entry that you are setting from the feed.

The solution is:
$query = new Zend_Gdata_Query('http://www.blogger.com/feeds/MyID/posts/default');
$query->setParam('max-results', "1");
$feed = $gdClient->getFeed($query);
$newestPost = $feed->entry[0];
$body = $newestPost->getContent()->getText();
$body now contains the post contents of the latest post (or entry[0]) from the feed. This is just the contents of the body of the post, not the title or any other data or formatting.

PHPFlickr script...could this be cleaner/leaner?

OK, here's my dilemma:
I've read all over about how many guys want to be able to display a set of images from Flickr using PHPFlickr, but lament on how the API for PhotoSets does not put individual photo descriptions. Some have tried to set up their PHP so it will pull the description on each photo as the script assembles the gallery on the page. However, the method has shown how slow and inefficient it can be.
I caught an idea elsewhere of creating a string of comma separated values with the photo ID and the description. I'd store it on the MySQL database and then call upon it when I have my script assemble the gallery on the page. I'd use explode to create an array of the photo ID and its description, then call on that to fill in the gaps...thus less API calls and a faster page.
So in the back-end admin, I have a form where I set up the information for the gallery, and I hand a Set ID. The script would then go through and make this string of separated values ("|~|" as a separation). Here's what I came up with:
include("phpFlickr.php");
$f = new phpFlickr("< api >");
$descArray = "";
// This will create an Array of Photo ID from the Set ID.
// $setFeed is the set ID brought in from the form.
$photos = $f->photosets_getPhotos($setFeed);
foreach ($photos['photoset']['photo'] as $photo) {
$returnDesc = array();
$photoID = $photo['id'];
$rsp = $f->photos_getInfo($photoID);
foreach ($rsp as $pic) {
$returnDesc[] = htmlspecialchars($pic['description'], ENT_QUOTES);
}
$descArray .= $photoID."|~|".$returnDesc[0]."|~|";
}
The string $descArray would then be placed in the MySQL string that puts it into the database with other information brought in from the form.
My first question is was I correct in using a second foreach loop to get those descriptions? I tried following other examples all over the net that didn't use that, but they never worked. When I brought on the second foreach, then it worked. Should I have done something else?
I noticed the data returned would be two entries. One being the description, and the other just an "o"...hence the array $returnDesc so I could just get the one string I wanted and not the other.
Second question is if I made this too complicated or not. I like to try to learn to write cleaner/leaner code, and was looking for opinions.
Suggestions on improvement are welcome. Thank you in advance.

I'm not 100% sure as I've just browsed the source for phpFlickr, and looked the the Flickr API for the getInfo() call. But let me have a go anyway :)
First off, it looks like you shouldn't need that loop, like you mention. What does the output of print_r($rsp); look like? It could be that $rsp is an array with 1 element, in which case you could ditch the inner loop and replace it with something like $pic = $rsp[0]; $desc = $pic['description'];
Also, I'd create a new "description" column in your database table (that has the photo id as the primary key), and store the description in their on its own. Parsing db fields like that is a bit of a nightmare. Lastly, you might want to force htmlspecialchars to work in UTF8 mode, cause I don't think it does by default. From memory, the third parameter is the content encoding.
edit: doesn't phpFlickr have its own caching system? Why not use that and make the cache size massive? Seems like you might be re-inventing the wheel here... maybe all you need to do is increase the cache size, and make a getDescription function:
function getDescription ($id)
{
$rsp = $phpFlickr->photos_getInfo ($id);
$pic = $rsp[0];
return $pic['description'];
}

Updating the XML file using PHP script

I'm making an interface-website to update a concert-list on a band-website.
The list is stored as an XML file an has this structure :
I already wrote a script that enables me to add a new gig to the list, this was relatively easy...
Now I want to write a script that enables me to edit a certain gig in the list.
Every Gig is Unique because of the first attribute : "id" .
I want to use this reference to edit the other attributes in that Node.
My PHP is very poor, so I hope someone could put me on the good foot here...
My PHP script :

Well i dunno what your XML structure looks like but:
<gig id="someid">
<venue></venue>
<day></day>
<month></month>
<year></year>
</gig>
$xml = new SimpleXmlElement('gig.xml',null, true);
$gig = $xml->xpath('//gig[#id="'.$_POST['id'].'"]');
$gig->venue = $_POST['venue'];
$gig->month = $_POST['month'];
// etc..
$xml->asXml('gig.xml)'; // save back to file
now if instead all these data points are attributes you can use $gig->attributes()->venue to access it.
There is no need for the loop really unless you are doing multiple updates with one post - you can get at any specific record via an XPAth query. SimpleXML is also a lot lighter and a lot easier to use for this type of thing than DOMDOcument - especially as you arent using the feature of DOMDocument.

You'll want to load the xml file in a domdocument with
<?
$xml = new DOMDocument();
$xml->load("xmlfile.xml");
//find the tags that you want to update
$tags = $xml->getElementsByTagName("GIG");
//find the tag with the id you want to update
foreach ($tags as $tag) {
if($tag->getAttribute("id") == $id) { //found the tag, now update the attribute
$tag->setAttribute("[attributeName]", "[attributeValue]");
}
}
//save the xml
$xml->save();
?>
code is untested, but it's a general idea

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping content from a file - php

Related

PHP Decoding JSON and Accessing Array Value based on dynamic variable

Fetch data from site Page by Page & go through sub links

How to get post URL out of the Blogger API in PHP

PHPFlickr script...could this be cleaner/leaner?

Updating the XML file using PHP script

Categories

Resources