Extracting Data From Remote XML & Creating Wordpress Post - php

I've been searching for some time to figure out how to extract data from a remote XML file and then create a post automatically with the parsed XML data. I have figured out the functions to create a post using cURL/PHP but I'm not sure how to pull data from an XML file, put that data into strings and then apply those strings to a newly created post. Also dupe protection would be nice.
If anybody knows a good starting point for me to learn or has written something ealready that could provide useful assistance then that would be great. Thanks guys

PHP has a wide variety of XML parsing functions. The most popular around here is the DOM. You can use DOM functions to find the specific XML tags you're interested in and retrieve their data.
Unfortunately you did not provide an example of the XML you're trying to work with, otherwise I'd post a brief example.

If you have to stay with xml format using php you use something with this here. If you can change the format to basic csv text you could try using wordpress plugin here.
Also php has a function for csv files called fgetcsv , so I would say get the information needed from your file.
Pass to a variable and than use wp_insert_post to create a post. Put it all in a while or foreach loop and should work fine - Or try the plug-in first.
As for duplicate content, perhaps you could pass the information in an array and then use array_unique to delete any duplicates (just off the top of my head theres probably a better way or function out there).

Related

extract data from site and put into a file

got this project where the client has lost their database,hence i got to look up into their current(live)site and retrieve information... problem is that there is too much data that i have to copy and insert into the database which is taking a lot of time ...could you suggest some code which could help me ?
You can use DOMDocument library for php and write automated scripts to retreive data after identifing where are your informations in the page usin tags.
http://www.php.net/manual/en/book.dom.php
The library is very robust and uses xpaths.
http://www.w3schools.com/xpath/xpath_examples.asp
If the pages are all very similar in structure, you could try to use regular expressions or a html parser (tidy) to filter out the relevant data.
I did a similar thing for a customer who had 200+ handwritten product pages with images, titles and text. The source seemed to have been copy-pasted from the last page, and had evolved into a few different flavors. it worked great after some tweaking.

Anyone able to read ning's json export files using PHP

I have a client's JSON files that he got from the NING exporter. I'm trying to load the data into PHP but seems like the json isnt properly formatted or something. SO PHP is not able to parse the JSON. I also used another PHP class to do it but that did not work either. Below is the content of one of the files
([{"id":"2492571:Note:75","contributorName":"16szgsc36qg2k","title":"Notes Home","description":"Welcome! To view all notes.","createdDate":"2008-11-14T08:44:58.821Z","updatedDate":"2008-11-14T08:44:58.821Z"}])
Help appreciated!
The parens at the beginning and end are not valid in JSON. It should parse after stripping those.
The JSON file from NING exporter is not properly formatted. For some reason, some commas are missing and you have '}{' pattern, instead of '},{' and the first and the last char is not correct.
You can write a small routine to pre-parse the file and fix those problems and some others that might appear or you can take a look at the code of this Wordpress plugin http://wordpress.org/extend/plugins/import-from-ning/ and copy the routine that fix the json file.
If you'd like to move your Ning data to another platform, you could consider Discourse. There is already an importer for it.
If you don't want to use Discourse, you can still use the (Ruby) importer source code to see how to parse the JSON file.

How can I save content from another website to my database?

I want to upload dynamically content from a soccer live score website to my database.
I also want to do this daily, from a single page on that website (the soccer matches for that day).
If you can help me only with the connection and retrieval of data from that webpage, I will manage the rest.
website: http://soccerstand.com/
language: php/java - mysql
Thank you !
You can use php's file function to get the data. You just pass it a URL and it returns the content as an array of lines from the file. You can also use file_get_contents to get the content as one big string.
Ethical questions about scraping other site's data aside:
With php you can do an "open" call on a website as long as you're setup corectly. See this page for more details on that and examples: http://www.php.net/manual/en/wrappers.http.php
From there you have the content of the web page and it's a matter of breaking it up. Off the top of my head, I'd use regular expressions or an HTML parser to break apart the HTML, and then loop through the child elements and parse the data into your database calls to save the data.
There are a lot of resources for parsing HTML on the web and it's simply a matter of choosing the one that will work best for you.
Keep in mind you'll need to monitor the site for changes, because if they change elements, or their classes/ids you might need to change your parsing structure as well.
Using curl you will get the content of the page, then using regex you will get what you want.
There is an easy way: http://www.jonasjohn.de/lab/htmlsql.htm

How to extract content from other websites automatically?

I want to extract a specific data from the website from its pages...
I dont want to get all the contents of a specific page but i need only some portion (may be data only inside a table or content_div) and i want to do it repeatedly along all the pages of the website..
How can i do that?
Use curl to retreive the content and xPath to select the individual elements.
Be aware of copyright though.
"extracting content from other websites" is called screen scraping or web scraping.
simple html dom parser is the easiest way(I know) of doing it.
You need the php crawler. The key is to use string manipulatin functions such as strstr, strpos and substr.
There are ways to do this. Just for fun I created a windows app that went through my account on a well know social network, looked into the correct places and logged the information into an xml file. This information would then be imported elsewhere. However, this sort of application can be used for motives I don't agree with so I never uploaded this.
I would recommend using RSS feeds to extract content.
I think, you need to implement something like a spider. You can make an XMLHTTP request and get the content and then do a parsing.

Simplehtmldom - curl, loops, arrays?

Pse forgive what is most likely a stupid question. I've successfully managed to follow the simplehtmldom examples and get data that I want off one webpage.
I want to be able to set the function to go through all html pages in a directory and extract the data. I've googled and googled but now I'm confused as I had in my ignorant state thought I could (in some way) use PHP to form an array of the filenames in the directory but I'm struggling with this.
Also it seems that a lot of the examples I've seen are using curl. Please can someone tell me how it should be done. THere are a significant number of files. I've tried concatenating them but this only works with doing this through an html editor - using cat -> doesn't work.
You probably want to use glob('some/directory/*.html'); (manual page) to get a list of all the files as an array. Then iterate over that and use the DOM stuff for each filename.
You only need curl if you're pulling the HTML from another web server, if these are stored on your web server you want glob().
Assuming the parser you talk about is working ok, you should build a simple www-spider. Look at all the links in a webpage and build a list of "links-to-scan". And scan each of those pages...
You should take care of circular references though.

Categories