Pse forgive what is most likely a stupid question. I've successfully managed to follow the simplehtmldom examples and get data that I want off one webpage.
I want to be able to set the function to go through all html pages in a directory and extract the data. I've googled and googled but now I'm confused as I had in my ignorant state thought I could (in some way) use PHP to form an array of the filenames in the directory but I'm struggling with this.
Also it seems that a lot of the examples I've seen are using curl. Please can someone tell me how it should be done. THere are a significant number of files. I've tried concatenating them but this only works with doing this through an html editor - using cat -> doesn't work.
You probably want to use glob('some/directory/*.html'); (manual page) to get a list of all the files as an array. Then iterate over that and use the DOM stuff for each filename.
You only need curl if you're pulling the HTML from another web server, if these are stored on your web server you want glob().
Assuming the parser you talk about is working ok, you should build a simple www-spider. Look at all the links in a webpage and build a list of "links-to-scan". And scan each of those pages...
You should take care of circular references though.
Related
I need to extract the "Toner Cartridges" levels from This site and "send it" to the one im working on. Im guessing i can use GET or something similar but im new to this so i dont know how it could be done.
Then the information needs to be run through a if/else sequence which checks for 4 possible states. 100% -> 50%, 50%->25%, 25%->5%, 5%->0%.
I have the if/else written down but i cant seem to find any good command for extracting the infromation from the index.php file.
EDIT: just need someone to poin me in the right the direction
To read the page you can use file_get_contents
$page = file_get_contents("http://example.com");
But in order to make the function work with URLs, allow_url_fopen must be set to true in the config file of your php (php.ini)
Then you can use a regular expression to filter the text and get data.
The php function to perform a regular expression is preg_match
Example:
preg_match('/[^.]+\.[^.]+$/', $host, $matches);
echo "domain name is: {$matches[0]}\n";
will output
domain name is: php.net
I imagine you are reading from a Printer Status Page. In which case, to give your self the flexibility to use sessions and login, I would look into Curl. Nice thing about Curl is, you can use the PHP library for code, but you can also test at the command-line rather quickly.
After you are retrieving the HTML contents, I would look into using an XML parser, like SimpleXML or DOMDocument. Either one will get you to the information you need. SimpleXML is a little easier to use for people new to traversing XML (this is, at the same time, like and very not like jQuery).
Although, that said, you could also hack to the data just as quick (if you are just now jumping in) with Regulair Expressions (it is seriously like that once you get the hang of it).
Best of luck!
I am trying to get data from a site and be able to manipulate it to display it on my own site.
The site contains a table with ticks and is updated every few hours.
Here's an example: http://www.astynomia.gr/traffic-athens.php
This data is there for everyone to use, and I will mention them on my own site just to be sure.
I've read something about php's cURL but I have no idea if this is the way to go.
Any pointers/tutorials, or code anyone could provide so I can start somewhere would be very helpful.
Also any pointers on how I can get informed as soon as the site is updated?
If you want to crawl the page, use something like Simple HTML DOM Parser for PHP. That'll server your purpose.
First, your web host/localhost should have the php_curl extension enabled.
To start with, you should read a bit here. If you want to jump in directly, there is a simple function here Why I can't get website content using CURL. You just have to change the value of the variables $url,$timeout
Lastly, to get the updated data every 2hrs you will have to run the script as a cronjob. Please refer to this post
PHP - good cronjob/crontab/cron tutorial or book
I am using a Javascript JFlot Graph to monitor a list of .rrd files I have on my server. I hard coded all the .rrd file href links into the JavaScript code of of the Graph in a drop down menu. I then realized it was bad practice to hard code them in.
Here comes the problem.
I want to write some code that each time I open the Graph, would check the server or the page, for all the .rrd files, extract them as href links, and place them inside 'something' so that the JavaScript 'drop down menu' for the Graph could read it.
I researched there was two solutions to do this, 1. Using client-side programing via JQuery and Ajax (I am not cross domain coding) 2. Server-Side via PHP and Json.
I figured Server-Side was the better option, for the understanding that it's easier and that the code wont need to download anything, regardless if it would only take a few seconds.
I just could not figure out the solution to this problem yet. I hope I explained my problem well and any advice or code is welcome, including any good practices to abide by or whether to use option 1 by extracting it from the directories on the page or 2 collecting the data from the server.
Thank You for your time and consideration I truly appreciate it.
As long as all the files are in one directory, you can generate the list pretty easily using PHP's glob function:
$files = glob($storage_dir . DIRECTORY_SEPARATOR . '*.rrd');
It returns an array where each entry is a file name. You can easily convert this to JSON and send it back using:
$response = json_encode($files);
I've been searching for some time to figure out how to extract data from a remote XML file and then create a post automatically with the parsed XML data. I have figured out the functions to create a post using cURL/PHP but I'm not sure how to pull data from an XML file, put that data into strings and then apply those strings to a newly created post. Also dupe protection would be nice.
If anybody knows a good starting point for me to learn or has written something ealready that could provide useful assistance then that would be great. Thanks guys
PHP has a wide variety of XML parsing functions. The most popular around here is the DOM. You can use DOM functions to find the specific XML tags you're interested in and retrieve their data.
Unfortunately you did not provide an example of the XML you're trying to work with, otherwise I'd post a brief example.
If you have to stay with xml format using php you use something with this here. If you can change the format to basic csv text you could try using wordpress plugin here.
Also php has a function for csv files called fgetcsv , so I would say get the information needed from your file.
Pass to a variable and than use wp_insert_post to create a post. Put it all in a while or foreach loop and should work fine - Or try the plug-in first.
As for duplicate content, perhaps you could pass the information in an array and then use array_unique to delete any duplicates (just off the top of my head theres probably a better way or function out there).
got this project where the client has lost their database,hence i got to look up into their current(live)site and retrieve information... problem is that there is too much data that i have to copy and insert into the database which is taking a lot of time ...could you suggest some code which could help me ?
You can use DOMDocument library for php and write automated scripts to retreive data after identifing where are your informations in the page usin tags.
http://www.php.net/manual/en/book.dom.php
The library is very robust and uses xpaths.
http://www.w3schools.com/xpath/xpath_examples.asp
If the pages are all very similar in structure, you could try to use regular expressions or a html parser (tidy) to filter out the relevant data.
I did a similar thing for a customer who had 200+ handwritten product pages with images, titles and text. The source seemed to have been copy-pasted from the last page, and had evolved into a few different flavors. it worked great after some tweaking.