PHP scrape data from website

PHP scrape data from website - php

I am new to programming. So I choose to build a webpage by using Wordpress. But I am trying to gather weather data from other sites, I could not find a fitting plugin for scraping the data, and decided to give it a try and put something together myself.
But with my limited understanding of programming is giving me issues. With a little inspirations from the web I have put this together:
$html = file_get_contents('http://www.frederikshavnhavn.dk/scripts/weatherwindow.php?langid=2'); //get the html returned from the following url
$poke_doc = new DOMDocument();
libxml_use_internal_errors(false); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$poke_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$poke_xpath = new DOMXPath($poke_doc);
//get all the spans's with an id
$poke_type = $poke_xpath->query("//span[#class='weathstattype']");
$poke_text = $poke_xpath->query("//span[#class='weathstattext']");
foreach($poke_text as $text){
foreach($poke_type as $type){
echo $type->nodeValue;
echo $text->nodeValue . "</br>";
continue 2;
}
break;
}
}
Being that this is all new to me, and I am really trying to get this to work for me, hoping for a better understanding of the code behind the works.
What I am trying to achieve is a formatted list with the data.
1. value $type $text
2. value $type $text
Right now it is giving me a lot of trouble.
when I use the continue 2 it does not change the value $type, but when I just use continue statement it changes $type but not $text. How can I make it change both values each time?
Thanks for your help.

try adding this method:
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
then replace the foreach with this:
foreach($poke_text as $text){
//echo $type ->nodeValue . "</n>";
echo get_inner_html($text ).'<br>';
}
foreach($poke_type as $type){
//echo $text ->nodeValue;
echo get_inner_html($type ).'<br>';
}
produces this:
197Â° (Syd)
5.7 Â°C Stigende
4.8 m/s Stigende
5.4 m/s Stigende
-6 cm Faldende 1004 hPa Vindretning Lufttemperatur Middel vindhastighed Max vindhastighed Vandstand Lufttryk

Buddy in your code your foreach loops (in last) you use $type as $text and $text as $type.. I run the code and just change the variables as they should be its working fine..
$html = file_get_contents('http://www.frederikshavnhavn.dk/scripts/weatherwindow.php?langid=2'); //get the html returned from the following url
$poke_doc = new DOMDocument();
libxml_use_internal_errors(false); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$poke_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$poke_xpath = new DOMXPath($poke_doc);
//get all the spans's with an id
$poke_type = $poke_xpath->query("//span[#class='weathstattype']");
$poke_text = $poke_xpath->query("//span[#class='weathstattext']");
foreach($poke_text as $text){
echo $text->nodeValue;
}
foreach($poke_type as $type){
echo $type->nodeValue;
}
}
And this the out that I got from your code (by changing the variables in loop)
196Â° (Syd) 5.6 Â°C 4.1 m/s 5 m/s -6 cm 1004 hPa Vindretning Lufttemperatur Middel vindhastighed Max vindhastighed Vandstand Lufttryk
Now You have your data I think you can manage how to sort them out...

Related

Get text from script output

everyone, I've been using this code for quite a long time
<?php
$url = 'http://www.smn.gov.ar/mensajes/index.php?observacion=metar&operacion=consultar&87582=on&87641=on&87750=on&87765=on&87222=on&87761=on&87860=on&87395=on&87344=on&87166=on&87904=on&87571=on&87347=on&87803=on&87576=on&87162=on&87532=on&87497=on&87097=on&87046=on&87548=on&87217=on&87506=on&87692=on&87418=on&87574=on&87715=on&87374=on&87289=on&87852=on&87178=on&87896=on&87823=on&87270=on&87155=on&87453=on&87925=on&87934=on&87480=on&87047=on&87553=on&87311=on&87909=on&87436=on&87509=on&87912=on&87623=on&87444=on&87129=on&87371=on&87645=on&87022=on&87127=on&87828=on&87121=on&87938=on&87791=on&87448=on';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// search for td's containing METAR
$metars = $xpath->query('//td[contains(text(), "METAR SA")]');
if($metars->length <= 0) {
echo 'no metars found';
exit;
}
$data = array();
foreach($metars as $metar) {
$data[] = $metar->nodeValue;
}
echo '<pre>';
print_r($data);
Well, this was working fine, until the program in charge to read the output was updated and now it needs a clear output.
At the momment I'm getting this:
http://ar.ivao.aero/weather/metar.php
But the program needs it like this:
SABE 161600Z 02006KT 9999 FEW030 24/18 Q1009 =
SAZA 161600Z 18011KT CAVOK 24/08 Q1010 =
SAZB 161700Z 27012KT CAVOK 21/09 Q1011 =
I don't thought maybe using another script like a file_get_content() could be useful but again its going to show the infromation I don't want to.
I also tried replacing print_r() by var_dump() but its the same
Any ideas?
There is anyway to get this informatin in a simple txt file?
Regards,

You need to filter out some data. Try to find out what's common in the info you need to output. For instance, all the required info from your raw print_r data seems to beging with METAR. So
echo '<pre>';
foreach($metars as $metar) {
if(substr($metar->nodeValue, 0, 5) === "METAR") {
echo str_replace("METAR ", "", $metar->nodeValue) . PHP_EOL;
}
}
That removes any lines like Aeropuerto FORMOSA from the output.

Retrieve data from html page using xpath and php

I know there are similar question, but, trying to study PHP I met this error and I want understand why this occurs.
<?php
$url = 'http://aice.anie.it/quotazione-lme-rame/';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTML($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tbody/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}
?>
this prints just "hello!". I want to print the value extracted with the xpath, but the last echo doesn't do anything.

You have some errors in your code :
You try to get the table from the url http://aice.anie.it/quotazione-lme-rame/, but it's actually in an iframe located at http://www.aiceweb.it/it/frame_rame.asp, so get the iframe url directly.
You use the function loadHTML(), which load an HTML string. What you need is the loadHTMLFile function, which takes the link of an HTML document as a parameter (See http://www.php.net/manual/fr/domdocument.loadhtmlfile.php)
You assume there is a tbody element on the page but there is no one. So remove that from your query filter.
Working code :
$url = 'http://www.aiceweb.it/it/frame_rame.asp';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTMLFile($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}

HTML Table parsing using DOM

I Have a HTML Table
My Parsing Code is
$src = new DOMDocument('1.0', 'utf-8');
$src->formatOutput = true;
$src->preserveWhiteSpace = false;
#$src->loadHTML($result);
$xpath = new DOMXPath($src);
$data=$xpath->query('//td[ contains (#class, "bodytext1") ]');
foreach($data as $datas)
{
echo $datas->nodeValue."<br />";
}
$values=$xpath->query('//tr[ contains (#bgcolor, "f3fafe") ]');
foreach($values as $value)
{
echo $value->nodeValue."<br />";
}
$values1=$xpath->query('//tr[ contains (#bgcolor, "def0fa") ]');
foreach($values1 as $value1)
{
echo $value1->nodeValue."<br />";
}
to be printed, and I want them to be repeated along with other lines as shown above in output i need.
and I want this whole thing in a array so that i can insert it in the database
Can anyone please guide me or give me any hint so that I can do this

This should get you started.
$src = new DOMDocument('1.0', 'utf-8');
$src->formatOutput = true;
$src->preserveWhiteSpace = false;
$src->loadHTML($result);
$xpath = new DOMXPath($src);
// get header data
$data=$xpath->query('//table[1]//td');
$htno = trim(explode(":",$data->item(0)->nodeValue)[1]);
$name = trim(explode(":",$data->item(1)->nodeValue)[1]);
$fatherName=trim(explode(":",$data->item(2)->nodeValue)[1]);
// rows from 2nd table
$values1=$xpath->query('//table[2]//tr');
$header = true; // flag to track whether we've read the header row.
foreach($values1 as $value1)
{
if (!$header) {
$rowdata = str_replace("\r\n"," ",$value1->nodeValue);
echo $htno," ",$name," ",$fatherName," ",$rowdata,"\n";
}
$header = false;
}
Note:
The $header flag is a quick fix. A better Xpath query might eliminate the need for it.
the str_replace near the bottom is ugly but expedient. You might want to play with the xpath query to see if you can improve it.
Output is not formatted for HTML - lines are delimited by \n
I got a warning on one line where it contained &, so I changed it to AND. You might have to preprocess your tables to eliminate those somehow.

you could use third party's dll,such as "Html Agility Pack". a tool which is professional to convert html into xml.

change variable with GET method

I have a page test.php in which I have a list of names:
name1: 992345
name2: 332345
name3: 558645
name4: 434544
In another page test1.php?id=name2 and the result should be:
332345
I've tried this PHP code:
<?php
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile("/test.php");
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*#".$_GET["id"]."");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>
I need to be able to change the name with GET PHP method in test1.pdp?id=name4
The result should be different now.
434544
is there another way, becose mine won't work?

Here is another way to do it.
<?php
libxml_use_internal_errors(true);
/* file function reads your text file into an array. */
$doc = file("test.php");
$id = $_GET["id"];
/* Show your array. You can remove this part after you
* are sure your text file is read correct.*/
echo "Seeking id: $id<br>";
echo "Elements:<pre>";
print_r($doc);
echo "</pre>";
/* this part is searching for the get variable. */
if (!is_null($doc)) {
foreach ($doc as $line) {
if(strpos($line,$id) !== false){
$search = $id.": ";
$replace = '';
echo str_replace($search, $replace, $line);
}
}
} else {
echo "No elements.";
}
?>

There is a completely different way to do this, using PHP combined with JavaScript (not sure if that's what you're after and if it can work with your app, but I'm going to write it). You can change your test.php to read the GET parameter (it can be POST as well, you'll see), and according to that, output only the desired value, probably from the associative array you have hard-coded in there. The JavaScript approach will be different and it would involve making a single AJAX call instead of DOM traversing using PHP.
So, in short: AJAX call to test.php, which then output the desired value based on the GET or POST parameter.
jQuery AJAX here; native JS tutorial here.
Just let me know if this won't work for your app, and I'll delete my answer.

Regex Replacement Dependent On Class

I have the following code that replaces all tags on a page and adds the nCode image resizer to it. The code is as follows:
function ncode_the_content($content) {
return preg_replace("/<img([^`|>]*)>/im", "<img onload=\"NcodeImageResizer.createOn(this);\"$1>", $content); }
}
What I need to do is make it so that if an image has the class of "noresize" it doesn't do the preg_match.
I have only managed to get it so that if there is the "noresize" class anywhere on the page it stops resizing all images instead of just the one with the correct class.
Any suggestions?
UPDATE:
Am I even remotely in the right ballpark with this?
function ncode_the_content($content) {
//Load the HTML page
$html = file_get_contents($content);
//Parse it. Here we use loadHTML as a static method
//to parse the HTML and create the DOM object in one go.
#$dom = DOMDocument::loadHTML($html);
//Init the XPath object
$xpath = new DOMXpath($dom);
//Query the DOM
$linksnoresize = $xpath->query( 'img[#class = "noresize"]' );
$links = $xpath->query( 'img[]' );
//Display the results as in the previous example
foreach($links as $link){
echo $link->getAttribute('onload'), 'NcodeImageResizer.createOn(this);';
}
foreach($linksnoresize as $link){
echo $link->getAttribute('onload'), '';
}
}

Here's some untested code:
$dom = DOMDocument::loadHTML($content);
$images = $dom->getElementsByTagName("img");
foreach ($images as $image) {
if (!strstr($image->getAttribute("class"), "noresize")) {
$image->setAttribute("onload", "NcodeImageResizer.createOn(this);");
}
}
But, if it were me, I would eschew any such inline event handler and instead just find the appropriate elements with Javascript.

I ended up just using pure CSS and adding a around the images I didn't want to be resized. Forced the width and height of that div back to auto and then removed the warning message that was displayed above them. Seems to work fine. Thanks for your help :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP scrape data from website - php

Related

Get text from script output

Retrieve data from html page using xpath and php

HTML Table parsing using DOM

change variable with GET method

Regex Replacement Dependent On Class

Categories

Resources