preg_match_all to show multiple results

preg_match_all to show multiple results - php

I'm trying to get this messages (METAR) from a page and show everything just in other php file without the styles and extra info.
At the moment I'm using this code:
<?php
$options = array('http' => array(
'method' => 'GET',
));
$config= stream_context_create($options);
$config_final=file_get_contents('http://www.smn.gov.ar/mensajes/index.php?observacion=metar&operacion=consultar&87582=on&87641=on&87750=on&87765=on&87222=on&87761=on&87860=on&87395=on&87344=on&87166=on&87904=on&87571=on&87347=on&87803=on&87576=on&87162=on&87532=on&87497=on&87097=on&87046=on&87548=on&87217=on&87506=on&87692=on&87418=on&87574=on&87715=on&87374=on&87289=on&87852=on&87178=on&87896=on&87823=on&87270=on&87155=on&87453=on&87925=on&87934=on&87480=on&87047=on&87553=on&87311=on&87909=on&87436=on&87509=on&87912=on&87623=on&87444=on&87129=on&87371=on&87645=on&87022=on&87127=on&87828=on&87121=on&87938=on&87791=on&87448=on',false, $config);
preg_match_all("|<td width=\"100%\">METAR (.*)</td>|sU", $config_final, $tiempo);
echo $tiempo[1][0];
?>
</div>
Using that code I can get only the first METAR, Waht I need is to see all of them in different lines, like showing different results.
Any ideas?
Thanks in advance

I suggest you utilize PHP's built in HTML Parsers for this, in particular the DOMDocument, and use DOMXpath to search for those needle.
Example:
$url = 'http://www.smn.gov.ar/mensajes/index.php?observacion=metar&operacion=consultar&87582=on&87641=on&87750=on&87765=on&87222=on&87761=on&87860=on&87395=on&87344=on&87166=on&87904=on&87571=on&87347=on&87803=on&87576=on&87162=on&87532=on&87497=on&87097=on&87046=on&87548=on&87217=on&87506=on&87692=on&87418=on&87574=on&87715=on&87374=on&87289=on&87852=on&87178=on&87896=on&87823=on&87270=on&87155=on&87453=on&87925=on&87934=on&87480=on&87047=on&87553=on&87311=on&87909=on&87436=on&87509=on&87912=on&87623=on&87444=on&87129=on&87371=on&87645=on&87022=on&87127=on&87828=on&87121=on&87938=on&87791=on&87448=on';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// search for td's with contains METAR
$metars = $xpath->query('//td[contains(text(), "METAR")]');
if($metars->length <= 0) {
echo 'no metars found';
exit;
}
$data = array();
foreach($metars as $metar) {
$data[] = $metar->nodeValue;
}
echo '<pre>';
print_r($data);

Related

Webscraping with Goutte and Guzzle

I have the following method from my controller that gets the data from the site:
$goutteClient = new Client();
$guzzleClient = new GuzzleClient([
'timeout' => 60,
]);
$goutteClient->setClient($guzzleClient);
$crawler = $goutteClient->request('GET', 'https://html.duckduckgo.com/html/?q=Laravel');
$crawler->filter('.result__title .result__a')->each(function ($node) {
dump($node->text());
});
The above code gives me the title of contents from the search results. I also want to get the link of the corresponding search result. That resides in class result__extras__url.
How do I filter the link in and the title at once? Or do I have to run another method for that?

Try to inspect the attributes of the nodes. Once you get the href attribute, parse it to get the URL.
$crawler->filter('.result__title .result__a')->each(function ($node) {
$parts = parse_url(urldecode($node->attr('href')));
parse_str($parts['query'], $params);
$url = $params['uddg']; // DDG puts their masked URL and places the actual URL as a query param.
$title = $node->text();
});

For parsing, I usually do the following:
$doc = new DOMDocument();
$doc->loadHTML((string)$crawler->getBody());
from then on, you can access using getElementsByTagName functions on your DOMDocument.
for example:
$rows = $doc->getElementsByTagName('tr');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('td');
$value = trim($cols->item(0)->nodeValue);
}
You can find more information in
https://www.php.net/manual/en/class.domdocument.php

PHP extract first occurence of link in source code

I'm trying to extract the first occurence of a link that starts like this
https://encrypted-tbn3.gstatic.com/images?...
from the source code of a page. The link starts and ends with a ". Here is what I've got so far:
$search_query = $array[0]['Name'];
$search_query = urlencode($search_query);
$context = stream_context_create(array('http' => array('header' => 'User-Agent: Mozilla compatible')));
$response = file_get_contents( "https://www.google.com/search?q=$search_query&tbm=isch", false, $context);
$html = str_get_html($response);
$url = explode('"',strstr($html, 'https://encrypted-tbn3.gstatic.com/images?'[0]))
However the output of $url is not the link I try to extract, but something very different. I have added an image.
Could anyone explain the output to me and how I would get the desired link? Thanks

It seems that you're using PHP Simple HTML DOM Parser.
I normally use DOMDocument, which is part of php build-in classes.
Here's a working example of what you need:
$search_query = $array[0]['Name'];
$search_query = urlencode($search_query);
$context = stream_context_create(array('http' => array('header' => 'User-Agent: Mozilla compatible')));
$response = file_get_contents( "https://www.google.com/search?q=$search_query&tbm=isch", false, $context);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($response);
foreach ($dom->getElementsByTagName('img') as $item) {
$img_src = $item->getAttribute('src');
if (strpos($img_src, 'https://encrypted') !== false) {
print $img_src."\n";
}
}
Output:
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSumjp6e37O_86nc36mlktuWpbFuCI4nkkkocoBCYW3qCOicqdu_KEK-MY
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcR_ttK8NlBgui_JndBj349UxZx0kHn0Z-Essswci-_5UQCmUOruY1PNl3M
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSydaTpSDw2mvU2JRBGEYUOstTUl4R1VhRevv1Sdinf0fxRvU26l3pTuqo
...

$url_beginning = 'https://encrypted-tbn3.gstatic.com/images?';
if(preg_match('/\"(https\:\/\/encrypted\-tbn3\.gstatic\.com\/images\?.+?)\"/ui',$html, $matches))
$url = $matches[1];
else
$url = '';
try to use preg_replace, it is more suitable for parsing
And in this eample a assumed that url in your HTML should be quoted.
UPD
a little bit tuned version to be usable for any url-beginning:
$url_beginning = 'https://encrypted-tbn3.gstatic.com/images?';
$url_beginning = preg_replace('/([^а-яА-Яa-zA-Z0-9_#%\s])/ui', '\\\\$1', $url_beginning);
if(preg_match('/\"('.$url_beginning.'.+?)\"/ui',$html, $matches))
$url = $matches[1];
else
$url = '';

PHP creating multiple DOMDocuments in a loop issue

I have a list of items to be added to the end of a base url and am trying to retrieve the html from each of these generated url's in a loop. However, I am encountering an error and i've really been struggling to fix it!
current code:
($items is just an array of strings)
$output = "";
foreach($items as $item) {
$url = $baseUrl . $item;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$output = $output . json_encode($dom->saveHTML());
}
echo $output;
Can anyone tell me why I can't load multiple HTML documents like this?
Annoyingly i'm not getting any PHP error logs and the ajax xhr text is not providing any useful info, it's just returning a section of the first html page loaded as the 'error' (it seems to be able to load the first item in the array but then fails)

You were almost there. This way it should do the trick:
$output = "";
foreach($items as $item) {
$url = $baseUrl . $item;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$output .= json_encode($dom->saveHTML(),JSON_ERROR_UTF8);
}
echo $output;

Get source of group of links and save each one to certain file

i am trying to get a source of group of links and save the source of each link to separate file.
$urls = array(
url1 => 'http://site1.com'
url2 => 'http://site2.com'
url3 => 'http://site3.com'
);
}
$files = array(
file1 => 'file1.html'
file2 => 'file2.html'
file3 => 'file3.html'
);
foreach ($urls as $url) {
ob_start();
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
echo $doc->saveHTML();
$page = ob_get_contents();
ob_end_flush();
}
foreach ($files as $file) {
$fp = fopen("$file","w");
fwrite($fp,$page);
fclose($fp);
}
at this point i am stuck and it's not working

You need to read the URLs and write the files in the same loop.
foreach ($urls as $i => $url) {
file_put_contents($files[$i], file_get_contents($url));
}
There's no need to use DOMDocument, unless you really need to parse the HTML instead of just saving the source. And there's definitely no reason to use the ob_XXX functions, just assign the results directly to variables or pass them to functions.
And as design advice, when you have related data like the URLs and filenames, don't put them in separate arrays. Put them in a single, 2-dimensional array:
$data = array(array('url' => 'http://site1.com',
'file' => 'file1.html'),
array('url' => 'http://site2.com',
'file' => 'file2.html'),
array('url' => 'http://site3.com',
'file' => 'file3.html'));
This keeps all the related items close together, and you never have to worry about the two arrays getting out of sync as you update things.
Then you can loop over this one array:
foreach ($data as $datum) {
$html = file_get_contents($datum['url']);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
// Do stuff with $doc
$page = $doc->saveHTML($doc);
file_put_contents($datum['file'], $page);
}

PHP DOMDocument how to get element?

I am trying to read a website's content but i have a problem i want to get images, links these elements but i want to get elements them selves not the element content for instance i want to get that: i want to get that entire element.
How can i do this..
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.link.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
$dom = new DOMDocument;
#$dom->loadHTML($output);
$items = $dom->getElementsByTagName('a');
for($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<br />";
}
curl_close($ch);;
?>

You appear to be asking for the serialized html of a DOMElement? E.g. you want a string containing link text? (Please make your question clearer.)
$url = 'http://example.com';
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $a) {
// Best solution, but only works with PHP >= 5.3.6
$htmlstring = $dom->saveHTML($a);
// Otherwise you need to serialize to XML and then fix the self-closing elements
$htmlstring = saveHTMLFragment($a);
echo $htmlstring, "\n";
}
function saveHTMLFragment(DOMElement $e) {
$selfclosingelements = array('></area>', '></base>', '></basefont>',
'></br>', '></col>', '></frame>', '></hr>', '></img>', '></input>',
'></isindex>', '></link>', '></meta>', '></param>', '></source>',
);
// This is not 100% reliable because it may output namespace declarations.
// But otherwise it is extra-paranoid to work down to at least PHP 5.1
$html = $e->ownerDocument->saveXML($e, LIBXML_NOEMPTYTAG);
// in case any empty elements are expanded, collapse them again:
$html = str_ireplace($selfclosingelements, '>', $html);
return $html;
}
However, note that what you are doing is dangerous because it could potentially mix encodings. It is better to have your output as another DOMDocument and use importNode() to copy the nodes you want. Alternatively, use an XSL stylesheet.

I'm assuming you just copy-pasted some example code and didn't bother trying to learn how it actually works...
Anyway, the ->nodeValue part takes the element and returns the text content (because the element has a single text node child - if it had anything else, I don't know what nodeValue would give).
So, just remove the ->nodeValue and you have your element.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_match_all to show multiple results - php

Related

Webscraping with Goutte and Guzzle

PHP extract first occurence of link in source code

PHP creating multiple DOMDocuments in a loop issue

Get source of group of links and save each one to certain file

PHP DOMDocument how to get element?

Categories

Resources