PHP preg_match displaying Array ( ) - php

I'm newbie in PHP and I'm trying to scrape just weather summary data from a forecast site, i tried everything i found but it only returns Array ( )
<?php
$contents=file_get_contents("http://www.weather-forecast.com/locations/Podgorica/forecasts/latest");
preg_match('/ <a class="forecast-magellan-target" name="forecast-part-0"><div data-magellan-destination="forecast-part-0"><\/div><\/a><p class="summary"> (.*?) <\/span><\/p> /is',$contents, $matches);
print_r($matches);
?>

You Have extra spaces, please try this:
preg_match('/<a class="forecast-magellan-target" name="forecast-part-0"><div data-magellan-destination="forecast-part-0"><\/div><\/a><p class="summary">(.*?)<\/span><\/p>/is',$contents, $matches);

Related

Parsing PDF tables into csv with php

I need to convert a pdf file with tables into CSV, so I used "PDFPARSER" in order to parse the entire text, then with pregmatch_all search the patterns of each table so I can create an array from each table of the pdf.
The structure of the following PDF is:
When I parse I get this
ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas
I figured out how to pregmatch_all all the ECO-XXXXX, but I don't know how to pregmatch all the descriptions
This is what is working for ECO-XXXXXX
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('publication.pdf');
$text = $pdf->getText();
echo $text;
$pattern = '/ECO-[.-^*-]{3,}| ECO-[.-^*-]{4,}\s\b[NMB]\b|ECO-[.-^*-]{4,}\sUP| ECO-[.-^*-]{3,}\sUP\s[B-N-M]{1}| ECO-[.-^*-]{3,}\sRX/' ;
preg_match_all($pattern, $text, $array);
echo "<hr>";
print_r($array);
I get
Array ( [0] => Array ( [0] => ECO-698 [1] => ECO-CHI-522 [2]
You may try this regex:
(ECO[^\s]+)\s+(.*?)(?=ECO|\z)
As per the input string, group1 contains the ECO Block and group 2 contains the descriptions.
Explanation:
(ECO[^\s]+) capture full ECO block untill it reaches white space.
\s+one or more white space
(.*?)(?=ECO|\z) Here (.*?) matches description and (?=ECO|\z) is a positive look ahead to match ECO or end of string (\z)
Regex101
Source Code (Run here):
$re = '/(ECO[^\s]+)\s+(.*?)(?=ECO|\z)/m';
$str = 'ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
$val=1;
foreach ($matches as $value)
{
echo "\n\nRow no:".$val++;
echo "\ncol 1:".$value[1]."\ncol 2:".$value[2];
}
UPDATE AS per the comment:
((?:ECO-(?!DE)[^\s]+)(?: (?:RX|B|N|M|UP|UP B|UP N|UP M))?)\s+(.*?)(?=(?:ECO-(?!DE))|\z)
Regex 101 updated

how to php preg_match_all find exact words

i am trying to get exact word 846002 from html code by using preg_match_all
php code :
<?php
$sText =file_get_contents('test.html', true);
preg_match_all('#download.php/.[^\s.,"\']+#i', $sText, $aMatches);
print_r($aMatches[0][0]);
?>
test.html
<tr>
<td class=ac>
<img src="/dl3.png" alt="Download" title="Download">
</td>
</tr>
output:
download.php/829685/Dark
but i want to output only,
829685
Just add the slash in the character class and capture in group 1:
preg_match_all('#download.php/([^\s.,"\'/]+)#i', $sText, $aMatches);
The value you're looking after is in $aMatches[1][0]
You need to create a group by ( and )
preg_match_all ( '#download.php/([^/]+)#i', $sText, $aMatches );
print_r ( $aMatches[1][0] );

Using RegEx to Capture All Links & In Between Text From A String

<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)
Given that this is all on one line, how can I match or better yet extract all full urls and text? ie. for this example I wish to extract:
http://www.someurl(.+) . maybe some text here(.*) . www.someotherurl(.+) . maybe even more text(.*)
Basically, <Link.*:.* would start each link capture and > would end it. Then all text after the first capture would be captured as well up until zero or more occurrences of the next link capture.
I have tried:
preg_match_all('/<Link.*?:.*?(https|http|www)(.+?)>(.*?)/', $v1, $m4);
but I need a way to capture the text after the closing >. The problem is that there may or may not be another link after the first one (of course there could also be no links to begin with!).
$string = "<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)";
$string = preg_split('~<link(?: to)?:\s*([^>]+)>~i',$string,-1,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
echo "<pre>";
print_r($string);
output:
Array
(
[0] => http://www.someurl(.+)
[1] => maybe some text here(.*)
[2] => www.someotherurl(.+)
[3] => maybe even more text(.*)
)
You can use this pattern:
preg_match_all('~<link\b[^:]*:\s*\K(?<link>[^\s>]++)[^>]*>\s*(?<text>[^<]++)~',
$txt, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
printf("<br/>link: %s\n<br/>text: %s", $match['link'], $match['text']);
}

How to get the content in a div using php

In my application i am trying to get the google indexed pages and i came to know that the number is available in following div
<div id="resultStats"> About 1,960,000 results (0.38 seconds) </div>
now my question is how to extract the number from above div in a web page
Never user regexp to parse HTML. (See: RegEx match open tags except XHTML self-contained tags)
Use a HTML parser, like SimpleDOM (http://simplehtmldom.sourceforge.net/)
You can the use CSS rules to select:
$html = file_get_html('http://www.google.com/');
$divContent = $html->find('div#resultStats', 0)->plaintext;
$matches = array();
preg_match('/([0-9,]+)/', $divContent, $matches);
echo $matches[1];
Outputs: "1,960,000"
$str = '<div id="resultStats"> About 1,960,000 results (0.38 seconds) </div> ';
$matches = array();
preg_match('/<div id="resultStats"> About ([0-9,]+?) results[^<]+<\/div>/', $str, $matches);
print_r($matches);
Output:
Array (
[0] => About 1,960,000 results (0.38 seconds)
[1] => 1,960,000
)
This is simple regex with subpatterns
([0-9,]+?) - means 0-9 numbers and , character at least 1 time and not greedy.
[^<]+ - means every character but < more than 1 time
echo $matches[1]; - will print the number you want.
You can use regex ( preg_match ) for that
$your div_string = '<div id="resultStats"> About 1,960,000 results (0.38 seconds) </div>';
preg_match('/<div.*>(.*)<\/div>/i', $your div_string , $result);
print_r( $result );
output will be
Array (
[0] => <div id="resultStats"> About 1,960,000 results (0.38 seconds) </div>
[1] => About 1,960,000 results (0.38 seconds)
)
in this way you can get content inside div

preg_match_all to scrape found word between html tags

I have the following piece of code which should match the provided string to $contents. $contents variable has a web page contents stored through file_get_contents() function:
if (preg_match('~<p style="margin-top: 40px; " class="head">GENE:<b>(.*?)</b>~iU', $contents, $match)){
$found_match = $match[1];
}
The original string on the said webpage looks like this:
<p style="margin-top: 40px; " class="head">GENE:<b>TSPAN6</b>
I would like to match and store the string 'TSPAN6' found on the web page through (.*?) into $match[1]. However, the matching does not seem to work. Any ideas?
Unfortunately, your suggestion did not work.
After some hours of looking through the html code I have realized that the regex simply had a blank space right after the colon. As such, the code snippet now looks like this:
$pattern = '#GENE: <b>(.*)</b>#i';
preg_match($pattern1, $contents, $match1);
if (isset($match1[1]))
{
$found_flag = $match1[1];
}
Try this:
preg_match( '#GENE:<b>([^<]+)</b>si#', $contents, $match );
$found_match = ( isset($match[1]) ? $match[1] : false );

Categories