Extracting data form website using php

Extracting data form website using php - php

I have the following website:
http://stationmeteo.meteorologic.net/metar/your-metar.php?icao=LFRS&day=070308
I want to extract data from it.
I tried using file_get_contents and some regular expressions, but something is not working.
this is the code I tried:
$content=file_get_contents('http://stationmeteo.meteorologic.net/metar/your-metar.php? icao=LFMN&day=010513');
preg_match('/00\:30 07\/03\/2008(.+)01\:30 07\/03\/2008/',$content,$m);
echo $m[0];
echo $m[1];
It's giving me undefined offset 0 and 1.
If I copy the content of the web page directly to $content instead of using file_get_contents, it works fine.
What am I missing?

The problem is that .+ matches any characters except newlines, and there is a newline character in the text you're trying to match.
Try
preg_match('~00:30 07/03/2008(.+)01:30 07/03/2008~s',$content,$m);
(using ~ as a delimiter so you don't have to escape all those slashes, by the way)
The next question is: Why don't I get this problem when copying the contents of the webpage directly into $content? Well, all whitespace is normalized to a single space when a webpage is rendered, turning the \n that's present in the page's source code (press Ctrl-U to see it) into a simple space. And .+ matches that space.

Related

Replace spaces in all URLs with %20 using Regex

I have a large block of HTML that contains multiples URLs with spaces in them. How do I used Regex to replace any space that occurs in a URL, with a '%20'. The good thing is that all of the URLs end with '.pdf'.
Looking for something I could run in BBedit/Text Wrangler, or even PHP.
Example: http://www.site-name.com/dir/file name here.pdf
Need to return: http://www.site-name.com/dir/file%20name%20here.pdf

Instead of Regex you could use could use urlencode in PHP to achieve this which escapes the url for you. Similar to encodeURI in JavaScript.

I was faced with exactly the same problem. I solved it with this:
$text = preg_replace("/http(.*) (.*)\.pdf/U", "http$1%20$2.pdf", $text);
This looks for a space between http and pdf and then replaces the space with %20.
If your URLs have multiple spaces, then simply run the code over and over until all the spaces are gone:
while(preg_match("/http(.*) (.*)\.pdf/U", $text))
{
$text = preg_replace("/http(.*) (.*)\.pdf/U", "http$1%20$2.pdf", $text);
echo('testing testing');
}
However, I've found this will overwrite text if there are two or more URLs on the same line. I haven't found a solution for this yet.

PHP Detect 1 or more unicode spaces only

I have searched in vain to find a fix for this issue. I have an editable field in a web page that contains a user entered space. When I copy the space and enter it into a program called IVI32 which I guess you would call a Unicode text program, I get the following info.
The space character is defined as FFFE2000. I need to detect when this field has one or more of these spaces and nothing else. I have tried the following with preg_match:
'/\s+/u'
'/^[0 :-]+$/ '
'/\A\s*\z/'
Nothing works and I am completely stumped. Any help from some Unicode experts out there will be greatly appreciated.

There was an error in my code which was preventing anything from working properly (the product of no time off!). Here's what works for anyone else who might want to detect if an element contains only whitespace that cannot be eliminated by php trim();
if(!preg_match('/\\s/', $test_string)):-do something-
if(!preg_match('/\s+/u', $test_string)):-do something-
if(!preg_match('/[\pZ\pC]+/u', $test_string)):-do something-
For anyone who is interested the space is pasted immediately after the end of this sentence.

Would this work?
preg_match('/^ +$/', $subject);
Match a single or more spaces? Because \s will also match nonbreaking spaces, tabs, and newlines.

Have a try with:
/\p{WhiteSpace}/

PHP Regex URL parsing issues preg_replace

I have a custom markup parsing function that has been working very well for many years. I recently discovered a bug that I hadn't noticed before and I haven't been able to fix it. If anyone can help me with this that'd be awesome. So I have a custom built forum and text based MMORPG and every input is sanitized and parsed for bbcode like markup. It'll also parse out URL's and make them into legit links that go to an exit page with a disclaimer that you're leaving the site... So the issue that I'm having is that when I user posts multiple URL's in a text box (let's say \n delimited) it'll only convert every other URL into a link. Here's the parser for URL's:
$markup = preg_replace("/(^|[^=\"\/])\b((\w+:\/\/|www\.)[^\s<]+)" . "((\W+|\b)([\s<]|$))/ei", '"$1".shortURL("$2")."$4"', $markup);
As you can see it calls a PHP function, but that's not the issue here. Then entire text block is passed into this preg_replace at the same time rather than line by line or any other means.
If there's a simpler way of writing this preg_replace, please let me know
If you can figure out why this is only parsing every other URL, that's my ultimate goal here
Example INPUT:
http://skylnk.co/tRRTnb
http://skylnk.co/hkIJBT
http://skylnk.co/vUMGQo
http://skylnk.co/USOLfW
http://skylnk.co/BPlaJl
http://skylnk.co/tqcPbL
http://skylnk.co/jJTjRs
http://skylnk.co/itmhJs
http://skylnk.co/llUBAR
http://skylnk.co/XDJZxD
Example OUTPUT:
http://skylnk.co/tRRTnb
<br>http://skylnk.co/hkIJBT
<br>http://skylnk.co/vUMGQo
<br>http://skylnk.co/USOLfW
<br>http://skylnk.co/BPlaJl
<br>http://skylnk.co/tqcPbL
<br>http://skylnk.co/jJTjRs
<br>http://skylnk.co/itmhJs
<br>http://skylnk.co/llUBAR
<br>http://skylnk.co/XDJZxD
<br>

e flag in preg_replace is deprecated. You can use preg_replace_callback to access the same functionality.
i flag is useless here, since \w already matches both upper case and lower case, and there is no backreference in your pattern.
I set m flag, which makes the ^ and $ matches the beginning and the end of a line, rather than the beginning and the end of the entire string. This should fix your weird problem of matching every other line.
I also make some of the groups non-capturing (?:pattern) - since the bigger capturing groups have captured the text already.
The code below is not tested. I only tested the regex on regex tester.
preg_replace_callback(
"/(^|[^=\"\/])\b((?:\w+:\/\/|www\.)[^\s<]+)((?:\W+|\b)(?:[\s<]|$))/m",
function ($m) {
return "$m[1]".shortURL($m[2])."$m[3]";
},
$markup
);

Regex to conditionally replace Twitter hashtags with hyperlinks

I'm writing a small PHP script to grab the latest half dozen Twitter status updates from a user feed and format them for display on a webpage. As part of this I need a regex replace to rewrite hashtags as hyperlinks to search.twitter.com. Initially I tried to use:
<?php
$strTweet = preg_replace('/(^|\s)#(\w+)/', '\1#\2', $strTweet);
?>
(taken from https://gist.github.com/445729)
In the course of testing I discovered that #test is converted into a link on the Twitter website, however #123 is not. After a bit of checking on the internet and playing around with various tags I came to the conclusion that a hashtag must contain alphabetic characters or an underscore in it somewhere to constitute a link; tags with only numeric characters are ignored (presumably to stop things like "Good presentation Bob, slide #3 was my favourite!" from being linked). This makes the above code incorrect, as it will happily convert #123 into a link.
I've not done much regex in a while, so in my rustyness I came up with the following PHP solution:
<?php
$test = 'This is a test tweet to see if #123 and #4 are not encoded but #test, #l33t and #8oo8s are.';
// Get all hashtags out into an array
if (preg_match_all('/(^|\s)(#\w+)/', $test, $arrHashtags) > 0) {
foreach ($arrHashtags[2] as $strHashtag) {
// Check each tag to see if there are letters or an underscore in there somewhere
if (preg_match('/#\d*[a-z_]+/i', $strHashtag)) {
$test = str_replace($strHashtag, ''.$strHashtag.'', $test);
}
}
}
echo $test;
?>
It works; but it seems fairly long-winded for what it does. My question is, is there a single preg_replace similar to the one I got from gist.github that will conditionally rewrite hashtags into hyperlinks ONLY if they DO NOT contain just numbers?

(^|\s)#(\w*[a-zA-Z_]+\w*)
PHP
$strTweet = preg_replace('/(^|\s)#(\w*[a-zA-Z_]+\w*)/', '\1#\2', $strTweet);
This regular expression says a # followed by 0 or more characters [a-zA-Z0-9_], followed by an alphabetic character or an underscore (1 or more), followed by 0 or more word characters.
http://rubular.com/r/opNX6qC4sG <- test it here.

It's actually better to search for characters that aren't allowed in a hashtag otherwise tags like "#Trentemøller" wont work.
The following works well for me...
preg_match('/([ ,.]+)/', $string, $matches);

I have devised this: /(^|\s)#([[:alnum:]])+/gi

I found Gazlers answer to work, although the regex added a blank space at the beginning of the hashtag, so I removed the first part:
(^|\s)
This works perfectly for me now:
#(\w*[a-zA-Z_0-9]+\w*)
Example here: http://rubular.com/r/dS2QYZP45n

PHP: How to prevent unwanted line breaks

I'm using PHP to create some basic HTML. The tags are always the same, but the actual links/titles correspond to PHP variables:
$string = '<p style="..."><strong><i>'.$title[$i].'</i></strong>
<br>';
echo $string;
fwrite($outfile, $string);
The resultant html, both as echoed (when I view the page source) and in the simple txt file I'm writing to, reads as follows:
<p style="..."><a href="http://www.example.com
"><strong><i>Example Title
</i></strong></a></p>
<br>
While this works, it's not exactly what I want. It looks like PHP is adding a line break every time I interrupt the string to insert a variable. Is there a way to prevent this behavior?

Whilst it won't affect your HTML page at all with the line breaks (unless you are using pre or text-wrap: pre), you should be able to call trim() on those variables to remove newlines.
To find out if your variable has a newline at front or back, try this regex
var_dump(preg_match('/^\n|\n$/', $variable));
(I think you have to use single quotes so PHP doesn't turn your \n into a literal newline in the string).

My guess is your variables are to blame. You might try cleaning them up with trim: http://us2.php.net/trim.

The line breaks show up because of multi-byte encoding, I believe. Try:
$newstring = mb_substr($string_w_line_break,[start],[length],'UTF-8');
That worked for me when strange line breaks showed up after parsing html.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting data form website using php - php

Related

Replace spaces in all URLs with %20 using Regex

PHP Detect 1 or more unicode spaces only

PHP Regex URL parsing issues preg_replace

Regex to conditionally replace Twitter hashtags with hyperlinks

PHP: How to prevent unwanted line breaks

Categories

Resources