Scrape a price off a website

Scrape a price off a website - php

I'm trying to scrape a price from a web page using PHP and Regexes. The price will be in the format £123.12 or $123.12 (i.e., pounds or dollars).
I'm loading up the contents using libcurl. The output of which is then going into preg_match_all. So it looks a bit like this:
$contents = curl_exec($curl);
preg_match_all('/(?:\$|£)[0-9]+(?:\.[0-9]{2})?/', $contents, $matches);
So far so simple. The problem is, PHP isn't matching anything at all - even when there are prices on the page. I've narrowed it down to there being a problem with the '£' character - PHP doesn't seem to like it.
I think this might be a charset issue. But whatever I do, I can't seem to get PHP to match it! Anyone have any ideas?
(Edit: I should note if I try using the Regex Test Tool using the same regex and page content, it works fine)

Have you try to use \ in front of £
preg_match_all('/(\$|\£)[0-9]+(\.[0-9]{2})/', $contents, $matches);
I have try this expression with .Net with \£ and it works. I just edited it and removed some ":".
(source: clip2net.com)
Read my comment about the possibility of Curl giving you bad encoding (comment of this post).

maybe pound has it's html entity replacement? i think you should try your regexp with some sort of couching program (i.e. match it against fixed text locally).
i'd change my regexp like this: '/(?:\$|£)\d+(?:\.\d{2})?/'

This should work for simple values.
'#(?:\$|\£|\€)(\d+(?:\.\d+)?)#'
This will not work with thousand separator like 234,343 and 34,454.45.

Related

PHP preg_match on own computer doesn't work

I have this code:
$success = preg_match('/(.+(駅前)?駅) (\(([^線]+線)\) )?((([^線 ]+) )?(\d+[分時])?)/u', $m, $matches);
Example input text is
大正駅 (JR大阪環状線) ﾊﾞｽ 20分
This regex works on https://regex101.com/ and the code works on http://sandbox.onlinephpfunctions.com/. However, when I run the PHP code on my own computer, it never gives me a match. $matches is an empty array, and $success is 0. Yes, the exact same code. I have verified that the regex is correct (using first link) and that the code itself works (using second link). However, it still refuses to work on my own PC.
OS is Arch Linux, running PHP 7.3.11, system locale is ja_JP.UTF-8 (which I don't think matters, but just in case)
Does anyone see anything wrong with the code?

So I was able to find the problem.
First, I tried just the one-liner commented by Nick (3v4l.org/o4ADM) on my PC, and it works. (Of course it should. PHP can't be broken.)
So I figured out that it's the data I'm feeding preg_match that should be broken.
Normal prints and echos were in vain--$m is always how it should be. Then I considered AD7six's comment,
Check that the bytes for 駅 etc. are actually the same
so I looked carefully to check that the characters are all Japanese and no Chinese variants are there. And it's all Japanese, it's fine.
So what could it be?
I tried using PHP's file_put_contents to dump the variable to a file, and then typing the same text with my Japanese keyboard manually and saving them to another file. I opened Meld (a diff tool) and compared the two text and voila--the spaces on the text use a different codepoint than the usual half-width space (0x20). It uses 0xA0 instead, which is a "no-break space", apparently. What the heck.
Fortunately, a simple $m = str_replace("\u{00A0}", " ", $m) did the trick.
Thanks to everyone for leading me to the right answer!

PHP: How would I remove parts of a string between 2 chunks of characters without removing too much?

This problem is driving me nuts. Let's say I have a string:
This is a &start;pretty bad&end; string that I want to &start;somehow&end; display differently
I want to be able to remove the &start; and &end; parts as well as everything in between so it says:
This is a string that I want to display differently
I tried using preg_replace with a regular expression but it took off too much, ie:
This is a display differently
The question is: how do I remove the stuff just between sets of &start; and &end; pairs and make sure that it doesn't remove anything between any &end; and &start; segments?
Keep in mind, I'm working with hundreds of strings that are very different to each other so I'm looking for a flexible solution that'll work with all of them.
Thanks in advance for any help with this.
Edit: Replaced dollar signs with ampersands. Oops!

Try this regex /\&start;(.+?)\$end;/g
It looks like it works as desired: https://regex101.com/r/MW5nom/2

I quickly tried it on chrome console using JS, tried converting it into PHP:
"This is a &start;pretty bad$end; string that I want to &start;somehow$end; display differently".replace(/\&start;(.+?)\$end;/g, "")

Settings that could influence PHP str_replace behaviour

I am currently working on a replacement tool that will dynamically replace certain strings (including html) in a website using a smarty outputfilter.
For the replacement to take place, I am using PHP's str_ireplace method, which reads the code that is supposed to be replaced and the replacement code from a database, and then pass the result to the smarty output (using an output filter), in a similar way as the below.
$tpl_source = str_ireplace($replacements['sourceHTML'], $replacements['replacementHTML'], $tpl_source);
The problem is, that although it works great on my dev server, once uploaded to the live server replacements occasionally fail. The same replacements work just fine on my dev version though. After some examinations and googling there was not much I could find out regarding this issue. So my question is, what could influence str_replace's behavour?
Thanks
Edit with replacement example:
$htmlsource = file_get_contents('somefile.html');
$newstr = str_replace('Some text', 'sometext', $htmlsource); // the text to be replaced does exist in the html source
fails to replace. After some checking, it looks like the combination of "> creates a problem. But just the combination of it. If I try to change only (") it works, if I try to change only (>) it works.

It might be that special chars like umlauts do not display on the live server correctly and so str_replace() would fail, if there are specialchars inside the string you want to replace.

Is the input string identical on both systems? Have you verified this? Are you sure?
Things to check:
Are the HTML attributes in the same order?
Are the attribute values using the same kind quote marks? (eg <a href='#'> vs <a href="#">)
Is there any other stray HTML code getting in there?
Is the entity encoding the same? (eg vs   - same character; different HTML)
Is the character-set the same? (eg utf-8 vs ISO 8859-1: Accented characters will be encoded differently)
Any of these things will affect the result and produce the failures you're describing.

This was a trikcy problem, and it ended up having nothing to do with the str_replace method itself;
We are using smarty as a tamplating system. The str_replace method was used by a smarty ouput filter in order to change the html in some ocassions, just before it was delivered to the user.
Here is the Smarty outputfilter Code:
function smarty_outputfilter_replace($tpl_source, &$smarty)
{
$replacements = Content::getReplacementsForPage();
if (is_array($replacements))
{
foreach ($replacements as $replacementData)
{
$tpl_source = str_replace($replacementData['sourcecode'], $replacementData['replacementcode'], $tpl_source);
}
}
return ($tpl_source);
}
So this code failed now and then for now apparent reason... until I realized that the HTML code in the smarty template was being manipulated by an Apache filter.
This resulted into the source code in the browser (which we were using as the code to be replaced by something else) not being identical to the template code (which smarty was trying to modify). Result? str_replace failed :)

PHP wordpress string formatting error

I have a bit of PHP where I want to store a URL in a string.
The code itself seems fine, but for some reason, when I use the characters $sectionId=, it causes problems, in fact, it alters $sectionId= and changes it to §ionId=.
If I misspell it to $secionId then it works fine.
The full url SHOULD be:
http://url.com/file.php?appKey=$appkey&storeId=$storeid&sectionId=$sectionid&v=3
but when I do an echo $myURL; on it, it gives me:
http://url.com/file.php?appKey=$appkey&storeId=$storeid§ionId=$sectionid&v=3
Notice the §ionId= instead of $sectionId=.
Can anyone help me with this? It seems like basic PHP, but I don't understand why it just doesnt like those 4 or 5 characters in a row!!
Thanks.

Are you echoing it right to HTML? Well, some over-helpful browsers will do character conversions without being asked explicitly to with a semicolon; all you need to do is run it through htmlentities or replace all &s with & and it will display correctly.

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.

You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.

I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);

Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scrape a price off a website - php

maybe pound has it's html entity replacement? i think you should try your regexp with some sort of couching program (i.e. match it against fixed text locally). i'd change my regexp like this: '/(?:\$|£)\d+(?:\.\d{2})?/'

This should work for simple values. '#(?:\$|\£|\€)(\d+(?:\.\d+)?)#' This will not work with thousand separator like 234,343 and 34,454.45.

Related

PHP preg_match on own computer doesn't work

PHP: How would I remove parts of a string between 2 chunks of characters without removing too much?

Settings that could influence PHP str_replace behaviour

PHP wordpress string formatting error

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

Categories

Resources