I am planning to move one of my scrapers to Python. I am comfortable using preg_match and preg_match_all in PHP. I am not finding a suitable function in Python similar to preg_match. Could anyone please help me in doing so?
For example, if I want to get the content between <a class="title" and </a>, I use the following function in PHP:
preg_match_all('/a class="title"(.*?)<\/a>/si',$input,$output);
Whereas in Python I am not able to figure out a similar function.
You looking for python's re module.
Take a look at re.findall and re.search.
And as you have mentioned you are trying to parse html use html parsers for that. There are a couple of option available in python like lxml or BeautifulSoup.
Take a look at this Why you should not parse html with regex
I think you need somthing like that:
output = re.search('a class="title"(.*?)<\/a>', input, flags=re.IGNORECASE)
if output is not None:
output = output.group(0)
print(output)
you can add (?s) at the start of regex to enable multiline mode:
output = re.search('(?s)a class="title"(.*?)<\/a>', input, flags=re.IGNORECASE)
if output is not None:
output = output.group(0)
print(output)
You might be interested in reading about Python Regular Expression Operations
Related
I want to remove all the tags (but keeps img, sub, sup) and styles from my code and also remove all the html entities (but keeps & (&) and © (©)) using REGEX. but I don't know how it's use can any one guide me.
Thank you in advance.
For PHP take a look at the php docs at http://php.net/manual/de/function.preg-replace.php
For Regexp in general check http://www.regular-expressions.info/
For live testing take a look at https://regex101.com/, it is a really good testing tool and explains the whole expression piece by piece.
As you want to keep your styles, strip_tags won't take you the whole way.
Try filter_var
You can see more filters on:
http://php.net/manual/en/filter.filters.sanitize.php
echo filter_var('coco <p>©&</p>',FILTER_SANITIZE_STRING);
I have an array with strings to translate ($translation), and I want to use it to translate the output buffer. However, it should not replace within html tags. I have tried using php DOM, but this is too slow and probably too complex for what I want to do.
I currently use this code, but this of course also translates between tags.
$output = ob_get_clean();
foreach($translation as $original => $translated) {
$output = str_replace($original,utf8_encode($translated),$output);
}
I guess I should use a regular expression to replace not within HTML tags, but I can't seem to find the correct expression to do this. Can anyone help? Thanks.
aside from opinions on the orginial idea:
i would not use regexp for that for performance reasen. you could utilize strpos($html,'<') + strpos($html,'>') in combination with substr to extract string by string.
But if somebody(including you) ever has to change the results at another point, then i suggest you go the extra mile and implement a 'proper' translation.
My recommendation:
look into gettext
filter out the strings like mentioned above and generate a .mo -file
encapsulate the strings between the tags with the gettext-functions (like here)
I'm new to php and am trying to write a regular expression using preg_match to extract the href value that I get from my http get.
The response looks:
{"_links":{"http://a.b.co/documents":{"href":"/docs"}}}
I want to extract only the href value and pass it to my next api... i.e. /docs.
Can anyone please tell me how to extract this?
I've been using http://www.solmetra.com/scripts/regex/index.php to test my regex.. and had no luck since last one day :(
please any help would be appreciated.
Thanks,
DR
No need for a regex.
Use json_decode() and then access the href property.
For example:
$data = json_decode('{"_links":{"http://a.b.co/documents":{"href":"/docs"}}}', true);
echo $data['_links']['http://a.b.co/documents']['href'];
Note: I'd encourage you to clean up your JSON if possible. Particularly the keys.
Don't use regex, use json_decode(). JSON is an excellent example of a context-free grammar that you shouldn't even try to parse with regex.
Here's PHP.NET's reference on using json_decode() for just this sort of thing.
Just like HTML parsing, I would recommend not using a REGEX but rather a json parser then reading the value. Check out json_encode and json_decode functions in php.
That said if you just need the href value then here is a regex to do just that on the example you gave
preg_match('/"href":"([^"]+)"/',$string,$matches);
$matches[1];// this is the href
Regex is only the right tool when you know exactly what you want and exactly the format it will be in. Often json and HTML from other parties can't be exactly predicted. There are also examples of certain legal HTML and json which can't properly be parsed with regex so in general use a specialized parser for them.
I have used file_get_contents() to basically get the source code of a site into a single string variable.
The source contains many rows that looks like this:
<td align="center">12345</td>
(and a lot of rows that don't look like that). I want to extract all the idnumbers (12345 above) and put them in an array. How can I do that? I assume I want to use some kind of regular expressions and then use the preg_match_all() function, but I'm not sure how...
Don't mess with regular expressions. Get the variable and let a DOM library do the mundane tasks for you. Take a look at: http://sourceforge.net/projects/simplehtmldom/
Then you can traverse your HTMl like a tree and extract stuff. If you really want to get funky, read up on xPath.
Try this:
preg_match('/>[0-9]+<\/a><\/td>/', $str, $matches);
for($i = 0;$i<sizeof($matches);$i++)
$values[] = $matches[$i];
The input: we get some plain text as input string and we have to highlighight all URLs there with <a href={url}>{url></a>.
For some time I've used regex taken from http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/, which I modified several times, but it's built for another issue - to check whether the whole input string is an URL or no.
So, what regex do you use in such issues?
UPD: it would be nice if answers were related to php :-[
Take a look at a couple of modules available on CPAN:
URI::Find
URI::Find::Schemeless
where the latter is a little more forgiving. The regular expressions are available in the source code (the latter's, for example).
For example:
#! /usr/bin/perl
use warnings;
use strict;
use URI::Find::Schemeless;
my $text = "http://stackoverflow.com/users/251311/zerkms is swell!\n";
URI::Find::Schemeless
->new(sub { qq[$_[0]] })
->find(\$text);
print $text;
Output:
http://stackoverflow.com/users/251311/zerkms is swell!
For Perl, I usually use one of the modules defining common regex, Regexp::Common::URI::*. You might find a good regexp for you in the sources of those modules.
http://search.cpan.org/search?query=Regexp%3A%3ACommon%3A%3AURI&mode=module