Extract data PHP string - php

I have used file_get_contents() to basically get the source code of a site into a single string variable.
The source contains many rows that looks like this:
<td align="center">12345</td>
(and a lot of rows that don't look like that). I want to extract all the idnumbers (12345 above) and put them in an array. How can I do that? I assume I want to use some kind of regular expressions and then use the preg_match_all() function, but I'm not sure how...

Don't mess with regular expressions. Get the variable and let a DOM library do the mundane tasks for you. Take a look at: http://sourceforge.net/projects/simplehtmldom/
Then you can traverse your HTMl like a tree and extract stuff. If you really want to get funky, read up on xPath.

Try this:
preg_match('/>[0-9]+<\/a><\/td>/', $str, $matches);
for($i = 0;$i<sizeof($matches);$i++)
$values[] = $matches[$i];

Related

Using php regex to translate output buffer, but not within HTML tags

I have an array with strings to translate ($translation), and I want to use it to translate the output buffer. However, it should not replace within html tags. I have tried using php DOM, but this is too slow and probably too complex for what I want to do.
I currently use this code, but this of course also translates between tags.
$output = ob_get_clean();
foreach($translation as $original => $translated) {
$output = str_replace($original,utf8_encode($translated),$output);
}
I guess I should use a regular expression to replace not within HTML tags, but I can't seem to find the correct expression to do this. Can anyone help? Thanks.
aside from opinions on the orginial idea:
i would not use regexp for that for performance reasen. you could utilize strpos($html,'<') + strpos($html,'>') in combination with substr to extract string by string.
But if somebody(including you) ever has to change the results at another point, then i suggest you go the extra mile and implement a 'proper' translation.
My recommendation:
look into gettext
filter out the strings like mentioned above and generate a .mo -file
encapsulate the strings between the tags with the gettext-functions (like here)

Regular expression to extract json response in php

I'm new to php and am trying to write a regular expression using preg_match to extract the href value that I get from my http get.
The response looks:
{"_links":{"http://a.b.co/documents":{"href":"/docs"}}}
I want to extract only the href value and pass it to my next api... i.e. /docs.
Can anyone please tell me how to extract this?
I've been using http://www.solmetra.com/scripts/regex/index.php to test my regex.. and had no luck since last one day :(
please any help would be appreciated.
Thanks,
DR
No need for a regex.
Use json_decode() and then access the href property.
For example:
$data = json_decode('{"_links":{"http://a.b.co/documents":{"href":"/docs"}}}', true);
echo $data['_links']['http://a.b.co/documents']['href'];
Note: I'd encourage you to clean up your JSON if possible. Particularly the keys.
Don't use regex, use json_decode(). JSON is an excellent example of a context-free grammar that you shouldn't even try to parse with regex.
Here's PHP.NET's reference on using json_decode() for just this sort of thing.
Just like HTML parsing, I would recommend not using a REGEX but rather a json parser then reading the value. Check out json_encode and json_decode functions in php.
That said if you just need the href value then here is a regex to do just that on the example you gave
preg_match('/"href":"([^"]+)"/',$string,$matches);
$matches[1];// this is the href
Regex is only the right tool when you know exactly what you want and exactly the format it will be in. Often json and HTML from other parties can't be exactly predicted. There are also examples of certain legal HTML and json which can't properly be parsed with regex so in general use a specialized parser for them.

Counterpart to PHP’s preg_match in Python

I am planning to move one of my scrapers to Python. I am comfortable using preg_match and preg_match_all in PHP. I am not finding a suitable function in Python similar to preg_match. Could anyone please help me in doing so?
For example, if I want to get the content between <a class="title" and </a>, I use the following function in PHP:
preg_match_all('/a class="title"(.*?)<\/a>/si',$input,$output);
Whereas in Python I am not able to figure out a similar function.
You looking for python's re module.
Take a look at re.findall and re.search.
And as you have mentioned you are trying to parse html use html parsers for that. There are a couple of option available in python like lxml or BeautifulSoup.
Take a look at this Why you should not parse html with regex
I think you need somthing like that:
output = re.search('a class="title"(.*?)<\/a>', input, flags=re.IGNORECASE)
if output is not None:
output = output.group(0)
print(output)
you can add (?s) at the start of regex to enable multiline mode:
output = re.search('(?s)a class="title"(.*?)<\/a>', input, flags=re.IGNORECASE)
if output is not None:
output = output.group(0)
print(output)
You might be interested in reading about Python Regular Expression Operations

PHP Extract Text from Webpage

Is it possible to do something with PHP where I can set up a connection to a URL like http://en.wikipedia.org/wiki/Wiki and extract any words that contain a prefix like "Exa" and "ins" such that the resulting PHP page will print out all the words that it found. For example with "Exa", the word "Example" would be printed out each time it found an instance of "Example". Same thing for words that start with "ins".
$data = strip_tags(file_get_contents($url));
$matches = array();
preg_match('/\bExa|ins([^\b]+)/', $data, &$matches);
for ($i = 1; $i < count($matches); $i++) {
echo "Match: '".$matches[$i]."'\r\n";
}
Probably something like this, though I'm not so sure about the regex, I haven't tested it yet...
Edit: I changed it, it should work now... (\B => \b and strip_tags to prevent HTML-classes from being matched).
I don't have a full answer with example to give you, but yes, you should be able to read the whole page into a string variable and then do normal string operations on it. It will read in all the HTML, so you will probably need to do a lot of regex to eliminate tags if you don't want them.
Read the page into a string using file_get_contents. Use one of the various string functions to examine the page.
Yes, this possible. A potential approach would be to:
Use something like fopen (if allow_url_fopen is enabled - failing that use CURL) to grab the external web page content.
Remove the (presumably not required) HTML tags via strip_tags.
Use strtok to tokenise and iterate over the remaining content, checking for whatever conditions you require.

PHP regex templating - find all occurrences of {{var}}

I need some help with creating a regex for my php script. Basically, I have an associative array containing my data, and I want to use preg_replace to replace some place-holders with real data. The input would be something like this:
<td>{{address}}</td><td>{{fixDate}}</td><td>{{measureDate}}</td><td>{{builder}}</td>
I don't want to use str_replace, because the array may hold many more items than I need.
If I understand correctly, preg_replace is able to take the text that it finds from the regex, and replace it with the value of that key in the array, e.g.
<td>{{address}}</td>
get replaced with the value of $replace['address']. Is this true, or did I misread the php docs?
If it is true, could someone please help show me a regex that will parse this for me (would appreciate it if you also explain how it works, since I am not very good with regexes yet).
Many thanks.
Use preg_replace_callback(). It's incredibly useful for this kind of thing.
$replace_values = array(
'test' => 'test two',
);
$result = preg_replace_callback('!\{\{(\w+)\}\}!', 'replace_value', $input);
function replace_value($matches) {
global $replace_values;
return $replace_values[$matches[1]];
}
Basically this says find all occurrences of {{...}} containing word characters and replace that value with the value from a lookup table (being the global $replace_values).
For well-formed HTML/XML parsing, consider using the Document Object Model (DOM) in conjunction with XPath. It's much more fun to use than regexes for that sort of thing.
To not have to use global variables and gracefully handle missing keys you can use
function render($template, $vars) {
return \preg_replace_callback("!{{\s*(?P<key>[a-zA-Z0-9_-]+?)\s*}}!", function($match) use($vars){
return isset($vars[$match["key"]]) ? $vars[$match["key"]] : $match[0];
}, $template);
}

Categories