PHP multiline preg_replace to extract portion of a HTML document

PHP multiline preg_replace to extract portion of a HTML document - php

I am trying to parse a HTTP document to extract portions of the document, but am unable to get the desired results. Here is what I have got:
<?php
// a sample of HTTP document that I am trying to parse
$http_response = <<<'EOT'
<dl><dt>Server Version: Apache</dt>
<dt>Server Built: Apr 4 2010 17:19:54
</dt></dl><hr /><dl>
<dt>Current Time: Wednesday, 10-Oct-2012 06:14:05 MST</dt>
</dl>
I do not need anything below this, including this line itself
......
EOT;
echo $http_response;
echo '********************';
$count = -1;
$a = preg_replace("/(Server Version)([\s\S]*?)(MST)/", "$1$2$3", $http_response, -1, $count);
echo "<br> count: $count" . '<br>';
echo $a;
I still see the string "I do not need ..." in the output. I do not need that string. What am I doing wrong?
How do I easily remove all other HTML tags as well?
Thanks for your help.
-Amit

You are matching everything from Server Version until MST. And only the part that is matched will later be modified by preg_replace. Everything not covered by the regex remains untouched.
So to replace the string part before your first anchor, and the text following, you also must match them first.
= preg_replace("/^.*(Server Version)(.*?)(MST).*$/s", "$1$2$3",
See the ^.* and .*$. Both will be matched, but aren't mentioned in the replacement pattern; so they get dropped.
Also of course, might be simpler to just use preg_match() in such cases ...

You need to capture other caracters after / before your regex, like :
/.+?(Server Version)([\s\S]*?)(MST).+?/s
The 's' is a flag telling preg to match multiple lines, you'll need it.
To remove html tags, use strip_tags.

Related

how i can print some array in echo string

i extract som html information of a website using file_get_contents and i want ti print by echo
how i can do
$page = file_get_contents('*******');
preg_match("/<span class=\"a-text-strike\".*span>/", $page, $precio_antes);
preg_match("/<span id=\"priceblock_ourprice\".*span>/", $page, $precio_ahora);
preg_match("/<td class=\"a-span12 a-color-price a-size-base\".*td>/", $page, $precio_descuento);
echo "Antes: ".$precio_antes. "Ahora: " .$precio_ahora." (-" .$precio_descuento. "%)";

You should change the regular expressions a bit because you will not always get the expected results:
Make the .* lazy, otherwise it will skip some closing </span> tags and go right to the last one: .*?;
Make the . also match newlines, otherwise matches will fail to find anything if the opening and closing tags are not on the same line: use the s pattern modifier;
As tags can be nested, make sure to test on the ending tag, including the slash: /span> instead of just span>;
The result in the third argument of preg_match is an array, so you need to take the element of your interest: [0]
Adapted Code:
preg_match("/<span\s+class=\"a-text-strike\".*?\/span>/si", $page, $precio_antes);
preg_match("/<span\s+id=\"priceblock_ourprice\".*?\/span>/si", $page, $precio_ahora);
preg_match("/<td\s+class=\"a-span12\s+a-color-price\s+a-size-base\".*?\/td>/si", $page, $precio_descuento);
echo "Antes: {$precio_antes[0]} Ahora: {$precio_ahora[0]} (-{$precio_descuento[0])}";
Parse the DOM
Using regular expressions for parsing HTML is not advised in general. There will always be cases where it goes wrong because the provider changed the order of attributes, or replaced the double quotes by single quotes, ... etc. You really should have a look at DOMDocument to parse HTML.

$precio_descuento is array so
Not $precio_descuento but $precio_descuento[0]
or implode('', $precio_descuento)
ps: the same with others

Remove all non php data from string

I want to be able to remove all non php data from a string / file.
Now this preg_replace line works perfectly:
preg_replace('/\?>.*\<?/', '', $src); // Remove all non php data
BUT... problem is that it works only for the first match and not for all of the string/file...
Small tweak needed here ;)

It would be simpler the other way round:
preg_match_all('~<\?.+?\?>~s', $src, $m);
$php = implode('', $m[0]);
Matching non-php blocks is much trickier, because they can also occur before the first php block and after the last one: blah <? php ?> blah.
Also note that no regex solution can handle <?'s inside php strings, as in:
<? echo "hi ?>"; ?>
You have to use tokenizer to parse this correctly.

Php regex to conditionally replace first occurance of string

I need to do some cleanup on strings that look like this:
$author_name = '<a href="http://en.wikipedia.org/wiki/Robert_Jones_Burdette>Robert Jones Burdette </a>';
Notice the href tag doesn't have closing quotes - I'm using the DOMParser on a large table of these to extract the text, and it borks on this.
I would like to look at the string in $author_name;
IF the first > does NOT have a " before it, replace it with "> to close the tag correctly. If it is okay, just skip and do the next step. Be sure not to replace the second > at all.
Using php regex, I haven't been able to find a working solution - I could chop up the whole thing and check its parts, but that would be slow and I think there must be a regex that can do what I want.
TIA

What you can do is, find the first closing tag, with or without the double-quote ("), and replace it with (">):
$author_name = preg_replace('/(.+?)"?>(.+?)/', '$1">$2', $author_name);

http://www.barattalo.it/html-fixer/
Download that, then include it in your php.
The rest is quite easy:
$dirty_html = ".....bad html here......";
$a = new HtmlFixer();
$clean_html = $a->getFixedHtml($dirty_html);
It's common for people to want to use regular expressions, but you must remember that HTML is not regular.

Email string parsing using regex

I am trying to do a complicated (to me) regex on a multi-line snip from an e-mail. I have tried hard, with no luck. I am trying to get rid of anything from "On " through " wrote:"
Would be nice if you can also check to see if it contains the word "AcmeCompany", so it doesn't check for everything "On " "wrote:"
So far, I have this: /On(.*)AcmeCompany(.*)/im but it does not work...
say hello, world!
On Tue, Jun 7, 2011 at 6:18 AM, AcmeCompany <
24a95f49f7ce573fds2d+c#AcmeCompany.com> wrote:
Thank you for the responses, but it seems like there's another problem.
EDIT: I found out that this works: /On[\s\S]+?AcmeCompany[\s\S]+?wrote:/m, but it seems to fail when the e-mail contents have word "On".
say hello, world!
On a plane!
On Tue, Jun 7, 2011 at 6:18 AM, AcmeCompany <
24a95f49f7ce573fds2d+c#AcmeCompany.com> wrote:
EDIT2: Every mail client is different... gmail tends to do it in 2 lines, mail app from iphone do it in 1 line, so it doens't always follow the strict format.
1 thing for sure: beginning always uses "On " and ends with " wrote:". It also contains a hash and AcmeCompany, which I can also use to verify.

For the new requirement I am adding another reply. Hope you won't mind.
Can you try something like this?
/On\s(Mon|Tue|Wed|Thu|Fri|Sat)[\s\S]+?AcmeCompany[\s\S]+?wrote:/
I am trying again..how about using ?
/On.+?AcmeCompany[\s\S]+?wrote:/

Hope this helps:
/On[\s\S]+?AcmeCompany[\s\S]+?wrote:/
The regular expression above first matches On and then either of all spaces and non-spaces (together swallowing all characters and newlines) with a lazy repetition mode till it finds AcmeCompany. Again it matches all spaces and non-spaces (together swallowing all characters and newlines) with a lazy repetition till it finds wrote:

This will work:
On.*AcmeCompany.*
Maybe offtopic but...
If you want to learn regex you should try Expresso
Example of Expresso at work:

To get the string before On Tue,Jun...:
$str = explode ('On', $yourstring);
$oldstr = array_pop($str); //Remove the last value of the $str array
echo trim( implode('On',$str) ); //Trim the string to remove any unnecessary line breaks
To find if the hidden message contains AcmeCompany:
if( strstr ( $oldstr , 'AcmeCompany' ) ) {
echo "I found AcmeCompany!";
} else {
echo "I didn't find AcmeCompany!";
}
Hope my answer is useful, even though I didn't use regex.

Try this: /On.*AcmeCompany <$[^:]+:/im, the m is important as it lets the $ match line breaks.

Highlite words from searchstring

I wrote a little search script for a client, it works and words get highlited, BUT...
Imagine this situation:
search term: test
found result: Hello this is a test
In this example both 'test' in the href part and between the <a> tags get highlited, breaking the link.
How could I prevent this?
Edit:
So this is what I need: A regex replace function that replaces all matched search strings EXCEPT the ones that are located inside a href attribute

You can not parse XML with regular expressions. :( If you want a dirty regex solution that still works in many cases you may try this regex.
">[^<]*?(test)"
First you look for a tag closing brace and than you make sure that no other tag is opened in between.
Ideally you want to parse HTML and replace only the textual parts of it.

Got it!
$body = $row['body'];
$pattern = "/".$search_string."(?!([^<]+)?>)/i";
$replacement = "<strong class='highlite'>".$search_string."</strong>";
$altered_body = preg_replace($pattern, $replacement, $body);
print($altered_body);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP multiline preg_replace to extract portion of a HTML document - php

You need to capture other caracters after / before your regex, like : /.+?(Server Version)([\s\S]*?)(MST).+?/s The 's' is a flag telling preg to match multiple lines, you'll need it. To remove html tags, use strip_tags.

Related

how i can print some array in echo string

Remove all non php data from string

Php regex to conditionally replace first occurance of string

Email string parsing using regex

Highlite words from searchstring

Categories

Resources