how to extract a portion of a string in php - php

I am using preg_replace() for some string replacement.
$str = "<aa>Let's find the stuff qwe in between <id>12345</id> these two previous brackets</h>";
$do = preg_match("/qwe(.*)12345/", $str, $matches);
which is working just fine and gives the following result
$match[0]=qwe in between 12345
$match[1]=in between
but I am using same logic to extract from the following string.
<text>
<src><![CDATA[<TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Arial" SIZE="36" COLOR="#999999" LETTERSPACING="0" KERNING="0">r1 text 1 </FONT></P></TEXTFORMAT>]]></src>
<width>45%</width>
<height>12%</height>
<left>30.416666666666668%</left>
<top>3.0416666666666665%</top>
<begin>2s</begin>
<dur>10s</dur>
<transIn>fadeIn</transIn>
<transOut>fadeOut</transOut>
<id>E2159292994B083ACA7ABC7799BBEF3F7198FFA2</id>
</text>
I want to extract the string from
r1text1
to
</id>
The Regular expression I currently Have is:
preg_match('/r1text1(.*)</id\>/', $metadata], $matches);
where $metadata is the above string..
$matches does not return anything....
For some reason...how do i do it?
Thanks in advance

If you want to extract the text, you will probably want to use preg_match. The following might work:
preg_match('#\<P[^\>]*\>\<FONT[^\>]*\>(.*\</id\>)#', $string, $matches)
Whatever gets matched in the parantheses can be found later in the $matches array. In this case everything between a <P> tag followed by a <FONT> tag and </id>, including the latter.
Above regex is untested but might give you a general idea of how to do it. Adapt if your needs are a bit different :)

Even if don't know why you would match the regex on a incomplete XML fragment (starting within a <![CDATA[ and ending right before the closing XML tag </id>, you do have three obvious problems with your regex:
As Amri said: you have to escape the / character in the closing XML tag because you use / as the pattern delimiter. By the way, you don't have to escape the > character. That gives you: '/r1text1(.*)<\/id>/' Alternatively you can change the pattern delimiter to # for example: '#r1text1(.*)</id>#' (I will use the first pattern to further develop the expression).
As Rich Adams already said: the text in your example data is "r1_text_1" (_ is a space character) but you match against '/r1text1(.*)<\/id>/'. You have to include the spaces in your regex or allow for a uncertain number of spaces, such as '/r1(?:\s*)text(?:\s*)1(.*)<\/id>/' (the ?: is the syntax for non-capturing subpatterns)
The . (dot) in your regex does not match newlines by default. You have to add the s (PCRE_DOTALL) pattern modifier to let the . (dot) match against newlines as well: '/r1(?:\s*)text(?:\s*)1(.*)<\/id>/s'

you probably need to parse your string/file and extract the value between the FONT tag. Then insert the value into the id tag
Try googling for php parsing.

try this
preg_match('/r1text1(.*)<\/id\>/', $metadata], $matches);
You are using / as the pattern delimiter but your content has / in . You can use \ as the escape character.

In the sample you have "r1 text 1 ", yet your regular expression has "r1text1". The regular expression doesn't match because there are spaces in the string you are trying to match it against. You should include the spaces in the regular expression.

Related

Using preg_replace to reformat bbcode

I am making forum from phpbb to php native and I need to parse some bbcode tags with uid inside. This is the code to parse it into regular bbcode without the uid:
$regex = "#\[quote:(.*)=(.*)\](.+)\[/quote:(.+)\]#isU";
$text = "outside sample
[quote:c1891a7ad3]
text with link https://www.facebook.com/groups/35688476100/?fref=ts [/quote:c1891a7ad3]
outside text
[quote:c1891a7ad3="Budi"]
written by me , - budi
[/quote:c1891a7ad3]"
preg_replace($regex,"[quote=$2]$3[\quote]",$text);
but the result is not
"outside sample
[quote:c1891a7ad3]
text with link https://www.facebook.com/groups/35688476100/?fref=ts [/quote:c1891a7ad3]
outside text
[quote="Budi"]
written by me , - budi
[\quote]"
How should the regex be modified to yield expected result?
You have a mismatch between the pattern and the actual string you test against. In the pattern, you have / in [/quote] and in the string, you have \ ([\quote:c1891a7ad3]).
So, if your actual string in fact has /, all you need to fix is the (.*) part as the dot matches any character (including ]) and thus can overmatch even with lazy matching.
So, use
$regex = "#\[quote:([^]]*)=([^]]*)\](.+)\[/quote:([^]]+)\]#isU";
See IDEONE demo
In this regex, I am using a negated character class [^]]* that matches 0 or more characters other than ]. It makes sure we only match text inside [...]. (.*) matches c1891a7ad3]
text with link https://www.facebook.com/groups/35688476100/?fref, so we need to restrict this somehow.

How to write the reg express to get the following pattern in the php?

There is a website and I would like to get all the <td> (any content) </td> pattern string
So I write like this:
preg_match("/<td>.*</td>/", $web , $matches);
die(var_dump($matches));
That return null, how to fix the problem? Thanks for helping
OK.
You are only not escaping properly I guess.
Also use groups to capture your stuff properly.
<td>(.*)<\/td>
should do. You can try this regex on your given text here. Don't forget the global flag if you are matching ALL td's. (preg_match_all in PHP)
Usually parsing HTML with regex is not a good idea, try to use DOM parsers instead.
Example -> http://simplehtmldom.sourceforge.net/
Test the above regex with
$web = file_get_contents('http://www.w3schools.com/html/html_tables.asp' );
preg_match_all("/<td>(.*)<\/td>/", $web , $matches);
print_r( $matches);
Lazy Quantifier, Different Delimiter
You need .*? rather than .*, otherwise you can overshoot the closing </td>. Also, your / delimiter needed to be escaped when it appeared in </td>. We can replace it with another one that doesn't need escaping.
Do this:
$regex = '~<td>.*?</td>~';
preg_match_all($regex, $web, $matches);
print_r($matches[0]);
Explanation
The ~ is just an esthetic tweak—you can use any delimiter you like around your regex patttern, and in general ~ is more versatile than /, which needs to be escaped more often, for instance in </td>.
The star quantifier in .*? is made "lazy" by the ? so that the dot only matches as many characters as needed to allow the next token to match (shortest match). Without the ?, the .* first matches the whole string, then backtracks only as far as needed to allow the next token to match (longest match).

Regular expression for contents within <td>and</td>

I need to find a regular expression to use for finding the content within and tags for use in PHP. I have tried...
preg_split("<td>([^\"]*)</td>", $table[0]);
But that gives me the PHP error...
Warning: preg_split(): Unknown modifier '(' in C:\xampp\htdocs\.....
Can anyone tell me what I am doing wrong?
Try this:
preg_match("/<td>([^\"]*)<\/td>/", $table[0], $matches);
But, as a general rule, please, do not try to parse HTML with regexes... :-)
Keep in mind that you need to do some extra work to make sure that the * between <td> and </td> in your regular expression doesn't slurp up entire lines of <td>some text</td>. That's because * is pretty greedy.
To toggle off the greediness of *, you can put a ? after it - this tells it just grab up until the first time it reaches whatever is after the *. So, the regular expression you're looking for is something like:
/<td>(.*?)<\/td>/
Remember, since the regular expression starts and ends with a /, you have to be careful about any / that is inside your regular expression - they have to be escaped. Hence, the \/.
From your regular expression, it looks like you're also trying to exclude any " character that might be between a <td> and </td> - is that correct? If that were the case, you would change the regular expression to use the following:
/<td>([^\"]*?)<\/td>/
But, assuming you don't want to exclude the " character in your matches, your PHP code could look like this, using preg_match_all instead of preg_match.
preg_match_all("/<td>(.*?)<\/td>/", $str, $matches);
print_r($matches);
What you're looking for is in $matches[1].
Use preg_match instead of preg_split
preg_match("|<td>([^<]*)</td>|", $table[0], $m);
print_r($m);
First of all you forgot to wrap regex with delimiters. Also you shouldn't specify closing td tag in regex.
Try the following code. Assuming $table[0] contains html between <table>, </table> tags, it allows to extract any content (including html) from cells of table:
$a_result = array_map(
function($v) { return preg_replace('/<\/td\s*>/i', '', $v); },
array_slice(preg_split('/<td[^>]*>/i', $table[0]), 1)
);

Replacing a string using preg_match

I'm having trouble using preg_match to find and replace a string. The string of interest is:
<span style="font-size:0.6em">EXPIRATION DATE: 04/30/2011</span>
I need to target and replace the date, "04/30/2011" with a different date. Can someone throw me a bone a give me the regular expression to match this pattern using preg_match in PHP? I also need it to match in such a way that it only replaces up to the first closing span and not closing span tags later in the code, e.g.:
<span style="font-size:0.6em">EXPIRATION DATE: 04/30/2011</span><span class="hello"></span>
I'm not versed in regex, and although I've spent the last hour trying to learn enough to make this work, I'm utterly failing. Thanks so much!
EDIT: As you can see this has gotten me exhausted. I did mean preg_replace, not preg_match.
If you're after a replacement, consider using preg_replace(), something like
preg_replace('#(\d{2})/(\d{2})/(\d{4})#', '<new date>', $string);
How about this:
$toBeFoundPattern = '/([0-9][0-9])\/([0-9][0-9])\/([0-9][0-9][0-9][0-9])/';
$toBeReplacedPattern = '$2.$1.$3';
$inString = '<span style="font-size:0.6em">EXPIRATION DATE: 04/30/2011</span>';
// Will convert from US date format 04/30/2011 to european format 30.04.2011
echo preg_replace( $toBeFoundPattern, $toBeReplacedPattern, $inString );
and prints
EXPIRATION DATE: 30.04.2011
Patterns always begin and end with identical so called delimiter characters. Often the character / is used.
$1 references the string, which matched the first string matched by ([0-9][0-9]), $2 references be (...) and $3 the four letters matched by the last (...).
[...] matched a single character, which is one of those listed inside the brackets. E.g. [a-z] matches all lower case letters.
To use the special meaning character / inside of a pattern, you need to escape it by \ to make it be the literal slash character.
Update: Using {..} as pointed out below is shorthand for repeated patterns.
Regex should be:
(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d
If you want to only match one instance, this is OK. For multiple instances, use preg_match_all instead. Taken from http://www.regular-expressions.info/regexbuddy/datemmddyyyy.html.
Edit: are you looking to just search and replace inside a PHP script or do you want to do some javascript live replacement?

regular expression to strip attributes and values from html tags

Hi Guys I'm very new to regex, can you help me with this.
I have a string like this "<input attribute='value' >" where attribute='value' could be anything and I want to get do a preg_replace to get just <input />
How do I specify a wildcard to replace any number of any characters in a srting?
like this? preg_replace("/<input.*>/",$replacement,$string);
Many thanks
What you have:
.*
will match "any character, and as many as possible.
what you mean is
[^>]+
which translates to "any character, thats not a ">", and there must be at least one
or altertaively,
.*?
which means
"any character, but only enough to make this rule work"
BUT DONT
Parsing HTML with regexps is Bad
use any of the existing html parsers, DOM librarys, anything, Just NOT NAïVE REGEX
For example:
<foo attr=">">
Will get grabbed wrongly by regex as
'<foo attr=" ' with following text of '">'
Which will lead you to this regex:
`<[a-zA-Z]+( [a-zA-Z]+=['"][^"']['"])*)> etc etc
at which point you'll discover this lovely gem:
<foo attr="'>\'\"">
and your head will explode.
( the syntax highlighter verifies my point, and incorrectly matches thinking i've ended the tag. )
Some people were close... but not 100%:
This:
preg_replace("<input[^>]*>", $replacement, $string);
should be this:
preg_replace("<input[^>]*?>", $replacement, $string);
You don't want that to be a greedy match.
preg_replace("<input[^>]*>", $replacement, $string);
// [^>] means "any character except the greater than symbol / right tag bracket"
This is really basic stuff, you should catch up with some reading. :-)
If I understand the question correctly, you have the code:
preg_replace("/<input.*>/",$replacement,$string);
and you want us to tell you what you should use for $replacement to delete what was matched by .*
You have to go about this the other way around. Use capturing groups to capture what you want to keep, and reinsert that into the replacement. E.g.:
preg_replace("/(<input).*(>)/","$1$2",$string);
Of course, you don't really need capturing groups here, as you're only reinserting literal text. Bet the above shows the technique, in case you want to do this in a situation where the tag can vary. This is a better solution:
preg_replace("/<input [^>]*>/","<input />",$string);
The negated character class is more specific than the dot. This regex will work if there are two HTML tags in the string. Your original regex won't.

Categories