What's the right pattern for this hidden input - php

I have this field Returned by curl_exec:
<input name="NUMBER_R" type="hidden" value="1500000">
150000 is a random number and may change the others are constant
i tried:
preg_match ('/<input name="NUMBER_R" type="hidden" value="([^"]*)" \/>/', $result, $number)
and also:
preg_match ('/<input name=\'NUMBER_R\' type=\'hidden\' value=\'(\\d+)\'>/ims', $result, $number)
but no luck...
Here is the full code:
$result=curl_exec($cid);
curl_close($cid);
$number = array();
if (preg_match ('REGEX', $result, $number))
{
echo $number[1];
}
EDIT 1:
Sorry i forgot [1] in echo $number[1];
Also 1500000 is a random number and may change

Description
This regex will find the input tag which has the attributes name="number_r" and type="hidden" in any order. Then it'll pull out the attribute value with it's associated values. It does require the value text to be all digits
<input\b\s+(?=[^>]*name=(["'])number_r\1)(?=[^>]*type=(["'])hidden\2)[^>]*value=(["'])(\d+)\3[^>]*>
<input\b\s+ consume the open bracket and the tag name, ensure there is a word break and white space
(?=[^>]*name=(["'])number_r\1) look ahead to ensure this tag include the correct name attribute
(?=[^>]*type=(["'])hidden\2) look ahead to ensure this tag also includes the type attribute
[^>]* move the cursor forward until we find the
value= tag
(["']) capture the open qoute
(\d+) capture the substring and require it to be all digits
\3 match the correct close quote. This is can be omitted as you've already received the desired substring.
[^>]*> match the rest of the characters in the tag. This is can be omitted as you've already received the desired substring.
Groups
Group 0 gets the entire input tag
the open quote for name which is back referenced to ensure the correct close quote is captured
the open quote for type which is back referenced to ensure the correct close quote is captured
the open quote for value which is back referenced to ensure the correct close quote is captured
the value in the attribute named value
PHP Code Example:
<?php
$sourcestring="<input name="NUMBER_R" type="hidden" value="1500000">";
preg_match('/<input\b\s+(?=[^>]*name=(["\'])number_r\1)(?=[^>]*type=(["\'])hidden\2)[^>]*value=(["\'])(\d+)\3[^>]*>/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => <input name="NUMBER_R" type="hidden" value="1500000">
[1] => "
[2] => "
[3] => "
[4] => 1500000
)

Try using DOM and Xpath for get that.
$xml = new DomDocument;
$xml->loadXml('<input name="NUMBER_R" type="hidden" value="1500000" />');
$xpath = new DomXpath($xml);
// traverse all results
foreach ($xpath->query('//input[#name="NUMBER_R"]') as $rowNode) {
var_dump($rowNode->getAttribute('value'));
}
testet : http://codepad.viper-7.com/8dwu9f

Related

Extract string in brackets when there are other brackets embedded in quotes

I want to extract this bracketed part from a string:
[list items='["one","two"]' ok="no" b="c"]
I am using the following preg_match call:
preg_match('~\[([a-zA-Z0-9_]+)[ ]+([a-zA-Z0-9]+=[^\[]+)\]~s', $string,$match)
But I have trouble with the brackets that appear within quotes.
I have two files
theme.html
[list items=""one","[x]tw"'o"" ok="no" b="c""/]
#book
[button text="t'"extB1" name="ok"'" /]
Asdfz " s wr aw3r '
[button text="t"'extB2" name="no"'" /]
file.php
$string=file_get_contents('theme.html');
for (;;) {
if (!preg_match('~\[([a-zA-Z0-9_]+)[ ]+([a-zA-Z0-9]+=[^\[]+)\]~s', $string,$match)) {
exit;
}
$string=str_replace($match[0], '', $string);
echo "<pre><br>";
print_r($match);
echo "<br></pre>";
}
and this is output:
<pre><br>Array
(
[0] = [button text="textB1" name="ok"]
[1] = button
[2] = text="textB1" name="ok"
)
<br></pre>
<pre><br>Array
(
[0] = [button text="textB2" name="no"]
[1] = button
[2] = text="textB2" name="no"
)
<br></pre>
As you can see the output does not include
[list items='["one","two"]' ok="no" b="c"]
I know the problem is caused by the embedded square brackets, but I don't know how I can correct the code to ignore them.
You could use this variation of your preg_match call:
if (!preg_match('~\[(\w+)\s+(\w+=(?:\'[^\']*\'|[^\[])+?)\]~s', $string, $match))
With \'[^\']*\' it detects the presence of a quote and will grab all characters until the next quote, without blocking on an opening bracket. Only if that cannot be matched, will it go for the part you had: [^\[])+. I added a ? to that, to make it non-greedy, which makes sure it will not grab a closing ].
Note also that [a-zA-Z_] can be shortened to \w, and [ ] can be written as \s which will also allow other white-space, which I believe is OK.
See it run on eval.in.
Alternative: match complete lines only
If the quotes can appear anywhere without guarantee that closing brackets appear within quotes, then the above will not work.
Instead we could require that the match must span a complete line in the text:
if (!preg_match('~^\s*\[(\w+)\s+(\w+=.*?)\]\s*$~sm', $string, $match))
See it run on eval.in.

PHP preg_split Input by <br>, <br/>, <p> into Separate Paragraphs

I am curling from a page with very ill-formed code. There is a particular snippet of the page I am trying to parse into paragraphs. This input snippet may be divided by <p> and </p> or separated by one or more <br> or <br/> tags. In cases where there are two <br> tags after another, I don't want those to be two separate pargaraphs.
My current code I'm trying to parse/display with is
$paragraphs = preg_split('/(<\s*p\s*\/?>)|(<\s*br\s*\/?>)|(\s\s+)|(<\s*\/p\s*\/?>)/', $article, -1, PREG_SPLIT_NO_EMPTY);
$paragraphcount = count($paragraphs);
for($x = 1; $x <= $paragraphcount; $x++ )
{
echo "<p>".$paragraphs[$x-1]."</p>";
}
However, this is not working as expected. Some different inputs/outputs are as follows:
Input 1: first part </p> <p> second part </p> <p> third part </p> <p> fourth part <br/>
Output 1: <p>first part </p><p> </p><p>second part </p><p> </p><p> third part </p><p> </p><p>fourth part</p><p> </p>
My code is parsing the input into paragraphs; however, it's also adding extra paragraphs containing only a space.
Any help would be appreciated.
Input is UTF-8 if it makes a difference.
Here is a solution with preg_replace:
$article = "first part </p> <p> second part </p> <p> third part </p>
<p> fourth part <br/> <br> fifth part";
$healed = substr(
preg_replace('/(\s*<(\/?p|br)\s*\/?>\s*)+/u', "</p><p>", "<p>$article<p>"),
4, -3);
It first wraps the string in <p> and then replaces (repetitions of) the variants of breaks by </p><p>, to finally remove the starting </p> and ending <p>. Note that this does not produce an (intermediate) array, but the final string.
echo $healed;
outputs:
<p>first part</p><p>second part</p><p>third part</p><p>fourth part</p><p>fifth part</p>
Note that you need the u modifier at the end of the regular expression to get UTF-8 support.
If on the other hand you need the paragraphs in an array, then preg_split is better suited (using the same regular expression):
$paragraphs = preg_split('/(\s*<(\/?p|br)\s*\/?>\s*)+/u',
$article, null, PREG_SPLIT_NO_EMPTY);
If you then write:
foreach ($paragraphs as $paragraph) {
echo "$paragraph\n";
}
You get:
first part
second part
third part
fourth part
fifth part
print_r(preg_split('/((<\s*p\s*\/?>\s*)|(<\s*br\s*\/?>\s*)|(\s\s+)|(<\s*\/p\s*\/?>\s*))+/', $article, -1, PREG_SPLIT_NO_EMPTY));
result:
Array
(
[0] => first part
[1] => second part
[2] => third part
[3] => fourth part
)

Using RegEx to Capture All Links & In Between Text From A String

<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)
Given that this is all on one line, how can I match or better yet extract all full urls and text? ie. for this example I wish to extract:
http://www.someurl(.+) . maybe some text here(.*) . www.someotherurl(.+) . maybe even more text(.*)
Basically, <Link.*:.* would start each link capture and > would end it. Then all text after the first capture would be captured as well up until zero or more occurrences of the next link capture.
I have tried:
preg_match_all('/<Link.*?:.*?(https|http|www)(.+?)>(.*?)/', $v1, $m4);
but I need a way to capture the text after the closing >. The problem is that there may or may not be another link after the first one (of course there could also be no links to begin with!).
$string = "<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)";
$string = preg_split('~<link(?: to)?:\s*([^>]+)>~i',$string,-1,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
echo "<pre>";
print_r($string);
output:
Array
(
[0] => http://www.someurl(.+)
[1] => maybe some text here(.*)
[2] => www.someotherurl(.+)
[3] => maybe even more text(.*)
)
You can use this pattern:
preg_match_all('~<link\b[^:]*:\s*\K(?<link>[^\s>]++)[^>]*>\s*(?<text>[^<]++)~',
$txt, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
printf("<br/>link: %s\n<br/>text: %s", $match['link'], $match['text']);
}

preg_replace regex, split string to array

i have a string, where i need to split some values in to an array, what would be the best aproach?
String can look like this:
<span class="17">118</span><span style="display: inline">.</span><span style="display:none"></span>
or
125<span class="17">25</span>354
The rules are:
The string can start with a number, followed by a span or a div
The string can start with a span or a div
The string can end with a number
The string can end with a /span or a /div
The divs/spans can have a style/class
What i need, is to seperate the string, so that i get the elements seperated, such as:
0 => 123
1 => <span class="potato">123</span>
2 => <span style="color: black">123</span>
I have tried some costum regex, but regex is not my strong side:
$pattern = "/<div.(.*?)<\/div>|<span.(.*?)<\/span>/";
// i know it wont detect a number value prior to the div, thats also an issue, even if it worked
I cannot use simple_html_dom has to be done with REGEX.
Splitting the string between every >< might work, but ">(.*?)<" inserts after the < for some reason?
You might get better performance if you just load this string to DOM and then parse it manually programming your logic like:
var el = document.createElement( 'div' );
el.innerHTML = '125<span class="17">25</span>354';
// test your first element (125) index=0 (you can make for loop)
if(el.childNodes[0].nodeType == 3) alert('this is number first, validate it');
else if(el.childNodes[0].nodeType == 1) alert('this is span or div, test it');
// you can test for div or span with el.childNodes[0].nodeName
// store first element to your array
// then continue, test el.childNodes[next one, index=1 (span)...]
// then continue, test el.childNodes[next one, index=2 (354)...]
since you are already know are you looking for, this can be as simple as that
Try /(<(span|div)[^>]*>)*([^<]*)(<\/(span|div)>)*/
The Regex says something like 'there can be a span or div or nothing, then it has to be somthing then a /span or /div or nothing and that whole statement can match zero or many times.
Here is an example:
$pattern = "/(<(span|div)[^>]*>)*([^<]*)(<\/(span|div)>)*/";
$txt = '<span class="17">118</span><span style="display: inline">.</span><span style="display:none"></span>';
preg_match_all($pattern, $txt,$foo);
print_r($foo[0]);
$txt = '125<span class="17">25</span>354';
preg_match_all($pattern, $txt,$foo);
print_r($foo[0]);
?>

Regex to replace reg trademark

I need some help with regex:
I got a html output and I need to wrap all the registration trademarks with a <sup></sup>
I can not insert the <sup> tag in title and alt properties and obviously I don't need to wrap regs that are already superscripted.
The following regex matches text that is not part of a HTML tag:
(?<=^|>)[^><]+?(?=<|$)
An example of what I'm looking for:
$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`
The filtered string should output:
<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>
thanks a lot for your time!!!
Well, here is a simple way, if you agree to following limitation:
Those regs that are already processed have the </sup> following right after the ®
echo preg_replace('#®(?!\s*</sup>|[^<]*>)#','<sup>®</sup>', $s);
The logic behind is:
we replace only those ® which are not followed by </sup> and...
which are not followed by > simbol without opening < symbol
I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).
You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.
Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:
content[i].replace(/\®/g, "<sup>®</sup>");
I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.
I'm using preg_split to split the string on HTML tags, as well as on the phrase <sup>&reg</sup> -- this will leave text that's either not an already superscript ® or a tag as tokens. Then for each token, ® can be replaced with <sup>®</sup>:
$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';
// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
[0] => <div>
[1] => asd® asdasd. asd
[2] => <sup>®</sup>
[3] => asd
[4] => <img alt="qwe®qwe" />
[5] => </div>
)
*/
foreach ($tokens as &$token)
{
if ($token[0] == "<") continue; // Skip tokens that are tags
$token = substr_replace('®', '<sup>®</sup>');
}
$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"
Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )

Categories