I want to improve my code but I do not know how to write the regexp.
I want to get all none xhtml attributes in a tag.
So after the preg match I want to get :
array(
0 => "required",
1 => "autocomplete"
);
$balise = <input id="myId" class="myClassA myClassB myClassC" required autocomplete/>;
I actually use this preg_match_all("/(?<=\s)[\w]+(?=[\s\/>])/i", $balise, $attributs);
But with the regexp I get :
array(
0 => "myClassB",
1 => "required",
3 => "autocomplete"
);
I do not want to get myClassB...
can anyone help me to write my regex ?
Thx
You can add the negative look-ahead (?![^=]*?") to make sure the next " doesn't precede the next =, that way you're getting only words that aren't within a quoted value. Single-quote the string so that the " in the regex won't terminate it.
preg_match_all('/(?<=\s)[\w]+(?=[\s\/>])(?![^=]*?")/i', $balise, $attributs);
Related
I have a parish website that I am maintaining. The site has a parish registration form on it. Yesterday someone submitted the form with spam. The submitter supplied an inappropriate web address in one of the fields.
I'm fairly confident this was not a bot form submission as I use a recapcha and honeypot to fend off bots.
What I'm trying to figure out is how on the processing page to look at all the text entry fields and scrub URLs.
Since the language is PHP:
function scrubURL(field){
if($_POST[field] contains **SomeURL**){
$field = str_replace(***SomeURL***, "", $_POST[field])
} else{
$field = $_POST[field];
}
return $field;
}
I'm just not sure to check the field to see if it contains a URL.
I'm planning to scrub URLs by calling:
$first = scrubURL($first);
$last = scrubURL($last);
$address = scrubURL($address);
I will then use $first, $last & $address in the mail that gets sent to the parish office.
This function will recognize URLs and replace then with empty strings. Just realize that lots of thing, such as wx.yz look like valid URLs.
function scrubURL($field)
{
//return preg_replace('#((https?://)?([-\\w]+\\.[-\\w\\.]+)+\\w(:\\d+)?(/([-\\w/_\\.]*(\\?\\S+)?)?)*)(?:[?&][^?$]+=[^?&]*)*#i', '', $_POST[$field]);
return preg_replace("#((https?://|ftp://|www\.|[^\s:=]+#www\.).*?[a-z_\/0-9\-\#=&])(?=(\.|,|;|\?|\!)?(\"|'|«|»|\[|\s|\r|\n|$))#iS", '', $_POST[$field]);
}
The parameter, $field, has to be a string, such as "email" corresponding to $_POST["email"]
<?php
$_POST = [
'email' => 'something www.badsite.com?site=21&action=redirect else',
];
function scrubURL($field)
{
return preg_replace('#((https?://)?([-\\w]+\\.[-\\w\\.]+)+\\w(:\\d+)?(/([-\\w/_\\.]*(\\?\\S+)?)?)*)(?:[?&]\S+=\S*)*#i', '', $_POST[$field]);
}
echo scrubURL('email');
Prints:
something else
Regex is an easy way to evaluate fields for possible URL markers. Something like the following would remove much of it (though, given how many shapes URLs can come in, not everything):
$_POST = [
'first' => 'actualname',
'last' => 'something http://url.com/visit-me',
'middle' => 'hello www.foobar.com spammer',
'other' => 'visit https://spammery.us/ham/spam spamming',
'more' => 'spam.tld',
];
// Iterates all $_POST fields, editing the $_POST array in place
foreach($_POST as $key => &$val) {
$val = scrubUrl($val);
}
function scrubURL($data)
{
/* Removes anything that:
- starts with http(s):
- starts with www.
- has a domain extension (2-5 characters)
... ending the match with the first space following the match.
*/
$data = preg_replace('#\b
(
https?:
|
www\.
|
[^\s]+\.[a-z]{2,5}
)
[^\s]+#x', '-', $data);
return $data;
}
print_r($_POST);
Be aware that the last condition, looking for any TLD (.abc) -- and there are lots of them! -- may result in some false positives.
"sentence.Poor punctuation" would be safe. We're only matching [a-z].However, spam.Com would also pass! Use [a-Z] to match both cases, or add the "i" modifier to the regex.
"my acc.no is 12345" would be removed (potential spammer accountants from Norway?!)
The above process would give you the following filtered data:
Array (
[first] => actualname
[last] => something -
[middle] => hello - spammer
[other] => visit - spamming
[more] => -
)
The regex can definitely be further refined. ^_^
N.B. You may also want to sanitize the incoming data with e.g. strip_tags and htmlspecialchars to ensure the website is sending reasonably safe data to your parish.
This question already has answers here:
PHP - how to create a newline character?
(15 answers)
Closed 4 years ago.
I have an index.php file and an array which has message. Is there a way that instead of <br> tag I can display the text with a new line in PHP so I can also store it in database?
The code:
$array = array(
array(
'id' => 1,
'message' => 'I\'m reading Harry Potter!',
),
array(
'id' => 2,
'message' => 'Ok. I just got a notification that you sent me a pin on Pinterest.<br>Will you come to school tomorrow?',
)
);
For example:
Ok. I just got a notification that you sent me a pin on Pinterest.
Will you come to school tomorrow?
The new line character is \n. Simply replace <br> with \n and you will have the results you're looking for.
PHP - how to create a newline character?
Note!
php does not process escape characters within single quotes.
'\n' is not processed as a new line character, while "\n" is.
What is the difference between single-quoted and double-quoted strings in PHP?
Depending on your platform, you may want to be more specific about which new line sequence you choose.
\r\n, \r and \n what is the difference between them?
$array = array(
array(
'id' => 1,
'message' => "I'm reading Harry Potter!"
),
array(
'id' => 2,
'message' => "Ok. I just got a notification that you sent me a pin on Pinterest.\nWill you come to school tomorrow?"
)
);
echo "<pre>";
print_r($array);
exit;
This will show you the raw formatting of your string.
To then convert new lines to the <br> tag for display on a webpage, you would pass that string to nl2br()
<?php echo nl2br($array[1]['message']); ?>
Is there a way that instead of <br> tag I can display the text with a new line in PHP
Yes you can easily do this with CSS
white-space: pre;
https://www.w3schools.com/cssref/pr_text_white-space.asp
Back in the day I used to do the whole "replace" thing, then I got bored of it. Now I just use CSS.
The pre option/setting will preserve whitespace much like using the <pre> tag. The only thing you have to watch for is indenting in the source code
<p style="white-space:pre;">
<?php echo $something; ?>
</p>
This extra space in the code will be added to the PHP output, instead do this:
<p style="white-space:pre;"><?php echo $something; ?></p>
You can close the php tag and reopen after the break.
For example -
'message'=>'Ok.I just got a notification that you sent me a pin on Pinterest. ?>
<br>
<?php Will you come to school tomorrow?',
I need a regular expression pattern all characters including whitespace what is not a variable in PHP.
<li class="xyz" data-name="abc">
<span id="XXX">some words</span>
<div data-attribute="values">
<a class="klm" href="http://example.com/blabla">somethings</a>
</div>
<div class="xyz sub" data-name="abc-sub"><img src="/images/any_image.jpg" class="qqwwee"></div>
</li><!--repeating li tags-->
I wrote a pattern;
preg_match_all('#<li((?s).*?)<div((?s).*?)href="((?s).*?)"((?s).*?)</li>#', $subject, $matches);
This works well but I don't want to get four variables. I just want to get
http://example.com/blabla
And anyone can tell me why this does not work like that?
preg_match_all('#<li[[?s].*?]<div[[?s].*?]href="((?s).*?)"[[?s].*?]</li>#', $subject, $matches);
Using (?:) will allow grouping but make those groups not captured, for example, the following:
#<li(?:(?s).*?)<div(?:(?s).*?)href="((?s).*?)"(?:(?s).*?)</li>#
Will output:
array (
0 =>
array (
0 => '<li class="xyz" data-name="abc">
<span id="XXX">some words</span>
<div data-attribute="values">
<a class="klm" href="http://example.com/blabla">somethings</a>
</div>
<div class="xyz sub" data-name="abc-sub"><img src="/images/any_image.jpg" class="qqwwee"></div>
</li>',
),
1 =>
array (
0 => 'http://example.com/blabla',
),
)
All of your matches will be contained in $matches[1], so iterate through that.
Don't use RegExps to parse HTML
Read this famous answer on StackOverflow.
HTML is not a regular language, so it cannot be reliably processed with a RegExp. Instead, use a proper (and robust) HTML parser.
Also note that data mining (analysis) != web-scraping (data collection).
If you don't want a regexp group to store the "captured" data, use a non-capturing flag.
(?:some-complex-regexp-here)
In your case, the following may work:
(?s)<li.*?<div.*?href="([^"]*?)".*?</li>
But seriously, don't use regexps for this; regexps are fragile. Use an xpath like /li//div//a//#href instead.
I have an HTML page which has a very large and very complex chunk of JSON in a script tag.
I want to extract the JSON so that I can decode it in a php script.
The JSON looks something like:
<script type="text/javascript">
var user_list_data_obj = (
({
... truncated ...
})
);
... some more js ...
</script>
The script tags can't be used in the pattern, because there's other JS between them, and there's nothing to make them unqiue anyway.
I believe I need to match against the variable name, and the first occurrence of '}));' but my attempts to match that have failed.
What I've got so far is:
$pattern = '/var user_list_data_obj = \(\s\(({.*})\)\s\);/';
Which returns nothing.
What am I doing wrong in that pattern? I know its difficult to match anything that has opening and closing delimiters like JSON, etc with a regex, but it should be possible in this case, no?
EDIT:
I'm trying to get the entire "user_list_data_obj" object parsed into my php script. But really, the bits I'm interested in are the several "columns :[] " arrays, so if it's easier to get those out separately, it might make sense to do that.
The columns[] arrays look something like
columns : [
{ display_value : '<input type="checkbox" name="user" value="username">'},
{ display_value : 'username', sort_value : 'username'},
{ display_value : 'username', sort_value : 'username'},
{ display_value : 'Enabled', sort_value : '1' },
{ display_value : '<img class="" src="/enabled.gif">', sort_value : '1' },
{ display_value : '<img class="" src="/enabled.gif">', sort_value : '1' },
{ display_value : '<img class="" src="/enabled.gif">', sort_value : '1' }
],
I was able to match the entire json object with the following
/user_list_data_obj\s*=\s*\(\s*\({(.*?)}\)\s*\);/
But in actuality, I ended up using preg_match_all to match each columns[] array in the json by using:
/columns\s*:\s*\[.*?\],/s
The closest I can get is
preg_match('/var user_list_data_obj = \(\s+\(({.*})\)\s+\);/s', $html, $matches);
The s modifer allows for the matching of newlines.
This is imperfect as it makes assumptions about the structure: namely that the JSON you need literally starts with
( /* some space */
({
and ends with
}) /* some space */
);
If you can't make those assumptions, a less specific regex will likely match other parts of the script. Also, if you have }) ); at some point in the script that you don't want to match, it will still be matched. Using {.*?} won't work because there can be many nested objects literals in the string you want to capture.
I know I can refer in replacement to dynamic parts of the term in regex in PHP:
preg_replace('/(test1)(test2)(test3)/',"$3$2$1",$string);
(Somehow like this, I don't know if this is correct, but its not what I am looking for)
I want that in the regex, like:
preg_match_all("~<(.*)>.*</$1>~",$string,$matches);
The first part between the "<" and ">" is dynamic (so every tag existing in html and even own xml tags can be found) and i want to refer on that again in the same regex-term.
But it doesn't work for me. Is this even possible?
I have a server with PHP 5.3
/edit:
my final goal is this:
if have a html-page with e. g. following source-code:
HTML
<html>
<head>
<title>Titel</title>
</head>
<body>
<div>
<p>
p-test<br />
br-test
</p>
<div>
<p>
div-p-test
</p>
</div>
</div>
</body>
</html>
And after processing it should look like
$htmlArr = array(
'html' => array(
'head' => array('title' => 'Titel'),
'body' => array(
'div0' => array(
'p0' => 'p-test<br />br-test',
'div1' => array(
'p1' => 'div-p-test'
)
)
)
));
Placeholders in the replacement string use the $1 syntax. In the regex itself they are called backreferences and follow the syntax \1 backslash and number.
http://www.regular-expressions.info/brackets.html
So in your case:
preg_match_all("~<(.*?)>.*?</\\1>~",$string,$matches);
The backslash is doubled here, because in PHP strings the backslash escapes itself. (In particular for double quoted strings, else it would become an ASCII symbol.)