I am trying to extract a specific JavaScript object from a page containing the usual HTML markup.
I have tried to use regex but i don't seem to be able to get it to parse the HTML correctly when the HTML contains a line break.
An example can be seen here: https://regex101.com/r/b8zN8u/2
The HTML i am trying to extract looks like this:
<script>
DATA.tracking.user = {
age: "19",
name: "John doe"
}
</script>
Using the following regex: DATA.tracking.user=(.*?)}
<?php
$re = '/DATA.tracking.user = (.*?)\}/m';
$str = '<script>
DATA.tracking.user = { age: "19", name: "John doe" }
</script>';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
If i parse DATA.tracking.user = { age: "19", name: "John doe" } without any linebreaks, Then it works fine but if i try to parse:
DATA.tracking.user = {
age: "19",
name: "John doe"
}
It does not like dealing with the line breaks.
Any help would be greatly appreciated.
Thanks.
You will need to specify whitespaces (\s) in your pattern in order to parse the javascript code containing linebreaks.
For example, if you use the following code:
<?php
$re = '/DATA.tracking.user = \{\s*.*\s*.*\s*\}/';
$str = '<script>
DATA.tracking.user = {
age: "19",
name: "John doe"
}
</script>';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches[0]);
?>
You will get the following output:
Array
(
[0] => DATA.tracking.user = {
age: "19",
name: "John doe"
}
)
The simple solution to your problem is to use the s pattern modifier to command the . (any character) to also match newline characters -- which it does not by default.
And you should:
escape your literal dots.
write the \{ outside of your capture group.
omit the m pattern modifier because you aren't using anchors.
...BUT...
If this was my task and I was going to be processing the data from the extracted string, I would probably start breaking up the components at extraction-time with the power of \G.
Code: (Demo) (Pattern Demo)
$htmls[] = <<<HTML
DATA.tracking.user = { age: "19", name: "John doe", int: 55 } // This works
HTML;
$htmls[] = <<<HTML
DATA.tracking.user = {
age: "20",
name: "Jane Doe",
int: 49
} // This does not works
HTML;
foreach ($htmls as $html) {
var_export(preg_match_all('~(?:\G(?!^),|DATA\.tracking\.user = \{)\s+([^:]+): (\d+|"[^"]*")~', $html, $out, PREG_SET_ORDER) ? $out : []);
echo "\n --- \n";
}
Output:
array (
0 =>
array (
0 => 'DATA.tracking.user = { age: "19"',
1 => 'age',
2 => '"19"',
),
1 =>
array (
0 => ', name: "John doe"',
1 => 'name',
2 => '"John doe"',
),
2 =>
array (
0 => ', int: 55',
1 => 'int',
2 => '55',
),
)
---
array (
0 =>
array (
0 => 'DATA.tracking.user = {
age: "20"',
1 => 'age',
2 => '"20"',
),
1 =>
array (
0 => ',
name: "Jane Doe"',
1 => 'name',
2 => '"Jane Doe"',
),
2 =>
array (
0 => ',
int: 49',
1 => 'int',
2 => '49',
),
)
---
Now you can simply iterate the matches and work with [1] (the keys) and [2] (the values). This is a basic solution, that can be further tailored to suit your project data. Admittedly, this doesn't account for values that contain an escaped double-quote. Adding this feature would be no trouble. Accounting for more complex value types may be more of a challenge.
You need to add the 's' modifier to the end of your regex - otherwise, "." does not include newlines. See this:
s (PCRE_DOTALL)
If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.
So basically change your regex to be:
'/DATA.tracking.user = (.*?)\}/ms'
Also, you should quote your other dots (otherwise you will match "DATAYtrackingzZuser". So...
'/DATA\.tracking\.user = (.*?)\}/ms'
I'd also add in the open curly bracket and not enforce the single space around the equal sign, so:
'/DATA\.tracking\.user\s*=\s*\{(.*?)\}/ms'
Since you seem to be scraping/reading the page anyway (so you have a local copy), you can simply replace all the newline characters in the HTML page with whitespace characters, then it should work perfectly without even changing your script.
Refer to this for the ascii values:
https://www.techonthenet.com/ascii/chart.php
Related
I have a php function that splits product names from their color name in woocommerce.
The full string is generally of this form "product name - product color", like for example:
"Boxer Welbar - ligth grey" splits into "Boxer Welbar" and "light grey"
"Longjohn Gari - marine stripe" splits into "Longjohn Gari" and "marine stripe"
But in some cases it can be "Tee-shirt - product color"...and in this case the split doesn't work as I want, because the "-" in Tee-shirt is detected.
How to circumvent this problem? Should I use a "lookahead" statement in the regexp?
function product_name_split($prod_name) {
$currenttitle = strip_tags($prod_name);
$splitted = preg_split("/–|[\p{Pd}\xAD]|(–)/", $currenttitle);
return $splitted;
}
I'd go for a negative lookahead.
Something like this:
-(?!.*-)
that means to search for a - not followed by any other -
This works if in the color name there will never be a -
What about counting space characters that surround a dash?
For example:
function product_name_split($prod_name) {
$currenttitle = strip_tags($prod_name);
$splitted = preg_split("/\s(–|[\p{Pd}\xAD]|(–))\s/", $currenttitle);
return $splitted;
}
This automatically trims spaces from split parts as well.
If you have - as delimiter (note the spaces around the dash), you may simply use explode(...). If not, use
\s*-(?=[^-]+$)\s*
or
\w+-\w+(*SKIP)(*FAIL)|-
with preg_split(), see the demos on regex101.com (#2)
In PHP this could be:
<?php
$strings = ["Tee-shirt - product color", "Boxer Welbar - ligth grey", "Longjohn Gari - marine stripe"];
foreach ($strings as $string) {
print_r(explode(" - ", $string));
}
foreach ($strings as $string) {
print_r(preg_split("~\s*-(?=[^-]+$)\s*~", $string));
}
?>
Both approaches will yield
Array
(
[0] => Tee-shirt
[1] => product color
)
Array
(
[0] => Boxer Welbar
[1] => ligth grey
)
Array
(
[0] => Longjohn Gari
[1] => marine stripe
)
To collect the splitted items, use array_map(...):
$splitted = array_map( function($item) {return preg_split("~\s*-(?=[^-]+$)\s*~", $item); }, $strings);
Your sample inputs convey that the neighboring whitespace around the delimiting hyphen/dash is just as critical as the hyphen/dash itself.
I recommend doing all of the html and special entity decoding before executing your regex -- that's what these other functions are built for and it will make your regex pattern much simpler to read and maintain.
\p{Pd} will match any hyphen/dash. Reinforce the business logic in the code by declaring a maximum of 2 elements to be generated by the split.
As a general rule, I discourage declaring single-use variables.
Code: (Demo)
function product_name_split($prod_name) {
return preg_split(
"/ \p{Pd} /u",
strip_tags(
html_entity_decode(
$prod_name
)
),
2
);
}
$tests = [
'Tee-shirt - product color',
'Boxer Welbar - ligth grey',
'Longjohn Gari - marine stripe',
'En dash – green',
'Entity – blue',
];
foreach ($tests as $test) {
echo var_export(product_name_split($test, true)) . "\n";
}
Output:
array (
0 => 'Tee-shirt',
1 => 'product color',
)
array (
0 => 'Boxer Welbar',
1 => 'ligth grey',
)
array (
0 => 'Longjohn Gari',
1 => 'marine stripe',
)
array (
0 => 'En dash',
1 => 'green',
)
array (
0 => 'Entity',
1 => 'blue',
)
As usual, there are several options for this, this is one of them
explode — Split a string by a string
end — Set the internal pointer of an array to its last element
$currenttitle = 'Tee-shirt - product color';
$array = explode( '-', $currenttitle );
echo end( $array );
I have a PHP array that looks like this..
Array
(
[0] => post: 746
[1] => post: 2
[2] => post: 84
)
I am trying to remove the post: from each item in the array and return one that looks like this...
Array
(
[0] => 746
[1] => 2
[2] => 84
)
I have attempted to use preg_replace like this...
$array = preg_replace('/^post: *([0-9]+)/', $array );
print_r($array);
But this is not working for me, how should I be doing this?
You've missed the second argument of preg_replace function, which is with what should replace the match, also your regex has small problem, here is the fixed version:
preg_replace('/^post:\s*([0-9]+)$/', '$1', $array );
Demo: https://3v4l.org/64fO6
You don't have a pattern for the replacement, or a empty string place holder.
mixed preg_replace ( mixed $pattern , mixed $replacement , mixed $subject)
Is what you are trying to do (there are other args, but they are optional).
$array = preg_replace('/post: /', '', $array );
Should do it.
<?php
$array=array("post: 746",
"post: 2",
"post: 84");
$array = preg_replace('/^post: /', '', $array );
print_r($array);
?>
Array
(
[0] => 746
[1] => 2
[2] => 84
)
You could do this without using a regex using array_map and substr to check the prefix and return the string without the prefix:
$items = [
"post: 674",
"post: 2",
"post: 84",
];
$result = array_map(function($x){
$prefix = "post: ";
if (substr($x, 0, strlen($prefix)) == $prefix) {
return substr($x, strlen($prefix));
}
return $x;
}, $items);
print_r($result);
Result:
Array
(
[0] => 674
[1] => 2
[2] => 84
)
There are many ways to do this that don't involve regular expressions, which are really not needed for breaking up a simple string like this.
For example:
<?php
$input = Array( 'post: 746', 'post: 2', 'post: 84');
$output = array_map(function ($n) {
$o = explode(': ', $n);
return (int)$o[1];
}, $input);
var_dump($output);
And here's another one that is probably even faster:
<?php
$input = Array( 'post: 746', 'post: 2', 'post: 84');
$output = array_map(function ($n) {
return (int)substr($n, strpos($n, ':')+1);
}, $input);
var_dump($output);
If you don't need integers in the output just remove the cast to int.
Or just use str_replace, which in many cases like this is a drop in replacement for preg_replace.
<?php
$input = Array( 'post: 746', 'post: 2', 'post: 84');
$output = str_replace('post: ', '', $input);
var_dump($output);
You can use array_map() to iterate the array then strip out any non-digital characters via filter_var() with FILTER_SANITIZE_NUMBER_INT or trim() with a "character mask" containing the six undesired characters.
You can also let preg_replace() do the iterating for you. Using preg_replace() offers the most brief syntax, but regular expressions are often slower than non-preg_ techniques and it may be overkill for your seemingly simple task.
Codes: (Demo)
$array = ["post: 746", "post: 2", "post: 84"];
// remove all non-integer characters
var_export(array_map(function($v){return filter_var($v, FILTER_SANITIZE_NUMBER_INT);}, $array));
// only necessary if you have elements with non-"post: " AND non-integer substrings
var_export(preg_replace('~^post: ~', '', $array));
// I shuffled the character mask to prove order doesn't matter
var_export(array_map(function($v){return trim($v, ': opst');}, $array));
Output: (from each technique is the same)
array (
0 => '746',
1 => '2',
2 => '84',
)
p.s. If anyone is going to entertain the idea of using explode() to create an array of each element then store the second element of the array as the new desired string (and I wouldn't go to such trouble) be sure to:
split on or : (colon, space) or even post: (post, colon, space) because splitting on : (colon only) forces you to tidy up the second element's leading space and
use explode()'s 3rd parameter (limit) and set it to 2 because logically, you don't need more than two elements
I have this as an input to my command line interface as parameters to the executable:
-Parameter1=1234 -Parameter2=38518 -param3 "Test \"escaped\"" -param4 10 -param5 0 -param6 "TT" -param7 "Seven" -param8 "secret" "-SuperParam9=4857?--SuperParam10=123"
What I want to is to get all of the parameters in a key-value / associative array with PHP like this:
$result = [
'Parameter1' => '1234',
'Parameter2' => '1234',
'param3' => 'Test \"escaped\"',
'param4' => '10',
'param5' => '0',
'param6' => 'TT',
'param7' => 'Seven',
'param8' => 'secret',
'SuperParam9' => '4857',
'SuperParam10' => '123',
];
The problem here lies at the following:
parameter's prefix can be - or --
parameter's glue (value assignment operator) can be either an = sign or a whitespace ' '
some parameters may be inside a quote block and can also have different, both separators and glues and prefixes, ie. a ? mark for the separator.
So far, since I'm really bad with RegEx, and still learning it, is this:
/(-[a-zA-Z]+)/gui
With which I can get all the parameters starting with an -...
I can go to manually explode the entire thing and parse it manually, but there are way too many contingencies to think about.
You can try this that uses the branch reset feature (?|...|...) to deal with the different possible formats of the values:
$str = '-Parameter1=1234 -Parameter2=38518 -param3 "Test \"escaped\"" -param4 10 -param5 0 -param6 "TT" -param7 "Seven" -param8 "secret" "-SuperParam9=4857?--SuperParam10=123"';
$pattern = '~ --?(?<key> [^= ]+ ) [ =]
(?|
" (?<value> [^\\\\"]*+ (?s:\\\\.[^\\\\"]*)*+ ) "
|
([^ ?"]*)
)~x';
preg_match_all ($pattern, $str, $matches);
$result = array_combine($matches['key'], $matches['value']);
print_r($result);
demo
In a branch reset group, the capture groups have the same number or the same name in each branch of the alternation.
This means that (?<value> [^\\\\"]*+ (?s:\\\\.[^\\\\"]*)*+ ) is (obviously) the value named capture, but that ([^ ?"]*) is also the value named capture.
You could use
--?
(?P<key>\w+)
(?|
=(?P<value>[^-\s?"]+)
|
\h+"(?P<value>.*?)(?<!\\)"
|
\h+(?P<value>\H+)
)
See a demo on regex101.com.
Which in PHP would be:
<?php
$data = <<<DATA
-Parameter1=1234 -Parameter2=38518 -param3 "Test \"escaped\"" -param4 10 -param5 0 -param6 "TT" -param7 "Seven" -param8 "secret" "-SuperParam9=4857?--SuperParam10=123"
DATA;
$regex = '~
--?
(?P<key>\w+)
(?|
=(?P<value>[^-\s?"]+)
|
\h+"(?P<value>.*?)(?<!\\\\)"
|
\h+(?P<value>\H+)
)~x';
if (preg_match_all($regex, $data, $matches)) {
$result = array_combine($matches['key'], $matches['value']);
print_r($result);
}
?>
This yields
Array
(
[Parameter1] => 1234
[Parameter2] => 38518
[param3] => Test \"escaped\"
[param4] => 10
[param5] => 0
[param6] => TT
[param7] => Seven
[param8] => secret
[SuperParam9] => 4857
[SuperParam10] => 123
)
In php when user saves a text, I need to split string as
"genesis1:3-16" ==> "genesis", "1", "3", "16"
"revelation2:3-5" ==> "revelation", "2", "3", "5"
The conditions are there will be no white spaces between all characters I need to split according to symbol ":", "-", and character. the numbers can go up to only '999' 3 digits.
$sample = "genesis1:3-16";
//magic happens....
$book = ""; // genesis
$chapter = ""; // 1
$start_verse = ""; // 3
$end_verse = ""; //16
I have limited knowledge of reg expression and can't figure out using only strpos and substr...
Thank you in advance
I think this regex would accomplish what you are after:
([a-z]+)(\d{1,3}):(\d{1,3})-(\d{1,3})
Demo (with explanation of what each part does): https://regex101.com/r/uP4gW6/1
PHP Usage:
preg_match('~([a-z]+)(\d{1,3}):(\d{1,3})-(\d{1,3})~', 'genesis1:3-16', $data);
print_r($data);
Output:
Array
(
[0] => genesis1:3-16
[1] => genesis
[2] => 1
[3] => 3
[4] => 16
)
With preg_match the 0 index is the found content. The subsequent indexes are each captured group.
If you have a fixed set of names the book could be you could replace [a-z]+ with that list seperated by |, for example revelation|genesis|othername.
$parts = array();
preg_match('/^(.*?)\s*(\d+):(\d+)-(\d+)$/', $sample, $parts);
$book = $parts[1];
$chapter = $parts[2];
$startVerse = $parts[3];
$endVerse = $parts[4];
you could use this simple pattern
([a-zA-Z]+|\d{1,3})
Demo
I have the following regular expression in javascript and i would like to have the exact same functionality (or similar) in php:
// -=> REGEXP - match "x bed" , "x or y bed":
var subject = query;
var myregexp1 = /(\d+) bed|(\d+) or (\d+) bed/img;
var match = myregexp1.exec(subject);
while (match != null){
if (match[1]) { "X => " + match[1]; }
else{ "X => " + match[2] + " AND Y => " + match[3]}
match = myregexp1.exec(subject);
}
This code searches a string for a pattern matching "x beds" or "x or y beds".
When a match is located, variable x and variable y are required for further processing.
QUESTION:
How do you construct this code snippet in php?
Any assistance appreciated guys...
You can use the regex unchanged. The PCRE syntax supports everything that Javascript does. Except the /g flag which isn't used in PHP. Instead you have preg_match_all which returns an array of results:
preg_match_all('/(\d+) bed|(\d+) or (\d+) bed/im', $subject, $matches,
PREG_SET_ORDER);
foreach ($matches as $match) {
PREG_SET_ORDER is the other trick here, and will keep the $match array similar to how you'd get it in Javascript.
I've found RosettaCode to be useful when answering these kinds of questions.
It shows how to do the same thing in various languages. Regex is just one example; they also have file io, sorting, all kinds of basic stuff.
You can use preg_match_all( $pattern, $subject, &$matches, $flags, $offset ), to run a regular expression over a string and then store all the matches to an array.
After running the regexp, all the matches can be found in the array you passed as third argument. You can then iterate trough these matches using foreach.
Without setting $flags, your array will have a structure like this:
$array[0] => array ( // An array of all strings that matched (e.g. "5 beds" or "8 or 9 beds" )
0 => "5 beds",
1 => "8 or 9 beds"
);
$array[1] => array ( // An array containing all the values between brackets (e.g. "8", or "9" )
0 => "5",
1 => "8",
2 => "9"
);
This behaviour isn't exactly the same, and I personally don't like it that much. To change the behaviour to a more "JavaScript-like"-one, set $flags to PREG_SET_ORDER. Your array will now have the same structure as in JavaScript.
$array[0] => array(
0 => "5 beds", // the full match
1 => "5", // the first value between brackets
);
$array[1] => array(
0 => "8 or 9 beds",
1 => "8",
2 => "9"
);