Catching different syntax with Regular Expression - php

I'll have code embedded in HTML, it will look something like this:
<div id="someDiv">
{:
HTMLObject
id: form
background: blue
font: large
fields [
username: usr
password: pwd
]
foo: bar
:}
</div>
I am trying to write a regular expression that will take these HTMLObjects and break them into manageable arrays. I already have the regexp that will do the lines such as
id: form
However, I'm having trouble with making the regexp also match ones like
fields [
username: usr
password: pwd
]
Here is the function I have that performs these tasks:
function parseHTMLObjects($html) {
$details = preg_replace('/[{:]([^}]+):}/i', '$1', $html);
$details = trim(str_replace('HTMLObject', '', $details));
$dynamPattern = '/([^\[]+)\[([^\]]+)]/';
$dynamMatch = preg_match_all($dynamPattern, $details, $dynamMatches);
print_r($dynamMatches); // nothing is shown here
$findMatch = preg_match_all('/([^:]+):([^\n]+)/', $details, $matches);
$obs = array();
foreach($matches[0] as $o) {
$tmp = trim($o);
echo $tmp . "\n";
}
}
When I pass an HTML string like I demonstrated at the beginning of the page, the $findMatch regexp works fine, but nothing gets stored in the dynams one. Am I going about this in the wrong way?
Basically all I need is each object stored in an array, so from the sample HTML string above, this would be an ideal array:
Array() {
[0] => id: form
[1] => background: blue
[2] => font: large
[3] => fields [
username: usr
password: pwd
]
[4] => foo: bar
}
I have all the sorting and manipulation handled beyond that point, but like I said, I'm having trouble getting the same regexp that handles the colon style objects to also handle the bracket style objects.
If I need to use a different regexp and store the results in a different array that is fine too.

It would be easly made with some black sorcery called YAML or JSON with these syntaxes:
YAML
{:
HTMLObject:
id: form
background: blue
font: large
fields: [
username: usr,
password: pwd
]
foo: bar
:}
JSON
{:
{
"HTMLObject":{
"id": "form",
"background": "blue",
"font": "large",
"fields": [
{"usernamd": "usr"},
{"password": "pwd"}
],
"foo": "bar"
}
}
:}
Bu-bu-but why? 'Cuz it's natively parsed. No dirty RegExps.

I can't comment on posts yet, but definitely looking for conversion functions from one notation to php arrays is the way forward, json_decode is one, though your data is starting life as something else.
regex can be very tricky with complicated data that is often complex because it has some other structure to it that is better interpreted with other tools
PS if you do use json_decode in php at any point, don't get caught out by the second parameter - it needs to be set to 'true' to get an array !

Related

Parse error raw string and select a error type efficiently

So I've the following situation: A process is piping huge amounts of data into a PHP script I'm using to parse and then store some info in a DB.
The input data is a multiline string, but what really matters for me is to find particular key words and then say that the input data is an error of the type 1 to n.
I've an array like this:
$errors = [
1 => [
"error 4011",
"clp-no",
],
2 => [
"error 4012",
"clp-nf",
"0010x100"
],
];
The idea is to state what key is the error - the array keys are the error numbers. Currently I've this piece of code to take care of the situation:
$errorId = 25; // Default error for undetected / others
foreach ($errors as $ierrorId => $matches) {
foreach ($matches as $match) {
if (mb_stripos($raw, $match) !== false) {
$errorId = $ierrorId;
break 2;
}
}
}
This code works fine, however, it looks like there is a bottleneck when I look at resource usage when the processes dump information to it... (usually around 10 or 20 strings to be processed by running that 20 times.)
What is the recommended way to do what I'm trying to accomplish with the minimum resource usage?
The output of the code is the ID of the error. The ID of the error is one of the numeric keys of the $errors array.
This code basically groups possible messages that are in the reality the same error and then gives me a unique error ID.
Thank you.
Example of the $raw input this parses:
[0]: error 4011 processing request
.
No input data: clp-nf
.
This is an automated message from the PTD daemon
=> Error: 0010x111.
And some others, the bottom line is: The format can change and I can't rely on position and stuff, it must try to find one of the strings on the array and then return the array key. For instance the second message will output 2 because clp-nf can be found on the second position of the array.
I did a little benchmarking with different functions to find text strings.
mb_stripos (case insensitive)
mb_strpos (case sensitive)
mb_strpos and strtolower on both the string to be searched and the error strings
I also tried the nested array structure you posted above, and a flat error list with the keys being the error strings and the values being the error number. Here are the results I got from running 20,000 reps on a set of sample strings (all returned the same set of errors) with an array of six different error strings, including one error string with non-ASCII characters:
[nested_stripos] => 178.60633707047s
[nested_strpos] => 19.614015340805s
[nested_strpos_with_strtolower] => 25.815476417542s
[flat_stripos] => 177.30470108986s
[flat_strpos] => 18.139512062073s
[flat_strpos_with_strtolower] => 24.32790517807s
As you can see, using mb_stripos is very slow in comparison with mb_strpos. If you don't know what case the errors will be in, it is much quicker to convert everything to lowercase than to use mb_stripos. Using a flat list is marginally faster than the nested arrays over 20,000 reps but there is unlikely to be a noticeable difference unless your raw input is large.
I didn't test preg_match as it is fairly well known that regular expressions are slower than string matching.
If you start recording the error string that was matched (e.g. error 4012 or 0010x001), you can build up frequency tables and order your error strings by which occur the most frequently.
This was the fastest function:
# lines: an array of text strings, including several with non-ASCII characters
# $err_types: data structure holding the error messages to search for (see below)
function flat_strpos($err_types, $lines){
$list = array();
foreach ($lines as $l) {
$err = 25;
foreach ($err_types['flat'] as $e => $no) {
if (mb_strpos($l, $e) !== false) {
$err = $no;
break;
}
}
$list[] = "ERR $err; content: " . mb_substr($l, 0, 100);
}
return $list;
}
And for reference, the $err_type data structure:
$err_types = [
'flat' => [
'error 4011' => 1,
'clp-no' => 1,
'error 4012' => 2,
'clp-nf' => 2,
'0010x100' => 2,
'颜色' => 3
],
'nested' => [
1 => [
'error 4011',
'clp-no'
],
2 => [
'error 4012',
'clp-nf',
'0010x100'
],
3 => [
'颜色'
]
]
];

How to get text in array between all <span> tag from HTML?

I want to fetch text in array between all <span> </span> tag from HTML, I have tried with this code but it returns only one occurrence :
preg_match('/<span>(.+?)<\/span>/is', $row['tbl_highlighted_icon_content'], $matches);
echo $matches[1];
My HTML:
<span>The wish to</span> be unfairly treated is a compromise attempt that would COMBINE attack <span>and innocen</span>ce. Who can combine the wholly incompatible, and make a unity of what can NEVER j<span>oin? Walk </span>you the gentle way,
My code returns only one occurrence of span tag, but I want get all text from every span tag in HTML in the form of a php array.
you need to switch to preg_match_all function
Code
$row['tbl_highlighted_icon_content'] = '<span>The wish to</span> be unfairly treated is a compromise attempt that would COMBINE attack <span>and innocen</span>ce. Who can combine the wholly incompatible, and make a unity of what can NEVER j<span>oin? Walk </span>you the gentle way,';
preg_match_all('/<span>.*?<\/span>/is', $row['tbl_highlighted_icon_content'], $matches);
var_dump($matches);
as you can see now array is correctly populated so you can echo all your matches
use preg_match_all() it's the same, it will return all the occurrences in the $matches array
http://php.net/manual/en/function.preg-match-all.php
here is code to get all span value in array
$str = "<span>The wish to</span> be unfairly treated is a compromise
attempt that would COMBINE attack <span>and innocen</span>ce.
Who can combine the wholly incompatible, and make a unity
of what can NEVER j<span>oin? Walk </span>you the gentle way,";
preg_match_all("/<span>(.+?)<\/span>/is", $str, $matches);
echo "<pre>";
print_r($matches);
you output will be
Array
(
[0] => Array
(
[0] => The wish to
[1] => and innocen
[2] => oin? Walk
)
[1] => Array
(
[0] => The wish to
[1] => and innocen
[2] => oin? Walk
)
)
you can use o or 1 index
If you don't mind using a third-party component, I'd like to show you Symfony's DomCrawler component. It 's a very simple way to parse HTML/XHTML/XML files and navigate through the nodes.
You can even use CSS Selectors. Your code would be something like:
$crawler = new Crawler($html);
$spans = $crawler->filter("span");
echo $spans[1]->getText();;
You don't even need to have a full HTML/XML document, if you assign only the <span>...</span> part of your code, it'll work fine.

why do javascript libraries using json choose a [ { } , { } ] structure

I've been using a few javascript libraries, and I noticed that most of them take input in this form: [{"foo": "bar", "12": "true"}]
According to json.org:
So we are sending an object in an array.
So I have a two part question:
Part 1:
Why not just send an object or an array, which would seem simpler?
Part2:
What is the best way to create such a Json with Php?
Here is a working method, but I found it a bit ugly as it does not work out of the box with multi-dimensional arrays:
<?php
$object[0] = array("foo" => "bar", 12 => true);
$encoded_object = json_encode($object);
?>
output:
{"1": {"foo": "bar", "12": "true"}}
<?php $encoded = json_encode(array_values($object)); ?>
output:
[{"foo": "bar", "12": "true"}]
Because that's the logical way how to pass multiple objects. It's probably made to facilitate this:
[{"foo" : "bar", "12" : "true"}, {"foo" : "baz", "12" : "false"}]
Use the same logical structure in PHP:
echo json_encode(array(array("foo" => "bar", "12" => "true")));
An array is used as a convenient way to support multiple parameters. The first parameter, in this case, is an object.
Question one:
JSON is just a way of representing an object and/or and array as a string. If you have your array or object as a string, it is much easier to send it around to different places, like to the client's browser. Different languages handle arrays and objects in different ways. So if you had an array in php, for example, you can't send it to the client directly because it is in a format that only php understands. This is why JSON is useful. It is a method of converting an array or object to a string that lots of different languages understand.
Question two:
To output an array of objects like the above, you could do this:
<?php
//create a test array of objects
$testarray = array();
$testarray[] = json_decode('{"type":"apple", "number":4, "price":5}');
$testarray[] = json_decode('{"type":"orange", "number":3, "price":8}');
$testarray[] = json_decode('{"type":"banana", "number":8, "price":3}');
$testarray[] = json_decode('{"type":"coconut", "number":2, "price":9}');
$arraycount = count($testarray);
echo("[");
$i = 1;
foreach($testarray as $object)
{
echo(json_encode($object));
if($i !== $arraycount)
{
echo(",");
}
$i += 1;
}
echo("]");
?>
This will take an array of objects and loop over them. Before we loop over them, we output the opening square bracket.
Then, for each object in the array, we output it encoded to JSON plus a comma after each one. We count the number of iterations so that we don't output a comma at the end.

PHP Simple CSS string parser

I need to parse some CSS code like:
color: black;
font-family:"Courier New";
background:url('test.png');
color: red;
--crap;
Into:
array (
'color'=>'red',
'font-family'=>'"Courier New"',
'background'=>'url(\'test.png\')',
'--crap'=>''
)
I need to do this via PHP. I can see this done easily via regexp (well, easy to those that know it, unlike myself :-) ).
I need the resulting array to be "normalized", there should not be any trailing spaces between tokens, even if they were in the source.
Valueless css tokens should be included in the array as a key only. (see --crap)
Quotes (and values in general) should remain as is, except for extra formatting (spaces, tabs); easily removed via trim() or via the relevant regexp switch.
Please not that at this point, I specifically do not need a full CSS parser, ie, there is no need to parse blocks ( {...} ) or selectors ( a.myclass#myid ).
Oh, and considering I'll be putting this stuff in an array, it is perfectly ok if the last items ( color:red; ) completely override the original items ( color:black; ).
Here's a simple version:
$a = array();
preg_match_all('/^\s*([^:]+)(:\s*(.+))?;\s*$/m', $css, $matches, PREG_SET_ORDER);
foreach ($matches as $match)
$a[$match[1]] = isset($match[3]) ? $match[3] : null;
Sample output:
array(4) {
["color"]=>
string(3) "red"
["font-family"]=>
string(13) ""Courier New""
["background"]=>
string(15) "url('test.png')"
["--crap"]=>
NULL
}
Not tested with anything except your source data, so I'm sure it has flaws. Might be enough to get you started.
I found this few weeks back and looks interesting.
http://websvn.atrc.utoronto.ca/wsvn/filedetails.php?repname=atutor&path=/trunk/docs/include/classes/cssparser.php
Example:
$Parser = new cssparser();
$Results = $Parser->ParseStr("color: black;font-family:"CourierNew";background:url('test.png');color: red;--crap;");
Why don't take a look at CSSTidy?
You can try:
$result = array();
if(preg_match_all('/\s*([-\w]+)\s*:?\s*(.*?)\s*;/m',$input,$m))
var_dump($m);
// $m[1] contains all the properties
// $m[2] contains their respective values.
for($i=0;$i<count($m[1]);$i++) {
$result[$m[1][$i]] = $m[2][$i];
}
}

Mirror SQL's LIKE functionality for a PHP array?

At the minute I have a page with an AJAX script that searches a database with LIKE '%search term from input box%'.
I need to modify it so that instead of searching the database it searches an array (that has been constructed from two tables - I can't use a JOIN because there's a bit more to it than that).
How do I go about creating a fuzzy search function in PHP that will return all the possible matches from the array?
you want preg_grep
e.g.
$arr = array("tom jones", "tom smith", "bob jones", "jon smith");
$results = preg_grep("/jones/",$arr);
$results will now contain two elements, "tom jones" and "bob jones"
you could just loop over the array and use strpos to find the matching elements
foreach( $arr as $value ) {
if ( strpos($value, 'searchterm') !== FALSE ) {
// Match
}
}
You could use a regular expression for more advanced searching, but strpos will be faster if you are just trying to do a simple LIKE '%term%' type search.

Categories