php extract 99 from string regex - php

I have a string:
tomato='36'/></carrot
From this I am trying to extract 36 using regex. I am using:
"/tomato='(.*)'\/>/"
This extracts the beginning ok, but not the rest. Any ideas how to fix this?

You should specialize your regex in order to only match numeral characters:
"/tomato='(\d+)'\/>/"

Here are a few tools that can help with constructing regular expressions: https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world
In your case you might want to match \d+ decimals rather.
Depending on the actual use case it might be simpler if you didn't use regexps, but a DOM parser(1) which simplifies the attribute extraction:
pq($xml)->find("recipe")->attr("tomato");

Try matching up to the first quote, and then grabbing everything that isn't a quote character:
/tomato='([^']*)'/
This method works well if you have no clue about what will be between the quotes, however it fails if the final quote is missing, or if you use double quotes instead of single quotes.

Given the very specific example you've provided, and the specific data you want to extract, it's fairly simple; your regex can ignore everything except the numeric characters:
"/\d+/"
If the input string could vary, and you specifically want to find the value of the tomato attribute, then:
"/tomato='(\d+)'/"
There's unlikely to be any real need to be matching the rest of the string - in fact, it's more likely to cause problems, given how variable XML can be.
But the question is, what exactly are you trying to do here? It looks very much like you're trying to parse an HTML/XML stream, but how did you end up with just this odd chunk? Did you do explode(' ',$xml);?
You may find a more scalable and manageable way of extracting data from an XML stream would be to use a DOM parser instead. Regex can work, but HTML/XML tends to have sufficient variation in formatting that you end up with some really really horrible regex strings if you want to be certain of getting the data you want; a DOM parser tends to be much more reliable in this respect.
May I suggest you investigate PHP's built-in DOM parser: http://www.php.net/dom
Hope that helps.

Related

What's the best approach to find words from a set of words in a string?

I must detect the presence of some words (even polyrematic, like in "bag of words") in a user-submitted string.
I need to find the exact word, not part of it, so the strstr/strpos/stripos family is not an option for me.
My current approach (PHP/PCRE regex) is the following:
\b(first word|second word|many other words)\b
Is there any other better approach? Am I missing something important?
Words are about 1500.
Any help is appreciated
A regular expression the way you're demonstrating will work. It may be challenging to maintain if the list of words grows long or changes.
The method you're using will work in the event that you need to look for phrases with spaces and the list doesn't grow much.
If there are no spaces in the words you're looking for, you could split the input string on space characters (\s+, see https://www.php.net/manual/en/function.preg-split.php ), then check to see if any of those words are in a Set (https://www.php.net/manual/en/class.ds-set.php) made up of the words you're looking for. This will be a bit more code, but less regex maintenance, so ymmv based on your application.
If the set has spaces, consider instead using Trie. Wiktor Stribiżew suggests: https://github.com/sters/php-regexp-trie

PHP Regex Match all string in other file

I have created one regex that can extract all string from PHP files.
Example, I have "abc.php", I want to extract all string inside there (including tags " ' ).
I make my own regex but some of string didn't match or overmatch.
Note : My intention also same with post here -> PHP: Regex to match the following string samples
But agent-j answers inside that thread also didn't match some of string.
Basically, this is my regex
/[\"|\'][^.\/\"](.*?)[^,\\][\"|\']/
Here the problem in picture..
I also try use agent-j regex, but his regex has problem when matching string in multiple line.
His regex
(['"])((?:\\\1|(?!\1).)+)\1
Problem with this regex
The easiest way I have ever found to regex match any logic to an entire file has been to use
$something_better = explode(''',$something);
This way you get an array of data that is more easily evaluated. I use this concept every time I want to guarantee I can make the match exactly how I want every time.
What I would do here is to explode and extract the info between single quotes, matching what I wanted from them, and then implode on the single quote. Since you also want the double quotes, you can then explode and repeat the process for double quotes.
In my experience it is very hard to regex all your problems away in one simple statement. It's better to take it in smaller pieces if you can. There will be less room for error.
Look like anybody don't solve my problem.
I solved this problem myself with help of my friend.
So this is the regex that i was looking for.
/([\"\'])(?:(?=(\\\\?))\\2.)*?\\1/s

Is it possible to write a regex which checks if a string (javascript & php code) is minified?

Is it possible to write a regular expression which checks if a string (some code) is minified?
Many PHP/JS obfuscators remove white space chars (among other things).
So, the final minified code sometimes looks like this:
PHP:
$a=array();if(is_array($a)){echo'ok';}
JS:
a=[];if(typeof(a)=='object'&&(a instanceof Array){alert('ok')}
in both cases there are no space chars before and after "{", "}", ";", etc. There also some other patterns which can help. I am not expecting a high accuracy regex, just need one which checks if at least 100 chars of string looks like minified code.
Thanks in advice.
PURPOSES: web malware scanner
I think a minifier will strip all newline characters, although there might possibly be one at the end of the file still if the minified code was pasted back in a text editor. Something like this will probably be fairly accurate:
/^[^\n\r]+(\r\n?|\n)?$/
That just tests that there are no newline characters in the whole thing except for possibly one at the end. So no guarantees, but I think it will work well on any longish block of code.
The short answer is "no", regex cannot do this.
Your best bet will probably be to do a statistical analysis of the source files, and compare against some known heuristics. For instance, by comparing the variable names against those often found in minimized code. A minimized file probably has a lot of one-character variable names, for instance... and won't have two-character variable names until all the one-character variable names are exhausted... etc.
Another option would be simply to run the source file through a minimizer, and see if the output is sufficiently different from the input. If not, it was probably already minimized.
But I have to agree with sg3s's final sentence: If you can explain why you need this, we can probably provide more useful answers to your actual needs.
No. Since the syntax/code and its intention doesn't change and some people who're very familiar with the php and/or js will write simple functions on one line without any whitespace at all (me :s).
What you could do is count all the whitespace characters in a string though this would also be unreliable since for some stuff you simply need whitespace, like x instanceof y heh. Also not all code is minified and cramped into a single row (see jQuery UI) so you can't really count on that either....
Maybe you can explain why you need to know this and we can try and find an alternative?
You can't tell if it's got minified or just written like that by hand (probably only applies for smaller scripts). But you can check if it doesn't contain unnecessary whitespace.
Take a look at open source obfuscator/minifier and see what rules they use to remove the whitespace. Validating if those rules were applied should work, if regex get to complex, a simple parser might be needed.
Just make sure that string literals like a="if ( b )" are excluded.
Run it through a parser for that particular language (even a prettifier might work fine) and modify it to count the number of unused characters. Use the percentage of unused chars vs. number of chars in documents as a test for minification. I don't think you can do this accurately with regex, although counting whitespace vs. document content might be okay.

Best way to parse a text document

I'm trying to parse a plain text document in PHP but have no idea how to do it correctly.
I want to separate each word, assign them an ID and save the result in JSON format.
Sample text:
"Hello, how are you (today)"
This is what im doing at the moment:
$document_array = explode(' ', $document_text);
json_encode($document_array);
The resulting JSON is
[["Hello,"],["how"],["are"],["you"],["(today)"]]
How do I ensure that spaces are kept in-place and that symbols are not included along with the words...
[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],[" ("],["today"],[")"]]
I’m sure some sort of regex is required... but have no idea what kind of pattern to apply to deal with all cases... Any suggestions guys?
This is actually a really complex problem, and one that's subject to a fair amount of academic reaserch. It sounds so simple (just split on whitespace! with maybe a few rules for punctuation...) but you quickly run into issues. Is "didn't" one word or two? What about hyphenated words? Some might be one word, some might be two. What about multiple successive punctuation characters? Possessives versus quotes? etc etc. Even determining the end of a sentence is non-trivial. (It's just a full stop right?!)
This problem is one of tokenisation and a topic that search engines take very seriously. To be honest you should really look at finding a tokeniser in your language of choice.
Maybe this:?
array_filter(preg_split('/\b/', $document_text))
the 'array_filter', removes the empty values at the first and/or last index of the resulting array, which will appear if your string start or ends with a word boundary (\b see: http://php.net/manual/en/regexp.reference.escape.php)

Best Way to parse an iCalendar string in php

I am trying to write a class that can parse an iCalendar file and am hitting some brick walls. Each line can be in the format:
PARAMETER[;PARAM_PROPERTY..]:VALUE[,VALUE2..]
It's pretty easy to parse with either a bunch of splits or regex's until you find out that values can have backticked commas, also they can be double quote marked which makes life hard. for example:
PARAMETER:"my , cool, value",value\,2,value3
In this example you are meant to pull out the three values:
my , cool value
value,2
value3
Which makes it a little more difficult.
Suggestions?
Go through the file char by char and split the values manually, whenever you have a quotation mark you enter "quotation mode" where you won't split at commas and when the closing quotation mark comes you leave it.
For the backticked commas: If you read in a backslash you also read the next character and decide what to do with it then.
Of course that's not extremely efficient, but you can't use regular expressions for this. I mean you can, but since I believe that there also can be escaped quotation marks this is going to be really messy.
If you want to give it a try though:
let's start by matching a quotation mark followed by characters that are not: "[^"]*"
to overcome the problem of escaped characters you can use lookaheads (?<!\\)"[^"]*(?<!\\)"
now it will break if escaped quotation marks are in the value, maybe this works?(haven't tested it) (?<!\\)"[^"|(?<=\\)"]*(?<!\\)"
So you see it very fast get's messy, so I would suggest to you to read it in characterwise.
I had the same problems. I found it a bit hard to turn 'any' iCalendar file into a usable PHP object/array structure, so instead I've been trying to convert iCalendar to xCal.
This is my implementation:
http://code.google.com/p/sabredav/source/browse/branches/caldav/lib/Sabre/CalDAV/ICalendarToXML.php
I must say that this script is not fully tested, but it might be enough to get your started.
Have you tried pulling something out of http://phpicalendar.net/ ?
Is this the project you're thinking of? I'm the auther :) The first usable version (v0.1.0) should be ready in about a month. It is capable of working with about 85% of the iCalendar spec right now, but recurring events are really tough. I'm working on them right now. Once those are complete, the library will be fully capable of doing anything in the spec.
qCal Google Code Homepage
Enjoy!

Categories