PHP Regex Match all string in other file - php

I have created one regex that can extract all string from PHP files.
Example, I have "abc.php", I want to extract all string inside there (including tags " ' ).
I make my own regex but some of string didn't match or overmatch.
Note : My intention also same with post here -> PHP: Regex to match the following string samples
But agent-j answers inside that thread also didn't match some of string.
Basically, this is my regex
/[\"|\'][^.\/\"](.*?)[^,\\][\"|\']/
Here the problem in picture..
I also try use agent-j regex, but his regex has problem when matching string in multiple line.
His regex
(['"])((?:\\\1|(?!\1).)+)\1
Problem with this regex

The easiest way I have ever found to regex match any logic to an entire file has been to use
$something_better = explode(''',$something);
This way you get an array of data that is more easily evaluated. I use this concept every time I want to guarantee I can make the match exactly how I want every time.
What I would do here is to explode and extract the info between single quotes, matching what I wanted from them, and then implode on the single quote. Since you also want the double quotes, you can then explode and repeat the process for double quotes.
In my experience it is very hard to regex all your problems away in one simple statement. It's better to take it in smaller pieces if you can. There will be less room for error.

Look like anybody don't solve my problem.
I solved this problem myself with help of my friend.
So this is the regex that i was looking for.
/([\"\'])(?:(?=(\\\\?))\\2.)*?\\1/s

Related

Regular Expression in Serialized Data

I am looking to a database search on serialized data. I am currently using Symfony2 as my Framework making pdo_mysql calls using Doctrine 2. What I would like to do is create a query that uses REGEXP to find data within a certian part of the array. The data I am trying to search within looks like this: -
a:1:{s:8:"bedrooms";a:5:{i:0;i:1;i:1;i:2;i:2;i:3;i:3;i:4;i:4;s:2:"5+";}}
So let's say I am looking for a record that has 3 bedrooms, then I would want it to find: -
i:2;i:3
The query I have come up with so far is: -
SELECT * FROM table WHERE field_name REGEXP '.*"bedrooms"; a:[0-9]+:{i:[0-9]+;i:3;}.*';
However this doesn't work. Can someone help me find a fix around this please? I think it's down to the way the regular expression is written.
Also its worth noting that there are other arrays stored in the field such credit limits and other data.
Thank you in advance.
I believe you can do it with the help of negated character class [^{}] that matches any character but a { and }:
.*"bedrooms";a:[0-9]+:[{][^{}]*i:[0-9]+;i:3[^{}]*[}]
See the regex demo
I see at least 2 mistakes and improvements you can do
first, in regex drop the blank space after "bedrooms";
you should scape the curly braces like \{ and \} since they are not literal for regex engine
if you are interested in a specific chunk in the string you must specify it as a group and inform what kind of characters are around, like
"bedrooms";a:[0-9]+:\{.*(i:[0-9];i:3).*\}
In this case in looking for i:*:i:3 where * is any digit

replace all backslashes when they are escaped multiple times

I made a mistake in a app I'm developing and it turns out that there are some text that have multiples times double quotes escaped, something like this
We will begin by constructing the front and sides of the bar. We
will first create frames for both the front and sides using
2\\\\\\\\" x 4\\\\\\\\"s and will then secure particle
board panels over the frames. The bar in this project is
42\\\\\\\\" tall by 60\\\\\\\\" wide by
40\\\\\\\\" deep. You will be drilling several screws in this
project. To make this process easier- first drill pilot holes for the
screws in your frame using your Dremel Rotary Tool and a 150
1/8\\\\\\\\" Drill Bit.
I'm looking for a way to do an update in the DB that remove all those slashes quickly, so far my thinking is to use a regular expression, something like
for each($records as $record){
$record->description = preg_replace(*/some patter here/*, $record->description);
$record->save();
}
However I'm having a hard time looking for the correct pattern, so maybe someone can please help with this, or if there is a easier or better way to updates these records, I really appreciate any help!
You should almost never be storing backslash-escaped content in the database. To clean these up you probably want something like:
preg_replace( "/\\\\{2,}/", "", $record->description );
That will replace sequences of 2 or more backslashes with the empty string. If necessary you can make it more specific -- so that it would only match those sequences of backslashes followed by a double-quote character:
preg_replace( '/\\\\{2,}"/', '"', $record->description );
That's if you want to do it via PHP. Your database engine may have a built-in regular expression replace function that would allow you to perform the update just using a query, and would likely be higher performance, if that matters. The pattern would likely be the same.
I just noticed your MySQL tag. Apparently MySQL does not have a built-in regex replace function. See How to do a regular expression replace in MySQL?
My suggestion is to get rid of the backslash escapes completely and then escape as necessary on output, unless there's a good reason to store the content escaped. You could of course alter the replacement string to replace those sequences with a single backslash or whatever you want.
preg_replace("/([\\]){2,}/", "\\", $record->description);
should do the trick.

Single regular expression that extracts a number from two different url formats?

I am trying to create a single regular expression that I can use to extract the number from two different urls in a PHP function. The format of these urls are:
/t/2121/title/
and
/top2121.html
I am bad at regular expressions and have already tried the following and many variants of it:
#^/t/(\d+?)/|/top(\d+?)\.html/#i
This is not doing anything and I am still at a complete loss after reading many sites and tutorials on regular expressions. Is there a regular expression I could create that would allow me to extra the number regardless of the url format entered?
Regex to extract only the digits while also checking if url matches accepted formats:
#^\/t(?:\/(\d+)\/[a-z_-]+\/?|op(\d+)\.html)$#i edit: captures in 2 groups
Explained demo here: http://regex101.com/r/dO5dI4
Variant #2: captures in the same group
#^\/t(?|\/(\d+)\/[a-z_-]+\/?$|op(\d+)\.html$)#i
Explained demo here: http://regex101.com/r/cG9vC3
if you just want the first digits after t regardless of the / between, something like this might work: #t/?(\d+)#i
edit:
example: http://codepad.viper-7.com/0z3ee0
I was able to get this regexp to match both types of url formats:
#^/(?:(?:t/)|(?:top))(\d+)(?:(?:\.html)|(?:/))#i
If anyone has a more efficient way of performing the same regexp, I would love to hear it.
If you got either one of these URL's you could use this expression. Your numbers should be stored in your second position:
#^/t(op|/)(\d+)(\.html|/.*)#i
Are there ever going to be numbers in the URL that you don't care about? If not, you can keep this simple by just capturing the numbers and ignoring the rest:
#(\d+)#

Best way to parse a text document

I'm trying to parse a plain text document in PHP but have no idea how to do it correctly.
I want to separate each word, assign them an ID and save the result in JSON format.
Sample text:
"Hello, how are you (today)"
This is what im doing at the moment:
$document_array = explode(' ', $document_text);
json_encode($document_array);
The resulting JSON is
[["Hello,"],["how"],["are"],["you"],["(today)"]]
How do I ensure that spaces are kept in-place and that symbols are not included along with the words...
[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],[" ("],["today"],[")"]]
I’m sure some sort of regex is required... but have no idea what kind of pattern to apply to deal with all cases... Any suggestions guys?
This is actually a really complex problem, and one that's subject to a fair amount of academic reaserch. It sounds so simple (just split on whitespace! with maybe a few rules for punctuation...) but you quickly run into issues. Is "didn't" one word or two? What about hyphenated words? Some might be one word, some might be two. What about multiple successive punctuation characters? Possessives versus quotes? etc etc. Even determining the end of a sentence is non-trivial. (It's just a full stop right?!)
This problem is one of tokenisation and a topic that search engines take very seriously. To be honest you should really look at finding a tokeniser in your language of choice.
Maybe this:?
array_filter(preg_split('/\b/', $document_text))
the 'array_filter', removes the empty values at the first and/or last index of the resulting array, which will appear if your string start or ends with a word boundary (\b see: http://php.net/manual/en/regexp.reference.escape.php)

php extract 99 from string regex

I have a string:
tomato='36'/></carrot
From this I am trying to extract 36 using regex. I am using:
"/tomato='(.*)'\/>/"
This extracts the beginning ok, but not the rest. Any ideas how to fix this?
You should specialize your regex in order to only match numeral characters:
"/tomato='(\d+)'\/>/"
Here are a few tools that can help with constructing regular expressions: https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world
In your case you might want to match \d+ decimals rather.
Depending on the actual use case it might be simpler if you didn't use regexps, but a DOM parser(1) which simplifies the attribute extraction:
pq($xml)->find("recipe")->attr("tomato");
Try matching up to the first quote, and then grabbing everything that isn't a quote character:
/tomato='([^']*)'/
This method works well if you have no clue about what will be between the quotes, however it fails if the final quote is missing, or if you use double quotes instead of single quotes.
Given the very specific example you've provided, and the specific data you want to extract, it's fairly simple; your regex can ignore everything except the numeric characters:
"/\d+/"
If the input string could vary, and you specifically want to find the value of the tomato attribute, then:
"/tomato='(\d+)'/"
There's unlikely to be any real need to be matching the rest of the string - in fact, it's more likely to cause problems, given how variable XML can be.
But the question is, what exactly are you trying to do here? It looks very much like you're trying to parse an HTML/XML stream, but how did you end up with just this odd chunk? Did you do explode(' ',$xml);?
You may find a more scalable and manageable way of extracting data from an XML stream would be to use a DOM parser instead. Regex can work, but HTML/XML tends to have sufficient variation in formatting that you end up with some really really horrible regex strings if you want to be certain of getting the data you want; a DOM parser tends to be much more reliable in this respect.
May I suggest you investigate PHP's built-in DOM parser: http://www.php.net/dom
Hope that helps.

Categories