User regular expression to match filename and slice it in pieces - php

I have the following string:
./../../../project-files/files/RE-filename.docx
And I want to receive an array with two elements: RE-filename and docx. By the way, the RE-filename section could contain a German umlaut (äöü).
I use the following pattern with preg_match_all():
\/(?!.+\/)|\\(?!.+\\)(.+)-(?!.+-)(.+)\.(?!.*\.)(.+)
However, it doesnt do what I expect. It works when I don't use |\\(?!.+\\) but I am scared that the script will run on a different system that uses \ to separate the directories, that's why I wanted to test for both.
Any ideas? Help is much appreciated!

Related

PHP: How would I remove parts of a string between 2 chunks of characters without removing too much?

This problem is driving me nuts. Let's say I have a string:
This is a &start;pretty bad&end; string that I want to &start;somehow&end; display differently
I want to be able to remove the &start; and &end; parts as well as everything in between so it says:
This is a string that I want to display differently
I tried using preg_replace with a regular expression but it took off too much, ie:
This is a display differently
The question is: how do I remove the stuff just between sets of &start; and &end; pairs and make sure that it doesn't remove anything between any &end; and &start; segments?
Keep in mind, I'm working with hundreds of strings that are very different to each other so I'm looking for a flexible solution that'll work with all of them.
Thanks in advance for any help with this.
Edit: Replaced dollar signs with ampersands. Oops!
Try this regex /\&start;(.+?)\$end;/g
It looks like it works as desired: https://regex101.com/r/MW5nom/2
I quickly tried it on chrome console using JS, tried converting it into PHP:
"This is a &start;pretty bad$end; string that I want to &start;somehow$end; display differently".replace(/\&start;(.+?)\$end;/g, "")

regex with preg_match anything and all line breaks

i am not too good with regex and can't seem to find the answer
I am writing a class file to check data type and "partially/best possible sanitise" any submitted data as well as performing some other functions too. This is working on all data types (i.e emails, url's phone numbers, int/signed/un-signed, words, passwords, various date formats, basic HTML, etc)
i am having problems with trying to match "anything"* (this is the one data type i dont really need to check, but for consistency, i need it to run through the preg_match, but always want it to return true).
when i say "anything" i want it to match any text, number, symbols AND Line Breaks. It is the line break i am having problems with
i am using :
define('REG_TEXT', '/^(.*)$/');
preg_match(REG_TEXT, $data)
this works fine on the first paragraph, but wont match past any line beaks so returns false
an example of what i want this to match (return true) would be:
this is a test match on anything 345 +_)(*&^%$£"!<br><html> <?php echo this i PHP; ?>
and match this too on a new line
and match all this line too
and anything else at all
i am not worried about any code in-putted into the data at this point as other areas of my class are dealing with this (before this stage!).
basically i am after a regex that will match/return true on absolutely anything.
(i dont want to change to preg_match_all as this will break other aspects of the class or require me to add additional code that will be a partial repeat of code that i dont think is needed)
any advice would be greatly welcomed!
thanks
Jon
Use:
'/^(.*)$/ms'
You need the m and s modifiers here. http://php.net/manual/en/reference.pcre.pattern.modifiers.php

Can't use OR( | ) in php Regular expression

I'm a newbie here. I'm facing a weird problem in using regex in PHP.
$result = "some very long long string with different kind of links";
$regex='/<.*?href.*?="(.*?net.*?)"/'; //this is the regex rule
preg_match_all($regex,$result,$parts);
Here in this code I'm trying to get the links from the result string. But it will provide me only those links which contains .net. But I also want to get those links which have .com. For this I tried this code
$regex='/<.*?href.*?="(.*?net|com.*?)"/';
But it shows nothing.
SOrry for my bad English.
Thanks in advance.
Update 1 :
now i'm using this
$regex='/<.*?href.*?="(.*?)"/';
this rule grab all the links from the string. But this is not perfect. Because it also grabs other substrings like "javascript".
The | character applies to everything within the capturing group, so (.*?net|com.*?) will match either .*?net or com.*?, I think what you want is (.*?(net|com).*?).
If you do not want the extra capturing group, you can use (.*?(?:net|com).*?).
You could also use (.*?net.*?|.*?com.*?), but this is not recommended because of the unnecessary repetition.
Your regex gets interpreted as .*?net or com.*?. You'll want (.*?(net|com).*?).
Try this:
$regex='/<.*?href.*?="(.*?\.(?:net|com)\b.*?)"/i';
or better:
$regex='/<a .*?href\s*+=\s*+"\K.*?\.(?:net|com)\b[^"]*+/i';
<.*?href
is a problem. This will match from the first < on the current line to the first href, regardless of whether they belong to the same tag.
Generally, it's unwise to try and parse HTML with regexes; if you absolutely insist on doing that, at least be a bit more specific (but still not perfect):
$regex='/<[^<>]*href[^<>=]*="(?:[^"]*(net|com)[^"]*)"/';

Fetch All URLs from a Page using Regex

Original format:
<a href="http://www.example.com/t434234.html" ...>
1. I need to fetch all URLs of this format:
http://www.example.com/t[ANY CHARACTER].html
ANY CHARACTER is where value changes from URL to another. The rest are fixed.
Here is my attempt:
preg_match("#http:\/\/www\.aqarcity\.com\/t[a-zA-Z0-9_]\.html#", $page, $urls);
I get empty results. I don't know where i went wrong...
The problem appears to be that [a-zA-Z0-9_] will only match exactly one character. If you want to match zero or more characters, use [a-zA-Z0-9_]*. For one or more, use [a-zA-Z0-9_]+. For exactly six characters, use [a-zA-Z0-9_]{6}. For e.g. one to six characters, use [a-zA-Z0-9_]{1,6}.
Also note that, since you're using # as the delimiter, you don't need to escape the / characters. As far as I know this will not make your code misbehave, but it'll be easier to read if you remove the backslashes before the slashes.
Finally, please realize that regular expressions are a rather dangerous way to work with HTML. In this case, you may pick up matching URLs from comments, Javascript code, and other things that aren't links. It is literally impossible to correctly parse HTML with unaugmented regular expressions—they don't have the expressive power necessary to do so. I don't know what sorts of HTML parsers are available for PHP, but you may want to look into them.

Apply a list of regular expressions in PHP

I have a long list of regular expressions in an ignore.txt, and another long list in an include.txt file. What would be the quickest way to apply these using PHP against data contained in a sample.html file such that any matches found in include are captured, but then anything matching in ignore.txt is excluded?
If your include.txt and ignore.txt files are setup so that they're only regular expressions, and there's one expression per line, you can use PHP's file() function. That will load the files into an array where each individual line is an element in the array. You can use file_get_contents() to load the sample.html file in as a string.
preg_match() or preg_match_all() do not actually take arrays as input, like preg_replace() does. You will need to loop over your array of expressions using something like foreach and applying an individual call to one of the matching functions to get your results.
I think preg_match_all() will suit your needs best, because it sounds like you're wanting to pull all of the matches out of the entire file, not just the first one. Once you have your full list of matches, then you'd apply your filter using the data from ignore.txt in a similar manner.
the quickest way would be to let the shell do the job
$result = `cat sample.html | egrep -f include.txt | egrep -vf ignore.txt`;

Categories