regex/preg_replace to extract the part number (substring) - php

I'm not very comfortable with RegEx.
The Use Case
I use three variables, namely $url, $pattern and $replacement and intend to use them as follows:
$url = $node->attr("href");
$resource = ExtractResourceWithoutHtmlExtension($url); // This is jus to abstract the stripping off of the prepended path and cutting the `.html` (see Edit 2 & 3 below).
$pattern = ...
$replacement = ${1}; // Not very sure of this value
$partno = preg_replace($pattern, replacement, $resource);
echo '"'.$partno.'";"'.$node->attr("title").'";"'.$url.'"'."\n";
The Part number and Resouce scheme mapping (string)
most of the time
35000-0295 => designation-of-the-products-as-slug-35000-0295
27021-0012 => designation-of-the-products-as-slug-27021-0012
or rarely
38811 => designation-of-the-products-as-slug-38811
last but not the least (edge case => nothing to extract)
In case of non availability of Part number, the Resource substring would be simply
designation-of-the-products-as-slug
I still prefer RegEx solution because there might be a variation in the length of number within the segments constituting the Part number.
The Question
What should I assign to $pattern and $replacement?
Edit 1 (for reference)
The substring designation-of-the-products-as-slug are mutable and path/to/ could be of any arbitrary depth.
Edit 2 (for reference)
On second thought I realise that there is no need to use RegEx for the whole URL path: http://path/to/ could be stripped of using parse_url, explode and array_pop. Edited accordingly my post.
Edit 3 (for reference)
The the complexity could also reduce by cutting the immutable trailing substring .html. Cf. #bloodyKnuckles's comment below. Post edited accordingly.

To start with I'd use a combination of parse_url and pathinfo to strip off extraneous bits from the string, then use preg_filter with a regex like /.*?(\d+[\d-]*)$/ to grab the last chunk of digits plus optional following hyphens and digits.
Example:
$urls = [
"http://example.com/path/to/designation-of-the-products-as-slug-35000-0295.extension",
"http://example.com/path/to/designation-of-the-products-as-slug-35000.html",
"http://example.com/path/to/designation-of-the-products-as-slug.ext?foo=bar.baz"
];
$regex = '/.*?(\d+[\d-]*)$/';
foreach ($urls as $url) {
$resource = pathinfo(parse_url($url, PHP_URL_PATH), PATHINFO_FILENAME);
echo preg_filter($regex, '$1', $resource), "\n";
}
Output:
35000-0295
35000

Related

Replace (add) words case sensitive from arrays

I am new to php and especially to regex.
My target is to enrich textes automatically with hints for "keywords" which are listed in arrays.
So far I had come.
$pattern = array("/\bexplanations\b/i",
"/\btarget\b/i",
"/\bhints\b/i",
"/\bhint\b/i",
);
$replacement = array("explanations <i>(Erklärungen)</i>",
"target <i>Ziel</i>",
"hints <i>Hinsweise</i>",
"hint <i>Hinweis</i>",
);
$string = "Target is to add some explanations (hints) from an array to
this text. I am thankful for every hint.";
echo preg_replace($pattern, $replacement, $string);
returns:
target <i>Ziel</i> is to add some explanations <i>(Erklärungen)</i> (hints <i>Hinsweise</i>) from an array to this text. I am thankful for every hint <i>Hinweis</i>
1) In generally I wonder if there are more elegant solutions (eventually without replacing the original word)?
On later state the arrays will contain more than 1000 items... and come from mariadb.
2) How can I achive, that the word "Targets" achives a case sensitive treatment?
(without duplicate the length of my arrays).
Sorry for my English and many thanks in advance.
If you project to increase the size of your array and if the text may be a bit long, processing all the text (once per word) isn't a reliable way. Also, with a large array, it isn't reliable to build a giant alternation with all the words.
But if you store all the translations in an associative array and split the text on word-boundaries, you can do it in one pass:
// Translation array with all keys lowercase
$trans = [ 'explanations' => 'Erklärungen',
'target' => 'Ziel',
'hints' => 'Hinsweise',
'hint' => 'Hinweis'
];
$parts = preg_split('~\b~', $text);
$partsLength = count($parts);
// All words are in the odd indexes
for ($i=1; $i<$partsLength; $i+=2) {
$lcWord = strtolower($parts[$i]);
if (isset($trans[$lcWord]))
$parts[$i] .= ' <i>(' . $trans[$lcWord] . ')</i>';
}
$result = implode('', $parts);
Actually the limitation here is that you can't use a key that contains a word-boundary (if you want to translate a whole expression with several words for instance), but if you want to handle this case, you can use preg_match_all in place of preg_split and build a pattern that tests these special cases before, something like:
preg_match_all('~mushroom pie\b|\w+|\W*~iS', $text, $m);
$parts = &$m[0];
$partsLength = count($parts);
$i = 1 ^ preg_match('~^\w~', $parts[0]);
for (; $i<$partsLength; $i+=2) {
...
(if you have a lot of exceptions (too many) other strategies are possible.)
Enclose search words with parentheses in regex patterns and use backteferences in replacements. 
See this PHP demo:
$pattern = array("/\b(explanations)\b/i", "/\b(target)\b/i", "/\b(hints)\b/i", "/\b(hint)\b/i", );
$replacement = array('$1 <i>(Erklärungen)</i>', '$1 <i>Ziel</i>', '$1 <i>Hinsweise</i>', '$1 <i>Hinweis</i>', );
$string = "Target is to add some explanations (hints) from an array to this text. I am thankful for every hint.";
echo preg_replace($pattern, $replacement, $string);
That way, you will replace with the words found with actual case used in the text.
Note it is very important to make sure the patterns go in the descending order with longer patterns coming before shorter ones (first Targets, then Target, etc.)

PHP Regular Expression to Match Function Name and Parameters with string like Needle(needle|needle)

I am filtering database results with a query string that looks like this:
attribute=operator(value|optional value)
I'll use
$_GET['attribute'];
to get the value.
I believe the right approach is using regex to get matches on the rest.
The preferred output would be
print_r($matches);
array(
1 => operator
2 => value
3 => optional value
)
The operator will always be one word and consist of letters: like(), between(), in().
The values can be many different things including letters, numbers, spaces commas, quotation marks, etc...
I was asked where my code was failing and didn't include much code because of how poorly it worked. Based on the accepted answer, I was able to whip up a regex that almost works.
EDIT 1
$pattern = "^([^\|(]+)\(([^\|()]+)(\|*)([^\|()]*)";
Edit 2
$pattern = "^([^\|(]+)\(([^\|()]+)(\|*)([^\|()]*)"; // I thought this would work.
Edit 3
$pattern = "^([^\|(]+)\(([^\|()]+)(\|+)?([^\|()]+)?"; // this does work!
Edit 4
$pattern = "^([^\|(]+)\(([^\|()]+)(?:\|)?([^\|()]+)?"; // this gets rid of the middle matching group.
The only remaining problem is when the 2nd optional parameter does not exist, there is still an empty $matches array.
This script, with the input "operator(value|optional value)", returns the array you expect:
<?php
$attribute = $_GET['attribute'];
$result = preg_match("/^([\w ]+)\(([\w ]+)\|([\w ]*)\)$/", $attribute, $matches);
print($matches[1] . "\n");
print($matches[2] . "\n");
print($matches[3] . "\n");
?>
This assumes your "values" match [\w ] regexp (all word characters plus space), and that the | you specify is a literal |...

Regex to match specific string not enclosed by another, different specific string

I need a regex to match a string not enclosed by another different, specific string. For instance, in the following situation it would split the content into two groups: 1) The content before the second {Switch} and 2) The content after the second {Switch}. It wouldn't match the first {Switch} because it is enclosed by {my_string}'s. The string will always look like shown below (i.e. {my_string}any content here{/my_string})
Some more
{my_string}
Random content
{Switch} //This {Switch} may or may not be here, but should be ignored if it is present
More random content
{/my_string}
Content here too
{Switch}
More content
So far I've gotten what is below which I know isn't very close at all:
(.*?)\{Switch\}(.*?)
I'm just not sure how to use the [^] (not operator) with a specific string versus different characters.
It really seems you're trying to use a regular expression to parse a grammar - something that regular expressions are really bad at doing. You might be better off writing a parser to break down your string into the tokens that build it, and then processing that tree.
Perhaps something like http://drupal.org/project/grammar_parser might help.
Try this simple function:
function find_content()
function find_content($doc) {
$temp = $doc;
preg_match_all('~{my_string}.*?{/my_string}~is', $temp, $x);
$i = 0;
while (isset($x[0][$i])) {
$temp = str_replace($x[0][$i], "{REPL:$i}", $temp);
$i++;
}
$res = explode('{Switch}', $temp);
foreach ($res as &$part)
foreach($x[0] as $id=>$content)
$part = str_replace("{REPL:$id}", $content, $part);
return $res;
}
Use it this way
$content_parts = find_content($doc); // $doc is your input document
print_r($content_parts);
Output (your example)
Array
(
[0] => Some more
{my_string}
Random content
{Switch} //This {Switch} may or may not be here, but should be ignored if it is present
More random content
{/my_string}
Content here too
[1] =>
More content
)
You can try positive lookahead and lookbehind assertions (http://www.regular-expressions.info/lookaround.html)
It might look something like this:
$content = 'string of text before some random content switch text some more random content string of text after';
$before = preg_quote('String of text before');
$switch = preg_quote('switch text');
$after = preg_quote('string of text after');
if( preg_match('/(?<=' $before .')(.*)(?:' $switch .')?(.*)(?=' $after .')/', $content, $matches) ) {
// $matches[1] == ' some random content '
// $matches[2] == ' some more random content '
}
$regex = (?:(?!\{my_string\})(.*?))(\{Switch\})(?:(.*?)(?!\{my_string\}));
/* if "my_string" and "Switch" aren't wrapped by "{" and "}" just remove "\{" and "\}" */
$yourNewString = preg_replace($regex,"$1",$yourOriginalString);
This might work. Can't test it know, but i'll update later!
I don't if this is what you're looking for, but to negate more than one character, the regex syntax is:
(?!yourString)
and it is called "negative lookahead assertion".
/Edit:
This should work and return true:
$stringMatchesYourRulesBoolean = preg_match('~(.*?)('.$my_string.')(.*?)(?<!'.$my_string.') ?('.$switch.') ?(?!'.$my_string.')(.*?)('.$my_string.')(.*?)~',$yourString);
Have a look at PHP PEG. It is a little parser written in PHP. You can write your own grammar and parse it. It's going to be very simple in your case.
The grammar syntax and the way of parsing is all explained in the README.md
Extracts from the readme:
token* - Token is optionally repeated
token+ - Token is repeated at least one
token? - Token is optionally present
Tokens may be :
- bare-words, which are recursive matchers - references to token rules defined elsewhere in the grammar,
- literals, surrounded by `"` or `'` quote pairs. No escaping support is provided in literals.
- regexs, surrounded by `/` pairs.
- expressions - single words (match \w+)
Sample grammar: (file EqualRepeat.peg.inc)
class EqualRepeat extends Packrat {
/* Any number of a followed by the same number of b and the same number of c characters
* aabbcc - good
* aaabbbccc - good
* aabbc - bad
* aabbacc - bad
*/
/*Parser:Grammar1
A: "a" A? "b"
B: "b" B? "c"
T: !"b"
X: &(A !"b") "a"+ B !("a" | "b" | "c")
*/
}

how to make a string lowercase without changing url

I'm using mb_strtolower to make a string lowercase, but sometimes text contains urls with upper case. And when I use mb_strtolower, of course the urls changing and not working.
How can I convert string to lower without changin urls?
Since you have not posted your string, this can be only generally answered.
Whenever you use a function on a string to make it lower-case, the whole string will be made lower-case. String functions are aware of strings only, they are not aware of the contents written within these strings specifically.
In your scenario you do not want to lowercase the whole string I assume. You want to lowercase only parts of that string, other parts, the URLs, should not be changed in their case.
To do so, you must first parse your string into these two different parts, let's call them text and URLs. Then you need to apply the lowercase function only on the parts of type text. After that you need to combine all parts together again in their original order.
If the content of the string is semantically simple, you can split the string at spaces. Then you can check each part, if it begins with http:// or https:// (is_url()?) and if not, perform the lowercase operation:
$text = 'your content http://link.me/now! might differ';
$fragments = explode(' ', $text);
foreach($fragments as &$fragment) {
if (is_not_url($fragment))
$fragment = strtolower($fragment) // or mb_strtolower
;
}
unset($fragment); // remove reference
$lowercase = implode(' ', $fragments);
To have this code to work, you need to define the is_not_url() function. Additionally, the original text must contain contents that allows to work on rudimentary parsing it based on the space separator.
Hopefully this example help you getting along with coding and understanding your problem.
Here you go, iterative, but as fine as possible.
function strtolower_sensitive ( $input ) {
$regexp = "#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie";
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
for( $i=0, $hist=array(); $i<=count($matches); ++$i ) {
str_replace( $u=$matches[$i][0], $n="sxxx".$i+1, $input ); $hist[]=array($u,$n);
}
$input = strtolower($input);
foreach ( $hist as $h ) {
str_replace ( $h[1], $h[0], $input );
}
}
return $input;
}
$input is your string, $output will be your answer. =)

Php is stripping one letter "g" from my rtrim function but not other chars

I'm trying to trim some youtube URLs that I am reading in from a playlist. The first 3 work fine and all their URLs either end in caps or numbers but this one that ends in a lower case g is getting trimmed one character shorter than the rest.
for ($z=0; $z <= 3; $z++)
{
$ythref2 = rtrim($tubeArray["feed"]["entry"][$z]["link"][0]["href"], '&feature=youtube_gdata');
The URL is http://www.youtube.com/watch?v=CuE88oVCVjg&feature=youtube_gdata .. and it should get trimmed down to .. http://www.youtube.com/watch?v=CuE88oVCVjg but instead it is coming out as http://www.youtube.com/watch?v=CuE88oVCVj.
I think it may be the ampersand symbol but I am not sure.
The second argument to rtrim is a list of characters to remove, not a string to remove.
You might want to use str_replace, or use parse_url and parse_str to get arrays of the components of the URL and the components of the query string, like "v".
Untested example code:
$youtube_url = 'http://www.youtube.com/watch?v=CuE88oVCVjg&feature=youtube_gdata';
$url_bits = parse_url($youtube_url);
$query_string = array();
parse_str($url_bits['query'], $query_string);
$video_identifier = $query_string['v']; // "CuE88oVCVjg"
$rebuilt_url = 'http://www.youtube.com/watch?v=' . $video_identifier;
No, it's the g in the second argument. rtrim() does not remove a string from the end, it removes any characters given in the second argument. Use preg_replace() or substr() instead.

Categories