I've parsing a xml file that sometimes has the value <avg_cpc>some number</avg_cpc> sometime don't.
my regex look like this:
<is_adult>(.*?)</is_adult>.*?<trademark_probability>(.*?)</trademark_probability>.*?<total_extensions_used>(.*?)</total_extensions_used> **here comes the <avg_cpc>some number</avg_cpc>** .*?</appraisal>
how can I make this regex match items that don't have cpc value ?
I've tried (<avg_cpc>.*?</avg_cpc>)? without luck.
Thanks !
Please use a real XML parser for PHP, instead of regular expressions. This will make everything much easier, not to mention less error-prone.
I would guess it's because you're not escaping your slashes, try this:
<is_adult>(.*?)<\/is_adult>.*?<trademark_probability>(.*?)<\/trademark_probability>.*?<total_extensions_used>(.*?)<\/total_extensions_used>(<avg_cpc>.*?<\/avg_cpc>)?.*?<\/appraisal>
I would also use [^<]+ instead of .*? if possible.
Related
I've got a problem with regexp function, preg_replace(), in PHP.
I want to get viewstate from html's input, but it doesn't work properly.
This code:
$viewstate = preg_replace('/^(.*)(<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value=")(.*[^"])("\s+name="__VIEWSTATE">)(.*)$/u','^\${3}$',$html);
Returns this:
%0D%0A%0D%0A%3C%21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.0+Transitional%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-transitional.dtd%22%3E%0D%0A%0D%0A%3Chtml+xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml%22+%3E%0D%0A%3Chead%3E%3Ctitle%3E%0D%0A%09Strava.cz%0D%0A%3C%2Ftitle%3E%3Clink+rel%3D%22shortcut+icon%22+href%3D%22..%2FGrafika%2Ffavicon.ico%22+type%3D%22image%2Fx-icon%22+%2F%3E%3Clink+rel%3D%22stylesheet%22+type%3D%22text%2Fcss%22+media%3D%22screen%22+href%3D%22..%2FStyly%2FZaklad.css%22+%2F%3E%0D%0A++++%3Cstyle+type%3D%22text%2Fcss%22%3E%0D%0A++++++++.style1%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+47px%3B%0D%0A++++++++%7D%0D%0A++++++++.style2%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+64px%3B%0D%0A++++++++%7D%0D%0A++++%3C%2Fstyle%3E%0D%0A%0D%0A%3Cscript+type%3D%22text%2Fjavascript%22%3E%0D%0A%0D%0A++var+_gaq+%3D+_gaq+%7C%7C+%5B%5D%3B%0D%0A++_gaq.push%28%5B
EDIT: Sorry, I left this question for a long time. Finally I used DOMDocument.
To be sure i'd split this match into two phases:
Find the relevant input element
Get the value
Because you cannot be certain what the attributes order in the element will be.
if(preg_match('/<input[^>]+name="__VIEWSTATE"[^>]*>/i', $input, $match))
$value = preg_replace('/.*value="([^"]*)".*/i', '$1', $match[0]);
And, of course, always consider DOM and DOMXpath over regex for parsing html/xml.
You should only capture when you're planning on using the data. So most () are obsolete in that regexp pattern. Not a cause for failure but I thought I'd mention it.
Instead of using [^"] to mark that you don't want that character you could use the non-greedy modifier - ?. This makes sure the pattern is matching as little as it can. Since you have name="__VIEWSTATE" following the value this should be safe.
Let's put this in practice and simplify the pattern some. This works as you want:
'/.*<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value="(.+?)"\s+name="__VIEWSTATE">.*/'
I would strongly recommend checking out an alternative to regexp for DOM operations. This makes certain your code works also if the attributes changes order. Plus it's so much nicer to work with.
The main mistake was the use of funciton preg_replace, witch returns the subject - neither the matched pattern nor the replacement. Thank you for your ideas and for the recommendation of DOMDocument. m93a
http://www.php.net/manual/en/function.preg-replace.php#refsect1-function.preg-replace-returnvalues
I'm looking for a regex that will scan a document to match a function call, and return the value of the first parameter (a string literal) only.
The function call could look like any of the following:
MyFunction("MyStringArg");
MyFunction("MyStringArg", true);
MyFunction("MyStringArg", true, true);
I'm currently using:
$pattern = '/Use\s*\(\s*"(.*?)\"\s*\)\s*;/';
This pattern will only match the first form, however.
Thanks in advance for your help!
Update
I was able to solve my problem with:
$pattern = '/Use\s*\(\s*"(.*?)\"/';
Thanks Justin!
~Scott
If you only care about the value of the first parameter, you can just chop off the end of the regex:
$pattern = '/Use\s*\(\s*"(.*?)\"/';
However, you should understand that this (or any pure-regex solution for this problem) will not be perfect, and there will be some possible cases it handles incorrectly. In this case, you'll get false positives, and escaped quotes (\") will break it.
You can ignore escaped quotes by complicating it a bit:
$pattern = '/Use\s*\(\s*"(.*?)(?!<(?:\\\\)*\\)\"/';
This ignores " characters inside the quoted string if they have an odd number of backslashes in front of them.
However, the false-postives issue can't be helped without introducing false-negatives, and vice versa. This is because PHP is an irregular language, so it can't be parsed with "pure" regex, and even modern regex engines that allow recursion are going to need some pretty complex code to do a really thorough job at this.
All I'm saying is, if you're planning a one-off job to quickly scrape through some PHP you wrote yourself, regex is probably fine. If you're looking for something robust and open-ended that will do this on arbitrary PHP code, you need some kind of reflection or PHP parser.
This might be slightly simpler, though will only work if you have double quotes and not single quotes:
$pattern = /Use\s*[^\"]*\"([^\"]*)\"/
Let's assume I do preg_replace as follows:
preg_replace ("/<my_tag>(.*)<\/my_tag>/U", "<my_new_tag>$1</my_new_tag>", $sourse);
That works but I do also want to grab the attribute of the my_tag - how would I do it with this:
<my_tag my_attribute_that_know_the_name_of="some_value">tra-la-la</my_tag>
You don't use regex. You use a real parser, because this stuff cannot be parsed with regular expressions. You'll never know if you've got all the corner cases quite right and then your regex has turned into a giant bloated monster and you'll wish you'd just taken fredley's advice and used a real parser.
For a humourous take, see this famous post.
preg_replace('#<my_tag\b([^>]*)>(.*?)</my_tag>#',
'<my_new_tag$1>$2</my_new_tag>', $source)
The ([^>]*) captures anything after the tag name and before the closing >. Of course, > is legal inside HTML attribute values, so watch out for that (but I've never seen it in the wild). The \b prevents matches of tag names that happen to start with my_tag, preventing bogus matches like this:
<my_tag_xyz>ooga-booga</my_tag_xyz><my_tag>tra-la-la</my_tag>
But that will still break on <my_tag> elements wrapped in other <my_tag> elements, yielding results like this:
<my_tag><my_tag>tra-la-la</my_tag>
If you know you'll never need to match tags with other tags inside them, you can replace the (.*?) with ([^<>]++).
I get pretty tired of the glib "don't do that" answers too, but as you can see, there are good reasons behind them--I could come up with this many more without having to consult any references. When you ask "How do I do this?" with no background or qualification, we have no idea how much of this you already know.
Forget regex's, use this instead:
http://simplehtmldom.sourceforge.net/
I'm having a lot of difficulty matching an image url with spaces.
I need to make this
http://site.com/site.com/files/images/img 2 (5).jpg
into a div like this:
.replace(/(http:\/\/([^\s]+\.(jpg|png|gif)))/ig, "<div style=\"background: url($1)\"></div>")
Here's the thread about that:
regex matching image url with spaces
Now I've decided to first make the spaces into entities so that the above regex will work.
But I'm really having a lot of difficulty doing so.
Something like this:
.replace(/http:\/\/(.*)\/([^\<\>?:;]*?) ([^\<\>?:;]*)(\.(jpe?g|png|gif))/ig, "http://$1/$2%20$3$4")
Replaces one space, but all the rest are still spaces.
I need to write a regex that says, make all spaces between http:// and an image extension (png|jpg|gif) into %20.
At this point, frankly not sure if it's even possible. Any help is appreciated, thanks.
Trying Paolo's escape:
.escape(/http:\/\/(.*)\/([^\<\>?:;]*?) ([^\<\>?:;]*)(\.(jpe?g|png|gif))/)
Another way I can do this is to escape serverside in PHP, and in PHP I can directly mess with the file name without having to match it in regex.
But as far as I know something like htmlentities do not apply to spaces. Any hints in this direction would be great as well.
Try the escape function:
>>> escape("test you");
test%20you
If you want to control the replacement character but don't want to use a regular expression, a simple...
$destName = str_replace(' ', '-', $sourceName);
..would probably be the more efficient solution.
Lets say you have the string variable urlWithSpaces which is set to a URL which contains spaces.
Simply go:
urlWithoutSpaces = escape(urlWithSpaces);
What about urlencode() - that may do what you want.
On the JS side you should be using encodeURI(), and escape() only as a fallback. The reason to use encodeURI() is that it uses UTF-8 for encoding, while escape() uses ISO Latin. Same problems applies for decoding.
encodeURI = encodeURI || escape;
alert(encodeURI('image name.png'));
I need to parse and return the tagname and the attributes in our PHP code files:
<ct:tagname attr="attr1" attr="attr2">
For this purpose the following regular expression has been constructed:
(\<ct:([^\s\>]*)([^\>]*)\>)
This expression works as expected but it breaks when the following code is parsed
<ct:form/input type="attr1" value="$item->field">
The original regular expression breaks because of the > character in the $item->field. I would need to construct a regular expression that ignores the -> or => but not the single >.
I am open to any suggestions... Thanks for your help in advance.
Try this:
<ct:([^\s\>]*)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*')\s*)*)>
But if that’s XML, use should better use a XML parser.
You could try using negative lookbehind like that:
(\<ct:([^\s\>]*)(.*?)(?<!-|=)\>)
Matches :
<ct:tagname attr="attr1" attr="attr2">
<ct:form/input type="attr1" value="$item->field">
Not sure that it the best suited solution for your case, but that respects the constraints.
In general, any parsing problem rapidly runs into language constructs that are context-free but not regular. It may be a better[1] solution to write a context-free parser, ignoring everything except the elements you're interested in.
[1] "better" as seen from a viewpoint of Being The Right Thing, not necessarily a return on investment one.
I think what you want to do is not recognize the -> and =>, but ignore everything between pairs of quotes.
I think it can be done by inserting ((
("[^"]*")*
)) at the opportune place.
My suggestion is to match to the attributes in the same expression.
\<ct:([^\s\>]*)((([a-x0-9]+)=\"([^\"]*)\")*)\>
edit: removed part about > not being valid xml in attribute values.