How to get text after string using reg expression in PHP? - php

I have the following string:
<p><b>Born:</b>333<br></p>
I try to get text 333 like as:
<b>Born:<\/b>(.)*<br>
But it does not work

The . is any character in a string, * means that we concern the repetition. Brackets determine a group to output.
You've used (.)* formula, what means that you will get only the last character (regex from your post should output 3). If you want to output the whole expression 333, try putting everything in one group using (.*?).

Use this regular expression instead,
/<b>Born:<\/b>(.*?)<br>/
Here's an example,
$reg = "/<b>Born:<\/b>(.*?)<br>/";
$str = "<p><b>Born:</b>333<br></p>";
$matches = array();
preg_match($reg, $str, $matches);
echo $matches[1]; // 333
Here's the live demo

You could try something like this:
<?php
$string = "<p><b>Born:</b>333<br></p>";
$extract = preg_replace("#(<p>.*?<\/b>)(.*?)(<br.+>)#", "$2", $string);
var_dump($extract); //<== DISPLAYS::: string '333' (length=3)

You should avoid to parse html with regex since it's a bad practice (html has too many traps, you doesn't take advantage of the html structure and when html isn't well formatted the string approach stops to work). The way to go is to use a tool designed to parse html. The combo DOMDocument/DOMXPath is able to build a DOM tree and to query it using the XPath language:
$str = "<p><b>Born:</b> 333<br></p>";
libxml_use_internal_errors(true);
$xp = new DOMXPath(DOMDocument::loadHTML($str));
$result = $xp->evaluate('string(//b[.="Born:"]/following-sibling::text()[1])');
libxml_clear_errors();
echo trim($result);

Related

Matching a substring (an apostrophe) in a given word using regex

I have a server application which looks up where the stress is in Russian words. The end user writes a word жажда. The server downloads a page from another server which contains the stresses indicated with apostrophes for each case/declension like this жа'жда. I need to find that word in the downloaded page.
In Russian the stress is always written after a vowel. I've been using so far a regex that is a grouping of all possible combinations (жа'жда|жажда'). Is there a more elegant solution using just a regex pattern instead of making a PHP script which creates all these combinations?
EDIT:
I have a word жажда
The downloaded page contains the string жа'жда. (notice the
apostrophe, I do not before-hand know where the apostrophe in the
word is)
I want to match the word with apostrophe (жа'жда).
P.S.: So far I have a PHP script creating the string (жа'жда|жажда') used in regex (apostrophe is only after vowels) which matches it. My goal is to get rid of this script and use just regex in case it's possible.
If I understand your question,
have these options (d'isorder|di'sorder|dis'order|diso'rder|disor'der|disord'er|disorde'r|disorder‌​') and one of these is in the downloaded page and I need to find out which one it is
this may suit your needs:
<pre>
<?php
$s = "d'isorder|di'sorder|dis'order|diso'rder|disor'der|disord'er|disorde'r|disorder'|disorde'";
$s = explode("|",$s);
print_r($s);
$matches = preg_grep("#[aeiou]'#", $s);
print_r($matches);
running example: https://eval.in/207282
Uhm... Is this ok with you?
<?php
function find_stresses($word, $haystack) {
$pattern = preg_replace('/[aeiou]/', '\0\'?', $word);
$pattern = "/\b$pattern\b/";
// word = 'disorder', pattern = "diso'?rde'?r"
preg_match_all($pattern, $haystack, $matches);
return $matches[0];
}
$hay = "something diso'rder somethingelse";
find_stresses('disorder', $hay);
// => array(diso'rder)
You didn't specify if there can be more than one match, but if not, you could use preg_match instead of preg_match_all (faster). For example, in Italian language we have àncora and ancòra :P
Obviously if you use preg_match, the result would be a string instead of an array.
Based, on your code, and the requirements that no function is called and disorder is excluded. I think this is what you want. I have added a test vector.
<pre>
<?php
// test code
$downloadedPage = "
there is some disorde'r
there is some disord'er in the example
there is some di'sorder in the example
there also' is some order in the example
there is some disorder in the example
there is some dso'rder in the example
";
$word = 'disorder';
preg_match_all("#".preg_replace("#[aeiou]#", "$0'?", $word)."#iu"
, $downloadedPage
, $result
);
print_r($result);
$result = preg_grep("#'#"
, $result[0]
);
print_r($result);
// the code you need
$word = 'also';
preg_match("#".preg_replace("#[aeiou]#", "$0'?", $word)."#iu"
, $downloadedPage
, $result
);
print_r($result);
$result = preg_grep("#'#"
, $result
);
print_r($result);
Working demo: https://eval.in/207312

PHP:preg_replace function

$text = "
<tag>
<html>
HTML
</html>
</tag>
";
I want to replace all the text present inside the tags with htmlspecialchars(). I tried this:
$regex = '/<tag>(.*?)<\/tag>/s';
$code = preg_replace($regex,htmlspecialchars($regex),$text);
But it doesn't work.
I am getting the output as htmlspecialchars of the regex pattern. I want to replace it with htmlspecialchars of the data matching with the regex pattern.
what should i do?
You're replacing the match with the pattern itself, you're not using the back-references and the e-flag, but in this case, preg_replace_callback would be the way to go:
$code = preg_replace_callback($regex,'htmlspecialchars',$text);
This will pass the mathces groups to htmlspecialchars, and use its return value as replacement. The groups might be an array, in which case, you can try either:
function replaceCallback($matches)
{
if (is_array($matches))
{
$matches = implode ('', array_slice($matches, 1));//first element is full string
}
return htmlspecialchars($matches);
}
Or, if your PHP version permits it:
preg_replace_callback($expr, function($matches)
{
$return = '';
for ($i=1, $j = count($matches); $i<$j;$i++)
{//loop like this, skips first index, and allows for any number of groups
$return .= htmlspecialchars($matches[$i]);
}
return $return;
}, $text);
Try any of the above, until you find simething that works... incidentally, if all you want to remove is <tag> and </tag>, why not go for the much faster:
echo htmlspecialchars(str_replace(array('<tag>','</tag>'), '', $text));
That's just keeping it simple, and it'll almost certainly be faster, too.
See the quickest, easiest way in action here
If you want to isolate the actual contents as defined by your pattern, you could use preg_match($regex,$text,$hits);. This will give you an array of hits those bits that were between the paratheses in the pattern, starting at $hits[1], $hits[0] contains the whole matched string). You can then start manipulating these found matches, possibly using htmlspecialchars ... and combine them again into $code.

Get data out of string

I am going to parse a log file and I wonder how I can convert such a string:
[5189192e][game]: kill killer='0:Tee' victim='1:nameless tee' weapon=5 special=0
into some kind of array:
$log['5189192e']['game']['killer'] = '0:Tee';
$log['5189192e']['game']['victim'] = '1:nameless tee';
$log['5189192e']['game']['weapon'] = '5';
$log['5189192e']['game']['special'] = '0';
The best way is to use function preg_match_all() and regular expressions.
For example to get 5189192e you need to use expression
/[0-9]{7}e/
This says that the first 7 characters are digits last character is e you can change it to fits any letter
/[0-9]{7}[a-z]+/
it is almost the same but fits every letter in the end
more advanced example with subpatterns and whole details
<?php
$matches = array();
preg_match_all('\[[0-9]{7}e\]\[game]: kill killer=\'([0-9]+):([a-zA-z]+)\' victim=\'([0-9]+):([a-zA-Z ]+)\' weapon=([0-9]+) special=([0-9])+\', $str, $matches);
print_r($matches);
?>
$str is string to be parsed
$matches contains the whole data you needed to be pared like killer id, weapon, name etc.
Using the function preg_match_all() and a regex you will be able to generate an array, which you then just have to organize into your multi-dimensional array:
here's the code:
$log_string = "[5189192e][game]: kill killer='0:Tee' victim='1:nameless tee' weapon=5 special=0";
preg_match_all("/^\[([0-9a-z]*)\]\[([a-z]*)\]: kill (.*)='(.*)' (.*)='(.*)' (.*)=([0-9]*) (.*)=([0-9]*)$/", $log_string, $result);
$log[$result[1][0]][$result[2][0]][$result[3][0]] = $result[4][0];
$log[$result[1][0]][$result[2][0]][$result[5][0]] = $result[6][0];
$log[$result[1][0]][$result[2][0]][$result[7][0]] = $result[8][0];
$log[$result[1][0]][$result[2][0]][$result[9][0]] = $result[10][0];
// $log is your formatted array
You definitely need a regex. Here is the pertaining PHP function and here is a regex syntax reference.

Regular Expression to get contents of anchor tag InnerHTML in php

I need to retrieve anchor tag innerHTML using RegExp in php. Consider I have a syntax like
<div class="detailsGray"><span class="detailEmail">examples#mail.com</span></div>
Try to get it by
preg_match_all('/class=\"fontLink"\>.*\<\/a\>/', $raw, $matches);
but which is not working. Only I need to retrieve examples#mail.com using RegExp and preg_match_all(). Thanks
Use a parser. Luckily, PHP has one!
$html = '<div class="detailsGray"><span class="detailEmail">examples#mail.com</span></div>';
echo retrieve_node_text($html, "//a[#class='fontLink']");
// -----------------------------------------------
function retrieve_node_text($html_fragment, $xpath) {
$fragment = new DOMDocument();
$fragment->loadHTML($html_fragment);
if ($fragment) {
$xp = new DOMXPath($fragment);
$result = $xp->query($xpath);
if ($result->length == 1) {
return $result->item(0)->textContent;
}
}
return FALSE;
}
returns:
examples#mail.com
Looking at the Regex is a bit of a mess:
'/class=\"fontLink\">.*?<\/a>/'
As far as I know there is nothing special about <> in regex.
You don't want .* as that will go to straight to end of the line and then start working backwards. .*? will take the next character if doesn't match until </a>.
What is your input ? If it's raw data from the web, regexp is not a reliable way to do that. It would be better to load your dom as a tree.
You need positive lookahead and lookbehind, so your pattern will be like this:
(?<=class=\"fontLink\"\>).*(?=\<\/a\>)
I think your approach was good enough. This is my solution:
preg_match('/class=\"fontLink"\>(.*)\<\/a\>/', $raw, $matches);
$parsedEmail = $matches[1];
Just add parenthesis on the parts that you want, so they can be matched alone.
If you only want to match one issue use preg_match() instead of preg_match_all().

Supposedly valid regular expression doesn't return any data in PHP

I am using the following code:
<?php
$stock = $_GET[s]; //returns stock ticker symbol eg GOOG or YHOO
$first = $stock[0];
$url = "http://biz.yahoo.com/research/earncal/".$first."/".$stock.".html";
$data = file_get_contents($url);
$r_header = '/Prev. Week(.+?)Next Week/';
$r_date = '/\<b\>(.+?)\<\/b\>/';
preg_match($r_header,$data,$header);
preg_match($r_date, $header[1], $date);
echo $date[1];
?>
I've checked the regular expressions here and they appear to be valid. If I check just $url or $data they come out correctly and if I print $data and check the source the code that I'm looking for to use in the regex is in there. If you're interested in checking anything, an example of a proper URL would be http://biz.yahoo.com/research/earncal/g/goog.html
I've tried everything I could think of, including both var_dump($header) and var_dump($date), both of which return empty arrays.
I have been able to create other regular expressions that works. For instance, the following correctly returns "Earnings":
$r_header = '/Company (.+?) Calendar/';
preg_match($r_header,$data,$header);
echo $header[1];
I am going nuts trying to figure out why this isn't working. Any help would be awesome. Thanks.
Your regex doesn't allow for the line breaks in the HTML Try:
$r_header = '/Prev\. Week((?s:.*))Next Week/';
The s tells it to match the newline characters in the . (match any).
Problem is that the HTML has newlines in it, which you need to incorporate with the s regex modifier, as below
<?php
$stock = "goog";//$_GET[s]; //returns stock ticker symbol eg GOOG or YHOO
$first = $stock[0];
$url = "http://biz.yahoo.com/research/earncal/".$first."/".$stock.".html";
$data = file_get_contents($url);
$r_header = '/Prev. Week(.+?)Next Week/s';
$r_date = '/\<b\>(.+?)\<\/b\>/s';
preg_match($r_header,$data,$header);
preg_match($r_date, $header[1], $date);
var_dump($header);
?>
Dot does not match newlines by default. Use /your-regex/s
$r_header should probably be /Prev\. Week(.+?)Next Week/s
FYI: You don't need to escape < and > in a regex.
You want to add the s (PCRE_DOTALL) modifier. By default . doesn't match newline, and I see the page has them between the two parts you look for.
Side note: although they don't hurt (except readability), you don't need a backslash before < and >.
I think this is because you're applying the values to the regex as if it's plain text. However, it's HTML. For example, your regex should be modified to parse:
Prev. Week ...
Not to parse regular plain text like: "Prev. Week ...."

Categories