I have a bunch of strings that may or may not have a substring similar to the following:
<a class="tag" href="http://www.yahoo.com/5"> blah blah ...</a>
Im trying to retrieve the '5' at the end of the link (that isnt necessarily a one digit number, it can be huge). But, this string will vary. The text before the link, and after, will always be different. The only thing that will be the same is the <a class="tag" href="http://www.yahoo.com/ and the closing </a>.
You can do it using preg_match_all and <a class="tag" href="http:\/\/(.*)\/(\d+)"> regular expression.
Give parse_url() a try. Should be easy from there.
As you only need to retrieve the 5, it's pretty straight forward:
$r = pret_match_all('~\/(\d+)"~', $subject, $matches);
It's then in the first matching group.
If you need more information like the link text, I would suggest you to use a HTML Parser for that:
require('Net/URL2.php');
$doc = new DOMDocument();
$doc->loadHTML('<a class="tag" href="http://www.yahoo.com/5"> blah blah ...</a>');
foreach ($doc->getElementsByTagName('a') as $link)
{
$url = new Net_URL2($link->getAttribute('href'));
if ($url->getHost() === 'www.yahoo.com') {
$path = $url->getPath();
printf("%s (from %s)\n", basename($path), $url);
}
}
Example Output:
5 (from http://www.yahoo.com/5)
I would got with "basename":
// prints passwd
print basename("/etc/passwd")
And to get the link you could use:
$xml = simplexml_load_string( '<a class="tag" href="http://www.yahoo.com/5"> blah blah ...</a>' );
$attr = $xml->attributes();
print $attr['href'];
And finally: If you don't know the whole structure of the string, use this:
$dom = new DOMDocument;
$dom->loadHTML( '<a class="tag" href="http://www.yahoo.com/5"> blah blah ...</a>asasasa<a class="tag" href="http://www.yahoo.com/6"> blah blah ...</a>' );
$nodes = $dom->getElementsByTagName('a');
foreach ($nodes as $node) {
print $node->getAttribute('href');
print basename( $node->getAttribute('href') );
}
As this will also fix invalid HTML code.
Related
My code is given below:-
$text = "<div class='title'>Title</div><div class='content'>This is title</div>";
$words = array('Title');
$words = join("|", $words);
$matches = array();
if ( preg_match('/' . $words . '/i', $text, $matches) ){
echo "Words matched: <br/>";
print_r($matches);
}
else{
echo "Not match";
}
The problem is that in above code I am finding title but i don't want to print title; I want to print this: "This is title" and I am not understanding how I can print this by finding title.
Because title is like keyword that will not change but value which i want to print it is dynamic value and it will change every time, that's why i cannot finding value of title. So how can i do it?
Don't use regex for parsing HTML. Use a DOM Parser instead. In this case, you can use an XPath expression to get the element by class name:
$text = "<div class='title'>Title</div>
<div class='content'>This is title</div>";
$dom = new DOMDocument;
$dom->loadHTML($text);
$xpath = new DOMXPath($dom);
$title = $xpath->query('//*[#class="content"]')->item(0)->nodeValue;
Output:
This is title
This should get you started. If the title is in a different position, you can modify the expression accordingly to retrieve it.
I am trying to extract some strings from the source code of a web page which looks like this :
<p class="someclass">
String1<br />
String2<br />
String3<br />
</p>
I'm pretty sure those strings are the only things that end with a single line break(). Everything else ends with two or more line breaks. I tried using this :
preg_match_all('~(.*?)<br />{1}~', $source, $matches);
But it doesn't work like it's supposed to. It returns some other text too along with those strings.
DOMDocument and XPath to the rescue.
$html = <<<EOM
<p class="someclass">
String1<br />
String2<br />
String3<br />
</p>
EOM;
$doc = new DOMDocument;
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
foreach ($xp->query('//p[contains(concat(" ", #class, " "), " someclass ")]') as $node) {
echo $node->textContent;
}
Demo
I wouldn't recommend using a regular expression to get the values. Instead, use PHP's built in HTML parser like this:
$dom = new DOMDocument();
$dom->loadHTML($source);
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//p[#class="someclass"]');
$text = array(); // to hold the strings
if (!is_null($elements)) {
foreach ($elements as $element) {
$text[] = strip_tags($element->nodeValue);
}
}
print_r($text); // print out all the strings
This is tested and working. You can read more about the PHP's DOMDocument class here: http://www.php.net/manual/en/book.dom.php
Here's a demonstration: http://phpfiddle.org/lite/code/0nv-hd6 (click 'Run')
Try this:
preg_match_all('~^(.*?)<br />$~m', $source, $matches);
Should work. Please try it
preg_match_all("/([^<>]*?)<br\s*\/?>/", $source, $matches);
or if your strings may contain some HTML code, use this one:
preg_match_all("/(.*?)<br\s*\/?>\\n/", $source, $matches);
I need find the highest number on a string like this:
Example
<div id='pages'>
<a href='pages.php?start=0&end=20'>Page 1</a>
<a href='pages.php?start=20&end=40'>Page 2</a>
<a href='pages.php?start=40&end=60'>Page 3</a>
<a href='pages.php?start=60&end=80'>Page 4</a>
<a href='pages.php?start=80&end=89'>Page 5</a>
</div>
In this example, I should get 89, because it's the highest number on "end" value.
I think I should use regex, but I don't know how :(
Any help would be very appreciated!
You shouldn't be doing this with a regex. In fact, I don't even know how you would. You should be using an HTML parser, parsing out the end parameter from each <a> tag's href attribute with parse_str(), and then finding the max() of them, like this:
$doc = new DOMDocument;
$doc->loadHTML( $str); // All & should be encoded as &
$xpath = new DOMXPath( $doc);
$end_vals = array();
foreach( $xpath->query( '//div[#id="pages"]/a') as $a) {
parse_str( $a->getAttribute( 'href'), $params);
$end_vals[] = $params['end'];
}
echo max( $end_vals);
The above will print 89, as seen in this demo.
Note that this assumes your HTML entities are properly escaped, otherwise DOMDocument will issue a warning.
One optimization you can do is instead of keeping an array of end values, just compare the max value seen with the current value. However this will only be useful if the number of <a> tags grows larger.
Edit: As DaveRandom points out, if we can make the assumption that the <a> tag that holds the highest end value is the last <a> tag in this list, simply due to how paginated links are presented, then we don't need to iterate or keep a list of other end values, as shown in the following example.
$doc = new DOMDocument;
$doc->loadHTML( $str);
$xpath = new DOMXPath( $doc);
parse_str( $xpath->evaluate( 'string(//div[#id="pages"]/a[last()]/#href)'), $params);
echo $params['end'];
To find the highest number in the entire string, regardless of position, you can use
preg_split — Split string by a regular expression
max — Find highest value
Example (demo)
echo max(preg_split('/\D+/', $html, -1, PREG_SPLIT_NO_EMPTY)); // prints 89
This works by splitting the string by anything that is not a number, leaving you with an array containing all the numbers in the string and then fetching the highest number from that array.
first extract all the numbers from the links then apply max function:
$str = "<div id='pages'>
<a href='pages.php?start=0&end=20'>Page 1</a>
<a href='pages.php?start=20&end=40'>Page 2</a>
<a href='pages.php?start=40&end=60'>Page 3</a>
<a href='pages.php?start=60&end=80'>Page 4</a>
<a href='pages.php?start=80&end=89'>Page 5</a>
</div>";
if(preg_match_all("/href=['][^']+end=([0-9]+)[']/i", $str, $matches))
{
$maxVal = max($matches[1]);
echo $maxVal;
}
function getHighest($html) {
$my_document = new DOMDocument();
$my_document->loadHTML($html);
$nodes = $my_document->getElementsByTagName('a');
$numbers = array();
foreach ($nodes as $node) {
if (preg_match('\d+$', $node->getAttribute('href'), $match) == 1) {
$numbers[]= intval($match[0])
}
}
return max($numbers);
}
The following situation:
$text = "This is some <span class='classname'>example</span> text i'm writing to
demonstrate the <span class='classname otherclass'>problem</span> of this.<br />";
preg_match_all("|<[^>/]*(classname)(.+)>(.*)</[^>]+>|U", $text, $matches, PREG_PATTERN_ORDER);
I need an array ($matches) where in one field is "<span class='classname'>example</span>" and in another "example".
But what i get here is one field with "<span class='classname'>example</span>" and one with "classname".
It also should contain the values for the other matches, of course.
how can i get the right values?
You would be better off with a DOM parser, however this question is more to do with how capturing works in Regexes in general.
The reason you are getting classname as a match is because you are capturing it by putting () around it. They are completely unnecessary so you can just remove them. Similarly, you don't need them around .+ since you don't want to capture that.
If you had some group that you had to enclose in () as grouping rather than capturing, start the group with ?: and it won't be captured.
The safe/easy way:
$text = 'blah blah blah';
$dom = new DOM();
$dom->loadHTML($text);
$xp = new DOMXPath($dom);
$nodes = $xp->query("//span[#class='classname']");
foreach($nodes as $node) {
$innertext = $node->nodeValue;
$html = // see http://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument
}
Let's say we have a string ($text)
I will help you out, if <b>you see this message and never forget</b> blah blah blah
I want to take text from "<b>" to "</b>" into a new string($text2)
How can this be done?
I appreciate any help I can get. Thanks!
Edit:
I want to take a code like this.
<embed type="application/x-shockwave-flash"></embed>
If you only wish the first match and do not want to match something like <b class=">, the following will work:
UPDATED for comment:
$text = "I will help you out, if <b>you see this message and never forget</b> blah blah blah";
$matches = array();
preg_match('#<b>.*?</b>#s', $text, $matches);
if ($matches) {
$text2 = $matches[0];
// Do something with $text2
}
else {
// The string wasn't found, so do something else.
}
But for something more complex, you really should parse it as DOM per Marc B.'s comment.
Use this bad mofo: http://fr2.php.net/domdocument
$dom = new DOMDocument();
$dom->loadHTML($text);
$xpath = new DOMXpath($dom);
$nodes = $xpath->query('//b');
Here you can either loop through each one, or if you know there is only one, just grab the value.
$text1 = $nodes->item(0)->nodeValue;
strip_tags($text, '<b>');
will extract only the parts of the string between <b> </b>
If it is the behavior you look for.