how to php preg_match_all find exact words - php

i am trying to get exact word 846002 from html code by using preg_match_all
php code :
<?php
$sText =file_get_contents('test.html', true);
preg_match_all('#download.php/.[^\s.,"\']+#i', $sText, $aMatches);
print_r($aMatches[0][0]);
?>
test.html
<tr>
<td class=ac>
<img src="/dl3.png" alt="Download" title="Download">
</td>
</tr>
output:
download.php/829685/Dark
but i want to output only,
829685

Just add the slash in the character class and capture in group 1:
preg_match_all('#download.php/([^\s.,"\'/]+)#i', $sText, $aMatches);
The value you're looking after is in $aMatches[1][0]

You need to create a group by ( and )
preg_match_all ( '#download.php/([^/]+)#i', $sText, $aMatches );
print_r ( $aMatches[1][0] );

Related

PHP preg_match displaying Array ( )

I'm newbie in PHP and I'm trying to scrape just weather summary data from a forecast site, i tried everything i found but it only returns Array ( )
<?php
$contents=file_get_contents("http://www.weather-forecast.com/locations/Podgorica/forecasts/latest");
preg_match('/ <a class="forecast-magellan-target" name="forecast-part-0"><div data-magellan-destination="forecast-part-0"><\/div><\/a><p class="summary"> (.*?) <\/span><\/p> /is',$contents, $matches);
print_r($matches);
?>
You Have extra spaces, please try this:
preg_match('/<a class="forecast-magellan-target" name="forecast-part-0"><div data-magellan-destination="forecast-part-0"><\/div><\/a><p class="summary">(.*?)<\/span><\/p>/is',$contents, $matches);

php preg_match_all and preg_replace.

[caption id="attachment_1342" align="alignleft" width="300" caption="Cheers... "Forward" diversifying innovation to secure first place. "][/caption] A group of 35 students from...
I'm reading this data from api. I want the text just start with A group of 35 students from.... Help me to replace the caption tag with null. This is what I tried:
echo "<table>";
echo "<td>".$obj[0]['title']."</td>";
echo "<td>".$obj[0]['content']."</td>";
echo "</table>";
$html = $obj[0]['content'];
preg_match_all('/<caption>(.*?)<\/caption>/s', $html, $matches);
preg_replace('',$matches, $obj[0]['content']);
Any help.
$pattern = "/\[caption (.*?)\](.*?)\[\/caption\]/i";
$removed = preg_replace($pattern, "", $html);
echo preg_replace("#\[caption.*\[/caption\]#u", "", $str);
In the snippet mentioned in the question, regex search pattern is incorrect. there is no <caption> in the input. its <caption id....
Second using preg_replace doesn't serve any purpose here. preg_replace expects three arguments. first should be a regex pattern for search. second the string to replace with. and third is input string.
Following snippet using preg_match will work.
<?php
//The input string from API
$inputString = '<caption id="attachment_1342" align="alignleft" width="300" caption="Cheers... "Forward" diversifying innovation to secure first place. "></caption> A group of 35 students from';
//Search Regex
$pattern = '/<caption(.*?)<\/caption>(.*?)$/';
//preg_match searches inputString for a match to the regular expression given in pattern
//The matches are placed in the third argument.
preg_match($pattern, $inputString, $matches);
//First match is the whole string. second if the part before caption. third is part after caption.
echo $matches[2];
// var_dump($matches);
?>
if you still want to use preg_match_all for some reason. following snippet is modification of the one mentioned in question -
<?php
//Sample Object for test
$obj = array(
array(
'title' => 'test',
'content' => '<caption id="attachment_1342" align="alignleft" width="300" caption="Cheers... "Forward" diversifying innovation to secure first place. "></caption> A group of 35 students from'
)
);
echo "<table border='1'>";
echo "<td>".$obj[0]['title']."</td>";
echo "<td>".$obj[0]['content']."</td>";
echo "</table>";
$html = $obj[0]['content'];
//preg_match_all will put the caption tag in first match
preg_match_all('/<caption(.*?)<\/caption>/s', $html, $matches2);
//var_dump($matches2);
//use replace to remove the chunk from content
$obj[0]['content'] = str_replace($matches2[0], '', $obj[0]['content']);
//var_dump($obj);
?>
Thank you guys. I use explode function to do this.
$html = $obj[0]['content'];
$code = (explode("[/caption]", $html));
if($code[1]==''){
echo $code[1];
}

preg_replace doesn't work as expected

I have problem with preg_replace, I need it to replace <td class="td_supltrid_3" width="11%"><p> 4A8</p> with only 4A8. When I use this pattern:
'/\<td class\=\"td_supltrid_3\" width\=\"11%\"\>\<p\> ...\<\/p\>/'
it doesn't find it. However, when I use preg_match, it finds searched expression without problem. Can you tell me there is the problem? Whole code:
preg_replace('/\<td class\=\"td_supltrid_3\" width\=\"11%\"\>\<p\> (...)\<\/p\>/', '$1', $str)
You need to change (...) to (.*?), which will grab everything up to the trailing </p>
<?php echo preg_replace('/<td class\=\"td_supltrid_3\" width\=\"11%\"><p>(.*?)<\/p>/', '$1', '<td class="td_supltrid_3" width="11%"><p> 4A8</p>'); ?>

Unusual behaviour of regex

My Setup:
index.php:
<?php
$page = file_get_contents('a.html');
$arr = array();
preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);
print_r($arr);
?>
a.html:
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
Output:
Array
(
[0] => Array
(
)
)
If I change the line 4 of index.php to:
preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);
The output is:
Array
(
[0] => <td class="myclass">
THE
CONTENT
</t
[1] =>
THE
CONTENT
)
I can't make out what's wrong. Please help me match the content between <td class="myclass"> and </td>.
Your code appears to work. I edited the regex to use a different separator and get a clearer view. You may want to use the ungreedy modifier in case there is more than one myclass TD in your HTML.
I have not been able to reproduce the "array of array" behaviour you note, unless I manipulate the code to add an error -- see at bottom.
<?php
$page = <<<PAGE
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
PAGE;
preg_match('#<td class="myclass">(.*)</td>#s',$page,$arr);
print_r($arr);
?>
returns, as expected:
Array
(
[0] => <td class="myclass">
THE
CONTENT
</td>
[1] =>
THE
CONTENT
)
The code below is similar to yours but has been modified to cause an identical error. Doesn't seem likely you did this, though. The regexp is modified in order to not match, and the resulting empty array is stored into $arr[0] instead of $arr.
preg_match('#<td class="myclass">(.*)</ td>#s',$page,$arr[0]);
Returns the same error you observe:
Array
(
[0] => Array
(
)
)
I can duplicate the same behaviour you observe (works with </t, does not work with </td>) if I use your regexp, but modify the HTML to have </t d>. I still need to write to $arr[0] instead of $arr if I also want to get an identical output.
Do you understand that the 3rd paramter of preg_match is the matches and it will contain the match then the other elements will show the captured pattern.
http://ca3.php.net/manual/en/function.preg-match.php
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
This code
preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);
When applied on
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
Will return the match in $arr[0] and the result of (.*) in $arr[1]. This result is correct: There is your content in [1]
Array
(
[0] => <td class="myclass">
THE
CONTENT
</t
[1] =>
THE
CONTENT
Example two
<?php
header('Content-Type: text/plain');
$page = 'A B C D E F';
$arr = array();
preg_match('/C (D) E/', $page, $arr);
print_r($arr);
Example output
Array
(
[0] => C D E // This is the string found
[1] => D // this is what I wanted to look for and extracted out of [0], the matched parenthesis
)
Your regex seems correct. Isn't the syntax of preg_match as follows?
preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);
The | in the regex represents or

Extract content from each first TD in a Table

I've got some HTML that looks like this:
<tr class="row-even">
<td align="center">abcde</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-odd">
<td align="center">efgh</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-even">
<td align="center">ijkl</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
And I need to retrieve the values, abcde, efgh, and ijkl
This is the regex I'm currently using:
preg_match_all('/(<tr class="row-even">|<tr class="row-odd">)<td align="center">(.*)<\/td><\/tr>/xs', $html, $matches);
Yes, I'm not very good at them. As with most of my regex attempts, this is not working. Can anyone tell me why?
Also, I know about html/xml parsers, but it would require a significant code revisit to make that happen. So that's for later. We need to stick with regex for now.
EDIT: To clarify, I need the values between the first <td align="center"></td> tag after either <tr class="row-even"> or <tr class="row-odd">
~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m
Notice the m modifier and the use of \s*.
Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)
Try this:
preg_match_all('/(?:<tr class="row-even">|<tr class="row-odd">).<td align="center">(.*?)<\/td>/s', $html, $matches);
Changes made:
You've not accounted for the newline
between the tags
You don't need to x modifier as it
will discard the space in the regex.
Make the matching non-greedy by using
.*? in place of .*.
Working link
Actually, you dont need a too big change in your codebase. Fetching Text Nodes is always the same with DOM and XPath. All that does change is the XPath, so you could wrap the DOM code into a function that replaces your preg_match_all. That would be just a tiny change, e.g.
include_once "dom.php";
$matches = dom_match_all('//tr/td[1]', $html);
where dom.php just contains:
// dom.php
function dom_match_all($query, $html, array $matches = array()) {
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTML($html);
libxml_clear_errors();
$xPath = new DOMXPath($dom);
foreach( $xPath->query($query) as $node ) {
$matches[] = $node->nodeValue;
}
return $matches;
}
and would return
Array
(
[0] => abcde
[1] => efgh
[2] => ijkl
)
But if you want a Regex, use a Regex. I am just giving ideas.
This is what I came up with
<td align="center">([^<]+)</td>
I'll explain. One of the challenges here is what's between the tags could be either the text you're looking for, or an tag. In the regex the [^<]+ says to match one or more characters that is not the < character. That's great, because that means the won't match, and the the group will only match until the tag is found.
Disclaimer: Using regexps to parse HTML is dangerous.
To get the innerhtml of the first TD in each TR, use this regexp:
/<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si
This is just a quick and dirty regex to meet your needs. It could easily be cleaned up and optimized, but it's a start.
<tr[^>]+>[^\n]*\n #Match the opening <tr> tag
\s*<td[^>]+>([^<]+)[^\n]+\n #Group the wanted data
[^\n]+\n #Match next line
</tr> #Match closing tag
Here is an alternative way, which may be more robust:
deluserconfirm.html\?user=([^"]+)

Categories