Regular expression to extract the onclick value from a string - php

Hi I am trying to get exact value from javascript onclick.
Here is my example link:
onclick="omniture('Touchpad_8.0.7.2.ZIP','NP-N150P');downloadFile('http://xxx.com/downloadfile/ContentsFile.aspx?CDSite=UNI_CO&CttFileID=3017288&CDCttType=DR&ModelType=N&ModelName=NP-N150P&VPath=DR/201105/20110509115437867/Touchpad_8.0.7.2.ZIP','ZIP');return false;"
Lan o red inalambrica BROADCOM - 5.100.82.95 - onclick="omniture('WLAN_Broadcom_5.100.82.95.ZIP','NP-N150P');downloadFile('http://xxx.com/downloadfile/ContentsFile.aspx?CDSite=UNI_CO&CttFileID=3017290&CDCttType=DR&ModelType=N&ModelName=NP-N150P&VPath=DR/201108/20110817201634927/WLAN_Broadcom_5.100.82.95.ZIP','ZIP');return false;"
here is what I am trying:
preg_match_all(
"~onclick\s*=\s*([\"\'])(.*?)\\1~si", $d_l, $match);
$link = $match[0][0];
I am getting full onclick not the exact value, I want to get link as output:
(
http://xxx.com/downloadfile/ContentsFile.aspx?CDSite=UNI_CO&CttFileID=3017290&CDCttType=DR&ModelType=N&ModelName=NP-N150P&VPath=DR/201108/20110817201634927/WLAN_Broadcom_5.100.82.95.ZIP)
Can any one help please?

An example on how you can do this properly:
<pre><?php
$html = <<<LOD
<html><head></head><body>
<table>
<thead></thead>
<tbody id="tbodyDR">
<tr><td>bidule
bidule
</td></tr>
<tr><td>truc
truc
</td></tr>
<tr><td>bidule
machin
</td></tr>
</tbody>
</body></html>
LOD;
$doc = new DOMDocument();
//#$doc->loadHTMLFile('http://example.com/list.html');
#$doc->loadHTML($html);
$links = $doc->getElementById('tbodyDR')->getElementsByTagName("a");
foreach($links as $link) {
$onclickAttr = $link->getAttribute('onclick');
if( preg_match("~downloadFile\('\K[^']++~", $onclickAttr, $match) )
$result[] = $match[0];
}
print_r($result);

$match[0][$i-1] is the whole $i-th match, $match[1][$i-1] corresponds to the first submatch in the $i-th match, etc.
To get just the links, try this:
preg_match_all(
"~onclick\s*=\s*([\"\']).*?downloadFile\(([\"'])(.*?)\\2.*?\).*?\\1~si",
$d_l, $match
);
foreach ($matches[3] as $link)
echo $link, "<br>\n";

Related

Match multiple results single line php regex

I would like to match multiple results on a single line string but I am only able to get the last iteration on the result I excpected.
For example I have this string : <ul><li>test1</li><li>test2</li>test3</li></ul>
I would like to get :
test1
test2
test3
As result but I only get "test3"
I used this regex <ul>(<li><a.*>(.*)<\/a><\/li>)*<\/ul> on : https://regex101.com/ but I don't know what I did wrong.
Use a parser instead:
<?php
$html = <<<DATA
<ul>
<li>test1</li>
<li>test2</li>
<li>test3</li>
</ul>
DATA;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DomXPath($dom);
$links = $xpath->query("//li/a");
foreach ($links as $link) {
echo $link->textContent;
}
?>
This sets up the DOM and uses an xpath expression to get the element(s).
Try like this:
(?<=(<a href="#">))([\s\S]| |w\[0-9]| )+?(?=(<\/a>))
or
(?<=(">))([\s\S]| |w\[0-9]| )+?(?=(<\/a>))
or
(?<=(<a href="#">))(.)+?(?=(<\/a>))
link with example:
https://regex101.com/r/MHnxxh/1
or
https://regex101.com/r/MHnxxh/2
<?php
$str = '
<ul>
<li>test1</li>
<li>test2</li>
<li>test3</li>
</ul>
';
preg_match_all('/(?<=(#">))([\s\S]| |w\[0-9]| )+?(?=(<\/a>))/', $str, $matches);
// display array if need
echo "<pre>";
print_r($matches);
// display list
foreach ($matches[0] as $key => $value) {
echo $value ."\r\n";
}
?>
preg_match_all("\#\"\>[a-z]\w+\<\/\a\>,
$out, PREG_PATTERN_ORDER)
this the regex pattern....try this
("#\">[a-z]\w+\</\a>)
this will extract only all text strings....
you cane use of preg_replace
$test = '<ul><li>test1</li><li>test2</li>test3</li></ul>';
echo preg_replace('/<[^>]*>/', ' ', $test);

How do I extract links with a specific domain name using PHP and Regex?

I am trying to extract urls that contain www.domain.com from a database column that contains HTML. The regex has to filter out www2.domain.com instances and external urls like www.domainxyz.com. It should only search for properly coded anchor links.
Here is what I have so far:
<?php
$content = '<html>
<title>Random Website</title>
<body>
Click here for foobar
Another site is http://www.domain.com
Test 1
Test 2
<Strong>NOT A LINK</strong>
</body>
</html>';
$regex = "((https?)\:\/\/)?";
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})";
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?";
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:#&%=+\/\$_.-]*)?";
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?";
$regex .= "([www\.domain\.com])";
$matches = array(); //create array
$pattern = "/$regex/";
preg_match_all($pattern, $content, $matches);
print_r(array_values(array_unique($matches[0])));
echo "<br><br>";
echo implode("<br>", array_values(array_unique($matches[0])));
?>
I am looking for this to find and output only http://www.domain.com/test.
How can I modify my Regex to accomplish this?
Here is a much safer way to extract the a href attribute values containing www.domain.com where the key is the XPath '//a[contains(#href, "www.domain.com")]':
$html = "YOUR_HTML_STRING"; // Your HTML string
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$arr = array();
$links = $xpath->query('//a[contains(#href, "www.domain.com")]');
foreach($links as $link) {
array_push($arr, $link->getAttribute("href"));
}
print_r($arr);
See IDEONE demo, result:
Array
(
[0] => http://www.domain.com/test
)
As you see, you can use the DOMDocument and DOMXPath with a string, too.
The code is self-explanatory, the XPath expression just means find all <a> tags that have a href attribute containing www.domain.com.

Extract value from href tag in table using php

I have a table with a td like below. I want to extract the value "abl" the value of symbol from href tag.
<td>
Ace Bank Limited
</td>
I can simply extract Ace Bank Limited using $td->nodeValue; but how can I extract abl using php only?
try with DOM
$html = '<td>Ace Bank Limited</td>';
$dom = new DOMDocument;
#$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
$anchor = $tag->getAttribute('href');
$text = explode('=', $anchor);
echo $text[1]; //ABL
}
or using preg_match
preg_match('/=([^\"]+)/', $html, $matches);
echo $matches[1]; //ABL
Try with Regex:- preg_match(/symbol=([^\"]+)/, $table_data, $matched)

Get last Element (<a>) tag content from html

I have a string with some HTML. In the HTML is a list of anchors (<a> tags) and I would like to get the last of those anchors.
<div id="breadcrumbs">
Home
Suppliers
This One i needed
<span class="currentpage">Amrapali</span>
</div>
Make use of DOMDocument Class.
<?php
$html='<div id="breadcrumbs">
Home
Suppliers
This One i needed
<span class="currentpage">Amrapali</span>
</div>';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
$arr[]=$tag->nodeValue;
}
echo $yourval = array_pop($arr); //"prints" This One i needed
You should look for the next a tags with a negative lookahead:
(?s)<a(?!.*<a).+</a>
and the code:
preg_match("#(?s)<a(?!.*<a).+</a>#", $html, $result);
print_r($result);
Output:
Array
(
[0] => This One i needed
)
Regex demo | PHP demo
Try this
<?php
$string = '<div id="breadcrumbs">
Home
Suppliers
This One i needed
<span class="currentpage">Amrapali</span>
</div>';
//$matches[0] will have all the <a> tags
preg_match_all("/<a.+>.+<\/a>/i", $string, $matches);
//Now we remove the <a> tags and store the tag content into an array called $result
foreach($matches[0] as $key => $value){
$find = array("/<a\shref=\".+\">/", "/<\/a>/");
$replace = array("", "");
$result[] = preg_replace($find, $replace, $value);
}
//Make the last item in the $result array become the first
$result = array_reverse($result);
$last_item = $result[0];
echo $last_item;
?>

Using php preg to find url and replace it with a second url

I've got a large number of webpages stored in an MySQL database.
Most of these pages contain at least one (and occasionally two) entries like this...
<a href="http://first-url-which-always-ends-with-a-slash/">
<img src="http://second-different-url-which-always-ends-with.jpg" />
</a>
I'd like to just set up a little php loop to go through all the entires replacing the first url with a copy of the second url for that entry.
How can I use preg to:
find the second url from the image tag
replace the first url in the a tag, with a copy of the second url
Is this possible?
see this url
PHP preg match / replace?
see also:- http://php.net/manual/en/function.preg-replace.php
$qp = qp($html);
foreach ($qp->find("img") as $img) {
$img->attr("title", $img->attr("alt"));
}
print $qp->writeHTML();
Though it might be feasible in this simple case to resort to an regex:
preg_replace('#(<img\s[^>]*)(\balt=)("[^"]+")#', '$1$2$3 title=$3', $h);
(It would make more sense to use preg_replace_callback to ensure no title= attribute is present yet.)
You can do following :
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->formatOutput = true;
$source = "<a href=\"http://first-url-which-always-ends-with-a-slash/\">
<img src=\"http://second-different-url-which-always-ends-with.jpg\" />
</a>";
$dom->loadHTML($source);
$tags = $dom->getElementsByTagName('a');
foreach ($tags as $tag) {
$atag = $tag->getAttribute('href');
$imgTag = $dom->getElementsByTagName('img');
foreach ($imgTag as $img) {
$img->setAttribute('src', $atag);
echo $img->getAttribute('src');
}
}
Thanks for the suggestions i can see how they are better than using Preg.
Even so i finally solved my own question like this...
$result = mysql_query($select);
while ($frow = mysql_fetch_array($result)) {
$page_content = $frow['page_content'];
preg_match("#<img\s+src\s*=\s*([\"']+http://[^\"']*\.jpg[\"']+)#i", $page_content, $matches1);
print_r($matches1);
$imageURL = $matches1[1] ;
preg_match("#<a\s+(?:[^\"'>]+|\"[^\"]*\"|'[^']*')*href\s*=\s(\"http://[^\"]+/\"|'http://[^']+/')#i", $page_content, $matches2);
print_r( $matches2 );
$linkURL = $matches2[1] ;
$finalpage=str_replace($linkURL, $imageURL, $page_content) ;
}

Categories