Parse HTML links with a regular expression - php

I have the following code:
$regex='|<a.*?href="(.*?)"|'; //PARSE FOR LINKS
preg_match_all($regex,$result,$parts);
$links=$parts[1];
foreach($links as $link){
echo $link."<br>";
}
Its output is the following:
/watch/b4se39an
/watch/b4se39an
/bscsystem
/watch/ifuyzwfw
/watch/ifuyzwfw
/?sort=v
/?sort=c
/?sort=l
/watch/xk4mvavj
/watch/2h7b53vx
/watch/d7bt47xb
/watch/yh953b17
/watch/tj3z6ki2
/watch/sd4vraxi
/watch/f2rnthuh
/watch/ey6z8hxa
/watch/ybgxgay1
/watch/3iaqyrm1
/help/feedback
How I can use a regular expression to extract the /watch/..... strings?

Modify your regex to include the restriction on /watch/:
$regex = '|<a.*?href="(/watch/.*?)"|';
A simple test script can show that it's working:
$tests = array( "/watch/something", "/bscsystem");
$regex = '|<a.*?href="(/watch/.*?)"|';
foreach( $tests as $test) {
$link = '';
if( preg_match( $regex, $link))
echo $test . ' matched.<br />';
}
This will produce:
/watch/something matched.

Related

PHP add html tag to first word

Trying to add HTML tags around the first word in each new line in WooCommerce short description and validate that the file exist. If it exist it will output a link.
I tried this:
$string = $short_description;
$keys = array('a1', 'a2', 'a3');
$patterns = array();
foreach($keys as $key)
$patterns[] = '/\b('.$key.')\b/i';
echo preg_replace($patterns, '$0', $string);
$url = preg_replace($patterns, 'https://www.example.com/media/' .$product->get_sku(). '/' .$product->get_sku(). '$0.pdf', $string);
$handle = #fopen($url,'r');
if($handle !== false){ ?>
<?php echo preg_replace($patterns, '<li>$0</li>', $string);?>
<?php } else {?>
<?php echo preg_replace($patterns, '<li>$0</li>', $string);?>
This is as close I could get, the limitation is that you need to add all words that needs to be changed (will be total 200+) and also the validation $url is not working as it echos the $string aswell.
So how can I either get the $url correct or is there a better way to wrap html tags to the first word on each new line?
Got it working with:
$s = strip_tags($short_description, '<br>');
$rows = explode( "\n", $s );
foreach( $rows as $r ){
echo preg_replace('/^([^ ]*)/', '$1', $r);
}
and then with
if is_readable
to output link or not

Preg_replace replace dashes with spaces between tags

I have a HTML code and would like to replace only the dashes with spaces but only between specific tags.
function getTextBetweenTags($string, $tagname) {
$pattern = "/<$tagname ?.*>(\d*)[-*](\d*)<\/$tagname>/";
$replace = " ";
$string = preg_replace($pattern, $replace, $string);
}
CODE EXAMPLE:
<div class="xxx">
start
World
Fantastic-yyy-zz
peter-hey
</div>
RESULT: Although 'peter hey' is without dashes it's more important the Tag's values.
<div class="xxx">
start
World
Fantastic yyy zz
peter-hey
</div>
You DO NOT need regular expressions for this task:
$contents = '<div class="xxx">
start
World
Fantastic-yyy-zz
peter-hey
</div>';
$doc = new DOMDocument();
$doc->loadXML($contents);
$tagName = 'a';
$tags = $doc->getElementsByTagName($tagName);
foreach ($tags as $tag) {
$newValue = str_replace('-', ' ', $tag->nodeValue);
$tag->nodeValue = $newValue;
}
echo $doc->saveHTML();
Demo: http://ideone.com/rI6k8b
#zerkms thank you for your help and patience, tried almost exactly as you told but it shows a warning and doesn't make a change.
Warning: DOMDocument::loadXML(): Extra content at the end of the document in Entity
CODE:
function process(&$vars) {
$theme = get_theme();
if ($vars['elts']['#xxx'] == 'main') {
$vars['bread'] = $theme->page['bread'];
/*add code*/
$doc = new DOMDocument();
$doc->loadXML($vars['bread']);
$tagName = 'a';
$tags = $doc->getElementsByTagName($tagName);
foreach ($tags as $tag) {
$newValue = str_replace('-', ' ', $tag->nodeValue);
$tag->nodeValue = $newValue;
}
echo $doc->saveHTML();
/*end add code*/
}
}
#zerkms, I give you the answer as valid as you really gave a correct answer. I'm also amazed to say that I found some interesting answers:
CODE TO FIND INFO
$tagname = 'a';
$pattern = "/<$tagname ?.*>(.*)\-+(.*)<\/$tagname>/";
$matches = "";
preg_match($pattern, $contents, $matches);
CODE TO CHANGE : As I only have a piece of code, I really don't need to check the tag is 'a'.
$pattern = "/>(.*)\-+(.*)\-+(.*)</";
$replace = ">$1 $2 $3<";
$res = preg_replace($pattern, $replace, $contents);
//$contents is my string with the code.
Hope it really helps someone.

PHP Regex or DOMDocument for Matching & Removing URLs?

I'm trying to extract links from html page using DOM:
$html = file_get_contents('links.html');
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br/>';
}
Output:
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
I would like to remove all results matching dontwantthisdomain.com, dontwantthisdomain2.com and dontwantthisdomain3.com so the output will looks like that:
http://domain1.com/page-X-on-domain-com.html
http://domain.com/page-XZ-on-domain-com.html
http://domain3.com/page-XYZ-on-domain3-com.html
Some people saying I should not use regex for html and others that it's ok. Could somebody point the best way how I can remove unwanted urls from my html file? :)
Maybe something like this:
function extract_domains($buffer, $whitelist) {
preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches);
$result = array();
foreach($matches[1] as $url) {
$url = urldecode($url);
$parts = #parse_url((string) $url);
if ($parts !== false && in_array($parts['host'], $whitelist)) {
$result[] = $parts['host'];
}
}
return $result;
}
$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com')));
It does a rough match on the all the <a> with href=, grabs what's between the quotes, then filters it based on your whitelist of domains.
None regex solution (without potential errors :-) :
$html='
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
';
$html=explode("\n", $html);
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com');
foreach ($html as $link) {
$ok=true;
foreach($dontWant as $notWanted) {
if (strpos($link, $notWanted)>0) {
$ok=false;
}
if (trim($link=='')) $ok=false;
}
if ($ok) $final_result[]=$link;
}
echo '<pre>';
print_r($final_result);
echo '</pre>';
outputs
Array
(
[0] => http://domain1.com/page-X-on-domain-com.html
[1] => http://domain.com/page-XZ-on-domain-com.html
[2] => http://domain3.com/page-XYZ-on-domain3-com.html
)

better striping method php regex

please help me strip the following more efficiently.
a href="/mv/test-1-2-3-4.vFIsdfuIHq4gpAnc.html"
the site I visit has a few of those, I would only need everything in between the two periods:
vFIsdfuIHq4gpAnc
I would like to use my current format and coding that works around the regex environment. Please help me tune up my following preg match line:
preg_match_all("(./(.*?).html)", $sp, $content);
Any kind of help I get on this is greatly appreciated and thank you in advance!
Here is my complete code
$dp = "http://www.cnn.com";
$sp = #file_get_contents($dp);
if ($sp === FALSE) {
echo("<P>Error: unable to read the URL $dp. Process aborted.</P>");
exit();
}
preg_match_all("(./(.*?).html)", $sp, $content);
foreach($content[1] as $surl) {
$nctid = str_replace("mv/","",$surl);
$nctid = str_replace("/","",$nctid);
echo $nctid,'<br /><br /><br />';
the above is what I have been working on
It's pretty okay, really. It's just that you don't want to match .*?, you want to match multiple characters that aren't a full stop, so you can use [^.]+ instead.
$sp = 'a href="/mv/test-1-2-3-4.vFIsdfuIHq4gpAnc.html"';
preg_match_all( '/\.([^.]+).html/', $sp, $content );
var_dump( $content[1] );
The result that is printed:
array(1) {
[0]=>
string(16) "vFIsdfuIHq4gpAnc"
}
Here's an example of how to loop through all links:
<?php
$url = 'http://www.cnn.com';
$dom = new DomDocument( );
#$dom->loadHTMLFile( $url );
$links = $dom->getElementsByTagName( 'a' );
foreach( $links as $link ) {
$href = $link->attributes->getNamedItem( 'href' );
if( $href !== null ) {
if( preg_match( '~mv/.*?([^.]+).html~', $href->nodeValue, $matches ) ) {
echo "Link-id found: " . $matches[1] . "\n";
}
}
}
You can use explode():
$string = 'a href="/mv/test-1-2-3-4.vFIsdfuIHq4gpAnc.html"';
if(stripos($string, '/mv/')){
$dots = explode('.', $string);
echo $dots[(count($dots)-2)];
}
How about using explode?
$exploded = explode('.', $sp);
$content = $exploded[1]; // string: "vFIsdfuIHq4gpAnc"
even more simpler
$sp="/mv/test-1-2-3-4.vFIsdfuIHq4gpAnc.html";
$regex = '/\.(?P<value>.*)\./';
preg_match_all($regex, $sp, $content);
echo nl2br(print_r($content["value"], 1));

php regular expression to match string if NOT in an HTML tag

I'm trying to solve this bug in Drupal's Hashtags module: http://drupal.org/node/1718154
I've got this function that matches every word in my text that is prefixed by "#", like #tag:
function hashtags_get_tags($text) {
$tags_list = array();
$pattern = "/#[0-9A-Za-z_]+/";
preg_match_all($pattern, $text, $tags_list);
$result = implode(',', $tags_list[0]);
return $result;
}
I need to ignore internal links in pages, such as link, or, more in general, any word prefixed by # that appears inside an HTML tag (so preceeded by < and followed by >).
Any idea how can I achieve this?
Can you strip the tags first because matching (using the strip_tags function)?
function hashtags_get_tags($text) {
$text = strip_tags($text);
$tags_list = array();
$pattern = "/#[0-9A-Za-z_]+/";
preg_match_all($pattern, $text, $tags_list);
$result = implode(',', $tags_list[0]);
return $result;
}
A regular expression is going to be tricky if you want to only match hashtags that are not inside an HTML tag.
You could throw out the tags before hand using preg_replace
function hashtags_get_tags($text) {
$tags_list = array();
$pattern = "/#[0-9A-Za-z_]+/";
$text=preg_replace("/<[^>]*>/","",$text);
preg_match_all($pattern, $text, $tags_list);
$result = implode(',', $tags_list[0]);
return $result;
}
I made this function using PHP DOM.
It returns all links that have # in the href.
If you want it to only remove internal hash tags, replace this line:
if(strpos($link->getAttribute('href'), '#') === false) {
with this:
if(strpos($link->getAttribute('href'), '#') !== 0) {
This is the function:
function no_hashtags($text) {
$doc = new DOMDocument();
$doc->loadHTML($text);
$links = $doc->getElementsByTagName('a');
$nohashes = array();
foreach($links as $link) {
if(strpos($link->getAttribute('href'), '#') === false) {
$temp = new DOMDocument();
$elem = $temp->importNode($link->cloneNode(true), true);
$temp->appendChild($elem);
$nohashes[] = $temp->saveHTML();
}
}
// return $nohashes;
return implode('', $nohashes);
// return implode(',', $nohashes);
}

Categories