find url in html code - php

I want tu find url in html code with PHP or JS
e.g i have this text
<description>
<![CDATA[<p>
<img" src="http://2010.pcnews.am/images/stories/2011/internet/chinese-computer-user-smoke.jpg" border="0" align="left" "/>
Երեկ Պեկինի ինտերնետ-սրճարաններից մեկում մահացել է 33-ամյա մի չինացի, ով 27 օր շարունակ անցկացրել էր համակարգչի առաջ: Հաղորդում է չինական «Ցյանլունվան» պարբերականը:</p>
<p>Աշխատանք չունեցող չինացին մեկ ամիս շարունակ չի լքել ինտերնետ-սրճարանը ՝ այդ ամբողջ ընթացքում սնվելով արագ պատրաստվող մակարոնով:</p>
<p />
Նույնիսկ ամանորյա տոները նա անցկացրել է համակարգչի առաջ. Պեկինի բնակիչները նշում են Նոր տարին Լուսնային օրացույցով՝ փետրվարի 3-8-ը: Մահվան պատճառները չեն հաղորդվում:
]]>
</description>
i want take only "http://2010.pcnews.am/images/stories/2011/internet/chinese-computer-user-smoke.jpg" ,
Thank in advance

This is a rather complicated task and while regex may seem easier, it is far too problematic. The following code will go through an XML file (called some.xml, but you’ll obviously need to change that) and gather the image sources into an array, $images.
$images = array();
$doc = new DOMDocument();
$doc->load('some.xml');
$descriptions = $doc->getElementsByTagName("description");
foreach ($descriptions as $description) {
foreach($description->childNodes as $child) {
if ($child->nodeType == XML_CDATA_SECTION_NODE) {
$html = new DOMDocument();
#$html->loadHTML($child->textContent);
$imgs = $html->getElementsByTagName('img');
foreach($imgs as $img) {
$images[] = $img->getAttribute('src');
}
}
}
}
I tested it against the XML you supplied an got the following result:
Array
(
[0] => http://2010.pcnews.am/images/stories/2011/internet/chinese-computer-user-smoke.jpg
)
I put it into an array in case there is more than one description with images.

You can use javascript or jQuery to get the image's src attribute.
document.getElementsByTag("img")[x].src

Use regex to find content between src=" and preceding "

In php could be done like this:
<?php
$txt = 'text here <img src="http://domain.com/something.png" border="0" align="left" "/> more
test and <em>html</em> around here
<p> thats it </p>';
preg_match('/src="([^"]*)"/', $txt, $matches);
var_dump($matches[1]);
?>

Regular expressions are brittle for text parsing and do not take advantage of the document's inherent structure. Using RegEx to find stuff in a marked up document is generally a poor practice.
Use PHP's built in DOMNode and DOMXPath instead.

Related

How can i replace a multiple img-elements with plain text?

I am want to create an output text filter to replaces all the <img> elements in the DOM with the following text "no images allowed".
I.e.: If the user creates this HTML markup:
<p><img src="/image.jpg" /></p>
the following HTML is rendered:
<p>no images allowed</p>
Please note that I cannot use preg_replace. The question is simplified and I need to parse the DOM to to find what images to disallow.
Thanks to this answer, I found that getElementsByTagName() returns "live" iterator, so you need two steps, so I have this:
foreach ($elements as $element) {
$domArray[] = $element;
$src= $element->getAttribute('src');
$frag= $dom->createElement('p');
$frag->nodeValue = 'no images allowed';
$element->parentNode->appendChild($frag);
}
// loop through the array and delete each node
$nodes = iterator_to_array($dom->getElementsByTagName('img'));
foreach ($nodes as $node) {
$node->parentNode->removeChild($node);
}
$newtext = $dom->saveHTML();
It almost do what I want, but I get this:
<p><p>no images allowed</p></p>
I would fetch the elements with xpath, then replace with newly created text nodes.
$xp = new DOMXPath($dom);
$elements = $xp->query('//img');
foreach ($elements as $element) {
$frag= $dom->createTextNode('no images allowed');
$element->parentNode->insertBefore($frag, $element);
$element->parentNode->removeChild($element);
}
echo $dom->saveHtml();
Demo here: http://codepad.org/w9uj0ez9
To remove HTML self-enclosed img tag you may use a simple regular expression:
<?php
function no_images_allowed($text) {
return preg_replace('/<img[^>]*>/', 'no images allowed', $text);
}
print no_images_allowed('<p><img src="/image.jpg" /></p>');
It is simpler and should be much more efficient, you do not need to travers over every DOM element, just process plain text.
Regex in example above will only work for self-enclosed img tag:
<img src="..."/>
<img src="...">
Please note that it will not work for example with:
<img src="..."></img>
<IMG SRC="..."/>
<img src="...">invalid content</img>
If you want to include every possible case (even invalid ones) then proposed regex should be modified.

How to change src attribute with specific string in php

I am using simple html dom to do this thing, there are some img tags. Where I want to change some src with specific string. I want to change urls which contains http://localhost.com in given text to https://i0.wp.com/localhost.com
Example:
$data='<p><img class="alignnone wp-image-36109 size-full" src="https://localhost.com/wp-content/uploads/2014/10/WjhzQlNRaDJrYUUx_o_using-freedom-for-unlimited-in-app-purchases-android-.jpg" alt="WjhzQlNRaDJrYUUx_o_using-freedom-for-unlimited-in-app-purchases-android-" width="480" height="360"/></p>
';
I have used below code to search https://localhost.com, But how can I change it.
$html->find('img[src^=https://localhost.com/]');
Result from simple html dom:
It gives me that search value but I want to change the value with something. That I have already told.
I have also use this regex to do this work.
echo preg_replace("/(<img.*src=)[\"'](.*)[\"']/m",'\1"https://i0.wp.com/\2\"',$data);
but it gives me out like
<p><img class="alignnone wp-image-36109 size-full" src="https://i0.wp.com/https://localhost.com/wp-content/uploads/2014/10/WjhzQlNRaDJrYUUx_o_using-freedom-for-unlimited-in-app-purchases-android-.jpg" alt="WjhzQlNRaDJrYUUx_o_using-freedom-for-unlimited-in-app-purchases-android-" width="480" height="360\"/></p>
All the src is with https://i0.wp.com/ but in regex, I want to get this result:
Result from regex:
https://i0.wp.com/https://i0.wp.com/localhost.com/wp-content/uploads/2016/03/sd.png?resize=300%2C300
Want to get result:
https://i0.wp.com/i0.wp.com/localhost.com/wp-content/uploads/2016/03/sd.png?resize=300%2C300
Can someone give me clue to get this, And also can you give me your answer in regex. It will helpful for me. Hope you understand my problem, you can comment below for more information.
The most important thing is that If any one want to give this question a down vote, please do that But Please Please Comment below Why Did you do that, I am not a genius php developer like you, I just learn from my mistake
There you go (this uses the far superior DOMDocument library with xpath and regular expressions):
<?php
$data='<p><img class="alignnone wp-image-36109 size-full" src="https://localhost.com/wp-content/uploads/2014/10/WjhzQlNRaDJrYUUx_o_using-freedom-for-unlimited-in-app-purchases-android-.jpg" alt="WjhzQlNRaDJrYUUx_o_using-freedom-for-unlimited-in-app-purchases-android-" width="480" height="360"/></p>';
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
# filters images
$needle = 'https://localhost.com';
$images = $xpath->query("//img[starts-with(#src, '$needle')]");
# split on positive lookahead
$regex = '~(?=localhost\.com)~';
foreach ($images as $image) {
$parts = preg_split($regex, $image->getAttribute("src"));
$newtarget = $parts[0] . "i0.wp.com/i0.wp.com/" . $parts[1];
$image->setAttribute("src", $newtarget);
}
# just to show the result
echo $dom->saveHTML();
?>
And see a demo on ideone.com.

Extract two string in html code

I have a HTML table which has the following structure:
<tr>
<td class='tablesortcolumn'>atest</td>
<td >Kunde</td>
<td >email#example.com</td>
<td align="right"><img src="images/iconedit.gif" border="0"/> <img src="images/pixel.gif" width="2" height="1" border="0"/> <img src="images/icontrash.gif" border="0"/></td>
</tr>
There are hundreds of these tr blocks.
I want to extract atest and email#example.com
I tried the following:
$document = new DOMDocument();
$document->loadHTML($data);
$selector = new DOMXPath($document);
$elements = $selector->query("//*[contains(#class, 'tablesortcolumn')]");
foreach($elements as $element) {
$text = $element->nodeValue;
print($text);
print('<br>');
}
Extracting atest is no problem, because I can get the element with the tablesortcolumn class. How can I get the email address?
I cannot simply use //table/tr/td/a because there are other elements on the website which are structured like this. So I need to get it by choosing an empty href tag. I already tried //table/tr/td/a[contains(#href, '')] but it returns the same as with //table/tr/td/a
Does anyone have an idea how to solve this?
can you try running an xpath that contains the string #? It seems unlikely that this would be used for anything else.
so something like this might work
//*[text()[contains(.,'#')]]
The following XPath expression does exactly what you want
//*[#class = 'tablesortcolumn' or contains(text(),'#')]
using the input document you have shown will yield (individual results separated by -------------):
<td class="tablesortcolumn">atest</td>
-----------------------
email#example.com
If you are looking for an email field, you could use a regex. Here is an article that could be useful.
EDIT
According to Nisse Engström, I will put the interesting part of the article here in case the blog goes down. Thanks for the advice.
// Supress XML parsing errors (this is needed to parse Wikipedia's XHTML)
libxml_use_internal_errors(true);
// Load the PHP Wikipedia article
$domDoc = new DOMDocument();
$domDoc->load('http://en.wikipedia.org/wiki/PHP');
// Create XPath object and register the XHTML namespace
$xPath = new DOMXPath($domDoc);
$xPath->registerNamespace('html', 'http://www.w3.org/1999/xhtml');
// Register the PHP namespace if you want to call PHP functions
$xPath->registerNamespace('php', 'http://php.net/xpath');
// Register preg_match to be available in XPath queries
//
// You can also pass an array to register multiple functions, or call
// registerPhpFunctions() with no parameters to register all PHP functions
$xPath->registerPhpFunctions('preg_match');
// Find all external links in the article
$regex = '#^http://[^/]+(?<!wikipedia.org)/#';
$links = $xPath->query("//html:a[ php:functionString('preg_match', '$regex', #href) > 0 ]");
// Print out matched entries
echo "Found " . (int) $links->length . " external linksnn";
foreach($links as $linkDom) { /* #var $entry DOMElement */
$link = simplexml_import_dom($linkDom);
$desc = (string) $link;
$href = (string) $link['href'];
echo " - ";
if ($desc && $desc != $href) {
echo "$desc: ";
}
echo "$href\n";
}
If you are using Chrome, you can test your XPath queries in the console, like this :
$x("//*[contains(#class, 'tablesortcolumn')]")

PHP Regex to find images with specific src attribute

I have a variable with HTML source and I need to find images within the variable that contain images with specific src attributes.
For example my image:
<img src="/path/img1.svg">
I have tried the below but doesnt work, any suggestions?
$hmtl = '<div> some stuff <img src="/path/img1.svg"/> </div><div>other stuff</div>';
preg_match_all('/<img src="/path/img1.svg"[^>]+>/i',$v, $images);
You should make use of DOMDocument Class, not regular expressions when it comes to parsing HTML.
<?php
$html='<img src="/path/img1.svg">';
$dom = new DOMDocument;
#$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('img') as $tag) {
echo $tag->getAttribute('src'); //"prints" /path/img1.svg
}

how to match specific text link with php regex

here I'm looking for a regular expression in PHP which would match the anchor with a specific "target="_parent" on it.I would like to get anchors with text like:
preg_match_all('Text here', subject, matches, PREG_SET_ORDER);
HTML:
<a href="http://" target="_parent">
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Text</B> - Text </DIV>
</FONT>
</a>
</DIV>
To be honest, the best way would be not to use a regular expression at all. Otherwise, you are going to be missing out on all kinds of different links, especially if you don't know that the links are always going to have the same way of being generated.
The best way is to use an XML parser.
<?php
$html = 'Text here';
function extractTags($html) {
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html); // because dom will complain about badly formatted html
$sxe = simplexml_import_dom($dom);
$nodes = $sxe->xpath("//a[#target='_parent']");
$anchors = array();
foreach($nodes as $node) {
$anchor = trim((string)dom_import_simplexml($node)->textContent);
$attribs = $node->attributes();
$anchors[$anchor] = (string)$attribs->href;
}
return $anchors;
}
print_r(extractTags($html))
This will output:
Array (
[Text here] => http://
)
Even using it on your example:
$html = '<a href="http://" target="_parent">
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Text</B> - Text </DIV>
</FONT>
</a>
</DIV>
';
print_r(extractTags($html));
will output:
Array (
[Text - Text] => http://
)
If you feel that the HTML is still not clean enough to be used with DOMDocument, then I would recommend using a project such as HTMLPurifier (see http://htmlpurifier.org/) to first clean the HTML up completely (and remove unneeded HTML) and use the output from that to load into DOMDocument.
You should be making using DOMDocument Class instead of Regex. You would be getting a lot of false positive results if you handle HTML with Regex.
<?php
$html='Text here';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if ($tag->getAttribute('target') === '_parent') {
echo $tag->nodeValue;
}
}
OUTPUT :
Text here

Categories