problem with string manipulation function retrieving URL

problem with string manipulation function retrieving URL - php

i build a simple scraper to get me links from other website
my problem now is to getting the link it self not all of the content
<a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','namobile.naughtyamerica.com']);" href="http://www.wwww.com/track/MTA3ODQxLjEyLjQwLjQwLjAuMC4wLjAuMA/freeporn3/lisa_ann6/7535/"><img class="aligncenter size-full" title="Lisa Ann" src="http://www.www.com/upload/source/mfhm/lisawill/lisawillhor_gmna_big_img3.jpg" alt="Lisa Ann" width="313" height="223" /></a>
here the image and its link i need to get the link only in a variable to be like that
$url = "http://www.wwww.com/track/MTA3ODQxLjEyLjQwLjQwLjAuMC4wLjAuMA/freeporn3/lisa_ann6/7535/";
that its it thank you

Use queryPath, Simple HTML DOM Parser or other PHP libraries for navigating in DOM document

You can use PHP Query library, and attr method if you are familiar with CSS selectors.
<?php
echo pq('a')->attr('href');

$html = <<< EOF
<a onclick="javascript:_gaq.push(['_trackEvent','outbound-article','namobile.naughtyamerica.com']);" href="http://www.wwww.com/track/MTA3ODQxLjEyLjQwLjQwLjAuMC4wLjAuMA/freeporn3/lisa_ann6/7535/"><img class="aligncenter size-full" title="Lisa Ann" src="http://www.www.com/upload/source/mfhm/lisawill/lisawillhor_gmna_big_img3.jpg" alt="Lisa Ann" width="313" height="223" /></a>
EOF;
preg_match_all('/<a onclick.*?href="(.*?)"/im', $html, $url, PREG_PATTERN_ORDER);
$url = $url[1][0];
echo $url // echo's "http://www.wwww.com/track/MTA3ODQxLjEyLjQwLjQwLjAuMC4wLjAuMA/freeporn3/lisa_ann6/7535/"

Related

PHP Remove html link tag from a string where the hypertext in new line

I have the following string:
$linkString="The Following is a link to google <a class='links' href='http://google.com'>
http://google.com
</a>
";
In this string the hypertext of the html link in new line. I want to remove and may be replace all of the link (its html tag and the hypertext) from the string, so I tried the following:
<?php
$linkString="The Following is a link to google <a class='links' href='http://google.com'>
http://google.com
</a>
";
//Remove link tag:
echo preg_replace('/<[^>]*>/','',$linkString);
However, the above example prints out:
The Following is a link to google
http://google.com
This is an online DEMO: http://codepad.org/whw81bwa
I want to know a regex that able to remove all the link (tag and hypertext)

Instead of using regex, make effective use of DOM to do this for you.
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML data
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//a') as $tag) {
$tag->parentNode->removeChild($tag);
}
echo $doc->saveHTML();

The following regex solve the issue:
/(?i)<a([^>]+)>(.+?)<\/a>/'
So,
<?php
$linkString="The Following is a link to google <a class='links' href='http://google.com'>
http://google.com
</a>
";
//Remove link tag:
echo preg_replace('/(?i)<a([^>]+)>(.+?)<\/a>/','A Hidden Link',$linkString);

How to add rel="lightbox" only to links which are including an image and not the rest of the links? [duplicate]

This question already has answers here:
How to add in every link (a href) tag a rel attribute with the help of php? [closed]
(2 answers)
Closed 9 years ago.
Lets say that (in php) we have a variable which has text, links and images :
<img src="myimage.jpg" alt="pic" title="pic" border="0" />
and we want to add to every a href tag the rel="light" as follow :
<a rel="lightbox" href="myimage_big"><img src="myimage.jpg" alt="pic" title="pic" border="0" /></a>
If the name of the variable is lets say $mydata then with str_replace we can do as follow to solve our problem :
$mydata = str_replace('<img ', '<img rel="lightbox"', $mydata);
Till here is all right, but what about the rest of the a href links that are not including any photo :
par example,
link_no1
link_no2
etc ? To this kind of links that are not including any image but text then with our str_replace code will also have a rel="lightbox" attribute that is not correct and i dont want :
<a rel="lightbox" href="link1.php">link_no1</a>
So how we can apply the rel="lightbox" only to the links that are including some image and to those links that are not including any image to dont set the rel="lightbox" attribute ..!?

Does this regex solve the problem?
$str = '<img src="myimage.jpg" />';
$str = preg_replace('~<a(?=[^>]+>\s*<img)~','<a rel="lightbox"',$str);
echo htmlspecialchars($str);
Using a lookahead to check, if the <a ...> is followed by <img

If you prefer to use a regular expression ...
$html = preg_replace('/(?<=<a)(?=[^>]*>[^<]*<img)/', ' rel="lightbox"', $html);
Although, I would consider using DOM and XPath ...
$doc = DOMDocument::loadHTML('
<img src="myimage.jpg" alt="pic" title="pic" border="0" />
link_no1
link_no2
<img src="image1.jpg">
');
$xpath = new DOMXPath($doc);
$links = $xpath->query('//a//img');
foreach($links as $link) {
$link->parentNode->setAttribute('rel', 'lightbox');
}
echo $doc->saveHTML();

How to get img tag value inside a specific div and specific anchor tag using regular expression

I am new to regular expression i tried a lot for getting image tag value inside a anchor tag html
this is my html expresstion
<div class="smallSku" id="ctl00_ContentPlaceHolder1_smallImages">
<a title="" name="http://www.playg.in/productImages/med/PNC000051_PNC000051.jpg" href="http://www.playg.in/productImages/lrg/PNC000051_PNC000051.jpg" onclick="return showPic(this)" onmouseover="return showPic(this)">
<img border="0" alt="" src="http://www.playg.in/productImages/thmb/PNC000051_PNC000051.jpg"></a> <a title="PNC000051_PNC000051_1.jpg" name="http://www.playg.in/productImages/med/PNC000051_PNC000051_1.jpg" href="http://www.playg.in/productImages/lrg/PNC000051_PNC000051_1.jpg" onclick="return showPic(this)" onmouseover="return showPic(this)">
<img border="0" alt="PNC000051_PNC000051_1.jpg" src="http://www.playg.in/productImages/thmb/PNC000051_PNC000051_1.jpg"></a>
</div>
i want to return only the src value of image tag and i tried a matching pattern in "preg_match_all()" and the pattern was
"#<div[\s\S]class="smallSku"[\s\S]id="ctl00_ContentPlaceHolder1_smallImages"\><a title=\"\" name="[\w\W]" href="[\w\W]" onclick=\"[\w\W]" onmouseover="[\w\W]"\><img[\s\S]src="(.*)"[\s\S]></a><\/div>#"
please help i tried a lots of time for this also tried with this link too Match image tag not nested in an anchor tag using regular expression

Regular expression is not the right tool for parsing HTML. See this FAQ: How to parse and process HTML/XML?
Here is an example on how to get the src property using your example:
$doc = new DOMDocument();
$doc->loadHTML($your_html_string);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//div[#class="smallSku"]/a/img/#src') as $attr) {
$src = $attr->value;
print $src;
}

try this sunith
$content = file_get_contents('your url');
preg_match_all("|<div class='items'>.*</div>|", $content, $arr, PREG_PATTERN_ORDER);
preg_match_all("/src='([^']+)'/", $arr[0][0], $arrr, PREG_PATTERN_ORDER);
echo '<pre>';
print_r($arrr);

find image with specific src using preg_replace

I have some text with images within it. I want to replace specific images within the text with something else.
i.e. the text contains an a youtube img url that I want to replace with the actual video link.
<img class="mceItem" src="http://img.youtube.com/vi/1MsVzAkmds0/default.jpg" alt="1MsVzAkmds0">
and replace it with the youtube Iframe code:
<iframe title="'.$id.'" class="youtube-player" type="text/html" width="576" height="400" src="http://www.youtube.com/embed/'.$id.'" frameborder="0"></iframe>
my function looks like this:
function replacelink($link) {
$find= ("/<img src=[^>]+\>/i");
$replace = youtube("\\2");
return preg_replace($find,$replace);
}
What do I need to change in the regex to do the above?

Your regex is looking for <img src=, but there is a class attribute between img and src. Using $find= '/<img.*src=[^>]+>/i'; corrects the problem; however, this illustrates why you shouldn’t use regex to parse HTML.
You wrote:
I have some text with images within it.
If the text you’re referring to is actually HTML, then there are better alternatives to using regex for this.
Update
I believe this is what you’re looking for.
<?php
function replacelink($text) {
$replace = '<iframe title="$2" class="youtube-player" type="text/html" width="576" height="400" <iframe title="$2" class="youtube-player" type="text/html" width="576" height="400" src="http://www.youtube.com/embed/$2" frameborder="0"></iframe>';
$find = '/(<img.*?alt="([\da-z]+)".*?>)/i';
return preg_replace($find, $replace, $text);
}
$imagestr = '<img class="mceItem" src="http://img.youtube.com/vi/1MsVzAkmds0/default.jpg" alt="1MsVzAkmds0">';
echo replacelink($imagestr);
?>
There’s no need for a separate youtube() function.
If you want to replace more than one image, use preg_replace_all() instead of preg_replace().

The following regex would get all the images with a specific url. I not sure if this is what you wanted.
<img [^>]*?src="url"[^>]*?>
Previous anwser would fail if there were more than one image.

Using regular expressions to extract the first image source from html codes?

I would like to know how this can be achieved.
Assume: That there's a lot of html code containing tables, divs, images, etc.
Problem: How can I get matches of all occurances. More over, to be specific, how can I get the img tag source (src = ?).
example:
<img src="http://example.com/g.jpg" alt="" />
How can I print out http://example.com/g.jpg in this case. I want to assume that there are also other tags in the html code as i mentioned, and possibly more than one image. Would it be possible to have an array of all images sources in html code?
I know this can be achieved way or another with regular expressions, but I can't get the hang of it.
Any help is greatly appreciated.

While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.
What I recommend you do is use a DOM parser such as SimpleHTML and use it as such:
function get_first_image($html) {
require_once('SimpleHTML.class.php')
$post_html = str_get_html($html);
$first_img = $post_html->find('img', 0);
if($first_img !== null) {
return $first_img->src;
}
return null;
}
Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can also get the alt attribute.
A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the alt attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression.
Also, consider the following. To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression:
<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\\?+.)*?)\1[^>]*?>
And then again, the above can fail if:
The attribute or tag name is in capital and the i modifier is not used.
Quotes are not used around the src attribute.
Another attribute then src uses the > character somewhere in their value.
Some other reason I have not foreseen.
So again, simply don't use regular expressions to parse a dom document.
EDIT: If you want all the images:
function get_images($html){
require_once('SimpleHTML.class.php')
$post_dom = str_get_dom($html);
$img_tags = $post_dom->find('img');
$images = array();
foreach($img_tags as $image) {
$images[] = $image->src;
}
return $images;
}

Use this, is more effective:
preg_match_all('/<img [^>]*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
echo $value."<br>";
}
Example:
$html = '
<ul>
<li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>
<li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>
<li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value1.jpg" />
<li>Electronaut Records</li>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value2.jpg" />
<li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value3.jpg" />
</ul>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="res/upload.jpg" />
<li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>
<li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>
<li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value4.jpg" />
<li>Electronaut Records</li>
<img src="value5.jpg" />
<li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value6.jpg" />
';
preg_match_all('/<img .*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
echo $value."<br>";
}
Output:
value1.jpg
value2.jpg
value3.jpg
res/upload.jpg
value4.jpg
value5.jpg
value6.jpg

This works for me:
preg_match('#<img.+src="(.*)".*>#Uims', $html, $matches);
$src = $matches[1];

i assume all your src= have " around the url
<img[^>]+src=\"([^\"]+)\"
the other answers posted here make other assumsions about your code

I agree with Andrew Moore. Using the DOM is much, much better. The HTML DOM images collection will return to you a reference to all image objects.
Let's say in your header you have,
<script type="text/javascript">
function getFirstImageSource()
{
var img = document.images[0].src;
return img;
}
</script>
and then in your body you have,
<script type="text/javascript">
alert(getFirstImageSource());
</script>
This will return the 1st image source. You can also loop through them along the lines of, (in head section)
function getAllImageSources()
{
var returnString = "";
for (var i = 0; i < document.images.length; i++)
{
returnString += document.images[i].src + "\n"
}
return returnString;
}
(in body)
<script type="text/javascript">
alert(getAllImageSources());
</script>
If you're using JavaScript to do this, remember that you can't run your function looping through the images collection in your header. In other words, you can't do something like this,
<script type="text/javascript">
function getFirstImageSource()
{
var img = document.images[0].src;
return img;
}
window.onload = getFirstImageSource; //bad function
</script>
because this won't work. The images haven't loaded when the header is executed and thus you'll get a null result.
Hopefully this can help in some way. If possible, I'd make use of the DOM. You'll find that a good deal of your work is already done for you.

I don't know if you MUST use regex to get your results. If not, you could try out simpleXML and XPath, which would be much more reliable for your goal:
First, import the HTML into a DOM Document Object. If you get errors, turn errors off for this part and be sure to turn them back on afterward:
$dom = new DOMDocument();
$dom -> loadHTMLFile("filename.html");
Next, import the DOM into a simpleXML object, like so:
$xml = simplexml_import_dom($dom);
Now you can use a few methods to get all of your image elements (and their attributes) into an array. XPath is the one I prefer, because I've had better luck with traversing the DOM with it:
$images = $xml -> xpath('//img/#src');
This variable now can treated like an array of your image URLs:
foreach($images as $image) {
echo '<img src="$image" /><br />
';
}
Presto, all of your images, none of the fat.
Here's the non-annotated version of the above:
$dom = new DOMDocument();
$dom -> loadHTMLFile("filename.html");
$xml = simplexml_import_dom($dom);
$images = $xml -> xpath('//img/#src');
foreach($images as $image) {
echo '<img src="$image" /><br />
';
}

I really think you can not predict all the cases with on regular expression.
The best way is to use the DOM with the PHP5 class DOMDocument and xpath. It's the cleanest way to do what you want.
$dom = new DOMDocument();
$dom->loadHTML( $htmlContent );
$xml = simplexml_import_dom($dom);
$images = $xml -> xpath('//img/#src');

You can try this:
preg_match_all("/<img\s+src=\"(.+)\"/i", $html, $matches);
foreach ($matches as $key=>$value) {
echo $key . ", " . $value . "<br>";
}

since you're not worrying about validating the HTML, you might try using strip_tags() on the text first to clear out most of the cruft.
Then you can search for an expression like
"/\<img .+ \/\>/i"
The backslashes escape special characters like <,>,/.
.+ insists that there be 1 or more of any character inside the img tag
You can capture part of the expression by putting parentheses around it. e.g. (.+) captures the middle part of the img tag.
When you decide what part of the middle you wish specifically to capture, you can modify the (.+) to something more specific.

<?php
/* PHP Simple HTML DOM Parser # http://simplehtmldom.sourceforge.net */
require_once('simple_html_dom.php');
$html = file_get_html('http://example.com');
$image = $html->find('img')[0]->src;
echo "<img src='{$image}'/>"; // BOOM!
PHP Simple HTML DOM Parser will do the job in few lines of code.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

problem with string manipulation function retrieving URL - php

Use queryPath, Simple HTML DOM Parser or other PHP libraries for navigating in DOM document

You can use PHP Query library, and attr method if you are familiar with CSS selectors. <?php echo pq('a')->attr('href');

Related

PHP Remove html link tag from a string where the hypertext in new line

How to add rel="lightbox" only to links which are including an image and not the rest of the links? [duplicate]

How to get img tag value inside a specific div and specific anchor tag using regular expression

find image with specific src using preg_replace

Using regular expressions to extract the first image source from html codes?

Categories

Resources