preg_match to get div contents - php

I am trying to get the contents of a div named: <img id="hplogo-img" src="thelinkiwant"/>
I have this code which isn't working, it just echo's 'Array':
<?php
include_once('simple_html_dom.php');
$html = file_get_html($url);
preg_match('/<img id= \'hplogo-img\'>(.*)<\/div>/s',$html,$matches);
echo $matches;
?>
If it's possible to do this with straight PHP that would be preferred. Any idea's why I can't get the link from the div?

Why not use the method the parser provide.
$ret = $html->find('img[id=hplogo-img]');

$matches is an array.
Try using
print_r($matches)
You should see the arrays content :)
The first element should be what you're looking for. So make:
echo $matches[0];

Related

How to exctract a single link in a webpage using PHP?

I'm looking for a solution to extract only one URL from a specific webpage using PHP.
Here's a simple example of what I need:
I have a URL with many links (https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details)
I want to scrape the link under the anchor click here from the current page.
Then the code must return this result https://download.apkpure.com/b/XAPK/Y29tLnhpYW9taS5zbWFydGhvbWVfNjMwNjdfYWU1M2FmOWU?_fn=TWkgSG9tZV92NS44LjdfYXBrcHVyZS5jb20ueGFwaw&as=4c5e64f6f957edac834f3631fe4e09715f2e35f6&ai=-1070628217&at=1596863870&_sa=ai%2Cat&k=24cb20f95fbf333deb01c145ce7b982b5f30d87e&_p=Y29tLnhpYW9taS5zbWFydGhvbWU&c=1%7CLIFESTYLE%7CZGV2PVhpYW9taSUyMEluYy4mdD14YXBrJnM9MTI5OTAzMTM4JnZuPTUuOC43JnZjPTYzMDY3.
I tried this:
$sourceURL="https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details";
$htmlSource=htmlentities(file_get_contents($sourceURL));
echo strip_tags($htmlSource, "<a>");
I get the result with all links including the one I need
I need your help to extract the href value of the link I want.
Thanks in advance.
If you look at the required URL, you can see it has a pattern https://download.apkpure.com at start of each Click Here URL, so, we can use regex to find it.
preg_match_all will return an array of strings that will match our regex. Then I have used implode to convert the first index to a string.
Here is the complete working code:
$sourceURL="https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details";
$content=file_get_contents($sourceURL);
$content = strip_tags($content,"<a>");
preg_match_all('#\bhttps?://download.apkpure.com[^,\s()<>]+(?:\([\w\d]+\)|([^,[:punct:]\s]|/))#', $content, $match);
echo implode(', ', $match[0]);
Most elegant way is to use a DOM parser.
Iterate thru anchors
Check if anchor ID is 'download_link' (which is in the 'click here' link)
Extract the href attribute value
$html = file_get_contents('https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details');
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($html);
$href = '';
foreach($doc->getElementsByTagName('a') as $item) {
if($item->getAttribute('id') == 'download_link') {
$href = $item->getAttribute('href');
break;
}
}
echo $href;
https://download.apkpure.com/b/XAPK/Y29tLnhpYW9taS5zbWFydGhvbWVfNjMwNjdfYWU1M2FmOWU?_fn=TWkgSG9tZV92NS44LjdfYXBrcHVyZS5jb20ueGFwaw&as=6a7de2cb660007a32e4b3d61a0d3c41e5f2e7102&ai=1946881098&at=1596878986&_sa=ai%2Cat&k=9e912b1007d50d2be9af8e78bcdea86c5f31138a&_p=Y29tLnhpYW9taS5zbWFydGhvbWU&c=1%7CLIFESTYLE%7CZGV2PVhpYW9taSUyMEluYy4mdD14YXBrJnM9MTI5OTAzMTM4JnZuPTUuOC43JnZjPTYzMDY3

Replace an element with Dom Document PHP

I load a html page with PHP Dom Document :
$doc = new DOMDocument();
#$doc->loadHTMLFile($url);
I search in my page all "a" elements, and if they realize my condition i need to replace for example My link is beautiful by just My link is beautiful
Here my loop :
$liens = $div->getElementsByTagName('a');
foreach($liens as $lien){
if($lien->hasAttribute('href')){
if (preg_match("/metz2/i", $lien->getAttribute('href'))) {
//HERE I NEED TO REPLACE </a>
}
$cpt++;
}
}
Do you have any ideas ? Suggestions ? Thanks :)
Every time i need to manage DOM with PHP, i use a framework called PHP Simple HTLM DOM parser. (Link here)
It's very easy to use, something like this might work for you:
// Create DOM from URL or file
$html = file_get_html('http://www.page.com/');
// Find all links
foreach($html->find('a') as $element) {
//Do your custom logic here if you need it, for example this extracts the inner contents of the a-tag, and puts it freely.
$inner = $element->innertext;
$element->outertext($inner);
}
//To echo modified html again:
echo $html;
Could be done with preg_replace as well:
$sText = 'Stackoverflow';
$sText = preg_replace( '/<a.*>(.*)<\/a>/', '$1', $sText );
echo $sText;

Strip directory structure in HTML

I have a PHP application that reads in a bit of HTML. In this HTML there may be an img tag. What I want to do is strip the directory structure from the src of the image tag e.g.
<img src="dir1/dir2/dir3/image1.jpg>
to
<img src="image1.jpg">
Anyone have any pointers?
Thanks,
Mark
As a suggestion, rather than using regex, you may be better off using something like the SimpleXML class to traverse the HTML, that way you'd be able to find the img tags and their src attribute then change it easily. Rather than having to try and parse a whole document with regex. After you've done that you'd be able to just explode the string using the "/" delimiter and use the last value of the exploded array as the src attribute.
PHP.net's SimpleXML Manual: http://php.net/manual/en/book.simplexml.php
This is a tutorial how to change all links in a HTMl document: Scraping Links From HTML.
With a slight modification of the example, this could do it:
<?php
require('FluentDOM/FluentDOM.php');
$html = '<img src="dir1/dir2/dir3/image1.jpg">';
$fd = FluentDOM($html, 'html')->find('//img[#src]')->each(
function ($node) use ($url) {
$item = FluentDOM($node);
$item->attr('href', basename($item->attr('src')));
}
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>
If you want to try this with regexp this could work:
$subject = "dir1/dir2/dir3/image1.jpg";
$pattern = '/^.*\//';
$result = preg_replace($pattern, '', $subject);

Matching SRC attribute of IMG tag using preg_match

I'm attempting to run preg_match to extract the SRC attribute from the first IMG tag in an article (in this case, stored in $row->introtext).
preg_match('/\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\']*)/i', $row->introtext, $matches);
Instead of getting something like
images/stories/otakuzoku1.jpg
from
<img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku's store" />
I get just
0
The regex should be right, but I can't tell why it appears to be matching the border attribute and not the src attribute.
Alternatively, if you've had the patience to read this far without skipping straight to the reply field and typing 'use a HTML/XML parser', can a good tutorial for one be recommended as I'm having trouble finding one at all that's applicable to PHP 4.
PHP 4.4.7
Your expression is incorrect. Try:
preg_match('/< *img[^>]*src *= *["\']?([^"\']*)/i', $row->introtext, $matches);
Note the removal of brackets around img and src and some other cleanups.
Here's a way to do it with built-in functions (php >= 4):
$parser = xml_parser_create();
xml_parse_into_struct($parser, $html, $values);
foreach ($values as $key => $val) {
if ($val['tag'] == 'IMG') {
$first_src = $val['attributes']['SRC'];
break;
}
}
echo $first_src; // images/stories/otakuzoku1.jpg
If you need to use preg_match() itself, try this:
preg_match('/(?<!_)src=([\'"])?(.*?)\\1/',$content, $matches);
Try:
include ("htmlparser.inc"); // from: http://php-html.sourceforge.net/
$html = 'bla <img src="images/stories/otakuzoku1.jpg" border="0" alt="Inside Otakuzoku\'s store" /> noise <img src="das" /> foo';
$parser = new HtmlParser($html);
while($parser->parse()) {
if($parser->iNodeName == 'img') {
echo $parser->iNodeAttributes['src'];
break;
}
}
which will produce:
images/stories/otakuzoku1.jpg
It should work with PHP 4.x.
The regex I used was much simpler. My code assumes that the string being passed to it contains exactly one img tag with no other markup:
$pattern = '/src="([^"]*)"/';
See my answer here for more info: How to extract img src, title and alt from html using php?
This task should be executed by a dom parser because regex is dom-ignorant.
Code: (Demo)
$row = (object)['introtext' => '<div>test</div><img src="source1"><p>text</p><img src="source2"><br>'];
$dom = new DOMDocument();
$dom->loadHTML($row->introtext);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');
Output:
source1
This says:
Parse the whole html string
Isolate all of the img tags
Isolate the first img tag
Isolate its src attribute value
Clean, appropriate, easy to read and manage.

Syntax regex to extract the source of an image

Ahoy there!
I can't "guess" witch syntax should I use to be able to extract the source of an image but simply the web address not the src= neither the quotes?
Here is my piece of code:
function get_all_images_src() {
$content = get_the_content();
preg_match_all('|src="(.*?)"|i', $content, $matches, PREG_SET_ORDER);
foreach($matches as $path) {
echo $path[0];
}
}
When I use it I got this printed:
src="http://project.bechade.fr/wp-content/uploads/2009/09/mer-300x225.jpg"
And I wish to get only this:
http://project.bechade.fr/wp-content/uploads/2009/09/mer-300x225.jpg
Any idea?
Thanks for your help.
Not exactly an answer to your question, but when parsing html, consider using a proper html parser:
foreach($html->find('img') as $element) {
echo $element->src . '<br />';
}
See: http://simplehtmldom.sourceforge.net/
$path[1] instead of $path[0]
echo $path[1];
$path[0] is the full string matched. $path[1] is the first grouping.
You could explode the string using " as a delimeter and then the second item in the array you get would be the right string:
$array = explode('"',$full_src);
$bit_you_want = $array[1];
Reworking your original function, it would be:
function get_all_images_src() {
$content = get_the_content();
preg_match_all('|src="(.*?)"|i', $content, $matches, PREG_SET_ORDER);
foreach($matches as $path) {
$src = explode('"', $path);
echo $src[1];
}
}
Thanks Ithcy for his right answer.
I guess I've been too long to respond because he deleted it, I just don't know where his answer's gone...
So here is the one I've received by mail:
'|src="(.*?)"|i' makes no sense as a
regex. try '|src="([^"]+)"|i' instead.
(Which still isn't the most robust
solution but is better than what
you've got.)
Also, what everyone else said. You
want $path1, NOT $path[0]. You're
already extracting all the src
attributes into $matches[]. That has
nothing to do with $path[0]. If you're
not getting all of the src attributes
in the text, there is a problem
somewhere else in your code.
One more thing - you should use a real
HTML parser for this, because img tags
are not the only tags with src
attributes. If you're using this code
on raw HTML source, it's going to
match not just but
tags, etc.
— ithcy
I did everything he told me to do including using a HTML parser from Bart (2nd answer).
It works like a charm ! Thank you mate...

Categories