How to webscrape HTML using PHP dom inside links

How to webscrape HTML using PHP dom inside links - php

I have a problem regarding HTML webscraping.
<div class="mbs fwb">
<a href="/groups/291064327770896/" data-hovercard="/ajax/hovercard/group.php?id=291064327770896" aria-owns="js_0" aria-haspopup="true" aria-describedby="js_1" id="js_2">
NCR Business Startups </a>
</div>
<div class="mbs fwb" >
<a href="/groups/Analystamit/" data-hovercard="/ajax/hovercard/group.php?id=158649140871478" aria-owns="js_0" aria-haspopup="true" aria-describedby="js_1" id="js_2">
Risk Professionals </a>
</div>
I need to scrape inside anchor tag data-hovercard field.
Below is the code I used:
include('simple_html_dom.php');
$html = file_get_html('http://sampleurl.com/taki.html');
foreach($html->find('div[class="mbs fwb"]') as $desc11)
foreach($desc11->find('a') as $desc12)
echo $desc12->data-hovercard . '<br>';
It is not working. The result I am getting:
0
0
I want a result like this:
/ajax/hovercard/group.php?id=291064327770896
/ajax/hovercard/group.php?id=158649140871478

Use a Regular Expression with a pattern like: /data-hovercard="([^"]*)"/gi;
The resulting matchs' "\1" will contain all of the values for that attribute. You might need to remove newlines from your source text, just for good housekeeping.
Hope this helps.

You can do this using the built-in SimpleXMLElement class and an XPath query:
$xml = new SimpleXMLElement('http://foo.bar/baz.html', null, true);
$anchors = $xml->xpath('//div[#class="mbs fwb"]/a');
foreach ($anchors as $a) {
echo $a['data-hovercard'], PHP_EOL;
}
Output, assuming baz.html is a valid HTML file containing the divs
from the question:
/ajax/hovercard/group.php?id=291064327770896
/ajax/hovercard/group.php?id=158649140871478

Related

Edit iframe content using PHP, and preg_replace()

I need to load some 3rd party widget onto my website. The only way they distribute it is by means of clumsy old <iframe>.
I don't have much choice so what I do is get an iframe html code, using a proxy page on my website like so:
$iframe = file_get_contents('http://example.com/page_with_iframe_html.php');
Then I have to remove some specific parts in iframe like this:
$iframe = preg_replace('~<div class="someclass">[\s\S]*<\/div>~ix', '', $iframe);
In this way I intend to remove the unwanted section. And in the end i simply output the iframe like so:
echo ($iframe);
The iframe gets output alright, however the unwanted section is still there. The regex itself was tested using regex101, but it doesn't work.

You should try this way, Hope this will help you out. Here i am using sample HTML remove the div with given class name, First i load the document, query and remove that node from the child.
Try this code snippet here
<?php
ini_set('display_errors', 1);
//sample HTML content
$string1='<html>'
. '<body>'
. '<div>This is div 1</div>'
. '<div class="someclass"> <span class="hot-line-text"> hotline: </span> <a id="hot-line-tel" class="hot-line-link" href="tel:0000" target="_parent"> <button class="hot-line-button"></button> <span class="hot-line-number">0000</span> </a> </div>'
. '</body>'
. '</html>';
$object= new DOMDocument();
$object->loadHTML($string1);
$xpathObj= new DOMXPath($object);
$result=$xpathObj->query('//div[#class="someclass"]');
foreach($result as $node)
{
$node->parentNode->removeChild($node);
}
echo $object->saveHTML();

Modifying DOM to style sequential headings

Let's start with this html in my database table:
<section id="love">
<h2 class="h2Article">III. Love</h2>
<div class="divArticle">
This is what the display looks like after I run it through a DOM script:
<section id="love"><h2 class="h2Article" id="a3" data-toggle="collapse" data-target="#b3">III. Love</h2>
<div class="divArticle collapse in article" id="b3">
And this is what I would like it to look like this:
<section id="love"><h2 class="h2Article" id="a3" data- toggle="collapse" data-target="#b3">
<span class="Article label label-primary">
<i class="only-collapsed fa fa-chevron-down"></i>
<i class="only-expanded fa fa-remove"></i> III. Love</span></h2>
<div class="divArticle collapse in article" id="b3">
In other word, DOM has given it the necessary function, correctly numbering each id sequentially. All that's missing is the styling:
<span class="Article"><span class="label label-primary"><i class="only- collapsed fa fa-chevron-down"></i><i class="only-expanded fa fa-remove"> </i> III. Love</span></span>
Can anyone tell me how to add that styling? The titles will change, of course (e.g. III. Love, IV. Hate, etc.). I posted my DOM script below:
$i = 1; // initialize counter
$dom = new DOMDocument;
#$dom->loadHTML($Content); // load the markup
$sections = $dom->getElementsByTagName('section'); // get all section tags
foreach($sections as $section) { // for each section tag
// get div inside each section
foreach($section->getElementsByTagName('h2') as $h2) {
if($h2->getAttribute('class') == 'h2Article') { // if this div has class maindiv
$h2->setAttribute('id', 'a' . $i); // set id for div tag
$h2->setAttribute('data-target', '#b' . $i);
}
}
foreach($section->getElementsByTagName('div') as $div) {
if($div->getAttribute('class') == 'divArticle') { // if this div has class divArticle
$div->setAttribute('id', 'b' . $i); // set id for div tag
}
if($div->getAttribute('class') == 'divClose') { // if this div has class maindiv
$div->setAttribute('data-target', '#b' . $i); // set id for div tag
}
}
$i++; // increment counter
}
// back to string again, get all contents inside body
$Content = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$Content .= $dom->saveHTML($child); // convert to string and append to the container
}
$Content = str_replace('data-target', 'data-toggle="collapse" data-target', $Content);
$Content = str_replace('<div class="divArticle', '<div class="divArticle collapse in article', $Content);

Since in this case a DOM Document object is being used, the createElement function can be used to add HTML.
See http://php.net/manual/en/domdocument.createelement.php
And stealing from the documentation on the attached page
<?php
$dom = new DOMDocument('1.0', 'utf-8');
$element = $dom->createElement('test', 'This is the root element!');
// We insert the new element as root (child of the document)
$dom->appendChild($element);
echo $dom->saveXML();
?>
will output
<?xml version="1.0" encoding="utf-8"?>
<test>This is the root element!</test>
Without the DOM object, you would normally add PHP in one of the following ways.
1.
echo "<div>this method is often used for shorter pieces of HTML</div>";
2.
?> <div> You can also escape out of HTML and then "turn" PHP back on like this </div> <?php
The first method uses the echo command to output a string of HTML. The second method uses the ?> escape tag to tell the computer to start treating everything as HTML until it sees another opening <?php PHP tag.
So normally in a PHP file you can add HTML like so.
?>
<span class="Article">
<span class="label label-primary">
<i class="only- collapsed fa fa-chevron-down"></i>
<i class="only-expanded fa fa-remove"></i>
III. Love
</span>
</span>
<?php
But since in this case we're trying to edit content coming from inside of the database we're not able to do this.

Well, I guess the obvious solution is to just wrap the title in something that can be modified with a simple str_replace...
<h2><span class="Answer">IIII. Love</span></h2>
Or even this...
<h2>[]III. Love[]</h2>
Kind of Mickey Mouse, but it gets the job done. I just having to write out or paste all of that code into every heading in every article. I prefer to automate it as much as possible.

How to scrape img src value of each li tag

<ul class="vehicle__gallery cf">
<li><img src="AETV19098412_2a.jpg"></li>
<li><img src="AETV19098412_3a.jpg"></li>
<li><img src="AETV19098412_4a.jpg"></li>
</ul>
and my preg match syntax is as below:
preg_match_all('/<ul class="vehicle__gallery cf">.*?<li>.*?<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>.*?<\/li>.*?<\/ul>/s', $html_image,$posts, PREG_SET_ORDER);

Please don't use regular expressions to parse HTML. PHP has a fine DOM implementation you can use to loadHTML() and query() it with XPath expressions such as //ul/li/a/img/#src to retrieve what you're after, or maybe import it as a SimpleXML object if you prefer that toolset.
Example:
$html = <<<HTML
<ul class="vehicle__gallery cf">
<li><img src="AETV19098412_2a.jpg"></li>
<li><img src="AETV19098412_3a.jpg"></li>
<li><img src="AETV19098412_4a.jpg"></li>
</ul>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$imgs = $xpath->query("//ul/li/a/img/#src");
foreach ($imgs as $img) {
echo $img->nodeValue . "\n";
}
Output:
AETV19098412_2a.jpg
AETV19098412_3a.jpg
AETV19098412_4a.jpg

You dont use regex to parse HTML.It wont work.
<li> tags dont always have ending tag nor do <img> tag.
There can be n number of attributes to a tag
attribute values don't always go in double quotes
Use an html parser like simpledomparser
I wont even attempt to come up with a regex for this because at some point it would fail.

If you give your img tags a class or something, for example:
<img class="gallery_item" src="AETV19098412_2a.jpg">
<img class="gallery_item" src="AETV19098412_3a.jpg">
you can do more easy:
preg_match('/<img class="gallery_item" src="(.*)">/');
However this is still very hacky, if you ever add a css class, html attributes or modify your code you have the problem that your code might not work anymore.
This solution is anything else then clean and you should considerung using JQuery or a form as stated in my comment before would make your life alot easier and the code will not break because of future, minor html changes that might come up any day.

Another approach is use javascript (jquery).
var imgArr = []
$("ul.vehicle__gallery li img").each(function(){
imgArr.push($(this).attr('src'));
})

PHP or Javascript: Simply Remove and Replace HTML Code

I have this code on my page, but the link has different names and ids:
<div class="myclass">
<a href="http://www.example.com/?vstid=00575000&veranstaltung=http://www.example.com/page.html">
Example Text</a>
</div>
how can I remove and Replace it to this:
<div class="myclass">Sorry no link</div>
With PHP or Javascript? I tried it with str.replace
Thank you!

I assume you mean dynamically? You won't be able to do this with php because it is server side, and doesn't have anything to do with the HTML once its been output to the screen.
See: http://www.tizag.com/javascriptT/javascript-innerHTML.php for the javascript.
Or you could use jquery which is just better and nicer than trying to do a cross browser compatible javascript script.
$('.myclass').html('Sorry...');

If the page is still on the server before you need to make the replacement, do this:
<?php if (allowed_to_see_link()) { ?>
<div class="myclass">
<a href="http://www.example.com/? vstid=00575000&veranstaltung=http://www.example.com/page.html">
Example Text</a>
</div>
<?php } else { ?>
non-link-text
<php } ?>
and also write the named functions...

You might want to clearify what you are up to. If that is your file, then you can simply open up in an editor and remove the portions. If you want to modify HTML with PHP, you can use native DOM
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$xPath = new DOMXPath($dom);
foreach( $xPath->query('//div[#class="myclass"]/a') as $link) {
$link->parentNode->replaceChild(new DOMText('Sorry no link'), $link);
}
echo $dom->saveHTML();
The above code would replace any direct <a> element children of any <div> elements that have a class attribute of myclass with the Textnode "Sorry no link".

php: how can I work with html as xml ? how do i find specific nodes and get the text inside these nodes?

Lets say i have the following web page:
<html>
<body>
<div class="transform">
<span>1</span>
</div>
<div class="transform">
<span>2</span>
</div>
<div class="transform">
<span>3</span>
</div>
</body>
</html>
I would like to find all div elements that contain the class transform and to fetch the text in each div element ?
I know I can do that easily with regular expressions, but i would like to know how can I do that without regular expressions, but parsing the xml and finding the required nodes i need.
update
i know that in this example i can just iterate through all the divs. but this is an example just to illustrate what i need.
in this example i need to query for divs that contain the attribute class=transform
thanks!

Could use SimpleXML - see the example below:
$string = "<?xml version='1.0'?>
<html>
<body>
<div class='transform'>
<span>1</span>
</div>
<div>
<span>2</span>
</div>
<div class='transform'>
<span>3</span>
</div>
</body>
</html>";
$xml = simplexml_load_string($string);
$result = $xml->xpath("//div[#class = 'transform']");
foreach($result as $node) {
echo "span " . $node->span . "<br />";
}
Updated it with xpath...

You can use xpath to address the items. For that particular query, you'd use:
div[contains(concat(" ",#class," "), concat(" ","transform"," "))]
Full PHP example:
<?php
$document = new DomDocument();
$document->loadHtml($html);
$xpath = new DomXPath($document);
foreach ($xpath->query('div[contains(concat(" ",#class," "), concat(" ","transform"," "))]') as $div) {
var_dump($div);
}
If you know CSS, here's a handy CSS-selector to XPath-expression mapping: http://plasmasturm.org/log/444/ -- You can find the above example listed there, as well as other common queries.
If you use it a lot, you might find my csslib library handy. It offers a wrapper csslib_DomCssQuery, which is similar to DomXPath, but using CSS-selectors instead.

ok what i wanted can be easily achieved using php xpath:
example:
http://ditio.net/2008/12/01/php-xpath-tutorial-advanced-xml-part-1/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to webscrape HTML using PHP dom inside links - php

Use a Regular Expression with a pattern like: /data-hovercard="([^"]*)"/gi; The resulting matchs' "\1" will contain all of the values for that attribute. You might need to remove newlines from your source text, just for good housekeeping. Hope this helps.

Related

Edit iframe content using PHP, and preg_replace()

Modifying DOM to style sequential headings

How to scrape img src value of each li tag

PHP or Javascript: Simply Remove and Replace HTML Code

php: how can I work with html as xml ? how do i find specific nodes and get the text inside these nodes?

Categories

Resources