I wrote the code blow to get all unique links from a url:
include_once ('simple_html_dom.php');
$html = file_get_html('http://www.example.com');
foreach($html->find('a') as $element){
$input = array($element->href = $element->href . '<br />');
print_r(array_unique($input));}
but I really can't understand why it shows the duplicated links too!
is there any problem with the function array_unique and simple html dom?
and there's another thing I guess is related to the problem: when you execute this you see all of the link that it extracted are in one key I mean this :
array(key => all values)
Is there any one who can solve this?
I believe you want it more like this:
$temp = array();
foreach($html->find('a') as $element) {
$temp[] = $element->href;
}
echo '<pre>' . print_r(array_unique($temp), true) . '</pre>';
Related
I am building a web scraper using the simple HTML DOM parser. However, I ran into some issues figuring out how to store HTML elements on a web page as objects. I would like to take an input URL, and turn all the HTML elements like tags, divs, fields, etc. and turn them into an object that gets spit out onto a page. I have written some code that currently works when I type in a URL, but the output is not what I am trying to achieve. Below, I have attached the code that I have worked out already, and I am seeking to find a way in which I could achieve what I am trying to do.
I have tried finding all images and links as well as creating a DOM object. I can't seem to figure out how to convert these elements into objects that I can use to learn more about a website, and possibly store that data into a database.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$url = $_POST["url"];
$html = file_get_html($url);
echo $html;
// Find all images
$element = new simple_html_dom();
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
$element = new simple_html_dom();
foreach($html->find('a') as $element)
echo $element->href . '<br>';
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a URL
$html->load_file($url);
echo $html;
?>
I am expecting an output of objects, but I am instead getting an actual output of images and links on a web page.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
// $url = $_POST["url"];
$url = 'Your-Url'; // Your url: 'www.example.com'
$html = file_get_html($url);
// Find all images
$images = []; //create empty images array
foreach($html->find('img') as $element){
$images[] = $element->src . '<br>'; //Store the found elements in the images array
}
echo '<pre>Output $images: '; var_dump($images); echo '</pre>'; //An output from the images array
// Find all links
$links = []; //create empty images array
foreach($html->find('a') as $element){
$links[] = $element->href . '<br>'; //Store the found elements in the links array
}
echo '<pre>Output $links: '; var_dump($links); echo '</pre>'; //An output from the links array
The echo's display the arrays filled with 'image' and 'a' tags value's from your page
<span class="contact-seller-name">Enda</span>
Now I want to echo 'Enda' inside this span tag using php
Here's my php code
$url="http://website.example.com";
$html = file_get_html( $url );
$value = $html->find('span.contact-seller-name');
echo $value->innertext;
From their documentation it looks like find returns an array of found values matching filter parameters:
From:
http://simplehtmldom.sourceforge.net/
Code:
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
They also provide another example for getting a specific element:
$html->find('div[id=hello]', 0)->innertext = 'foo';
So my guess would be something like this will get you want you desire:
$value = $html->find('span.contact-seller-name', 0);
echo $value->innertext;
By adding the 0 as a parameter it returns the first found instance of that filter.
Take a look at their API here:
http://simplehtmldom.sourceforge.net/manual_api.htm
It describe what the find method returns (an array of element objects or element object if the second parameter is defined)
Then using any of the provided methods for the element object you can get the desired text.
Full working example tested on a live site:
$url = "http://fleeceandthankyou.org/";
$html = file_get_html($url);
$value = $html->find('span.givecamp-header-wide', 0);
//If it can't find the element, throw an error
try
{
echo $value->innertext;
}
catch (Exception $e)
{
echo "Couldn't access magic method: " . $e->getMessage();
}
Im parsing html from some a page, to get a list of the outgoing, i want to split them in two - the ones with the rel="nofollow" / rel="nofollow me" / rel="me nofollow" element and the ones with with out those expressions.
At the moment im using the code bellow parsed using - PHP Simple HTML DOM Parser
$html = file_get_html("$url");
foreach($html->find('a') as $element) {
echo $element->href; // THE LINK
}
but im not quite sure how to implement it, any ideas ?
Try using something like this :
$html = file_get_html("$url");
// Creating array for storing links
$arrayLinks = array(
"nofollow" => array(),
"others" => array()
);
foreach($html->find('a') as $element) {
// Search for "nofollow" expression with no case-sensitive (i flag)
if(preg_match('#nofollow#i', $element->rel)) {
$arrayLinks["nofollow"][] = $element->href;
}
else {
$arrayLinks["others"][] = $element->href;
}
}
// Display the array
echo "<pre>";
print_r($arrayLinks);
echo "</pre>";
Do a regexp on $element->rel I guess
I'm trying to learn the simple_html_dom syntax, but i'm not having much luck. Could someone show me an example from this:
<div id="container">
<span>Apples</span>
<span>Oranges</span>
<span>Bananas</span>
</div>
If I want to just return the values Apples, Oranges and Bananas.
Can I simply use the php simple_html_dom class or will I also have to use xcode, curl, etc?
UPDATE:
I was able to get this to work, but not convinced it's the most efficient way of getting what I need:
foreach ($html->find('div[id=cont]') as $div);
foreach($div->find('span') as $element)
echo $element->innertext . '<br>';
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Your suggestion is correct:
foreach ($html->find('div[id=cont]') as $div);
foreach($div->find('span') as $element)
echo $element->innertext . '<br>';
More simply:
foreach($html->find('div#container span') as $element)
echo $element->innerText();
That means any span that descends from a div with id: container
i am trying to build simple php crawler
for this purpose
i am getting constants of webpage using
http://simplehtmldom.sourceforge.net/
after getting page data i get page as bellow
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e)
echo $e->href . '<br>';
this works perfectly,and print all links on that page.
i only want to get some url like
/view.php?view=open&id=
i have wirtten function for this purpose
function starts_text_with($s, $prefix){
return strpos($s, $prefix) === 0;
}
and use this function as
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e) {
if (starts_text_with($e->href, "/view.php?view=open&id=")))
echo $e->href . '<br>';
}
but nothing return.
i hope you understand what i need.
i need to print only url which match that criteria.
Thanks
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e) {
if (preg_match($e->href, "view.php?view=open&id="))
echo $e->href . '<br>';
}
try this once.
refer preg_match