How to remove duplicate values from DOM output?

How to remove duplicate values from DOM output? - php

I have implemented a simple code around HTML DOM script. The output I am receiving has many duplicate values. How to remove these values? I tried array_unique, but it doesn't work. Any pointers?
Below is the code;
$path = 'https://www.example.com';
$html = new simple_html_dom();
$url = $html->load_file($path);
if (!empty($url))
$html->load_file($url);
$result = array();
$result1 = array_unique($result);
foreach($html->find('a') as $a){
$href = $a->href;
if (strpos($href, '://')==false)
{
$result1[] = $href;
echo $href;
echo'<br />';
}
}

Two problems.
1. You are using array_unique too early.
2. You are echo-ing all matches. (While also adding them to an unused array.)
Here's the relevant part of your code, fixed:
$result = array();
// THIS WILL ONLY FILTER AN EMPTY ARRAY == DO NOTHING:
// $result1 = array_unique($result);
foreach($html->find('a') as $a){
$href = $a->href;
// Assuming you want to match this; then use !== false
if (strpos($href, '://') !== false) {
$result[] = $href;
// If you output, you can't filter for unique values. Ditch this:
// echo $href;
// echo'<br />';
}
}
$result = array_unique($result);
var_dump($result);
Then, do whatever you wish with the unique values, including implode('<br>', $result); if your desired output is unique links separated by line breaks.

Related

Check every URL in string to remove links of certain sites

I want to remove URLs of certain sites within a string
I used this:
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$LinksToCheck = in_array('google.com' , $LinksToRemove);
if (strpos($URLContent, $LinksToCheck) !== 0) {
$URLContent = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $URLContent);
}
echo $URLContent;
?>
In this example, I want to remove URLs of google.com, yahoo.com and msn.com websites only if any of them found in string $URLContent, but keep any other links.
The result of the previous code is:
<p>Google</p><p>AnotherSite</p>
but I want it to be:
<p>Google</p><p>AnotherSite</p>

One solution would be to explode your $URLContent and compare for each value in $LinksToCheck.
It could be like this :
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$urlList = explode('</p>', $URLContent);
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$urlFormat = [];
foreach ($urlList as $url) {
foreach ($LinksToRemove as $link) {
if (str_contains($url, $link)) {
$url = '<p>' . ucfirst(str_replace('.com', '', $link)) . '</p>';
break;
}
}
$urlFormat[] = $url;
}
$result = implode('', $urlFormat);

simple_html_dom - read html page, two arrays

This is my entire code
// include the scrapper
include('simple_html_dom.php');
// connect the page for scrapping
$html = file_get_html('http://www.niagarafallsreview.ca/news/local');
// make empty arrays
$headlines = array();
$links = array();
// look for 'h' headings on page
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
// look for 'a' links that start with 'http://www.niagarafallsreview.ca/2016/04/'
foreach($html->find('a[href^="http://www.niagarafallsreview.ca/2016/04/"]') as $link) {
$links[] = $link->href;
}
// trim the headlines because one on top and bottom were not needed
$output = array_slice($headlines, 1, -1);
// for each header output a nice list of the headers
foreach ($output as $headers){
echo "< a href='#'>$headers</a>" . "<br />";
}
// make sure the links are unique and no doubles are found
$result = array_unique($links);
// for each link output it in a nice list
foreach ($result as $linkk){
echo "<a href='$linkk'>$linkk</a>" . "<br />";
}
this code will produce the headings in a nice list, and will also produce a nice list of the links.
My problem is that i need to combine them, i would like the $header to be the text of the href, and the link in the href to be the $linkk
like this..
< a href ='$linkk'>$headers</a>
I dont know how to do this as i have two foreach statements. I tried to combine them but i was unsuccessful.
Any help will be greatly appreciated.
Thanks.

Try this:
// include the scrapper
include('simple_html_dom.php');
// connect the page for scrapping
$html = file_get_html('http://www.niagarafallsreview.ca/news/local');
// make empty arrays
$headlines = array();
$links = array();
// look for 'h' headings on page
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
// look for 'a' links that start with 'http://www.niagarafallsreview.ca/2016/04/'
foreach($html->find('a[href^="http://www.niagarafallsreview.ca/2016/04/"]') as $link) {
$links[] = $link->href;
}
// trim the headlines because one on top and bottom were not needed
$output = array_slice($headlines, 1, -1);
// make sure the links are unique and no doubles are found
$result = array_unique($links);
// for each link output it in a nice list
foreach ($result as $i=>$linkk) {
$headline = isset($output[$i]) ? $output[$i] : '(empty)';
echo "<a href='$linkk'>$headline</a>" . "<br />";
}

Here is the foreach you are looking for:
foreach($output as $i=>$headers) {
$linkk = $result[$i];
echo "< a href='$linkk'>$headers</a>" . "<br />";
}
This assumes the arrays have the same length and also the correct order.

PHP Regex or DOMDocument for Matching & Removing URLs?

I'm trying to extract links from html page using DOM:
$html = file_get_contents('links.html');
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br/>';
}
Output:
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
I would like to remove all results matching dontwantthisdomain.com, dontwantthisdomain2.com and dontwantthisdomain3.com so the output will looks like that:
http://domain1.com/page-X-on-domain-com.html
http://domain.com/page-XZ-on-domain-com.html
http://domain3.com/page-XYZ-on-domain3-com.html
Some people saying I should not use regex for html and others that it's ok. Could somebody point the best way how I can remove unwanted urls from my html file? :)

Maybe something like this:
function extract_domains($buffer, $whitelist) {
preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches);
$result = array();
foreach($matches[1] as $url) {
$url = urldecode($url);
$parts = #parse_url((string) $url);
if ($parts !== false && in_array($parts['host'], $whitelist)) {
$result[] = $parts['host'];
}
}
return $result;
}
$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com')));
It does a rough match on the all the <a> with href=, grabs what's between the quotes, then filters it based on your whitelist of domains.

None regex solution (without potential errors :-) :
$html='
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
';
$html=explode("\n", $html);
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com');
foreach ($html as $link) {
$ok=true;
foreach($dontWant as $notWanted) {
if (strpos($link, $notWanted)>0) {
$ok=false;
}
if (trim($link=='')) $ok=false;
}
if ($ok) $final_result[]=$link;
}
echo '<pre>';
print_r($final_result);
echo '</pre>';
outputs
Array
(
[0] => http://domain1.com/page-X-on-domain-com.html
[1] => http://domain.com/page-XZ-on-domain-com.html
[2] => http://domain3.com/page-XYZ-on-domain3-com.html
)

Return in array

I have these php lines:
<?php
$start_text = '<username="';
$end_text = '" userid=';
$source = file_get_contents('http://mysites/users.xml');
$start_pos = strpos($source, $start_text) + strlen($start_text);
$end_pos = strpos($source, $end_text) - $start_pos;
$found_text = substr($source, $start_pos, $end_pos);
echo $found_text;
?>
I want to see just the names from entire file, but it shows me just the first name. I want to see all names.
I think it is something like: foreach ($found_text as $username).... but here I am stuck.
Update from OP post, below:
<?php
$xml = simplexml_load_file("users.xml");
foreach ($xml->children() as $child)
{
foreach($child->attributes() as $a => $b)
{
echo $a,'="',$b,"\"</br>";
}
foreach ($child->children() as $child2)
{
foreach($child2->attributes() as $c => $d)
{
echo "<font color='red'>".$c,'="',$d,"\"</font></br>";
}
}
}
?>
with this code, i receive all details about my users, but from all these details i want to see just 2 or 3
Now i see :
name="xxx"
type="default"
can_accept="true"
can_cancel="false"
image="avatars/trophy.png"
title="starter"
........etc
Another details from the same user "Red color(defined on script)"
reward_value="200"
reward_qty="1"
expiration_date="12/07/2012"
.....etc
what i want to see?
i.e first line from first column "name="xxx" & expiration_date="12/07/2012" from second column

You will need to repeat the loop, using the 3rd parameter, offset, of the strpos function. That way, you can look for a new name each time.
Something like this (untested)
<?php
$start_text = '<username="';
$end_text = '" userid=';
$source = file_get_contents('http://mysites/users.xml');
$offset = 0;
while (false !== ($start_pos = strpos($source, $start_text, $offset)))
{
$start_pos += strlen($start_text);
$end_pos = strpos($source, $end_text, $offset);
$offset = $end_pos;
$text_length = $end_pos - $start_pos;
$found_text = substr($source, $start_pos, $text_length);
echo $found_text;
}
?>

You should either use XMLReader or DOM or SimpleXML to read XML files. If you don't see the necessity, try the following regular expressions approach to retrieve all usernames:
<?php
$xml = '<xml><username="hello" userid="123" /> <something /> <username="foobar" userid="333" /></xml>';
if (preg_match_all('#<username="(?<name>[^"]+)"#', $xml, $matches, PREG_PATTERN_ORDER)) {
var_dump($matches['name']);
} else {
echo 'no <username="" found';
}

PHP Simple DOM Parser to Scrape From Multiple URLs

Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.

include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)

I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to remove duplicate values from DOM output? - php

Related

Check every URL in string to remove links of certain sites

simple_html_dom - read html page, two arrays

PHP Regex or DOMDocument for Matching & Removing URLs?

Return in array

PHP Simple DOM Parser to Scrape From Multiple URLs

Categories

Resources