Adjust code Retrieve the DOM from a given URL - php

Hello ,
im using the following code to Retrieve the DOM from URL
ind all "A" tags and print their HREFs
Now my output is contain "A" i dont want its my out is here
http://trend.remal.com/parsing.php
some elements duplicated ,
i need to clear my out to be only "A" that include https://twitter.com/$namehere
as you can see i have 2 kind of urls i need only twitter url and avoid duplicate
any tips to adjust the code
<?php
include('simple_html_dom.php');
$html = file_get_html('http://tweepar.com/sa/1/');
foreach($html->find('a') as $e)
echo $e->href . '<br>';
?>

$urls = array();
foreach ( $html->find('a') as $e )
{
// If it's a twitter link
if ( strpos($e->href, '://twitter.com/') !== false )
{
// and we don't have it in the array yet
if ( ! in_array($e->href, $urls) )
{
// add it to our array
$urls[] = $e->href;
}
}
}
echo implode('<br>', $urls);
Here are some references from the PHP docs:
strpos
in_array
implode

Related

Pass all results from a foreach loop to a new variable

I have used the following little bit of code to find all links on a page (home.php) and echoed them as URLs. It works fine, but how do I pass the results to a new variable? If I create a new variable:
$myvariable ="$element->href";
This only echos the last result of many.
// Create DOM from URL or file
$html = file_get_html('http://www.somewebsite.xxx/include/home.php');
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Concatenate with a String Operator:
$myvar = '';
foreach($html->find('a') as $element) {
$myvar .= $element->href . '<br>';
}
Or use an Array:
foreach($html->find('a') as $element) {
$myvar[] = $element->href; // removed <br> for implode, you can add it back
}
// if you want the array as one string
$myvar = implode('<br>', $myvar);
Use an array:
// Create DOM from URL or file
$html = file_get_html('http://www.somewebsite.xxx/include/home.php');
$urls = array();
foreach($html->find('a') as $element) {
$urls[] = $element->href;
}
print_r($urls);
You could use an Array to hold the values of all the Links from that Page in Question. In the End, the Array is the Variable you are looking for. Here's how:
<?php
//USE THE HTML DOM PARSER TO PARSE ALL THE HTML DATA ON THE PAGE: $page
$page = 'http://www.somewebsite.xxx/include/home.php';
$html = file_get_html($page);
// LOOPING THROUGH THE DOM ELEMENTS SELECT ONLY THE <a> TAGS
// AND BUNDLE THEM INTO AN ARRAY...
// THE ARRAY NOW FORMS THE VARIABLE YOU HAD EXPECTED TO CREATE..
$arrAnchors = array(); // INITIALIZE $arrAnchors TO AN EMPTY ARRAY...
foreach($html->find('a') as $element) {
// PUSH ALL THE ANCHOR'S HREF ATTRIBUTES (URLs) INTO THE $arrAnchors ARRAY
$arrAnchors[] = $element->href . '<br>';
}
// NOW TRY TO DUMP THE CONTENT OF YOUR $arrAnchors....
var_dump($arrAnchors); // DISPLAYS A NUMERICALLY INDEXED ARRAY OF LINKS ON THE PAGE: $page

PHP Regex or DOMDocument for Matching & Removing URLs?

I'm trying to extract links from html page using DOM:
$html = file_get_contents('links.html');
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br/>';
}
Output:
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
I would like to remove all results matching dontwantthisdomain.com, dontwantthisdomain2.com and dontwantthisdomain3.com so the output will looks like that:
http://domain1.com/page-X-on-domain-com.html
http://domain.com/page-XZ-on-domain-com.html
http://domain3.com/page-XYZ-on-domain3-com.html
Some people saying I should not use regex for html and others that it's ok. Could somebody point the best way how I can remove unwanted urls from my html file? :)
Maybe something like this:
function extract_domains($buffer, $whitelist) {
preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches);
$result = array();
foreach($matches[1] as $url) {
$url = urldecode($url);
$parts = #parse_url((string) $url);
if ($parts !== false && in_array($parts['host'], $whitelist)) {
$result[] = $parts['host'];
}
}
return $result;
}
$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com')));
It does a rough match on the all the <a> with href=, grabs what's between the quotes, then filters it based on your whitelist of domains.
None regex solution (without potential errors :-) :
$html='
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
';
$html=explode("\n", $html);
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com');
foreach ($html as $link) {
$ok=true;
foreach($dontWant as $notWanted) {
if (strpos($link, $notWanted)>0) {
$ok=false;
}
if (trim($link=='')) $ok=false;
}
if ($ok) $final_result[]=$link;
}
echo '<pre>';
print_r($final_result);
echo '</pre>';
outputs
Array
(
[0] => http://domain1.com/page-X-on-domain-com.html
[1] => http://domain.com/page-XZ-on-domain-com.html
[2] => http://domain3.com/page-XYZ-on-domain3-com.html
)

Adjust PHP code Retrieve the DOM from a given URL

Hello ,
im using the following code to Retrieve the DOM from URL
all "A" tags and print their HREFs
Now my output is contain "A" i dont want its my out is here
http://trend.remal.com/parsing.php
i need to clear my out to be only the name after http://twitter.com/namehere
so output print list of "namehere"
include('simple_html_dom.php');
// Retrieve the DOM from a given URL
$html = file_get_html('http://tweepar.com/sa/1/');
$urls = array();
foreach ( $html->find('a') as $e )
{
// If it's a twitter link
if ( strpos($e->href, '://twitter.com/') !== false )
{
// and we don't have it in the array yet
if ( ! in_array($urls, $e->href) )
{
// add it to our array
$urls[] = $e->href;
}
}
}
echo implode('<br>', $urls);
echo $e->href . '<br>';
Instead of simply using $urls[] = $e->href, use a regex to match the username:
preg_match('~twitter.com/(.+)~', $e->href, $matches);
$urls[] = $matches[1];

PHP Simple DOM Parser to Scrape From Multiple URLs

Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.
include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)
I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.

Need help with preg_replace interpreting {variables} with parameters

I want to replace
{youtube}Video_ID_Here{/youtube}
with the embed code for a youtube video.
So far I have
preg_replace('/{youtube}(.*){\/youtube}/iU',...)
and it works just fine.
But now I'd like to be able to interpret parameters like height, width, etc. So could I have one regex for this whether is does or doesn't have parameters? It should be able to inperpret all of these below...
{youtube height="200px" width="150px" color1="#eee" color2="rgba(0,0,0,0.5)"}Video_ID_Here{/youtube}
{youtube height="200px"}Video_ID_Here{/youtube}
{youtube}Video_ID_Here{/youtube}
{youtube width="150px" showborder="1"}Video_ID_Here{/youtube}
Try this:
function createEmbed($videoID, $params)
{
// $videoID contains the videoID between {youtube}...{/youtube}
// $params is an array of key value pairs such as height => 200px
return 'HTML...'; // embed code
}
if (preg_match_all('/\{youtube(.*?)\}(.+?)\{\/youtube\}/', $string, $matches)) {
foreach ($matches[0] as $index => $youtubeTag) {
$params = array();
// break out the attributes
if (preg_match_all('/\s([a-z0-9]+)="([^\s]+?)"/', $matches[1][$index], $rawParams)) {
for ($x = 0; $x < count($rawParams[0]); $x++) {
$params[$rawParams[1][$x]] = $rawParams[2][$x];
}
}
// replace {youtube}...{/youtube} with embed code
$string = str_replace($youtubeTag, createEmbed($matches[2][$index], $params), $string);
}
}
this code matches the {youtube}...{/youtube} tags first and then splits out the attributes into an array, passing both them (as key/value pairs) and the video ID to a function. Just fill in the function definition to make it validate the params you want to support and build up the appropriate HTML code.
You probably want to use preg_replace_callback, as the replacing can get quite convoluted otherwise.
preg_replace_callback('/{youtube(.*)}(.*){\/youtube}/iU',...)
And in your callback, check $match[1] for something like the /(width|showborder|height|color1)="([^"]+)"/i pattern. A simple preg_match_all inside a preg_replace_callback keeps all portions nice & tidy and above all legible.
I would do it something like this:
preg_match_all("/{youtube(.*?)}(.*?){\/youtube}/is", $content, $matches);
for($i=0;$i<count($matches[0]);$i++)
{
$params = $matches[1][$i];
$youtubeurl = $matches[2][$i];
$paramsout = array();
if(preg_match("/height\s*=\s*('|\")([0-9]+px)('|\")/i", $params, $match)
{
$paramsout[] = "height=\"{$match[2]}\"";
}
//process others
//setup new code
$tagcode = "<object ..." . implode(" ", $paramsout) ."... >"; //I don't know what the code is to display a youtube video
//replace original tag
$content = str_replace($matches[0][$i], $tagcode, $content);
}
You could just look for params after "{youtube" and before "}" but you open yourself up to XSS problems. The best way would be look for a specific number of parameters and verify them. Don't allow things like < and > to be passed inside your tags as someone could put do_something_nasty(); or something.
I'd not use regex at all, since they are notoriously bad at parsing markup.
Since your input format is so close to HTML/XML in the first place, I'd rely on that
$tests = array(
'{youtube height="200px" width="150px" color1="#eee" color2="rgba(0,0,0,0.5)"}Video_ID_Here{/youtube}'
, '{youtube height="200px"}Video_ID_Here{/youtube}'
, '{youtube}Video_ID_Here{/youtube}'
, '{youtube width="150px" showborder="1"}Video_ID_Here{/youtube}'
, '{YOUTUBE width="150px" showborder="1"}Video_ID_Here{/youtube}' // deliberately invalid
);
echo '<pre>';
foreach ( $tests as $test )
{
try {
$youtube = SimpleXMLYoutubeElement::fromUserInput( $test );
print_r( $youtube );
}
catch ( Exception $e )
{
echo $e->getMessage() . PHP_EOL;
}
}
echo '</pre>';
class SimpleXMLYoutubeElement extends SimpleXMLElement
{
public static function fromUserInput( $code )
{
$xml = #simplexml_load_string(
str_replace( array( '{', '}' ), array( '<', '>' ), strip_tags( $code ) ), __CLASS__
);
if ( !$xml || 'youtube' != $xml->getName() )
{
throw new Exception( 'Invalid youtube element' );
}
return $xml;
}
public function toEmbedCode()
{
// write code to convert this to proper embode code
}
}

Categories