I'm stuck on particular task. As you can see I'm extracting hrefs and title from webpage and I need to put this information to a file. But how this array can be printed in order like this: href1 : title1 , href2 : title2 and so on.
<?php
$searched = file_get_contents('http://technologijos.lt');
$xml = new DOMDocument();
#$xml->loadHTML($searched);
foreach($xml->getElementsByTagName('a') as $lnk)
{
$links[] = array(
'href' => $lnk->getAttribute('href'),
'title' => $lnk->getAttribute('title')
);
}
echo '<pre>'; print_r($links); echo '</pre>';
?>
Why not create the array directly in a way that is usable afterwards?
<?php
$searched = file_get_contents('http://technologijos.lt');
$xml = new DOMDocument();
#$xml->loadHTML($searched);
$links = [];
foreach($xml->getElementsByTagName('a') as $lnk) {
$links[] = sprintf(
'%s : %s',
$lnk->getAttribute('href'),
$lnk->getAttribute('title');
);
}
var_dump(implode(', ', $links);
Obviously the same can be done by using a second loop to iterate over the links array if it is create as shown in your example.
Related
I am using a simple html dom to parsing html file.
I have a dynamic array called links2, it can be empty or maybe have 4 elements inside or more depending on the case
<?php
include('simple_html_dom.php');
$url = 'http://www.example.com/';
$html = file_get_html($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
//////////////////////////////////////////////////////////////////////////////
foreach ($doc->getElementsByTagName('p') as $link)
{
$intro2 = $link->nodeValue;
$links2[] = array(
'value' => $link->textContent,
);
$su=count($links2);
}
$word = 'document.write(';
Assuming that the two elements contain $word in "array links2", when I try to filter this "array links2" by removing elements contains matches
unset( $links2[array_search($word, $links2 )] );
print_r($links2);
the filter removes only one element and array_diff doesn't solve the problem. Any suggestion?
solved by adding an exception
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document') === false) {
$links2[] = array(
'value' => $link->textContent,
);
}
$su=count($links2);
echo $su;
i am trying to extract coupon codes and if the code is present then get the corresponding title too but unable to do so.
in the code below i am able to extract the coupon codes correctly but how do i get the corresponding title to be extracted oo. as you can see in the link some titles don't have coupon codes...
<?php
$html = file_get_contents('http://www.grabon.in/abof-coupons/'); //get the html returned from the following url
$mydoc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$mydoc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$my_xpath = new DOMXPath($mydoc);
//get all the codes
$my_code = $my_xpath->query('//*[#class="coupon-click"]//a//small');
if($my_code->length > 0){
foreach($my_code as $row){
$my_row = $my_xpath->query('//*[#class="h3_click"]');
echo $code->nodeValue . "<br/>";
}
}
}
?>
thanx fusion3k the code works perfectly but using ur code i tried for different url as below and get the error Notice: Trying to get property of non-object
<?php
$html = file_get_contents('http://official.deals/ebay-coupons?coupon-id=1055981&h=ed68f1b2a5b28471ecf9584734d65742&utm_source=coupon_page&utm_medium=deal_reveal&utm_campaign=od_deal_click#ebay1055981'); //get the html returned from the following url
$mydoc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(empty($html)) die("EMPTY HTML");
$mydoc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$my_xpath = new DOMXPath($mydoc);
//////////////////////////////////////////////////////
$result = array();
$nodes = $my_xpath->query( '//div[#data-rowtype="1"]' );
foreach( $nodes as $node )
{
$title = $my_xpath->query( 'div[#class="cop-head"]/h4', $node )->item(0)->nodeValue;
$found = $my_xpath->query( 'div[#class="cop-head"]/div/input/value', $node );
$coupon = ( $found->length ) ? $found->item(0)->nodeValue : '' ;
$result[] = compact( 'title', 'coupon' );
}
echo '<pre>';
print_r($result);
echo '</pre>';
?>
If you want retrieve also boxes without coupon, you have to proceed in a different way: retrieve all boxes and, for each box, find if a coupon code exists.
Init an array to store results:
$result = array();
Search for boxes ( <li> nodes with class “coupon-list-item ” ):
$nodes = $my_xpath->query( '//li[#class="coupon-list-item "]' );
# ↑ pay attention!
Then analyze each node through a foreach loop:
foreach( $nodes as $node )
{
Match titles:
$title = $my_xpath->query( 'div/b[#class="h3_click"]', $node )->item(0)->nodeValue;
# ┊ ┊
# No starting slashes, pattern is node-relative ┊
# Second optional xpath->query parameter define the search context
Then search for coupons, if it exists:
$found = $my_xpath->query( 'div[#class="coupon-actions"]/div/a/small', $node );
$coupon = ( $found->length ) ? $found->item(0)->nodeValue : '' ;
At the end, you can add a sub-array to $result using the fabulous downplayed compact function:
$result[] = compact( 'title', 'coupon' );
}
If you want, you can also add related coupons in similar way:
$nodes = $my_xpath->query( '//div[#class="related-coupons"]/*/div[#class="col-sm-8"]' );
foreach( $nodes as $node )
{
$title = $my_xpath->query( 'div/div[#class="coupon-title"]', $node )->item(0)->nodeValue;
$found = $my_xpath->query( 'div/div[#class="coupon-click"]/a/small', $node );
$coupon = ( $found->length ) ? $found->item(0)->nodeValue : '' ;
$result[] = compact( 'title', 'coupon' );
}
At the end, $result looks like this:
Array
(
[0] => Array
(
[title] => Upto 80% OFF + Extra Rs. 500 OFF On Rs 1495 - All Users
[coupon] => ABOFBMF500C
)
(...)
[14] => Array
(
[title] => Fresh Arrivals on Women & Men Collection
[coupon] =>
)
(...)
)
phpFiddle demo
I'd like to get all articles from the webpage, as well as get all pictures for the each article.
I decided to use PHP Simple HTML DOM Parse and I used the following code:
<?php
include("simple_html_dom.php");
$sitesToCheck = array(
array(
'url' => 'http://googleblog.blogspot.ru/',
'search_element' => 'h2.title a',
'get_element' => 'div.post-content'
),
array(
// 'url' => '', // Site address with a list of of articles
// 'search_element' => '', // Link of Article on the site
// 'get_element' => '' // desired content
)
);
$s = microtime(true);
foreach($sitesToCheck as $site)
{
$html = file_get_html($site['url']);
foreach($html->find($site['search_element']) as $link)
{
$content = '';
$savePath = 'cachedPages/'.md5($site['url']).'/';
$fileName = md5($link->href);
if ( ! file_exists($savePath.$fileName))
{
$post_for_scan = file_get_html($link->href);
foreach($post_for_scan->find($site["get_element"]) as $element)
{
$content .= $element->plaintext . PHP_EOL;
}
if ( ! file_exists($savePath) && ! mkdir($savePath, 0, true))
{
die('Unable to create directory ...');
}
file_put_contents($savePath.$fileName, $content);
}
}
}
$e = microtime(true);
echo $e-$s;
I will try to get only articles without pictures. But I get the response from the server
"Maximum execution time of 120 seconds exceeded"
.
What I'm doing wrong? Is there any other way to get all the articles and all pictures for each article for a specific webpage?
I had similar problems with that lib. Use PHP's DOMDocument instead:
$doc = new DOMDocument;
$doc->loadHTML($html);
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
doSomethingWith($link->getAttribute('href'), $link->nodeValue);
}
See http://www.php.net/manual/en/domdocument.getelementsbytagname.php
I am not very sure why my inner loop data is added to the external loop data-
XML I am parsing - http://pastebin.com/vGc5NhXr
Code I am using -
<?php
$dom = new DomDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load('course/Golf/imsmanifest.xml');
// get the resources element
$organization = $dom->getElementsByTagName( "item" );
echo '<ul>';
foreach( $organization as $organizationItem )
{
$unitTitle = $organizationItem->getElementsByTagName("title");
$unitName = $unitTitle->item(0)->nodeValue;
echo '<li>',$unitName,'</li>';
echo '<ul>';
$item1 = $organizationItem->getElementsByTagName( "item" );
foreach( $item1 as $myitem ) {
$title = $myitem->getElementsByTagName("title");
$author = $title->item(0)->nodeValue;
echo '<li>',$author,'</li>';
}
echo '</ul>';
}
echo '</ul>';
Generated output - http://codepad.org/J2vP71rd
Expected Output - http://codepad.org/uzUtehgT
Let me know what I am doing wrong with the for each loop.
Because the item elements are nested. $dom->getElementsByTagName( "item" ) gets all the item elements, including those lie within another item. That's not what you want.
I'd suggest using XPath for this kind of job.
I am indexing web pages. The code scans the web pages for links and the web page that is given's title. The links and title are stored in two different arrays. I would like to create a multidimensional array that has the word Array, followed by the links, followed by the individual titles of the links. I have the code, I just don't know how to put it together.
require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
//links
$links = Array();
$URL = 'http://www.youtube.com'; // change it for urls to grab
// grabs the urls from URL
$file = file_get_html($URL);
foreach ($file->find('a') as $theelement) {
$links[] = url_to_absolute($URL, $theelement->href);
}
print_r($links);
//titles
$titles = Array();
$str = file_get_contents($URL);
$titles[] = preg_match_all( "/\<title\>(.*)\<\/title\>/", $str, $title );
print_r($title[1]);
You should be able to do this, assuming there are the same amount of links as there are titles, then they should correspond to the same array key.
$newArray = array();
foreach ($links as $key=>$val)
{
$newArray[$key]['link'] = $val;
$newArray[$key]['title'] = $titles[$key];
}
It is not clear what you want.
Anyway, here is how I would rewrite your code in a more organized way:
require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
$info = array();
$urls = array(
'http://www.youtube.com',
'http://www.google.com.br'
);
foreach ($urls as $url)
{
$str = file_get_contents($url);
$html = str_get_html($str);
$title = strval($html->find('title')->plaintext);
$links = array();
foreach($html->find(a) as $anchor)
{
$links[] = url_to_absolute($url, strval($anchor->href));
}
$links = array_unique($links);
$info[$url] = array(
'title' => $title,
'links' => $links
);
}
print_r($info);