PHP Simple Dom HTML - Trouble parsing list of a hrefs

PHP Simple Dom HTML - Trouble parsing list of a hrefs - php

I'm trying to scrape all the a hrefs with an id starting with 'system' from this webpage: http://www.myfxbook.com/systems
Here is my code which I just can't seem to get to work. I've been fiddling around for hours now, looking at countless answered questions here.
include_once( 'simple_html_dom.php' );
$url2process = 'http://www.myfxbook.com/systems';
$html = file_get_html( $url2process );
$cnt = 0;
$parent_mark = $html->find('a[id^=system]');
$cntr = 0;
foreach( $parent_mark as $element) {
if( $cntr > 3 ) continue;
$cntr++;
$single_html = file_get_html( $element->href );
UPDATE1: Ok this is kind of working now, but it only seems to be using the very last a href on the page with the correct id. I need to process ALL these a hrefs with this ID, what am I missing here?

You could do it using the domdocument like this..
$html = file_get_contents('http://www.myfxbook.com/systems');
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors(false);
$links = $doc->getElementsByTagName('a');
$cnt = 0;
$cntr = 0;
foreach ($links as $link) {
if(preg_match('~^system~', $link->getAttribute('id'))) {
if( $cntr > 3 ) {
continue;
}
$cntr++;
$single_html = file_get_contents($link->getAttribute('href'));
if (empty($single_html)) {
echo 'EMPTY';
}
}
}

Related

webscrapinhg a webite filtering for divs with a certain classname. How to do that?

currently I´m tring to webscrape a site for football matches and I need to find out how to filter for divs with a specific name. Here is the code I already have. Thanks
include('simple_html_dom.php');
$day = 1; //temporär
$html = file_get_html('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$list = $html -> find('div[class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]', 0);
$list_array = $list -> find('div');
for($i = 0; $i < sizeof($list_array); $i++){
echo $list_array[$i]->plaintext;
echo "<br>";
}

You can use xpath. Here is the full documentation.
$day = 1; //temporär
$html = file_get_contents('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXPath($doc);
$query = $xpath->query('//div[#class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]/div/span[2]');
foreach ($query as $item) {
/** #var DOMElement $item */
echo $item->nodeValue;
echo PHP_EOL;
}
Or you can benefit from symfony components for this purpose like DOM crawler or CSS selector

Getting link tag via DOMDocument

I convert an atom feed into RSS using atom2rss.xsl. Works fine.
Then, using DOMDocument, I try to get the post title and URL:
$feed = new DOMDocument();
$feed->loadHTML('<?xml encoding="utf-8" ?>' . $html);
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
echo 'url: '. $item->getElementsByTagName("link")->item(0)->nodeValue;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
But the post URL is empty.
See this eval which contains HTML. What am I doing wrong? I suspect I am not getting the link tag properly via $item->getElementsByTagName("link")->item(0)->nodeValue.

I think the problem is that there are several <link> elements in each item and the one (I think) your interested in is the one with rel="self" as an attribute. The quickest way (without messing around with XPath) is to loop over each <link> element checking for the right rel value and then take the href attribute from that...
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
$url = "";
// Look for the 'right' link tag and extract URL from that
foreach ( $item->getElementsByTagName("link") as $link ) {
if ( $link->getAttribute("rel") == "self" ) {
$url = $link->getAttribute("href");
break;
}
}
echo 'url: '. $url;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
which gives...
url: https://www.blogger.com/feeds/2984353310628523257/posts/default/1947782625877709813titleExtraordinary Genius - Cp274

function get_links($link)
{
$ret = array();
$dom = new DOMDocument();
#$dom->loadHTML(file_get_contents($link));
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach ($links as $tag){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
print_r(get_links('http://www.google.com'));
OR u can use DOMXpath
$html = file_get_contents('http://www.google.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// take all links
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'
';

How to make crawling and extracting data in each pager links?

I want to extract all the attributes name="" of a website,
example html
<div class="link_row">
link
</div>
I have the following code:
<?php
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.onedomain.com/plus?ca=11_c&o=1');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='link_row']/a[#class='listing_container']/#name" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n<br>";
}
?>
Result is:
7777
This code is working fine, but need not be limited to one pager number.
http://www.onedomain.com/plus?ca=11_c&o=1 pager attr is "o=1"
I would like once you finish with o=1, follow with o=2
to my variable defined $last=556 is equal http://www.onedomain.com/plus?ca=11_c&o=556
Could you help me?
What is the best way to do it?
Thanks

Use a for (or while) loop. I don't see $last in your provided code so I've statically set the max value plus one.
$html = new DOMDocument();
for($i =1; $i < 557; $i++) {
#$html->loadHtmlFile('http://www.onedomain.com/plus?ca=11_c&o=' . $i);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='link_row']/a[#class='listing_container']/#name" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n<br>";
}
}
Simpler example:
for($i =1; $i < 557; $i++) {
echo $i;
}
http://php.net/manual/en/control-structures.for.php

How to call UL class only once using domdocument php

I am using PHP Domdocument to load my html. In my HTML, I have class="smalllist" two times. But, I need to load the first class elements.
Now, My PHP Code is
$d = new DOMDocument();
$d->validateOnParse = true;
#$d->loadHTML($html);
$xpath = new DOMXPath($d);
$table = $xpath->query('//ul[#class="smalllist"]');
foreach ($table as $row) {
echo $row->getElementsByTagName('a')->item(0)->nodeValue."-";
echo $row->getElementsByTagName('a')->item(1)->nodeValue."\n";
}
which loads both the classes.
But, I need to load only one class with that name.
Please help me in this. Thanks in advance.

DOMXPath returns a DOMNodeList which has a item() method. see if this works
$table->item(0)->getElementsByTagName('a')->item(0)->nodeValue
edited (untested):
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
echo $anchor->nodeValue . "\n";
}

You can put a break within the foreach loop to read only from the first class. Or, you can do foreach ($table->item(0) as $row) {...
Code:
$count = 0;
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
echo $anchor->nodeValue . "\n";
if( ++$count > 2 ) {
break;
}
}

another way rather than using break (more than one way to skin a cat):
$anchors = $table->item(0)->getElementsByTagName('a');
for($i = 0; $i < 2; $i++){
echo $anchor->item($i)->nodeValue . "\n";
}

This is my final code:
$d = new DOMDocument();
$d->validateOnParse = true;
#$d->loadHTML($html);
$xpath = new DOMXPath($d);
$table = $xpath->query('//ul[#class="smalllist"]');
$count = 0;
foreach($table->item(0)->getElementsByTagName('a') as $anchor){
$data[$k][$arr1[$count]] = $anchor->nodeValue;
if( ++$count > 1 ) {
break;
}
}
Working fine.

Extracting certain portions of HTML from within PHP

Ok, so I'm writing an application in PHP to check my sites if all the links are valid, so I can update them if I have to.
And I ran into a problem. I've tried to use SimpleXml and DOMDocument objects to extract the tags but when I run the app with a sample site I usually get a ton of errors if I use the SimpleXml object type.
So is there a way to scan the html document for href attributes that's pretty much as simple as using SimpleXml?
<?php
// what I want to do is get a similar effect to the code described below:
foreach($html->html->body->a as $link)
{
// store the $link into a file
foreach($link->attributes() as $attribute=>$value);
{
//procedure to place the href value into a file
}
}
?>
so basically i'm looking for a way to preform the above operation. The thing is I'm currently getting confused as to how should I treat the string that i'm getting with the html code in it...
just to be clear, I'm using the following primitive way of getting the html file:
<?php
$target = "http://www.targeturl.com";
$file_handle = fopen($target, "r");
$a = "";
while (!feof($file_handle)) $a .= fgets($file_handle, 4096);
fclose($file_handle);
?>
Any info would be useful as well as any other language alternatives where the above problem is more elegantly fixed (python, c or c++)

You can use DOMDocument::loadHTML
Here's a bunch of code we use for a HTML parsing tool we wrote.
$target = "http://www.targeturl.com";
$result = file_get_contents($target);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
#$dom->loadHTML($result);
$links = extractLink(getTags( $dom, 'a', ));
function extractLink( $html, $argument = 1 ) {
$href_regex_pattern = '/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si';
preg_match_all($href_regex_pattern,$html,$matches);
if (count($matches)) {
if (is_array($matches[$argument]) && count($matches[$argument])) {
return $matches[$argument][0];
}
return $matches[1];
} else
function getTags( $dom, $tagName, $element = false, $children = false ) {
$html = '';
$domxpath = new DOMXPath($dom);
$children = ($children) ? "/".$children : '';
$filtered = $domxpath->query("//$tagName" . $children);
$i = 0;
while( $myItem = $filtered->item($i++) ){
$newDom = new DOMDocument;
$newDom->formatOutput = true;
$node = $newDom->importNode( $myItem, true );
$newDom->appendChild($node);
$html[] = $newDom->saveHTML();
}
if ($element !== false && isset($html[$element])) {
return $html[$element];
} else
return $html;
}

You could just use strpos($html, 'href=') and then parse the URL. You could also search for <a or .php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Simple Dom HTML - Trouble parsing list of a hrefs - php

Related

webscrapinhg a webite filtering for divs with a certain classname. How to do that?

Getting link tag via DOMDocument

How to make crawling and extracting data in each pager links?

How to call UL class only once using domdocument php

Extracting certain portions of HTML from within PHP

Categories

Resources