I know how to xpath and echo text off another website via tags like div id, class ,etc, using the below code. But, I don't know how to do it under more precise conditions, for example when trying to scrape and echo a bit of text that has no unique tag identifier like a div.
This below code spits out scraped data.
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.nbcnews.com/business');
$xpath = new DOMXPath($doc);
$query = "//div[#class='market']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
In this below source code for an example, I want to pull the value "21,271.97". But there's no unique tag for this, no div id. Is it possible to pull this data by identifying a keyword in the < p> that never changes, for example "DJIA all time".
<p>DJIA All Time, Record-High Close: <font color="#0000FF">June 9,
2017</font>
(<font color="#FF0000"><b bgcolor="#FFFFCC"><font face="Verdana, Arial,
Helvetica, sans-serif" size="2">21,271.97</font></b></font>)</p>
Wondering if I could possibly replace this with something around the lines of $query = "//div[#class='market']";
$query = "//p['DJIA all time']";
Could this be possible?
I also wonder if using a loop with something like $query = "//p[='DJIA']";?
could work, though I don't know how to use that exactly.
Thanks!!
It would be good to have a play with an online XPath tester - I use https://www.freeformatter.com/xpath-tester.html#ad-output
$query = "//p[contains(text(),'DJIA')]";
Although if you use the page your after, I've found that the value seems to be the first record for...
$query = "//span[contains(#class,'market_price')]";
But the idea is the same in both cases, using contains(source,value) will match a set of nodes. In the first case the text() is the value of the node,the second looks for the specific class definition.
Try to use below XPath expression:
//p[contains(text(), "DJIA All Time")]//b/font
Considering provided link (http://www.nbcnews.com/business) you can get required text with
//span[text()="DJIA"]/following-sibling::span[#class="market_item market_price"]
Related
Trying to get good at php web scraping. Doing some tests and I've nailed scraping/echoing that information from one site to another, but I'm unable to also include the original links in the source code, which is what I'd ideally like to do. Any thoughts on how to accomplish this with what I've got thurs far? (I'm very new to php btw).
this is the php code:
// news
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('https://www.usatoday.com/');
$xpath = new DOMXPath($doc);
$query = "//ul[#class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
that code is spitting out this: NBA Cavs win record-breaking Game 4 behind Irving's 40 Entertain This Watch: 'Black Panther' trailer unleashes a fearsome king News Police: London Bridge terrorists planned more bloodshed How Trump is highlighting divisions amo..........
Now what I'd really like to do, is actually have those as working links, which was what it was in the original code. this is what the source code for this information looked like:
<div class="partner-heroflip-ad partner-placement ui-flip-panel size-xxs"></div></div><p class="hfwmm-tertiary-
list-title hfwmm-light-tertiary-list-title">TOP STORIES</p><ul class="hfwmm-
list hfwmm-4uphp-list hfwmm-light-list"
data-track-prefix="flex4uphphero"><li class="hfwmm-item hfwmm-secondary-item
hfwmm-item-2 sports-theme-bg hfwmm-first-secondary-item hfwmm-4uphp-
secondary-item"
data-asset-position="1"
data-asset-id="102694848"
><a class="js-asset-link hfwmm-list-link hfwmm-light-list-link hfwmm-image-
link hfwmm-secondary-link
href="/story/sports/nba/2017/06/10/kyrie-irving-lebron-james-cavs-win-game-
4/102694848/"
data-track-display-type="thumb"
data-ht="flex4uphpherostack1"
data-asset-id="102694848"
><span class="hfwmm-image-gradient hfwmm-secondary-image-gradient"></span>
<span class="js-asset-section theme-bg-ssts-label hfwmm-ssts-label-top-left
hfwmm-ssts-label-secondary sports-theme-bg">NBA</span><img
src="https://www.gannett-cdn.com/-
mm-/cd17823b265aa373c83094fc75525710f645ec90/c=0-178-4072-
81338209183-USP-NBA-FINALS-GOLDEN-STATE-WARRIORS-AT-CLEVELAND-91573076.JPG"
class="hfwmm-image hfwmm-secondary-image js-asset-image placeholder-hide"
alt="Kyrie Irving reacts after making a basket against the"
data-id="102695338"
data-crop="16_9"
width="239"
height="135" /><span class="hfwmm-secondary-hed-wrap hfwmm-secondary-text-
hed-wrap"><span class="hfwmm-text-hed-icon js-asset-disposable"></span><span
title="Cavs win record-breaking Game 4 behind Irving's 40"
class="js-asset-headline hfwmm-list-hed hfwmm-secondary-hed placeholder-
hide">
Cavs win record-breaking Game 4 behind Irving's 40
hfwmm-item-3 life-theme-bg hfwmm-4uphp-secondary-item"
data-asset-position="2"
For sanity, the href above is href="/story/sports/nba/2017/06/10/kyrie-irving-lebron-james-cavs-win-game-
4/102694848/"
Any thoughts on how this might be accomplished in this test scenario, would be hugely helpful. Thank you very much. -wilson
You need to output the element as a string, your just extracting the text of the element (not the same thing with XML). The element may be <a>some text</a> the text is simply some text.
To output the tags, use...
$query = "//ul[#class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']//a";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$newdoc = new DOMDocument();
$cloned = $entry->cloneNode(TRUE);
$newdoc->appendChild($newdoc->importNode($cloned,TRUE));
echo $newdoc->saveHTML();
//echo trim((string)$entry); // use `trim` to eliminate spaces
}
Also note that I've added //a on the end of the XPath expression to limit the selection to links in the segment you where fetching. This may or may not be what you want, but look at the results and check it out.
Edit:
To manipulate the href in the , then use something like...
foreach ($entries as $entry) {
$oldHref = (string)$entry->getAttribute("href");
$entry->setAttribute("href", "http://someserver.com".$oldHref);
$newdoc = new DOMDocument();
$cloned = $entry->cloneNode(TRUE);
$newdoc->appendChild($newdoc->importNode($cloned,TRUE));
echo $newdoc->saveHTML();
}
I am using explode to manipulate information I am scraping from a website. I am trying to eliminate something specific from the string so that it will return what I want and also add the rest of the items to the array.
$pageArray = explode('<td class="player-label"><a href="/nfl/players/antonio-brown.php?type=overall&week=draft">', $fantasyPros);
I would like to skip the antonio-brown section and use a regular expression or whatever is best to replace it so that it will not look for a specific name but every name on the list and add them to my array. Do you have any suggestions on what I should use here? I appreciate any assistance.
Seems like a parser job to me with appropriate xpath functions, e.g. not().
Consider the following code:
<?php
$data = <<<DATA
<td class="player-label">
Some brown link here
Some green link here
</td>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$green_links = $xpath->query("//a[not(contains(#href, 'antonio-brown'))]");
foreach ($green_links as $link) {
// do sth. useful here
}
?>
This prints out every link where there's no antonio-brown in it.
You can easily adjust this to td or any other element.
I know there are plenty of these, and that my PHP may be riddled with errors (I'm new to PHP), but I cant figure this one out.... it's citing the "function must be a string" error on line 11:
<?php
$dom=new DOMDocument();
$dom->formatOutput=true;
$dom->load('test.xml');
$searchItems=$dom->getElementsByTagName('title');
$voteItems=$dom->getElementsByTagName('vote');
for ($i=0; $i < $searchItems->length; $i++){
$value = $voteItems($i)->textContent;
$value++;
$name=$searchItems($i);
$xpath = new DOMXPath($dom);
$resulted = $xpath->query('div[#class="active"]');
$active= $div->$resulted->query('div[#class="content"]');
$names=$active->getAttribute('song');
preg_match($names, "i");
if(preg_match($names, $name )){
$voteItems($i)->length->textContent=$value;
$vote = voteItems($i)->length->textContent;
$results=array($name, $vote);
$txt=($name." scored: ".$vote." votes");
echo $txt;
}
}
?>
What this is doing, just in case there's a WAYYY better way to do this is, it checks and XML sheet I have that looks like this:
<playslist>
<song>
<source>imgs/Beck.jpg</source>
<title>Modern Guilt</title>
<artist>Beck</artist>
<vote>0</vote>
<plays>y</plays>
</song>
</playlist>
It checks for the title, and vote for each song, and stores those.
It then for each gets the value of the vote tag, adds one, then searches the HTML for the <div> with the class="active", and then searches that div with active and finds the inner div that has the class="content", then returns that div's "song" attribute.
I then have a preg match to check whether the name of the current "i" in the loop matches the string from the "song" attribute.
If it's a match, that "i"'s value will +1 to its value. It will then store that name value, and the new vote value as an array for me to check.
After that I'd like to save the XML sheet's new change. (otherwise I would've just kept this in JavaScript)
Any tips, hints, and help would be greatly appreciated! As a newbie to PHP I love to learn more!
$value = $voteItems($i)->textContent;
This is incorrect. You're trying to take $voteItems and use it as the name of a function.
PHP allows it:
function foo() {}
$name = "foo";
$name(); // calls foo()
However, that can only work if $name is a string. Here, $voteItems is a DOMNodeList, hence the error message.
From looking at the documentation, it seems like you meant to write:
$value = $voteItems->item($i)->textContent;
// ^^^^^^^^^^
I need to access table cell values via DOM / PHP. The web page is loaded into $myHTML. I have identified the XPath as :
//*[#id="main-content-inner"]/div[2]/div[1]/div/div/table/tbody/tr/td[1]
I want to get the text of the value in the cell as follows:
$dom = new DOMDocument();
$dom->loadHTML($myHTML);
$xpath = new DOMXPath($dom);
$myValue = $xpath->query('//*[#id="main-content-inner"]/div[2]/div[1]/div/div/table/tbody/tr/td[1]');
echo $myValue->nodeValue;
But I am getting "Undefined Property: DOMNodeList::$nodeValue error. How do I retrieve the value of this table cell? I have tried various techniques from stackoverflow with no luck.
DOMXPath::query() returns a DOMNodeList, even if there's only one match.
If you know for sure you have a match there, you can use
echo $myValue->item(0)->nodeValue;
But if you want to be bullet proof, you better check the length in advance, e.g.
if ($myValue->length > 0) {
echo $myValue->item(0)->nodeValue;
} else {
//No such cell. What now?
}
Can you please help me with the correct syntax to use when you want to check the innerHTML/nodeValue of an element?
I have no problem with the Name however the Age is within a plain div element, What is the correct syntax to use in place of "NOT SURE WHAT TO PUT HERE" below.
$html is a page from the internet
The persons name is in a span like:
<span class="fullname">John Smith</span>
The persons age is in a div like:
<div>Age: 28</div>
I have the following PHP:
<?php
$dom = new DomDocument();
#$dom->loadHTML($html);
$finder = new DOMXPath($dom);
//Full Name
$findName = "fullname";
$queryName = $finder->query("//span[contains(#class, '$findName')]");
$name = $queryName->item(0)->nodeValue;
//Age
$findAge = "Age: ";
$queryAge = $finder->query("//div[NOT SURE WHAT TO PUT HERE]");
$age = substr($queryAge->item(0)->nodeValue, 5);
?>
Try
$queryAge = $finder->query("//div[starts-with(., '$findAge')]");
I've had limited success with starts-with() due to whitespace so you may have to resort to
$queryAge = $finder->query("//div[contains(., '$findAge')]");
If there's a chance of finding false positives (ie, other divs with "Age: " in them), you might be able to avoid that by using a more specific path (if known), ie
$queryAge = $finder->query("//div[#id='something']//div[contains(., '$findAge')]");