Trying to get good at php web scraping. Doing some tests and I've nailed scraping/echoing that information from one site to another, but I'm unable to also include the original links in the source code, which is what I'd ideally like to do. Any thoughts on how to accomplish this with what I've got thurs far? (I'm very new to php btw).
this is the php code:
// news
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('https://www.usatoday.com/');
$xpath = new DOMXPath($doc);
$query = "//ul[#class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
that code is spitting out this: NBA Cavs win record-breaking Game 4 behind Irving's 40 Entertain This Watch: 'Black Panther' trailer unleashes a fearsome king News Police: London Bridge terrorists planned more bloodshed How Trump is highlighting divisions amo..........
Now what I'd really like to do, is actually have those as working links, which was what it was in the original code. this is what the source code for this information looked like:
<div class="partner-heroflip-ad partner-placement ui-flip-panel size-xxs"></div></div><p class="hfwmm-tertiary-
list-title hfwmm-light-tertiary-list-title">TOP STORIES</p><ul class="hfwmm-
list hfwmm-4uphp-list hfwmm-light-list"
data-track-prefix="flex4uphphero"><li class="hfwmm-item hfwmm-secondary-item
hfwmm-item-2 sports-theme-bg hfwmm-first-secondary-item hfwmm-4uphp-
secondary-item"
data-asset-position="1"
data-asset-id="102694848"
><a class="js-asset-link hfwmm-list-link hfwmm-light-list-link hfwmm-image-
link hfwmm-secondary-link
href="/story/sports/nba/2017/06/10/kyrie-irving-lebron-james-cavs-win-game-
4/102694848/"
data-track-display-type="thumb"
data-ht="flex4uphpherostack1"
data-asset-id="102694848"
><span class="hfwmm-image-gradient hfwmm-secondary-image-gradient"></span>
<span class="js-asset-section theme-bg-ssts-label hfwmm-ssts-label-top-left
hfwmm-ssts-label-secondary sports-theme-bg">NBA</span><img
src="https://www.gannett-cdn.com/-
mm-/cd17823b265aa373c83094fc75525710f645ec90/c=0-178-4072-
81338209183-USP-NBA-FINALS-GOLDEN-STATE-WARRIORS-AT-CLEVELAND-91573076.JPG"
class="hfwmm-image hfwmm-secondary-image js-asset-image placeholder-hide"
alt="Kyrie Irving reacts after making a basket against the"
data-id="102695338"
data-crop="16_9"
width="239"
height="135" /><span class="hfwmm-secondary-hed-wrap hfwmm-secondary-text-
hed-wrap"><span class="hfwmm-text-hed-icon js-asset-disposable"></span><span
title="Cavs win record-breaking Game 4 behind Irving's 40"
class="js-asset-headline hfwmm-list-hed hfwmm-secondary-hed placeholder-
hide">
Cavs win record-breaking Game 4 behind Irving's 40
hfwmm-item-3 life-theme-bg hfwmm-4uphp-secondary-item"
data-asset-position="2"
For sanity, the href above is href="/story/sports/nba/2017/06/10/kyrie-irving-lebron-james-cavs-win-game-
4/102694848/"
Any thoughts on how this might be accomplished in this test scenario, would be hugely helpful. Thank you very much. -wilson
You need to output the element as a string, your just extracting the text of the element (not the same thing with XML). The element may be <a>some text</a> the text is simply some text.
To output the tags, use...
$query = "//ul[#class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']//a";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$newdoc = new DOMDocument();
$cloned = $entry->cloneNode(TRUE);
$newdoc->appendChild($newdoc->importNode($cloned,TRUE));
echo $newdoc->saveHTML();
//echo trim((string)$entry); // use `trim` to eliminate spaces
}
Also note that I've added //a on the end of the XPath expression to limit the selection to links in the segment you where fetching. This may or may not be what you want, but look at the results and check it out.
Edit:
To manipulate the href in the , then use something like...
foreach ($entries as $entry) {
$oldHref = (string)$entry->getAttribute("href");
$entry->setAttribute("href", "http://someserver.com".$oldHref);
$newdoc = new DOMDocument();
$cloned = $entry->cloneNode(TRUE);
$newdoc->appendChild($newdoc->importNode($cloned,TRUE));
echo $newdoc->saveHTML();
}
Related
I know how to xpath and echo text off another website via tags like div id, class ,etc, using the below code. But, I don't know how to do it under more precise conditions, for example when trying to scrape and echo a bit of text that has no unique tag identifier like a div.
This below code spits out scraped data.
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.nbcnews.com/business');
$xpath = new DOMXPath($doc);
$query = "//div[#class='market']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
In this below source code for an example, I want to pull the value "21,271.97". But there's no unique tag for this, no div id. Is it possible to pull this data by identifying a keyword in the < p> that never changes, for example "DJIA all time".
<p>DJIA All Time, Record-High Close: <font color="#0000FF">June 9,
2017</font>
(<font color="#FF0000"><b bgcolor="#FFFFCC"><font face="Verdana, Arial,
Helvetica, sans-serif" size="2">21,271.97</font></b></font>)</p>
Wondering if I could possibly replace this with something around the lines of $query = "//div[#class='market']";
$query = "//p['DJIA all time']";
Could this be possible?
I also wonder if using a loop with something like $query = "//p[='DJIA']";?
could work, though I don't know how to use that exactly.
Thanks!!
It would be good to have a play with an online XPath tester - I use https://www.freeformatter.com/xpath-tester.html#ad-output
$query = "//p[contains(text(),'DJIA')]";
Although if you use the page your after, I've found that the value seems to be the first record for...
$query = "//span[contains(#class,'market_price')]";
But the idea is the same in both cases, using contains(source,value) will match a set of nodes. In the first case the text() is the value of the node,the second looks for the specific class definition.
Try to use below XPath expression:
//p[contains(text(), "DJIA All Time")]//b/font
Considering provided link (http://www.nbcnews.com/business) you can get required text with
//span[text()="DJIA"]/following-sibling::span[#class="market_item market_price"]
I'm trying to scrap data from one websites. I stuck on ratings.
They have something like this:
<div class="rating-static rating-10 margin-top-none margin-bottom-sm"></div>
<div class="rating-static rating-13 margin-top-none margin-bottom-sm"></div>
<div class="rating-static rating-46 margin-top-none margin-bottom-sm"></div>
Where rating-10 is actually one star, rating-13 two stars in my case, rating-46 will be five stars in my script.
Rating range can be from 0-50.
My plan is to create switch and if I get class range from 1-10 I will know how that is one star, from 11-20 two stars and so on.
Any idea, any help will be appreciated.
Try this
<?php
$data = '<div class="rating-static rating-10 margin-top-none margin-bottom-sm"></div>';
$dom = new DOMDocument;
$dom->loadHTML($data);
$xpath = new DomXpath($dom);
$div = $dom->getElementsByTagName('div')[0];
$div_style = $div->getAttribute('class');
$final_data = explode(" ",$div_style);
echo $final_data[1];
?>
this will give you expected output.
I had an similiar project, this should be the way to do it if you want to parse the whole HTML site
$dom = new DOMDocument();
$dom->loadHTML($html); // The HTML Source of the website
foreach ($dom->getElementsByTagName('div') as $node){
if($node->getAttribute("class") == "rating-static"){
$array = explode(" ", $node->getAttribute("class"));
$ratingArray = explode("-", $array[1]); // $array[1] is rating-10
//$ratingArray[1] would be 10
// do whatever you like with the information
}
}
It could be that you must change the if part to an strpos check, I haven't tested this script, but I think that getAttribute("class") returns all classes. This would be the if statement then
if(strpos($node->getAttribute("class"), "rating-static") !== false)
FYI try using Querypath for future parsing needs. Its just a wrapper around PHP DOM parser and works really really well.
I've been recently playing with DOMXpath in PHP and had success with it, trying to get more experience with it I've been playing grabbing certain elements of different sites. I am having trouble getting the weather marker off of http://www.theweathernetwork.com/weather/cape0005 this website.
Specifically I want
//*[#id='theTemperature']
Here is what I have
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $tag){
echo $tag->nodeValue;
}
Is there something I am doing wrong here? I am able to produce actual results on other tags on the page but specifically not this one.
Thanks in advance.
You might want to improve your DOMDocument debugging skills, here some hints (Demo):
<?php
header('Content-Type: text/plain;');
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $i => $tag){
echo $i, ': ', var_dump($tag->nodeValue), ' HTML: ', $dom->saveHTML($tag), "\n";
}
Output the number of the found node, I do it here with $i in the foreach.
var_dump the ->nodeValue, it helps to show what exactly it is.
Output the HTML by making use of the saveHTML function which shows a better picture.
The actual output:
0: string(0) ""
HTML: <p id="theTemperature"></p>
You can easily spot that the element is empty, so the temperature must go in from somewhere else, e.g. via javascript. Check the Network tools of your browser.
what happens is straightforward, the page contains an empty id="theTemperature" element which is a placeholder to be populated with javascript. file_get_contents() will just download the page, not executing javascript, so the element remains empty. Try to load the page in the browser with javascript disabled to see it yourself
The element you're trying to select is indeed empty. The page loads the temperature into that id through ajax. Specifically this script:
http://www.theweathernetwork.com/common/js/master/citypage_ajax.js?cb=201301231338
but when you do a file_get_contents those scripts obviously don't get resolved. I'd go with guido's solution of using the RSS
I have actually seen this question quite a bit here, but none of them are exactly what I want... Lets say I have the following phrase:
Line 1 - This is a TEST phrase.
Line 2 - This is a <img src="TEST" /> image.
Line 3 - This is a TEST link.
Okay, simple right? I am trying the following code:
$linkPin = '#(\b)TEST(\b)(?![^<]*>)#i';
$linkRpl = '$1TEST$2';
$html = preg_replace($linkPin, $linkRpl, $html);
As you can see, it takes the word TEST, and replaces it with a link to test. The regular expression I am using right now works good to avoid replacing the TEST in line 2, it also avoids replacing the TEST in the href of line 3. However, it still replaces the text encapsulated within the tag on line 3 and I end up with:
Line 1 - This is a TEST phrase.
Line 2 - This is a <img src="TEST" /> image.
Line 3 - This is a <a href="newurl">TEST</a> link.
This I do not want as it creates bad code in line 3. I want to not only ignore matches inside of a tag, but also encapsulated by them. (remember to keep note of the /> in line 2)
Honestly, I'd do this with DomDocument and Xpath:
//First, create a simple html string around the text.
$html = '<html><body><div id="#content">'.$text.'</div></body></html>';
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$query = '//*[not(name() = "a") and contains(., "TEST")]';
$nodes = $xpath->query($query);
//Force it to an array to break the reference so iterating works properly
$nodes = iterator_to_array($nodes);
$replaceNode = function ($node) {
$text = $node->wholeText;
$text = str_replace('TEST', 'TEST', '');
$fragment = $node->ownerDocument->createDocumentFragment();
$fragment->appendXML($text);
$node->parentNode->replaceChild($fragment, $node);
}
foreach ($nodes as $node) {
if ($node instanceof DomText) {
$replaceNode($node, 'TEST');
} else {
foreach ($node->childNodes as $child) {
if ($child instanceof DomText) {
$replaceNode($node, 'TEST');
}
}
}
}
This should work for you, since it ignores all text inside of a elements, and only replaces the text directly inside of the matching tags.
Okay... I think I came up with a better solution...
$noMatch = '(</a>|</h\d+>)';
$linkUrl = 'http://www.test.com/test/'.$link['page_slug'];
$linkPin = '#(?!(?:[^<]+>|[^>]+'.$noMatch.'))\b'.preg_quote($link['page_name']).'\b#i';
$linkRpl = ''.$link['page_name'].'';
$page['HTML'] = preg_replace($linkPin, $linkRpl, $page['HTML']);
With this code, it won't process any text within <a> tags and <h#> tags. I figure, any new exclusions I want to add, simply need to be added to $noMatch.
Am I wrong in this method?
Note: The input HTML is trusted; it is not user defined!
I'll highlight what I need with an example.
Given the following HTML:
<p>
Welcome to Google.com!<br>
Please, enjoy your stay!
</p>
I'd like to to convert it to:
Welcome to Google.com[1]
Please, enjoy[2] your stay!
[1] http://google.com/
[2] %request-uri%/enjoy.html <- note, request uri is something I define
for relative paths
I'd like to be able to customize it.
Edit: On a further note, I'd better explain myself and my reasons
We have an automated templating system (with sylesheets!) for emails and as part of the system, I'd like to generate multipart emails, ie, which contain both HTML and TEXT.
The system is made to only provides HTML.
I need to convert this HTML to text meaningfully, eg, I'd like to somehow retain any links and images, perhaps in the format I specified above.
You could use the DOM to do the following:
$doc = new DOMDocument();
$doc->loadHTML('…');
$anchors = array();
foreach ($doc->getElementsByTagName('a') as $anchor) {
if ($anchor->hasAttribute('href')) {
$href = $anchor->getAttribute('href');
if (!isset($anchors[$href])) {
$anchors[$href] = count($anchors) + 1;
}
$index = $anchors[$href];
$anchor->parentNode->replaceChild($doc->createElement('a', $anchor->nodeValue." [$index]"), $anchor);
}
}
$html = strip_tags($doc->saveHTML());
$html = preg_replace('/^[\t ]+|[\t ]+$/m', '', $html);
foreach ($anchors as $href => $index) {
$html .= "\n[$index] $href";
}