I'm trying to scrap data from one websites. I stuck on ratings.
They have something like this:
<div class="rating-static rating-10 margin-top-none margin-bottom-sm"></div>
<div class="rating-static rating-13 margin-top-none margin-bottom-sm"></div>
<div class="rating-static rating-46 margin-top-none margin-bottom-sm"></div>
Where rating-10 is actually one star, rating-13 two stars in my case, rating-46 will be five stars in my script.
Rating range can be from 0-50.
My plan is to create switch and if I get class range from 1-10 I will know how that is one star, from 11-20 two stars and so on.
Any idea, any help will be appreciated.
Try this
<?php
$data = '<div class="rating-static rating-10 margin-top-none margin-bottom-sm"></div>';
$dom = new DOMDocument;
$dom->loadHTML($data);
$xpath = new DomXpath($dom);
$div = $dom->getElementsByTagName('div')[0];
$div_style = $div->getAttribute('class');
$final_data = explode(" ",$div_style);
echo $final_data[1];
?>
this will give you expected output.
I had an similiar project, this should be the way to do it if you want to parse the whole HTML site
$dom = new DOMDocument();
$dom->loadHTML($html); // The HTML Source of the website
foreach ($dom->getElementsByTagName('div') as $node){
if($node->getAttribute("class") == "rating-static"){
$array = explode(" ", $node->getAttribute("class"));
$ratingArray = explode("-", $array[1]); // $array[1] is rating-10
//$ratingArray[1] would be 10
// do whatever you like with the information
}
}
It could be that you must change the if part to an strpos check, I haven't tested this script, but I think that getAttribute("class") returns all classes. This would be the if statement then
if(strpos($node->getAttribute("class"), "rating-static") !== false)
FYI try using Querypath for future parsing needs. Its just a wrapper around PHP DOM parser and works really really well.
Related
I am trying to scrap http://spys.one/free-proxy-list/but here i just want get Proxy by ip:port column only
i checked the website there was 3 table
Anyone can help me out?
<?php
require "scrapper/simple_html_dom.php";
$html=file_get_html("http://spys.one/free-proxy-list/");
$html=new simple_html_dom($html);
$rows = array();
$table = $html->find('table',3);
var_dump($table);
Try the below script. It should fetch you only the required items and nothing else:
<?php
include 'simple_html_dom.php';
$url = "http://spys.one/free-proxy-list/";
$html = file_get_html($url);
foreach($html->find("table[width='65%'] tr[onmouseover]") as $file) {
$data = $file->find('td', 0)->plaintext;
echo $data . "<br/>";
}
?>
Output it produces like:
176.94.2.84
178.150.141.93
124.16.84.208
196.53.99.7
31.146.161.238
I really don 't know, what your simple html dom library does. Anyway. Nowadays PHP has all aboard what you need for parsing specific dom elements. Just use PHPs own DOMXPath class for querying dom elements.
Here 's a short example for getting the first column of a table.
$dom = new \DOMDocument();
$dom->loadHTML('https://your.url.goes.here');
$xpath = new \DomXPath($dom);
// query the first column with class "value" of the table with class "attributes"
$elements = $xpath->query('(/table[#class="attributes"]//td[#class="value"])[1]');
// iterate through all found td elements
foreach ($elements as $element) {
echo $element->nodeValue;
}
This is a possible example. It does not solve exactly your issue with http://spys.one/free-proxy-list/. But it shows you how you could easily get the first column of a specific table. The only thing you have to do now is finding the right query in the dom of the given site for the table you want to query. Because the dom of the given site is a pretty complex table layout from ages ago and the table you want to parse does not have a unique id or something else, you have to find out.
Trying to get good at php web scraping. Doing some tests and I've nailed scraping/echoing that information from one site to another, but I'm unable to also include the original links in the source code, which is what I'd ideally like to do. Any thoughts on how to accomplish this with what I've got thurs far? (I'm very new to php btw).
this is the php code:
// news
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('https://www.usatoday.com/');
$xpath = new DOMXPath($doc);
$query = "//ul[#class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
that code is spitting out this: NBA Cavs win record-breaking Game 4 behind Irving's 40 Entertain This Watch: 'Black Panther' trailer unleashes a fearsome king News Police: London Bridge terrorists planned more bloodshed How Trump is highlighting divisions amo..........
Now what I'd really like to do, is actually have those as working links, which was what it was in the original code. this is what the source code for this information looked like:
<div class="partner-heroflip-ad partner-placement ui-flip-panel size-xxs"></div></div><p class="hfwmm-tertiary-
list-title hfwmm-light-tertiary-list-title">TOP STORIES</p><ul class="hfwmm-
list hfwmm-4uphp-list hfwmm-light-list"
data-track-prefix="flex4uphphero"><li class="hfwmm-item hfwmm-secondary-item
hfwmm-item-2 sports-theme-bg hfwmm-first-secondary-item hfwmm-4uphp-
secondary-item"
data-asset-position="1"
data-asset-id="102694848"
><a class="js-asset-link hfwmm-list-link hfwmm-light-list-link hfwmm-image-
link hfwmm-secondary-link
href="/story/sports/nba/2017/06/10/kyrie-irving-lebron-james-cavs-win-game-
4/102694848/"
data-track-display-type="thumb"
data-ht="flex4uphpherostack1"
data-asset-id="102694848"
><span class="hfwmm-image-gradient hfwmm-secondary-image-gradient"></span>
<span class="js-asset-section theme-bg-ssts-label hfwmm-ssts-label-top-left
hfwmm-ssts-label-secondary sports-theme-bg">NBA</span><img
src="https://www.gannett-cdn.com/-
mm-/cd17823b265aa373c83094fc75525710f645ec90/c=0-178-4072-
81338209183-USP-NBA-FINALS-GOLDEN-STATE-WARRIORS-AT-CLEVELAND-91573076.JPG"
class="hfwmm-image hfwmm-secondary-image js-asset-image placeholder-hide"
alt="Kyrie Irving reacts after making a basket against the"
data-id="102695338"
data-crop="16_9"
width="239"
height="135" /><span class="hfwmm-secondary-hed-wrap hfwmm-secondary-text-
hed-wrap"><span class="hfwmm-text-hed-icon js-asset-disposable"></span><span
title="Cavs win record-breaking Game 4 behind Irving's 40"
class="js-asset-headline hfwmm-list-hed hfwmm-secondary-hed placeholder-
hide">
Cavs win record-breaking Game 4 behind Irving's 40
hfwmm-item-3 life-theme-bg hfwmm-4uphp-secondary-item"
data-asset-position="2"
For sanity, the href above is href="/story/sports/nba/2017/06/10/kyrie-irving-lebron-james-cavs-win-game-
4/102694848/"
Any thoughts on how this might be accomplished in this test scenario, would be hugely helpful. Thank you very much. -wilson
You need to output the element as a string, your just extracting the text of the element (not the same thing with XML). The element may be <a>some text</a> the text is simply some text.
To output the tags, use...
$query = "//ul[#class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']//a";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$newdoc = new DOMDocument();
$cloned = $entry->cloneNode(TRUE);
$newdoc->appendChild($newdoc->importNode($cloned,TRUE));
echo $newdoc->saveHTML();
//echo trim((string)$entry); // use `trim` to eliminate spaces
}
Also note that I've added //a on the end of the XPath expression to limit the selection to links in the segment you where fetching. This may or may not be what you want, but look at the results and check it out.
Edit:
To manipulate the href in the , then use something like...
foreach ($entries as $entry) {
$oldHref = (string)$entry->getAttribute("href");
$entry->setAttribute("href", "http://someserver.com".$oldHref);
$newdoc = new DOMDocument();
$cloned = $entry->cloneNode(TRUE);
$newdoc->appendChild($newdoc->importNode($cloned,TRUE));
echo $newdoc->saveHTML();
}
I am using explode to manipulate information I am scraping from a website. I am trying to eliminate something specific from the string so that it will return what I want and also add the rest of the items to the array.
$pageArray = explode('<td class="player-label"><a href="/nfl/players/antonio-brown.php?type=overall&week=draft">', $fantasyPros);
I would like to skip the antonio-brown section and use a regular expression or whatever is best to replace it so that it will not look for a specific name but every name on the list and add them to my array. Do you have any suggestions on what I should use here? I appreciate any assistance.
Seems like a parser job to me with appropriate xpath functions, e.g. not().
Consider the following code:
<?php
$data = <<<DATA
<td class="player-label">
Some brown link here
Some green link here
</td>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$green_links = $xpath->query("//a[not(contains(#href, 'antonio-brown'))]");
foreach ($green_links as $link) {
// do sth. useful here
}
?>
This prints out every link where there's no antonio-brown in it.
You can easily adjust this to td or any other element.
I've been recently playing with DOMXpath in PHP and had success with it, trying to get more experience with it I've been playing grabbing certain elements of different sites. I am having trouble getting the weather marker off of http://www.theweathernetwork.com/weather/cape0005 this website.
Specifically I want
//*[#id='theTemperature']
Here is what I have
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $tag){
echo $tag->nodeValue;
}
Is there something I am doing wrong here? I am able to produce actual results on other tags on the page but specifically not this one.
Thanks in advance.
You might want to improve your DOMDocument debugging skills, here some hints (Demo):
<?php
header('Content-Type: text/plain;');
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $i => $tag){
echo $i, ': ', var_dump($tag->nodeValue), ' HTML: ', $dom->saveHTML($tag), "\n";
}
Output the number of the found node, I do it here with $i in the foreach.
var_dump the ->nodeValue, it helps to show what exactly it is.
Output the HTML by making use of the saveHTML function which shows a better picture.
The actual output:
0: string(0) ""
HTML: <p id="theTemperature"></p>
You can easily spot that the element is empty, so the temperature must go in from somewhere else, e.g. via javascript. Check the Network tools of your browser.
what happens is straightforward, the page contains an empty id="theTemperature" element which is a placeholder to be populated with javascript. file_get_contents() will just download the page, not executing javascript, so the element remains empty. Try to load the page in the browser with javascript disabled to see it yourself
The element you're trying to select is indeed empty. The page loads the temperature into that id through ajax. Specifically this script:
http://www.theweathernetwork.com/common/js/master/citypage_ajax.js?cb=201301231338
but when you do a file_get_contents those scripts obviously don't get resolved. I'd go with guido's solution of using the RSS
Is it possible to get a certain class name through cURL?
For instance, I just want to get the 123 in this <div class="123 photo"></div>.
Is that possible?
Thanks for your responses!
Edit:
If this is whats seen on a website
<div class="444 photo">...</div>
<div class="123 photo">...</div>
<div class="141 photo">...</div>
etc...
I'm trying to get all the numbers of this class, and putting in an some array.
cURL is only half the solution. Its job is simply to retrieve the content. Once you have that content, then you can do string or structure manipulation. There are some text functions you could use, but it seems like you're looking for something specific among this content, so you may need something more robust.
Therefore, for this HTML content, I'd suggest researching DOMDocument, as it will structure your content into an XML-like hierarchy, but is more forgiving of the looser nature of HTML markup.
$ch = curl_init();
// [Snip cURL setup functions]
$content = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($content); // We use # here to suppress a bunch of parsing errors that we shouldn't need to care about too much.
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
if (strpos($div->getAttribute('class'), 'photo') !== false) {
// Now we know that our current $div is a photo
$targetValue = explode(' ', $dom->getAttribute('class'));
// $targetValue will now be an array with each class definition.
// What is done with this information was not part of the question,
// so the rest is an exercise to the poster.
}
}