cURL + PHP to getting a class name, possible? - php
Is it possible to get a certain class name through cURL?
For instance, I just want to get the 123 in this <div class="123 photo"></div>.
Is that possible?
Thanks for your responses!
Edit:
If this is whats seen on a website
<div class="444 photo">...</div>
<div class="123 photo">...</div>
<div class="141 photo">...</div>
etc...
I'm trying to get all the numbers of this class, and putting in an some array.
cURL is only half the solution. Its job is simply to retrieve the content. Once you have that content, then you can do string or structure manipulation. There are some text functions you could use, but it seems like you're looking for something specific among this content, so you may need something more robust.
Therefore, for this HTML content, I'd suggest researching DOMDocument, as it will structure your content into an XML-like hierarchy, but is more forgiving of the looser nature of HTML markup.
$ch = curl_init();
// [Snip cURL setup functions]
$content = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($content); // We use # here to suppress a bunch of parsing errors that we shouldn't need to care about too much.
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
if (strpos($div->getAttribute('class'), 'photo') !== false) {
// Now we know that our current $div is a photo
$targetValue = explode(' ', $dom->getAttribute('class'));
// $targetValue will now be an array with each class definition.
// What is done with this information was not part of the question,
// so the rest is an exercise to the poster.
}
}
Related
php regex to add class to images without class
I'm looking for a php regex to check if a image don't have any class, then add "img-responsive" to that image's class. thank you.
Instead of looking to implement a regular expression, make effective use of DOM instead. $doc = new DOMDocument; $doc->loadHTML($html); // load the HTML data $imgs = $doc->getElementsByTagName('img'); foreach ($imgs as $img) { if (!$img->hasAttribute('class')) $img->setAttribute('class', 'img-responsive'); }
I would be tempted to do this in JQuery. That offers all the functionality you need in a few lines. $(document).ready(function(){ $('img').not('img[class]').each(function(e){ $(this).addClass('img-responsive'); }); });
If you have the output in PHP then a HTML parser is the way to do it. Regular expressions will always fail, in the end. If you don't want to use a parser, but you have the HTML code you can try to do it with plain and simple PHP code: function addClassToImagesWithout($html) // this function does what you want, given well-formed html { // cut into parts where the images are $parts = explode('<img',$html); foreach ($parts as $key => $part) { // spilt at the end of tags, the image args are in the first bit $bits = explode('>',$part); // does it not contain a class if (strpos($bits[0],'class=') !== FALSE) { // insert the class $bits[0] .= " class='img-responsive'"; } // recombine the bits $part[$key] = implode('>',$bits); } // recombine the parts and return the html return implode('<img',$parts); } this code is untested and far from perfect, but it shows that regular expressions are not needed. You will have to add in some code to catch exceptions. I must stress that this code, just like regular expressions will ultimately fail when, for instance, you have something like id='classroom', title='we are a class apart' or similar. To do a better job you should use a parser: http://htmlparsing.com/php.html
Simple HTML DOM Parser - find class with random number
I'm trying to scrap data from one websites. I stuck on ratings. They have something like this: <div class="rating-static rating-10 margin-top-none margin-bottom-sm"></div> <div class="rating-static rating-13 margin-top-none margin-bottom-sm"></div> <div class="rating-static rating-46 margin-top-none margin-bottom-sm"></div> Where rating-10 is actually one star, rating-13 two stars in my case, rating-46 will be five stars in my script. Rating range can be from 0-50. My plan is to create switch and if I get class range from 1-10 I will know how that is one star, from 11-20 two stars and so on. Any idea, any help will be appreciated.
Try this <?php $data = '<div class="rating-static rating-10 margin-top-none margin-bottom-sm"></div>'; $dom = new DOMDocument; $dom->loadHTML($data); $xpath = new DomXpath($dom); $div = $dom->getElementsByTagName('div')[0]; $div_style = $div->getAttribute('class'); $final_data = explode(" ",$div_style); echo $final_data[1]; ?> this will give you expected output.
I had an similiar project, this should be the way to do it if you want to parse the whole HTML site $dom = new DOMDocument(); $dom->loadHTML($html); // The HTML Source of the website foreach ($dom->getElementsByTagName('div') as $node){ if($node->getAttribute("class") == "rating-static"){ $array = explode(" ", $node->getAttribute("class")); $ratingArray = explode("-", $array[1]); // $array[1] is rating-10 //$ratingArray[1] would be 10 // do whatever you like with the information } } It could be that you must change the if part to an strpos check, I haven't tested this script, but I think that getAttribute("class") returns all classes. This would be the if statement then if(strpos($node->getAttribute("class"), "rating-static") !== false)
FYI try using Querypath for future parsing needs. Its just a wrapper around PHP DOM parser and works really really well.
Remove tags with Simple HTML DOM parser [duplicate]
I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it. Basically I would do Get content as HTML string Remove all image tags from content Limit content to x words Output. Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do $e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it. here is an example function: public function removeNode($selector) { foreach ($this->find($selector) as $node) { $node->outertext = ''; } $this->load($this->save()); } put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string). Try this: $html = file_get_html("http://example.com"); foreach($html ->find('img') as $item) { $item->outertext = ''; } $html->save(); echo $html;
I could not figure out where to put the function so I just put the following directly in my code: $html->load($html->save()); It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition. I prefer to use "soft deletes": foreach($html->find('somecondition'),$item){ if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code $item->outertext=''; foreach($foo as $bar){ if(!baz->getAttribute('softDelete'){ //do something } } }
This is working for me: foreach($html->find('element') as $element){ $element = NULL; }
Adding new answer since removeNode is definitely a better way of removing it: $html->removeNode('img'); This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext <div id='your_div'>the contents of your div</div> $your_div->outertext = ''; echo $your_div // echoes <div id='your_div'></div> $your_div->outerhtml= ''; echo $your_div // echoes nothing
Try this: $dom = new Dom(); $dom->loadStr($text); foreach ($dom->find('element') as $element) { $element->delete(); }
This works now: $element->remove(); You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes. $clean_html = file_get_html($url); // Find and remove 1st instance of node. $node = $clean_html->find('header', 0); $node->remove(); // Find and remove all instances of Nde. $nodes = $clean_html->find('script'); foreach($nodes as $node) { $node->remove(); }
PHP XPath query returns nothing
I've been recently playing with DOMXpath in PHP and had success with it, trying to get more experience with it I've been playing grabbing certain elements of different sites. I am having trouble getting the weather marker off of http://www.theweathernetwork.com/weather/cape0005 this website. Specifically I want //*[#id='theTemperature'] Here is what I have $url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005'); $dom = new DOMDocument(); #$dom->loadHTML($url); $xpath = new DOMXPath($dom); $tags = $xpath->query("//*[#id='theTemperature']"); foreach ($tags as $tag){ echo $tag->nodeValue; } Is there something I am doing wrong here? I am able to produce actual results on other tags on the page but specifically not this one. Thanks in advance.
You might want to improve your DOMDocument debugging skills, here some hints (Demo): <?php header('Content-Type: text/plain;'); $url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005'); $dom = new DOMDocument(); #$dom->loadHTML($url); $xpath = new DOMXPath($dom); $tags = $xpath->query("//*[#id='theTemperature']"); foreach ($tags as $i => $tag){ echo $i, ': ', var_dump($tag->nodeValue), ' HTML: ', $dom->saveHTML($tag), "\n"; } Output the number of the found node, I do it here with $i in the foreach. var_dump the ->nodeValue, it helps to show what exactly it is. Output the HTML by making use of the saveHTML function which shows a better picture. The actual output: 0: string(0) "" HTML: <p id="theTemperature"></p> You can easily spot that the element is empty, so the temperature must go in from somewhere else, e.g. via javascript. Check the Network tools of your browser.
what happens is straightforward, the page contains an empty id="theTemperature" element which is a placeholder to be populated with javascript. file_get_contents() will just download the page, not executing javascript, so the element remains empty. Try to load the page in the browser with javascript disabled to see it yourself
The element you're trying to select is indeed empty. The page loads the temperature into that id through ajax. Specifically this script: http://www.theweathernetwork.com/common/js/master/citypage_ajax.js?cb=201301231338 but when you do a file_get_contents those scripts obviously don't get resolved. I'd go with guido's solution of using the RSS
Parsing XML with PHP (simplexml)
Firstly, may I point out that I am a newcomer to all things PHP so apologies if anything here is unclear and I'm afraid the more layman the response the better. I've been having real trouble parsing an xml file in to php to then populate an HTML table for my website. At the moment, I have been able to get the full xml feed in to a string which I can then echo and view and all seems well. I then thought I would be able to use simplexml to pick out specific elements and print their content but have been unable to do this. The xml feed will be constantly changing (structure remaining the same) and is in compressed format. From various sources I've identified the following commands to get my feed in to the right format within a string although I am still unable to print specific elements. I've tried every combination without any luck and suspect I may be barking up the wrong tree. Could someone please point me in the right direction?! $file = fopen("compress.zlib://$url", 'r'); $xmlstr = file_get_contents($url); $xml = new SimpleXMLElement($url,null,true); foreach($xml as $name) { echo "{$name->awCat}\r\n"; } Many, many thanks in advance, Chris PS The actual feed
Since no one followed my closevote, I think I can just as well put my own comments as an answer: First of all, SimpleXml can load URIs directly and it can do so with stream wrappers, so your three calls in the beginning can be shortened to (note that you are not using $file at all) $merchantProductFeed = new SimpleXMLElement("compress.zlib://$url", null, TRUE); To get the values you can either use the implicit SimpleXml API and drill down to the wanted elements (like shown multiple times elsewhere on the site): foreach ($merchantProductFeed->merchant->prod as $prod) { echo $prod->cat->awCat , PHP_EOL; } or you can use an XPath query to get at the wanted elements directly $xml = new SimpleXMLElement("compress.zlib://$url", null, TRUE); foreach ($xml->xpath('/merchantProductFeed/merchant/prod/cat/awCat') as $awCat) { echo $awCat, PHP_EOL; } Live Demo Note that fetching all $awCat elements from the source XML is rather pointless though, because all of them have "Bodycare & Fitness" for value. Of course you can also mix XPath and the implict API and just fetch the prod elements and then drill down to the various children of them. Using XPath should be somewhat faster than iterating over the SimpleXmlElement object graph. Though it should be noted that the difference is in an neglectable area (read 0.000x vs 0.000y) for your feed. Still, if you plan to do more XML work, it pays off to familiarize yourself with XPath, because it's quite powerful. Think of it as SQL for XML. For additional examples see A simple program to CRUD node and node values of xml file and PHP Manual - SimpleXml Basic Examples
Try this... $url = "http://datafeed.api.productserve.com/datafeed/download/apikey/58bc4442611e03a13eca07d83607f851/cid/97,98,142,144,146,129,595,539,147,149,613,626,135,163,168,159,169,161,167,170,137,171,548,174,183,178,179,175,172,623,139,614,189,194,141,205,198,206,203,208,199,204,201,61,62,72,73,71,74,75,76,77,78,79,63,80,82,64,83,84,85,65,86,87,88,90,89,91,67,92,94,33,54,53,57,58,52,603,60,56,66,128,130,133,212,207,209,210,211,68,69,213,216,217,218,219,220,221,223,70,224,225,226,227,228,229,4,5,10,11,537,13,19,15,14,18,6,551,20,21,22,23,24,25,26,7,30,29,32,619,34,8,35,618,40,38,42,43,9,45,46,651,47,49,50,634,230,231,538,235,550,240,239,241,556,245,244,242,521,576,575,577,579,281,283,554,285,555,303,304,286,282,287,288,173,193,637,639,640,642,643,644,641,650,177,379,648,181,645,384,387,646,598,611,391,393,647,395,631,602,570,600,405,187,411,412,413,414,415,416,649,418,419,420,99,100,101,107,110,111,113,114,115,116,118,121,122,127,581,624,123,594,125,421,604,599,422,530,434,532,428,474,475,476,477,423,608,437,438,440,441,442,444,446,447,607,424,451,448,453,449,452,450,425,455,457,459,460,456,458,426,616,463,464,465,466,467,427,625,597,473,469,617,470,429,430,615,483,484,485,487,488,529,596,431,432,489,490,361,633,362,366,367,368,371,369,363,372,373,374,377,375,536,535,364,378,380,381,365,383,385,386,390,392,394,396,397,399,402,404,406,407,540,542,544,546,547,246,558,247,252,559,255,248,256,265,259,632,260,261,262,557,249,266,267,268,269,612,251,277,250,272,270,271,273,561,560,347,348,354,350,352,349,355,356,357,358,359,360,586,590,592,588,591,589,328,629,330,338,493,635,495,507,563,564,567,569,568/mid/2891/columns/merchant_id,merchant_name,aw_product_id,merchant_product_id,product_name,description,category_id,category_name,merchant_category,aw_deep_link,aw_image_url,search_price,delivery_cost,merchant_deep_link,merchant_image_url/format/xml/compression/gzip/"; $zd = gzopen($url, "r"); $data = gzread($zd, 1000000); gzclose($zd); if ($data !== false) { $xml = simplexml_load_string($data); foreach ($xml->merchant->prod as $pr) { echo $pr->cat->awCat . "<br>"; } }
<?php $xmlstr = file_get_contents("compress.zlib://$url"); $xml = simplexml_load_string($xmlstr); // you can transverse the xml tree however you want foreach ($xml->merchant->prod as $line) { // $line->cat->awCat -> you can use this } more information here
Use print_r($xml) to see the structure of the parsed XML feed. Then it becomes obvious how you would traverse it: foreach ($xml->merchant->prod as $prod) { print $prod->pId; print $prod->text->name; print $prod->cat->awCat; # <-- which is what you wanted print $prod->price->buynow; }
$url = 'you url here'; $f = gzopen ($url, 'r'); $xml = new SimpleXMLElement (fread ($f, 1000000)); foreach($xml->xpath ('//prod') as $name) { echo (string) $name->cat->awCatId, "\r\n"; }