Website Scraping using cURL and Regex - php

I am trying to scrap the categories using cURL and Regex. But the code that I have only extract one of the categories (Arts, Antiques & Collectibles).
This is the code I have:
<?php
$curl = curl_init('http://www.lelong.com.my/Auc/List/BrowseAll.asp');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$page = curl_exec($curl);
if(curl_errno($curl)) // check for execution errors
{
echo 'Scraper error: ' . curl_error($curl);
exit;
}
curl_close($curl);
$regex = '/<span class=CatLevel1>(.*?)<\/a>/s';
if ( preg_match($regex, $page, $list) )
echo $list[0]. "<br>";
else
print "Not found";
?>
Can anyone help me correct this code to extract all the categories(without the numbers)? I've been stuck on this for a long time.
Thanks!
Sample output:
Arts, Antiques & Collectibles
B2B & Industrial Products
Baby
etc....

here's a working code with DOMDocument and DOMXPath classes
$grep = new DoMDocument();
#$grep->loadHTMLFile("http://www.lelong.com.my/Auc/List/BrowseAll.asp");
$finder = new DomXPath($grep);
$class = "CatLevel1";
$nodes = $finder->query("//*[contains(#class, '$class')]");
foreach ($nodes as $node) {
$span = $node->childNodes;
echo $span->item(0)->nodeValue."<br>"
}

I read the comment to your question suggesting a different approach and an alternative answer that probably are better suited for this job, but if you still want to do it this way you need to do a global search (preg_match_all()) so it doesn't stop when it finds the first match and then use a loop to print the contents of the array where the results are saved. I haven't used cURL and can't test it and php isn't my strong, but the code should be something like:
if ( preg_match_all($regex, $page, $list) )
$i = 0;
while(isset($list[1][$i])) {
echo $list[1][$i]. "<br>";
$i++;
}
else
print "Not found";
Sorry for any mistakes in the code.

Related

Add space between textContent data scraped from website using PHP DOM

I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
}
}
echo $la;
Current Result
DoyleBrainDavid,
Expected Result
Doyle, Brain, David
I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
die(curl_error($c1));
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
curl_close($c1);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
}
var_dump($match1);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you
You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.

PHP Wont print from external API

I use to go to school for only a short time in programming. A while ago and I'm rusty. I've been trying to re-learn everything by myself and something is bothering me. I'm trying to print out a specific object from an external API but nothing I try seems to work out. I don't really know what to google to get the right answer I am looking for. Anyway here's my code.
<?php
$url = 'http://apis.is/flight?language=en&type=departures';
$json = file_get_contents($url);
$results = json_decode($json, TRUE);
for ($x = 0; $x < count($results); $x++) {
echo $results[$x]['results']['flightNumber']."<br/>";
}
?>
If you do a debug (by the way, learn what is it), you will see, that your $results has one key: result, over which you can iterate with a simple foreach:
foreach ($results['result'] as $item) {
echo $item['flightNumber'];
}
You are trying to access the data returned from the API in the wrong order, do this instead:
<?php
$url = 'http://apis.is/flight?language=en&type=departures';
$json = file_get_contents($url);
$results = json_decode($json, TRUE);
// To loop through an array, use foreach instead of for
// It is easier to use
foreach($results['results'] as $result){
echo $result['flightNumber'].'<br />';
}
?>
<?php
$url = 'http://apis.is/flight?language=en&type=departures';
$json = file_get_contents($url);
$results = json_decode($json, TRUE);
foreach ($results['results'] as $res) {
echo $res['flightNumber']."<br/>";
}
?>

Parsing HTML Table Data from XML with PHP

I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.
Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).
Here is my code so far (code updated):
Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.
function getXMLFeed() {
echo "<h2>Reddit Items</h2><hr><br><br>";
//$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
$feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
$xml = simplexml_load_file($feedURL);
//define each xml entry from reddit as an item
foreach ($xml -> entry as $item ) {
foreach ($item -> content as $content) {
$newContent = (string)$content;
$html = str_get_html($newContent);
foreach($html->find('table') as $table) {
$links = $table->find('span', '0');
//echo $links;
foreach($links->find('a') as $link) {
echo $link->href;
}
}
}
}
}
XML Code:
http://pasted.co/0bcf49e8
I've also included JSON if it can be done this way; I just preferred XML:
http://pasted.co/f02180db
That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).
foreach ($item -> content as $content) {
$dom = new DOMDocument();
$dom -> loadHTML($content);
$xpath = new DOMXPath($dom);
$classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";
foreach ($dom->getElementsByTagName('table') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
//$originalURL = $node->getAttribute('href');
}
//$html = $dom->saveHTML();
}
I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.
Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!
Added HTML:
I am specifically trying to extract <span>[link]</span> from each table/item.
http://pastebin.com/QXa2i6qz
The following code can extract you all the youtube links from each content.
function extract_youtube_link($xml) {
$entries = $xml['entry'];
$videos = [];
foreach($entries as $entry) {
$content = html_entity_decode($entry['content']);
preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
if(!empty($matches[1][0])) {
$videos[] = array(
'entry_title' => $entry['title'],
'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
'author_reddit_url' => $entry['author']['uri'],
'video_url' => $matches[1][0]
);
}
}
return $videos;
}
$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);
foreach($videos as $video) {
echo "<p>Entry Title: {$video['entry_title']}</p>";
echo "<p>Author: {$video['author']}</p>";
echo "<p>Author URL: {$video['author_reddit_url']}</p>";
echo "<p>Video URL: {$video['video_url']}</p>";
echo "<br><br>";
}
The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!
If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.
http://php.net/manual/es/class.domxpath.php .

Extract content of specific div preserving only certain elements

I need to extract only textual part of the webpage preserving all and only the <p> <h2>, <h3>, <h4> and <blockquote>s.
Now, using DOMXPath and $div = $xpath->query('//div[#class="story-inner"]'); gives lots of unwanted page elements like pictures, ad blocks, other custom markups, etc. inside of text div.
On the other hand using the following code:
$items = $doc->getElementsByTagName('<p>');
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<p>";
}
gives very nice and clean result very close what I wanted, but with <h2>, <h3>, <h4> and <blockquotes> missing.
I wonder is there any DOM-way of (1) indicating only desired page elements and extracting clean result or (2) efficient way of cleaning up the output obtained by using $div = $xpath->query('//div[#class="story-inner"]');?
You could use an OR inside your xpath query in this case. Just cascade those tags with it get those only desired ones.
$url = "http://www.example.com/russian/international/2015/02/150218_ukraine_debaltseve_fighting";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$html = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$tags = array('p', 'h2');
$children_needed = implode(' or ', array_map(function($tag){ return sprintf('name()="%s"', $tag); }, $tags));
$query = "//div[#class='story-body__inner']//*[$children_needed]";
$div_children = $xpath->query($query);
if($div_children->length > 0) {
foreach($div_children as $child) {
echo $doc->saveHTML($child);
}
}
If i understood your question correctly.. is this what you are asking for...
$output1=preg_match('/^.*<tagName>(.*)<\/tagName>/', $value,$match1);
Match with the tagnames and get the data in between them by using preg_match...

variable echoing out the last of 3 numbers

Hi so I've currently got a output echoing 176 8 58 from a web scraping script. I want to pack this script up into a variable and echo it out in other places on the website.
I've packed this up by doing this
ob_start();
echo $node->nodeValue. "\n";
$thenumbers = ob_get_contents();
ob_end_clean();
but when I echo it out like this
Now on the website the numbers are in spans and are split up by "/" do I need to do anything fancy? I'm kind of new to PHP so let me know if its something stupid!
<?php echo $thenumbers ?>
my output is then 176 8 58
Would really appreciate a bit of help
(web scraping script i'm using had to hide the website i'm scraping as its in development)
<?php
$teamlink = rwmb_meta( 'WEBSITE_HIDDEN' );
$arr = array( $teamlink );
foreach ($arr as &$value) {
$file = $DOCUMENT_ROOT. $value;
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[contains(#class, 'table')]/tr[3]/td[3]/span");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
ob_start();
echo $node->nodeValue. "\n";
$win_loss = ob_get_contents();
ob_end_clean();
}
}
}
}
?>
p.s I know the script works as its currently outputting standard text fine.
My apoligies if I have completely misunderstood your question.
If you want to add a "/" between the numbers, where the spaces are you could:
echo str_replace(' ','/',$thenumbers);
If you just want to show the last 3 digits (cleaning out the spaces from the string) you could;
echo substr(str_replace(' ','',$thenumbers),-3);

Categories