So for my code I basically need to get a specific URL from a webpage, I use Simple HTML DOM Parser, so far I managed to get all the links from the webpage with the class .linkify, which results in having 18 different links, and I will need to have only the second one in a variable.
Here's my code:
$html = file_get_html("http://saucenao.com/search.php?db=999&url=http://simg4.gelbooru.com//images/4f/3d/$file");
foreach($html->find('a.linkify') as $element)
echo $element->href . '<br>';
How do I make it so it only generates the second link?
Thanks!
You can select which one you want by modifying
$html->find('a.linkify')
to
$html->find('a.linkify', 1)
To fetch the 2nd one, we use ,1. It's index based so we start at 0 (0 = First one, 1 = Second one)
To get the href in a single line, do the following:
$html->find('a.linkify', 1)->href
Related
I am trying to retreive one specific element from HTML-code using QueryPath. It occurs twice, I only want the first one though.
Searching for the object DOES work, but it returns me two elements.
I was trying to add a pseudo-class-selector to my search, but that didn't work.
This is the HTML-element that occurs twice in the code:
<span class="aui-suffix"> of 5 </span>
And this is how I am searching for it:
$arrURL = "URL..."
$html = htmlqp( $arrURL );
$pageAsString = $html->find('span.aui-suffix');
echo $pageAsString->text();
The output is "of 5 of 5 ", which is both elements printed right after each other.
How can I modify my search to get me only "of 5 "?
try
$pageAsString = $html->find('span.aui-suffix:eq(0)');
I'm trying to get the title from a link - which I already have identified, it is just the last bit with retrieving the title from the same link.
To find the right link I use this code:
$html->find('a[href=http://mylink.se']');
But I also want the title from this link. How would I do that?
Assuming you're using PHP Simple HTML DOM parser, which is not clear in your question, you could do
$link = $html->find('a[href=http://mylink.se]', 0); //As the OP pointed out in comments, you need to select the first element
$title = $link->title
http://simplehtmldom.sourceforge.net/manual.htm
I am trying to extract urls from a large number of google search results. Getting them from the source code is proving to be quite challenging as the delimiters are not clear and not all of the urls are in the code. Is there a tool that can extract urls from a certain area of an image? If so that may be a better solution.
Any help would be much appreciated.
Try using the JSON/Atom Custom Search API instead: http://code.google.com/apis/customsearch/v1/overview.html. It gives you 100 api calls per day, something you can increase to 10000 per day, if you pay.
Use this excellent lib: http://simplehtmldom.sourceforge.net/manual.htm
// Grab the source code
$html = file_get_html('http://www.google.com/');
// Find all anchors, returns a array of element objects
$ret = $html->find('a');
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $ret->href;
EDit :
All "natural" search urls are in the #res div it seems.. With simplehtmldom find first #res, than all url inside of it. Don't remember exactly the syntax but it must be this way :
$ret = $html->find('div[id=res]')->find('a');
or maybe
$html->find('div[id=res] a');
How can I screen scrape a website using cURL and show the data within a specific div?
Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element.
After downloading with cURL use XPath to select the div and extract the content.
A possible alternative.
# We will store the web page in a string variable.
var string page
# Read the page into the string variable.
cat "http://www.abczyx.com/path/to/page.ext" > $page
# Output the portion in the third (3rd) instance of "<div...</div>"
stex -r -c "^<div&</div\>^3" $page
This code is in biterscripting. I am using the 3 as sample to extract 3rd div. If you want to extract the div that has say string "ABC", then use this command syntax.
stex -r -c "^<div&ABC&</div\>^" $page
Take a look at this script http://www.biterscripting.com/helppages/SS_ExtractTable.html . It shows how to extract an element (div, table, frame, etc.) when the elements are nested.
Fetch the website content using a cURL GET request. There's a code sample on the curl_exec manual page.
Use a regular expression to search for the data you need. There's a code sample on the preg_match manual page, but you'll need to do some reading up on regular expressions to be able to build the pattern you need. As Yacoby mentioned which I hadn't thought of, a better idea may be to examine the DOM of the HTML page using PHP's Simple XML or DOM parser.
Output the information you've found from the regex/parser in the HTML of your page (within the required div.)
I am trying to create a simple alert app for some friends.
Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:
http://www.sparkfun.com/commerce/product_info.php?products_id=5
http://www.sparkfun.com/commerce/product_info.php?products_id=9279
I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?
$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];
preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];
echo "Price: $price - Availability: $in_stock\n";
It's called screen scraping, in case you need to google for it.
I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.
For example:
$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[#class="pricing"]/th') as $node) {
echo $node, "\n";
}
What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.
1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:
Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)
For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.
Good luck!
You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.
The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).
The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.
<?php
include(‘simple_html_dom.php’);
// Create DOM from URL, paste your destined web url in $page
$page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
$html = new simple_html_dom();
//Within $html your webpage will be loaded for further operation
$html->load_file($page);
// Find all links
$links = array();
//Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
foreach($html->find(‘h3′) as $element)
{
$links[] = $element;
}
reset($links);
//$out will be having each of HTML element content you searching for, within that web page
foreach ($links as $out)
{
echo $out;
}
?>