Scraping data from various online stores - php

First off, should mention that I have permission from the stores to scrape this data so legality isn't an issue here!
I'm trying to scrape information from various online stores, and store them in a database once every hour.
Example site: http://www.uptherestore.com/department/accessories
I've tried a PHP scrape like this:
<?php
$file_string = file_get_contents('http://www.uptherestore.com/department/accessories');
preg_match('/<div class="view view-uc-products view-id-uc_products view-display-id-page_3 storeview view-dom-id-1">
(.*)<\/div>/i', $file_string, $title);
$title_out = $title[1];?>
<p><strong>Accessories:</strong> <?php echo $title_out; ?></p>
but it's giving me errors of the ilk:
[14-Feb-2013 07:39:49 UTC] PHP Warning: DOMDocument::loadHTML() [<a href='domdocument.loadhtml'>domdocument.loadhtml</a>]: htmlCheckEncoding: encoder error in Entity, line: 7 in scraping.php on line 5
Full error from log file is here: http://pastebin.com/W2Bhkc0s
Even if I do manage to scrape from that site, it will only return the first page of results (when I need all pages). My current solution to this would be:
Use jQuery to check how many elements are in the pager at the bottom of the page
Run a loop that scrapes each of these pages
But this is no ideal - as you can see, at the bottom of the page there are pages 1...9 but if you click "last" there are actually 11 pages of content. In short, what's the best method to scrape data from sites like this? As mentioned the store owners have all given me permission to use their content, but they're not particularly technically minded and cannot give me access to their servers/put any code in their servers' .htaccess to allow requests from my website.

Paging is simple, you just find the link that says 'next' and follow it until it's not there anymore. Unless you're comfortable with xpath want a good html parser library (phpquery, simple-html-dom). Be prepared to spend a good deal of time figuring out the right way to do it, and above all, don't listen to anyone who tells you to use regex.

First of all, your error message does not seem to fit your php code: The php is trying to parse the html using a regex (wrong!) and the error message suggests that you are using an html parser (DOMDocument) to parse the html (the right way).
What you would need to do is the following:
Get the html from a product page (like you are doing now...);
Check if that page has already been parsed in your database (see the next point);
Use an html parser to get the info from that page that you need and store everything in a database - including the link to the product page or another identifying property of that page and some sort of time-stamp so that you know what you have done already;
Use an html parser to get all product links in the html;
Go to 1. for every product link that you have found.
You probably need to build in some logic to make sure your script does not enter in a never-ending loop or runs too long, but that is basically it; no browser / javascript / ajax is required until you actually want to see the results of the operations in your browser.

Use cURL and regex to filter through what you need. Google cURL the php.net site will give you all the information you need

Related

get web page content using php script and save it in my db

I want to get the content of this 1st and 2nd webpage https://www.goodreads.com/search?utf8=%E2%9C%93&q=%D8%B3%D9%8A%D8%A7%D8%B3%D8%A9&search_type=books
and then store it in my database and then make the list is searchable.so after I googled I found that I can do it like this
$url = "https://www.goodreads.com/search?utf8=%E2%9C%93&q=%D8%B3%D9%8A%D8%A7%D8%B3%D8%A9&search_type=books";
function get_links($url){
$input = file_get_contents($url);
echo $input;
}
get_links($url);
my problem is how can I get the 2nd page content also and how can I store these books in my database to the list searchable
The answer is not that easy...
Options
Getting the pages (Not recommended)
To get a later page you can send the "page argument" in your request:
e.g.:
https://www.goodreads.com/search?page=2&q=%D8%B3%D9%8A%D8%A7%D8%B3%D8%A9&search_type=books&tab=books&utf8=%E2%9C%93
But to get the Elements into a nice structure you need to parse the HTML you get which is realy hard.
Use the API (recommended)
At https://www.goodreads.com/api/index you can find the documentation for goodreads API which returns example as response and is easily parseable.
Parse XML in PHP
If you use the API you can parse the XML-response with SimpleXML.
See example 1 & 2: http://php.net/manual/en/simplexml.examples-basic.php
Saving to database
If you are a beginner you might read some tutorials about how to use mysql with php and PDO. But you may also have a look at RedBeanPHP which is an really easy to use ORM for databases.
For getting 2nd page you will have to append on the url ?page=2
In case that you need to get other pages, you can use a for loop if you don't know how to use for loop, please google php for loop
For the database, google php mysql insert

Running preg_replace on html code taking too long

At the risk of getting redirected to this answer (yes, I read it and spent the last 5 minutes laughing out loud at it), allow me to explain this issue, which is just one in a list of many.
My employer asked me to review a site written in PHP, using Smarty for templates and MySQL as the DBMS. It's currently running very slowly, taking up to 2 minutes (with a entirely white screen through it all, no less) to load completely.
Profiling the code with xdebug, I found a single preg_replace call that takes around 30 seconds to complete, which currently goes through all the HTML code and replaces each URL found to its SEO-friendly version. The moment it completes, it outputs all of the code to the browser. (As I said before, that's not the only issue -the code is rather old, and it shows-, but I'll focus on it for this question.)
Digging further into the code, I found that it currently looks through 1702 patterns with each appropriate match (both matches and replacements in equally-sized arrays), which would certainly account for the time it takes.
Code goes like this:
//This is just a call to a MySQL query which gets the relevant SEO-friendly URLs:
$seourls_data = $oSeoShared->getSeourls();
$url_masks = array();
$seourls = array();
foreach ($seourls_data as $seourl_data)
{
if ($seourl_data["url"])
{
$url_masks[] = "/([\"'\>\s]{1})".$site.str_replace("/", "\/", $seourl_data["url"])."([\#|\"'\s]{1})/";
$seourls[] = "$1".MAINSITE_URL.$seourl_data["seourl"]."$2";
}
}
//After filling both $url_masks and $seourls arrays, then the HTML is parsed:
$html_seo = preg_replace($url_masks, $seourls, $html);
//After it completes, $html_seo is simply echo'ed to the browser.
Now, I know the obvious answer to the problem is: don't parse HTML with a regexp. But then, how to solve this particular issue? My first attempt would probably be:
Load the (hopefully, well-formed) HTML into a DOMDocument, and then get each href attribute in each a tag, like so.
Go through each node, replacing the URL found for its appropriate match (which would probably mean using the previous regexps anyway, but on a much-reduced-size string)
???
Profit?
but I think it's most likely not the right way to solve the issue.
Any ideas or suggestions?
Thanks.
As your goal is to be SEO-friendly, using canonical tag in the target pages would tell the search engines to use your SEO-friendly urls, so you don't need to replace them in your code...
Oops ,That's really tough, bad strategy from the beginning , any way that's not your fault,
i have 2 suggestion:-
1-create a caching technique by smarty so , first HTML still generated in 2 min >
second HTMl just get from a static resource .
2- Don't Do what have to be done earlier later , so fix the system ,create a database migration that store the SEO url in a good format or generate it using titles or what ever, on my system i generate SEO links in this format ..
www.whatever.com/jobs/722/drupal-php-developer
where i use 722 as Id by parsing the url to get the right page content and (drupal-php-developer) is the title of the post or what ever
3 - ( which is not a suggestion) tell your client that project is not well engineered (if you truly believe so ) and need a re structure to boost performance .
run

PHP - curl or simplexml_load_file?

I'm trying to parse information from fonefinder.net. I was trying to use simplexmlload_file, but couldn't get the page to load successfully.
Now, I'm looking into Curl. But I'm not sure if this will work either.
I basically just want to take the html from the fonefinder page, and parse it to get phone carrier and city.
Is that possible? How?
SimpleXML will only work if the HTML is formatted correctly - and that is rarely the case ;)
You could do a simple cURL call to fetch the data and the easiest thing would probably be using a regular expression to get the information you need.
The solution however is not easy to supply you with, with nothing to go on. But this was an idea.
I Recommend using:
http://www.de.php.net/manual/en/function.file-get-contents.php to get the document
http://www.php.net/manual/en/domdocument.loadhtml.php to load it
http://www.php.net/manual/en/class.domxpath.php to get the information from it
Or use the search function here, that question must have been asked over and over, for example PHP: Fetch content from a html page using xpath()

How to do search with out using the scripting language and by using jquery?

I have a html site. In that site around 100 html files are available. i want to develop the search engine . If the user typing any word and enter search then i want to display the related contents with the keyword. Is't possible to do without using any server side scripting? And it's possible to implement by using jquery or javascript?? Please let me know if you have any ideas!!!
Advance thanks.
Possible? Yes. You can download all the files via AJAX, save their contents in an array of strings, and search the array.
The performance however would be dreadful. If you need full text search, then for any decent performance you will need a database and a special fulltext search engine.
3 means:
Series of Ajax indexing requests: very slow, not recommended
Use a DB to store key terms/page refernces and perform a fulltext search
Utilise off the shelf functionality, such as that offered by google
The only way this can work is if you have a list of all the pages on the page you are searching from. So you could do this:
pages = new Array("page1.htm","page2.htm"...)
and so on. The problem with that is that to search for the results, the browser would need to do a GET request for every page:
for (var i in pages)
$.get(pages[i], function (result) { searchThisPage(result) });
Doing that 100 times would mean a long wait for the user. Another way I can think of is to have all the content served in an array:
pages = {
"index" : "Some content for the index",
"second_page" : "Some content for the second page",
etc...
}
Then each page could reference this one script to get all the content, include the content for itself in its own content section, and use the rest for searching. If you have a lot of data, this would be a lot to load in one go when the user first arrives at your site.
The final option I can think of is to use the Google search API: http://code.google.com/apis/customsearch/v1/overview.html
Quite simply - no.
Client-side javascript runs in the client's browser. The client does not have any way to know about the contents of the documents within your domain. If you want to do a search, you'll need to do it server-side and then return the appropriate HTML to the client.
The only way to technically do this client-side would be to send the client all the data about all of the documents, and then get them to do the searching via some JS function. And that's ridiculously inefficient, such that there is no excuse for getting them to do so when it's easier, lighter-weight and more efficient to simply maintain a search database on the server (likely through some nicely-packaged third party library) and use that.
some useful resources
http://johnmc.co/llum/how-to-build-search-into-your-site-with-jquery-and-yahoo/
http://tutorialzine.com/2010/09/google-powered-site-search-ajax-jquery/
http://plugins.jquery.com/project/gss
If your site is allowing search engine indexing, then fcalderan's approach is definitely the simplest approach.
If not, it is possible to generate a text file that serves as an index of the HTML files. This would probably be rudimentarily successful, but it is possible. You could use something like the keywording in Toby Segaran's book to build a JSON text file. Then, use jQuery to load up the text file and find the instances of the keywords, unique the resultant filenames and display the results.

Intelligently grab first paragraph/starting text

I'd like to have a script where I can input a URL and it will intelligently grab the first paragraph of the article... I'm not sure where to begin other than just pulling text from within <p> tags. Do you know of any tips/tutorials on how to do this kind of thing?
update
For further clarification, I'm building a section of my site where users can submit links like on Facebook, it'll grab an image from their site as well as text to go with the link. I'm using PHP and trying to determine the best method of doing this.
I say "intelligently" because I'd like to try to get content on that page that's important, not just the first paragraph, but the first paragraph of the most important content.
If the page you want to grab is foreign or even if it is local but that you don't know its structure in advance, I'd say the best to achieve this would be by using the php DOM functions.
function get_first_paragraph($url)
{
$page = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($page);
/* Gets all the paragraphs */
$p = $doc->getElementsByTagName('p');
/* extracts the first one */
$p = $p->items(0);
/* returns the paragraph's content */
return $p->textContent;
}
Short answer: you can't.
In order to have a PHP script "intelligently" fetch the "most important" content from a page, the script would have to understand the content on the page. PHP is no natural language processor, nor is this a trivial area of study. There might be some NLP toolkits for PHP, but I still doubt it would be easy then.
A solution that can be achieved with reasonable effort would be fetch those entire page with an HTML parser and then look out for elements with certain class names or ids commonly found in blog engines. You could also parse for hAtom Microformats. Or you could look out for Meta tags within the document and more clearly defined information.
I wrote a Python script a while ago to extract a web page's main article content. It uses a heuristic to scan all text nodes in a document and group together nodes at similar depths, and then assume the largest grouping is the main article.
Of course, this method has its limitations, and no method will work on 100% of web pages. This is just one approach, and there are many other ways you might accomplish it. You may also want to look at similar past questions on this subject.

Categories