Displaying content from different PHP app into one single PHP app - php

I am creating a PHP app that will display some classifieds/listings based on user location. For eg:
Our classifieds from Chicago:
Classified 1
Classified 2
Classified 3
now, I also want to display "classifieds" from some other classified sites into my own page. Like this:
More Classifieds from Chicago (courtsey of XYZ.com)
Classified 1
Classified 2
Classified 3
Classified 4
More Classifieds from Chicago (courtsey of ABC.com)
Classified 1
Classified 2
Classified 3
This way, user can see classifieds hosted on my server AND as well as classifieds from other common classified sites.
Is it possible this? Note that 1) there are no "RSS" feeds available for importing these classifieds; and 2)if possible I'd like to show these lists in widget format. That is display a iframe/widget box (not sure what the technical term is) and display all external-classifieds in that box.
See a rough mockup here: http://i.imgur.com/O19MR.jpg
I was thinking I could load the other classified sites into "iframes" but then I'd get the whole site (including their header/footer, logo etc.). I just want some relevant "classified" section from their site.

You want to look into doing some screen scraping through a spider and parser setup. You can use CURL or file_get_contents to bring in the web page, then use regular expressions and string operators to filter out the data you want, then build a page to display it. This is an overly simplified version of the full answer, but if i gave you the 100's of lines of code to complete this, that would be cheating!

Given the lack of API or feed, the only thing I can think of is to have to pull the relevant URLs and scrape the data from them. It should be pretty simple with a mix of file_get_contents and DOMDocument to parse the data, as long as the markup is tidy.

The best option i can think is set up a web crawler asynchronous that fetches the data from those sites.
You could set it up to crawl every day at 00:00 and store the content in your database, something like:
external_classified
id
site_source
city_id
extra_data
After that you could get it from your PHP app with no problems.
EDIT: Note that the solution i'm thinking is asynchronous! The other answers use an synchronous action to get the data. I think it's a waste of time to fetch the same classifies over and over again. Although, to be fair, those solutions are simpler to implement.

Related

Search On Website using php variable - DOM Parsing

I want to search on website pragmatically using PHP like as we search on website manually, enter query on search box press search and result came out.
Suppose I want to search on this website by products names or model number that are stored in my csv file.
if the products number or model number match with website data then result page should be displayed ..
I search on below question but not able to implement.
Creating a 'robot' to fill form with some pages in
Autofill a form of another website and send it
Please let me know how we can do this PHP ..
Thanks
You want to create a “crawler” for websites.
There are some things to consider first:
You code will never be generic. Each site has proper structure and you can not assume any thing (Example: craigslist “encode” emails with a simple method)
You need to select an objective (Emails ? Items information ? Links ?)
PHP is by far one of the worst languages to do that.
I’ll suggest using C# and the library called AgilityHtmlPack. It allows you to parse HTML pages as XML documents (So you can do XPath expressions and more to retrieve information).
It surely can be done in PHP, but I think it will take at least 10x time in php compared to c#.

Is it better to try for one mega screen scraper or split it into a scraper for different sites?

I will explain my situation.
Our Social Media Manager (yay) suddenly wants something to scrape a list of about 40 websites for information about our company, for example there's a lot of review sites in the list.
(I have read a ton of tutorials and SO questions but still) My questions are:
Is it possible to build a generic scraper that will work across all of these sites or do I need a separate scraper for each site?
I think I understand how to parse an individual web page but how do you do it, where, for example there's a website structure of review-website.com/company-name and on that page are titles and a snippet of the review that then link to the actual full page review?
i.e. Crawling and scraping multiple pages on multiple sites. Some are 'easier' than others because they have dedicated pages like the urls previously mentioned but some are forums etc with no particular structure that just happen to mention our company name so I don't know how to get relevant information on those.
Does the time spent creating this justify that the Social Media Manager could just search these sites manually himself? Especially considering that a HTML change on any of the sites could possibly end up breaking the scraper?
I really don't think this is a good idea yet my Line Manager seems to think it will take a morning's worth of work to write a scraper for all of these sites and I have no idea how to do it!
UPDATE
Thank you very much for the answers so far, I also thought I'd provide a list of the sites just to clarify what I think is an extreme task:
Facebook - www.facebook.com
Social Mention - www.socialmention.com
Youtube - www.youtube.com
Qype - www.qype.co.uk
Money Saving Expert - www.moneysavingexpert.co.uk
Review Centre - www.reviewcentre.com
Dooyoo - www.dooyoo.co.uk
Yelp - www.yelp.co.uk
Ciao - www.ciao.co.uk
All in London - www.allinlondon.co.uk
Touch Local - www.touchlocal.com
Tipped - www.tipped.co.uk
What Clinic - www.whatclinic.com
Wahanda - www.wahanda.com
Up My Street - www.upmystreet.com
Lasik Eyes - www.lasik-eyes.co.uk/
Lasik Eyes (Forum) - forums.lasik-eyes.co.uk/default.asp
Laser Eye Surgery - www.laser-eye-surgery-review.com/
Treatment Saver - www.treatmentsaver.com/lasereyesurgery
Eye Surgery Compare - www.eyesurgerycompare.co.uk/best-uk-laser-eye-surgery-clinics
The Good Surgeon Guide - www.thegoodsurgeonguide.co.uk/
Private Health -www.privatehealth.co.uk/hospitaltreatment/find-a-treatment/laser-eye-surgery/
Laser Eye Surgery Wiki - www.lasereyesurgerywiki.co.uk
PC Advisor - www.pcadvisor.co.uk/forums/2/consumerwatch/
Scoot - www.scoot.co.uk
Cosmetic Surgery Reviews - www.cosmetic-surgery-reviews.co.uk
Lasik Reviews - www.lasikreviews.co.uk
Laser Eye Surgery Costs - www.lasereyesurgerycosts.co.uk
Who Calls Me - www.whocallsme.com
Treatment Adviser - www.treatmentadviser.com/
Complaints Board - http://www.complaintsboard.com
Toluna - http://uk.toluna.com/
Mums Net - http://www.mumsnet.com
Boards.ie - http://www.boards.ie
AV Forums - http://www.avforums.com
Magic Mum - http://www.magicmum.com
That really deppends on what sort of websites and data you face.
Option 1: DOM / XPATH based
If you need to parse tables and very detailed things you need to parse each site with a separate algorithm. One way would be to parse each of the specific site into a DOM representation and adress each value per XPATH. This will take some time and is affected by structure changes and if you have to scrape each of these sites with this it will cost you more than a morning.
Option 2: Density based
However if you need to parse something like a blog article and you may want to extract only the articles text there are pretty good density based algorithm which work accross HTML structure changes. One of those is described here: https://www2.cs.kuleuven.be/cwis/research/liir/publication_files/978AriasEtAl2009.pdf
A implementation is provided here: http://apoc.sixserv.org/code/ce_density.rb
You would have to port it to php. For blogs and news sites this is a really effective way.
Option 3: Pragmatic
If you do not care about layout and structure and only want to have the data provided. You might download contents and try to strip the tags solely. However this will have a lot of noise in the resulting text.
Update
After updating your post you might follow the following in order:
Check which page is illegal to scrape. On this list there are for sure some which you will not be allowed to scrape.
You will need much more time than a day. I would talk about this and the legal problems with project lead.
Choose one option per page
I would create a scraper for each site but create a library with common functionality (e.g. opening up a page, convert to DOM, report errors, storing results etc)
Try to avoid regular expressions when scraping. A small change will stop the scraping working. Use the web sites DOM structure instead (XPaths?). Much more reliable.
Tell your boss it is going take quite a bit of time.
Good luck.

Create and populate landing pages dynamically using feed

I am trying to create multiple landing pages populated dynamically with data from a feed.
My initial thought was to create a generic php page as a template that can be used to create other pages dynamically and populate them with data from a feed. For instance, the generic page could be called landing.php; then populate that page and other pages created on the go with data from a feed depending on an id, keyword or certain string in the url. e.g http://www.example.com/landing.php?page=cars or http://www.example.com/landing.php?page=bikes will show contents that are either only about cars or bikes as the case may be.
My question is how feasible is this approach and is there a better way to create multiple dynamic pages populated with data from a feed depending on the url query string or some sort of id.
Many thanks for your help in advance.
I use this quite extensively. For example, where I work, we often have education oriented landing pages, but target each landing page to different types of visitors. A good example may be arts oriented schools looking for a diverse array of potential students who may be interested in a variety of programs for any number of reasons.
Well, who likes 3d Modelling? Creative types (Generic lander => ?type=generic) from all sorts of social circles. Also, probably gamers (Gamer centric lander => ?type=gamer). So on.
I apply that variable to the body's class, which can be used to completely reorganize the layout. Then, I select different images for key parts of the layout based on that variable as well. The entire site changes. Different fonts can be loaded, different layout, different content.
I keep this organized via extensive includes. This sounds ugly, but it's not if you stick to a convention. You have to know the limitations of your foundation html, and you can't make too many exceptions. Sure, you could output extra crap based on if the type was gamer or generic, but you're going down the road to a product that should probably be contained in its own landing page if it needs to be that different.
I have a few landing pages which can be switched between several contents and styles (5 or 6 'themes'), but the primary purpose of keeping them grouped within the same url is only to maintain focus on the fact that that's where a certain type of traffic goes to in order to convert for this specific thing. Overlapping the purpose of these landing pages is a terrible idea.
Anyway, dream up a great template, outline a rigid convention for development, keep each theme very separate, and go to town on it. I find doing it right saves a load of time, but be careful - Doing it wrong can cost a lot of time too.
Have a look at htaccess URL Rewrite. Then your user (and google) can use a url like domanin.com/landing/cars but on your server the script will be executed as if someone entered domain.com/landing.php?page=cars;
If you use feed content to populate the pages you should use some kind of caching to ensure that you do NOT reload all feed on every requests/reloads the page.
Checking the feeds every 1 to 5 minutes should be enough and the very structure of feeds allows you to identify new items easily.
About URL rewrite: http://www.workingwith.me.uk/articles/scripting/mod_rewrite
A nice template engine for generating pages from feets is phptal (http://phptal.org)
You can load the feet as xml and directly use it in your template.
test.xml:
<foo><bar>baz!!!</bar></foo>
template.html:
<html><head /><body> ${xml/foo/bar}</body></html>
sample.php:
$xml = simplexml_load_file('test.xml');
$tal = new PHPTAL('template.html');
$tal->xml = $xml;
echo $tal->execute();
And it does support loops and conditional elements.
If you are not needing real time data then you can do this in a few parts
A script which pulls data from your rss feeds and stores the data somewhere (sql db?), timed by something like cron. It could also tag the entries into categories.
A template in php taking the url arguments and then adding the requested data and displaying it for the user. Really quite easy to do with php, probably a good project to use to learn as well if you are that way inclined

Is is possible to parse a web page from the client side for a large number of words and if so, how?

I have a list of keywords, about 25,000 of them. I would like people who add a certain < script> tag on their web page to have these keywords transformed into links. What would be the best way to go and achieve this?
I have tried the simple javascript approach (an array with lots of elements and regexping/replacing each) and it obviously slows down the browser.
I could always process the content server-side if there was a way, from the client, to send the page's content to a cross-domain server script (I'm partial to PHP but it could be anything) but I don't know of any way to do this.
Any other working solution is also welcome.
I would allow the remote site add a javascript file and using ajax connect to your site to get a list of only specific terms. Which terms?
Categories: Now if this is for advertising (where this concept has been done a lot) let them specify what category their site falls into and group your terms into those categories. Then only send those groups of terms. It would be in their best interest to choose the right categories because the more links they have the more income they can generate.
Indexing: If that wouldn't work, you can maybe when the first time someone tries to load the page, on your server index a copy of it and index all the words on their page with the terms you have and for any subsequent loads you have a list of terms to send them based on what their page contains. ideally after that you would have some background process that indexes their pages with your script like once a day or every few days to catch any updates. Possibly use the script to get a hash of the page contents and if changed at all you can then update your indexed copy.
I'm sure there are other methods, which is best is really just preference. Try looking at a few other advertising-link sites/scripts and see how they do it.

Find duplicate content using MySQL and PHP

I am facing a problem on developing my web app, here is the description:
This webapp (still in alpha) is based on user generated content (usually short articles although their length can become quite large, about one quarter of screen), every user submits at least 10 of these articles, so the number should grow pretty fast. By nature, about 10% of the articles will be duplicated, so I need an algorithm to fetch them.
I have come up with the following steps:
On submission fetch a length of text and store it in a separated table (article_id,length), the problem is the articles are encoded using PHP special_entities() function, and users post content with slight modifications (some one will miss the comma, accent or even skip some words)
Then retrieve all the entries from database with length range = new_post_length +/- 5% (should I use another threshold, keeping in mind that human factor on articles submission?)
Fetch the first 3 keywords and compare them against the articles fetched in the step 2
Having a final array with the most probable matches compare the new entry using PHP's levenstein() function
This process must be executed on article submission, not using cron. However I suspect it will create heavy loads on the server.
Could you provide any idea please?
Thank you!
Mike
Text similarity/plagiat/duplicate is a big topic. There are so many algos and solutions.
Lenvenstein will not work in your case. You can only use it on small texts (due to its "complexity" it would kill your CPU).
Some projects use the "adaptive local alignment of keywords" (you will find info on that on google.)
Also, you can check this (Check the 3 links in the answer, very instructive):
Cosine similarity vs Hamming distance
Hope this will help.
I'd like to point out that git, the version control system, has excellent algorithms for detecting duplicate or near-duplicate content. When you make a commit, it will show you the files modified (regardless of rename), and what percentage changed.
It's open source, and largely written in small, focused C programs. Perhaps there is something you could use.
You could design your app to reduce the load by not having to check text strings and keywords against all other posts in the same category. What if you had the users submit the third party content they are referencing as urls? See Tumblr implementation-- basically there is a free-form text field so each user can comment and create their own narrative portion of the post content, but then there are formatted fields also depending on the type of reference the user is adding (video, image, link, quote, etc.) An improvement on Tumblr would be letting the user add as many/few types of formatted content as they want in any given post.
Then you are only checking against known types like a url or embed video code. Combine that with rexem's suggestion to force users to classify by category or genre of some kind, and you'll have a much smaller scope to search for duplicates.
Also if you can give each user some way of posting to their own "stream" then it doesn't matter if many people duplicate the same content. Give people some way to vote up from the individual streams to a main "front page" level stream so the community can regulate when they see duplicate items. Instead of a vote up/down like Digg or Reddit, you could add a way for people to merge/append posts to related posts (letting them sort and manage the content as an activity on your app rather than making it an issue of behind the scenes processing).

Categories