I think this is a real challenging one!
I write a website for my local football league, www.rdyfl.co.uk , and include javascript code snippets from the F.A's Full-Time system where we generate our fixtures, linking in tables fixtures recent results etc.
For another feature I want to add to the site I need to scrape the 'Upcoming Fixtures' for each agegroup and division but when I examine the source I have two problems.
The fixtures content is generated by javascript and therefore I need to see the generated source and not just the source.
When I view the generated source using Firefox the team names are actually further javascript links and not the name itself.
I basically want to somehow download the fixtures on a regular basis and write then to a mysql database ?
I have asked the F.A. and they have no more options available to access the data ?
Having never coded for scraping before can anyone point me to a simple solution or does anyone fancy the challange?
This question was asked a long time ago, but I noticed it was active today 🤷.
You should be able to scrape the website using a headless browser such as Puppeteer. Using Puppeteer you are able to access a URL and execute JavaScript or interact with the website as you would with an ordinary browser. Parsing the output DOM and storing it should then be relatively straightforward.
There are plenty of articles on this topic using Puppeteer.
The latest version of OutWit Hub is doing a pretty good job on dynamic content. The source scraped by outwit to extract links, images, documents and tables and text is the updated DOM. You can certainly make a job to grab what you need using these.
Custom scrapers are still applied to the static source in version 1.0.3 but version 1.1.x (still in beta) will offers the choice between the static source and the dynamically modified DOM.
Scrapping content produced by Javascript is challenging. AFAIK you will need to do this with AJAX. Hopefully the content has some css that you can grab with jQuery or at least some id's. Do you have id's or classes that you can grab?
Related
The question is how to get ajax calls source code? this is not crawled, for example how to crawl the pictures on a link like this? http://www.tiendeo.nl/Catalogi/amsterdam/16558&subori=web_sliders&buscar=Boni&sw=1366
If you do inspect element, then it will show you the right code in the middle where the pictures are. But how to crawl this? If you click the next page, then it will have other images in the source. How to get the source for all images?
If I understand your question correctly (How do I crawl information loaded into the page by ajax calls?), the answer is that you either need some sort of javascript-aware crawler, or you need to inspect the javascript to figure out what resources are being polled to load the content you're interested in. From PHP you should be able to send curl get requests to these URLs and receive the same responses the site's javascript is using to render the entries.
The latter option has some rewards--namely that you will most likely be able to get simple, easy-to-use JSON responses to your requests.
As with most web scraping efforts, it tends to be the case that some content providers won't appreciate your interest in their data (especially if you're collecting it in ways that put undue strain on their systems or resources.) Keep in mind that they make take steps (technical or legal) to stop you if they notice/mind.
Addendum:
If you're hoping to crawl a variety of similar sites without needing to look through the source to find the resources they're using, (let's say for the sake of argument that you're just trying to naively scrape all images over a certain size from several sites selling the same sort of items), you'd need the former option--some sort of javascript-aware scraper. I don't know if such a thing exists, but it wouldn't surprise me.
Our company allows its clients to view reports via our website. The pages are php based and the data is collected from MySQL. These reports were written a long time ago and include inline css. The pages themselves look fine, but the print version is lacking. I want to take the reports and create visually appealing "printable" pages that contain our branding.
I have found three solutions so far.
#Media Print Stylesheets
This is the easiest method, but does not give me complete layout control. I want landscape mode and need to control where the page breaks occur so this method has been eliminated from my list of possible solutions. The reports are built by looping through PHP data, so while I can always put a page break after a or for example, I can't stop the page from breaking before it gets to the next set of data.
TCPDF/FPDF
From what I have seen these classes will give me all of the control I need to customer a PDF. The challenge is that this appears to be a little more advanced than my programming skills require, and all of the inline CSS contained within the HTML tables may throw off formatting.
FDF
I am leaning towards this method if I understand it correctly. First I would create a PDF form and define all of the fields to be populated by the MySQL data. Then I would create a FDF file that would populate the form template with the data from the database. It seems easier to me to create a visually pleasing form via PDF and then populate that form using this method, rather than create the entire pdf from scratch using method 2.
Does it sound like I am on the right track? Are any of these methods "easier" than the other?
Any help is greatly appreciated.
TCPDF has the most control of each page which is what I am looking for. It is extremely sensitive when writing HTML, but that is the only downside I have found so far.
There's this excellent answer on SO already.
If you're looking for easy, my money is on mPDF. I found it to be the easiest, and essentially an out-of-the-box solution (often zero server configuration to do).
I think you should try out wkhtmltopdf.
https://code.google.com/p/wkhtmltopdf/
As for the TCPDF/FPDF pagination issue, you can see this other question for the solution provided and use the flow in it to sort yours out.
TCPDF / FPDF - Page break issue
Just found this other solution as well and think you'll need it
Convert HTML + CSS to PDF with PHP?
For me personally, FPDF works great to fetch data from my database, insert into the FPDF class and dynamically create PDF's for customers.
I see some people want to write HTML/CSS to create PDF's but you will always have
differences as the browser parses the HTML/CSS differently than when using it in PDF's.
When using FPDF's built-in method's, I have been able to get exactly what I wanted
and haven't seen any issues (yet).
I have a website running with Joomla!, and I'm using PhocaGallery, which is a component used for managing a photo gallery.
In my articles, I simply put a tag, for example :
{phocagallery view=category|categoryid=2|limitcount=2}
This tag displays simply 2 images from the category which has 2 as ID.
I'm developing my own application for this website, and when I load an article on it, I have simply the tag displayed, without images, that's normal.
I have the long PHP code for this plugin.
I would like to know how :
1. Detect the tag when the article is loadind in application
2. Call PHP script in the application to browse good images
3. Display images
The problem I think, is that the PHP code may call to folders on the website, and I think the application can't...
Do you think it's possible ?
detect the tag: use a regular expression to parse the parameters you need
inkoking the plugin is pretty straightforward, just make sure you're making all the necessary resources available i.e. most likely you will want to load the whole joomla framework. Since you're just querying a single table you might be better off doing it on your own with mysql, it will save you a lot of time both developing and at runtime
just output the
I just need some clarity here on whether this concept is possible or whether i have misunderstood what is capable of crawlers.
Say 1 have a list of 100 websites/blogs and every day, my program ( i am assuming its a crawler thingy ) will crwal thru them and if there is a match for some specified phrases like "miami heat" or "lebron james", it will proceed to download that page -> convert it to a pdf with full text/images and save that pdf.
So my questions are;
This type of thing is possible right ? Pls note that i dont want just text snippets but i am hoping to get the entire page as if it was printed out on a piece of paper?
This type of programs are called as crawlers right ?
i am planning to build on code from http://phpcrawl.cuab.de/about.html
This is perfectly possible, as you are going to use phpcrawl to crawl the web pages use wkhtmltopdf to convert your html to pdf as it is
Yes it is possible, by using wkhtmltopdf tool you can convert web page as it is. its a desktop bases s/w so you can install in in you machine
Yes Crawlers.
Its a perfect tool for building what you want to build.
Yes it is possible.
You could call it a crawler or a scraper, as you are scraping data off the websites.
Rendering the website to a PDF will probably be the hardest part, their are webservices that can do this for you.
For Example
http://pdfmyurl.com/
(I have no affiliation, and I have never used them, it was just the first site in the google results when I checked)
We are downloading images to our computers when we open new webpages. For example: If a webpage has an image(image.jpg), our computer downloads it while we are surfing that page.
Some webpages are using ajax methods. For example: You don't see an image on the page's source codes, however your computer downloads an image. Because, if you click a link on that page, ajax will be showing that image...
Let me show an example:
<div id="ajax_will_load_image_here"></div>
Okay, how can php curl see (or download) that image? Curl can't see that image when I try to use preg_match function. Actually there is an image. I want to download that image by using php curl. Any advice?
If i understand the question correctly there is no convinient way of doing that.
Your crawler/spider would have to parse the website and evaluate javascript.
There are libraries for that but support is very limited.
There are however methods where an actual browser is used to evaulate the page (without displaying it but setting proper environment variables like resolution etc).
Then the generated source including javascript dom modifications is available.
This is for example how the google search previews are generated.
But if you require user interaction it gets pretty specific and complicated.
I am sorry to dissapoint you, but using curl and preg metch the old school way we used to when javascript was not yet so common wont work.
However for most legit use cases this is more than sufficient and websites are today more and more designed to be non-javascript compliant. Especially the content for crawling purposes. It is a must in search engine optimization, and which website doesnt want that?