How to write this crawler in php?

How to write this crawler in php? - php

I need to create a php script.
The idea is very simple:
When I send a link of a blogpost to this php script, then the webpage is crawled and the first image with the title page are saved on my server.
What PHP function I have to use for this crawler ?

Use PHP Simple HTML DOM Parser
// Create DOM from URL
$html = file_get_html('http://www.example.com/');
// Find all images
$images = array();
foreach($html->find('img') as $element) {
$images[] = $element->src;
}
Now $images array have images links of given webpage. Now you can store your desired image in database.

HTML Parser: HTMLSQL
Features: you can get external html file, http or ftp link and parse content.

Well, you'll have to use quite a few functions :)
But I'm going to assume that you're asking specifically about finding the image, and say that you should use a DOM parser like Simple HTML DOM Parser, then curl to grab the src of the first img element.

I would user file_get_contents() and a regular expression to extract the first image tags src attribute.
CURL or a HTML Parser seem overkill in this case, but you are welcome to check it out.

Related

simple html DOM cant see all hrefs

Im trying to retrieve the youtube link of a certain site. But when using the simple html DOM parser it cant find the links im looking for.
$new_html = file_get_html("https://www.bia2.com/video/Amir-Shamloo/Delam-Tange/");
foreach ($new_html->find('href') as $youtube) {
echo $youtube;
}
it should find the link: https://www.youtube.com/watch?v=vJ2aNG0aJPU.
does someone know what the problem is here?

That particular link is inserted via JavaScript via onYouTubeIframeAPIReady("vJ2aNG0aJPU") during the onload event.
SimpleHtmlDom (or any other PHP based HTML parser for that matter) will not execute any JavaScript. They just parse the markup returned by the webserver.
You'd need a scraper capable of executing Javascript before you can scrape it. Or you can match the argument to that function and assemble the link yourself.
On a side note: $new_html->find('href') will try to find any elements named "href", which is obviously wrong. To get all href attributes for any element, you'd have to use *[href] instead.
On another side not: SimpleHtmlDom is a crap library. Consider your options:
How do you parse and process HTML/XML in PHP?

Php file_get_contents() issue

With php file_get_contents() i want just only the post and image. But it's get whole page. (I know there is other way to do this)
Example:
$homepage = file_get_contents('http://www.bdnews24.com/details.php?cid=2&id=221107&hb=5',
true);
echo $homepage;
It's show full page. Is there any way to show only the post which cid=2&id=221107&hb=5.
Thanks a lot.

Use PHP's DomDocument to parse the page. You can filter it more if you wish, but this is the general idea.
$url = 'http://www.bdnews24.com/details.php?cid=2&id=221107&hb=5';
// Create new DomDocument
$doc = new DomDocument();
$doc->loadHTMLFile($url);
// Get the post
$post = $doc->getElementById('opage_mid_left');
var_dump($post);
Update:
Unless the image is a requirement, I'd use the printer-friendly version: http://www.bdnews24.com/pdetails.php?id=221107, it's much cleaner.

You will need to parse the resulting HTML using a DOM parser to get the HTML of only the part you want. I like PHP Simple HTML DOM Parser, but as Paul pointed out, PHP also has it's own.

you can extract the
<div id="page">
//POST AND IMAGE EXIST HERE
</div>
part from the fetched contents using regex and push it on your page...

Creating a personalization engine with php

I am new to php and I want to create an php engine which changes the web content of a webpage with PHP with the use of data in mysql. For example (changing the order of navigation links on a webpage with the order of highest click count) I am not sure how PHP will read the HTML file and change the elements in the HTML file and also output the HTML file with the changes. Is this possible?

I am not quite sure why you would want to generate the html, read it, change it and then output it. It seems to be a lot easier to just generate it the way you want to in the first place.
I am not sure how PHP will read the HTML file and change the elements in the HTML file and also output the HTML file with the changes. Is this possible?
You could use file_get_contents:
$html = file_get_contents($url);
Then use a html-parser like Simple HTML DOM Parser, change what you want to do and output it.

If you want to modify HTML structure, use ganon - HTML DOM parser for PHP
include('path/ganon.php');
// Parse the google code website into a DOM
$html = file_get_dom('http://code.google.com/');
foreach($html('p[class]') as $element) {
echo $element->class, "<br>\n";
}

Getting the first URL of an image search result with google image API in PHP

did you know a php script (a class will be nice) who get the url of the first image result of a google api image search? Thanks
Example.
<?php echo(geturl("searchterm")) ?>

I have found a solution to get the first image from Google Image result using Simple HTML DOM as Sarfraz told.
Kindly check the below code. Currently it is working fine for me.
$search_keyword=str_replace(' ','+',$search_keyword);
$newhtml =file_get_html("https://www.google.com/search?q=".$search_keyword."&tbm=isch");
$result_image_source = $newhtml->find('img', 0)->src;
echo '<img src="'.$result_image_source.'">';

You should be able do that easily with Simple HTML DOM.
Note: See the examples on their site for more information.
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Find tags on an HTML page with selectors just like jQuery.

How to get specific content with PHP and DOM Document?

I have a url I want to grab. I only want a short piece of content from it. The content in question is in a div that has a ID of sample.
<div id="sample">
Content
</div>
I can grab the file like so:
$url= file_get_contents('http://www.example.com/');
But how do I select just that sample div.
Any ideas?

I'd recommend using the PHP Simple HTML DOM Parser.
Then you can do:
$html = file_get_html('http://www.example.com/');
$html->find('div[#sample]', 0);

I would recommend something like Simple HTML DOM, although if you are very sure of the format, you may wish to look at using regex to extract the data you want.

A while ago, I released an open source library named PHPPowertools/DOM-Query, which allows you to (1) load an HTML file and then (2) select or change parts of your HTML much like you'd do it with jQuery.
Using that library, here's how you'd select the sample div for your example :
use \PowerTools\DOM_Query;
// Get file content
$htmlcode = file_get_contents('http://www.example.com/');
// Create a new DOM_Query object
$H = new DOM_Query($htmlcode);
// Find the elements that match selector "div#sample"
$s = $H->select('div#sample');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to write this crawler in php? - php

I need to create a php script. The idea is very simple: When I send a link of a blogpost to this php script, then the webpage is crawled and the first image with the title page are saved on my server. What PHP function I have to use for this crawler ?

HTML Parser: HTMLSQL Features: you can get external html file, http or ftp link and parse content.

Well, you'll have to use quite a few functions :) But I'm going to assume that you're asking specifically about finding the image, and say that you should use a DOM parser like Simple HTML DOM Parser, then curl to grab the src of the first img element.

I would user file_get_contents() and a regular expression to extract the first image tags src attribute. CURL or a HTML Parser seem overkill in this case, but you are welcome to check it out.

Related

simple html DOM cant see all hrefs

Php file_get_contents() issue

Creating a personalization engine with php

Getting the first URL of an image search result with google image API in PHP

How to get specific content with PHP and DOM Document?

Categories

Resources