I'd like to have a script where I can input a URL and it will intelligently grab the first paragraph of the article... I'm not sure where to begin other than just pulling text from within <p> tags. Do you know of any tips/tutorials on how to do this kind of thing?
update
For further clarification, I'm building a section of my site where users can submit links like on Facebook, it'll grab an image from their site as well as text to go with the link. I'm using PHP and trying to determine the best method of doing this.
I say "intelligently" because I'd like to try to get content on that page that's important, not just the first paragraph, but the first paragraph of the most important content.
If the page you want to grab is foreign or even if it is local but that you don't know its structure in advance, I'd say the best to achieve this would be by using the php DOM functions.
function get_first_paragraph($url)
{
$page = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($page);
/* Gets all the paragraphs */
$p = $doc->getElementsByTagName('p');
/* extracts the first one */
$p = $p->items(0);
/* returns the paragraph's content */
return $p->textContent;
}
Short answer: you can't.
In order to have a PHP script "intelligently" fetch the "most important" content from a page, the script would have to understand the content on the page. PHP is no natural language processor, nor is this a trivial area of study. There might be some NLP toolkits for PHP, but I still doubt it would be easy then.
A solution that can be achieved with reasonable effort would be fetch those entire page with an HTML parser and then look out for elements with certain class names or ids commonly found in blog engines. You could also parse for hAtom Microformats. Or you could look out for Meta tags within the document and more clearly defined information.
I wrote a Python script a while ago to extract a web page's main article content. It uses a heuristic to scan all text nodes in a document and group together nodes at similar depths, and then assume the largest grouping is the main article.
Of course, this method has its limitations, and no method will work on 100% of web pages. This is just one approach, and there are many other ways you might accomplish it. You may also want to look at similar past questions on this subject.
Related
At the risk of getting redirected to this answer (yes, I read it and spent the last 5 minutes laughing out loud at it), allow me to explain this issue, which is just one in a list of many.
My employer asked me to review a site written in PHP, using Smarty for templates and MySQL as the DBMS. It's currently running very slowly, taking up to 2 minutes (with a entirely white screen through it all, no less) to load completely.
Profiling the code with xdebug, I found a single preg_replace call that takes around 30 seconds to complete, which currently goes through all the HTML code and replaces each URL found to its SEO-friendly version. The moment it completes, it outputs all of the code to the browser. (As I said before, that's not the only issue -the code is rather old, and it shows-, but I'll focus on it for this question.)
Digging further into the code, I found that it currently looks through 1702 patterns with each appropriate match (both matches and replacements in equally-sized arrays), which would certainly account for the time it takes.
Code goes like this:
//This is just a call to a MySQL query which gets the relevant SEO-friendly URLs:
$seourls_data = $oSeoShared->getSeourls();
$url_masks = array();
$seourls = array();
foreach ($seourls_data as $seourl_data)
{
if ($seourl_data["url"])
{
$url_masks[] = "/([\"'\>\s]{1})".$site.str_replace("/", "\/", $seourl_data["url"])."([\#|\"'\s]{1})/";
$seourls[] = "$1".MAINSITE_URL.$seourl_data["seourl"]."$2";
}
}
//After filling both $url_masks and $seourls arrays, then the HTML is parsed:
$html_seo = preg_replace($url_masks, $seourls, $html);
//After it completes, $html_seo is simply echo'ed to the browser.
Now, I know the obvious answer to the problem is: don't parse HTML with a regexp. But then, how to solve this particular issue? My first attempt would probably be:
Load the (hopefully, well-formed) HTML into a DOMDocument, and then get each href attribute in each a tag, like so.
Go through each node, replacing the URL found for its appropriate match (which would probably mean using the previous regexps anyway, but on a much-reduced-size string)
???
Profit?
but I think it's most likely not the right way to solve the issue.
Any ideas or suggestions?
Thanks.
As your goal is to be SEO-friendly, using canonical tag in the target pages would tell the search engines to use your SEO-friendly urls, so you don't need to replace them in your code...
Oops ,That's really tough, bad strategy from the beginning , any way that's not your fault,
i have 2 suggestion:-
1-create a caching technique by smarty so , first HTML still generated in 2 min >
second HTMl just get from a static resource .
2- Don't Do what have to be done earlier later , so fix the system ,create a database migration that store the SEO url in a good format or generate it using titles or what ever, on my system i generate SEO links in this format ..
www.whatever.com/jobs/722/drupal-php-developer
where i use 722 as Id by parsing the url to get the right page content and (drupal-php-developer) is the title of the post or what ever
3 - ( which is not a suggestion) tell your client that project is not well engineered (if you truly believe so ) and need a re structure to boost performance .
run
I am trying to make myself a homepage, for my personal use only and what I want to do is to display different information from different websites that change few times a day. i.e. News, weather and such. I want to have my favorite information always on sight without the need to visit many pages. As many of websites don't load within an iframe which was the first thing I tried I figured PHP might be able to help me.
So what I need to do to is to get the contents of a DIV and place it within my page with PHP.
The DIV on the source page is generated on the server but it always have the same ID.
example:
<div id="nowbox">
<a href="http://www.seznam.cz/jsTitleExecute?id=91&h=19331020">
<img width="135" height="77" src="http://seznam.cz/favicons/title//009/91-JrAEVc.jpg" alt="" /></a>
<div class="cont"> <ul> <li>
<strong>Sledujte dnes od 20.00 koncert Tata Bojs</strong>
<p>Nenechte si ujít tradiční benefiční koncert kapely Tata Bojs. Sledujte představení na Seznam.cz</p> </li> </ul>
</div>
</div>
so the ID of the DIV is "nowbox" and I need to copy all that is within it and put it in my page.
So far I was only able to use this
$contents = file_get_contents("http://seznam.cz");
and view all contents of the page but I have no idea how to strip everything and leave only the needed DIV.
I am not very experienced in PHP so I would be very grateful for any help, the easier to understand the better.
EDIT:
THX for answers. Basically I just wanted to get the code I posted as example to a variable so I could ECHO it somewhere on my page. The problem is that the code changes as does the rest of the website and only some things remain the same i.e. the DIV ID.
Definitely NOT the most elegant solution (even I know that but as the website is for my purposes only it shouldn't matter) but one that I successfully managed to get to work is that I got the whole page with:
$contents = file_get_contents("http://seznam.cz");
and then counted the number of chars to a specific unique position in the code with STRPOS plus/minus a static number of characters that I could count manually. Then I split the string into ARRAYs and discard the parts I don't need to get the beginning of the code in the beginning of a string and then use the same method to cut the string after the code ended.
If you want to do this server side, I suggest you to use phpquery
require('phpQuery/phpQuery.php');
$doc = phpQuery::newDocumentFileXHTML('http://seznam.cz');
$html = pq('#nowbox')->htmlOuter();
I'm unable to fully understand what you want to achieve and why, but you can do this both through the server and client, first, the client side way:
Well, what you're asking is for is to extract parts of the DOM, using javascript + jQuery on the client side you can achieve it this very rapidly, simply by calling the $.load("/mypage #nowbox") function.
This could be achieved on the server side aswell using php by using any DOM manipulation library, either one that is bundled within (DOMDocument) or one the easier to use libs (which is a bit memory leakish), simplehtmldom
So there you have it, options for both client & server ways to implement, select which one suites your needs best.
please notice that any CSS ruling will not be available by either method, as the css won't be loaded in your dom.
Good Luck!
I want to get the url of an image (where it's stored) from the description of a RSS feed, so then I can put that url inside a variable.
I know that for getting the link of the RSS feed post I have to use $feed->channel->item->link; where $feed is $feed=simplexml_load_file("link_of_the_feed";.
But what if I have get the image url of the post, do I have to use something like $feed->channel->item->image;?
I really don't know, maybe a RSS parser like MagPie RSS which I tried without results?
Thanks in advance.
If the image node is at the top level of the item node, then yes. If it's deeper than the item node, you'll have to traverse it accordingly. It would be helpful if you posted your XML.
EDIT: you can also check out my answer here on how to parse through an XML file with PHP.
You're on the right track! But it all depends on the format that the RSS Feed is set up in.
The item node actually contains a whole bunch of different fields, of which link is only one. Take a look here for information on the other fields that the item node contains.
Now, if the RSS feed points directly to the image file, then you can just use item->link. More likely, however, the link points to a blog post or something that has the image embedded in it. In this case, you can undertake some processing on $feed->channel->item->description to find what you need. The description node contains an escaped HTML summary of the post, and then from there, you can just use a regular expression to find the source of the image. Also remember: before you start using the regular expressions, you might need to decode the description using htmlspecialchars_decode() before you start processing it with the regular expressions - in my own experience, descriptions often come formatted with special characters escaped.
I know that's a lot of information, but once you get started it's really not as hard as it sounds. Good luck!
I am trying to create multiple landing pages populated dynamically with data from a feed.
My initial thought was to create a generic php page as a template that can be used to create other pages dynamically and populate them with data from a feed. For instance, the generic page could be called landing.php; then populate that page and other pages created on the go with data from a feed depending on an id, keyword or certain string in the url. e.g http://www.example.com/landing.php?page=cars or http://www.example.com/landing.php?page=bikes will show contents that are either only about cars or bikes as the case may be.
My question is how feasible is this approach and is there a better way to create multiple dynamic pages populated with data from a feed depending on the url query string or some sort of id.
Many thanks for your help in advance.
I use this quite extensively. For example, where I work, we often have education oriented landing pages, but target each landing page to different types of visitors. A good example may be arts oriented schools looking for a diverse array of potential students who may be interested in a variety of programs for any number of reasons.
Well, who likes 3d Modelling? Creative types (Generic lander => ?type=generic) from all sorts of social circles. Also, probably gamers (Gamer centric lander => ?type=gamer). So on.
I apply that variable to the body's class, which can be used to completely reorganize the layout. Then, I select different images for key parts of the layout based on that variable as well. The entire site changes. Different fonts can be loaded, different layout, different content.
I keep this organized via extensive includes. This sounds ugly, but it's not if you stick to a convention. You have to know the limitations of your foundation html, and you can't make too many exceptions. Sure, you could output extra crap based on if the type was gamer or generic, but you're going down the road to a product that should probably be contained in its own landing page if it needs to be that different.
I have a few landing pages which can be switched between several contents and styles (5 or 6 'themes'), but the primary purpose of keeping them grouped within the same url is only to maintain focus on the fact that that's where a certain type of traffic goes to in order to convert for this specific thing. Overlapping the purpose of these landing pages is a terrible idea.
Anyway, dream up a great template, outline a rigid convention for development, keep each theme very separate, and go to town on it. I find doing it right saves a load of time, but be careful - Doing it wrong can cost a lot of time too.
Have a look at htaccess URL Rewrite. Then your user (and google) can use a url like domanin.com/landing/cars but on your server the script will be executed as if someone entered domain.com/landing.php?page=cars;
If you use feed content to populate the pages you should use some kind of caching to ensure that you do NOT reload all feed on every requests/reloads the page.
Checking the feeds every 1 to 5 minutes should be enough and the very structure of feeds allows you to identify new items easily.
About URL rewrite: http://www.workingwith.me.uk/articles/scripting/mod_rewrite
A nice template engine for generating pages from feets is phptal (http://phptal.org)
You can load the feet as xml and directly use it in your template.
test.xml:
<foo><bar>baz!!!</bar></foo>
template.html:
<html><head /><body> ${xml/foo/bar}</body></html>
sample.php:
$xml = simplexml_load_file('test.xml');
$tal = new PHPTAL('template.html');
$tal->xml = $xml;
echo $tal->execute();
And it does support loops and conditional elements.
If you are not needing real time data then you can do this in a few parts
A script which pulls data from your rss feeds and stores the data somewhere (sql db?), timed by something like cron. It could also tag the entries into categories.
A template in php taking the url arguments and then adding the requested data and displaying it for the user. Really quite easy to do with php, probably a good project to use to learn as well if you are that way inclined
I have a html site. In that site around 100 html files are available. i want to develop the search engine . If the user typing any word and enter search then i want to display the related contents with the keyword. Is't possible to do without using any server side scripting? And it's possible to implement by using jquery or javascript?? Please let me know if you have any ideas!!!
Advance thanks.
Possible? Yes. You can download all the files via AJAX, save their contents in an array of strings, and search the array.
The performance however would be dreadful. If you need full text search, then for any decent performance you will need a database and a special fulltext search engine.
3 means:
Series of Ajax indexing requests: very slow, not recommended
Use a DB to store key terms/page refernces and perform a fulltext search
Utilise off the shelf functionality, such as that offered by google
The only way this can work is if you have a list of all the pages on the page you are searching from. So you could do this:
pages = new Array("page1.htm","page2.htm"...)
and so on. The problem with that is that to search for the results, the browser would need to do a GET request for every page:
for (var i in pages)
$.get(pages[i], function (result) { searchThisPage(result) });
Doing that 100 times would mean a long wait for the user. Another way I can think of is to have all the content served in an array:
pages = {
"index" : "Some content for the index",
"second_page" : "Some content for the second page",
etc...
}
Then each page could reference this one script to get all the content, include the content for itself in its own content section, and use the rest for searching. If you have a lot of data, this would be a lot to load in one go when the user first arrives at your site.
The final option I can think of is to use the Google search API: http://code.google.com/apis/customsearch/v1/overview.html
Quite simply - no.
Client-side javascript runs in the client's browser. The client does not have any way to know about the contents of the documents within your domain. If you want to do a search, you'll need to do it server-side and then return the appropriate HTML to the client.
The only way to technically do this client-side would be to send the client all the data about all of the documents, and then get them to do the searching via some JS function. And that's ridiculously inefficient, such that there is no excuse for getting them to do so when it's easier, lighter-weight and more efficient to simply maintain a search database on the server (likely through some nicely-packaged third party library) and use that.
some useful resources
http://johnmc.co/llum/how-to-build-search-into-your-site-with-jquery-and-yahoo/
http://tutorialzine.com/2010/09/google-powered-site-search-ajax-jquery/
http://plugins.jquery.com/project/gss
If your site is allowing search engine indexing, then fcalderan's approach is definitely the simplest approach.
If not, it is possible to generate a text file that serves as an index of the HTML files. This would probably be rudimentarily successful, but it is possible. You could use something like the keywording in Toby Segaran's book to build a JSON text file. Then, use jQuery to load up the text file and find the instances of the keywords, unique the resultant filenames and display the results.