How to get sentences from the website html

How to get sentences from the website html - php

Hello I want to extract all sentences from a html document. How can i perform that? as there are many conditions like first we need to strip tags, then we need to identify sentences which may end with . or ? or ! also there might be conditions like email address and website address also may have . in them How do we make some script like this?

It's called programming ;). Start by dividing the task in simpler sub-tasks and implement those. For example, in your case, I'd design the program like this:
Download and parse the HTML document
Extract all text content (pay special attention to <script> and <style> elements)
Merge the text content to one long string
Solve the problem of finding sentences in a string (likely, just parse until you find a stop character in ".!?" and then start a new sentence)
Discard false positives (Like empty sentences, number-only sentences etc.)

First you should strip certain tags which are inline formatting elemnts like:
I <b>strongly</b> agree.
But you sbhould leave in block-level elements, like DIV and P because there are even stronger delimiters than . ? and !
Then you have to process the content in these block level elements. Typically there are navigation links with one word, you might want to filter them out later, so it is not the right choice to strip away the block structure of the document.
At this point you can safely use the regex pattern to identify blocks:
>([^<]+)<
When you have your blocks you can filter out the short ones (navigation elemnts) and strip the big ones (paragraphs of text) using your sentence delimiter.
There are interesting questions when a fullstop character signals an end of the sentenct and when is it just a decimal point, but I leave that to you. :)

Related

PHP Data Scraping

I would like to scrape some data from a website using PHP using
preg_match("/ /i/s", $contents, $matches);
The website I am trying to get data from looks like this
https://www.spareroom.co.uk/flatshare/?search_id=592135669&
I would like to scrape the line that says;
Showing 1-17 of 17 results
I want to use (.*?) to get the total number of properties (in this case 17) for a website to show this information separately.
How can I use preg_match when the data I am scraping varies according to the amount of properties available?
I look forward to any assistance.
David

Going by the example page it looks like this line appears once on a page. If it appears multiple times you may want preg_match_all to return multiple results. Another tricky thing about doing this is changes that get made to a web page from time to time. So here is a solution that will work right now, but you can also tweak things to account for changes in the web page (something I can't tell from a single example):
preg_match( "#<.*?>\s*(\d+)\s*<.*?>\s+results#i", $page, $results );
So I use the i flag to make it case insensitive. That way if they capitalize "results" or something it won't break.
<.*?>
Keep in mind that you are going to be getting the HTML code which has tags you can't see from the front. In this case there are strong tags around the total. But maybe they will change this to a different tag in the future? So I just used open/close angle brackets with wildcards for the contents. Oh and the question mark is so it's not greedy and stops at the closest angle bracket.
\s*
This looks for 0 or more spaces. Right now there is a single space between the strong tag and the total. What if they remove that space or add more? This should cover you in both cases.
(\d+)
The parenthesis is what captures content to the $results array. Inside it is saying 1 or more digits, so only numbers.
\s*
Like earlier, looking for 0 or more space characters.
<.*?>
This is to match the closing strong tag but takes account that they could later use a different closing tag.
\s+results
This is looking for one or more spaces before the word results. We know there has to be at least one, but they could make changes in the future that will put more spaces in there (even though the webpage will only display one).
$results will have two elements The first one will be the entire phrase, and the second element will contain just the capture phrase (between the parenthesis).
There are a million variations you can do to account for variations in the HTML, but this is one that maybe can get you started and you could tweak.

preg_replace function for multiple matches in PHP

I am trying to get the base64 code for the insert image string using summer note js editor.
I manage to get this
preg_replace('#<img\ssrc="data:image/([\w]+);base64,([a-zA-Z0-9+/-_=]+)"\sstyle="width:\s[0-9]+px;">#',"IMAGE REMOVED FROM CHROME\r\n", $content,-1, $counter);
AS well as a few variations depending on
The position of the style and data-filename and src are always changing depending on different browsers and so i have a few variations.
1) Are there easier way to do this?
Like if i have all components of img src, style and data-filename, i will just match the string? I can create all the variations but if i were to do $content = preg_replace 10 times for 10 different variations, isn't it extremely slow just to find one match? And it becomes increasingly slower if my $string is extremely long.
2) I need to pull out the base64 string to save it as a image, how can i use the regex above to help me to fopen, fwrite, fclose?
Thanks in advanced.

You might use something like this:
<img(\s+src="data:image/([\w]+);base64,([\w+/-_=]+)")?(\s+style="width:\s*([\d]+)px;")?\s*>
You can look at this working example.
This way, it covers the absence of src and/or style attributes.
It's up to you to eventually refine it to:
change the 2 attribute groups to non-capturing, depending on how you find them useful or not
also make their contents partially optional, or even optionally richer (e.g. style might include other properties)
Some additional points:
in order to face any case, I replaced \s by \s+ between attributes, and at the opposite by \s* inside of style
I simplified [a-zA-Z0-9+/-_=]+ and [0-9]+ expressions, which become [\w+/-_=]+ and [\d]+
I added capturing parenthesis around this last one (from your question, seems that you need to get width when available)

php character limits (trim an html paragraph)

We have our own blog system and the post data is stores a raw html, so when it's called from the db we can just echo it and it's formatted completely, no need for BB codes in our situation. Our issue now is that our blog posts sometimes are too long and need to be trimmed.
The problem is that our data contains html, mostly <font>, <span>, <p>, <b>, and other styling tags. I made a php function that trims the characters, but it doesn't take into account the html tags. If the trim function trims the blog it should not trim tags because it messes the whole page. The function needs to be able to close the html tags if they're trimmed. Is there a function out there that can do this? or a function where I could start and build from it?

There's a good example here of truncating text while preserving HTML tags.

There is strip_tags which gets rid of all HTML tags but other than that there isn't much.
This is not an easy thing by the way, you have to actually parse the HTML to find out which tags are left open - that's the most robust approach anyway. Also, don't use a regular expression.

The right solution is to not store display information in your database layer.
Failing that, you could use CSS overflow properties: print the whole post, and then have the display layer handle sizing it to fit. This mitigates the problem of having formatting information in your database by putting the resizing (a display issue, not a content issue) into the display layer as well.
Failing that, you could parse the HTML and "round up" or "round down" to the nearest tag boundary, then insert the tag-close characters necessary to finish the block you were in.
Another option is to iframe the content.

I know this isn't the best way to do it programatically, but have you considered manually specifying where the cut should be? Adding something like and cutting it there manually would allow you to control where the cut happened, regardless of the number of characters before it. For example, you could always put that below the first paragraph.
Admittedly, you lose the ability to just have it happen automatically, but I bring it up in case that doesn't matter as much to you.

PHP: Regex replace while ignoring content between html tags

I'm looking for a regular expressions string that can find a word or regex string NOT between html tags.
Say I want to replace (alpha|beta) in: the first two letters in the greek alphabet are alpha and <b>beta</b>
I only want it to replace alpha, because beta is between <> tags. So ignore (<(.*?)>(.*?)<\/(.*?)>)
:)

I didn't test the logic used in this page - http://www.phpro.org/examples/Get-Text-Between-Tags.html But I can confirm the logical point made at the top of the page in big bold letters that says you shouldn't do what you're trying to do with regex.
Html is not uniform and edge cases will always bite you in the rear if you use regular expressions to handle the content of those tags in any real world situation. So unless your markup is extremely simplistic, uniform, 100% accurate, only contains html (not css, javascript or garbage) then your best bet is a dom parser library.
And really many dom parser libraries have problems too but you'll be miles ahead of the regex counterparts. The best way to get the text contet of tags is to render the html in a browser and access the innerText property of the given dom node (or have a human copy and paste the contents out manually) - but that isn't always an option :D

It's maybe the 'wrong' way, but it works: when I need to do something similar, I first do a preg_replace_callback to find what I don't want to match and encode it with something like base64.
Then I can happily run an ordinary preg_replace on the result, knowing that it has no chance of matching the strings I want to ignore. Then unscramble using the same pattern in preg_replace_callback, this time sending the matches to be base64 decoded.
I often do this when automatically adding keyword or glossary links or tooltips to a text - I scramble the HTML tags themselves so that I don't try to create a link or a tooltip within the title of an anchor tag or somewhere equally ridiculous, for example.

Regex to parse links containing specific words

Taking this thread a step further, can someone tell me what the difference is between these two regular expressions? They both seem to accomplish the same thing: pulling a link out of html.
Expression 1:
'/(https?://)?(www.)?([a-zA-Z0-9_%]*)\b.[a-z]{2,4}(.[a-z]{2})?((/[a-zA-Z0-9_%])+)?(.[a-z])?/'
Expression 2:
'/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/si'
Which one would be better to use? And how could I modify one of those expressions to match only links that contain certain words, and to ignore any matches that do not contain those words?
Thanks.

The difference is that expression 1 looks for valid and full URIs, following the specification. So you get all full urls that are somewhere inside of the code. This is not really related to getting all links, because it doesn't match relative urls that are very often used, and it gets every url, not only the ones that are link targets.
The second looks for a tags and gets the content of the href attribute. So this one will get you every link. Except for one error* in that expression, it is quite safe to use it and it will work good enough to get you every link – it checks for enough differences that can appear, such as whitespace or other attributes.
*However there is one error in that expression, as it does not look for the closing quote of the href attribute, you should add that or you might match weird things:
/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?<\/a>/si
edit in response to the comment:
To look for word inside of the link url, use:
/<a.*?href\s*=\s*["\']([^"\'>]*word[^"\'>]*)["\'][^>]*>.*?<\/a>/si
To look for word inside of the link text, use:
/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?word.*?<\/a>/si

In the majority of cases I'd strongly recommend using an HTML parser (such as this one) to get these links. Using regular expressions to parse HTML is going to be problematic since HTML isn't regular and you'll have no end of edge cases to consider.
See here for more info.

/<a.*?href\s*=\s*["']([^"']+)[^>]*>.*?<\/a>/si
You have to be very careful with .*, even in the non-greedy form. . easily matches more than you bargained for, especially in dotall mode. For example:
<a name="foo">anchor</a>
...
Matches from the start of the first <a to the end of the second.
Not to mention cases like:
<a href="a"></a >
or:
<a href="a'b>c">
or:
<a data-href="a" title="b>c" href="realhref">
or:
<!-- <a href="notreallyalink"> -->
and many many more fun edge cases. You can try to refine your regex to catch more possibilities, but you'll never get them all, because HTML cannot be parsed with regex (tell your friends)!
HTML+regex is a fool's game. Do yourself a favour. Use an HTML parser.

At a brief glance the first one is rubbish but seems to be trying to match a link as text, the second one is matching a html element.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.