PHP Recognize paragraphs in Rich Text - php

I've a rich text editor for news messages.
The frontend shows one paragraph and the user can read the full message once the user clicks "read more".
However this recognition is now done by <div></div> tags, while the editor works with tags (two for a paragraph).
My current regex is:
"/<div>([^`]*?)<\/div>/is"
How can i extend this to also recognize two tags right after each other. (Notice, those br tags might contain attributes).

As discussed above, beware that using regex to parse HTML, especially for "complex" problems, is generally a bad idea. The following is not a perfect solution, but may be good enough to the simple requirements you've given above:
/(?<=<div>).*?(?=<\/div>)|(?<=<br>\s*<br>).*?(?=<div>|<br>\s*<br>)/is
The (?<=...) and (?=...) are look behinds/aheads, i.e. they assert that those sections of the pattern are present, but are not included in the match result.
I have also used \s* to help catch scenarios where the user types something like:
<br> <br>
Or:
<br>
<br>
...But as I say, this is still not a perfect solution. If you find the pattern gets too complex, then seriously consider using an XML parser instead. (Or, how about just letting the user enter new lines, and converting these into paragraphs for them? ... Or even, just use an existing WYSIHTML5 library, or a markdown library?)

Related

Detecting dofollow backlinks using regular expression

The objective of this regular expression is to find whether a web page contains backlink(s) to a given domain and that all of must have a rel="nofollow" attribute on a tag. True if it meets this otherwise False if any does not contain rel="nofollow".
From any web page I want to check whether anything like this is present:
<a ... href="http://www.mysite.com/xyz...." ... >
Addtionally there must not be "rel=nofollow" attribute in all such links found.
Given that domain www.mysite.com is known and I want to check it even within comments or wherever present in the page.
I could do above myself but I'm not able to think of optimized way to it using single pattern.
One unoptimized way I can do it to find all occurances of a tags with href="mysite.com" and see if even single match does not contain a rel=nofollow.
Is there any smart & single line way of making a regular expression pattern?
PS: Don't want to parse DOM since it's risky to miss a backlink due to parsing error and Google's DOM parser could be different. I want human attention to only those pages links from whom can cause backlink penalty from search engines. If a link within comment is flagged as backlink and takes away some human attention, no problem. But at any cost links from say a porn site must be caught. Finally I want to prepare list of spam links which I can submit in Google Webmaster's Disavow tool. This exercise is must for every webmaster once or so in a month for every site. And I can't afford this kind of paid service: www.linkdetox.com
Usually, parsing HTML with regex is a bad idea (here's the famous reason why). You risk weird bugs as regex aren't able to fully parse HTML.
However, if your input is "safe" (i.e. not changing a lot, or you're prepared for weird errors) and to answer your question, when you're on the a tag you can use something like this to catch link with the href you want and without rel="nofollow":
#<a\s+(?![^>]*rel\s*=\s*(['"])\s*nofollow\s*\1)[^>]*href\s*=\s*(["'])http://www.mysite.com[][\w-.~:/%?##!$&'()*+,;=]*\2[^>]*>
<a\s+ # start of the a tag followed by at least a space
(?! # negative look-ahead: if there isn't...
[^>]* # anything except tag closing bracket
rel\s*=\s* # 'rel=', with spaces allowed
(['"]) # capture the opening quote
\s*nofollow\s* # nofollow
\1 # closing quote is the same as captured opening one
) # end of negative look ahead
[^>]* # anything but a closing tag
href\s*=\s* #
(["']) # capture opening quote
http://www.mysite.com # the fixed part of your url
[][\w-.~:%/?##!$&'()*+,;=]* # url-allowed characters
\2 # closing quote
[^>]*> # "checks" that the tag is ending
Demo: http://regex101.com/r/hC8lV9
Disclaimer
This isn't meant to check whether your input is well-formed or not, this assumes it is well formed. This won't account for stuff like escaped > or escaped quotes, and you very probably will need to adapt it to your needs. Basically, no regex will give a complete answer.
If you need to deal with various input or with potentially malformed HTML, a parser will will do a much safer and better job than regex.
However I'm putting this one here to give you an idea of what can be done on this subject, since in very strict and narrowly defined context regex can actually be a relevant solution.
First of all do not use regular expression for parsing the dom of a web page. PHP got it 's own Document Object Model, which does the whole job. Just have a look at http://de1.php.net/manual/en/class.domdocument.php and http://de1.php.net/manual/en/class.domxpath.php.
Regular Expression
<a(?=[^>]*?rel=nofollow)(?=[^>]*?href="http:\/\/www\.mysite\.com\/.*?")[^>]*?>
How it works
It uses positive lookaheads to validate the string for the rel=nofollow and href="mysite tags.
Online demo: http://regex101.com/r/pX0yF5
If you’ve been doing any kind of reading about link building, then you’ve probably seen people mentioning nofollow and dofollow links. These are very important terms to understand when you are trying to build great links back to your site in order to increase your search engine rankings. But, to the person who is new to all of this, it may be kind of confusing. I am going to help break it down for you.
To tell the spiders to crawl a link, you don’t have to do anything. Simply using the format shown above, the search engine spiders will crawl the link provided.

Help Implement Tags in PHP

In my recent PHP project, I need to implement Tags (searchable) separated by comma (similar to this site or something like in WordPress). What is the smart way to detect and remove unnecessary characters or tags? Putting the XSS concern aside, first of all I need to clean and extract only text if user inputs HTML(or other tags) instead of the plain text.
For example:
If user inputs <b>sdfasdf</b>, sdfsdfsdf, <sdfsdfsdf
It should strip out all the unnecessary characters and tags and only plain text should be saved in database.
I have tried it in WordPress and it is very smart to figure out this plus automatically extracts text only.
My question:
Is there an open source library available for this task, which I can integrate in my project. I have done some homework regarding this but *htmlentities(), strip_tags(), HTML Purifier* etc. doesn't seem suitable for this task. Or do need to build my own library combined with this?
Can somebody guide me on this?
Thanks!
In addition to removing "complete" tags (markup language elements) such as found in <b>sdfasdf</b>, sdfsdfsdf,
you can also remove "forbidden" characters such as "<", ">", and "&" (using preg_replace and the like), and collapse multiple spaces into a single space (also using preg_replace).
Remember, they're used only as tags (keywords), so it's acceptable here to use a somewhat restricted character set. In Stack
Overflow, for instance, only letters, numbers, and hyphens are allowed in tags.
I would look at this the other way around. What input is legal? Which characters are allowed in tag names? Ones those questions are answered I would build a server-side whitelist of legal characters using regex, state the rules in the UI, and simply reject input that does comply.
Massaging invalid inpu into valid, is rarely a good idea.
Characters allowed in tags are usually alphanumeric + dashes and underscores. Some sites also allow spaces.

PHP: Regex replace while ignoring content between html tags

I'm looking for a regular expressions string that can find a word or regex string NOT between html tags.
Say I want to replace (alpha|beta) in: the first two letters in the greek alphabet are alpha and <b>beta</b>
I only want it to replace alpha, because beta is between <> tags. So ignore (<(.*?)>(.*?)<\/(.*?)>)
:)
I didn't test the logic used in this page - http://www.phpro.org/examples/Get-Text-Between-Tags.html But I can confirm the logical point made at the top of the page in big bold letters that says you shouldn't do what you're trying to do with regex.
Html is not uniform and edge cases will always bite you in the rear if you use regular expressions to handle the content of those tags in any real world situation. So unless your markup is extremely simplistic, uniform, 100% accurate, only contains html (not css, javascript or garbage) then your best bet is a dom parser library.
And really many dom parser libraries have problems too but you'll be miles ahead of the regex counterparts. The best way to get the text contet of tags is to render the html in a browser and access the innerText property of the given dom node (or have a human copy and paste the contents out manually) - but that isn't always an option :D
It's maybe the 'wrong' way, but it works: when I need to do something similar, I first do a preg_replace_callback to find what I don't want to match and encode it with something like base64.
Then I can happily run an ordinary preg_replace on the result, knowing that it has no chance of matching the strings I want to ignore. Then unscramble using the same pattern in preg_replace_callback, this time sending the matches to be base64 decoded.
I often do this when automatically adding keyword or glossary links or tooltips to a text - I scramble the HTML tags themselves so that I don't try to create a link or a tooltip within the title of an anchor tag or somewhere equally ridiculous, for example.

How to get sentences from the website html

Hello I want to extract all sentences from a html document. How can i perform that? as there are many conditions like first we need to strip tags, then we need to identify sentences which may end with . or ? or ! also there might be conditions like email address and website address also may have . in them How do we make some script like this?
It's called programming ;). Start by dividing the task in simpler sub-tasks and implement those. For example, in your case, I'd design the program like this:
Download and parse the HTML document
Extract all text content (pay special attention to <script> and <style> elements)
Merge the text content to one long string
Solve the problem of finding sentences in a string (likely, just parse until you find a stop character in ".!?" and then start a new sentence)
Discard false positives (Like empty sentences, number-only sentences etc.)
First you should strip certain tags which are inline formatting elemnts like:
I <b>strongly</b> agree.
But you sbhould leave in block-level elements, like DIV and P because there are even stronger delimiters than . ? and !
Then you have to process the content in these block level elements. Typically there are navigation links with one word, you might want to filter them out later, so it is not the right choice to strip away the block structure of the document.
At this point you can safely use the regex pattern to identify blocks:
>([^<]+)<
When you have your blocks you can filter out the short ones (navigation elemnts) and strip the big ones (paragraphs of text) using your sentence delimiter.
There are interesting questions when a fullstop character signals an end of the sentenct and when is it just a decimal point, but I leave that to you. :)

Regex to parse links containing specific words

Taking this thread a step further, can someone tell me what the difference is between these two regular expressions? They both seem to accomplish the same thing: pulling a link out of html.
Expression 1:
'/(https?://)?(www.)?([a-zA-Z0-9_%]*)\b.[a-z]{2,4}(.[a-z]{2})?((/[a-zA-Z0-9_%])+)?(.[a-z])?/'
Expression 2:
'/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/si'
Which one would be better to use? And how could I modify one of those expressions to match only links that contain certain words, and to ignore any matches that do not contain those words?
Thanks.
The difference is that expression 1 looks for valid and full URIs, following the specification. So you get all full urls that are somewhere inside of the code. This is not really related to getting all links, because it doesn't match relative urls that are very often used, and it gets every url, not only the ones that are link targets.
The second looks for a tags and gets the content of the href attribute. So this one will get you every link. Except for one error* in that expression, it is quite safe to use it and it will work good enough to get you every link – it checks for enough differences that can appear, such as whitespace or other attributes.
*However there is one error in that expression, as it does not look for the closing quote of the href attribute, you should add that or you might match weird things:
/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?<\/a>/si
edit in response to the comment:
To look for word inside of the link url, use:
/<a.*?href\s*=\s*["\']([^"\'>]*word[^"\'>]*)["\'][^>]*>.*?<\/a>/si
To look for word inside of the link text, use:
/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?word.*?<\/a>/si
In the majority of cases I'd strongly recommend using an HTML parser (such as this one) to get these links. Using regular expressions to parse HTML is going to be problematic since HTML isn't regular and you'll have no end of edge cases to consider.
See here for more info.
/<a.*?href\s*=\s*["']([^"']+)[^>]*>.*?<\/a>/si
You have to be very careful with .*, even in the non-greedy form. . easily matches more than you bargained for, especially in dotall mode. For example:
<a name="foo">anchor</a>
...
Matches from the start of the first <a to the end of the second.
Not to mention cases like:
<a href="a"></a >
or:
<a href="a'b>c">
or:
<a data-href="a" title="b>c" href="realhref">
or:
<!-- <a href="notreallyalink"> -->
and many many more fun edge cases. You can try to refine your regex to catch more possibilities, but you'll never get them all, because HTML cannot be parsed with regex (tell your friends)!
HTML+regex is a fool's game. Do yourself a favour. Use an HTML parser.
At a brief glance the first one is rubbish but seems to be trying to match a link as text, the second one is matching a html element.

Categories