Cleaning text scraped from webpage with php & regex

Cleaning text scraped from webpage with php & regex - php

I have been building a function reads in the title text as found on a webpage between the <title></title> tags. I am using the following regex code to grab the title text form the html page:
if(preg_match('#<title>([^<]+)</title>#simU', $this->html, $m1))
$this->title = trim($m1[1]);
I am using the following to encode the value for the mysql insert statement:
mysql_real_escape_string(rawurldecode($this->title))
So that leaves me with a database full of titles that have html entities(&nsbp etc...) and
foreign characters such as in
Dating S.o.sÂ |Â Gluten-free, Dairy-free, Sugar-free Recipes And Lifestyle Tips
The goal is to decode,remove, clean the titles so that they look as close to perfect english as possible.
I have constructed a function that uses the following 2 regex's to remove html entities and limit junk respectively. And while not ideal(because it removes the html entities rather than preserves them) it's the closest to clean as I've got.
$string = preg_replace("/&#?[a-z0-9]+;/i","",$string);
//remove all non-normal chars
$string = preg_replace('/[^a-zA-Z0-9-\s\'\!\,\|\(\)\.\*\&\#\/\:]/', '', $string);
But the non-english chars still exist.
Would anyone be able to offer help as to:
Best way to save these title strings to the db trying to preserve the english intent (punctuation, apostrophies, etc...)
How to convert or eliminate the strange chars as shown in my example title above?
Thanks much for your help!

For point 1, PHP has an html_entity_decode() function that you can use to turn HTML entities into "regular" characters.

Check out http://www.php.net/manual/en/function.html-entity-decode.php for #1
And http://php.net/manual/en/function.mb-convert-encoding.php for #2

Related

Split html document while maintaining inner tags

I'm working on an e-shop. At some point in my code I have to show attributes and descriptions for many products in a single page.Attributes are a table and description can contain simple text and table,li,br tags etc...These which are stored in the database as html encoded string. So in my php file I load them from the db and decode them like this.
$description=html_entity_decode($description_from_db, ENT_QUOTES, 'UTF-8');
$attributes=html_entity_decode($attributes_from_db, ENT_QUOTES, 'UTF-8');
Later on I just do echo $description; an they are shown properly.
All this HAS TO BE PRINTABLE and here comes the challenge.
When the attributes table and the description are long enough the exceed the printable page height and they get cut in half looking realy ugly. What I want to do is split the $description and $attributes strings and echo them with page breaks between the pieces where neccesary. The problem is that this must be done with respect to the tags inside these strings. I can't for example break the string in the middle of a tr tag.
Is there a way to break these strings maintaining the html elements that they contain intact ? I'm thinking it must be possible since html editors show a warning when a tag has been left unclosed.

You can put page breaks in your HTML code, then the HTML will choose where to break itself:
https://css-tricks.com/almanac/properties/p/page-break
http://davidwalsh.name/css-page-breaks

Truncate HTML content to specified character/word count, while preserving tags

I recently had the need to truncate post content that contains HTML (for a post excerpt/summary, etc.). This is usually done by manually entering an excerpt for the post, but for this specific project, I need to do it automatically.
I tried to create a simple method which just takes a character count and sub-strings the content. However, this does not work all the time as it may truncate the content within an HTML tag/attribute.
eg:
<?php
function truncateText($string, $chars) { return substr($string, 0, $chars); }
$content = "<div><p>some content</p><a href='http://google.com'>Let's go to google</a></div>";
echo truncateText($content,40); //returns "<div><p>some content</p><a href='http:/"
as you can see, it will return a broken HTML, which will not render properly. How would I be able to truncate content, yet retain HTML tags?

Your approach yelds many problems. Do you want to truncate at the 40 characters, then add as many tags as needed until they are closed? Or do you prefer to truncate at 40 and trim as much as needed to make the tags work? Do the tags add up to the 40 characters or they are ignored when counting? There are many problems with this as you can see. However, there's an alternative commonly found for summaries:
Delete the tags and truncate the text. The summary is normally just a small extract of text, a paragraph, with simple format. You don't want lists here and in most cases and stripping a link or two is okay for this.
However, if you really want to go down that road, I'd recommend meaningfully reading the html tags with some DOM parser, but to know how to do that you will first need to answer the first questions I wrote.

If you don't care if formatting is removed from your text, then just send the string through the PHP function strip-tags() before you do anything else. Instructions here.

php change specific characters to html tags

I have some text data stored in my database, and I want it to be displayed in an specific way: I store the data followind wikipedia standard, for example:
==title==
some ''data''
And I want this data to be translated to <b>, <h2>, <i>, etc.
Is there any function/parser to easily achive this?

There are numerous libraries for parsing and rendering MediaWiki markup, including some written in PHP. See http://www.mediawiki.org/wiki/Alternative_parsers

PHP's str_replace($search, $replace, $subject) function can take arrays as parameters, and replaces each occurence of $search's elements in $subject with the respective element of $replace.
See http://php.net/manual/en/function.str-replace.php

PHP MySql display returned content without html tags being stripped

I have a column in SQL 'Text' datatype and inside I have content within html tag, eg somecontent: onetwo... I want the mysql query to echo out the databases contents without stripping the html tags (its possible this is done by php for security reasons or?) so that the html code will render if you get me? At the moment it just lumps out a paragraph which looks aweful! It should be noted security is not much of a concern as this is a project and not going to be exposed publicly
Cheers folks
Nick

MySQL wouldn't strip tags from text - it couldn't care less what the text is. PHP also wouldn't strip tags, unless somewhere in your code you do a strip_tags() or equivalent.
If you want to force the browser to display the tags in the retrieved data, you can run the string through [htmlspecialchars()][1], which converts html metacharacters (<, >, ", &, etc...) to their character entity equivalents (<, >, etc...).
Or you can force the entire page to be rendered as plain text by doing
header('Content-type: text/plain');

Database content isn't stripped by PHP unless you explicitly tell it to.
Are you sure your tags haven't been stripped before they were inserted?
Alternatively try stripslashes();

Use htmlspecialchars_decode() function. Its works fine for me...
Use guideline
Your Text with code
<?php
$text = "
<html>
<head>
<title>Page Title</title>
</head>
<body>
Reference site about Lorem Ipsum, giving information on its origins, as well as a random Lipsum generator.
</body>
</html> ";
OR
$text = $row['columnname']; // Here define your database table column name.
echo htmlspecialchars_decode($text);
?>
Output:
Reference site about Lorem Ipsum, giving information on its origins, as well as a random Lipsum generator.

this works for me:
Ex:
<?=htmlspecialchars_decode(htmlspecialchars('I <b>love</b> you'))?>
Your browser will output:
I love you

Use bellow snippet:
echo(strip_tags($your_string));
copied.

PHP: HTML markup problem while displaying trimmed HTML markups

I am using a Richtext box control to post some data in one page.
and I am saving the data to my db table with the HTML mark up Ex : This is <b >my bold </b > text
I am displaying the first 50 characters of this column in another page. Now When i am saving, if i save a Sentence (with more than 50 chars )with bold tag applied and in my other page when i trim this (for taking first 50 chars) I would lost the closing b tag (</b>) .So the bold is getting applied to rest of my contents in that page.
How can i solve this ? How can i check which all open tags are not closed ? is there anyeasy way to do this in PHP. Is there any function to remove my entire HTML tags / mark up and give me the sentence as plain text ?

http://php.net/strip_tags
the strip_tags function will remove any tags you might have.

Yes
$textWithoutTags = strip_tags($html);

I generally use HTML::Truncate for this. Of course, being a Perl module, you won't be able to use it directly in your PHP - but the source code does show a working approach (which is to use an HTML parser).
An alternative approach, might be to truncate as you are doing at the moment, and then try to fix it using Tidy.

If you want the HTML tags to remain, but be closed properly, see PHP: Truncate HTML, ignoring tags. Otherwise, read on:
strip_tags will remove HTML tags, but not HTML entities (such as &), which could still cause problems if truncated.
To handle entities as well, one can use html_entity_decode to decode entities after stripping tags, then trim, and finally reencode the entities with htmlspecialchars:
$text = "1 < 2\n";
print $text;
print htmlspecialchars(substr(html_entity_decode(strip_tags($text), ENT_QUOTES), 0, 3));
(Note use of ENT_QUOTES to actually convert all entities.)
Result:
1 < 2
1 <
Footnote: The above only works for entities that can be decoded to ISO-8859-1. If you need support for international characters, you should already be working with UTF-8 encoded strings, and simply need to specify that in the call to html_entity_decode.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Cleaning text scraped from webpage with php & regex - php

For point 1, PHP has an html_entity_decode() function that you can use to turn HTML entities into "regular" characters.

Check out http://www.php.net/manual/en/function.html-entity-decode.php for #1 And http://php.net/manual/en/function.mb-convert-encoding.php for #2

Related

Split html document while maintaining inner tags

Truncate HTML content to specified character/word count, while preserving tags

php change specific characters to html tags

PHP MySql display returned content without html tags being stripped

PHP: HTML markup problem while displaying trimmed HTML markups

Categories

Resources