How do you fix sentence spacing on extracted plain text from HTML?

How do you fix sentence spacing on extracted plain text from HTML? - php

I'm pulling articles from specific URLs for conversion to sentences, but the text body has a random behavior of eliminating whitespace between some sentences resulting in:
Jane went to the store.She bought a dog. The dog was very friendly.It had no teeth.
Some of my text is stock symbols (AZ.GAN) etc. So I can't simply insert a space between all periods which have no adjacent whitespace.
Jane bought several shares of (TY.JPN). She lost all her cash money."Arg!" She cried.
The above example would destroy the stock symbol variable.
Curious if anyone knows the cause of this. I have tried several HTML and DOM. I use Simple_DOM to grab the plaintext. Although, I get the same result if I do it manually, or with any other parsing engine.

Unfortunately I don't have an approach for your specific question, but is it possible that the missing space between sentences is actually a linebreak (e.g. \n) that your text viewer (whatever it is) isn't showing you?
Perhaps try something like this just to make sure
var articleContent = ... // get content
articleContent = articleContent.replace(/\n/g, ' NEW LINE ');

Try doing:
$str = trim(preg_replace('~([(].+?[.])\s(.+?[)])~', '$1$2', str_replace('.', '. ', $str)));

Related

How to change every html tag into space

I have stored some html data into database through summernote plugin
in database it look like this
<p><span id="job_summary" class="summary"><ul><li>
<b class="jobtitle"><font size="+1">Analyst/Junior Analyst- Outbound calling process</font></b>
here is how i show it
echo text_cut(strip_tags(html_entity_decode($ro)),300);
now i want to show this data in plain text on my page, i tried using strip_tags but it makes the looks messy, here is how it looks after strip tags
knowledgeMust be reliable in terms of attendance and timingExhibit
it joined the words, so now i want all the html tags to be converted into how can i achieve this

Try This,
$spaceString = str_replace('<', ' <', $ro);
echo strip_tags(html_entity_decode($spaceString));

Here's something you can try. If you replace the tags with a space and then replace multiple spaces with one space, it should give you the desired results.
First, use something like this to replace the tags:
preg_replace('~<.*?>~i', ' ', $string);
Here is what it will give you
Next, you can look for multiple spaces in a row and consolidate them:
preg_replace('~ +~', ' ', $string);
That will give you this:
Analyst/Junior Analyst- Outbound calling process
Here is a demo of it all together
You won't really be able to see it, but there is a line above it with a blank space and a blank space before the string as well. So, depending on what you are wanting the result to look like, you can use a \s+ instead of [SPACE]+
Here is another demo showing how to do it that way

Is there a typo in this str_replace code? / Am I reading it correctly?

Here is the line of code from a PHP file, specifically it is from zstore.php which is a file include as part of the "Zazzle Store Builder" toolset from Zazzle.com
The set of files allows someone like me, who has products for sale on Zazzle and massage that data into a nicer "storefront" which I can set up my way instead of being confined by the CMS structure of Zazzle.com where they understandably want to keep the monkeys (uhmmm... users like myself) from causing too much mayhem.
So... here is the code:
$keywords = str_replace(" ",",",str_replace(",","",$keywords));
Two questions:
Am I understanding what it does and
Is there an extra single or double quote in the string that does not need to be there?
Here is what I think the line of code is saying:
Take the string of characters that the user inputs (dance diva) and assign it to the variable called
$keywords
then run the following function on that character string
= str_replace
(" ","," <<< look for spaces. If you find a space, replace it with a comma
,str_replace(",","" <<< this is the bit I don't understand or which may have a typo
I THINK that it is saying " if you find commas, leave them alone, but I'm not certain.
,$keywords)); <<< then put the edited string of characters backing to the variable called $keywords.
What lead me to look at this was that I was inputting the following:
dance,diva which is what I THOUGHT the script was wanting from me based on the commented text in the README.txt file:
// Search terms. Comma separated keywords you can use to select products for your store
So..
Am I understanding what this line of code is supposed to do?
which, assuming I am correct, and I'm pretty sure that the first half is supposed to work as I've described, now brings me to my second question:
Why isn't the second bit working? Is there a typo?
To review:
dance diva produces results
dance,diva does not
Both, SHOULD work.
Thanks in advance for your help. I have a lot of HTML experience and computer experience but PHP is new to me.

$keywords = str_replace(" ",",",str_replace(",","",$keywords));
You can split into
$temp = str_replace(",","",$keywords);
$keywords = str_replace(" ",",",$temp);
First it replaces all comas with empty string, it is removes all comas. Then replaces all spaces with comas.
For "dance diva" there are no comas so first does nothing, then it replaces space and result is "dance,diva"
For "dance,diva" it removes coma, you get "dancediva" and there in no space to replace next so it is Your result.

PHP rtrim not removing trailing \n

I need to trim any trailing \n from strings.
I used rtrim but for some reason it's not working. The string remains the same with or without rtrim. It's driving me crazy.
This is the code:
$strippedDescription = rtrim($strippedDescription);
where $strippedDescription is:
The owner of Hill House is Scott Croyle, senior vice president of design at HTC. At two bedrooms, 2 1/2 baths and a study, the home is just large enough to share with his wife and son. Its modest scale allowed Bernstein to emphasize quality materials over quantity of space.
"It's almost a negative value in that (tech) community," said Bernstein of over-the-top homes. "There's a real emphasis on not seeking a mansion right away."\n\n
EDIT
Ok so the issue is that $strippedDescription is being read from an RSS feed and stored in our database. It's the article content. This content will later be displayed on an iPhone thru an app.
The iphone programmer said that we need to replace the "< b r / >" and "< / p>" with "\n" so the iphone will correctly recognize the new line. However this isn't happening. The \n are displayed as part of the article.
This is the code preceding the above part (where $itemDescription is the article content with all html tags):
$strippedDescription = $itemDescription;
$strippedDescription = str_replace('</p>', '\n', $itemDescription);
$strippedDescription = str_replace('<br/>', '\n', $itemDescription);
$strippedDescription = str_replace('<br />', '\n', $itemDescription);
$strippedDescription = strip_tags($strippedDescription);
$strippedDescription = rtrim($strippedDescription);
EDIT
Ok I replaced the '\n' with "\n" (double quotes) and that seems to have solved the problem.
Thank you Alex and Sergi and the rest for pointing me in the right direction.

It looks like you have the literal characters \n at the end. trim() won't remove these, as they're not whitespace.
Looks like something like this would work...
$str = preg_replace('/(\\\n)+\z/', '', $str);
CodePad.

We've several problems here.
The \n is NOT a single character, as the rtrim is not working.
However, you've to search for a STRING and not a single character. Thats why double quotes should be used while searching the \n.
The second problem, is that as you can see in the question, there are two \n at the end, and might there be more.
What I'd do, is manually program a function that does the following:
Reverse the string
Loop while 2 first characters == "n\".
Replace those two characters for a blank.
End loop

White spaces are lost when echoing under php

I've got the following issue with PHP and PostgreSQL.
In a table I added the following value, mark the spaces.
Things: 10 POLI
When I read this out with PHP it will become
Things 10 POLI
My simpified code (for an ideal world without errors) is:
$query = "SELECT stuff, thing, planets FROM 42 WHERE answer = '-'";
$result = pg_query($connection, $query);
$resultTable = pg_fetch_all($result);
Then with
echo "Things: $result[stuff]";
My question is, which step eliminates all the white spaces? And how to get these spaces back? I know that most people want to remove them, I want to keep them.

that is not a PHP issue, but a HTML issue, becauyse if you output with echo, you do in fact generate HTML code.
The HTML specification defines, that multiple consecutive spaces get rendered as only one space.
If you want to avoid this, wrap a <pre> tag around the string:
echo "<pre>Things: $result[stuff]</pre>";

That's because browser does not recognize more than one space, you can use this code to convert consective spaces to (space understood by browser)
$str = str_replace(' ', ' ', $origText);
Or alternatively wrap your text in <pre> tag if that suites your requirements as suggested in comments below.

Removing Break Lines

I've asked this question before but I didn't seem to get the right answer. I've got a problem with new lines in text. Javascript and jQuery don't like things like this:
alert('text
text);
When I pull information from a database table that has a break line in it, JS and jQuery can't parse it correctly. I've been told to use n2lbr(), but that doesn't work when someone uses 'shift+enter' or 'enter' when typing text into a message (which is where I get this problem). I still end up with separate lines when using it. It seems to correctly apply the BR tag after the line break, but it still leaves the break there.
Can anyone provide some help here? I get the message data with jQuery and send it off to PHP file to storage, so I'd like to fix the problem there.
This wouldn't be a problem normally, but I want to pull all of a users messages when they first load up their inbox and then display it to them via jQuery when they select a certain message.

You could use a regexp to replace newlines with spaces:
alert('<?php preg_replace("/[\n\r\f]+/m","<br />", $text); ?>');
The m modifier will match across newlines, which in this case I think is important.
edit: sorry, didn't realise you actually wanted <br /> elements, not spaces. updated answer accordingly.
edit2: like #LainIwakura, I made a mistake in my regexp, partly due to the previous edit. my new regexp only replaces CR/NL/LF characters, not any whitespace character (\s). note there are a bunch of unicode linebreak characters that i haven't acknowledged... if you need to deal with these, you might want to read up on the regexp syntax for unicode

Edit: Okay after much tripping over myself I believe you want this:
$str = preg_replace('/\n+/', '<br />', $str);
And with that I'm going to bed...too late to be answering questions.

I usually use json_encode() to format string for use in JavaScript, as it does everything that's necessary for making JS-valid value.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.