Remove white space from scraped text

Remove white space from scraped text - php

$url = 'MyUrl';
$contents = file_get_contents($url);
function scrape_between($data, $start, $end){
$data = stristr($data, $start);
$data = substr($data, strlen($start));
$stop = stripos($data, $end);
$data = substr($data, 0, $stop);
return $data;
}
$svetaines_turinys = trim(scrape_between($contents, "<table border=\"0\" cellspacing=\"0\">", "</table>"));
$fp = fopen("autogidas.php", "w+");
fwrite ($fp, "$svetaines_turinys");
fclose ($fp);
$fh = fopen("autogidas.php", 'r') or die("negalima atidaryti");
while(! feof($fh)) {
$visa_data1 = fgets($fh);
$visa_data = trim($visa_data1);
$pavadinimas = trim(scrape_between($visa_data, "<span class=\"ttitle2\">", "</span>"));
$metai = trim(scrape_between($visa_data, "<span class=\"ttitle1\">", "</span>"));
$kaina = trim(scrape_between($visa_data, "<span class=\"ttitle1\" style='float: left;'>", "<br /><span class=\"grey\">"));
echo "$pavadinimas<br> $metai <br> $kaina . <br><br>";
}
fclose($fh);
Output is working fine, but the problem is the output with a lot of free space, I tried to use trim(), but it didn't solved the problem.

You could just use regex to accomplish this task, something like this will work perfectly:
$metai = preg_replace('/\s+/', ' ',scrape_between($visa_data, "<span class=\"ttitle1\">", "</span>"));
Just do it on every var with the same problem.

If you mean you want to remove multiple space and just leave a single space you could use str_replace() like this
function scrape_between($data, $start, $end){
$data = stristr($data, $start);
$data = substr($data, strlen($start));
$stop = stripos($data, $end);
$data = substr($data, 0, $stop);
return str_replace(' ', ' ', $data);
}

Related

how to replace a string in a stream for very large files

How can I replace a string in a file that cannot be fully loaded into memory
I can read it a few byes at a time, but how can I be sure I didn't read into the middle of my phrase?
I think I should save the last strlen(phrase) length of bytes and try to replace last+current
This is my WIP
function stream_str_replace(string $search, string $replace, $handle, int $length, &$count = null)
{
// assure $handle is a resource
if (!is_resource($handle)) {
throw new UnexpectedValueException('handle must be a valid stream resource');
}
// assure $handle is a stream resource
if ($resourceType = get_resource_type($handle) !== 'stream') {
throw new UnexpectedValueException('handle must be a valid stream resource, but is a "' . $resourceType . '"');
}
$sLength = strlen($search);
$lastInSLength = '';
while (!feof($handle)) {
$str = fread($handle, $length - $sLength - 1);
$batchCount = 0;
$res = str_replace($search, $replace, $lastInSLength . $str, $batchCount);
if ($batchCount) {
$count += $batchCount;
fseek($handle, -($length - 1));
fwrite($handle, $res); // this does not seem to work as I intend it to
}
$lastInSLength = substr($str, -$sLength);
}
}
$fh = fopen('sample.txt', 'r+');
stream_str_replace('consectetur', 'foo', $fh, 50, $count);
fclose($fh);

PHP how to remove strings from a file

I have a file named test.txt and i want to remove the lines from this file which have length of less than 30 characters and line starting with Capital letter word and ending with an dot or question mark should not be deleted.
For example content of test.txt file is:
text 1
text 2
text 3
Long text.
text 4
Long text 2?
After filtering, result should be
Long text.
Long text 2?
<?php
# create and load the HTML
include('simple_html_dom.php');
$tekst = file_get_html('http://www.naszawiedza.pl/')->plaintext;
foreach ($tekst as $key=>&$value) {
if (strlen($value) > 60) {
unset($yourArray[$key]);
}
}
echo $tekst;
//kropka
$string = $tekst;
$substr = '.';
$attachment = "\r\n";
//$position = strpos($string, 'a');
$newstring = str_replace($substr, $substr.$attachment, $string);
// bca+++def a+++bcdef
//znak zapytania
$string = $tekst;
$substr = '?';
$attachment = "\r\n";
//$position = strpos($string, 'a');
$newstring = str_replace($substr, $substr.$attachment, $string);
// bca+++def a+++bcdef
//podwójna spacja
$string = $tekst;
$substr = '\r\n\r\n';
$attachment = "\r\n";
//$position = strpos($string, 'a');
$newstring = str_replace($substr, $substr.$attachment, $string);
// bca+++def a+++bcdef
//Wykrzyknik
$string = $tekst;
$substr = '!';
$attachment = "\r\n";
//$position = strpos($string, 'a');
$newstring = str_replace($substr, $substr.$attachment, $string);
// bca+++def a+++bcdef
//tabulator
$string = $tekst;
$substr = ' ';
$attachment = "\r\n";
//$position = strpos($string, 'a');
$newstring = str_replace($substr, $substr.$attachment, $string);
// bca+++def a+++bcdef
echo $newstring;
// zmienna $dane, która będzie zapisana
// może także pochodzić z formularza np. $dane = $_POST['dane'];
$dane = $newstring;
// przypisanie zmniennej $file nazwy pliku
$file = "testy.txt";
// uchwyt pliku, otwarcie do dopisania
$fp = fopen($file, "a");
// blokada pliku do zapisu
flock($fp, 2);
// zapisanie danych do pliku
fwrite($fp, $dane);
// odblokowanie pliku
flock($fp, 3);
// zamknięcie pliku
fclose($fp);
//usun puste wiersze
$plik = "testy.txt";
// odczyt
$bufor = array();
$fd = fopen($plik, "r");
while (!feof ($fd))
{
$linia = fgets($fd, 1024);
if(strlen(trim($linia)))
{
$bufor[] = $linia;
}
}
fclose($fd);
// zapis
$fdw = fopen($plik, "w");
foreach($bufor as $wiersz)
{
fwrite($fdw, $wiersz);
}
fclose($fdw);

Here is an sample code that will do this thing
test.txt contents:
text 1
text 2
Long text.
text3
Long text 2?
Line with 30 characters ending with a question mark?
text4
<?php
$file = fopen("test.txt", "r");
$i = 0;
$string = "";
while(!feof($file))
{
// get the line
$line = trim(fgets($file));
// check if line have 30 characters
if(strlen($line) > 30)
{
// get first character ascai value
$value = ord(substr($line, 0, 1));
// get the last character
$last = substr($line, -1);
// now check if it has allowed criteria
if((($value >= 65 && $value <= 90) && ($last == '.' || $last == '?')))
{
$string .= $line."\n";
}
}
}
fclose($file);
// put the proccessed content back to file
file_put_contents("test.txt", trim($string));
?>
Ouput after executing code
Line with 30 characters ending with a question mark?
hope it will help you

Full program code. It gives me empty file. Should be here only sentences with more than 30 chars and with capital letter on beginning and question mark or dot at the end.
<?php
# create and load the HTML
include('simple_html_dom.php');
$tekst = file_get_html('http://www.naszawiedza.pl/')->plaintext;
foreach ($tekst as $key=>&$value) {
if (strlen($value) > 60) {
unset($yourArray[$key]);
}
}
echo $tekst;
//kropka
$string = $tekst;
$substr = '.';
$attachment = "\r\n";
//$position = strpos($string, 'a');
$newstring = str_replace($substr, $substr.$attachment, $string);
// bca+++def a+++bcdef
//znak zapytania
$string = $tekst;
$substr = '?';
$attachment = "\r\n";
//$position = strpos($string, 'a');
$newstring = str_replace($substr, $substr.$attachment, $string);
// bca+++def a+++bcdef
//podwójna spacja
$string = $tekst;
$substr = '\r\n\r\n';
$attachment = "\r\n";
//$position = strpos($string, 'a');
$newstring = str_replace($substr, $substr.$attachment, $string);
// bca+++def a+++bcdef
//Wykrzyknik
$string = $tekst;
$substr = '!';
$attachment = "\r\n";
//$position = strpos($string, 'a');
$newstring = str_replace($substr, $substr.$attachment, $string);
// bca+++def a+++bcdef
//tabulator
$string = $tekst;
$substr = ' ';
$attachment = "\r\n";
//$position = strpos($string, 'a');
$newstring = str_replace($substr, $substr.$attachment, $string);
// bca+++def a+++bcdef
echo $newstring;
// zmienna $dane, która będzie zapisana
// może także pochodzić z formularza np. $dane = $_POST['dane'];
$dane = $newstring;
// przypisanie zmniennej $file nazwy pliku
$file = "testy.txt";
// uchwyt pliku, otwarcie do dopisania
$fp = fopen($file, "a");
// blokada pliku do zapisu
flock($fp, 2);
// zapisanie danych do pliku
fwrite($fp, $dane);
// odblokowanie pliku
flock($fp, 3);
// zamknięcie pliku
fclose($fp);
//usun puste wiersze
$plik = "testy.txt";
// odczyt
$bufor = array();
$fd = fopen($plik, "r");
while (!feof ($fd))
{
$linia = fgets($fd, 1024);
if(strlen(trim($linia)))
{
$bufor[] = $linia;
}
}
fclose($fd);
// zapis
$fdw = fopen($plik, "w");
foreach($bufor as $wiersz)
{
fwrite($fdw, $wiersz);
}
fclose($fdw);
$file = fopen("testy.txt", "r");
$i = 0;
$string = "";
while(!feof($file))
{
// get the line
$line = trim(fgets($file));
// check if line have 30 characters
if(strlen($line) > 30)
{
// get first character ascai value
$value = ord(substr($line, 0, 1));
// get the last character
$last = substr($line, -1);
// now check if it has allowed criteria
if((($value >= 65 && $value <= 90) && ($last == '.' || $last == '?')))
{
$string .= $line."\n";
}
}
}
fclose($file);
// put the proccessed content back to file
file_put_contents("testy.txt", trim($string));
?>

How to read specific line text with interval using PHP

i want read line in text file, with interval 4 , 4 lines show per page..
if load domain.com/pages/page2.php
output read line (5,6,7,8)
if load domain.com/pages/page3.php
output read line (9,10,11,12)
my code
$file1 = basename($_SERVER["SCRIPT_FILENAME"], '.php') ;
$file1 = preg_replace("/.+?(\\d+).*/", "$1", $file1);
$file2 = ($file1 - 1);
$file3 = ($file2 *4);
$file4 = ($file3 + 3 );
function retrieveText($file, $init, $end, $sulfix = '')
{
$i = 1;
$output = '';
$handle = fopen($file, 'r');
while (false === feof($handle) && $i <= $end) {
$data = fgets($handle);
if ($i >= $init) {
$output .= $data . $sulfix;
}
$i++;
}
fclose($handle);
return $output;
}
echo retrieveText('file.txt', $file3, $file4, '<br>');
not work, missing lines

First of all, I'm gonna advise you to get rid of your REGEX name getter and just have the following format:
domain.com/pages?page=1
domain.com/pages?page=2
domain.com/pages?page=3
So on and so forth. You will be using $_GET['page'] to retrieve the page number.
Now, the way that I'd go with it is to have an array with all the lines of the text and to use the array_slice() function. Something along this should do:
function retrieveText($file, $page, $per_page, $suffix)
{
$content = file_get_contents($file);
$array = explode(PHP_EOL, $content);
$start = --$page * $per_page;
$lines = array_slice($array, $start, $per_page);
$output = '';
foreach ($lines as $line) {
$output .= $line . $suffix;
}
return $output;
}
You should then call this function like this:
$page = $_GET['page'];
$page = $page === null ? 1 : $page;
retrieveText('file.txt', $page, 4, '<br>');

How to remove [img] bbcode with php

I have a string
$content = "your image is [img]url to image.png[/img] now you can use it";
With php script I want
$content = "your image is now you can use it";

$content = "your image is [img]url to image.png[/img] now you can use it";
echo preg_replace("/\[img\](.+?)\[\/img\]/i", '', $content);
Output:
your image is now you can use it

If there is a single-instance of [img][/img], you can use a combination of substr() and strpos():
$first = substr($content, 0, strpos($content, '[img]'));
$end = substr($content, strpos($content, '[/img]') + 6);
$content = $first . $end;
If there can be multiple instances within the same string, you'll need to put it in a loop:
$openImg = strpos($content, '[img]');
while ($openImg !== false) {
$first = substr($content, 0, $openImg);
$end = substr($content, strpos($content, '[/img]') + 6);
$content = $first . $end;
$openImg = strpos($content, '[img]');
}

php substring occurances between two strings in an html file

So i have an HTML file as source, it contains several instances of the following code:
<span itemprop="name">NAME</span>
where the NAME part always changing to something different.
how can i write a php code that would go through the html code, extract all the names between the "<span itemprop="name">" and "</span>" and put it in an array?
i have tried this code but it doesn't work:
$prev=$html;
for($i=0; $i<10; $i++){
$current = explode('<span itemprop="name">', $prev);
$cur = explode('</span>', $current[1]);
$names[] = $cur[0];
$prev = $current[2];
}
print_r($names);

Probably better way would be using php DOMDocument or simple php dom or any DOM representative than the way you planed.
Here is example of working DOMDocument code:
$doc = new DOMDocument();
$doc->loadHTML('<html><body><span itemprop="name">1</span><span itemprop="name">2</span><span itemprop="name">3</span></body></html>');
$finder = new DomXPath($doc);
$nodes = $finder->query("//*[contains(#itemprop, 'name')]");
foreach($nodes as $node)
{
echo $node->nodeValue . '<br />';
}
Outputs:
1
2
3

I kinda feel bad for saying this... but you could use a regular expression
preg_match_all('/<span itemprop="name">(.*?)<\/span>/i', $matches);
var_dump($matches); // results are stored in the variable $matches;

This function will get us the "NAME"
function getbetween($content,$start,$end) {
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '';
}
This function will replace only the first occurence
<?php
function str_replace_once($search, $replace, $subject) {
$firstChar = strpos($subject, $search);
if($firstChar !== false) {
$beforeStr = substr($subject,0,$firstChar);
$afterStr = substr($subject, $firstChar + strlen($search));
return $beforeStr.$replace.$afterStr;
} else {
return $subject;
}
}
?>
now a loop
$start = '<span itemprop="name">';
$end = '</span>';
while(strpos($content, $start)) {
$name = getbetween($content, $start, $end);
$content = str_replace_once($start.$name.$end, '',$content);
echo $name.'<br>';
}

use this function:
function get_string_between($string, $start, $end){
$string = ' ' . $string;
$ini = strpos($string, $start);
if ($ini == 0) return '';
$ini += strlen($start);
$len = strpos($string, $end, $ini) - $ini;
return substr($string, $ini, $len);
}
$fullstring = 'this is my [tag]dog[/tag]';
$parsed = get_string_between($fullstring, '[tag]', '[/tag]');
echo $parsed; // (result = dog)
Refenter link description here

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Remove white space from scraped text - php

You could just use regex to accomplish this task, something like this will work perfectly: $metai = preg_replace('/\s+/', ' ',scrape_between($visa_data, "<span class=\"ttitle1\">", "</span>")); Just do it on every var with the same problem.

Related

how to replace a string in a stream for very large files

PHP how to remove strings from a file

How to read specific line text with interval using PHP

How to remove [img] bbcode with php

php substring occurances between two strings in an html file

Categories

Resources