Extract part of string matching pattern - php

I would like to scan a large piece of text using PHP and find all matches for a pattern, but then also 2 lines above the match and 2 lines below.
My text looks like this, but with some extra unnecessary text above and below this sample:
1
Description text
123.456.12
10.00
10.00
3
Different Description text
234.567.89
10.00
30.00
#Some footer text that is not needed and will change for each text file#
15
More description text
564.238.02
4.00
60.00
15
More description text
564.238.02
4.00
60.00
#Some footer text that is not needed and will change for each text file#
15
More description text
564.238.02
4.00
60.00
15
More description text
564.238.02
4.00
60.00
Using PHP, I am looking to match each number in bold (always same format - 3 numbers, dot, 3 numbers, dot, 2 numbers) but then also return the previous 2 lines and the next 2 lines and hopefully return an array so that I can use:
$contents[$i]["qty"] = "1";
$contents[$i]["description"] = "Description text";
$contents[$i]["price"] = "10.00";
$contents[$i]["total"] = "10.00";
etc...
Is this possible and would I use regex? Any help or advice would be greatly appreciated!
Thanks
ANSWERED BY vzwick
This is my final code that I used:
$items_array = array();
$counter = 0;
if (preg_match_all('/(\d+)\n\n(\w.*)\n\n(\d{3}\.\d{3}\.\d{2})\n\n(\d.*)\n\n(\d.*)/', $text_file, $matches)) {
$items_string = $matches[0];
foreach ($items_string as $value){
$item = explode("\n\n", $value);
$items_array[$counter]["qty"] = $item[0];
$items_array[$counter]["description"] = $item[1];
$items_array[$counter]["number"] = $item[2];
$items_array[$counter]["price"] = $item[3];
$items_array[$counter]["total"] = $item[4];
$counter++;
}
}
else
{
die("No matching patterns found");
}
print_r($items_array);

$filename = "yourfile.txt";
$fp = #fopen($filename, "r");
if (!$fp) die('Could not open file ' . $filename);
$i = 0; // element counter
$n = 0; // inner element counter
$field_names = array('qty', 'description', 'some_number', 'price', 'total');
$result_arr = array();
while (($line = fgets($fp)) !== false) {
$result_arr[$i][$field_names[$n]] = trim($line);
$n++;
if ($n % count($field_names) == 0) {
$i++;
$n = 0;
}
}
fclose($fp);
print_r($result_arr);
Edit: Well, regex then.
$filename = "yourfile.txt";
$file_contents = #file_get_contents($filename);
if (!$file_contents) die("Could not open file " . $filename . " or empty file");
if (preg_match_all('/(\d+)\n\n(\w.*)\n\n(\d{3}\.\d{3}\.\d{2})\n\n(\d.*)\n\n(\d.*)/', $file_contents, $matches)) {
print_r($matches[0]);
// do your matching to field names from here ..
}
else
{
die("No matching patterns found");
}

(.)+\n+(.)+\n+(\d{3}\.\d{3}\.\d{2})\n+(.)+\n+(.)+
It might be necessary to replace \n with \r\n. Make sure the regex is in a mode when the "." doesn't match with the new line character.
To reference groups by names, use named capturing group:
(?P<name>regex)
example of named capturing groups.

You could load the file in an array, and them use array_slice, to slice each 5 blocks of lines.
<?php
$file = file("myfile");
$finalArray = array();
for($i = 0; $i < sizeof($file); $i = $i+5)
{
$finalArray[] = array_slice($file, $i, 5);
}
print_r($finalArray);
?>

Related

Searching a file for multiple strings and output the data

How can I search a .tsv file for multiple matches to a string and export them to a database?
What I'm trying to do is search a large file called mdata.tsv (1.5m lines) for a string given to it from an array. Afterwards output matching columns data.
The current code is what I've gotten stuck at:
<?php
$file = fopen("mdata.tsv","r"); //open file
$movies = glob('./uploads/Videos/*/*/*/*.mp4', GLOB_BRACE); //Find all the movies
$movID = array(); //Array for movies IDs
//Get XML and add the IDs to $movID()
foreach ($movies as $movie){
$pos = strrpos($movie, '/');
$xml = simplexml_load_file((substr($movie, 0, $pos + 1) .'movie.xml'));
array_push($movID, $xml->id);
}
//Loop through the TSV rows and search for the $tmdbID then print out the movies category.
foreach ($movID as $tmdbID) {
while(($row = fgetcsv($file, 0, "\t")) !== FALSE) {
fseek($file,0);
$myString = $row[0];
$b = strstr( $myString, $tmdbID );
//Dump out the row for the sake of clarity.
//var_dump($row);
$myString = $row[0];
if ($b == $tmdbID){
echo 'Match ' . $row[0] .' '. $row[8];
} // Displays movie ID and category
}
}
fclose($file);
?>
Example of tsv file:
tt0043936 movie The Lawton Story The Lawton Story 0 1949 \N \N Drama,Family
tt0043937 short The Prize Pest The Prize Pest 0 1951 \N 7 Animation,Comedy,Family
tt0043938 movie The Prowler The Prowler 0 1951 \N 92 Drama,Film-Noir,Thriller
tt0043939 movie Przhevalsky Przhevalsky 0 1952 \N \N Biography,Drama
It looks as though you can simplify this code by using in_array() instead of the nested loops to see if the current line is in the list of required ID's. The one change needed to make sure this works is that you need to ensure that you store strings in the $movID array.
$file = fopen("mdata.tsv","r"); //open file
$movies = glob('./uploads/Videos/*/*/*/*.mp4', GLOB_BRACE); //Find all the movies
$movID = array(); //Array for movies IDs
//Get XML and add the IDs to $movID()
foreach ($movies as $movie){
$pos = strrpos($movie, '/');
$xml = simplexml_load_file((substr($movie, 0, $pos + 1) .'movie.xml'));
// Store ID as string
$movID[] = (string) $xml->id;
}
while(($row = fgetcsv($file, 0, "\t")) !== FALSE) {
if ( in_array($row[0], $movID) ){
echo 'Match ' . $row[0] .' '. $row[8];
} // Displays movie ID and category
}

Place content in between paragraphs without images

I am using the following code to place some ad code inside my content .
<?php
$content = apply_filters('the_content', $post->post_content);
$content = explode (' ', $content);
$halfway_mark = ceil(count($content) / 2);
$first_half_content = implode(' ', array_slice($content, 0, $halfway_mark));
$second_half_content = implode(' ', array_slice($content, $halfway_mark));
echo $first_half_content.'...';
echo ' YOUR ADS CODE';
echo $second_half_content;
?>
How can i modify this so that the 2 paragraphs (top and bottom) enclosing the ad code should not be the one having images. If the top or bottom paragraph has image then try for next 2 paragraphs.
Example: Correct Implementation on the right.
preg_replace version
This code steps through every paragraph ignoring those that contain image tags. The $pcount variable is incremented for every paragraph found without an image, if an image is encountered however, $pcount is reset to zero. Once $pcount reaches the point where it would hit two, the advert markup is inserted just before that paragraph. This should leave the advert markup between two safe paragraphs. The advert markup variable is then nullified so only one advert is inserted.
The following code is just for set up and could be modified to split the content differently, you could also modify the regular expression that is used — just in case you are using double BRs or something else to delimit your paragraphs.
/// set our advert content
$advert = '<marquee>BUY THIS STUFF!!</marquee>' . "\n\n";
/// calculate mid point
$mpoint = floor(strlen($content) / 2);
/// modify back to the start of a paragraph
$mpoint = strripos($content, '<p', -$mpoint);
/// split html so we only work on second half
$first = substr($content, 0, $mpoint);
$second = substr($content, $mpoint);
$pcount = 0;
$regexp = '/<p>.+?<\/p>/si';
The rest is the bulk of the code that runs the replacement. This could be modified to insert more than one advert, or to support more involved image checking.
$content = $first . preg_replace_callback($regexp, function($matches){
global $pcount, $advert;
if ( !$advert ) {
$return = $matches[0];
}
else if ( stripos($matches[0], '<img ') !== FALSE ) {
$return = $matches[0];
$pcount = 0;
}
else if ( $pcount === 1 ) {
$return = $advert . $matches[0];
$advert = '';
}
else {
$return = $matches[0];
$pcount++;
}
return $return;
}, $second);
After this code has been executed the $content variable will contain the enhanced HTML.
PHP versions prior to 5.3
As your chosen testing area does not support PHP 5.3, and so does not support anonymous functions, you need to use a slightly modified and less succinct version; that makes use of a named function instead.
Also, in order to support content that may not actually leave space for the advert in it's second half I have modified the $mpoint so that it is calculated to be 80% from the end. This will have the effect of including more in the $second part — but will also mean your adverts will be generally placed higher up in the mark-up. This code has not had any fallback implemented into it, because your question does not mention what should happen in the event of failure.
$advert = '<marquee>BUY THIS STUFF!!</marquee>' . "\n\n";
$mpoint = floor(strlen($content) * 0.8);
$mpoint = strripos($content, '<p', -$mpoint);
$first = substr($content, 0, $mpoint);
$second = substr($content, $mpoint);
$pcount = 0;
$regexp = '/<p>.+?<\/p>/si';
function replacement_callback($matches){
global $pcount, $advert;
if ( !$advert ) {
$return = $matches[0];
}
else if ( stripos($matches[0], '<img ') !== FALSE ) {
$return = $matches[0];
$pcount = 0;
}
else if ( $pcount === 1 ) {
$return = $advert . $matches[0];
$advert = '';
}
else {
$return = $matches[0];
$pcount++;
}
return $return;
}
echo $first . preg_replace_callback($regexp, 'replacement_callback', $second);
You could try this:
<?php
$ad_code = 'SOME SCRIPT HERE';
// Your code.
$content = apply_filters('the_content', $post->post_content);
// Split the content at the <p> tags.
$content = explode ('<p>', $content);
// Find the mid of the article.
$content_length = count($content);
$content_mid = floor($content_length / 2);
// Save no image p's index.
$last_no_image_p_index = NULL;
// Loop beginning from the mid of the article to search for images.
for ($i = $content_mid; $i < $content_length; $i++) {
// If we do not find an image, let it go down.
if (stripos($content[$i], '<img') === FALSE) {
// In case we already have a last no image p, we check
// if it was the one right before this one, so we have
// two p tags with no images in there.
if ($last_no_image_p_index === ($i - 1)) {
// We break here.
break;
}
else {
$last_no_image_p_index = $i;
}
}
}
// If no none image p tag was found, we use the last one.
if (is_null($last_no_image_p_index)) {
$last_no_image_p_index = ($content_length - 1);
}
// Add ad code here with trailing <p>, so the implode later will work correctly.
$content = array_slice($content, $last_no_image_p_index, 0, $ad_code . '</p>');
$content = implode('<p>', $content);
?>
It will try to find a place for the ad from the mid of your article and if none is found the ad is put to the end.
Regards
func0der
I think this will work:
First explode the paragraphs, then you have to loop it and check if you find img inside them.
If you find it inside, you try the next.
Think of this as psuedo-code, since it's not tested. You will have to make a loop too, comments in the code :) Sorry if it contains bugs, it's written in Notepad.
<?php
$i = 0; // counter
$arrBoolImg = array(); // array for the paragraph booleans
$content = apply_filters('the_content', $post->post_content);
$contents = str_replace ('<p>', '<explode><p>', $content); // here we add a custom tag, so we can explode
$contents = explode ('<explode>', $contents); // then explode it, so we can iterate the paragraphs
// fill array with boolean array returned
$arrBoolImg = hasImages($contents);
$halfway_mark = ceil(count($contents) / 2);
/*
TODO (by you):
---
When you have $arrBoolImg filled, you can itarate through it.
You then simply loop from the middle of the array $contents (plural), that is exploded from above.
The startingpoing for your loop is the middle, the upper bounds is the +2 or what ever :-)
Then you simply insert your magic.. And then glue it back together, as you did before.
I think this will work. even though the code may have some bugs, since I wrote it in Notepad.
*/
function hasImages($contents) {
/*
This function loops through the $contents array and checks if they have images in them
The return value, is an array with boolean values, so one can iterate through it.
*/
$arrRet = array(); // array for the paragraph booleans
if (count($content)>=1) {
foreach ($contents as $v) { // iterate the content
if (strpos($v, '<img') === false) { // did not find img
$arrRet[$i] = false;
}
else { // found img
$arrRet[$i] = true;
}
$i++;
} // end for each loop
return $arrRet;
} // end if count
} // end hasImages func
?>
[This is just an idea, I don't have enough reputation to comment...]
After calling #Olavxxx's method and filling your boolean array you could just loop through that array in an alternating manner starting in the middle: Let's assume your array is 8 entries long. Calculating the middle using your method you get 4. So you check the combination of values 4 + 3, if that doesn't work, you check 4 + 5, after that 3 + 2, ...
So your loop looks somewhat like
$middle = ceil(count($content) / 2);
$i = 1;
while ($i <= $middle) {
$j = $middle + (-1) ^ $i * $i;
$k = $j + 1;
if (!$hasImagesArray[$j] && !$hasImagesArray[$k])
break; // position found
$i++;
}
It's up to you to implement further constraints to make sure the add is not shown to far up or down in the article...
Please note that you need to take care of special cases like too short arrays too in order to prevent IndexOutOfBounds-Exceptions.

Finding all matches of string, and also returning line number of match

I have a string variable which contains some text (shown below). The text has line breaks in it as shown. I would like to search the text for a given string, and return the number of matches per line number. For instance, searching for "keyword" would return 1 match on line 3 and 2 matches on line 5.
I have tried using strstr(). It does a good job finding the first match, and giving me the remaining text, so I can do it again and again until there are no matches. Problem is I do not know how to determine which line number the match occurred on.
Hello,
This is some text.
And a keyword.
Some more text.
Another keyword! And another keyword.
Goodby.
Why not split the text on line-feeds and loop, use the index + 1 as a line number:
$txtParts = explode("\n",$txt);
for ($i=0, $length = count($txtParts);$i<$length;$i++)
{
$tmp = strstr($txtParts[$i],'keyword');
if ($tmp)
{
echo 'Line '.($i +1).': '.$tmp;
}
}
Tested, and working. Just a quick tip, since you're looking for matches in a text (sentences, upper- and lower-case etc...) perhaps stristr (case-insensitive) would be better?An example with foreach and stristr:
$txtParts = explode("\n",$txt);
foreach ($txtParts as $number => $line)
{
$tmp = stristr($line,'keyword');
if ($tmp)
{
echo 'Line '.($number + 1).': '.$tmp;
}
}
With this code you can have all data in one array (Linenumber and position numbers)
<?php
$string = "Hello,
This is some text.
And a keyword.
Some more text.
Another keyword! And another keyword.
Goodby.";
$expl = explode("\n", $string);
$linenumber = 1; // first linenumber
$allpos = array();
foreach ($expl as $str) {
$i = 0;
$toFind = "keyword";
$start = 0;
while($pos = strpos($str, $toFind, $start)) {
//echo $toFind. " " . $pos;
$start = $pos+1;
$allpos[$linenumber][$i] = $pos;
$i++;
}
$linenumber++; // linenumber goes one up
}
foreach ($allpos as $linenumber => $position) {
echo "Linenumber: " . $linenumber . "<br/>";
foreach ($position as $pos) {
echo "On position: " .$pos . "<br/>";
}
echo "<br/>";
}
Angelo's answer definitely provides more functionality and is probably the best answer, but the following is simple and seems to work. I will continue to play with all solutions.
function findMatches($text,$phrase)
{
$list=array();
$lines=explode("\n", $text);
foreach($lines AS $line_number=>$line)
{
str_replace($phrase,$phrase,$line,$count);
if($count)
{
$list[]='Found '.$count.' match(s) on line '.($line_number+1);
}
}
return $list;
}

Parsing text files to get some contains

I have some text files. for example: file1.txt and file2.txt.
The contain of file1.txt is Walk word1 in the rain
Walking in the rain is one of the most beautiful word2 experiences.
There are some conditions :
If there are word1 AND word2, I wanna get the text between those 2 words as $between so I will get in the rain
Walking in the rain is one of the most beautiful. And also I wanna get the text after word2 as $content so I will get experiences
If there are only word1 OR word2 (eg = Walk in the rain
Walking in the rain is one of the most beautiful word1 experiences.) Then $between ='' and $content is all of texts-> Walk in the rain
Walking in the rain is one of the most beautiful word1 experiences.
If word2 in front of word1 for example : Walk in word2 the rain
Walking in the rain is one of the most word1 beautiful word1 experiences. then $between = ''and$content` is all of texts.
here's my code :
//to get and open the text files
$txt = glob($savePath.'*.txt');
foreach ($txt as $file => $files) {
$handle = fopen($files, "r") or die ('can not open file');
$ori_content = file_get_contents($files);
//count the words of text, to reach until the last word
$words = preg_split('/\s+/',$ori_content ,-1,PREG_SPLIT_NO_EMPTY);
$count = count ($words);
$word1 ='word1';
$word2 ='word2';
if (stripos($ori_content, $word1) && stripos($ori_content, $word2)){
$between = substr($ori_content, stripos($ori_content, $word1)+ strlen($word1), stripos($ori_content, $word2) - stripos($ori_content, $word1)- strlen($word1));
$content = substr($ori_content, stripos($ori_content, $word2)+strlen($word2), stripos($ori_content, $ori_content[$count+1]) - stripos($ori_content,$word2));
}
else
$content = $ori_content;
$q0 = mysql_query("INSERT INTO tb VALUES('','$files','$content','$between')") or die(mysql_error());
but my code still cannot handle for :
the condition number 2(above), I get the result $between = experiences, it should be $between=''
the condition number 3(above). I get the result $etween = the rain
Walking in the rain is one of the most word1 beautiful word1 experiences, it should be $between=''
If I get $between in file1.txt, but not in file2.txt, in table between in database, for data file2.txt it should be null in the column between. but it doesn't null, it filled by the between of other text files
I cannot reach the last word.
please help me.. thanks in advance ! :)
I think you're just missing one statement:
...
}
else {
$between = '';
$content = $ori_content;
}
You're probably using this in a loop, so you get the values of the previous loop if you're not explicitly setting $between to an empty string :)
Edit
You also forgot to compare the positions:
if (stripos($ori_content, $word1) && stripos($ori_content, $word2)){
Should be:
$pos1 = stripos($ori_content, $word1);
$pos2 = stripos($ori_content, $word2);
if (false !== $pos1 && false !== $pos2 && $pos1 < $pos2) {
Edit 2
Another thing; your SQL is prone to injection and you can't properly use the NULL value this way. You could use this kind of construct, but it's more preferable to use PDO or mysqli.
$sql_between = is_null($between) ? 'NULL' : "'" . mysql_real_escape_string($between) . "'";
// apply the same treatment for `$files`, etc.
...
mysql_query("INSERT INTO tb VALUES('', $sql_files, $sql_content, $sql_between)");
In this manner you can set $between to null and have it properly get sent to MySQL.
I've wrapped the parser logic into a function parse_content.
$txt = glob($savePath.'*.txt');
foreach ($txt as $file => $files) {
$handle = fopen($files, "r") or die ('can not open file');
$ori_content = file_get_contents($files);
$word1 ='word1';
$word2 ='word2';
$result = parse_content($word1, $word2, $ori_content);
extract($result);
$q0 = mysql_query("INSERT INTO tb VALUES('','$files','$content','$between')") or die(mysql_error());
}
function parse_content($word1, $word2, $input) {
$between = '';
$content = '';
$w1 = stripos($input, $word1);
$w2 = stripos($input, $word2);
if($w1 && $w2) {
if($w2 < $w1) {
// Case 3
$content = $input;
} else {
// Case 1
$reg_between = '/' . $word1 . '(.*?)' . $word2 . '/';
$reg_content = '/' . $word2 . '(.*)$/';
preg_match($reg_between, $input, $match);
$between = trim($match[1]);
preg_match($reg_content, $input, $match);
$content = trim($match[1]);
}
} else if($w1 || $w2) {
// Case 2
$content = $input;
} else {
// Case 4
$content = $input;
}
return compact('between', 'content');
}

Intelligently removing excess indention from a string

I'm trying to remove some excessive indention from a string, in this case it's SQL, so it can be put into a log file. So I need the find the smallest amount of indention (aka tabs) and remove it from the front of each line, but the following code ends up printing out exactly the same, any ideas?
In other words, I want to take the following (NOTE: StackOverflow editor converted my tabs to spaces, in the code, a tab simulates 4 spaces, but it really is a \t character)
SELECT
blah
FROM
table
WHERE
id=1
and convert it to
SELECT
blah
FROM
table
WHERE
id=1
here's the code I tried and fails
$sql = '
SELECT
blah
FROM
table
WHERE
id=1
';
// it's most likely idented SQL, remove any idention
$lines = explode("\n", $sql);
$space_count = array();
foreach ( $lines as $line )
{
preg_match('/^(\t+)/', $line, $matches);
$space_count[] = strlen($matches[0]);
}
$min_tab_count = min($space_count);
$place = 0;
foreach ( $lines as $line )
{
$lines[$place] = preg_replace('/^\t{'. $min_tab_count .'}/', '', $line);
$place++;
}
$sql = implode("\n", $lines);
print '<pre>'. $sql .'</pre>';
It seems the problem was
strlen($matches[0])
returns 0 and 1 for the first and last line, which isn't the 3 I actually wanted as the minimum, so a quick hack was to
trim the SQL
skip counting the length if it's less than 2
Not the most elegant solution, but it'll always work because tabs are usually in the 4+ count in this code. Here's the fixed code:
$sql = '
SELECT
blah
FROM
table
WHERE
id=1
';
// it's most likely idented SQL, remove any idention
$lines = explode("\n", $sql);
$space_count = array();
foreach ( $lines as $line )
{
preg_match('/^(\t+)/', $line, $matches);
if ( strlen($matches[0]) > 1 )
{
$space_count[] = strlen($matches[0]);
}
}
$min_tab_count = min($space_count);
$place = 0;
foreach ( $lines as $line )
{
$lines[$place] = preg_replace('/^\t{'. $min_tab_count .'}/', '', $line);
$place++;
}
$sql = implode("\n", $lines);
print $sql;
private function cleanIndentation($str) {
$content = '';
foreach(preg_split("/((\r?\n)|(\r\n?))/", trim($str)) as $line) {
$content .= " " . trim($line) . PHP_EOL;
}
return $content;
}

Categories