Parsing XML and TXT files in PHP - php

I have a Text.xml file with some text and the bibliographic references in this text. Its look like this:
Text.xml
<p>…blabla S.King (1987). Bla bla bla J.Doe (2001) blabla bla J.Martin (1995) blabla…</p>
And I have a Reference.txt file with list of bibliographic references and ID number for each reference. Its look like this:
Reference.txt
b1#S.King (1987)
b2#J.Doe (2001)
b3#J.Martin (1995)
I would like to find all bibliographic references from Reference.txt into Text.xml and then add a tag with ID. The goal is TextWithReference.xml who must look like this:
TextWithReference.xml
<p>…blabla <ref type="biblio" target=“b1”>S.King (1987)</ref>. Bla bla bla <ref type="biblio" target=“b2”>J.Doe (2001)</ref> blabla bla <ref type="biblio" target=“b3”>J.Martin (1995)</ref> blabla…</p>
To do this, I use a php file.
Search&Replace.php
<?php
$handle = fopen("Reference.txt","r");
while(!feof($handle))
{
$ligne = fgets($handle,1024);
$tabRef[] = $ligne;
}
fclose($handle);
$handleXML = fopen("Text.xml","r");
$fp = fopen("TextWithReference.xml", "w");
while(!feof($handleXML))
{
$ligneXML = fgets($handleXML,2048);
for($i=0;$i<sizeof($tabRef);$i++)
{
$tabSearch = explode('/#/',$tabRef[$i]);
$xmlID = $tabSearch[0];
$searchString = trim($tabSearch[1]);
if(preg_match('/$searchString/',$ligneXML))
{
$ligneXML = preg_replace('/($searchString)/','/<ref type=\"biblio\" target=\"#$xmlID\">\\0</ref>/',$ligneXML);
}
}
fwrite($fp, $ligneXML);
}
fclose($handleXML);
fclose($fp);
?>
The problem is that this php script just copy Text.xml in TextWithReference.xml without identifing the bibliographic references and without adding the tags…
Many thanks for your help!

There are a number of problems with your code.
The search strings contain characters that are special in regular expressions, such as parentheses. You need to escape these if you want to match them literally. The preg_quote function does this.
Your file-reading loops are not correct. while (!feof()) is not the correct way to read through a file, because the EOF flag isn't set until after you read at the end of the file. So you'll go through the loops an extra time. The proper way to write this is while ($ligne = fgets()).
You have single quotes around the strings where you're trying to substitute $searchString and $xmlID. Variables are only substituted inside double quotes. See What is the difference between single-quoted and double-quoted strings in PHP?
You don't need to put / delimiters around the replacement string in preg_replace.
It's inefficient to explode, trim and escape the lines from the Reference.txt every time you're processing a line in Text.xml. Do it once when you're reading Reference.txt.
In the replacement string, use $0 to replace with the matched text from the source. \0 is an obsolete method that isn't recommended.
You don't need parentheses around the search string in the regexp, since you're not using the $1 capture group in the replacement. And since it's around the whole regexp, it's the same as $0.
Here's the working rewrite:
<?php
$handle = fopen("Reference.txt","r");
$tabRef = array();
while($ligne = trim(fgets($handle,1024))) {
list($xmlID, $searchString) = explode('#', $ligne);
$tabRef[] = array($xmlID, preg_quote($searchString));
}
fclose($handle);
$handleXML = fopen("Text.xml","r");
$fp = fopen("TextWithReference.xml", "w");
while($ligneXML = fgets($handleXML,2048)) {
foreach ($tabRef as $tabSearch) {
$xmlID = $tabSearch[0];
$searchString = $tabSearch[1];
if(preg_match("/$searchString/",$ligneXML)) {
$ligneXML = preg_replace("/$searchString/","<ref type=\"biblio\" target=\"#$xmlID\">$0</ref>",$ligneXML);
}
}
fwrite($fp, $ligneXML);
}
fclose($handleXML);
fclose($fp);
?>
Another improvement takes advantage of the ability to give use arrays as the search and replacement arguments to preg_replace, instead of using a loop. When reading Reference.txt, create the regexp and replacement strings there, and put them each into an array.
<?php
$handle = fopen("Reference.txt","r");
$search = array();
$replacement = array();
while($ligne = trim(fgets($handle,1024))) {
list($xmlID, $searchString) = explode('#', $ligne);
$search[] = "/" . preg_quote($searchString) . "/";
$replacement[] = "<ref type=\"biblio\" target=\"#$xmlID\">$0</ref>";
}
fclose($handle);
$handleXML = fopen("Text.xml","r");
$fp = fopen("TextWithReference.xml", "w");
while($ligneXML = fgets($handleXML,2048)) {
$ligneXML = preg_replace($search,$replacement,$ligneXML);
fwrite($fp, $ligneXML);
}
fclose($handleXML);
fclose($fp);
?>

Related

How to find particular pattern using regex?

I have some .php files in a directory that calls a user defined function:
_l($string);
each time different string is passed to that function and all strings are static, (i.e. not a single string is entered by user input).
Now I want a script that can list down all the strings form all the files of that directory, which are passed into _l($string);
I had tried:
$fp = fopen($file, "r");
while(!feof($fp)) {
$content .= fgets($fp, filesize($file));
if(preg_match_all('/(_l\(\'.*?\'\);)/', fgets($fp, filesize($file)), $matches)){
foreach ($matches as $key => $value) {
array_push($text, $value[0]);
}
}
}
I get strings but not every strings those are in files, some strings are not match with given regex, so what are the condiotion that is required to get all the strings?
This is easier to get strings in double " or single ' quotes as the _l() function argument.
$string = file_get_contents($file);
preg_match_all('/_l\([\'"](.*?)[\'"]\);/', $string, $matches);
$text = $matches[1];
If needed you can add some optional spaces before and after the ( and before the ):
'/_l\s*\(\s*[\'"](.*?)[\'"]\s*\);/'
Also, if the function can be used in a loop or if or something where it's not terminated by a semi-colon ; then remove it from the pattern.

remove line where multiple characters are present

I am reading file with file_get_contents.
Some lines can have multiple "=" chars and I want to remove these lines.
I tried
str_replace("=", "", $content);
but this replaces all occurences of "=" but not removes these lines.
Any idea please?
UPDATE: my content from file looks like this:
something
apple is =greee= maybe red
sugar is white
sky is =blue
Without seeing an example of your file/strings, it's a bit tricky to advise, but the basic principle I would work to would be something like this:
$FileName = "PathToFile";
$FileData = file_get_contents($FileName);
$FileDataLines = explode("\r\n", $FileData); // explode lines by ("\n", "\r\n", etc)
$FindChar = "="; // the character you want to find
foreach($FileDataLines as $FileDataLine){
$NoOfChar = substr_count($FileDataLine, $FindChar); // finds the number of occurrences of character in string
if($NoOfChar <= 1){ // if the character appears less than two times
$Results[] = $FileDataLine; // add to the results
}
}
# print the results
print_r($Results);
# build a new file
$NewFileName = "YourNewFile";
$NewFileData = implode("\r\n", $Results);
file_put_contents($NewFileName, $NewFileData);
Hope it helps

REGEX: replacing everything between two strings, but not the start/end strings

I am not the best with RegEx... I have some PHP code:
$pattern = '/activeAds = \[(.*?)\]/si';
$modData = preg_replace($pattern,'TEST',$data);
So I have a JavaScript file, and it declares and array:
var activeAds = [];
I need this to populate the array with my string, or if the array already has a string inside it, i want to replace it with my string (in this case "TEST").
Right now, my REGEX is replacing everything, including my start and end, i need to only replace whats between.
I'm left with:
var TEST;
TIA
You could capture what's before and what's after the part you want replacing:
$pattern = '/(activeAds = \[).*?(\])/si';
After capturing these parts, you can keep them and replace the part in the middle:
$modData = preg_replace($pattern, '\1TEST\2', $data);
There are many ways you could do this, mine is below:
$data = array("activeAds = testing123");
$pattern = "/activeAds\s?=\s?(.*)/";
$result = preg_replace($pattern,"activeAds = TEST", $data);
var_dump($result);
Edit: Forgot to mention that the \s? here allow for an optional space.

PHP extract data from text file and write to another file

This is my first post on the internet for some assistance with coding so please bear with me!
I have been finding open code on the internet for a few years and modding it to do what I want but I seem to have come up against a wall with this one that I am sure is very simple. If you would please be able to help me it would be very much appreciated.
I have the following page:
<?php
$text = $_REQUEST['message'];
$f = file_get_contents("all.txt");
$f = explode(", ", $f);
function modFile($pos, $tothis, $inthis)
{
foreach($inthis as $pos => $a){
}
$newarr = implode("\r\n", $inthis);
$fh = fopen("example.txt", "w");
fwrite($fh, $newarr);
fclose($fh);
}
modFile(4, '', $f);
I have a file (all.txt) with the following:
11111111111, 22222222222, 33333333333, 44444444444
That I wish to display like this:
11111111111
22222222222
33333333333
44444444444
and to add a space then some text after each number where the text is the same on each line:
11111111111 text here
22222222222 text here
33333333333 text here
44444444444 text here
I have an html form that passes the custom text to be appended to each line.
I need to keep the file all.txt intact then save the newly formatted file with a different name.
I have tried putting variables into the implode where I currently have the "\r\n" but this does not work.
Any help very much appreciated.
Thanks in advance!
A few notes about your code: You are passing $pos to the function but it will get overwritten in the foreach. Also the foreach is empty, so what's it good for? And I don't see you use $text anywhere either.
To achieve your desired output, try this instead:
file_put_contents(
'/path/to/new.txt',
preg_replace(
'/[^\d+]+/',
' some text' . PHP_EOL,
file_get_contents('all.txt')
)
);
The pattern [^\d+]+ will match any string that is not a consecutive number and replace it with "some text " and a new line.
A somewhat more complicated version achieving the same would be:
file_put_contents(
'/path/to/new.txt',
implode(PHP_EOL, array_map(
function ($number) {
$message = filter_var(
$_POST['message'], FILTER_SANITIZE_SPECIAL_CHARS
);
return sprintf('%s %s', trim($number), $message);
},
array_filter(str_getcsv(file_get_contents('/path/to/all.txt')))
)
));
This will (from the inside out):
Load the content of all.txt and parse it as CSV string into an array. Each array element corresponds to a number.
Each of these numbers is appended with the message content from the POST superglobal (you dont want to use REQUEST).
The resulting array is then concatenated back into a single string where the concatenating character is a newline.
The resulting string is written to the new file.
In case the above is too hard to follow, here is a version using temp vars and no lambda:
$allTxtContent = file_get_contents('/path/to/all.txt');
$numbers = array_filter(str_getcsv($allTxtContent));
$message = filter_var($_POST['message'], FILTER_SANITIZE_SPECIAL_CHARS);
$numbersWithMessage = array();
foreach ($numbers as $number) {
$numbersWithMessage[] = sprintf('%s %s', trim($number), $message);
};
$newString = implode(PHP_EOL, $numbersWithMessage);
file_put_contents('/path/to/new.txt', $newString);
It does the same thing.
Your foreach() closing brace is on the wrong place. You've missed the exact part of running the execution of the new file creation. Here:
$text = $_REQUEST['message'];
$f = file_get_contents("all.txt");
$f = explode(", ", $f);
function modFile($pos, $tothis, $inthis, $text){
$fh = fopen("example.txt", "w");
foreach($inthis as $pos => $a){
$newarr = $a." ".$text."\r\n";
fwrite($fh, $newarr);
}
fclose($fh);
}
modFile(4, "", $f, $text);
This is for formatting your new file as you desire, however, you're not passing the new $text['message'] you want to append to your new file. You could either modify your mod_file() method or pass it within the foreach() loop while it runs.
EDIT* Just updated the whole code, should be now what you aimed for. If it does, please mark the answer as accepted.

editing values stored in each subarray of an array

I am using the following code which lets me navigate to a particular array line, and subarray line and change its value.
What i need to do however, is change the first column of all rows to BLANK or NULL, or clear them out.
How can i change the code below to accomplish this?
<?php
$row = $_GET['row'];
$nfv = $_GET['value'];
$col = $_GET['col'];
$data = file_get_contents("temp.php");
$csvpre = explode("###", $data);
$i = 0;
$j = 0;
if (isset($csvpre[$row]))
{
$target_row = $csvpre[$row];
$info = explode("%%", $target_row);
if (isset($info[$col]))
{
$info[$col] = $nfv;
}
$csvpre[$row] = implode("%%", $info);
}
$save = implode("###", $csvpre);
$fh = fopen("temp.php", 'w') or die("can't open file");
fwrite($fh, $save);
fclose($fh);
?>
Use foreach or array_map to perform the same action on all elements of an array.
In this case, something roughly along these lines?
foreach($rows as &$row) {
$row[0] = NULL;
}
I don't have a ready answer for you but I would recommend checking out CakePHP's Set class. It does things like this very well and (in some methods) supports XPath. Hopefully you can find the code you need there.
Depending on the size of that file, this could be much more efficient than looping through:
$data = file_get_contents("temp.php"); //data = blah%%blah%%blah%%blah%%###blah%%blah%%blah
$data = preg_replace( "/^(.+?)(?=%%)/", "\\1", $data ); //Replace first column to blank
$data = preg_replace( "/(###)(.+?)(?=%%))/", "\\1", $data ); //Replace all other columns to blank
After that, write it back to the file as you did above.
This would need to be adjusted to allow for escape characters if your columns allow %% to appear consecutively within them, but other than that, this should work.
If you expect this csv file to get REALLY large, you should start thinking of looping through the file line by line rather than reading it completely into memory using file_get_contents. I would point you to fgets_csv, but I don't believe it is possible to get each csv line by any delimiter other than newline (unless you are willing to replace your ### separator with \r\n). If you end up going this way, the answer totally changes :P
For more information on Regex (specifically positive lookaheads) see Regex Tutorial - Lookahead and Lookbehind Zero-Width Assertions (also a great site for regex in general)

Categories