I'm wanting to build a scraper that parses through transcripts from the Leveson Inquiry, which are in the following format as plaintext:
1 Thursday, 2 February 2012
2 (10.00 am)
3 LORD JUSTICE LEVESON: Good morning.
4 MR BARR: Good morning, sir. We're going to start today
5 with witnesses from the mobile phone companies,
6 Mr Blendis from Everything Everywhere, Mr Hughes from
7 Vodafone and Mr Gorham from Telefonica.
8 LORD JUSTICE LEVESON: Very good.
9 MR BARR: We're going to listen to them all together, sir.
10 Can I ask that the gentlemen are sworn in, please.
11 MR JAMES BLENDIS (affirmed)
12 MR ADRIAN GORHAM (sworn)
13 MR MARK HUGHES (sworn)
14 Questions by MR BARR
15 MR BARR: Can I start, please, Mr Hughes, with you. Could
16 you tell us the position that you hold and a little bit
17 about your professional background, please?
18 MR HUGHES: Yes, sure. I'm currently head of fraud risk and
19 security for Vodafone UK. I have been in that position
20 since August 2011 and I've worked in the fraud risk and
21 security department in Vodafone since October 2006.
22 Q. Mr Gorham, if I could ask you the same question, please.
23 MR GORHAM: I'm the head of fraud and security for
24 Telefonica O2, I've been in that role for ten years and
25 have been in the industry for 13.
1
(Full example)
Ultimately I want to build an XML file structured as follows:
<hearing date="2012-02-02" time="10:00">
<quote speaker="Lord Justice Leveson" page="1" line="3">Good morning.</quote>
<quote speaker="Mr Barr" page="1" line="4">Good morning, sir. We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</quote>
<quote speaker="Lord Justice Leveson" page="1" line="8">Very good.</quote>
[... and on ...]
</hearing>
...Any help?
(Also note, that "MR BARR:" changes into simply "Q." at a certain point.)
Many thanks!
This is generally a very hard problem, and is way out of scope for StackOverflow. That said, if I had to do this I'd take an iterative approach:
Identify regularities in the text layout and devise a candidate grammar.
Write a parser using the grammar; the parse would be quite strict and discard (with error messages) anything that didn't match.
Run it on the entire text
Examine the output and mismatches, revise the grammar, identify special cases
Go back to step 3
As to the details of those steps, only you can decide if you're getting out what you want. Also, any solution is going to require manual intervention, either beforehand or afterwards, to clean up low-frequency inconsistencies.
let me start by saying this is not a foolproof script, there might well be something I forgot or overlooked,
but it is a proof of concept for you to improve and expand upon or just get an idea.
There are enough regularities in the text layout for us to work with, what the script does is split the
transcript to an array of lines and match those lines against a few patterns in an attempt to identify the
regularities and determine the role of the data.
Example Script:
<?php
/*
Proof of Concept : Transcript to XML by Robjong
? :
- action on date change (what to do when the date changes?)
- what to do with lines like "MR MARK HUGHES (sworn)" (make it a note?!)
- what to do with lines like "Questions by MR BARR" (make it a note?!)
- detect events/notes in quotes better? (e.g: MR BLENDIS: (Nods head).)
Notes :
- desperately needs error checking/handling!!!! (for now it just got in the way)
- it might well be that the configuration of PHP will cause file_get_contents to fail,
try curl or download it manually and read the local file
- if you are running PHP < 5.2.4, change the \h in the pattern to \s or [\t ]
*/
# basic usage
// get the transcript as plain text
$txt = file_get_contents( 'http://www.levesoninquiry.org.uk/wp-content/uploads/2012/02/Transcript-of-Morning-Hearing-2-February-2012.txt' );
// convert transcript to XML
$xml = transcriptToXML_beta( $txt );
// we have the transcript as XML, now what?
file_put_contents( 'transcript.xml', $xml ); // let's write it to a file
echo $xml;
function transcriptToXML_beta( $string ) { // beta is just to emphasize lack of torough testing
$lines = explode( "\n", $string ); // split text into an array array of lines
if( !is_array( $lines ) ) { // the provided string was not multiline
return false;
}
// these vars will hold the data we need to build our XML
$date = ''; // the date of the transcript
$time = ''; // transcript time
$page = 1; // this will hold the current page number
$linenr = ''; // this will hold the line nr
$speaker = ''; // the name of the speaker
$text = ''; // transcribed text attributed to the speaker
$new = false; // will be true if a new item has been matched
$event = ''; // this will hold notes that are in a quote but need to be stored separately (events)
$xml = ''; // this will be the XML string
$i = 0; // count the lines to display actual line number for debugging
foreach( $lines as $line ) { // loop over lines
$i++;
if( !preg_match( "/[[:graph:]]/", $line ) ) { // line is empty, does not contain printable characters....
continue; // ....so we skip to the next one
}
if( preg_match( "/^\h*\d+\h+(?P<date>[a-z]+,\h+\d+\h+[a-z]+\h\d{4})\s*$/i", $line, $match ) ) { # it looks like a date
$date = $match['date']; // set date
$speaker = ''; // reset vars
$text = '';
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*\d+\h+([A-Z]+(?:\s+[A-Z]+){0,4}\h+\(.*?\)|(?i:questions\h+by)[A-Z\h]+)\s*$/", $line, $match ) ) { # (qued) event, uppercase text followed by text between parentheses
$event .= " <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry to que, to be added after current quote
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*(\d*)\h*\(\h*(?P<time>\d{1,2}[:.]\d{1,2}\h*[ap]m)\)\s*$/i", $line, $match ) ) { # seems we have a time entry
$time = $match['time']; // set date
$xml .= " <time page=\"{$page}\" line=\"{$match[1]}\">" . strtoupper( str_replace( '.', ':', $match['time'] ) ) . "</time>\n"; // add time as entry
$speaker = ''; // reset vars
$text = '';
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*(\d+)\s*$/", $line, $match ) ) { # line has just one or more digits, we assume its a pagenr
if( $match[1] <= $page ) { // if the number is lower then the current page number ignore it, this avoids triggering errors for 'empty lines' that only have a line number
continue;
}
$page = (int) $match[1] + 1; // set pagenr, add one because the nr is at the bottom of the page
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*\d+\s+\(([[:print:]]+)\)\s*$/", $line, $match ) && !$speaker ) { # note, text is between parentheses
$xml .= " <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*\d+\h+[A-Z\h]+\(.*?\)\s*$/", $line, $match ) && !$speaker ) { # note, uppercase text followed by text between parentheses, only if not in quote
$xml .= " <event type=\"note\" speaker=\"\" page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
continue;// no need to handle this line any further
} elseif( preg_match("/^\h*(?P<linenr>\d+)\h+(?P<speaker>(?:\h[A-Z]+(?:\h[A-Z]+){0,4}))[:.]\h*(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # new speaker entry
if( $new && $speaker ) { // if we have one open we need to add it first
$xml .= " <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n"; // add quote
$new = false; // reset
if( $event ) { // if we have a qued note we need to add that too
$xml .= $event; // add entry to XML string
$event = ''; // clear 'que'
}
}
$speaker = trim( $match['speaker'] ); // assign match to speaker var
$linenr = (int) $match['linenr']; // assign line number
$text = trim( $match['text'] ); // assign text
$new = true; // set new match bool
} elseif( preg_match( "/^\h*(?P<linenr>\d+)\h+(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # follow up text
$text .= ' ' . trim( $match['text'] ); // append text
} else { # unkown line (add check for linenr only lines or remove $match[1] >= $page from the pagenr match conditional)
// not sure what kind of line this is... add it as a separate 'type'?!
trigger_error( "Unable to parse line {$i} \"{$line}\"" ); // throw exception / trigger error
continue; // no need to handle this line any further
}
if( !$new && $speaker ) {
$xml .= " <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
$speaker = ''; // reset vars
$text = '';
$new = false;
if( $event ) { // if we have a qued note we need to add that too
$xml .= $event; // add entry to XML string
$event = ''; // clear 'que'
}
}
}
// if we have a match open we need to handle it, this might happen because we do not assign the match in the same iteration as we matched it
if( $new ) {
$xml .= " <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
}
if( !trim( $xml ) ) { // no text found so $xml is still an empty string
return false;
}
$date = new DateTime( $date ); // instantiate datetime with the time from the transcript
$date = date( 'Y-m-d', $date->getTimestamp() ); // format date
// now we need to wrap the nodes in a root node
$xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<hearing date=\"{$date}\">\n{$xml}</hearing>\n";
return $xml; // return the XML
}
?>
I will update the comments and script later today
Output Sample:
<hearing date="2012-02-02">
<time page="1" line="2">10:00 AM</time>
<entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="3">Good morning.</entry>
<entry type="quote" speaker="MR BARR" page="1" line="4">Good morning, sir. We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</entry>
<entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="8">Very good.</entry>
<entry type="quote" speaker="MR BARR" page="1" line="9">We're going to listen to them all together, sir. Can I ask that the gentlemen are sworn in, please.</entry>
<event page="1" line="9">MR JAMES BLENDIS (affirmed)</event>
<event page="1" line="9">MR ADRIAN GORHAM (sworn)</event>
<event page="1" line="9">MR MARK HUGHES (sworn)</event>
<event page="1" line="9">Questions by MR BARR</event>
b.t.w. just out of curiosity, what is it you need this for?
Related
I want to get specific content of a website into an array.
I have approx 20 sites to fetch the content and output in other ways i like.Only the port is always changing (not 27015, its than 27016 or so...)
This is just one: SOURCE-URL of Content
For now, i use this code in PHP to fetch the Gameicon "cs.png", but the icon varies in length - so it isn't the best way, or? :-/
$srvip = '148.251.78.214';
$srvlist = array('27015');
foreach ($srvlist as $srvport) {
$source = file_get_contents('http://www.gametracker.com/server_info/'.$srvip.':'.$srvport.'/');
$content = array(
"icon" => substr($source, strpos($source, 'game_icons64')+13, 6),
);
echo $content[icon];
}
Thanks for helping, some days are passed from my last PHP work :P
You just need to look for the first " that comes after the game_icons64 and read up to there.
$srvip = '148.251.78.214';
$srvlist = array('27015');
foreach ($srvlist as $srvport) {
$source = file_get_contents('http://www.gametracker.com/server_info/'.$srvip.':'.$srvport.'/');
// find the position right after game_icons64/
$first_occurance = strpos($source, 'game_icons64')+13;
// find the first occurance of " after game_icons64, where the src ends for the img
$second_occurance = strpos($source, '"', $first_occurance);
$content = array(
// take a substring starting at the end of game_icons64/ and ending just before the src attribute ends
"icon" => substr($source, $first_occurance, $second_occurance-$first_occurance),
);
echo $content['icon'];
}
Also, you had an error because you used [icon] and not ['icon']
Edit to match the second request involving multiple strings
$srvip = '148.251.78.214';
$srvlist = array('27015');
$content_strings = array( );
// the first 2 items are the string you are looking for in your first occurrence and how many chars to skip from that position
// the third is what should be the first char after the string you are looking for, so the first char that will not be copied
// the last item is how you want your array / program to register the string you are reading
$content_strings[] = array('game_icons64', 13, '"', 'icon');
// to add more items to your search, just copy paste the line above and change whatever you need from it
foreach ($srvlist as $srvport) {
$source = file_get_contents('http://www.gametracker.com/server_info/'.$srvip.':'.$srvport.'/');
$content = array();
foreach($content_strings as $k=>$v)
{
$first_occurance = strpos($source, $v[0])+$v[1];
$second_occurance = strpos($source, $v[2], $first_occurance);
$content[$v[3]] = substr($source, $first_occurance, $second_occurance-$first_occurance);
}
print_r($content);
}
I am creating a Bible search. The trouble with bible searches is that people often enter different kinds of searches, and I need to split them up accordingly. So i figured the best way to start out would be to remove all spaces, and work through the string there. Different types of searches could be:
Genesis 1:1 - Genesis Chapter 1, Verse 1
1 Kings 2:5 - 1 Kings Chapter 2, Verse 5
Job 3 - Job Chapter 3
Romans 8:1-7 - Romans Chapter 8 Verses 1 to 7
1 John 5:6-11 - 1 John Chapter 5 Verses 6 - 11.
I am not too phased by the different types of searches, But If anyone can find a simpler way to do this or know's of a great way to do this then please tell me how!
Thanks
The easiest thing to do here is to write a regular expression to capture the text, then parse out the captures to see what you got. To start, lets assume you have your test bench:
$tests = array(
'Genesis 1:1' => 'Genesis Chapter 1, Verse 1',
'1 Kings 2:5' => '1 Kings Chapter 2, Verse 5',
'Job 3' => 'Job Chapter 3',
'Romans 8:1-7' => 'Romans Chapter 8, Verses 1 to 7',
'1 John 5:6-11' => '1 John Chapter 5, Verses 6 to 11'
);
So, you have, from left to right:
A book name, optionally prefixed with a number
A chapter number
A verse number, optional, optionally followed by a range.
So, we can write a regex to match all of those cases:
((?:\d+\s)?\w+)\s+(\d+)(?::(\d+(?:-\d+)?))?
And now see what we get back from the regex:
foreach( $tests as $test => $answer) {
// Match the regex against the test case
preg_match( $regex, $test, $match);
// Ignore the first entry, the 2nd and 3rd entries hold the book and chapter
list( , $book, $chapter) = array_map( 'trim', $match);
$output = "$book Chapter $chapter";
// If the fourth match exists, we have a verse entry
if( isset( $match[3])) {
// If there is no dash, it's a single verse
if( strpos( $match[3], '-') === false) {
$output .= ", Verse " . $match[3];
} else {
// Otherwise it's a range of verses
list( $start, $end) = explode( '-', $match[3]);
$output .= ", Verses $start to $end";
}
}
// Here $output matches the value in $answer from our test cases
echo $answer . "\n" . $output . "\n\n";
}
You can see it working in this demo.
I think I understand what you are asking here. You want to devise an algorithm that extracts information (ex. book name, chapter, verse/verses).
This looks to me like a job for pattern matching (ex. regular expressions) because you could then define patterns, extract data for all scenario's that make sense and work from there.
There are actually quite a few variants that could exist - perhaps you should also take a look at natural language processing. Fuzzy string matching on names could provide better results (ex. people misspelling book names).
Best of luck
Try out something based on preg_match_all, like:
$ php -a
Interactive shell
php > $s = '1 kings 2:4 and 1 sam 4-5';
php > preg_match_all("/(\\d*|[^\\d ]*| *)/", $s, $parts);
php > print serialize($s);
Okay Well I am not too sure about regular expressions and I havent yet studied them out, So I am stuck with the more procedural approach. I have made the following (which is still a huge improvement on the code I wrote 5 years ago, which was what I was aiming to achieve) That seems to work flawlessly:
You need this function first of all:
function varType($str) {
if(is_numeric($str)) {return false;}
if(is_string($str)) {return true;}
}
$bible = array("BookNumber" => "", "Book" => "", "Chapter" => "", "StartVerse" => "", "EndVerse" => "");
$pos = 1; // 1 - Book Number
// 2 - Book
// 3 - Chapter
// 4 - ':' or 'v'
// 5 - StartVerse
// 6 - is a dash for spanning verses '-'
// 7 - EndVerse
$scan = ""; $compile = array();
//Divide into character type groups.
for($x=0;$x<=(strlen($collapse)-1);$x++)
{ if($x>=1) {if(varType($collapse[$x]) != varType($collapse[$x-1])) {array_push($compile,$scan);$scan = "";}}
$scan .= $collapse[$x];
if($x==strlen($collapse)-1) {array_push($compile,$scan);}
}
//If the first element is not a number, then it is not a numbered book (AKA 1 John, 2 Kings), So move the position forward.
if(varType($compile[0])) {$pos=2;}
foreach($compile as $val)
{ if(!varType($val))
{ switch($pos)
{ case 1: $bible['BookNumber'] = $val; break;
case 3: $bible['Chapter'] = $val; break;
case 5: $bible['StartVerse'] = $val; break;
case 7: $bible['EndVerse'] = $val; break;
}
} else {switch($pos)
{ case 2: $bible['Book'] = $val; break;
case 4: //Colon or 'v'
case 6: break; //Dash for verse spanning.
}}
$pos++;
}
This will give you an array called 'Bible' at the end that will have all the necessary data within to run on an SQL database or whatever else you might want it for. Hope this helps others.
I know this is crazy talk, but why not just have a form with 4 fields so they can specify:
Book
Chapter
Starting Verse
Ending Verse [optional]
I have to create an automatic weather including rain, snow, clouds, fog and sunny.
Depending on the season I need to set a percentage for all weather: the forecast will be updated 3 or 4 times during a day.
Example:
Winter | Rain: 30% Snow: 30% Sunny: 10% Cloudy: 10%, Fog: 20%
I do not know how to implement a random condition based on percentages. Some help?
Many thanks and sorry for my bad English.
Well, you can use:
$location = 'Rome';
$document = file_get_contents(str_replace(" ", "+", "http://api.wunderground.com/auto/wui/geo/WXCurrentObXML/index.xml?query=".$location));
$xml = new SimpleXMLElement($document);
echo "$location: ".$xml->temp_c."° C";
Just take a look on the XML and see what data you have available.
EDIT
I didn't understand what the OP wanted the first time. Basically, it's even easier.
$weather = mt_rand(0,100);
$season = 'winter';
switch($season) {
case 'winter': {
if ($weather < 30) {
$output = 'Rainy';
} else if ($weather >=30 && $weather < 60) {
$output = 'Snowy';
}
// goes on on the same ideea of testing the value of $weather
break;
}
// other seasons
}
echo $output;
What I suggest tough, is to keep your values in arrays (for example the seasons) as well as the values for chances to have one type of weather or another.
array (
[winter] => array (
[30] => 'Rainy',
[60] => 'Snowy',
... // the other chances of weather
),
[spring] => array (
...
),
... // and so on
)
Use mt_rand(0,100) to get a random value and the array above to determine the weather.
Please let me know if this works for you.
Great answer by Claudiu but if you want to view with Fahrenheit (F) that possible example Below:
<?php
$location = 'Washington';
$document = file_get_contents(str_replace(" ", "+", "http://api.wunderground.com/auto/wui/geo/WXCurrentObXML/index.xml?query=" . $location));
$xml = new SimpleXMLElement($document);
echo $xml->temp_f . "° F";
?>
I have a string which contains the text of an article. This is sprinkled with BBCodes (between square brackets). I need to be able to grab the first say, 200 characters of an article without cutting it off in the middle of a bbcode. So I need an index where it is safe to cut it off. This will give me the article summary.
The summary must be minimum 200 characters but can be longer to 'escape' out of a bbcode. (this length value will actually be a parameter to a function).
It must not give me a point inside a stand alone bbcode (see the pipe) like so: [lis|t].
It must not give me a point between a start and end bbcode like so: [url="http://www.google.com"]Go To Goo|gle[/url].
It must not give me a point inside either the start or end bbcode or in-between them, in the above example.
It should give me the "safe" index which is after 200 and is not cutting off any BBCodes.
Hope this makes sense. I have been struggling with this for a while. My regex skills are only moderate. Thanks for any help!
First off, I would suggest considering what you will do with a post that is entirely wrapped in BBcodes, as is often true in the case of a font tag. In other words, a solution to the problem as stated will easily lead to 'summaries' containing the entire article. It may be more valuable to identify which tags are still open and append the necessary BBcodes to close them. Of course in cases of a link, it will require additional work to ensure you don't break it.
Well, the obvious easy answer is to present your "summary" without any bbcode-driven markup at all (regex below taken from here)
$summary = substr( preg_replace( '|[[\/\!]*?[^\[\]]*?]|si', '', $article ), 0, 200 );
However, do do the job you explicitly describe is going to require more than just a regex. A lexer/parser would do the trick, but that's a moderately complicated topic. I'll see if I can come up w/something.
EDIT
Here's a pretty ghetto version of a lexer, but for this example it works. This converts an input string into bbcode tokens.
<?php
class SimpleBBCodeLexer
{
protected
$tokens = array()
, $patterns = array(
self::TOKEN_OPEN_TAG => "/\\[[a-z].*?\\]/"
, self::TOKEN_CLOSE_TAG => "/\\[\\/[a-z].*?\\]/"
);
const TOKEN_TEXT = 'TEXT';
const TOKEN_OPEN_TAG = 'OPEN_TAG';
const TOKEN_CLOSE_TAG = 'CLOSE_TAG';
public function __construct( $input )
{
for ( $i = 0, $l = strlen( $input ); $i < $l; $i++ )
{
$this->processChar( $input{$i} );
}
$this->processChar();
}
protected function processChar( $char=null )
{
static $tokenFragment = '';
$tokenFragment = $this->processTokenFragment( $tokenFragment );
if ( is_null( $char ) )
{
$this->addToken( $tokenFragment );
} else {
$tokenFragment .= $char;
}
}
protected function processTokenFragment( $tokenFragment )
{
foreach ( $this->patterns as $type => $pattern )
{
if ( preg_match( $pattern, $tokenFragment, $matches ) )
{
if ( $matches[0] != $tokenFragment )
{
$this->addToken( substr( $tokenFragment, 0, -( strlen( $matches[0] ) ) ) );
}
$this->addToken( $matches[0], $type );
return '';
}
}
return $tokenFragment;
}
protected function addToken( $token, $type=self::TOKEN_TEXT )
{
$this->tokens[] = array( $type => $token );
}
public function getTokens()
{
return $this->tokens;
}
}
$l = new SimpleBBCodeLexer( 'some [b]sample[/b] bbcode that [i] should [url="http://www.google.com"]support[/url] what [/i] you need.' );
echo '<pre>';
print_r( $l->getTokens() );
echo '</pre>';
The next step would be to create a parser that loops over these tokens and takes action as it encounters each type. Maybe I'll have time to make it later...
This does not sound like a job for (only) regex.
"Plain programming" logic is a better option:
grab a character other than a '[', increase a counter;
if you encounter an opening tag, keep advancing until you reach the closing tag (don't increase the counter!);
stop grabbing text when your counter has reached 200.
Here is a start. I don't have access to PHP at the moment, so you might need some tweaking to get it to run. Also, this will not ensure that tags are closed (i.e. the string could have [url] without [/url]). Also, if a string is invalid (i.e. not all square brackets are matched) it might not return what you want.
function getIndex($str, $minLen = 200)
{
//on short input, return the whole string
if(strlen($str) <= $minLen)
return strlen($str);
//get first minLen characters
$substr = substr($str, 0, $minLen);
//does it have a '[' that is not closed?
if(preg_match('/\[[^\]]*$/', $substr))
{
//find the next ']', if there is one
$pos = strpos($str, ']', $minLen);
//now, make the substr go all the way to that ']'
if($pos !== false)
$substr = substr($str, 0, $pos+1);
}
//now, it may be better to return $subStr, but you specifically
//asked for the index, which is the length of this substring.
return strlen($substr);
}
I wrote this function which should do just what you want. It counts n numbers of characters (except those in tags) and then closes tags which needs to be closed. Example use included in code. The code is in python, but should be really easy to port to other languages, such as php.
def limit(input, length):
"""Splits a text after (length) characters, preserving bbcode"""
stack = []
counter = 0
output = ""
tag = ""
insideTag = 0 # 0 = Outside tag, 1 = Opening tag, 2 = Closing tag, 3 = Opening tag, parameters section
for i in input:
if counter >= length: # If we have reached the max length (add " and i == ' '") to not make it split in a word
break
elif i == '[': # If we have reached a tag
insideTag = 1
elif i == '/': # If we reach a slash...
if insideTag == 1: # And we are in an opening tag
insideTag = 2
elif i == '=': # If we have reached the parameters
if insideTag >= 1: # If we actually are in a tag
insideTag = 3
elif i == ']': # If we have reached the closing of a tag
if insideTag == 2: # If we are in a closing tag
stack.pop() # Pop the last tag, we closed it
elif insideTag >= 1:# If we are in a tag, parameters or not
stack.append(tag) # Add current tag to the tag-stack
if insideTag >= 0: # If are in some type of tag
insideTag = 0
tag = ""
elif insideTag == 0: # If we are not in a tag
counter += 1
elif insideTag <= 2: # If we are in a tag and not among the parameters
tag += i
output += i
while len(stack) > 0:
output += '[/'+stack.pop()+']' # Add the remaining tags
return output
cutText = limit('[font]This should be easy:[img]yippee.png[/img][i][u][url="http://www.stackoverflow.com"]Check out this site[/url][/u]Should be cut here somewhere [/i][/font]', 60)
print cutText
I've never really done much with parsing text in PHP (or any language). I've got this text:
1 (2) ,Yes,5823,"Some Name
801-555-5555",EXEC,,"Mar 16, 2009",0.00,
1 (3) ,,4821,Somebody Else,MBR,,"Mar 11, 2009",,0.00
2 (1) ,,5634,Another Guy,ASSOC,,"Mar 15, 2009",,0.00
You can see the first line has a break, I need to get it be:
1 (2) ,Yes,5823,"Some Name 801-555-5555",EXEC,,"Mar 16, 2009",0.00,
1 (3) ,,4821,Somebody Else,MBR,,"Mar 11, 2009",,0.00
2 (1) ,,5634,Another Guy,ASSOC,,"Mar 15, 2009",,0.00
I was thinking of using a regular expression to find \n within quotes, (or after a quote, since that wouldn't create false matches) and then replacing it with nothing using PHP's preg_replace(). I'm currently researching regex since I don't know any of it so I may figure this out on my own (that's always best) but no doubt a solution to a current problem of mine will help me get a handle on it ever sooner.
Thanks so much. If I could, I'd drop a bounty on this immediately.
Thanks!
If the text has that fixed format, maybe you won't need regex at all, just scanning the line for two double quotes and if there is only one, start joining lines until you find the closing one...
Problems may arise if there can be escaped quotes, single quotes to delimit the strings, etc. but as long as there are not that kind of things, you should be fine.
I don't know PHP, so here is some pseudocode:
open = False
for line in lines do
nquotes = line.count("\"")
if not open then
if nquotes == 1 then
open = True
write(line)
else #we assume nquotes == 2
writeln(line)
end
else
if nquotes == 0 then
write(line)
else #we assume nquotes == 1
open = False
writeln(line)
end
end
end
Here's essentially fortran's answer in PHP
<pre>
<?php
$data = <<<DATA
1 (2) ,Yes,5823,"Some Name
801-555-5555",EXEC,,"Mar 16, 2009",0.00,
1 (3) ,,4821,Somebody Else,MBR,,"Mar 11, 2009",,0.00
2 (1) ,,5634,Another Guy,ASSOC,,"Mar 15, 2009",,0.00
DATA;
echo $data, '<hr>';
$lines = preg_split( "/\r\n?|\n/", $data );
$filtered = "";
$open = false;
foreach ( $lines as $line )
{
if ( substr_count( $line, '"' ) & 1 && !$open )
{
$filtered .= $line;
$open = true;
} else {
$filtered .= $line . "\n";
$open = false;
}
}
echo $filtered;
?>
</pre>