Creating a CSV file from an HTML page

Creating a CSV file from an HTML page - php

I have extracted records from a database and stored them on an HTML page with only text. Each record is stored in a <p> paragraph field and separated by a line break <br /> and a line <hr>.
For example:
Company Name<br/>
555-555-555<br />
Address Line 1<br />
Address Line 2<br />
Website: www.example.com<br />
I just need to place these records into a CSV file. I used fputcsv in combination with array() and file_get_contents() but it read my the entire source code of the webpage into a .csv file and alot of data was missing as well. These are multiple records stored in the same format. So after an entire record block as seen above, it is separate by an <hr> line tag. I want to read the company name into the Name column, the Phone number into the Phone column, the addresses into the Address column and the Website into the Website column as shown below.
http://i.stack.imgur.com/00Gxw.png
How can i do this?
Snippet of the HTML:
1 Stop Signs<br />
480-961-7446<br />
500 N. 56th Street<br />
Chandler, AZ 85226<br />
<br />
Website: www.1stopsigns.com<br />
<br />
</p><br /><hr><br />
It's spaced like this in the source of the HTML.

Assuming that your data follows a pattern where every record is separated by a <hr> tag and every field within is separated by a <br /> then you should be able to split out the data.
There are loads of ways to do this, but a naive way that might work using explode() might be something like:
// open a file pointer to csv
$fp = fopen('records.csv', 'w');
// first, split each record into a separate array element
$records = explode('<hr>', $str);
// then iterate over this array
foreach ($records as $record) {
// strip tags and trim enclosing whitespace
$stripped = trim(strip_tags($record));
// explode by end-of-line
$fields = explode(PHP_EOL, $stripped);
// array walk over each field and trim whitespace
array_walk($fields, function(&$field) {
$field = trim($field);
});
// create row
$row = array(
$fields[0], // name
$fields[1], // phone
sprintf('%s, %s', $fields[2], $fields[3]), // address
$fields[6], // web
);
// write cleaned array of fields to csv
fputcsv($fp, $row);
}
// done
fclose($fp);
Where $str is the page data you are parsing. Hope this helps.
EDIT
Didn't notice the specific field requirements originally. Updated the example.

Assuming the html that shown above is well formed,my approach to this problem must be in 2 phases.
First. Clear a little bit the html text to be more efficient to export or manage the information. Here try to clear the items you want to save and delete those you know you don't want to require in the near future.
$html = preg_replace("|\s{2,}|si"," ",$html); // clear non neccesary spaces
$html = preg_replace("|\n{2,}|si","\n",$html); // convert more return line to only one
$html = preg_replace("|<br />|si","##",$html); // replace those tags with this one
Then you'll have a more clean html to work with similar to this....
1 Stop Signs##
480-961-7446##
500 N. 56th Street##
Chandler, AZ 85226##
Website: www.1stopsigns.com##
##
</p>##<hr>##
Second. Now you can explode the fields or make an implode into a comma separate value to form a csv
// here you'll have the fields to work with into the array called $csv_parts
$csv_parts = explode("##",$html);
// imploding, so there you have the formatted csv similar to 1 Stop Signs,480-961-7446,..
$csv = implode(",",$csv_parts);
Now you'll have a two ways to work with the html for extracting the fields or exporting the csv.
Hope this helps or give you an idea to develop what you need.

By far the easiest way would be to simply take the block, drop everything from the <hr> tag forward then split the string as a string array on the <br /> tags.

Related

Formatting string to display as list using php

I have the contents of a textarea being stored in a PHP string after it is submitted by the user. I am hoping to be able to tweak the formatting of the contents of that string, such that it will be displayable as a list when it is echoed. In other words, I would need to insert UL and /UL at the beginning and end, respectively, and LI and /LI and the beginning and end of each line.
Before I mess with my code, I was wondering if anyone knows if this is this even possible? Are carriage returns sent via textarea submit? Any help/comments would be much appreciated.
[EDIT]
I have defined some variables to give myself all the necessary HTML stuff. The 'repertoire' variable is the original string containing text sent from user input.
$repertoire = ($_POST['repertoire']);
$list_start = '<UL>';
$list_end = '</UL>';
$list_end = '</UL>';
$list_start_line = '<LI>';
$list_end_line = '</LI>';
The following is an example of what would be submitted by the user, and therefore, what would constitute the original $repertoire string:
Luciano Berio - Circles
Mike Svoboda - Piangero la sorte mia
Nicholas von Ritter-Zahony - New Piece
Stefano Gervasoni - Due Poesie Francesi di Rilke
So we would at least need the following:
$repertoire_formatted = substr_replace($list_start, $repertoire, $list_end);
...but I don't know how to substitute <LI> for line breaks; also, I cannot know in advance the length of the string or of each line.

You can use regex to selecting every line and wrap it in <li></li>
$html = preg_replace("/([^\n]+)/", "<li>$1</li>", $repertoire);
$html = "<ul>\n$html</ul>";
Check result in demo

Get the feed value from particular string in php

I have the below feed value
<item>
<description><strong>Contact Number:</strong> +91-00-000-000<br /><br /><strong>Rate:</strong> xx.xx<br /><br /><strong>Fees and Comments:<br /></strong><ul><li>$0 fees</li><li>Indicative Exchange Rate</li></description>
</item>
Now i wanna get Contact number and rate as well as Fees and comments in separte value.
how can i get this value ..any one????

Description
You should probably read this with a parsing engine. however if your use case is this simple then this regex will:
capture each of the fields
allow the fields to appear in any order
^(?=.*?Contact\sNumber:<\/strong>([^<]*))(?=.*?Rate:<\/strong>([^<]*))(?=.*?Fees\sand\sComments:.*?<li>([^<]*)<.*?<li>([^<]*)<)
Live Example: http://www.rubular.com/r/j0aStij3L8

It kind of depends on what reliable patterns there are to the rest of your feed (or future feeds). It doesn't look like an XML parser is going to work here as the example doesn't look like well formed XML.
A good way to start is using explode to split the string into an array of strings, it looks like is a good delimiter to split on. So this would look like:
$split_feed = explode("<br />",$feed);
where $feed is your feed input in the question, and $split_feed will be your output array.
Then, from that split feed, you can use strpos (or stripos) to test for keys in your string, to determine which field it references, and replace to get the value out of the key/value string.

I think this is you want
<?php
$value = '<strong>Contact Number:</strong> +91-00-000-000<br /><br />
<strong>Rate:</strong> xx.xx<br /><br />
<strong>Fees and Comments:<br /></strong><ul><li>$0 fees</li>
<li>Indicative Exchange Rate</li>';
$steps = explode('<br /><br />', $value);
$step_2_for_contact_number = explode('</strong>', $steps[0]);
$contact_number = $step_2_for_contact_number[1];
$step_for_rate = explode('</strong>', $steps[1]);
$rate = $step_for_rate[1];
$feed_n_comment_s_1 = explode('</li>', $steps[2]);
$feed_n_comment_s_2 = explode('<li>', $feed_n_comment_s_1[0]);
$feed_n_comment = $feed_n_comment_s_2[1];
echo $contact_number;
echo "<br/>";
echo $rate;
echo "<br/>";
echo $feed_n_comment;
?>

You can also have a look at this pattern: (uses named groups)
(?<key>[a-zA-Z\d\s]+)(?=\:).*?\>(?<value>[^<]+)
Live Demo

get data between inverted commas and more

i have aproblem for a few days right now :s ...
I'm trying to get some changing data inside a string, the string is something like this:
<docdata>
<!-- News Identifier -->
<doc-id id-string ="YBY15349" />
<!-- Date of issue -->
<date.issue norm ="2012-09-22 19:52" />
<!-- Date of release -->
<date.release norm ="2012-09-22 19:52" />
</docdata>
What i need is only the date inside the "2012-09-22 19:52" , the string its stored in some type of xml, malformed by the way. So i can't use normal xml parser, i load/read the file already to change some charset
$fname = $file;
$fhandle = fopen($fname,"r");
$content = fread($fhandle,filesize($fname));
str_replace("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>", $content);
etc..
this work like a charm, but with the string i cant use it.
I try with preg_match_all but i can`t get it right.
Its there a simple way to search this value
<date.issue norm ="2012-09-22 19:52" />
and get just the date in a variable?
thanks in advance and sorry for my english.

From the PHP documentation:
file_get_contents() is the preferred way to read the contents of a file into a string. It will use memory mapping techniques if supported by your OS to enhance performance.
Consequently, your code would become:
$content = file_get_contents($file);
$content = str_replace("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>", $content);
preg_match_all('/date\.issue norm ="([^"]+)" /', $content, $date);
The default behavior is to store the parenthesized matches in the array $date[1]. Therefore, you might loop through $date[1][0], $date[1][1], and so on.

A regular expression to match the following:
<date.issue norm ="2012-09-22 19:52" />
Would be:
/<date\.issue\s*norm\s*="([^"]*)"/
In code:
preg_match_all('/<date\.issue\s*norm\s*="([^"]*)"/', $content, $matches);
// $matches[1] contains all the dates

Instead of using
fopen($filename)
use
$filename = '/path/to/file.xml';
$filearray = file($filename) // pulls the while file into an array by lines
$searchstr = 'date.issue';
foreach($filearray as $line) {
if(stristr($line,$searchstr)) { // <-- forgot the )
$linearray = explode('"',$line);
// your date should be $linearray[1];
echo $linearray[1]."\n"; // to test your output
// rest of your code here
}
}
this way you search the whole file for your search string and the malformed xml shouldnt be a problem.

PHP/HTML - Multiple page screen scrape, export to .txt with commas between dates and values

I am attempting to scrape the web page (see code) - as well as those pages going back in time (you can see the date '20110509' in the page itself) - for simple numerical strings. I can't seem to figure out through much trial and error (I'm new to programming) how to parse the specific data in the table that I want. I have been trying to use simple PHP/HTML without curl or other such things. Is this possible? I think my main issue is
using the delimiters that are necessary to get the data from the source code.
What I'd like is for the program to start at the very first page it can, say for example '20050101', and scan through each page till the current date, grabbing the specific data for example, the "latest close" (column), "closing arm" (row), and have that value for the corresponding date exported to a single .txt file, with the date being separated from the value with a comma. Each time the program is run, the date/value should be appended to the existing text file.
I am aware many lines of the code below are junk, it's part of my learning process.
<html>
<title>HTML with PHP</title>
<body>
<?php
$rawdata = file_get_contents('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2-20110509.html?mod=mdc_pastcalendar');
//$data = substr(' ', $data);
//$begindate = '20050101';
//$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
//if (preg_match(' <td class="text"> ' , $data , $content)) {
//$content = str_replace($newlines
echo $rawdata;
///file_put_contents( 'NYSETRIN.html' , $content , FILE_APPEND);
?>
<b>some more html</b>
<?php
?>
</body>
</html>

All right so let's do this. We're going to first load the data into an HTML parser, then create an XPath parser out of it. XPath will help us navigate around the HTML easily. So:
$date = "20110509";
$data = file_get_contents("http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2-{$date}.html?mod=mdc_pastcalendar");
$doc = new DOMDocument();
#$doc->loadHTML($data);
$xpath = new DOMXpath($doc);
Now then we need to grab some data. First off let's get all the data tables. Looking at the source, these tables are indicated by a class of mdcTable:
$result = $xpath->query("//table[#class='mdcTable']");
echo "Tables found: {$result->length}\n";
So far:
$ php test.php
Tables found: 5
Okay so we have the tables. Now we need to get specific column. So let's use the latest close column you mentioned:
$result = $xpath->query("//table[#class='mdcTable']/*/td[contains(.,'Latest close')]");
foreach($result as $td) {
echo "Column contains: {$td->nodeValue}\n";
}
The result so far:
$ php test.php
Column contains: Latest close
Column contains: Latest close
Column contains: Latest close
... etc ...
Now we need the column index for getting the specific column for the specific row. We do this by counting all of the previous sibling elements, then adding one. This is because element index selectors are 1 indexed, not 0 indexed:
$result = $xpath->query("//table[#class='mdcTable']/*/td[contains(.,'Latest close')]");
$column_position = count($xpath->query('preceding::*', $result->item(0))) + 1;
echo "Position is: $column_position\n";
Result is:
$ php test.php
Position is: 2
Now we need to get our specific row:
$data_row = $xpath->query("//table[#class='mdcTable']/*/td[starts-with(.,'Closing Arms')]");
echo "Returned {$data_row->length} row(s)\n";
Here we use starts-with, since the row label has a utf-8 symbol in it. This makes it easier. Result so far:
$ php test.php
Returned 4 row(s)
Now we need to use the column index to get the data we want:
$data_row = $xpath->query("//table[#class='mdcTable']/*/td[starts-with(.,'Closing Arms')]/../*[$column_position]");
foreach($data_row as $row) {
echo "{$date},{$row->nodeValue}\n";
}
Result is:
$ php test.php
20110509,1.26
20110509,1.40
20110509,0.32
20110509,1.01
Which can now be written to a file. Now, we don't have the markets these apply to, so let's go ahead and grab those:
$headings = array();
$market_headings = $xpath->query("//table[#class='mdcTable']/*/td[#class='colhead'][1]");
foreach($market_headings as $market_heading) {
$headings[] = $market_heading->nodeValue;
}
Now we can use a counter to reference which market we're on:
$data_row = $xpath->query("//table[#class='mdcTable']/*/td[starts-with(.,'Closing Arms')]/../*[$column_position]");
$i = 0;
foreach($data_row as $row) {
echo "{$date},{$headings[$i]},{$row->nodeValue}\n";
$i++;
}
The output being:
$ php test.php
20110509,NYSE,1.26
20110509,Nasdaq,1.40
20110509,NYSE Amex,0.32
20110509,NYSE Arca,1.01
Now for your part:
This can be made into a function that takes a date
You'll need code to write out the file. Check out the filesystem functions for hints
This can be made extendible to use different columns and different rows

I'd recommend using the HTML Agility Pack, its a HTML parser which is very handy for finding particular content within a HTML document.

Insert string between two markers

I have a requirement to insert a string between two markers.
Initially I get a sting (from a file stored on the server) between #DATA# and #END# using:
function getStringBetweenStrings($string,$start,$end){
$startsAt=strpos($string,$start)+strlen($start);
$endsAt=strpos($string,$end, $startsAt);
return substr($string,$startsAt,$endsAt-$startsAt);
}
I do some processing and based on the details of the string, query for some records. If there are records I need to be able to append them at the end of the string and then re-insert the string between #DATA# and #END# within the file on the server.
How can I best achieve this?
Is it possible to insert a record at a time in the file before #END# or is it best to manipulate the string on the server and just re-insert over the existing string in the file on the server?
Example of Data:
AGENT_REF^ADDRESS_1^ADDRESS_2^ADDRESS_3^ADDRESS_4^TOWN^POSTCODE1^POSTCODE2^SUMMARY^DESCRIPTION^BRANCH_ID^STATUS_ID^BEDROOMS^PRICE^PROP_SUB_ID^CREATE_DATE^UPDATE_DATE^DISPLAY_ADDRESS^PUBLISHED_FLAG^LET_RENT_FREQUENCY^TRANS_TYPE_ID^NEW_HOME_FLAG^MEDIA_IMAGE_00^MEDIA_IMAGE_TEXT_00^MEDIA_IMAGE_01^MEDIA_IMAGE_TEXT_01^~
#DATA#
//Property records would appear here and match the string above, each field separated with ^ and terminating with ~
//Once the end of data has been reached, it will be fully terminated with:
#END#
When I check for new properties, I do the following:
Get all existing properties between #DATA# and #END#
Get the IDs of the properties and query for new properties which don't match these IDs
I then need to re-insert the new properties before #END# but after the last property in the file.
The structure of the file is a Rightmove BLM file.

Just do an str_replace() of the old data with the new:
$str = str_replace('#DATA#'.$oldstr.'#END#', '#DATA#'.$newstr.'#END#', $str);

I would extract the data in 3 steps:
1) Extract the data from the file:
<?php
preg_match("/#DATA#(.+)#END#/s", $string, $data);
?>
2) Extract each row of data:
<?php
preg_match_all("/((?:.+\^){2,})~/", $data[1], $rows, PREG_PATTERN_ORDER);
// The rows with data will be stored in $rows[1]
?>
3) Manipulate the data in each row or add new rows:
<?php
//Add
// Add new row to the end of the array
$data[1][] = implode('^', $newRowArray);
//Use
// Creates an array with all the data from the row '0'
$rowData = preg_split("/\^/", $data[1][0], -1, PREG_SPLIT_NO_EMPTY);
//Save the changes
//$newData should be all the rows together (with the '~' at the end of each row)
//$string is the original string with all the information
$file = preg_replace("/(#DATA#\r?\n).+(\r?\n#END#)/s", "\1".$newData."\2", $string);
I hope this can help you in your problem.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Creating a CSV file from an HTML page - php

By far the easiest way would be to simply take the block, drop everything from the <hr> tag forward then split the string as a string array on the <br /> tags.

Related

Formatting string to display as list using php

Get the feed value from particular string in php

get data between inverted commas and more

PHP/HTML - Multiple page screen scrape, export to .txt with commas between dates and values

Insert string between two markers

Categories

Resources