I need to print the results from a html parsing into an PHP array. I am stuck at the very last part.
library(XML)
url ='http://www.brainyquote.com/quotes/authors/j/john_kenneth_galbraith.html'
page <- htmlParse(url)
quote <- xpathSApply(page,
"//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]",
xmlValue)
quote = quote[nchar(quote) > 50] # this removes all none quotes based on string length
quote = quote[1:(length(quote)-2)] # this drops the last
out = paste(quote, collaspe= "', ") # how to get the ' at the front of the quote
write(out, "quote.txt")
The final code has the text string with --- quote here---',(apostrophe - comma) at the end. I need to put the '(apostrophe) at the beginning and do not have idea how to do it. I tried using r to json but does not work for simple php array I use. which is structure like this:
<?php
$quotes = array('quote goes here', 'quote goes here', 'final quote');
$rand = rand( 0, count($quotes)-1 );
echo $quotes[$rand];
?>
I do not really use php but it just runs on everything so I did this random quote maker is simple terms. I could rewrite in javascript and use a json array. But I would then need to write in javascript.
You can simplify the extraction somewhat (maybe this takes care of your apostrophe etc ?)
library(XML)
url ='http://www.brainyquote.com/quotes/authors/j/john_kenneth_galbraith.html'
page <- htmlParse(url)
out <- sapply(page['//span[#class="bqQuoteLink"]/a'], xmlValue)
write(out, "quote.txt")
> head(paste0("'", out, "'"))
[1] "'The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is, the search for a superior moral justification for selfishness.'"
[2] "'Economics is extremely useful as a form of employment for economists.'"
[3] "'All of the great leaders have had one characteristic in common: it was the willingness to confront unequivocally the major anxiety of their people in their time. This, and not much else, is the essence of leadership.'"
[4] "'In economics, the majority is always wrong.'"
[5] "'Faced with the choice between changing one's mind and proving that there is no need to do so, almost everyone gets busy on the proof.'"
[6] "'Under capitalism, man exploits man. Under communism, it's just the opposite.'"
Related
PHP Mysql CodeIgniter Converting characters to symbols in very bizarre circumstances
Application Built on CodeIgniter.
Has been running for over a year. No problems.
Client fills in a form about a customer.
A simple trim($_POST['notes']) captures textarea form field text and saves to MySQL
no error reported in PHP or JavaScript
The other day I notice some text the client has entered, has had the brackets used in the text "()" replaced with the equivalent "()
I think... "That's strange... I don't recall any reason why those characters would have been replaced like that.!"
I take a look ... and a day later... here is my madness revealed:
The text in question is verbatim "
Always run credit card on file (we do not charge this customer for pick-up or return)
"
No matter what I did or changed on the code side.. I could not prevent the PHP... OR Javascript... Or MySQL... OR alien beings... - or whoever the heck is doing it - from converting the "()" in the text to "(). And I tried many things like cleaning the string in all ways known to man or god. Capturing the string previous to sending just before saving to the database. And the conversion would always take place just before the save to MySQL. I tried posting in different forms and fields... Same thing every time... could not stop the magic conversion to "().
What in the name of batman is in this magical text that is causing this to happen?? is it magic pixie dust sprinkled on to godaddy server it is running on??? 0_o
.......
Being the genius that I am 0_0 I decide to remove one word from the paragraph at a time.
Magically... as all the creatures of the forest gathered around - as I finally got to the word "file" in the paragraph, and removed it !!! Like magic - the "()" stay as "()" and are NOT converted to "()?!?!???!?!? :\ How come??I simply removed the word "file" from the text... How could this change anything?? What is the word "file" causing to change with how the string is saved or converted??
OK -So I tested this out on any and every form field in the app. Every single time, in any field, if you type the word "file" followed by a "(" it will convert the first "(" to "(; and the very next ")" to ")
So.. if the string is:
"file ( any number of characters or text ) any other text or characters"
On post, it will be converted mysteriously to:
"file ( any number of characters or text ) any other text or characters"
Remove the word "file" from the string, and you get:
"( any number of characters or text ) any other text or characters"
The alien beings return the abducted "()"
Anyone have a clue what the heck could be going on here?
What is causing this?
Is the word "file" a keyword that is tripping some sort of security measures? interpereting it as "file()"???
I dunno :\
It's the strangest thing I ever saw... Except for that time I walked in on Mom and Dad 0_o
Any help would be greatly appreciated, and I will buy you a beer for sure :)
The very large headed, - (way to much power for such tender egos) -, Noo-Noos here at stack have paused this question as "Off topic" LOL... honest to God these guys are so silly.
So - in an effort to placate the stack-gestapo - I will attempt to edit this question so that it is... "on topic"??? 0_o ... anything for you oh so "King" Stack Guys O_O - too bad you would never have the whit to ever notice such a bug... maybe some day. ;)
Sample code:
<textarea name="notes">Always run credit card on file (we do not charge this customer for pick-up or return) blah blah</textarea>
<?php
if(isset($_POST['notes']){
$this->db->where("ID = ".$_POST['ID']);
$this->db->update('OWNER', $_POST['notes']);
}
?>
Resulting MySQL storage:
"Always run credit card on file (we do not charge this customer for pick-up or return) blah blah"
InnoDB - Type text utf8_general_ci
I am not looking for a way to prevent it, or clean it... I am clearly asking "What causes it"
/*
* Sanitize naughty scripting elements
*
* Similar to above, only instead of looking for
* tags it looks for PHP and JavaScript commands
* that are disallowed. Rather than removing the
* code, it simply converts the parenthesis to entities
* rendering the code un-executable.
*
* For example: eval('some code')
* Becomes: eval('some code')
*/
$str = preg_replace('#(alert|cmd|passthru|eval|exec|expression|system|fopen|fsockopen|file|file_get_contents|readfile|unlink)(\s*)\((.*?)\)#si', "\\1\\2(\\3)", $str);
This is the part of XSS Clean. (system/core/Security.php)
If you want the filter to run automatically every time it encounters POST or COOKIE data you can enable it by opening your application/config/config.php file and setting this:
$config['global_xss_filtering'] = TRUE;
https://www.codeigniter.com/user_guide/libraries/security.html
try something like this
$this->db->set('OWNER', $_POST['notes'],FALSE);
$this->db->where('ID ', $_POST['ID']);
$this->db->update('table_name');
Men I think Is in your server. If Ur using Wamp try to check if you have miss Install some arguments in xhtml. This is my Idea. it's related on my experience in CodeIgniter. hope U will response if you want some advice.
Use utf8 encoding to store these values.
To avoid injections use mysql_real_escape_string() (or prepared statements).
To protect from XSS use htmlspecialchars.
How ever not sure what is the issue in ur case..
Probably try using some other sql keywords in the string and verify the solution.
Try replacing the ( and the ) with ( and ) using str_replace
If you are storing ( and ) in your database then you should try replacing it on output if not try and replace it before input.
I'm not sure if this would work, but you could try inserting a slash in or before the word 'file':
fi\le ( any number of characters or text ) any other text or characters
Ok, I'm using SimpleXML to parse RSS feeds, and as many feeds contain embedded html, I'd like to be able to isolate any image addresses contained in embedded html. Sounds like an easy enough task, but I'm running into an issue with parsing the data from the SimpleXMLElement objects. Here's the relevant code.
for($i = 0; $i < count($articles); $i++) {
foreach($articles[$i] as $feedDeet) {
$str = (string)$feedDeet;
$result = strpos($str, '"');
if($result === false) {
echo 'There are apparently no quotes in this string: '.$str;
}
$explodedString = explode('"', $str);
echo "<br>";
if($explodedString[0] == $str) {
echo 'ExplodedString is equal to str. Apparently, once again, the string contains no quotes.';
}
echo "<hr>";
}
}
In this situation, $articles is an array of SimpleXMLElement objects each representing an RSS article, and containing many child SimpleXMLElement objects representing properties and details of that article. Basically, I'd like to iterate through those properties one by one, cast them as strings, and then explode the strings using any quotes as delimiters (because any image addresses would be contained inside of quotes). I would then parse through the exploded array and search for any strings that appear to be an image address. However, neither explode() nor strpos() is behaving as I would expect it to. To give an example of what I mean, one of the outputs of the above code is as follows:
There are apparently no quotes in this string: <p style="text-align: center;"><img class="alignnone size-full wp-image-243922" alt="gold iPhone Shop Le Monde" src="http://media.idownloadblog.com/wp-content/uploads/2013/08/gold-iPhone-Shop-Le-Monde.jpg" width="593" height="515" /></p> <p>Folks still holding out hope that the gold iPhone rumors aren’t true may want to brace themselves, the speculation has just been confirmed by the Wall Street Journal-owned blog AllThingsD. And given the site’s near perfect (perfect?) track record with predicting future Apple plans, and corroborating evidence, we’d say Apple is indeed going for the gold…(...)<br/>Read the rest of AllThingsD confirms gold iPhone coming</p> <hr /> <p><small> "AllThingsD confirms gold iPhone coming" is an article by iDownloadBlog.com. <br/>Make sure to follow us on Twitter, Facebook, and Google+. </small></p>
ExplodedString is equal to str. Apparently, once again, the string contains no quotes.
Sorry if that was a little hard to read, it's copied verbatim from the output.
As you can see, there are clearly quotes in the string in question, yet, strpos is returning false, meaning that the specified string could not be found, and explode is returning an array with the original string inside, signifying that the specified delimiter could not be found. What is going on here? I've been stumped by this for hours, and I feel like I'm losing my mind.
Thanks!
The mistake you've made here is that your debug output is an HTML page, so the messages you print are being interpreted as HTML by your browser. To see their actual contents, you either need to view the page source, or use <pre> tags to preserve whitespace, and htmlspecialchars() to add a layer of HTML escaping: echo '<pre>' . htmlspecialchars($str) . '</pre>';
If the output in the browser looks like <p style="text-align: center;">, then clearly the input is already escaped with HTML-entities, and probably actually looks like <p style="text-align: center;">. Although that " looks like ", it is not the same string, so strpos() won't find it.
In order to undo this extra layer of escaping, you could run html_entity_decode() on the string before processing it.
In a project I am building I would like to use markdown as follows
*text* = <em>text</em>
**text** = <strong>text</strong>
***text*** = <strong><em>text</em><strong>
As those are the only three markdown formats I require, I would like to remain lightweight and avoid importing the entire PHP markdown library as that would introduce features I do not require and create issues.
So I have been trying to build some simple regex replaces. Using preg_replace I run:
'/(\*\*\*)(.*?)\1/' to '<strong><em>\2</em></strong>'
'/(\*\*)(.*?)\1/' to '<strong>\2</strong>'
'/(\*)(.*?)\1/' to '<em>\2</em>',
And this works great! em, bold, and the combo all work fine...
But if the user makes a mistake or enters to many stars, everything breaks.
i.e.
****hello**** = <strong><em><em>hello</em></strong></em>
*****hello***** = <strong><em><strong>hello</em></strong></strong>
******hello****** = <strong><em></em></strong>hello<strong><em></em></strong>
etc
When ideally it would create
****hello**** = *<strong><em>hello</em></strong>*
*****hello***** = **<strong><em>hello</em></strong>**
******hello****** = ***<strong><em>hello</em></strong>***
etc
Ignoring the un-required stars (so it would become clear to the user they made a mistake, and more importantly, the rendered HTML remains valid).
I presume there must be some way to modify my regex to do this but I cannot for the life of my work it out, even after a whole day trying!
I would also be happy with the result of
******hello****** = <strong><em>hello</em></strong>
So please, can anybody help me?
Also please consider uneven stars. In this case the below scenario would be ideal.
***hello* = **<em>hello</em>
And the time when a star should be part of the body and not detected, such as if a user inputs:
'terms and conditions may apply*'
or
'I give the film 5* out of 10'
Many many thanks
Try different capturing pattern (match anything except * one or more times),
'/(\*\*\*)([^*]+)\1/'
I wanted a particular implementation, such that the user provide a block of text like:
"Requirements
- Working knowledge, on LAMP Environment using Linux, Apache 2,
MySQL 5 and PHP 5,
- Knowledge of Web 2.0 Standards
- Comfortable with JSON
- Hands on Experience on working with Frameworks, Zend, OOPs
- Cross Browser Javascripting, JQuery etc.
- Knowledge of Version Control Software such as sub-version will be
preferable."
What I want to do is automatically select relevant keywords and create tags/keywords, hence for the above piece of text, relevant tags should be: mysql, php, json, jquery, version control, oop, web2.0, javascript
How can I go about doing it in PHP/Javascript etc? A headstart would be really helpful.
A very naive method is to remove common stopwords from the text, leaving you with more meaningful words like 'Standards', 'JSON', etc. You will still get a lot of noise however, so you may consider a service like OpenCalais which can do a rather sophisticated analysis of your text.
Update:
Okay, the link in my previous answer pointed to implementations, but you asked for one so a simple one is here:
function stopWords($text, $stopwords) {
// Remove line breaks and spaces from stopwords
$stopwords = array_map(function($x){return trim(strtolower($x));}, $stopwords);
// Replace all non-word chars with comma
$pattern = '/[0-9\W]/';
$text = preg_replace($pattern, ',', $text);
// Create an array from $text
$text_array = explode(",",$text);
// remove whitespace and lowercase words in $text
$text_array = array_map(function($x){return trim(strtolower($x));}, $text_array);
foreach ($text_array as $term) {
if (!in_array($term, $stopwords)) {
$keywords[] = $term;
}
};
return array_filter($keywords);
}
$stopwords = file('stop_words.txt');
$text = "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable.";
print_r(stopWords($text, $stopwords));
You can see this, and the contents of stop_word.txt in this Gist.
Running the above on your example text produces the following array:
Array
(
[0] => requirements
[4] => linux
[6] => apache
[10] => mysql
[13] => php
[25] => json
[28] => frameworks
[30] => zend
[34] => browser
[35] => javascripting
[37] => jquery
[38] => etc
[42] => software
[43] => preferable
)
So, like I said, this is somewhat naive and could use more optimization (plus it's slow) but it does pull out the more relevant keywords from your text. You would need to do some fine tuning on the stop words as well. Capturing terms like Web 2.0 will be very difficult, so again I think you would be better off using a serious service like OpenCalais which can understand a text and return a list of entities and references. DocumentCloud relies on this very service to gather information from documents.
Also, for client side implementation you could do pretty much the same thing with JavaScript, and probably much cleaner (although it could be slow for the client.)
I did a quick review of these this morning and to my surprise one which performs best with my test phrase was written in PHP
http://code.fivefilters.org/term-extraction
demo: http://fivefilters.org/term-extraction/
What looked like the most professional one performed abysmally: viewer.opencalais.com
Others that were OK were (not sure what language they're written in)
www.nactem.ac.uk/software/termine/#form
www.alchemyapi.com/api/keyword/
This is not easy to do because it requires some type of fuzzy logic. You should use the Yahoo Term extractor YQL
Check it out: link
Depending on whether you want to show the client keywords/tags or whether you want to extract the keywords / tags from the block of text then do further computation with them.
If you only need to show them then clientside handling is fine. If you need them for further computation then use serverside handling for it.
I can recommend a javascript clientside implementation if you can supply some more details. If you want to generically "know" the keywords then some kind of clever solution is neccesary
If you have a list of keywords then you can use regular expressions to extract the data
I'm looking for a good example of using Regular Expressions in PHP to "reverse engineer" a form letter (with a known format, of course) that has been pasted into a multiline textbox and sent to a script for processing.
So, for example, let's assume this is the original plain-text input (taken from a USDA press release):
WASHINGTON, April 5, 2010 - North
American Bison Co-Op, a New Rockford,
N.D., establishment is recalling
approximately 25,000 pounds of whole
beef heads containing tongues that may
not have had the tonsils completely
removed, which is not compliant with
regulations that require the removal
of tonsils from cattle of all ages,
the U.S. Department of Agriculture's
Food Safety and Inspection Service
(FSIS) announced today.
For clarity, the fields that are variables are highlighted below:
[pr_city=]WASHINGTON, [pr_date=]April 5, 2010 - [corp_name=]North
American Bison Co-Op, a [corp_city=]New Rockford,
[corp_state=]N.D., establishment is recalling
approximately [amount=]25,000 pounds of [product=]whole
beef heads containing tongues that may
not have had the tonsils completely
removed, which is not compliant with
regulations that require [reason=]the removal
of tonsils from cattle of all ages,
the U.S. Department of Agriculture's
Food Safety and Inspection Service
(FSIS) announced today.
How could I efficiently extract the contents of the
pr_city
pr_date
corp_name
corp_city
corp_state
amount
product
reason
fields from my example?
Any help would be appreciated, thanks.
Well, a regex that works on your example could look like this (line breaks introduced to keep this beast legible, need to be removed prior to use):
/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a
(?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is
recalling approximately (?P<amount>.*?) of (?P<product>.*?),
which is not compliant with regulations that require (?P<reason>.*?),
the U\.S\. Department of Agriculture\'s Food Safety and Inspection
Service \(FSIS\) announced today\.$/
So, in PHP you could do
if (preg_match('/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a (?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is recalling approximately (?P<amount>.*?) of (?P<product>.*?), which is not compliant with regulations that require (?P<reason>.*?), the U\.S\. Department of Agriculture\'s Food Safety and Inspection Service \(FSIS\) announced today\.$/', $subject, $regs)) {
$prcity = $regs['pr_city'];
$prdate = $regs['pr_date'];
... etc.
} else {
$result = "";
}
This assumes a couple of things, for instance that there are no line breaks, and that the input is the entire string (and not a larger string from which this part has to be extracted from). I've tried to make assumptions about legal values that make some sense, but there is the very real chance that other inputs could break this. So some more test cases are probably needed.
If the surrounding text is constant, then something like this partial regex could do the trick:
preg_match('/^(.*?), (.*?)- (.*?), a (.*?), (.*?), establishment is recalling approximately (.*?), which is not compliant with regulations that require (.*?), the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today./', $text, $matches);
$matches[1] = 'WASHINGTON';
$matches[2] = 'April 5, 2010';
$matches[3] = ... etc...
If the surrounding text changes, then you're going to end up with a ton of false matches, no matches, etc... Essentially you'd need an AI to parse/understand PR releases.
Edit: Please disregard this crazy answer, as the other two are better. I should probably delete it, but I'm keeping it up for reference.
I have a crazy idea that just might work: build an XML string from the input by adding markups, then parse it. It might look something like this (completely untested) code:
preg_replace('([^,]*), ([^-]*)- ...etc...', '<pr_city>\1</pr_city><pr_date>\2</pr_date> ...etc...');
Parsing the XML afterwards is a needlessly complicated process that is best left to the PHP documentation: http://www.php.net/manual/en/function.xml-parse.php .
You could also consider converting it to JSON with this method, then using json_decode() to parse it. In any case, you have to think about what happens when " marks and > symbols appear in the input.
It might be easier to just match and remove one piece of the text at a time.