Trim mongo object size

Trim mongo object size - php

This is my document:
{
"_id": {
"$oid": "58e9a13999447d65550f4dd6"
},
"prices": {
"20170409": 701.09,
"20170408": 700.07
},
"stock": {
"20170409": 0,
"20170408": 0
}
}
I append a lot of fields in document objects (prices, stock) but over time it ends up being huge, and thats just one document. I have 200k documents and each have prices and stock objects.
Wondering if there's any way I could keep those object size to 30 fields max, that is, older entries purged on reaching the limit?

Take a look here! You can do it with a combination of $push and $slice, the example on the link sort the array too, don't know if that is your need, but I think you will have a good start from there.

Related

Alpha Vantage client too slow

I have this very simple PHP call to Alpha Vantage API to fill a table (or list) with NASDAQ stock prices:
<?php
function get_price($commodity = "")
{
$url = 'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=' . $commodity . '&outputsize=full&apikey=myKey';
$obj = json_decode(file_get_contents($url), true);
$date = $obj['Meta Data']['3. Last Refreshed'];
$result = $obj['Time Series (Daily)']['2018-03-23']['4. close'];
$rd_result = round($result, 2);
echo $result;
}
?>
<?php get_price("XOM");
get_price("AAPL");
get_price("MSFT");
get_price("CVX");
get_price("CAT");
get_price("BA");
?>
And it works, but just so freaking slow. It can take ove 30 secs. to load while the json file from Alpha Vantage loads in a fraction of second.
Does anyone knows where am I going wrong?

This what i did when the API took time to reply, my solution is written in C# but the logic would be the same.
string[] AlphaVantageApiKey = { "RK*********", "B2***********", 4FD*********QN", "7S3Z*********FRX", "U************I3" };
int ApiKeyValue = 0;
foreach (var stock in listOfStocks)
{
DataTable dtResult = DataRetrival.GetIntradayStockFeedForSelectedStockAs(stock.Symbol.Trim().ToUpper(), ApiKeyValue);
ApiKeyValue = (ApiKeyValue == 4) ? 0 : ApiKeyValue + 1;
}
I use 5 to 6 different API keys, when i'm querying data. I loop thought each of them for each call. There by reducing load on one perpendicular token.
I observed that this improved my performance a lot. It takes me less than 1 min to get Intraday data for 50 stocks.
Another, way you can improve your performance is to use
outputsize=compact
compact returns only the latest 100 data points in the time series.

UPDATE: Batch Stock Quotes
You might want to consider using this type of query as well. Multiple stock quotes all in one call.
Also, using the full output size is grabbing data from the past 20 years, if applicable. Take that out of your query and have the API use its condensed output default.
EDIT: According to the above, you should make changes to your query. But it can also be an issue with your server. I tested this for a use case I am working on and it takes me a few seconds to get the data, albeit I am only pulling it for one stock symbol on a page at a time.
Try increasing your memory limit if things are too slow for your liking.
<?php
ini_set('memory_limit','500M'); // or your desired limit
?>
Also, if you have shared hosting, that might be the problem. However, I do not know enough about your server to answer that fully.

i am unable to read large xlsx in spout php

i have a large xlsx file having approx 24 MB size. It takes too much time even i have to read first row only. If spout read each row one by one then why it takes too much time evenif i have to read first row only ?
Following is a complete code
require_once 'src/Spout/Autoloader/autoload.php';
$file_path = $_SERVER["DOCUMENT_ROOT"].'spout'.'/'.'testdata.xlsx';
use Box\Spout\Reader\ReaderFactory;
use Box\Spout\Common\Type;
libxml_disable_entity_loader(false);
try {
//Lokasi file excel
$reader = ReaderFactory::create(Type::XLSX); //set Type file xlsx
$reader->open($file_path); //open the file
$i = 0;
/**
* Sheets Iterator. Kali aja multiple sheets
**/
foreach ($reader->getSheetIterator() as $sheet) {
//Rows iterator
foreach ($sheet->getRowIterator() as $row) {
echo $i."<hr>";
if($i==0) // if first row
{
print_r($row);
exit; // exist after reading first row
}
++$i;
}
exit;
}
echo "Total Rows : " . $i;
$reader->close();
echo "Peak memory:", (memory_get_peak_usage(true) / 1024 / 1024), " MB";
}
catch (Exception $e) {
echo $e->getMessage();
exit;
}
Can you help me what is reason regarding this issue. How can i do it fast ?
You can find test xlsx file at http://www.mediafire.com/file/y369njsaeeah1ip/testdata.xlsx
Excel file contains following content:
number of rows : 999991
number of columns : 4 (i.e MPN,CATEGORY,MFG,Description)
file size approx 24 MB and does not contain any formatting.

There are 2 ways to store cell data with XLSX files:
the simplest one is to keep the cell values with the cells structure (i.e. cell "A1" contains "foo", "B1" contains "bar").
the other way to do it is by keeping track of the different values used in the spreadsheet and add a layer of redirection that will help removing duplicates: this translates to 2 files, one describing the structure (i.e. cell "A1" contains value referenced by ID1, "B1" => ID2, "C1" => ID1) and one describing the values (ID1 => "foo", ID2 => "bar").
The method 2 optimizes the size of the file since a value used N times will be stored only once (but referenced N times). However to read these values, you now need to read 2 files and have the mapping ready when you read the structure. Basically, in order to read the first row, you will read the structure to get the cells (A1, B1, C1) and then you need to resolve the values, using the IDs.
The inline method is more straightforward since everything is stored at the same place so you can read the structure and the values at the same time. No need for a mapping table.
Now back to your problem! The file you're trying to read is most likely using the method 2 (a file describring the spreadsheet's structure + a file containing all the values). When Spout inits the reader, it processes the file containing the values so that the mapping is ready whenever we start ready rows.
This processing can take a long time if there are a lot of values. Below a certain threshold (which depends on the amount of memory available), Spout loads the mapping [ID => value] into memory, which is pretty fast. However if there are too many values, Spout decides that everything may not fit into memory and caches chunks of the mapping on disk. This process is definitely time consuming...
So this is what's happening in your case. Hopefully that makes more sense now.
Eventually the threshold will be moved higher, as Spout is currently ultra defensive to avoid out of memory issues.

PHP / Mongo geoJSON Loop is not valid

I'm passing in some coordinates to mongo to do a geo search. It works fine if the coordinates don't intersect (for instance a figure eight). But when two lines intersect it gives the loop is not valid. Is there any way to find the intersection and split all these loops up?
Note there could be many.
EDIT: I added the sample query and error. Note that I understand why it's happening, I'm just wondering if there is some known way to split those loops up into separate polygon's (some algorithm or within Mongo).
Query:
db.items.find({
"address.location": {
"$geoWithin": {
"$geometry": {
"type": "Polygon",
"coordinates": [[
[-97.209091, 49.905691],
[-97.206345, 49.918072],
[-97.178879, 49.919399],
[-97.165146, 49.907903],
[-97.164459, 49.892865],
[-97.180939, 49.889326],
[-97.197418, 49.895077],
[-97.200165, 49.902596],
[-97.203598, 49.919399],
[-97.216644, 49.928682],
[-97.244797, 49.927356],
[-97.255096, 49.913209],
[-97.209091, 49.905691]
]]
}
}
}
});
Error:
Error: error: {
"waitedMS" : NumberLong(0),
"ok" : 0,
"errmsg" : "Loop is not valid: [
[ -97.209091, 49.905691 ]
[ -97.206345, 49.918072 ],
[ -97.17887899999999, 49.919399 ],
[ -97.16514599999999, 49.907903 ],
[ -97.16445899999999, 49.892865 ],
[ -97.180939, 49.889326 ],
[ -97.197418, 49.895077 ],
[ -97.200165, 49.902596 ],
[ -97.203598, 49.919399 ],
[ -97.216644, 49.928682 ],
[ -97.24479700000001, 49.927356 ],
[ -97.25509599999999, 49.913209 ],
[ -97.209091, 49.905691 ]
]
Edges 1 and 7 cross.
Edge locations in degrees: [-97.2063450, 49.9180720]-[-97.1788790, 49.9193990]
and [-97.2001650, 49.9025960]-[-97.2035980, 49.9193990]
",
"code" : 2
}
UPDATE
I've added an image of a brute force approach.
Polygon Slicing
Basically it's doing a look ahead on intersects.
If it finds one, it swaps out the points so that it stays within a loop.
It would add the cut off as a "starting point" in some queue.
When a look ahead goes around and finds it's own starting point we have a loop.
Then keep going through the "starting point" queue until it's empty.
The new set of polygons should contain all separate loops (in theory).
There are some issues with this though, it can get quite expensive going through all these loops. Say a max of 50 points would be about 1275 operations.
Also dealing with wrap around on 0/180 degreee coordinates could be a challenge.
Anyway, I didn't want to spend a whole day on this, I could even deal with a solution that does NOT deal with the wrap around condition.
Hoping there is a nice algorithm for this somewhere already I can just pop in (probably has some fancy technical term).
Also would be great if there was a more efficient approach then the brute force look-ahead.

It is because your coordinates are identical that creates an anomaly in the polygon shape: [-97.1788790, 49.9193990] and [-97.2035980, 49.9193990]. In your code remove or change any of the duplicate coordinate,
"coordinates": [[
[-97.209091, 49.905691],
[-97.206345, 49.918072],
[-97.178879, 49.919399], // this line
[-97.165146, 49.907903],
[-97.164459, 49.892865],
[-97.180939, 49.889326],
[-97.197418, 49.895077],
[-97.200165, 49.902596],
[-97.203598, 49.919399], // and this one
[-97.216644, 49.928682],
[-97.244797, 49.927356],
[-97.255096, 49.913209],
[-97.209091, 49.905691]
]]

As I mentioned in comments the better tool to query spatial data would be to use PostGIS.
For example PostGIS have the ST_validReason() to find the problem with polygon and a st_makevalid to fix but if this is not an option then I would create a service available to your PHP script with shapely python library https://github.com/Toblerity/Shapely
http://toblerity.org/shapely/manual.html
The first premise of Shapely is that Python programmers should be able
to perform PostGIS type geometry operations outside of an RDBMS
Certainly shapely is a popular tool, and you should get future help with in on StackOverflow or gis.stackexchange.com from more experienced users
I think the problem you are trying to solve is not as trivial as it sounds.
So I would perform following steps:
1 Find how to do it with shapely
similar questions:
Splitting self-intersecting polygon only returned one polygon in shapely in Python
2 create a simple php service which will pass query details to python script
3 Shapely will not prevent the creation of invalid polygon, but exceptions will be raised when they are operated on. So basically on such exception, I would call the script from step 1.
If this is just for one of query, I would just use QGIS software ( import points as CSV, create a new shapefile layer ( of type polygon ), and use Numerical vertex edit plugin )
update
I think the polygon should be fixed by the tool that created it so in this case, it should be a user. Maybe you fix that problem by looking from a different angle and if the shape is coming from the user and is drawn in GoogleMap in your app maybe enforce no intersection during drawing.
Found also this Drawing a Polygon ( sorting points and creating polygon with no intersection )

Parsing large XML data

I am trying to parse xml files to store data into database. I have written a code with PHP (as below) and I could successfully run the code.
But the problem is, it requires around 8 mins to read a complete file (which is around 30 MB), and I have to parse around 100 files in each hour.
So, obviously my current code is of no use to me. Can anybody advise for a better solution? Or should I switch to other coding language?
What I get from net is, I can do it with Perl/Python or something called XSLT (which I am not so sure about, frankly).
$xml = new XMLReader();
$xml->open($file);
while ($xml->name === 'node1'){
$node = new SimpleXMLElement($xml->readOuterXML());
foreach($node->node2 as $node2){
//READ
}
$xml->next('node1');
}
$xml->close();

Here's an example of my script I used to parse the WURFL XML database found here.
I used the ElementTree module for Python and wrote out a JavaScript Array - although you can easily modify my script to write a CSV of the same (Just change the final 3 lines).
import xml.etree.ElementTree as ET
tree = ET.parse('C:/Users/Me/Documents/wurfl.xml')
root = tree.getroot()
dicto = {} #to store the data
for device in root.iter("device"): #parse out the device objects
dicto[device.get("id")] = [0, 0, 0, 0] #set up a list to store the needed variables
for child in device: #iterate through each device
if child.get("id") == "product_info": #find the product_info id
for grand in child:
if grand.get("name") == "model_name": #and the model_name id
dicto[device.get("id")][0] = grand.get("value")
dicto[device.get("id")][3] +=1
elif child.get("id") == "display": #and the display id
for grand in child:
if grand.get("name") == "physical_screen_height":
dicto[device.get("id")][1] = grand.get("value")
dicto[device.get("id")][3] +=1
elif grand.get("name") == "physical_screen_width":
dicto[device.get("id")][2] = grand.get("value")
dicto[device.get("id")][3] +=1
if not dicto[device.get("id")][3] == 3: #make sure I had enough
#otherwise it's an incomplete dataset
del dicto[device.get("id")]
arrays = []
for key in dicto.keys(): #sort this all into another list
arrays.append(key)
arrays.sort() #and sort it alphabetically
with open('C:/Users/Me/Documents/wurfl1.js', 'w') as new: #now to write it out
for item in arrays:
new.write('{\n id:"'+item+'",\n Product_Info:"'+dicto[item][0]+'",\n Height:"'+dicto[item][1]+'",\n Width:"'+dicto[item][2]+'"\n},\n')
Just counted this as I ran it again - took about 3 seconds.

In Perl you could use XML::Twig, which is designed to process huge XML files (bigger than can fit in memory)
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $file= shift #ARGV;
XML::Twig->new( twig_handlers => { 'node1/node2' => \&read_node })
->parsefile( $file);
sub read_node
{ my( $twig, $node2)= #_;
# your code, the whole node2 string is $node2->sprint
$twig->purge; # if you want to reduce memory footprint
}
You can find more info about XML::Twig at xmltwig.org

In case of Python I would recommend using lxml.
As you are having performance problems, I would recommend iterating through your XML and processing things part by part, this would save a lot of memory and is likely to be much faster.
I am reading on old server 10 MB XML within 3 seconds, your situation might be different.
About iterating with lxml: http://lxml.de/tutorial.html#tree-iteration

Review this line of code:
$node = new SimpleXMLElement($xml->readOuterXML());
Documentation for readOuterXML has a comment, that sometime it is attempting to reach out for namespaces etc. Anyway, here I would suspect big performance problem.
Consider using readInnerXML() if you could.

Encoding unique IDs to a maximum document size of 100kb

This is going to be a nice little brainbender I think. It is a real life problem, and I am stuck trying to figure out how to implement it. I don't expect it to be a problem for years, and at that point it will be one of those "nice problems to have".
So, I have documents in my search engine index. The documents can have a number of fields, however, each field size must be limited to only 100kb.
I would like to store the IDs, of particular sites which have access to this document. The site id count is low, so it is never going to get up into the extremely high numbers.
So example, this document here can be accessed by sites which have an ID of 7 and 10.
Document: {
docId: "1239"
text: "Some Cool Document",
access: "7 10"
}
Now, because the "access" field is limited to 100kb, that means that if you were to take consecutive IDs, only 18917 unique IDs could be stored.
Reference:
http://codepad.viper-7.com/Qn4N0K
<?php
$ids = range(1,18917);
$ids = implode(" ", $ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb";
?>
// Output
99.9951171875kb
In my application, a particular site, of site ID 7, tries to search, and he will have access to that "Some Cool Document"
So now, my question would be, is there any way, that I could some how fit more IDs into that field?
I've thought about proper encoding, and applying something like a Huffman Tree, but seeing as each document has different IDs, it would be impossible to apply a single encoding set to every document.
Prehaps, I could use something like tokenized roman numerals?
Anyway, I'm open to ideas.
I should add, that I want to keep all IDs in the same field, for as long as possible. Searching over a second field, will have a considerable performance hit. So I will only switch to using a second access2 field, when I have milked the access field for as long as possible.
Edit:
Convert to Hex
<?php
function hexify(&$item){
$item = dechex($item);
}
$ids = range(1,21353);
array_walk( $ids, "hexify");
$ids = implode(" ", $ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb";
?>
This yields a performance boost of 21353 consecutive IDs.
So that is up like 12.8%
Important Caveat
I think the fact that my fields can only store UTF encoded characters makes it next to impossible to get anything more out of it.

Where did 18917 come from? 100kb is a big number.
You have 100,000 or so bytes. Each byte can be 255 long, if you store it as a number.
If you encode as hex, you'll get 100,000 ^ 16, which is a very large number, and that just hex encoding.
What about base64? You stuff 3 bytes into a 4 byte space (a little loss), but you get 64 characters per character. So 100,000 ^ 64. That's a big number.
You won't have any problems. Just do a simple hex encoding.
EDIT:
TL;DR
Let's say you use base64. You could fit 6.4 times more data in the same spot. No compression needed.

How about using data compression?
$ids = range(1,18917);
$ids = implode(" ", $ids);
$ids = gzencode($ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb"; // 41.435546875kb

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.