How to add non-unique elements to existing array in Firestore? - python

I am using Firestore to store time series data being pulled from a sensor. I am using Python to push the data, namely the Firebase-Admin package for verification. I chose to store this data using arrays, where each index corresponds with an array across different fields. Is there a way to add non-unique elements to the array? Or can arrays only store unique elements? If so, what data structure would you suggest for storing time series data.
I am trying to add observations to an existing array in Firestore, but ArrayUpdate only adds the element if it is not already present in the array. When I execute the second chuck of code (to update the existing array), only unique values are saved
# Initialize arrays and push to Firestore
import firebase_admin
from firebase_admin import credentials
from firebase_admin import firestore
import datetime
cred = credentials.Certificate('path_to_certificate')
firebase_admin.initialize_app(cred)
db = firestore.client()
cell_1_arr = []
cell_2_arr = []
cell_3_arr = []
exec_time_arr = []
curr_time_arr = []
pred_volt = 12562.70
meas_volt = 12362.70
current = 0.0
soc = 0,0
cell_1_volt = 4.32
cell_2_volt = 4.4
cell_3_volt = 4.23
exec_time = 0.4
curr_time = datetime.datetime.now()
pred_v_arr.append(pred_volt)
meas_v_arr.append(meas_volt)
c_arr.append(current)
soc_arr.append(soc)
cell_1_arr.append(cell_1_volt)
cell_2_arr.append(cell_2_volt)
cell_3_arr.append(cell_3_volt)
exec_time_arr.append(exec_time)
curr_time_arr.append(curr_time)
try:
push_data = {
u'time': curr_time_arr,
u'vpred': meas_v_arr,
u'vmeas': pred_v_arr,
u'current': c_arr,
u'soc': soc_arr,
u'cell1': cell_1_arr,
u'cell2': cell_2_arr,
u'cell3': cell_3_arr,
u'exectime': exec_time_arr
}
db.collection(u'battery1').document(u"day1").set(push_data)
# Add add a new observation to the different arrays
db.collection(u'battery1').document(u"day1").update({'time': firestore.ArrayUnion(curr_time_arr)})
db.collection(u'battery1').document(u"day1").update({'vpred': firestore.ArrayUnion(pred_v_arr)})
db.collection(u'battery1').document(u"day1").update({'vmeas': firestore.ArrayUnion(meas_v_arr)})
db.collection(u'battery1').document(u"day1").update({'current': firestore.ArrayUnion(c_arr)})
db.collection(u'battery1').document(u"day1").update({'cell1': firestore.ArrayUnion(cell_1_arr)})
db.collection(u'battery1').document(u"day1").update({'cell2': firestore.ArrayUnion(cell_2_arr)})
db.collection(u'battery1').document(u"day1").update({'cell3': firestore.ArrayUnion(cell_3_arr)})
db.collection(u'battery1').document(u"day1").update({'exectime': firestore.ArrayUnion(exec_time_arr)})
db.collection(u'battery1').document(u"day1").update({'soc': firestore.ArrayUnion(soc_arr)})
In the screenshot above you can see that there are 8 elements in the "time" field (as all calls to datetime.now() produce unique instances of timestamps), while all the other fields have only saved the unique data points sent (exectime/soc only have two data points, for 8 calls to ArrayUnion).

When you use firestore.ArrayUnion that operator's job is literally to ensure each value can only be present once in the array.
If you want to allow non-unique values, don't use firestore.ArrayUnion but just add the elements to the array regularly. This does require that you read the entire document and array first, then add the element locally, and write the result back.

Related

python cut between partitioned column results

I use below code in Spark-scala to get the partitioned columns.
scala> val part_cols= spark.sql(" describe extended work.quality_stat ").select("col_name").as[String].collect()
part_cols: Array[String] = Array(x_bar, p1, p5, p50, p90, p95, p99, x_id, y_id, # Partition Information, # col_name, x_id, y_id, "", # Detailed Table Information, Database, Table, Owner, Created Time, Last Access, Created By, Type, Provider, Table Properties, Location, Serde Library, InputFormat, OutputFormat, Storage Properties, Partition Provider)
scala> part_cols.takeWhile( x => x.length()!= 0 ).reverse.takeWhile( x => x != "# col_name" )
res20: Array[String] = Array(x_id, y_id)
and I need to get similar output in Python. I'm struggling to replicate the same code in Python for the Array Operation to get the [y_id, x_id].
Below is what I tried.
>>> part_cols=spark.sql(" describe extended work.quality_stat ").select("col_name").collect()
Is it possible using Python.
part_cols in the question is an array of rows. So the first step is to convert it into an array of strings.
part_cols = spark.sql(...).select("col_name").collect()
part_cols = [row['col_name'] for row in part_cols]
Now the start and end of the array's part that you are interessted in can be calculated with
start_index = part_cols.index("# col_name") + 1
end_index = part_cols.index('', start_index)
Finally a slice can be extracted from the list with these two values as start and end
part_cols[start_index:end_index]
This slice will contain the values
['x_id', 'y_id']
If the output really should be reversed, the slice
part_cols[end_index-1:start_index-1:-1]
will contain the values
['y_id', 'x_id']

import all rows from dataset using SODA API Python

I'm trying to import the following dataset and store it in a pandas dataframe: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh/data
I use the following code:
r = requests.get('https://data.nasa.gov/resource/gh4g-9sfh.json')
meteor_data = r.json()
df = pd.DataFrame(meteor_data)
print(df.shape)
The resulting dataframe only has 1000 rows. I need it to have all 45,716 rows. How do I do this?
Check out the docs on the $limit parameter
The $limit parameter controls the total number of rows returned, and
it defaults to 1,000 records per request.
Note: The maximum value for $limit is 50,000 records, and if you
exceed that limit you'll get a 400 Bad Request response.
So you're just getting the default number of records back.
You will not be able to get more than 50,000 records in a single API call - this will take multiple calls using $limit together with $offset
Try:
https://data.nasa.gov/resource/gh4g-9sfh.json$limit=50000
See Why am I limited to 1,000 rows on SODA API when I have an App Key
DO LIKE This ans set limit
import pandas as pd
from sodapy import Socrata
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.nasa.gov", None)
# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.nasa.gov,
# MyAppToken,
# userame="user#example.com",
# password="AFakePassword")
# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("gh4g-9sfh", limit=2000)
# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

Get results in an Earth Engine python script

I'm trying to get NDVI mean in every polygon in a feature collection with earth engine python API.
I think that I succeeded getting the result (a feature collection in a feature collection), but then I don't know how to get data from it.
The data I want is IDs from features and ndvi mean in each feature.
import datetime
import ee
ee.Initialize()
#Feature collection
fc = ee.FeatureCollection("ft:1s57dkY_Sg_E_COTe3sy1tIR_U-5Gw-BQNwHh4Xel");
fc_filtered = fc.filter(ee.Filter.equals('NUM_DECS', 1))
#Image collection
Sentinel_collection1 = (ee.ImageCollection('COPERNICUS/S2')).filterBounds(fc_filtered)
Sentinel_collection2 = Sentinel_collection1.filterDate(datetime.datetime(2017, 1, 1),datetime.datetime(2017, 8, 1))
# NDVI function to use with ee map
def NDVIcalc (image):
red = image.select('B4')
nir = image.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename('NDVI')
#NDVI mean calculation with reduceRegions
MeansFeatures = ndvi.reduceRegions(reducer= ee.Reducer.mean(),collection= fc_filtered,scale= 10)
return (MeansFeatures)
#Result that I don't know to get the information: Features ID and NDVI mean
result = Sentinel_collection2.map(NDVIcalc)
If the result is small, you pull them into python using result.getInfo(). That will give you a python dictionary containing a list of FeatureCollection (which are more dictionaries). However, if the results are large or the polygons cover large regions, you'll have to Export the collection instead.
That said, there are probably some other things you'll want to do first:
1) You might want to flatten() the collection, so it's not nested collections. It'll be easier to handle that way.
2) You might want to add a date to each result so you know what time the result came from. You can do that with a map on the result, inside your NDVIcalc function
return MeansFeatures.map(lambda f : f.set('date', image.date().format())
3) If what you really want is a time-series of NDVI over time for each polygon (most common), then restructuring your code to map over polygons first will be easier:
Sentinel_collection = (ee.ImageCollection('COPERNICUS/S2')
.filterBounds(fc_filtered)
.filterDate(ee.Date('2017-01-01'),ee.Date('2017-08-01')))
def GetSeries(feature):
def NDVIcalc(img):
red = img.select('B4')
nir = img.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename(['NDVI'])
return (feature
.set(ndvi.reduceRegion(ee.Reducer.mean(), feature.geometry(), 10))
.set('date', img.date().format("YYYYMMdd")))
series = Sentinel_collection.map(NDVIcalc)
// Get the time-series of values as two lists.
list = series.reduceColumns(ee.Reducer.toList(2), ['date', 'NDVI']).get('list')
return feature.set(ee.Dictionary(ee.List(list).flatten()))
result = fc_filtered.map(GetSeries)
print(result.getInfo())
4) And finally, if you're going to try to Export the result, you're likely to run into an issue where the columns of the exported table are selected from whatever columns the first feature has, so it's good to provide a "header" feature that has all columns (times), that you can merge() with the result as the first feature:
# Get all possible dates.
dates = ee.List(Sentinel_collection.map(function(img) {
return ee.Feature(null, {'date': img.date().format("YYYYMMdd") })
}).aggregate_array('date'))
# Make a default value for every date.
header = ee.Feature(null, ee.Dictionary(dates, ee.List.repeat(-1, dates.size())))
output = header.merge(result)
ee.batch.Export.table.toDrive(...)

Storing Datetime in a matrix to be used to define points of interest (Python)

I have bunch of CSV files that contain rows of dates corresponding to data, with column headers Using pandas, I have been able to import the CSV files. Now, I made a CSV file that labels the points of interest by datetime. I have also used pandas to import this file. I need to store the start time and end time in a matrix/array/something to call later to parse with my data which is labeled with these dates. Currently, using pd.to_datetime I have been able to convert the strings in my CSVs to datetime, but I have no idea how to store this. This is my third day using Python, so I apologize for the newbie question. I am a relatively advanced user of Matlab. I will provide my code, but I will not be able to provide the data in question as it is not owned by me. Thanks guys!
NUMBER_OF_CLASSES = 4
SUBSPACE_DIMENSION = 3
from datetime import datetime
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as sio
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
Pdata_1 = Pdata_1.as_matrix()
startdatetime = []
enddatetime = []
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])
Instead of iterating through the dataframe you can store the start and ending dates in a new dataframe and convert the columns to timeseries and then you can access the data by iloc method :
dates = PeriodList[['START','END']]
dates['START'] = pd.to_datetime(dates['START'])
dates['END'] = pd.to_datetime(dates['END'])
# You can access the dates based on index using iloc
dates.iloc[3]
#If you Start date you can use the column name
dates.iloc[3]['START']
Incase you want to store specifically under existing data structure, you can use dictionary with key as index and values as dataframe values
start_end = dict(zip(dates.index, dates.values))
If you are looking for the difference of the end date and start date you can simply subtract the columns i.e
dates['Difference'] = dates['END']-dates['START']
I suggest you to go through pandas documentation for more info about accessing the data here
Edit :
You can also use dictionary in your code i.e
startdatetime = {}
enddatetime = {}
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
Hope this helps
Figured out a solution: Make empty strings, so then the loop stores the value each iteration. Since it is an empty string, there will not be a "cannot convert to float" error. Thanks for the help #Bharath Shetty
Code:
PeriodList = pd.read_csv('IP_List.csv')
PeriodList = PeriodList.as_matrix()
# Pdata format:
# Pdata{hull, engine, 1}(:) - datetime array of hull and engine P data
# Pdata{hull, engine, 2}(:,:) - parametric data corrsponding to timestamps in datetime array
# Pdata{hull, engine, 3}(:) - array of parametric channel labels
Pdata_1 = pd.read_csv('LPD-17_1A.csv')
[list_m, list_n] = PeriodList.shape
#Pdata_1 = Pdata_1.as_matrix()
startdatetime = ['' for x in range(list_m)]
enddatetime = ['' for x in range(list_m)]
#Up to line 27 done on MatLab script
for d in range (0, list_m):
Hull = PeriodList[d,0]
Engine = PeriodList[d,1]
startdatetime[d] = pd.to_datetime(PeriodList[d,2])
enddatetime[d] = pd.to_datetime(PeriodList[d,3])
#startdatetime = pd.to_datetime(PeriodList[d,2])

Trouble with passing bson objectid to numpy recarray in python 3

I am working on machine translating some text that is stored in a mongodb database. I am trying pull the data from a database and then store it in numpy recarray. However I keep getting errors when I try to save the ObjectId field to the recarray--despite the different type conversions and such I have read about. Here is my code. Any suggestions would help.
#Pull the records from the DB into a resultset
db_results_records_to_translate = \
db_connector.db_fetch_untranslated_records_from_db(
article_collection,rec_number)
#Create an empty numpy recarray to store the data
data_table_for_translation=np.zeros([db_results_records_to_translate.count(),6],
dtype=[('_id', np.str),
('article_raw_text', np.str),
('article_raw_date', np.str),
('translated',np.bool),
('translated_text',np.str),
('translated_date',np.str)])
#Write record data to the recarray
for index, r in enumerate(db_results_records_to_translate):
data_table_for_translation[index, 0] = str(r['_id']) # Line with errors!!!
data_table_for_translation[index,1] = r['article_raw_text']
data_table_for_translation[index,2] = r['article_raw_date']
data_table_for_translation[index, 3] = r['translated']
So after running this code, I get an error TypeError: expected an object with a buffer interface.
Now I have tried to convert the objectid from bson to string using the str(ObjectId) function as referenced in the documentation, but no luck.
Any suggestions?
NOTE: I noticed that this error happens even for the non-id columns too, so even straight text has an issue.
There are errors in the definition of the array, including the dtype, and errors in indexing fields during the iteration.
This is clip illustrates the changes I think you need to make to get this assignment to work:
# fake data - a list of tuples
db_results_records_to_translate = [('12','raw text','raw date')]
#Create an empty numpy recarray to store the data
data_table_for_translation=np.zeros([1,],
dtype=[('_id', 'U10'),
('article_raw_text', 'U10'),
('article_raw_date', 'U10')])
# string dtype has to include length
# I'm using unicode here (Python3), 'S10' would do just as well (in py2)
#Write record data to the structured array
for index, r in enumerate(db_results_records_to_translate):
data_table_for_translation[index]['_id'] = str(r[0])
data_table_for_translation[index]['article_raw_text'] = r[1]
data_table_for_translation[index]['article_raw_date'] = r[2]
print(db_results_records_to_translate)
Note that I index the 'fields' by name, not number. data_table... is a 1d array with n fields, not a 2d array with n columns. I'm indexing r by number because my mock data is a tuple, not the db named fields.

Categories

Resources