how to efficiently store pandas series in Postgresql? - python

I am trying to store a Pandas series column with around 3000 rows in a new table in Postgresql.This column is part of an excel file holding the time series data followed by a number of different sensors. My excel file looks like this :
dateTime sensor1 sensor2 ...sensor-n
2021-06-12 00:00:01 0.0,,,,
2021-06-13 00:00:03 0.0,1.0,,,
2021-06-14 00:00:05 0.0,,,,
...
If I store the name of the sensor and the pandas series for each sensor this will give me redundancy. Do you have any idea how can I store a pandas series efficiently for different sensors in Postgresql? Please help, I am new at Postgresql. Thank you.

Related

Sorting dataframe rows by Day of Date wise

I have made my dataframe. But I want to sort it by the date wise..For example, I want data for 02.01.2016 just after 01.01.2016.
df_data_2311 = df_data_231.groupby('Date').agg({'Wind Offshore in [MW]': ['sum']})
df_data_2311 = pd.DataFrame(df_data_2311)
After running this, I got the below output. This dataframe has 2192 rows.
Wind Offshore in [MW]
sum
Date
01.01.2016 5249.75
01.01.2017 12941.75
01.01.2018 19020.00
01.01.2019 13723.00
01.01.2020 17246.25
... ...
31.12.2017 21322.50
31.12.2018 13951.75
31.12.2019 21457.25
31.12.2020 16491.25
31.12.2021 35683.25
Kindly let me know How would I sort this data of the day of the date.
You can use the sort_values function in pandas.
df_data_2311.sort_values(by=["Date"])
However in order to sort them by the Date column you will need reset_index() on your grouped dataframe and then to convert the date values to datetime, you can use pandas.to_datetime.
df_data_2311 = df_data_231.groupby('Date').agg({'Wind Offshore in [MW]': ['sum']}).reset_index()
df_data_2311["Date"] = pandas.to_datetime(df_data_2311["Date"], format="%d.%m.%Y")
df_data_2311 = df_data_2311.sort_values(by=["Date"])
I recommend reviewing the pandas docs.

Create django model instances from dataframe without server timeout

I am uploading a zipped file that contains a single CSV file. I then unzip it on the server and load it into a dataframe. Then I am creating django objects from it.
This works fine until my dataframe becomes too large. When the dataset is getting too large the django server shuts down. I am assuming that this happens because every iteration in my loop increases memory and when there is no memory space left it shuts down. I am working with big datasets and want to make my code work no matter how big the dataframe.
Imagine I have a df like this:
cola fanta sprite libella
2018-01-01 00:00:00 0 12 12 34
2018-01-01 01:00:00 1 34 23 23
2018-01-01 02:00:00 2 4 2 4
2018-01-01 03:00:00 3 24 2 4
2018-01-01 04:00:00 4 2 2 5
Imagine that the columns could be up to 1000 brands and the rows could be more than half a Million rows. Further imagine I have a model that saves this data in a JSONB field. Thus every column is a django object, the column name being the name. The time-stamp combined with the data in the column is the JSONB field.
e.g.: name=fanta, json_data={ "2018-01-01 00:00:00": 12, "2018-01-01 01:00:0": 34 }
My code to unzip, load to df and then create a django instance is this:
df = pd.read_csv(
file_decompressed.open(name_file),
index_col=0,
)
for column_name in df:
Drink.objects.create(
name=column_name,
json_data=df[column_name].to_dict(),
)
As I said, this works, but my loop breaks after having created about 15 elements. Searching the internet I found that bulk_create could make this more efficient. But I have custom signals implemented, so this is not really a solution. I also thought about using to_sql, but since I have to restructure the data of the dataframe I don't think this will work. Maybe not using pandas at all? Maybe chunking the DF somehow?
I am looking for a solution that works independently of the number of columns. Number of rows is maximum of half a million rows. I also tried a while-loop but same problem occurs.
Any ideas how I can make this work? Help is very much appreciated.
My dummy model could be as simple as:
class Juice(models.Model):
name = CharField(...)
json_data = JSONField(...)

Grouping by week and product and summing over IDs in pandas

I have a pandas dataframe containing amongst others the columns Product_ID, Producttype and a Timestamp. It looks roughly like this:
df
ID Product Time
C561 PX 2017-01-01
00:00:00
T801 PT 2017-01-01
00:00:01
I already converted the Time column into the datetime format.
Now I would like to sum up the number of different IDs per Product in a particular week.
I already tried a for loop:
for data['Time'] in range(start='1/1/2017', end='8/1/2017'):
data.groupby('Product')['ID'].sum()
But range requires an integer.
I also thought about using pd.Grouper with freq="1W" but then I don't know how to combine it with both Product and ID.
Any help is greatly appreciated!

Pandas does not retain frequency format when exporting to excel

I have a monthly dataframe and after resampling to annual data, I used Pandas to_period('M') to have index shown in monthly format. That works fine. When I export to excel, it is in datetime format there.
How to retain the format when exporting to excel.
Data sample in Jupyter Notebook:
2014 1463 146.416667 1110.877414 197.230546 199.230546
Data sample in excel:
2014-01-01 00:00:00 1463 146.416667 1110.877414 197.230546 199.230546
#Arthur Gouveia: Thanks for your response!
I tried to change the data type to string and worked. But any better solution?
if type(TNA_BB2_a.index)==pd.tseries.period.PeriodIndex:
TNA_BB2_a.index = TNA_BB2_a.index.strftime('%Y')
if type(tna_n_m_BB.index)==pd.tseries.period.PeriodIndex:
tna_n_m_BB.index = tna_n_m_BB.index.strftime('%Y-%m')

Passing Quandl query to pandas dataframe

I'm using the Quandl database service API and its python support to download stock financial data.
Right now, I'm using the free SFO database which downloads year operational financial data.
For example, this query code passes the last 6-8 years of data for stock "CRM" to the dataframe.
df=quandl.get('SF0/CRM_REVENUE_MRY')
df
Out[29]:
Value
Date
2010-01-31 1.305583e+09
2011-01-31 1.657139e+09
2012-01-31 2.266539e+09
2013-01-31 3.050195e+09
2014-01-31 4.071003e+09
2015-01-31 5.373586e+09
2016-01-31 6.667216e+09
What I want to do with this is to recursively pass it a list of about 50 stocks and also grab 6-8 other columns from this database using different query codes appended on to the SFO/CRM_ part of the query.
qcolumns = ['REVUSD_MRY',
'GP_MRY',
'INVCAP_MRY',
'DEBT_MRY',
'NETINC_MRY',
'RETEARN_MRY',
'SHARESWADIL_MRY',
'SHARESWA_MRY',
'COR_MRY',
'FCF_MRY',
'DEBTUSD_MRY',
'EBITDAUSD_MRY',
'SGNA_MRY',
'NCFO_MRY',
'RND_MRY']
So, I think I need to:
a) run the query for each column and in each case append to the dataframe.
b) Add column names to the dataframe.
c) Create a dataframe for each stock (should this be a panel or a list of dataframes? (apologies as I'm new to Pandas and dataframes and am on my learning curve.
d) write to CSV
could you suggest or point me?
This code works to do two queries (two columns of data, both date indexed), renames the columns, and then concatenates them.
df=quandl.get('SF0/CRM_REVENUE_MRY')
df = df.rename(columns={'Value': 'REVENUE_MRY'})
dfnext=quandl.get('SF0/CRM_NETINC_MRY')
dfnext = dfnext.rename(columns={'Value': 'CRM_NETINC_MRY'})
frames = [df, dfnext]
dfcombine = pd.concat([df, dfnext], axis=1) # now question is how to add stock tag "CRM" to frame
dfcombine
Out[39]:
REVENUE_MRY CRM_NETINC_MRY
Date
2010-01-31 1.305583e+09 80719000.0
2011-01-31 1.657139e+09 64474000.0
2012-01-31 2.266539e+09 -11572000.0
2013-01-31 3.050195e+09 -270445000.0
2014-01-31 4.071003e+09 -232175000.0
2015-01-31 5.373586e+09 -262688000.0
2016-01-31 6.667216e+09 -47426000.0
I can add recursion to this to get all the columns (there are around 15) but how do I tag each frame for each stock? Use a key? Use a 3D panel? Thanks for helping a struggling python programmer!

Categories

Resources