combining three different timestamp dataframes using duration match - python

I have three data frames with different dataframes and frequencies. I want to combine them into one dataframe.
First dataframe collects sunlight from sun as given below:
df1 =
index light_data
05/01/2019 06:54:00.000 10
05/01/2019 06:55:00.000 20
05/01/2019 06:56:00.000 30
05/01/2019 06:57:00.000 40
05/01/2019 06:59:00.000 50
05/01/2019 07:01:00.000 60
05/01/2019 07:03:00.000 70
05/01/2019 07:04:00.000 80
05/01/2019 07:06:00.000 90
Second dataframe collects solar power from unit-A
df2 =
index P1
05/01/2019 06:54:24.000 100
05/01/2019 06:59:32.000 200
05/01/2019 07:04:56.000 300
Third dataframe collects solar power from unit-B
df3 =
index P2
05/01/2019 06:56:45.000 400
05/01/2019 07:01:21.000 500
05/01/2019 07:06:34.000 600
Above three are measurements coming from the field. Three have different timestamps. Now I want to combine all three into data frame with one timestamp.
df1 data occurs every minute
df2 and df3 occur every five minutes at different times.
Combine three data frames with df2 timestamp as a reference index with no seconds information.
Finally, I want the output something like as given below:
df_combine =
combine_index P1 light_data1 P2 light_data2
05/01/2019 06:54:00 100 10 400 30
05/01/2019 06:59:00 200 50 500 60
05/01/2019 07:04:00 300 80 600 90
# Note: combine_index is df2 index with no seconds

Nice question I am using reindex with nearest as method 1
df1['row']=df1.index
s1=df1.reindex(df2.index,method='nearest')
s2=df1.reindex(df3.index,method='nearest')
s1=s1.join(df2).set_index('row')
s2=s2.join(df3).set_index('row')
pd.concat([s1,s2.reindex(s1.index,method='nearest')],1)
Out[67]:
light_data A light_data B
row
2019-05-01 06:54:00 10 100 40 400
2019-05-01 06:59:00 50 200 60 500
2019-05-01 07:04:00 80 300 90 600
Or at the last line using merge_asof
pd.merge_asof(s1,s2,left_index=True,right_index=True,direction='nearest')
Out[81]:
light_data_x A light_data_y B
row
2019-05-01 06:54:00 10 100 40 400
2019-05-01 06:59:00 50 200 40 400
2019-05-01 07:04:00 80 300 90 600
Make it extendable
df1['row']=df1.index
l=[]
for i,x in enumerate([df2,df3]):
s1=df1.reindex(x.index,method='nearest')
if i==0:
l.append(s1.join(x).set_index('row').add_suffix(x.columns[0].str[-1]))
else :
l.append(s1.join(x).set_index('row').reindex(l[0].index,method='nearest').add_suffix(x.columns[0].str[-1]))
pd.concat(l,1)

Related

Calculate Multiple Column Growth in Python Dataframe

The data I used look like this
data
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
1 100 50 120 45 110 50
2 95 40 100 45 105 50
3 110 45 100 45 110 40
I want to calculate each variable growth for each year so the result will look like this
Subject 2001_X1_gro 2001_X2_gro 2002_X1_gro 2002_X2_gro
1 0.2 -0.1 -0.08333 0.11111
2 0.052632 0.125 0.05 0.11111
3 -0.09091 0 0.1 -0.11111
I already do it manually for each variable for each year with code like this
data[2001_X1_gro]= (data[2001_X1]-data[2000_X1])/data[2000_X1]
data[2002_X1_gro]= (data[2002_X1]-data[2001_X1])/data[2001_X1]
data[2001_X2_gro]= (data[2001_X2]-data[2000_X2])/data[2000_X2]
data[2002_X2_gro]= (data[2002_X2]-data[2001_X2])/data[2001_X2]
Is there a way to do it more efficient escpecially if I have more year and/or more variable?
import pandas as pd
df = pd.read_csv('data.txt', sep=',', header=0)
Input
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
0 1 100 50 120 45 110 50
1 2 95 40 100 45 105 50
2 3 110 45 100 45 110 40
Next, a loop is created and the columns are filled:
qqq = '_gro'
new_name = ''
year = ''
for i in range(1, len(df.columns) - 2):
year = str(int(df.columns[i][:4]) + 1) + df.columns[i][4:]
new_name = year + qqq
df[new_name] = (df[year] - df[df.columns[i]])/df[df.columns[i]]
print(df)
Output
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2 2001_X1_gro \
0 1 100 50 120 45 110 50 0.200000
1 2 95 40 100 45 105 50 0.052632
2 3 110 45 100 45 110 40 -0.090909
2001_X2_gro 2002_X1_gro 2002_X2_gro
0 -0.100 -0.083333 0.111111
1 0.125 0.050000 0.111111
2 0.000 0.100000 -0.111111
In the loop, the year is extracted from the column name, converted to int, 1 is added to it. The value is again converted to a string, the prefix '_Xn' is added. A new_name variable is created, to which the string '_gro ' is also appended. A column is created and filled with calculated values.
If you want to count, for example, for three years, then you need to add not 1, but 3. This is with the condition that your data will be ordered. And note that the loop does not go through all the elements: for i in range(1, len(df.columns) - 2):. In this case, it skips the Subject column and stops short of the last two values. That is, you need to know where to stop it.

Pandas multiindex into columns

I have a Dataframe that looks similar to this:
height length
cat max 30 50
mean 20 40
min 10 30
dog max 70 100
mean 50 90
min 30 60
and want to turn it into
height_max height_mean height_min length_max length_mean length_min
cat 30 20 10 50 40 30
dog 70 50 30 100 90 60
The column names itself arent important, so if they are numbered it is also fine.
You can unstack and rework the columns index:
df2 = df.unstack(1)
df2.columns = df2.columns.map('_'.join)
output:
height_max height_mean height_min length_max length_mean length_min
cat 30 20 10 50 40 30
dog 70 50 30 100 90 60

Back Filling Dataframe

I have a dataframe with 3 columns. Something like this:
Data Initial_Amount Current
31-01-2018
28-02-2018
31-03-2018
30-04-2018 100 100
31-05-2018 100 90
30-06-2018 100 80
I would like to populate the prior rows with the Initial Amount as such:
Data Initial_Amount Current
31-01-2018 100 100
28-02-2018 100 100
31-03-2018 100 100
30-04-2018 100 100
31-05-2018 100 90
30-06-2018 100 80
So find the:
First non_empty row with Initial Amount populated
use that to backfill the initial Amounts to the starting date
If it is the first row and current is empty then copy Initial_Amount, else copy prior balance.
Regards,
Pandas fillna with fill method 'bfill' (uses next valid observation to fill gap) should do what you're looking for:
In [13]: df.fillna(method='bfill')
Out[13]:
Data Initial_Amount Current
0 31-01-2018 100.0 100.0
1 28-02-2018 100.0 100.0
2 31-03-2018 100.0 100.0
3 30-04-2018 100.0 100.0
4 31-05-2018 100.0 90.0
5 30-06-2018 100.0 80.0

Pandas timeseries bins and indexing

I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!
Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0

pandas subtracting two grouped dataframes of different size

i have two dataframes:
my stock solutions (df1):
pH salt_conc
5.5 0 23596.0
200 19167.0
400 17052.5
6.0 0 37008.5
200 27652.0
400 30385.5
6.5 0 43752.5
200 41146.0
400 39965.0
and my measurements after i did something (df2):
pH salt_conc id
5.5 0 8 20953.0
11 24858.0
200 3 20022.5
400 13 17691.0
20 18774.0
6.0 0 14 38639.0
200 1 37223.5
2 36597.0
7 37039.0
10 37088.5
15 35968.5
16 36344.5
17 34894.0
18 36388.5
400 9 33386.0
6.5 0 4 41401.5
12 44933.5
200 5 43074.5
400 6 42210.5
19 41332.5
I would like to normalize each measurement in the second dataframe (df2) with its corresponding stock solution from which i took the sample.
Any suggestions ?
Figured it out with the help of this post:
SO: Binary operation broadcasting across multiindex
I had to reset the index of both grouped dataframes and set it again.
df_initial = df_initial.reset_index().set_index(['pH','salt_conc'])
df_second = df_second.reset_index().set_index(['pH','salt_conc'])
No i can do any calculation i want to do.

Categories

Resources