How to add interpolated values to pandas DataFrame? - python

I have the following pandas DataFrame called df:
timestamp param_1 param_2
0.000 -0.027655 0.0
0.25 -0.034012 0.0
0.50 -0.040369 0.0
0.75 -0.046725 0.0
1.00 -0.050023 0.0
1.25 -0.011015 0.0
1.50 -0.041366 0.0
1.75 -0.056723 0.0
2.00 -0.013081 0.0
Now I need to add two new columns created from the following lists:
timestamp_new = [0.5, 1.0, 1.5, 2.0]
param_3 = [10.0, 25.0, 15.0, 22.0]
The problem is that timestamp_new has a different granularity. Thus, I need to interpolate (linearly) both timestamp_new and param_3 in order to fit the granularity of timestamp in df.
Expected result (please notice that I interpolated param_3 values randomly just to show the format of an expected result):
timestamp param_1 param_2 param_3
0.000 -0.027655 0.0 8.0
0.25 -0.034012 0.0 9.0
0.50 -0.040369 0.0 10.0
0.75 -0.046725 0.0 20.0
1.00 -0.050023 0.0 25.0
1.25 -0.011015 0.0 18.0
1.50 -0.041366 0.0 15.0
1.75 -0.056723 0.0 17.0
2.00 -0.013081 0.0 22.0
Is there any way to do it?

Let's try reindex().interpolate:
ref_df = pd.Series(param_3, index=timestamp_new)
new_vals = (ref_df.reindex(df['timestamp'])
.interpolate('index')
.bfill() # fill the first few nans
.ffill() # fill the last few nans
)
df['param_3'] = df['timestamp'].map(new_vals)
Output:
timestamp param_1 param_2 param_3
0 0.00 -0.027655 0.0 10.0
1 0.25 -0.034012 0.0 10.0
2 0.50 -0.040369 0.0 10.0
3 0.75 -0.046725 0.0 17.5
4 1.00 -0.050023 0.0 25.0
5 1.25 -0.011015 0.0 20.0
6 1.50 -0.041366 0.0 15.0
7 1.75 -0.056723 0.0 18.5
8 2.00 -0.013081 0.0 22.0

Related

How to write data sequentially based on a key into an other csv file

I'm trying to group by a column data on my dataset, Here is what I have tried till now.
grouped_df = df.groupby(["sender"])
for key,item in grouped_df:
a_group = grouped_df[["rcvTime","pos_x","pos_y","pos_z","spd_x","spd_y","spd_z","acl_x","acl_y","acl_z","hed_x","hed_y","hed_z"]].get_group(key)
print(a_group, "\n")
the output for it is like the following:
for key=15
rcvTime pos_x pos_y pos_z spd_x spd_y spd_z acl_x acl_y acl_z \
0 25207.0 136.07 1118.46 0.0 0.00 0.00 0.0 0.00 0.00 0.0
1 25208.0 136.19 1117.14 0.0 0.22 -2.31 0.0 0.14 -1.48 0.0
3 25209.0 136.69 1113.79 0.0 0.39 -4.18 0.0 0.15 -1.64 0.0
5 25210.0 133.77 1108.01 0.0 0.58 -6.17 0.0 0.16 -1.76 0.0
7 25211.0 134.37 1100.75 0.0 0.76 -8.14 0.0 0.18 -1.93 0.0
for key=22
rcvTime pos_x pos_y pos_z spd_x spd_y spd_z acl_x acl_y acl_z \
2 25208.81 152.66 904.56 0.0 0.06 -0.75 0.0 0.18 -2.43 0.0
4 25209.81 152.98 902.59 0.0 0.22 -2.91 0.0 0.12 -1.68 0.0
6 25210.81 153.25 898.68 0.0 0.37 -4.65 0.0 0.11 -1.35 0.0
8 25211.81 153.82 893.00 0.0 0.65 -6.67 0.0 0.25 -2.54 0.0
for key=31
rcvTime pos_x pos_y pos_z spd_x spd_y spd_z acl_x acl_y acl_z \
25211.93 122.87 892.12 0.0 5.63 0.32 0.0 -1.57 -0.09 0.0
25212.93 127.24 892.36 0.0 3.30 0.19 0.0 -1.52 -0.09 0.0
25213.93 129.69 892.49 0.0 1.67 0.10 0.0 -1.54 -0.09 0.0
25214.93 130.79 892.55 0.0 0.71 0.04 0.0 -0.50 -0.03 0.0
now what I need is for each key the data displays sequentially and add them to a new csv file for example for key=31 the data should be like the following:
rcvTime,pos_x,pos_y,pos_z,spd_x,spd_y,spd_z,acl_x,acl_y,acl_z,rcvTime,pos_x,pos_y,pos_z,spd_x,spd_y,spd_z,acl_x,acl_y,acl_z,rcvTime,pos_x,pos_y,pos_z,spd_x,spd_y,spd_z,acl_x,acl_y,acl_z
25211.93,122.87,892.12,0.0,5.63,0.32,0.0,-1.57,-0.09,25212.93,127.24,892.36,0.0,3.30,0.19,0.0,-1.52,-0.09,0.0,25213.93,129.69,892.49,0.0,1.67,0.10,0.0,-1.54,-0.09,0.0,....
And then for the next key it should writes into an other line the data for the next key in that csv file. I appreciate if anyone can help me with this.
You can do something like this
df = pd.DataFrame({"col1": [4, 5, 6], "col2": [9, 8, 7]})
df_new = pd.DataFrame()
for idx, row in df.iterrows():
df_temp = df.loc[idx:idx].reset_index(drop=True)
df_new = pd.concat([df_new, df_temp], axis=1)
df_new.to_csv('test.csv', index=False)
and the output csv looks like this-
col1,col2,col1,col2,col1,col2
4,9,5,8,6,7

How to split a DataFrame on each different value in a column?

Below is an example DataFrame.
0 1 2 3 4
0 0.0 13.00 4.50 30.0 0.0,13.0
1 0.0 13.00 4.75 30.0 0.0,13.0
2 0.0 13.00 5.00 30.0 0.0,13.0
3 0.0 13.00 5.25 30.0 0.0,13.0
4 0.0 13.00 5.50 30.0 0.0,13.0
5 0.0 13.00 5.75 0.0 0.0,13.0
6 0.0 13.00 6.00 30.0 0.0,13.0
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
11 2.0 13.25 1.00 30.0 0.0,13.25
12 2.0 13.25 1.25 30.0 0.0,13.25
13 2.0 13.25 1.50 30.0 0.0,13.25
14 2.0 13.25 1.75 30.0 0.0,13.25
15 2.0 13.25 2.00 30.0 0.0,13.25
16 2.0 13.25 2.25 30.0 0.0,13.25
I want to split this into new dataframes when the row in column 0 changes.
0 1 2 3 4
0 0.0 13.00 4.50 30.0 0.0,13.0
1 0.0 13.00 4.75 30.0 0.0,13.0
2 0.0 13.00 5.00 30.0 0.0,13.0
3 0.0 13.00 5.25 30.0 0.0,13.0
4 0.0 13.00 5.50 30.0 0.0,13.0
5 0.0 13.00 5.75 0.0 0.0,13.0
6 0.0 13.00 6.00 30.0 0.0,13.0
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
11 2.0 13.25 1.00 30.0 0.0,13.25
12 2.0 13.25 1.25 30.0 0.0,13.25
13 2.0 13.25 1.50 30.0 0.0,13.25
14 2.0 13.25 1.75 30.0 0.0,13.25
15 2.0 13.25 2.00 30.0 0.0,13.25
16 2.0 13.25 2.25 30.0 0.0,13.25
I've tried adapting the following solutions without any luck so far. Split array at value in numpy
Split a large pandas dataframe
Looks like you want to groupby the first colum. You could create a dictionary from the groupby object, and have the groupby keys be the dictionary keys:
out = dict(tuple(df.groupby(0)))
Or we could also build a list from the groupby object. This becomes more useful when we only want positional indexing rather than based on the grouping key:
out = [sub_df for _, sub_df in df.groupby(0)]
We could then index the dict based on the grouping key, or the list based on the group's position:
print(out[0])
0 1 2 3 4
0 0.0 13.0 4.50 30.0 0.0,13.0
1 0.0 13.0 4.75 30.0 0.0,13.0
2 0.0 13.0 5.00 30.0 0.0,13.0
3 0.0 13.0 5.25 30.0 0.0,13.0
4 0.0 13.0 5.50 30.0 0.0,13.0
5 0.0 13.0 5.75 0.0 0.0,13.0
6 0.0 13.0 6.00 30.0 0.0,13.0
Based on
I want to split this into new dataframes when the row in column 0 changes.
If you only want to group when value in column 0 changes , You can try:
d=dict([*df.groupby(df['0'].ne(df['0'].shift()).cumsum())])
print(d[1])
print(d[2])
0 1 2 3 4
0 0.0 13.0 4.50 30.0 0.0,13.0
1 0.0 13.0 4.75 30.0 0.0,13.0
2 0.0 13.0 5.00 30.0 0.0,13.0
3 0.0 13.0 5.25 30.0 0.0,13.0
4 0.0 13.0 5.50 30.0 0.0,13.0
5 0.0 13.0 5.75 0.0 0.0,13.0
6 0.0 13.0 6.00 30.0 0.0,13.0
0 1 2 3 4
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
I will use GroupBy.__iter__:
d = dict(df.groupby(df['0'].diff().ne(0).cumsum()).__iter__())
#d = dict(df.groupby(df[0].diff().ne(0).cumsum()).__iter__())
Note that if there are repeated non-consecutive values ​​different groups will be created, if you only use groupby(0) they will be grouped in the same group

Can't Re-Order Columns Data

I have dataframe not sequences. if I use len(df.columns), my data has 3586 columns. How to re-order the data sequences?
ID V1 V10 V100 V1000 V1001 V1002 ... V990 V991 V992 V993 V994
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0
I used this df = df.reindex(sorted(df.columns), axis=1) (based on this question Re-ordering columns in pandas dataframe based on column name) but still not working.
thank you
First get all columns without pattern V + number by filtering with str.contains, then sorting all another values by Index.difference, add together and pass to DataFrame.reindex - get first all non numeric non matched columns in first positions and then sorted V + number columns:
L1 = df.columns[~df.columns.str.contains('^V\d+$')].tolist()
L2 = sorted(df.columns.difference(L1), key=lambda x: float(x[1:]))
df = df.reindex(L1 + L2, axis=1)
print (df)
ID V1 V10 V100 V990 V991 V992 V993 V994 V1000 V1001 V1002
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

python script breaks output lines

If I run
import numpy as np
import pandas as pd
import sys
df = pd.read_csv(sys.argv[1]) # note to self: argv[0] is script file content
description = df.groupby(['option','subcase']).describe()
totals = df.groupby('option').describe().set_index(np.array(['total'] * df['option'].nunique()), append=True)
description = description.append(totals).sort_index()
print(description)
on .csv
option,subcase,cost,time
A,sub1,4,3
A,sub1,2,0
A,sub2,3,8
A,sub2,1,2
B,sub1,13,0
B,sub1,11,0
B,sub2,5,2
B,sub2,3,4
, I get an output like this:
cost time \
count mean std min 25% 50% 75% max count
option subcase
A sub1 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0 2.0
sub2 2.0 2.0 1.414214 1.0 1.50 2.0 2.50 3.0 2.0
total 4.0 2.5 1.290994 1.0 1.75 2.5 3.25 4.0 4.0
B sub1 2.0 12.0 1.414214 11.0 11.50 12.0 12.50 13.0 2.0
sub2 2.0 4.0 1.414214 3.0 3.50 4.0 4.50 5.0 2.0
total 4.0 8.0 4.760952 3.0 4.50 8.0 11.50 13.0 4.0
mean std min 25% 50% 75% max
option subcase
A sub1 1.50 2.121320 0.0 0.75 1.5 2.25 3.0
sub2 5.00 4.242641 2.0 3.50 5.0 6.50 8.0
total 3.25 3.403430 0.0 1.50 2.5 4.25 8.0
B sub1 0.00 0.000000 0.0 0.00 0.0 0.00 0.0
sub2 3.00 1.414214 2.0 2.50 3.0 3.50 4.0
total 1.50 1.914854 0.0 0.00 1.0 2.50 4.0
This is annoying, especially if you want to save it as a .csv instead of displaying it in a console.
(e.g. python myscript.py my.csv > my.summary)
How do I stop this linebreak from happening?
Adding : pd.set_option
pd.set_option('expand_frame_repr', False)
print(description)
cost time
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
option subcase
A sub1 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0 2.0 1.50 2.121320 0.0 0.75 1.5 2.25 3.0
sub2 2.0 2.0 1.414214 1.0 1.50 2.0 2.50 3.0 2.0 5.00 4.242641 2.0 3.50 5.0 6.50 8.0
total 4.0 2.5 1.290994 1.0 1.75 2.5 3.25 4.0 4.0 3.25 3.403430 0.0 1.50 2.5 4.25 8.0
B sub1 2.0 12.0 1.414214 11.0 11.50 12.0 12.50 13.0 2.0 0.00 0.000000 0.0 0.00 0.0 0.00 0.0
sub2 2.0 4.0 1.414214 3.0 3.50 4.0 4.50 5.0 2.0 3.00 1.414214 2.0 2.50 3.0 3.50 4.0
total 4.0 8.0 4.760952 3.0 4.50 8.0 11.50 13.0 4.0 1.50 1.914854 0.0 0.00 1.0 2.50 4.0

What is the effective way to have a pivot-table having pandas dataset columns as its rows?

Let's take as an example the following dataset:
make address all 3d our over length_total y
0 0.0 0.64 0.64 0.0 0.32 0.0 278 1
1 0.21 0.28 0.5 0.0 0.14 0.28 1028 1
2 0.06 0.0 0.71 0.0 1.23 0.19 2259 1
3 0.15 0.0 0.46 0.1 0.61 0.0 1257 1
4 0.06 0.12 0.77 0.0 0.19 0.32 749 1
5 0.0 0.0 0.0 0.0 0.0 0.0 21 1
6 0.0 0.0 0.25 0.0 0.38 0.25 184 1
7 0.0 0.69 0.34 0.0 0.34 0.0 261 1
8 0.0 0.0 0.0 0.0 0.9 0.0 25 1
9 0.0 0.0 1.42 0.0 0.71 0.35 205 1
10 0.0 0.0 0.0 0.0 0.0 0.0 23 0
11 0.48 0.0 0.0 0.0 0.48 0.0 37 0
12 0.12 0.0 0.25 0.0 0.0 0.0 491 0
13 0.08 0.08 0.25 0.2 0.0 0.25 807 0
14 0.0 0.0 0.0 0.0 0.0 0.0 38 0
15 0.24 0.0 0.12 0.0 0.0 0.12 227 0
16 0.0 0.0 0.0 0.0 0.75 0.0 77 0
17 0.1 0.0 0.21 0.0 0.0 0.0 571 0
18 0.51 0.0 0.0 0.0 0.0 0.0 74 0
19 0.3 0.0 0.15 0.0 0.0 0.15 155 0
I want to get pivot-table from the previous dataset, in which the columns (make, address all, 3d, our, over, length_total) will have their mean values processed by the column y. The following table is the expected result:
y
1 0
make 0.048 0.183
address 0.173 0.008
all 0.509 0.098
3d 0.01 0.02
our 0.482 0.123
over 0.139 0.052
length_total 626.7 250
Is it possible to get the desired result through pivot_table method from pandas.data object? If so, how?
Is there a more effective way to do this?
Some people like using stack or unstack, but I prefer good ol' pd.melt to "flatten" or "unpivot" a frame:
>>> df_m = pd.melt(df, id_vars="y")
>>> df_m.pivot_table(index="variable", columns="y")
value
y 0 1
variable
3d 0.020 0.010
address 0.008 0.173
all 0.098 0.509
length_total 250.000 626.700
make 0.183 0.048
our 0.123 0.482
over 0.052 0.139
(If you want to preserve the original column order as the new row order, you can use .loc to index into this, something like df2.loc[df.columns].dropna()).
Melting does the flattening, and preserves y as a column, putting the old column names as a new column called "variable" (which can be changed if you like):
>>> pd.melt(df, id_vars="y").head()
y variable value
0 1 make 0.00
1 1 make 0.21
2 1 make 0.06
3 1 make 0.15
4 1 make 0.06
After that we can call pivot_table as we would ordinarily.

Categories

Resources