group list of dataframes to another dataframe row by row - python

Frames is a list of dataframes with the following order and dimension:
[ participant activity t phone_accel_x phone_accel_y \
0 1600 D 241598773279024 0.565436 1.049568
1 1600 D 241598823633028 0.502723 1.029012
2 1600 D 241598873987032 0.470794 1.002914
3 1600 D 241598924341036 0.490821 1.003417
4 1600 D 241598974695040 0.487980 1.033217
.. ... ... ... ... ...
195 1600 D 241608592309883 0.677391 0.918443
196 1600 D 241608642663887 0.673493 0.913030
197 1600 D 241608693017891 0.674655 0.913004
198 1600 D 241608743371894 0.679319 0.914433
199 1600 D 241608793725898 0.676576 0.913901
phone_accel_z phone_gyro_x phone_gyro_y phone_gyro_z
0 1.248711 -0.017212 -0.006581 -0.080116
1 1.197390 -0.121311 0.050491 -0.109368
2 1.224439 -0.324749 -0.007777 -0.148947
3 1.234429 -0.290535 -0.105310 -0.151757
4 1.223829 -0.100016 -0.093174 -0.112706
.. ... ... ... ...
195 1.250941 -0.008502 0.028063 0.019072
196 1.260808 -0.004811 0.027223 0.024403
197 1.266306 0.000024 0.022763 0.023875
198 1.258972 0.003954 0.012599 0.021185
199 1.259517 -0.006841 0.007218 0.012923
[200 rows x 9 columns],
participant activity t phone_accel_x phone_accel_y \
50 1600 D 241601290979316 0.667534 0.907823
51 1600 D 241601341333320 0.659705 0.917594
52 1600 D 241601391687324 0.650291 0.908096
53 1600 D 241601442041328 0.641641 0.901728
54 1600 D 241601492395332 0.659827 0.899954
.. ... ... ... ... ...
245 1600 D 241611110023497 0.673400 0.913214
246 1600 D 241611160377501 0.677467 0.912210
247 1600 D 241611210731505 0.681255 0.905807
248 1600 D 241611261085509 0.670614 0.904358
249 1600 D 241611311439513 0.668775 0.909658
phone_accel_z phone_gyro_x phone_gyro_y phone_gyro_z
50 1.277606 -0.031145 -0.012867 -0.057229
51 1.272129 -0.039413 0.005489 -0.044188
52 1.290153 -0.056169 0.004972 -0.065202
53 1.274855 -0.044967 -0.010766 -0.078963
54 1.290040 -0.046148 -0.010928 -0.075745
.. ... ... ... ...
245 1.246544 -0.006509 0.009480 0.009705
246 1.250491 -0.012193 0.010935 0.008721
247 1.256942 -0.006915 0.017657 0.008312
248 1.264303 -0.007985 0.019806 0.001612
249 1.265652 0.002644 0.007558 0.004734
[200 rows x 9 columns], etc
All the dataframes are of the same dimension 200 rows x 9 columns and the len(frames) is 91999. I want to create a new dataframe that contains the values of all the 200 rows of every dataframe in one row but only for the columns phone_accel_x, phone_accel_y, phone_accel_z, phone_gyro_x, phone_gyro_y, phone_gyro_z and activity. The values of each dataframe will be added as new row, so the new dataframe will be of dimension 91999 rows x 1201 columns (200 x 6 + 1).
sensors_frames = []
for i in range(0, len(frames)):
t = frames[i][['phone_accel_x', 'phone_accel_y', 'phone_accel_z',
'phone_gyro_x', 'phone_gyro_y', 'phone_gyro_z', 'activity']].values
sensors_frames.append(t)
i am trying something like this, but i am having difficulties in stacking the values of each column in a single row and continue in a new line for the next dataframe. The list sensors_frames will be converted to a dataframe afterwards.
Any ideas to make it happen with pandas library?
Thanks in advance.

Here is a pandas-based solution - I timed it on my machine and took around 2 minutes to run.
import pandas as pd
# simulate input data
df = pd.DataFrame(
{
"participant": [1600] * 200,
"activity": ["D"] * 200,
"t": [241598773279024] * 200,
"phone_accel_x": [1.049568] * 200,
"phone_accel_y": [1.049568] * 200,
"phone_accel_z": [1.248711] * 200,
"phone_gyro_x": [-0.017212] * 200,
"phone_gyro_y": [-0.006581] * 200,
"phone_gyro_z": [-0.080116] * 200,
}
)
columns = [
"phone_accel_x",
"phone_accel_y",
"phone_accel_z",
"phone_gyro_x",
"phone_gyro_y",
"phone_gyro_z",
]
frames = [df] * 91_999
# suggested solution
df_frames = pd.concat(frames, axis=0, ignore_index=True)
df_frames["step"] = df_frames.index // 200
reshaped = df_frames.groupby(["step", "activity"]).apply(
lambda grp: pd.DataFrame(
grp[columns].values.reshape(1, -1),
columns=[f"{col}_{i}" for i in range(200) for col in columns],
)
).reset_index()

Related

How do I compare the two data frames and get the values?

The x data frame is information about departure and arrival, and the y data frame is latitude and longitude data for each location.
I try to calculate the distance between the origin and destination using the latitude and longitude data of start and end (e.g., start_x, start_y, end_x, end_y).
How can I connect x and y to bring the latitude data that fits each code into the x data frame?
The notation is somewhat confusing, but I took it after the question's notation.
One way to do would be by merging your dataframes into a new one like so :
Dummy dataframes:
import pandas as pd
x=[300,500,300,600,700]
y=[400,400,700,700,400]
code=[300,400,500,600,700]
start=[100,101,102,103,104]
end=[110,111,112,113,114]
x={"x":x, "y":y}
y={"code":code, "start":start, "end":end}
x=pd.DataFrame(x)
y=pd.DataFrame(y)
This gives:
x
x
y
0
300
400
1
500
400
2
300
700
3
600
700
4
700
400
y
code
start
end
0
300
100
110
1
400
101
111
2
500
102
112
3
600
103
113
4
700
104
114
Solution :
df = pd.merge(x,y,left_on="x",right_on="code").drop("code",axis=1)
df
x
y
start
end
0
300
400
100
110
1
300
700
100
110
2
500
400
102
112
3
600
700
103
113
4
700
400
104
114
df = df.merge(y,left_on="y",right_on="code").drop("code",axis=1)
df
x
y
start_x
end_x
start_y
end_y
0
300
400
100
110
101
111
1
500
400
102
112
101
111
2
700
400
104
114
101
111
3
300
700
100
110
104
114
4
600
700
103
113
104
114
Quick explanation :
The line df = pd.merge(...) creates the new dataframe by merging the left one (x) on the "x" column and the right one on the "code" column. The second line df = df.merge(...) takes the existing df as the left one, and uses its column "y" to merge the "code" column from the y dataframe.
The .drop("code",axis=1) is used to drop the unwanted "code" column resulting from the merging.
The _x and _y suffixes are added automatically when merging dataframes that have the same column names. To control it, use the "suffixe=.." option when calling the second merging (when the column with the same name are merging). In this case it works right with the default setting so no bothering with this if you use the x as right and y as left dataframes.

How to apply a function to all the columns in a data frame and take output in the form of dataframe in python

I have two functions that do some calculation and gives me results. For now, I am able to apply it in one column and get the result in the form of a dataframe.
I need to know how I can apply the function on all the columns in the dataframe and get results as well in the form of a dataframe.
Say I have a data frame as below and I need to apply the function on each column in the data frame and get a dataframe with results corresponding for all the columns.
A B C D E F
1456 6744 9876 374 65413 1456
654 2314 674654 2156 872 6744
875 653 36541 345 4963 9876
6875 7401 3654 465 3547 374
78654 8662 35 6987 6874 65413
658 94512 687 489 8756 5854
Results
A B C D E F
2110 9058 684530 2530 66285 8200
1529 2967 711195 2501 5835 16620
7750 8054 40195 810 8510 10250
85529 16063 3689 7452 10421 65787
Here is simple example
df
A B C D
0 10 11 12 13
1 20 21 22 23
2 30 31 32 33
3 40 41 42 43
# Assume your user defined function is
def mul(x, y):
return x * y
which will multiply the values
Let's say you want to multiply first column 'A' with 3
df['A'].apply(lambda x: mul(x,3))
0 30
1 60
2 90
3 120
Now, you want to apply mul function to all columns of dataframe and create new dataframe with results
df1 = df.applymap(lambda x: mul(x, 3))
df1
A B C D
0 30 33 36 39
1 60 63 66 69
2 90 93 96 99
3 120 123 126 129
pd.DataFrame object also has its own apply method.
From the example given in the documentation of the link above:
>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
A B
0 4 9
1 4 9
2 4 9
>>> df.apply(np.sqrt)
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
Conclusion: you should be able to apply your function to the whole dataframe.
It looks like this is what you are trying to do in your output:
df = pd.DataFrame(
[[1456, 6744, 9876, 374, 65413, 1456],
[654, 2314, 674654, 2156, 872, 6744],
[875, 653, 36541, 345, 4963, 9876],
[6875, 7401, 3654, 465, 3547, 374],
[78654, 8662, 35, 6987, 6874, 65413],
[658, 94512, 687, 489, 8756, 5854]],
columns=list('ABCDEF'))
def fn(col):
return col[:-2].values + col[1:-1].values
Apply the function as mentioned in previous answers:
>>> df.apply(fn)
A B C D E F
0 2110 9058 684530 2530 66285 8200
1 1529 2967 711195 2501 5835 16620
2 7750 8054 40195 810 8510 10250
3 85529 16063 3689 7452 10421 65787

How to group by and aggregate on multiple columns in pandas

I have following dataframe in pandas
ID Balance ATM_drawings Value
1 100 50 345
1 150 33 233
2 100 100 333
2 100 100 234
I want data in that desired format
ID Balance_mean Balance_sum ATM_Drawings_mean ATM_drawings_sum
1 75 250 41.5 83
2 200 100 200 100
I am using following command to do it in pandas
df1= df[['Balance','ATM_drawings']].groupby('ID', as_index = False).agg(['mean', 'sum']).reset_index()
But, it does not give what I intended to get.
You can use a dictionary to specify aggregation functions for each series:
d = {'Balance': ['mean', 'sum'], 'ATM_drawings': ['mean', 'sum']}
res = df.groupby('ID').agg(d)
# flatten MultiIndex columns
res.columns = ['_'.join(col) for col in res.columns.values]
print(res)
Balance_mean Balance_sum ATM_drawings_mean ATM_drawings_sum
ID
1 125 250 41.5 83
2 100 200 100.0 200
Or you can define d via dict.fromkeys:
d = dict.fromkeys(('Balance', 'ATM_drawings'), ['mean', 'sum'])
Not sure how to achieve this using agg, but you could reuse the `groupby´ object to avoid having to do the operation multiple times, and then use transformations:
import pandas as pd
df = pd.DataFrame({
"ID": [1, 1, 2, 2],
"Balance": [100, 150, 100, 100],
"ATM_drawings": [50, 33, 100, 100],
"Value": [345, 233, 333, 234]
})
gb = df.groupby("ID")
df["Balance_mean"] = gb["Balance"].transform("mean")
df["Balance_sum"] = gb["Balance"].transform("sum")
df["ATM_drawings_mean"] = gb["ATM_drawings"].transform("mean")
df["ATM_drawings_sum"] = gb["ATM_drawings"].transform("sum")
print df
Which yields:
ID Balance Balance_mean Balance_sum ATM_drawings ATM_drawings_mean ATM_drawings_sum Value
0 1 100 125 250 50 41.5 83 345
1 1 150 125 250 33 41.5 83 233
2 2 100 100 200 100 100.0 200 333
3 2 100 100 200 100 100.0 200 234

Format Pandas Pivot Table

I met a problem in formatting pivot table that created by Pandas.
So I made a matrix table between 2 columns (A,B) from my source data, by using pandas.pivot_table with A as Column, and B as Index.
>> df = PD.read_excel("data.xls")
>> table = PD.pivot_table(df,index=["B"],
values='Count',columns=["A"],aggfunc=[NUM.sum],
fill_value=0,margins=True,dropna= True)
>> table
It returns as:
sum
A 1 2 3 All
B
1 23 52 0 75
2 16 35 12 65
3 56 0 0 56
All 95 87 12 196
And I hope to have a format like this:
A All_B
1 2 3
1 23 52 0 75
B 2 16 35 12 65
3 56 0 0 56
All_A 95 87 12 196
How should I do this? Thanks very much ahead.
The table returned by pd.pivot_table is very convenient to do work on (it's single-level index/column) and normally does NOT require any further format manipulation. But if you insist on changing the format to the one you mentioned in the post, then you need to construct a multi-level index/column using pd.MultiIndex. Here is an example on how to do it.
Before manipulation,
import pandas as pd
import numpy as np
np.random.seed(0)
a = np.random.randint(1, 4, 100)
b = np.random.randint(1, 4, 100)
df = pd.DataFrame(dict(A=a,B=b,Val=np.random.randint(1,100,100)))
table = pd.pivot_table(df, index='A', columns='B', values='Val', aggfunc=sum, fill_value=0, margins=True)
print(table)
B 1 2 3 All
A
1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All 1458 1472 1718 4648
After:
multi_level_column = pd.MultiIndex.from_arrays([['A', 'A', 'A', 'All_B'], [1,2,3,'']])
multi_level_index = pd.MultiIndex.from_arrays([['B', 'B', 'B', 'All_A'], [1,2,3,'']])
table.index = multi_level_index
table.columns = multi_level_column
print(table)
A All_B
1 2 3
B 1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All_A 1458 1472 1718 4648

Pandas - GroupBy and then Merge on original table

I'm trying to write a function to aggregate and perform various stats calcuations on a dataframe in Pandas and then merge it to the original dataframe however, I'm running to issues. This is code equivalent in SQL:
SELECT EID,
PCODE,
SUM(PVALUE) AS PVALUE,
SUM(SQRT(SC*EXP(SC-1))) AS SC,
SUM(SI) AS SI,
SUM(EE) AS EE
INTO foo_bar_grp
FROM foo_bar
GROUP BY EID, PCODE
And then join on the original table:
SELECT *
FROM foo_bar_grp INNER JOIN
foo_bar ON foo_bar.EID = foo_bar_grp.EID
AND foo_bar.PCODE = foo_bar_grp.PCODE
Here are the steps: Loading the data
IN:>>
pol_dict = {'PID':[1,1,2,2],
'EID':[123,123,123,123],
'PCODE':['GU','GR','GU','GR'],
'PVALUE':[100,50,150,300],
'SI':[400,40,140,140],
'SC':[230,23,213,213],
'EE':[10000,10000,2000,30000],
}
pol_df = DataFrame(pol_dict)
pol_df
OUT:>>
EID EE PCODE PID PVALUE SC SI
0 123 10000 GU 1 100 230 400
1 123 10000 GR 1 50 23 40
2 123 2000 GU 2 150 213 140
3 123 30000 GR 2 300 213 140
Step 2: Calculating and Grouping on the data:
My pandas code is as follows:
#create aggregation dataframe
poagg_df = pol_df
del poagg_df['PID']
po_grouped_df = poagg_df.groupby(['EID','PCODE'])
#generate acc level aggregate
acc_df = po_grouped_df.agg({
'PVALUE' : np.sum,
'SI' : lambda x: np.sqrt(np.sum(x * np.exp(x-1))),
'SC' : np.sum,
'EE' : np.sum
})
This works fine until I want to join on the original table:
IN:>>
po_account_df = pd.merge(acc_df, po_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
OUT:>>
KeyError: u'no item named EID'
For some reason, the grouped dataframe can't join back to the original table. I've looked at ways of trying to convert the groupby columns to actual columns but that doesn't seem to work.
Please note, the end goal is to be able to find the percentage for each column (PVALUE, SI, SC, EE) IE:
pol_acc_df['PVALUE_PCT'] = np.round(pol_acc_df.PVALUE_Po/pol_acc_df.PVALUE_Acc,4)
Thanks!
By default, groupby output has the grouping columns as indicies, not columns, which is why the merge is failing.
There are a couple different ways to handle it, probably the easiest is using the as_index parameter when you define the groupby object.
po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)
Then, your merge should work as expected.
In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]:
EID PCODE SC_Acc EE_Acc SI_Acc PVALUE_Acc EE_Po PVALUE_Po \
0 123 GR 236 40000 1.805222e+31 350 10000 50
1 123 GR 236 40000 1.805222e+31 350 30000 300
2 123 GU 443 12000 8.765549e+87 250 10000 100
3 123 GU 443 12000 8.765549e+87 250 2000 150
SC_Po SI_Po
0 23 40
1 213 140
2 230 400
3 213 140
From the pandas docs:
Transformation: perform some group-specific computations and return a like-indexed object
Unfortunately, transform works series by series, so you wouldn't be able to perform multiple functions on multiple columns as you've done with agg, but transform does allow you to skip merge
po_grouped_df = pol_df.groupby(['EID','PCODE'])
pol_df['sum_pval'] = po_grouped_df['PVALUE'].transform(sum)
pol_df['func_si'] = po_grouped_df['SI'].transform(lambda x: np.sqrt(np.sum(x * np.exp(x-1))))
pol_df['sum_sc'] = po_grouped_df['SC'].transform(sum)
pol_df['sum_ee'] = po_grouped_df['EE'].transform(sum)
pol_df
Results in:
PID EID PCODE PVALUE SI SC EE sum_pval func_si sum_sc sum_ee
1 123 GU 100 400 230 10000 250 8.765549e+87 443 12000
1 123 GR 50 40 23 10000 350 1.805222e+31 236 40000
2 123 GU 150 140 213 2000 250 8.765549e+87 443 12000
2 123 GR 300 140 213 30000 350 1.805222e+31 236 40000
For more info, check out this SO answer.

Categories

Resources