I am working on a dataset which is in the following dataframe.
#print(old_df)
col1 col2 col3
0 1 10 1.5
1 1 11 2.5
2 1 12 5,6
3 2 10 7.8
4 2 24 2.1
5 3 10 3.2
6 4 10 22.1
7 4 11 1.3
8 4 89 0.5
9 4 91 3.3
I am trying to generate another data frame which contains selected col1 values as index, selected col2 values as columns and assign respective col3 value.
Eg:
selected_col1 = [1,2]
selected_col2 = [10,11,24]
New data frame should be looking like:
#print(selected_df)
10 11 24
1 1.5 2.5 Nan
2 7.8 Nan 2.1
I have tried following method
selected_col1 = [1,2]
selected_col2 = [10,11,24]
selected_df =pd.DataFrame(index=selected_col1,columns=selected_col2)
for col1_value in selected_col1:
for col2_value in selected_col2:
qry = 'col1 == {} & col2 == {}'.format(col1_value,col2_value)
col3_value = old_df.query(qry).col3.values
if(len(col3_value) > 0):
selected_df.at[col1_value,col2_value] = col3_value[0]
But because my dataframe has around 20 million rows, this brute force kind of method is taking long time. Is there a way better than this?
First filter rows by membership by Series.isin in both columns chained by & for bitwise AND and then use DataFrame.pivot:
df = df[df['col1'].isin(selected_col1) & df['col2'].isin(selected_col2)]
df = df.pivot('col1','col2','col3')
print (df)
col2 10 11 24
col1
1 1.5 2.5 NaN
2 7.8 NaN 2.1
If possible some duplicated pairs in col1 with col2 after filtering use DataFrame.pivot_table:
df = df.pivot_table(index='col1',columns='col2',values='col3', aggfunc='mean')
EDIT:
If use | for bitwise OR get different output:
df = df[df['col1'].isin(selected_col1) | df['col2'].isin(selected_col2)]
df = df.pivot('col1','col2','col3')
print (df)
col2 10 11 12 24
col1
1 1.5 2.5 5,6 NaN
2 7.8 NaN NaN 2.1
3 3.2 NaN NaN NaN
4 22.1 1.3 NaN NaN
Related
Edited for clarity:
I have a dataframe in the following format
i col1 col2 col3
0 00:00:00,1 10 1.7
1 00:00:00,2 10 1.5
2 00:00:00,3 50 4.6
3 00:00:00,4 30 3.4
4 00:00:00,5 20 5.6
5 00:00:00,6 50 1.8
6 00:00:00,9 20 1.9
...
That I'm trying to sort like this
i col1 col2 col3
0 00:00:00,1 10 1.7
1 00:00:00,2 10 1.5
4 00:00:00,5 20 5.6
3 00:00:00,9 20 1.9
4 00:00:00,4 30 3.4
5 00:00:00,3 50 4.6
6 00:00:00,6 50 1.8
...
I've tried df = df.sort_values(by = ['col1', 'col2'] which only works on col1.
I understand that it may have something to do with the values being 'strings', but I can't seem to find a workaround for it.
df.sort_values(by = ['col2', 'col1']
Gave the desired result
If need sort each column independently use Series.sort_values in DataFrame.apply:
c = ['col1','col2']
df[c] = df[c].apply(lambda x: x.sort_values().to_numpy())
#alternative
df[c] = df[c].apply(lambda x: x.sort_values().tolist())
print (df)
i col1 col2
0 0 00:00:00,1 10
1 1 00:00:01,5 20
2 2 00:00:10,0 30
3 3 00:01:00,1 40
4 5 01:00:00,0 50
I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8
For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()
I have a data frame with more then 100 columns. i need to lag 60 of them, and i know columns names for which i need to lag. Is there a way to lag them in batch or just few lines?
Say I have a dataframe like belwo
col1 col2 col3 col4 col5 col6 ... col100
1 2 3 4 5 6 8
3 9 15 19 21 23 31
The only way i know is to do it one by one. i.e run df['col1_lag']=df['col'].shift(1) for each column.
It seems too much for so many columns. Is there a better way to do this? Thanks in advance.
Use shift with add_prefix for new DataFrame and join to original:
df1 = df.join(df.shift().add_suffix('_lag'))
#alternative
#df1 = pd.concat([df, df.shift().add_suffix('_lag')], axis=1)
print (df1)
col1 col2 col3 col4 col5 col6 col100 col1_lag col2_lag col3_lag \
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 2.0 3.0
col4_lag col5_lag col6_lag col100_lag
0 NaN NaN NaN NaN
1 4.0 5.0 6.0 8.0
If want lag only some columns is possible filter them by list:
cols = ['col1','col3','col5']
df2 = df.join(df[cols].shift().add_suffix('_lag'))
print (df2)
col1 col2 col3 col4 col5 col6 col100 col1_lag col3_lag col5_lag
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 3.0 5.0
I currently have a dataframe which looks like this:
col1 col2 col3
1 2 3
2 3 NaN
3 4 NaN
2 NaN NaN
0 2 NaN
What I want to do is apply some condition to the column values and return the final result in a new column.
The condition is to assign values based on this order of priority where 2 being the first priority: [2,1,3,0,4]
I tried to define a function to append the final results but wasnt really getting anywhere...any thoughts?
The desired outcome would look something like:
col1 col2 col3 col4
1 2 3 2
2 3 NaN 2
3 4 NaN 3
2 NaN NaN 2
0 2 NaN 2
where col4 is the new column created.
Thanks
first you may want to get ride of the NaNs:
df.fillna(5)
and then apply a function to every row to find your value:
def func(x,l=[2,1,3,0,4,5]):
for j in l:
if(j in x):
return j
df['new'] = df.apply(lambda x: func(list(x)),axis =1)
Output:
col1 col2 col3 new
0 1 2 3 2
1 2 3 5 2
2 3 4 5 3
3 2 5 5 2
4 0 2 5 2
maybe a little later.
import numpy as np
def f(x):
for i in [2,1,3,0,4]:
if i in x.tolist():
return i
return np.nan
df["col4"] = df.apply(f, axis=1)
and the Output:
col1 col2 col3 col4
0 1 2.0 3.0 2
1 2 3.0 NaN 2
2 3 4.0 NaN 3
3 2 NaN NaN 2
4 0 2.0 NaN 2
I have a dataframe as below:
idx col1 col2 col3
0 1.1 A 100
1 1.1 A 100
2 1.1 A 100
3 2.6 B 100
4 2.5 B 100
5 3.4 B 100
6 2.6 B 100
I want to update col3 with percentage values depending on the group size of col1,col2 (two columns ie., for each row having 1.1,A - col3 value should have 33.33)
Desired output:
idx col1 col2 col3
0 1.1 A 33.33
1 1.1 A 33.33
2 1.1 A 33.33
3 2.6 B 50
4 2.5 B 100
5 3.4 B 100
6 2.6 B 50
I think you need groupby with transform size:
df['col3'] = 100 / df.groupby(['col1', 'col2'])['col3'].transform('size')
print df
col1 col2 col3
idx
0 1.1 A 33.333333
1 1.1 A 33.333333
2 1.1 A 33.333333
3 2.6 B 50.000000
4 2.5 B 100.000000
5 3.4 B 100.000000
6 2.6 B 50.000000