How to generate dataframe using column values in some other dataframe

How to generate dataframe using column values in some other dataframe - python

I am working on a dataset which is in the following dataframe.
#print(old_df)
col1 col2 col3
0 1 10 1.5
1 1 11 2.5
2 1 12 5,6
3 2 10 7.8
4 2 24 2.1
5 3 10 3.2
6 4 10 22.1
7 4 11 1.3
8 4 89 0.5
9 4 91 3.3
I am trying to generate another data frame which contains selected col1 values as index, selected col2 values as columns and assign respective col3 value.
Eg:
selected_col1 = [1,2]
selected_col2 = [10,11,24]
New data frame should be looking like:
#print(selected_df)
10 11 24
1 1.5 2.5 Nan
2 7.8 Nan 2.1
I have tried following method
selected_col1 = [1,2]
selected_col2 = [10,11,24]
selected_df =pd.DataFrame(index=selected_col1,columns=selected_col2)
for col1_value in selected_col1:
for col2_value in selected_col2:
qry = 'col1 == {} & col2 == {}'.format(col1_value,col2_value)
col3_value = old_df.query(qry).col3.values
if(len(col3_value) > 0):
selected_df.at[col1_value,col2_value] = col3_value[0]
But because my dataframe has around 20 million rows, this brute force kind of method is taking long time. Is there a way better than this?

First filter rows by membership by Series.isin in both columns chained by & for bitwise AND and then use DataFrame.pivot:
df = df[df['col1'].isin(selected_col1) & df['col2'].isin(selected_col2)]
df = df.pivot('col1','col2','col3')
print (df)
col2 10 11 24
col1
1 1.5 2.5 NaN
2 7.8 NaN 2.1
If possible some duplicated pairs in col1 with col2 after filtering use DataFrame.pivot_table:
df = df.pivot_table(index='col1',columns='col2',values='col3', aggfunc='mean')
EDIT:
If use | for bitwise OR get different output:
df = df[df['col1'].isin(selected_col1) | df['col2'].isin(selected_col2)]
df = df.pivot('col1','col2','col3')
print (df)
col2 10 11 12 24
col1
1 1.5 2.5 5,6 NaN
2 7.8 NaN NaN 2.1
3 3.2 NaN NaN NaN
4 22.1 1.3 NaN NaN

Related

Pandas. Cannot sort by multiple columns

Edited for clarity:
I have a dataframe in the following format
i col1 col2 col3
0 00:00:00,1 10 1.7
1 00:00:00,2 10 1.5
2 00:00:00,3 50 4.6
3 00:00:00,4 30 3.4
4 00:00:00,5 20 5.6
5 00:00:00,6 50 1.8
6 00:00:00,9 20 1.9
...
That I'm trying to sort like this
i col1 col2 col3
0 00:00:00,1 10 1.7
1 00:00:00,2 10 1.5
4 00:00:00,5 20 5.6
3 00:00:00,9 20 1.9
4 00:00:00,4 30 3.4
5 00:00:00,3 50 4.6
6 00:00:00,6 50 1.8
...
I've tried df = df.sort_values(by = ['col1', 'col2'] which only works on col1.
I understand that it may have something to do with the values being 'strings', but I can't seem to find a workaround for it.

df.sort_values(by = ['col2', 'col1']
Gave the desired result

If need sort each column independently use Series.sort_values in DataFrame.apply:
c = ['col1','col2']
df[c] = df[c].apply(lambda x: x.sort_values().to_numpy())
#alternative
df[c] = df[c].apply(lambda x: x.sort_values().tolist())
print (df)
i col1 col2
0 0 00:00:00,1 10
1 1 00:00:01,5 20
2 2 00:00:10,0 30
3 3 00:01:00,1 40
4 5 01:00:00,0 50

Get last non NaN value after groupby and aggregation

I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8

For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()

how to lag columns in batch in dataframe

I have a data frame with more then 100 columns. i need to lag 60 of them, and i know columns names for which i need to lag. Is there a way to lag them in batch or just few lines?
Say I have a dataframe like belwo
col1 col2 col3 col4 col5 col6 ... col100
1 2 3 4 5 6 8
3 9 15 19 21 23 31
The only way i know is to do it one by one. i.e run df['col1_lag']=df['col'].shift(1) for each column.
It seems too much for so many columns. Is there a better way to do this? Thanks in advance.

Use shift with add_prefix for new DataFrame and join to original:
df1 = df.join(df.shift().add_suffix('_lag'))
#alternative
#df1 = pd.concat([df, df.shift().add_suffix('_lag')], axis=1)
print (df1)
col1 col2 col3 col4 col5 col6 col100 col1_lag col2_lag col3_lag \
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 2.0 3.0
col4_lag col5_lag col6_lag col100_lag
0 NaN NaN NaN NaN
1 4.0 5.0 6.0 8.0
If want lag only some columns is possible filter them by list:
cols = ['col1','col3','col5']
df2 = df.join(df[cols].shift().add_suffix('_lag'))
print (df2)
col1 col2 col3 col4 col5 col6 col100 col1_lag col3_lag col5_lag
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 3.0 5.0

Compare each of the column values and return final value based on conditions

I currently have a dataframe which looks like this:
col1 col2 col3
1 2 3
2 3 NaN
3 4 NaN
2 NaN NaN
0 2 NaN
What I want to do is apply some condition to the column values and return the final result in a new column.
The condition is to assign values based on this order of priority where 2 being the first priority: [2,1,3,0,4]
I tried to define a function to append the final results but wasnt really getting anywhere...any thoughts?
The desired outcome would look something like:
col1 col2 col3 col4
1 2 3 2
2 3 NaN 2
3 4 NaN 3
2 NaN NaN 2
0 2 NaN 2
where col4 is the new column created.
Thanks

first you may want to get ride of the NaNs:
df.fillna(5)
and then apply a function to every row to find your value:
def func(x,l=[2,1,3,0,4,5]):
for j in l:
if(j in x):
return j
df['new'] = df.apply(lambda x: func(list(x)),axis =1)
Output:
col1 col2 col3 new
0 1 2 3 2
1 2 3 5 2
2 3 4 5 3
3 2 5 5 2
4 0 2 5 2

maybe a little later.
import numpy as np
def f(x):
for i in [2,1,3,0,4]:
if i in x.tolist():
return i
return np.nan
df["col4"] = df.apply(f, axis=1)
and the Output:
col1 col2 col3 col4
0 1 2.0 3.0 2
1 2 3.0 NaN 2
2 3 4.0 NaN 3
3 2 NaN NaN 2
4 0 2.0 NaN 2

calculate percentage values depending on size group in dataframe - pandas

I have a dataframe as below:
idx col1 col2 col3
0 1.1 A 100
1 1.1 A 100
2 1.1 A 100
3 2.6 B 100
4 2.5 B 100
5 3.4 B 100
6 2.6 B 100
I want to update col3 with percentage values depending on the group size of col1,col2 (two columns ie., for each row having 1.1,A - col3 value should have 33.33)
Desired output:
idx col1 col2 col3
0 1.1 A 33.33
1 1.1 A 33.33
2 1.1 A 33.33
3 2.6 B 50
4 2.5 B 100
5 3.4 B 100
6 2.6 B 50

I think you need groupby with transform size:
df['col3'] = 100 / df.groupby(['col1', 'col2'])['col3'].transform('size')
print df
col1 col2 col3
idx
0 1.1 A 33.333333
1 1.1 A 33.333333
2 1.1 A 33.333333
3 2.6 B 50.000000
4 2.5 B 100.000000
5 3.4 B 100.000000
6 2.6 B 50.000000

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to generate dataframe using column values in some other dataframe - python

Related

Pandas. Cannot sort by multiple columns

Get last non NaN value after groupby and aggregation

how to lag columns in batch in dataframe

Compare each of the column values and return final value based on conditions

calculate percentage values depending on size group in dataframe - pandas

Categories

Resources