Let's assume I have the following dataframe:
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4, 2, 1, 3], 'col3': [1,0,1,1], 'outcome': [1,0,1,0]}
df = pd.DataFrame(data=d)
I want this dataframe sorted by col1 and col2 on the minimum value. The order of the indexes should be 2, 0, 1, 3.
I tried this with df.sort_values(by=['col2', 'col1']), but than it takes the minimum of col1 first and then of col2. Is there anyway to order by taking the minimum of two columns?
Using numpy.lexsort:
order = np.lexsort(np.sort(df[['col1', 'col2']])[:, ::-1].T)
out = df.iloc[order]
Output:
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
Note that you can easily handle any number of columns:
df.iloc[np.lexsort(np.sort(df[['col1', 'col2', 'col3']])[:, ::-1].T)]
col1 col2 col3 outcome
1 2 2 0 0
2 3 1 1 1
0 1 4 1 1
3 4 3 1 0
One way (not the most efficient):
idx = df[['col2', 'col1']].apply(lambda x: tuple(sorted(x)), axis=1).sort_values().index
Output:
>>> df.loc[idx]
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
>>> idx
Int64Index([2, 0, 1, 3], dtype='int64')
you can decorate-sort-undecorate where decoration is minimal and other (i.e., maximal) values per row:
cols = ["col1", "col2"]
(df.assign(_min=df[cols].min(axis=1), _other=df[cols].max(axis=1))
.sort_values(["_min", "_other"])
.drop(columns=["_min", "_other"]))
to get
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
I would compute min(col1, col2) as new column and then sort by it
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4, 2, 1, 3], 'col3': [1,0,1,1], 'outcome': [1,0,1,0]}
df = pd.DataFrame(data=d)
df['colmin'] = df[['col1','col2']].min(axis=1) # compute min
df = df.sort_values(by='colmin').drop(columns='colmin') # sort then drop min
print(df)
gives output
col1 col2 col3 outcome
0 1 4 1 1
2 3 1 1 1
1 2 2 0 0
3 4 3 1 0
Related
I have a df as follows:
Col1 Col2
0 [7306914, 7306915]
1 [7295911, 7295912]
2 [7324496]
3 [7294109, 7294110]
4 [7313713]
The second column is a list.
what I would like is to create a new column that contains the total number of elements in the list
Expected Output:
Col1 Col2 Col3
0 [7306914, 7306915] 2
1 [7295911, 7295912] 2
2 [7324496] 1
3 [7294109, 7294110] 2
4 [7313713] 1
Use Series.str.len. This is a vectorized method and is more efficient than apply function, which is essentially looping under-the-hood:
df = pd.DataFrame([{'Col1': 0, 'Col2': [7306914, 7306915]}, {'Col1': 1, 'Col2': [7295911, 7295912]}, {'Col1': 2, 'Col2': [7324496]}, {'Col1': 3, 'Col2': [7294109, 7294110]}, {'Col1': 4, 'Col2': [7313713]}])
df['Col3'] = df['Col2'].str.len()
[out]
print(df)
Col1 Col2 Col3
0 0 [7306914, 7306915] 2
1 1 [7295911, 7295912] 2
2 2 [7324496] 1
3 3 [7294109, 7294110] 2
4 4 [7313713] 1
Try this:
df_tmp = pd.DataFrame({'col1':[[1,2,3], [1,2]]}).reset_index()
In [360]:
df_tmp.head()
Out[360]:
index col1
0 0 [1, 2, 3]
1 1 [1, 2]
In [364]:
df_tmp['len'] = df_tmp.apply(lambda x: len(x['col1']), axis=1)
In [365]:
df_tmp
Out[365]:
index col1 len
0 0 [1, 2, 3] 3
1 1 [1, 2] 2
Apply should be most faster way for that.
Using the DataFrame.apply() or DataFrame.apply() like this:
df['Col3'] = df['Col2'].apply(len)
Hope it could help you.
I am calculating unique values, per row. However I want to exclude the value 0 and then calculate uniques
d = {'col1': [1, 2, 3], 'col2': [3, 4, 0], 'col3': [0, 4, 0],}
df = pd.DataFrame(data=d)
df
col1 col2 col3
0 1 3 0
1 2 4 4
2 3 0 0
Expected output
col1 col2 col3 uniques
0 1 3 0 2
1 2 4 4 2
2 3 0 0 1
df.nunique(axis = 1), this includes all values
To do this you can simply replace zeroes with Nan values.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, 3], 'col2': [3, 4, 0], 'col3': [0, 4, 0]}
df = pd.DataFrame(data=d)
df['uniques'] = df.replace(0, np.NaN).nunique(axis=1)
Try this:
def func(x):
s = set(x)
s.discard(0)
return len(s)
df['uniq'] = df.apply(lambda x: func(x), axis=1)
A slightly more concise version without using replace:
df['unique'] = df[df!=0].nunique(axis=1)
df
Output:
col1 col2 col3 unique
0 1 3 0 2
1 2 4 4 2
2 3 0 0 1
I want to do a groupby on column 1 then get the sum of values from column 2, conditional on the value in column 3, which are then divided by the total sum in column 2, still grouped by column 1.
An example is given below:
d = {'col1': [1, 2, 1, 2], 'col2': [3, 4, 2, 7], 'col3': [1, 1, 0, 0]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 3 1
1 2 4 1
2 1 2 0
3 2 7 0
I want to create a new column: col4. For this column I group by col1 and then get the percentage of col2 values where col3 is 1 divided by the total grouped sum of col2. Such that I would end up with the following result. ( I put it in fractions to make it easier to follow the calculations.
col1 col2 col3 col4
0 1 3 1 3/5
1 2 4 1 4/11
2 1 2 0 3/5
3 2 7 0 4/11
I tried the following, but this does not work unfortunately:
df.col4 = df.groupby(['col1']).transform(lambda x: np.where(x.col3 == 1, x.col2, 0).sum()) / df.groupby(['col1']).col2.transform('sum')
Edit | Extended example
I extended the example as the solution provided by Wen only covered the above simple example.
d = {'col1': [1, 2, 1, 2, 1, 2], 'col2': [3, 4, 2, 7, 6, 8], 'col3': [1, 1, 0, 0, 1, 0]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 3 1
1 2 4 1
2 1 2 0
3 2 7 0
4 1 6 1
5 2 8 0
Edit | Possible solution
I found a possible solution. I would like to do it in a cleaner way, but this is readable and pretty simple. Any alternatives to combine these two lines of code are still appreciated ofcourse.
df['col4'] = np.where(df.col3 == 1, df.col2, 0)
df['col4'] = df.groupby(['col1']).col4.transform('sum') / df.groupby(['col1']).col2.transform('sum')
You may need to correct your expected output , then using map after filter
df.col1.map(df.loc[df.col3==1,].set_index('col1').col2)/df.groupby(['col1']).col2.transform('sum')
Out[566]:
0 0.600000
1 0.363636
2 0.600000
3 0.363636
dtype: float64
simple :)
d = {'col1': [1, 2, 1, 2], 'col2': [3, 4, 2, 7], 'col3': [1, 1, 0, 0]}
df = pd.DataFrame(data=d)
df['col4'] = 0.0
def con(data):
part_a = sum(data[data['col3'] == 1]['col2'])
part_b = sum(data['col2'])
data.col4 = part_a/part_b
return data
df.groupby('col1').apply(con)
Output
col1 col2 col3 col4
0 1 3 1 0.600000
1 2 4 1 0.363636
2 1 2 0 0.600000
3 2 7 0 0.363636
If I have a dataframe df_i and I want to split it into sub-dataframes based on unique values of 'Cycle Number'
I use:
dfs = {k: df_i[df_i['Cycle Number'] == k] for k in df_i['Cycle Number'].unique()}
Assuming the 'Cycle Number' ranges from 1 to 50 and in each cycle, I have steps ranging from 1 to 15, how do I split each data frame into 15 further data frames?
I am presuming something of this type would work:
for i in range(1,51):
dsfs = {k: dfs[i][dfs[i]['Step Number'] == k] for k in dfs[i]['Step Number'].unique()}
But, this will return me 15 data frames only from the cycle number corresponding to 50, not the ones before.
If I want to access a sub-dataframe in the 20th Cycle with step number 10, is there a way of generating the subdata frame such that I can access it using something like dfs[20][10]?
A simple parallel:
Step Number Cycle Number Desired Access
1 1 dfs[1][1]
2 1 dfs[1][2]
3 1 dfs[1][3]
4 1 dfs[1][4]
5 1 dfs[1][5]
1 2 dfs[2][1]
2 2 dfs[2][2]
3 2 dfs[2][3]
4 2 dfs[2][4]
5 2 dfs[2][5]
1 3 dfs[3][1]
2 3 dfs[3][2]
3 3 dfs[3][3]
4 3 dfs[3][4]
5 3 dfs[3][5]
1 4 dfs[4][1]
2 4 dfs[4][2]
3 4 dfs[4][3]
4 4 dfs[4][4]
5 4 dfs[4][5]
You can use tuple keys instead and utilize groupby. Here's a minimal example:
df = pd.DataFrame([[0, 1, 2], [0, 1, 3], [1, 2, 4], [1, 2, 5], [1, 3, 6], [1, 3, 7]],
columns=['col1', 'col2', 'col3'])
dfs = dict(tuple(df.groupby(['col1', 'col2'])))
for k, v in dfs.items():
print(k)
print(v)
(0, 1)
col1 col2 col3
0 0 1 2
1 0 1 3
(1, 2)
col1 col2 col3
2 1 2 4
3 1 2 5
(1, 3)
col1 col2 col3
4 1 3 6
5 1 3 7
The pivot code:
result = pandas.pivot_table(result, values=['value'], index=['index'], columns=['columns'], fill_value=0)
The result:
value value value
columns col1 col2 col3
index
idx1 14 1 1
idx2 2 0 1
idx3 6 0 0
I tried:
result.columns = result.columns.get_level_values(1)
Then I got this:
columns col1 col2 col3
index
idx1 14 1 1
idx2 2 0 1
idx3 6 0 0
Actually what I would like is this one:
index col1 col2 col3
idx1 14 1 1
idx2 2 0 1
idx3 6 0 0
Is there anyway to achieve this? Help really is appreciated. Thank you in advance.
You need remove index name by rename_axis (new in pandas 0.18.0):
df = df.rename_axis(None)
If need also remove columns name, use:
df = df.rename_axis(None, axis=1)
If use older version of pandas, use:
df.columns.name = None
df.index.name = None
Sample (if remove [] from pivot_table, you remove Multiindex from columns):
print (result)
index columns value
0 1 Toys 5
1 2 Toys 6
2 2 Cars 7
3 1 Toys 2
4 1 Cars 9
print (pd.pivot_table(result, index='index',columns='columns',values='value', fill_value=0)
.rename_axis(None)
.rename_axis(None, axis=1))
Cars Toys
1 9 3.5
2 7 6.0
If use [], get:
result = pd.pivot_table(result, values=['value'], index=['index'], columns=['columns'], fill_value=0)
.rename_axis(None)
.rename_axis((None,None), axis=1)
print (result)
value
Cars Toys
1 9 3.5
2 7 6.0
Consider this dataframe:
results = pd.DataFrame(
[
[14, 1, 1],
[2, 0, 1],
[6, 0, 0]
],
pd.Index(['idx1', 'idx2', 'idx3'], name='index'),
pd.MultiIndex.from_product([['value'], ['col1', 'col2', 'col3']], names=[None, 'columns'])
)
print results
value
columns col1 col2 col3
index
idx1 14 1 1
idx2 2 0 1
idx3 6 0 0
Then all you need is:
print results.value.rename_axis(None, 1) # <---- Solution
col1 col2 col3
index
idx1 14 1 1
idx2 2 0 1
idx3 6 0 0