Pandas. Cannot sort by multiple columns - python

Edited for clarity:
I have a dataframe in the following format
i col1 col2 col3
0 00:00:00,1 10 1.7
1 00:00:00,2 10 1.5
2 00:00:00,3 50 4.6
3 00:00:00,4 30 3.4
4 00:00:00,5 20 5.6
5 00:00:00,6 50 1.8
6 00:00:00,9 20 1.9
...
That I'm trying to sort like this
i col1 col2 col3
0 00:00:00,1 10 1.7
1 00:00:00,2 10 1.5
4 00:00:00,5 20 5.6
3 00:00:00,9 20 1.9
4 00:00:00,4 30 3.4
5 00:00:00,3 50 4.6
6 00:00:00,6 50 1.8
...
I've tried df = df.sort_values(by = ['col1', 'col2'] which only works on col1.
I understand that it may have something to do with the values being 'strings', but I can't seem to find a workaround for it.

df.sort_values(by = ['col2', 'col1']
Gave the desired result

If need sort each column independently use Series.sort_values in DataFrame.apply:
c = ['col1','col2']
df[c] = df[c].apply(lambda x: x.sort_values().to_numpy())
#alternative
df[c] = df[c].apply(lambda x: x.sort_values().tolist())
print (df)
i col1 col2
0 0 00:00:00,1 10
1 1 00:00:01,5 20
2 2 00:00:10,0 30
3 3 00:01:00,1 40
4 5 01:00:00,0 50

Related

Multiplying Dataframes with Different Dimensions Pandas: Same Number of Columns, But Different Number of Rows

I have two Dataframes.
df1 with a shape of (1, 3),
df2 with a shape of (10, 3).
df1 looks like this:
col0 col1 col2
0 0.3 0.14 0.34
df2 looks like this:
col0 col1 col2
0 5 10 15
1 36 30 39
2 42 21 44
3 49 37 34
4 19 14 50
5 28 27 48
6 19 28 45
7 4 7 8
8 31 4 33
9 3 23 43
I would like to multiply df2 to df1, using the column axis; i.e col1 of df2 to col1 of df1; col2 of df2 to col2 of df1; and col3 of df2 to col3 of df1.
The result I seek:
col0 col1 col2
0 1.5 1.4 5.1
1 10.8 4.2 13.26
2 12.6 2.94 14.96
3 14.7 5.18 11.56
4 5.7 1.96 17
5 8.4 3.78 16.32
6 5.7 3.92 15.3
7 1.2 0.98 2.72
8 9.3 0.56 11.22
9 0.9 3.22 14.62
Here is my unsuccessful attempt:
columns = df1.columns
product = df2.multiply(df1[columns], axis=columns)
It throws a "Length Mismatch Error" error.
What can be done to make it work? I searched through the forums, but I could not find an answer which matches my exact requirements.
Convert to ndarrays and multiply - they should broadcast correctly.
vals1 = df1.to_numpy()
vals2 = df2.to_numpy()
result = vals1 * vals2
Or
df2 * df1.to_numpy()
Here is a long way of resolving the challenge:
columns = df1.columns
index = df2.index
length = len(df2)
df1_array = df1.to_numpy()
df1_tiled = np.tile(df1_array, (length, 1))
df1_tiled_frame = pd.DataFrame(df1_tiled, columns=columns, index=index)
product = df2.multiply(df1_tiled_frame, axis="columns")

How to generate dataframe using column values in some other dataframe

I am working on a dataset which is in the following dataframe.
#print(old_df)
col1 col2 col3
0 1 10 1.5
1 1 11 2.5
2 1 12 5,6
3 2 10 7.8
4 2 24 2.1
5 3 10 3.2
6 4 10 22.1
7 4 11 1.3
8 4 89 0.5
9 4 91 3.3
I am trying to generate another data frame which contains selected col1 values as index, selected col2 values as columns and assign respective col3 value.
Eg:
selected_col1 = [1,2]
selected_col2 = [10,11,24]
New data frame should be looking like:
#print(selected_df)
10 11 24
1 1.5 2.5 Nan
2 7.8 Nan 2.1
I have tried following method
selected_col1 = [1,2]
selected_col2 = [10,11,24]
selected_df =pd.DataFrame(index=selected_col1,columns=selected_col2)
for col1_value in selected_col1:
for col2_value in selected_col2:
qry = 'col1 == {} & col2 == {}'.format(col1_value,col2_value)
col3_value = old_df.query(qry).col3.values
if(len(col3_value) > 0):
selected_df.at[col1_value,col2_value] = col3_value[0]
But because my dataframe has around 20 million rows, this brute force kind of method is taking long time. Is there a way better than this?
First filter rows by membership by Series.isin in both columns chained by & for bitwise AND and then use DataFrame.pivot:
df = df[df['col1'].isin(selected_col1) & df['col2'].isin(selected_col2)]
df = df.pivot('col1','col2','col3')
print (df)
col2 10 11 24
col1
1 1.5 2.5 NaN
2 7.8 NaN 2.1
If possible some duplicated pairs in col1 with col2 after filtering use DataFrame.pivot_table:
df = df.pivot_table(index='col1',columns='col2',values='col3', aggfunc='mean')
EDIT:
If use | for bitwise OR get different output:
df = df[df['col1'].isin(selected_col1) | df['col2'].isin(selected_col2)]
df = df.pivot('col1','col2','col3')
print (df)
col2 10 11 12 24
col1
1 1.5 2.5 5,6 NaN
2 7.8 NaN NaN 2.1
3 3.2 NaN NaN NaN
4 22.1 1.3 NaN NaN

Pandas merge two df

I have two DataFrames
df1 has following form
ID col1 col2
0 1 2 10
1 3 1 21
and df2 looks like this
ID field1 field2
0 1 4 1
1 1 3 3
2 3 5 4
3 3 9 5
4 1 2 0
I want to concatenate both DataFrames but so that I have only one line per each ID, so it'd look like this:
ID col1 col2 field1_1 field2_1 field1_2 field2_2 field1_3 field2_3
0 1 2 10 4 1 3 3 2 0
1 3 1 21 5 4 9 5
I have tried merging and pivoting the data df.pivot(index=df1.index, columns='ID')
But because the length is variable, I become a ValueError.
ValueError: all arrays must be same length
Without over formatting, we want to merge and add a level of a multi index that counts the 'ID's.
df = df1.merge(df2)
cc = df.groupby('ID').cumcount()
df.set_index(['ID', 'col1', 'col2', cc]).unstack()
field1 field2
0 1 2 0 1 2
ID col1 col2
1 2 10 4.0 3.0 2.0 1.0 3.0 0.0
3 1 21 5.0 9.0 NaN 4.0 5.0 NaN
We can nail down the formatting with:
df = df1.merge(df2)
cc = df.groupby('ID').cumcount() + 1
d1 = df.set_index(['ID', 'col1', 'col2', cc]).unstack().sort_index(axis=1, level=1)
d1.columns = d1.columns.to_series().map('{0[0]}_{0[1]}'.format)
d1.reset_index()
ID col1 col2 field1_1 field2_1 field1_2 field2_2 field1_3 field2_3
0 1 2 10 4.0 1.0 3.0 3.0 2.0 0.0
1 3 1 21 5.0 4.0 9.0 5.0 NaN NaN

calculate percentage values depending on size group in dataframe - pandas

I have a dataframe as below:
idx col1 col2 col3
0 1.1 A 100
1 1.1 A 100
2 1.1 A 100
3 2.6 B 100
4 2.5 B 100
5 3.4 B 100
6 2.6 B 100
I want to update col3 with percentage values depending on the group size of col1,col2 (two columns ie., for each row having 1.1,A - col3 value should have 33.33)
Desired output:
idx col1 col2 col3
0 1.1 A 33.33
1 1.1 A 33.33
2 1.1 A 33.33
3 2.6 B 50
4 2.5 B 100
5 3.4 B 100
6 2.6 B 50
I think you need groupby with transform size:
df['col3'] = 100 / df.groupby(['col1', 'col2'])['col3'].transform('size')
print df
col1 col2 col3
idx
0 1.1 A 33.333333
1 1.1 A 33.333333
2 1.1 A 33.333333
3 2.6 B 50.000000
4 2.5 B 100.000000
5 3.4 B 100.000000
6 2.6 B 50.000000

Pandas Dataframe split in to sessions

This is an extension to my question.
To make it simpler Lets suppose I have a pandas dataframe as following.
df = pd.DataFrame([[1.1, 1.1, 2.5, 2.6, 2.5, 3.4,2.6,2.6,3.4], list('AAABBBBAB'), [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3]]).T
df.columns = ['col1', 'col2','col3']
dataframe :
col1 col2 col3
0 1.1 A 1.1
1 1.1 A 1.7
2 2.5 A 2.5
3 2.6 B 2.6
4 2.5 B 3.3
5 3.4 B 3.8
6 2.6 B 4
7 2.6 A 4.2
8 3.4 B 4.3
I want to group this based on some conditions. The logic is based on col1 col2 values and the cumulative difference of col3:
Go to col1 and find other occurrences of the same value.
In my case first value of col1 is '1.1' and again their is the same value at row2.
Then check for col2 value, If they are similar, then get the cumulative difference of col 3.
If the cumulative difference is greater than 0.5 then mark this as a new session.
If col1 values are same but col2 values are different then mark them as new session
expected output:
col1 col2 col3 session
0 1.1 A 1.1 0
1 1.1 A 1.7 1
2 2.5 A 2.5 2
3 2.6 B 2.6 4
4 2.5 B 3.3 3
5 3.4 B 3.8 7
6 2.6 B 4 5
7 2.6 A 4.2 6
8 3.4 B 4.3 7
As in the excellent answer you linked to ;) first create the session number:
In [11]: g = df.groupby(['col1', 'col2'])
In [12]: df['session_number'] = g['col3'].apply(lambda s: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False))
Then I think you want to set_index of these columns, this may be enough for many usecases (though it might be worth doing a sort):
In [13]: df1 = df.set_index(['col1', 'col2', 'session_number'])
In [14]: df1
Out[14]:
col3
col1 col2 session_number
1.1 A 0 1.1
1 1.7
2.5 A 0 2.5
2.6 B 0 2.6
2.5 B 0 3.3
3.4 B 0 3.8
2.6 B 1 4
A 0 4.2
3.4 B 0 4.3
If you really want you can grab out the session number :
In [15]: g1 = df.groupby(['col1', 'col2', 'session_number']) # I think there is a slightly neater way, but I forget..
In [16]: df1['session'] = g1.apply(lambda x: 1).cumsum() # could -1 here if it matters
In [17]: df1
Out[17]:
col3 session
col1 col2 session_number
1.1 A 0 1.1 1
1 1.7 2
2.5 A 0 2.5 3
2.6 B 0 2.6 6
2.5 B 0 3.3 4
3.4 B 0 3.8 8
2.6 B 1 4 7
A 0 4.2 5
3.4 B 0 4.3 8
If you want this in columns (as in your question) the reset_index and you could delete the session column:
In [18]: df1.reset_index()
Out[18]:
col1 col2 session_number col3 session
0 1.1 A 0 1.1 1
1 1.1 A 1 1.7 2
2 2.5 A 0 2.5 3
3 2.6 B 0 2.6 6
4 2.5 B 0 3.3 4
5 3.4 B 0 3.8 8
6 2.6 B 1 4 7
7 2.6 A 0 4.2 5
8 3.4 B 0 4.3 8

Categories

Resources