Sort DataFrame on column of intervals

Sort DataFrame on column of intervals - python

My output looks like this:
binnedb Proba-A Proba-B Esperance-A Esperance-B
0 (0.0101, 0.0202] 0.547826 0.539130 0.007817 0.007693
1 (0.0302, 0.0402] 0.547826 0.539130 0.005963 0.005854
2 (0.0201, 0.0302] 0.547826 0.539130 0.008360 0.008227
What I would like to do is to sort the df in an ascending order based on the binnedb column(which will be also sorted in ascending order). Please let me know if you don't understand the question. That is what I tried so far: df.sort_values(by=['binnedb'], ascending = False)
But it does not work... thanks!

Since it is inverval type column , you can using left to get the left range and sort base on it .
df['sortkey']=df.binnedb.map(lambda x : x.left)
df=df.sort_values('sortkey')

Interval columns are actually categorical columns which follow a specific ordering. If "binnedb" is categorical column, you can access its category codes and use argsort:
df = df.iloc[df['binnedb'].cat.codes.argsort()]

Related

Adding random values in column depending on other columns with pandas

I have a dataframe with the Columns "OfferID", "SiteID" and "CatgeoryID" which should represent an online ad on a website. I then want to add a new Column called "NPS" for the net promoter score. The values should be given randomly between 1 and 10 but where the OfferID, the SideID and the CatgeoryID are the same, they need to have the same value for the NPS. I thought of using a dictionary where the NPS is the key and the pairs of different IDs are the values but I haven't found a good way to do this.
Are there any recommendations?
Thanks in advance.
Alina

The easiest would be first to remove all duplicates ; you can do this using :
uniques = df[['OfferID', 'SideID', 'CategoryID']].drop_duplicates(keep="first")
Afterwards, you can do something like this (note that your random values are not uniques) :
uniques['NPS'] = [random.randint(0, 100) for x in uniques.index]
And then :
df = df.merge(uniques, on=['OfferID', 'SideID', 'CategoryID'], how='left')

Create a matrix with a set of ranges in columns and a set of ranges in rows with Pandas

I have a data frame in which one column 'F' has values from 0 to 100 and a second column 'E' has values from 0 to 500. I want to create a matrix in which frequencies fall within ranges in both 'F' and 'E'. For example, I want to know the frequency in range 20 to 30 for 'F' and range 400 to 500 for 'E'.
What I expect to have is the following matrix:
matrix of ranges
I have tried to group ranges using pd.cut() and groupby() but I don't know how to join data.
I really appreciate your help in creating the matrix with pandas.

you can use the cut function to create the bin "tag/name" for each column.
after you cat pivot the data frame.
df['rows'] = pd.cut(df['F'], 5)
df['cols'] = pd.cut(df['E'], 5)
df = df.groupby(['rows', 'cols']).agg('sum').reset_index([0,1], False) # your agg func here
df = df.pivot(columns='cols', index='rows')

So this is the way I found to create the matrix, that was obviously inspired by #usher's answer. I know it's more convoluted but wanted to share it. Thanks again #usher
E=df.E
F=df.F
bins_E=pd.cut(E, bins=(max(E)-min(E))/100)
bins_F=pd.cut(F, bins=(max(F)-min(F))/10)
bins_EF=bins_E.to_frame().join(bins_F)
freq_EF=bins_EF.groupby(['E', 'F']).size().reset_index(name="counts")
Mat_FE = freq_EF.pivot(columns='E', index='F')

Convert categorical column into specific integers

I have a bunch of dataframes with one categorical column defining Sex (M/F). I want to assign integer 1 to Male and 2 to Female. I have the following code that cat codes them to 0 and 1 instead
df4["Sex"] = df4["Sex"].astype('category')
df4.dtypes
df4["Sex_cat"] = df4["Sex"].cat.codes
df4.head()
But I need specifically for M to be 1 and F to be 2. Is there a simple way to assign specific integers to categories?

IIUC:
df4['Sex'] = df4['Sex'].map({'M':1,'F':2})
And now:
print(df4)
Would be desired result.

If you need to impose a specific ordering, you can use pd.Categorical:
c = pd.Categorical(df["Sex"], categories=['M','F'], ordered=True)
This ensures "M" is given the smallest value, "F" the next, and so on. You can then just access codes and add 1.
df['Sex_cat'] = c.codes + 1
It is better to use pd.Categorical than astype('category') if you want finer control over what categories are assigned what codes.

You can also use lambda with apply:
df4['sex'] = df4['sex'].apply(lambda x : 1 if x=='M' else 2)

df.set_index returns key error python pandas dataframe

I have this Pandas DataFrame and I have to convert some of the items into coordinates, (meaning they have to be floats) and it includes the indexes while trying to convert them into floats. So I tried to set the indexes to the first thing in the DataFrame but it doesn't work. I wonder if it has anything to do with the fact that it is a part of the whole DataFrame, only the section that is "Latitude" and "Longitude".
df = df_volc.iloc(axis = 0)[0:, 3:5]
df.set_index("hello", inplace = True, drop = True)
df
and I get the a really long error, but this is the last part of it:
KeyError: '34.50'
if I don't do the set_index part I get:
Latitude Longitude
0 34.50 131.60
1 -23.30 -67.62
2 14.50 -90.88
I just wanna know if its possible to get rid of the indexes or set them.

The parameter you need to pass to set_index() function is keys : column label or list of column labels / arrays. In your scenario, it seems like "hello" is not a column name.
I just wanna know if its possible to get rid of the indexes or set them.
It is possible to replace the 0, 1, 2 index with something else, though it doesn't sound like it's necessary for your end goal:
to convert some of the items into [...] floats
To achieve this, you could overwrite the existing values by using astype():
df['Latitude'] = df['Latitude'].astype('float')

How to sort a pandas data frame by value counts of a column?

I'd like to sort the following pandas data frame by the result of df['user_id'].value_counts().
import pandas as pd
n = 100
df = pd.DataFrame(index=pd.Index(range(1, n+1), name='gridimage_id'))
df['user_id'] = 2
df['has_term'] = True
df.iloc[:10, 0] = 1
The sort should be stable, meaning that whilst user 2's rows would come before user 1's rows, the user 2's rows and user 1's rows would be in the original order.
I was thinking about using df.groupby, merging df['user_id'].value_counts() with the data frame, and also converting df['user_id'] to ordered categorical data. However, none of these approaches seemed particularly elegant.
Thanks in advance for any help!

transform and argsort
Use kind='mergesort' for stability
df.iloc[df.groupby('user_id').user_id.transform('size').argsort(kind='mergesort')]
factorize, bincount, and argsort
Use kind='mergesort' for stability
i, r = pd.factorize(df['user_id'])
a = np.argsort(np.bincount(i)[i], kind='mergesort')
df.iloc[a]
Response to Comments
Thank you #piRSquared. Is it possible to reverse the sort order, though? value_counts is in descending order. In the example, user 2 has 90 rows and user 1 has 10 rows. I'd like user 2's rows to come first. Unfortunately, Series.argsort ignores the order kwarg. – Iain Dillingham 4 mins ago
Quick and Dirty
Make the counts negative
df.iloc[df.groupby('user_id').user_id.transform('size').mul(-1).argsort(kind='mergesort')]
Or
i, r = pd.factorize(df['user_id'])
a = np.argsort(-np.bincount(i)[i], kind='mergesort')
df.iloc[a]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sort DataFrame on column of intervals - python

Since it is inverval type column , you can using left to get the left range and sort base on it . df['sortkey']=df.binnedb.map(lambda x : x.left) df=df.sort_values('sortkey')

Interval columns are actually categorical columns which follow a specific ordering. If "binnedb" is categorical column, you can access its category codes and use argsort: df = df.iloc[df['binnedb'].cat.codes.argsort()]

Related

Adding random values in column depending on other columns with pandas

Create a matrix with a set of ranges in columns and a set of ranges in rows with Pandas

Convert categorical column into specific integers

df.set_index returns key error python pandas dataframe

How to sort a pandas data frame by value counts of a column?

Categories

Resources