Pandas: Sort by sum of 2 columns

Pandas: Sort by sum of 2 columns - python

I have a DataFrame:
COL1 COL2
1 1
3 1
1 3
I need to sort by COL1 + COL2.
key=lambda col: f(col) argument-function of sort_values(...) lets you sort by a changed column but in the described case I need to sort on the basis of 2 columns. So, it would be nice if there were an opportunity to provide a key argument-function for 2 or more columns but I don't know whether such a one exists.
So, how can I sort its rows by sum COL1 + COL2?
Thank you for your time!

Assuming a unique index, you can also conveniently use the key parameter of sort_values to pass a callable to apply to the by column. Here we can add the other column:
df.sort_values(by='COL1', key=df['COL2'].add)
We can even generalize to any number of columns using sort_index:
df.sort_index(key=df.sum(1).get)
Output:
COL1 COL2
0 1 1
2 1 3
1 3 2
Used input:
data = {"COL1": [1, 3, 1], "COL2": [1, 2, 3]}
df = pd.DataFrame(data)

This does the trick:
data = {"Column 1": [1, 3, 1], "Column 2": [1, 2, 3]}
df = pd.DataFrame(data)
sorted_indices = (df["Column 1"] + df["Column 2"]).sort_values().index
df.loc[sorted_indices, :]
I just created a series that has the sum of both the columns, sorted it, got the sorted indices, and printed those indices out for the dataframe.
(I changed the data a little so you can see the sorting in action. Using the data you provided, you wouldn't have been able to see the sorted data as it would have been the same as the original one.)

Related

Pandas groupby does not set index correctly

i'm a bit confused by the behaviour of the pandas groupby function:
df = pd.DataFrame({"row_id":[1,2,3], "group": [1,2,2], "col1":[1,100,2], "col2":[2,200,2]})
for i, e in df.groupby("group", as_index=True):
print(e.index)
Here I would expect that I will get the "group" column as a new index. However the print returns:
Int64Index([0], dtype='int64')
Int64Index([1, 2], dtype='int64')
Thus having kept the "old" index. Also the column "group" is still in place as seperate column.
Shouldnt be the result like:
Int64Index([1], dtype='int64')
Int64Index([2, 2], dtype='int64')
I dont understand the logic, esp. as as_index=False doesn't change anything.
P.s. I am using pandas 1.3.5

as_index changes the output format only:
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
>>> df.groupby('group', as_index=False).first()
group row_id col1 col2
0 1 1 1 2
1 2 2 100 200
>>> df.groupby('group', as_index=True).first()
row_id col1 col2
group
1 1 1 2
2 2 100 200

Opposite of factorize function (Map numeric to categorical values)

I am searching for a way to map some numeric columns to categorical features.
All columns are of categorical nature but are represented as integers. However I need them to be a "String".
e.g.
col1 col2 col3 -> col1new col2new col3new
0 1 1 -> "0" "1" "1"
2 2 3 -> "2" "2" "3"
1 3 2 -> "1" "3" "2"
It does not matter what kind of String the new column contains as long as all distinct values from the original data set map to the same new String value.
Any ideas?
I have a bumpy representation of my data right now but any pandas solution would be also helpful.
Thanks a lot!

You can use applymap method. Cosider the following example:
df = pd.DataFrame({'col1': [0, 2, 1], 'col2': [1, 2, 3], 'col3': [1, 3, 2]})
df.applymap(str)
col1 col2 col3
0 0 1 1
1 2 2 3
2 1 3 2
You can convert all elements of col1, col2, and col3 to str using the following command:
df = df.applymap(str)

you can modify the type of the elements in a list by using the dataframe.apply function which is offered by pandas-dataframe-apply.
frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1', 'col2', 'col3'])
in the new dataframe you can define columns and the value by:
updated_frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1new', 'col2new', 'col3new'])
updated_frame['col1new'] = frame['col1'].apply(str)
updated_frame['col2new'] = frame['col2'].apply(str)
updated_frame['col3new'] = frame['col3'].apply(str)

You could use the .astype method. If you want to replace all the current columns with a string version then you could do (df your dataframe):
df = df.astype(str)
If you want to add the string columns as new ones:
df = df.assign(**{f"{col}new": df[col].astype(str) for col in df.columns})

Python Pandas find value in dataframe regardless of column

Is there a simple way to check for a value within a dataframe when it could possibly be in a variety of columns? Whether using iterrow and searching each row for the value and finding which column it is in or just checking the dataframe as a whole and getting its position like iat coords.

import pandas as pd
d = {'id': [1, 2, 3], 'col2': [3, 4, 5], 'col3': [8,3,9]}
df = pd.DataFrame(data=d)
df = df.set_index('id')
df
Sample Data
col2 col3
id
1 3 8
2 4 3
3 5 9
Find 3
df.isin([3]).any()
Output Column:
col2 True
col3 True
dtype: bool
Want more detals? Here you go:
df[df.isin([3])].stack().index.tolist()
Co-ordinates output:
[(1, 'col2'), (2, 'col3')]

You can search the value in dataframe and get the Boolean dataframe for your search. It
gives you all equalities of var1 in df.
df[df.eq(var1).any(1)]

Python - count number of elements that are equal between two columns of two dataframes

I have two dataframes: df1, df2
that contain two columns, col1 and col2. I would like to calculate the number of elements in column col1 of df1 that are equal to col2 of df2. How can I do that?

You can use Series.isin df1.col1.isin(df2.col2).sum():
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'col2': [1, 3, 5, 7]})
nb_comon_elements = df1.col1.isin(df2.col2).sum()
assert nb_comon_elements == 3
Be cautious depending on your use case because:
df1 = pd.DataFrame({'col1': [1, 1, 1, 2, 7]})
df1.col1.isin(df2.col2).sum()
Would return 4 and not 2, because all 1 from df1.col1 are present in df2.col2. If that's not the expected behaviour you could drop duplicates from df1.col1 before testing the intersection size:
df1.col1.drop_duplicates().isin(df2.col2).sum()
Which in this example would return 2.
To better understand why this is happening you can have look at what .isin is returning:
df1['isin df2.col2'] = df1.col1.isin(df2.col2)
Which gives:
col1 isin df2.col2
0 1 True
1 1 True
2 1 True
3 2 False
4 7 True
Now .sum() adds up the booleans from column isin df2.col2 (a total of 4 True).

I assume you're using pandas.
One way is to simply use pd.merge and merge on the second column, and return the length of that column.
pd.merge(df1, df2, on="column_to_merge")
Pandas does an inner merge by default.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

append dictionary to data frame

I have a function, which returns a dictionary like this:
{'truth': 185.179993, 'day1': 197.22307753038834, 'day2': 197.26118010160317, 'day3': 197.19846975345905, 'day4': 197.1490578795196, 'day5': 197.37179265011116}
I am trying to append this dictionary to a dataframe like so:
output = pd.DataFrame()
output.append(dictionary, ignore_index=True)
print(output.head())
Unfortunately, the printing of the dataframe results in an empty dataframe. Any ideas?

You don't assign the value to the result.
output = pd.DataFrame()
output = output.append(dictionary, ignore_index=True)
print(output.head())

The previous answer (user alex, answered Aug 9 2018 at 20:09) now triggers a warning saying that appending to a dataframe will be deprecated in a future version.
A way to do it is to transform the dictionary to a dataframe and the concatenate the dataframes:
output = pd.DataFrame()
df_dictionary = pd.DataFrame([dictionary])
output = pd.concat([output, df_dictionary], ignore_index=True)
print(output.head())

I always do it this way because this syntax is less confusing for me.
I believe concat method is recommended though.
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>>df
col1 col2
0 1 3
1 2 4
d={'col1': 5, 'col2': 6}
df.loc[len(df)]=d
>>>df
col1 col2
0 1 3
1 2 4
2 5 6
Note that iloc method won't work this way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Sort by sum of 2 columns - python

Related

Pandas groupby does not set index correctly

Opposite of factorize function (Map numeric to categorical values)

Python Pandas find value in dataframe regardless of column

Python - count number of elements that are equal between two columns of two dataframes

append dictionary to data frame

Categories

Resources