The question is I would like to avoid iterrows here.
From my dataframe I want to create a new column "unique" that will be based on the condition that if "a" and "b" column values are the same I would give it a value "uniqueN" then for all occurrence of the exact "a" and "b" I would need the same value "uniqueN".
In this case
"1", "3" (the first row) from "a" and "b" is the first unique pair, so I give that the value "unique1", and the seventh row will also have the same value which is "unique1" as it is also "1", "3".
"2", "2" (the second row) is the next unique "a", "b" pair so I give them "unique2" and the eight row also has "2", "2" so that will also have "unique2".
"3", "1" (third row) is the next unique, so "unique3", no more rows in the df is "3", "1" so that value wont repeat.
and so on
I have a working code that uses loops but this is not the pandas way, can anyone suggest how I can do this using pandas functions?
Expected Output (My code works, but its not using pandas methods)
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Code
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
c = 1
seen = {}
for i, j in df.iterrows():
j = tuple(j)
if j not in seen:
seen[j] = 'unique' + str(c)
c += 1
for key, value in seen.items():
df.loc[(df.a == key[0]) & (df.b == key[1]), 'unique'] = value
Let's use groupby ngroup with sort=False to ensure values are enumerated in order of appearance, add 1 so group numbers start at one, then convert to string with astype so we can add the prefix unique to the number:
df['unique'] = 'unique' + \
df.groupby(['a', 'b'], sort=False).ngroup().add(1).astype(str)
Or with map and format instead of converting and concatenating:
df['unique'] = (
df.groupby(['a', 'b'], sort=False).ngroup()
.add(1)
.map('unique{}'.format)
)
df:
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Setup:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]
})
I came up with a slightly different solution. I'll add this for posterity, but the groupby answer is superior.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
print(df)
df1 = df[~df.duplicated()]
print(df1)
df1['unique'] = df1.index
print(df1)
df2 = df.merge(df1, how='left')
print(df2)
I'm aiming to pass a groupby count of values but only considering rows where Item and Item 2 are different. The following achieves this but drops rows if no values are different. If there are one or more values that are present but are identical between Item and Item 2 then I'm hoping to return 0.
import pandas as pd
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,4,4,4],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A','B','B','B'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A','B','A','A'],
'Value' : [5, 6, 6, 5, 5, 6, 5, 6, 3, 1, 4, 6, 7, 4, 5, 1, 2, 3],
})
df1 = df[df['Item'] != df['Item2']].groupby(['Time']).size().reset_index(name='count')
Intended Output:
Time count
0 1 4
1 2 3
2 3 0
3 4 2
Edit 2:
df = pd.DataFrame({
'Time' : ['1','1','1','1','1','1','1','2','2','2','2','2','2','2','3','4','4','4'],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A','B','B','B'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A','B','A','A'],
'Value' : [2, 6, 6, 5, 3, 3, 4, 6, 5, 1, 4, 6, 7, 4, 5, 1, 2, 3],
})
df1 = (df.assign(new = df['Item'] != df['Item2'])
.groupby('Time')['new']
.mean()
.reset_index(name='avg')
)
Intended Output:
Time avg
0 1 3.0
1 2 5.0
2 3 0.0
3 4 2.5
Idea is not filter, bur count values of Trues per groups by sum, here is passed Series df['Time'] to groupby:
df1 = (df['Item'] != df['Item2']).groupby(df['Time']).sum().reset_index(name='count')
print (df1)
Time count
0 1 4
1 2 3
2 3 0
3 4 2
Another similar solution is create new helper column and aggregate it:
df1 = (df.assign(new = df['Item'] != df['Item2'])
.groupby('Time')['new']
.sum()
.reset_index(name='count'))
EDIT: You can replace non matched values to misisng values by Series.where and then replace misisng values by fillna
df1 = (df.assign(new = df['Value'].where(df['Item'] != df['Item2']))
.groupby('Time')['new']
.mean()
.fillna(0)
.reset_index(name='avg')
)
print (df1)
Time avg
0 1 3.0
1 2 5.0
2 3 0.0
3 4 2.5
Alternative is use Series.reindex by uniqu values of original Time column:
df1 = (df[df['Item'] != df['Item2']]
.groupby(['Time'])['Value']
.mean()
.reindex(df['Time'].unique(), fill_value=0)
.reset_index(name='avg'))
Have a look at the pivot tables for pandas
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A'],
'Value' : [5, 6, 6, 5, 5, 6, 5, 6, 3, 1, 4, 6, 7, 4, 5],
})
# this gives you just the ones were there is a differance
df2 = df[df['Item'] != df['Item2']]
# then sum up the numbers for each item
pd.pivot_table(df2,index='Time',aggfunc='count')
This gives you the table
Item Item2 Value
Time
1 4 4 4
2 3 3 3
I'm trying to write a function that take a pandas Dataframe as argument and at some concatenate this datagframe with another.
for exemple:
def concat(df):
df = pd.concat((df, pd.DataFrame({'E': [1, 1, 1]})), axis=1)
I would like this function to modify in place the input df but I can't find how to achieve this. When I do
...
print(df)
concat(df)
print(df)
The dataframe df is identical before and after the function call
Note: I don't want to do df['E'] = [1, 1, 1] because I don't know how many column will be added to df. So I want to use pd.concat(), if possible...
This will edit the original DataFrame inplace and give the desired output as long as the new data contains the same number of rows as the original, and there are no conflicting column names.
It's the same idea as your df['E'] = [1, 1, 1] suggestion, except it will work for an arbitrary number of columns.
I don't think there is a way to achieve this using pd.concat, as it doesn't have an inplace parameter as some Pandas functions do.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [10, 20, 30], 'D': [40, 50, 60]})
df[df2.columns] = df2
Results (df):
A B C D
0 1 4 10 40
1 2 5 20 50
2 3 6 30 60
I was going through this link: Return top N largest values per group using pandas
and found multiple ways to find the topN values per group.
However, I prefer dictionary method with agg function and would like to know if it is possible to get the equivalent of the dictionary method for the following problem?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
print(df)
A B C D
0 1 1 10 X
1 1 1 20 Y
2 1 2 30 X
3 2 2 40 Y
4 2 1 50 Y
I can do this:
df1 = df.groupby(['A'])['C'].nlargest(2).droplevel(-1).reset_index()
print(df1)
A C
0 1 30
1 1 20
2 2 50
3 2 40
# also this
df1 = df.sort_values('C', ascending=False).groupby('A', sort=False).head(2)
print(df1)
# also this
df.set_index('C').groupby('A')['B'].nlargest(2).reset_index()
Required
df.groupby('A',as_index=False).agg(
{'C': lambda ser: ser.nlargest(2) # something like this
})
Is it possible to use the dictionary here?
If you want to get a dictionary like A: 2 top values from C,
you can run:
df.groupby(['A'])['C'].apply(lambda x:
x.nlargest(2).tolist()).to_dict()
For your DataFrame, the result is:
{1: [30, 20], 2: [50, 40]}
I want to select rows from a dataframe based on values in the index combined with values in a specific column:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [0, 20, 30], [40, 20, 30]],
index=[4, 5, 6, 7], columns=['A', 'B', 'C'])
A B C
4 0 2 3
5 0 4 1
6 0 20 30
7 40 20 30
with
df.loc[df['A'] == 0, 'C'] = 99
i can select all rows with column A = 0 and replace the value in column C with 99, but how can i select all rows with column A = 0 and the index < 6 (i want to combine selection on the index with selection on the column)?
You can use multiple conditions in your loc statement:
df.loc[(df.index < 6) & (df.A == 0), 'C'] = 99