How do I combine these two columns? Pandas - python

I have two columns one is Buyer ID and one is Seller ID. I'm trying to simply find out which combination of both appears the most.
def twoCptyFreq(df,col1,col2):
cols=[col1,col2]
df['TwoCptys']=df[cols].astype(str).apply('+'.join, axis=1)
return (df)
newdf=twoCptyFreq(tradedf,'BuyerID','SellerID')
I get the results I want however sometimes i get 1234+7651 and 7651+1234, so the same two but i need to aggregate these together. How do I write this into my function to allow for cases wher the buyer and seller may be switched?

You can sorting values - in lambda function by sorted:
df['TwoCptys']=df[cols].astype(str).apply(lambda x: '+'.join(sorted(x)), axis=1)
Or in columns converted to 2d array by np.sort:
df['TwoCptys']= (pd.DataFrame(np.sort(df[cols].values, axis=1))
.astype(str).apply('+'.join, axis=1))

df=pd.DataFrame({'A':[1,1,1],'B':[2,3,2],'C':[9,9,9]})
df['combination']=df['A'].astype(str) + '+' + df['B'].astype(str)
df['combination'].value_counts()
out[]:
1+2 2
1+3 1
Name: combination, dtype: int64
#This shows combination of df[A] ==1 and df[B] ==2 has more occurences

Related

Dataframe- Remove similar rows related based on two columns

I have following dataset:
this dataset print correlation of two columns at left
if you look at the row number 3 and 42, you will find they are same. only column position is different. that does not affect correlation. I want to remove column 42. But this dataset has many these row of similar values. I need a general algorithm to remove these similar value and have only unique.
As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates
# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
.apply(lambda x: tuple(sorted(x)), axis='columns')
out = df.loc[~key.duplicated()]
>>> out
source_column destination_column correlation_Value
0 A B 1
2 C E 2
3 D F 4
You could try a self join. Without a code example, it's hard to answer, but something like this maybe:
df.merge(df, left_on="source_column", right_on="destination_column")
You can follow that up with a call to drop_duplicates.

How to iterate and calculate over a pandas multi-index dataframe

I have a pandas multi-index dataframe:
>>> df
0 1
first second
A one 0.991026 0.734800
two 0.582370 0.720825
B one 0.795826 -1.155040
two 0.013736 -0.591926
C one -0.538078 0.291372
two 1.605806 1.103283
D one -0.617655 -1.438617
two 1.495949 -0.936198
I'm trying to find an efficient way to divide each number in column 0 by the maximum number in column I that shares the same group under index "first", and make this into a third column. Is there a simply efficient method for doing something like this that doesn't require multiple for loops?
Use Series.div with max for maximal values per first level:
print (df[1].max(level=0))
first
A 0.734800
B -0.591926
C 1.103283
D -0.936198
Name: 1, dtype: float64
df['new'] = df[0].div(df[1].max(level=0))
print (df)
0 1 new
first second
A one 0.991026 0.734800 1.348702
two 0.582370 0.720825 0.792556
B one 0.795826 -1.155040 -1.344469
two 0.013736 -0.591926 -0.023206
C one -0.538078 0.291372 -0.487706
two 1.605806 1.103283 1.455480
D one -0.617655 -1.438617 0.659748
two 1.495949 -0.936198 -1.597898

Convert categorical column into specific integers

I have a bunch of dataframes with one categorical column defining Sex (M/F). I want to assign integer 1 to Male and 2 to Female. I have the following code that cat codes them to 0 and 1 instead
df4["Sex"] = df4["Sex"].astype('category')
df4.dtypes
df4["Sex_cat"] = df4["Sex"].cat.codes
df4.head()
But I need specifically for M to be 1 and F to be 2. Is there a simple way to assign specific integers to categories?
IIUC:
df4['Sex'] = df4['Sex'].map({'M':1,'F':2})
And now:
print(df4)
Would be desired result.
If you need to impose a specific ordering, you can use pd.Categorical:
c = pd.Categorical(df["Sex"], categories=['M','F'], ordered=True)
This ensures "M" is given the smallest value, "F" the next, and so on. You can then just access codes and add 1.
df['Sex_cat'] = c.codes + 1
It is better to use pd.Categorical than astype('category') if you want finer control over what categories are assigned what codes.
You can also use lambda with apply:
df4['sex'] = df4['sex'].apply(lambda x : 1 if x=='M' else 2)

Value counts for specific items in a DataFrame

I have a dataframe (df) of messages that appears similar the following:
From To
person1#gmail.com stranger1#gmail.com
person2#gmail.com stranger1#gmail.com, stranger2#gmail.com
person3#gmail.com person1#gmail.com, stranger2#gmail.com
I want to count the amount of times each email appears from a specific list. My list being:
lst = ['person1#gmail.com', 'stranger2#gmail.com', 'person3#gmail.com']
I'm hoping to receive a dataframe/series/dictionary with a result like this:
list_item Total_Count
person1#gmail.com 2
stranger2#gmail.com 2
person3#gmail.com 1
I'm tried several different things, but haven't succeeded. I thought I could try something like the for loop below (it returns a Syntax Error), but I cannot figure out the right way to write it.
for To,From in zip(df.To, df.From):
for item in lst:
if To,From contains item in emails:
Count(item)
Should this type of task be accomplished with a for loop or are there out of the box pandas methods that could solve this easier?
stack-based
Split your To column, stack everything and then do a value_counts:
v = pd.concat([df.From, df.To.str.split(', ', expand=True)], axis=1).stack()
v[v.isin(lst)].value_counts()
stranger2#gmail.com 2
person1#gmail.com 2
person3#gmail.com 1
dtype: int64
melt
Another option is to use melt:
v = (df.set_index('From')
.To.str.split(', ', expand=True)
.reset_index()
.melt()['value']
)
v[v.isin(lst)].value_counts()
stranger2#gmail.com 2
person1#gmail.com 2
person3#gmail.com 1
Name: value, dtype: int64
Note that set_index + str.split + reset_index is synonymous to pd.concat([...])...

How to get the mean of a subset of rows after using groupby?

I want to get the average of a particular subset of rows in one particular column in my dataframe.
I can use
df['C'].iloc[2:9].mean()
to get the mean of just the particular rows I want from my original Dataframe but my problem is that I want to perform this operation after using the groupby operation.
I am building on
df.groupby(["A", "B"])['C'].mean()
whereby there are 11 values returned in 'C' once I group by columns A and B and I get the average of those 11 values. I actually only want to get the average of the 3rd through 9th values though so ideally what I would want to do is
df.groupby(["A", "B"])['C'].iloc[2:9].mean()
This would return those 11 values from column C for every group of A,B and then would find the mean of the 3rd through 9th values but I know I can't do this. The error suggests using the apply method but I can't seem to figure it out.
Any help would be appreciated.
You can use agg function after the groupby and then subset within each group and take the mean:
df = pd.DataFrame({'A': ['a']*22, 'B': ['b1']*11 + ['b2']*11, 'C': list(range(11))*2})
# A dummy data frame to demonstrate
df.groupby(['A', 'B'])['C'].agg(lambda g: g.iloc[2:9].mean())
# A B
# a b1 5
# b2 5
# Name: C, dtype: int64
Try this variant:
for key, grp in df.groupby(["A", "B"]):
print grp['C'].iloc[2:9].mean()

Categories

Resources