Match values in dataframe rows - python

I have a dataframe (df) that looks like:
name type cost
a apples 1
b apples 2
c oranges 1
d banana 4
e orange 6
Apart from using 2 for loops is there a way to loop through and compare each name and type in the list against each other and where the name is not itself (A vs A), the type is the same (apples vs apples) and its not a repeat of the same pair but the other way around e.g. if we have A vs B, I would not want to see B vs A, produce an output list of that looks:
name1, name2, status
a b 0
c e 0
Where the first 2 elements are the names where the criteria match and the third element is always a 0.
I have tried to do this with 2 for loops (see below) but can't get it to reject say b vs a if we already have a vs b.
def pairListCreator(staticData):
for x, row1 in df.iterrows():
name1 = row1['name']
type1= row1['type']
for y, row2 in df.iterrows():
name2 = row['name']
type2 = row['type']
if name1<> name2 and type1 = type2:
pairList = name1,name2,0

Something like this
import pandas as pd
# Data
data = [['a', 'apples', 1],
['b', 'apples', 2],
['c', 'orange', 1],
['d', 'banana', 4],
['e', 'orange', 6]]
# Create Dataframe
df = pd.DataFrame(data, columns=['name', 'type', 'cost'])
df.set_index('name', inplace=True)
# Print DataFrame
print df
# Count number of rows
nr_of_rows = df.shape[0]
# Create result and compare
res_col_nam = ['name1', 'name2', 'status']
result = pd.DataFrame(columns=res_col_nam)
for i in range(nr_of_rows):
x = df.iloc[i]
for j in range(i + 1, nr_of_rows):
y = df.iloc[j]
if x['type'] == y['type']:
temp = pd.DataFrame([[x.name, y.name, 0]], columns=res_col_nam)
result = result.append(temp)
# Reset the index
result.reset_index(inplace=True)
result.drop('index', axis=1, inplace=True)
# Print result
print 'result:'
print result
Output:
type cost
name
a apples 1
b apples 2
c orange 1
d banana 4
e orange 6
result:
name1 name2 status
0 a b 0.0
1 c e 0.0

You can use self join on column type first, then sort values in names column per row by apply(sorted).
Then remove same values in names columns by boolean indexing, drop_duplicates and add new column status by assign:
df = pd.merge(df,df, on='type', suffixes=('1','2'))
names = ['name1','name2']
df[names] = df[names].apply(sorted, axis=1)
df = df[df.name1 != df.name2].drop_duplicates(subset=names)[names]
.assign(status=0)
.reset_index(drop=True)
print (df)
name1 name2 status
0 a b 0
1 c e 0

Related

A list and a dataframe mapping to get a column value in python pandas

I have a dataframe with words as index and a corresponding sentiment score in another column. Then, I have another dataframe which has one column with list of words (token list) with multiple rows. So each row will have a column with different lists. I want to find the average of sentiment score for a particular list. This has to be done for a huge number of rows, and hence efficiency is important.
One method I have in mind is given below:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
'''
df
words
0 [a, b, c]
1 [hi, this, is, a, sample]
'''
def find_score(tokenlist, ref_df):
# ref_df contains two cols, 'tokens' and 'score'
temp_df = pd.DataFrame()
temp_df['tokens'] = tokenlist
return temp_df.merge(ref_df, on='tokens', how='inner')['sentiment_score'].mean(axis=0)
# this should return score
df['score'] = df['tokens'].apply(find_score, axis=1, args=(ref_df))
# each input for find_score will be a list
Is there any more efficient way to do it without creating dataframe for each list?
You can create a dictionary for mapping from the reference dataframe ref_df and then use .map() on each token list on each row of dataframe df, as follows:
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Demo
Test Data Construction
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'tokens': ['a', 'b', 'c', 'd', 'hi', 'this', 'is', 'sample', 'example'],
'sentiment_score': [1, 2, 3, 4, 11, 12, 13, 14, 15]})
print(df)
tokens
0 [a, b, c]
1 [hi, this, is, a, sample]
print(ref_df)
tokens sentiment_score
0 a 1
1 b 2
2 c 3
3 d 4
4 hi 11
5 this 12
6 is 13
7 sample 14
8 example 15
Run New Code
ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))
Output
print(df)
tokens score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 10.2
Let's try explode, merge, and agg:
import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a
ref_df = pd.DataFrame({'sentiment_score': {'a': 1, 'b': 2,
'c': 3, 'hi': 4,
'this': 5, 'is': 6,
'sample': 7}})
# Explode Tokens into rows (Preserve original index)
new_df = df.explode('tokens').reset_index()
# Merge sentiment_scores
new_df = new_df.merge(ref_df, left_on='tokens',
right_index=True,
how='inner')
# Group By Original Index and agg back to lists and take mean
new_df = new_df.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [a, hi, this, is, sample] 4.6
After Explode:
index tokens
0 0 a
1 0 b
2 0 c
3 1 hi
4 1 this
5 1 is
6 1 a
7 1 sample
After Merge
index tokens sentiment_score
0 0 a 1
1 1 a 1
2 0 b 2
3 0 c 3
4 1 hi 4
5 1 this 5
6 1 is 6
7 1 sample 7
(The one-liner)
new_df = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index') \
.agg({'tokens': list, 'sentiment_score': 'mean'}) \
.reset_index(drop=True)
If the order of the tokens in the list matters, the scores can be calculated and merged back to the original df instead of using list aggregation:
mean_scores = df.explode('tokens') \
.reset_index() \
.merge(ref_df, left_on='tokens',
right_index=True,
how='inner') \
.groupby('index').mean() \
.reset_index(drop=True)
new_df = df.merge(mean_scores,
left_index=True,
right_index=True)
print(new_df)
Output:
tokens sentiment_score
0 [a, b, c] 2.0
1 [hi, this, is, a, sample] 4.6

Keep pair of row data in pandas [duplicate]

I am stuck with a seemingly easy problem: dropping unique rows in a pandas dataframe. Basically, the opposite of drop_duplicates().
Let's say this is my data:
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
I would like to drop the rows when A, and B are unique, i.e. I would like to keep only the rows 1 and 2.
I tried the following:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates()
duplicates = df[~df.index.isin(uniques.index)]
But I only get the row 2, as 0, 1, and 3 are in the uniques!
Solutions for select all duplicated rows:
You can use duplicated with subset and parameter keep=False for select all duplicates:
df = df[df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
1 foo 1 A
2 foo 1 B
Solution with transform:
df = df[df.groupby(['A', 'B'])['A'].transform('size') > 1]
print (df)
A B C
1 foo 1 A
2 foo 1 B
A bit modified solutions for select all unique rows:
#invert boolean mask by ~
df = df[~df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
0 foo 0 A
3 bar 1 A
df = df[df.groupby(['A', 'B'])['A'].transform('size') == 1]
print (df)
A B C
0 foo 0 A
3 bar 1 A
I came up with a solution using groupby:
groupped = df.groupby(['A', 'B']).size().reset_index().rename(columns={0: 'count'})
uniques = groupped[groupped['count'] == 1]
duplicates = df[~df.index.isin(uniques.index)]
Duplicates now has the proper result:
A B C
2 foo 1 B
3 bar 1 A
Also, my original attempt in the question can be fixed by simply adding keep=False in the drop_duplicates method:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates(keep=False)
duplicates = df[~df.index.isin(uniques.index)]
Please #jezrael answer, I think it is safest(?), as I am using pandas indexes here.
df1 = df.drop_duplicates(['A', 'B'],keep=False)
df1 = pd.concat([df, df1])
df1 = df1.drop_duplicates(keep=False)
This technique is more suitable when you have two datasets dfX and dfY with millions of records. You may first concatenate dfX and dfY and follow the same steps.

Adding conditional prefixes to column names

So I have a dataframe with some weird suffixes, like _a or _b that map to certain codes, I was wondering how would you add a prefix depending on the the suffix and remove the suffix to something easier to understand.
i.e.:
red_a blue_a green_b
....
....
to
A red A blue B green
....
....
I tried
for col in df.columns:
if col.endswith('_a'):
batch_match[col].replace('_a', '')
batch_match[col].add_prefix('A ')
else:
batch_match[col].add_prefix('B ')
But it returns a df of NaN.
You can use pandas.DataFrame.rename with a customer mapper
df = pd.DataFrame(
{"red_a": ['a', 'b', 'c'], "blue_a": [1, 2, 3], 'green_b': ['x', 'y', 'z']}
)
def renamer(col):
if any(col.endswith(suffix) for suffix in ['_a', '_b']):
prefix = col[-1] # use last char as prefix
return prefix.upper() + " " + col[:-2] # add prefix and strip last 2 chars
else:
return col
df = df.rename(mapper=renamer, axis='columns')
print(df)
# A blue B green A red
#0 1 x a
#1 2 y b
#2 3 z c
What I will do
df.columns=df.columns.str.split('_').map(lambda x : '{} {}'.format(x[1].upper(),x[0]))
df
Out[512]:
A red A blue B green
0 a 1 x
1 b 2 y
2 c 3 z

pandas convert grouped rows into columns

I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.
The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)
you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps
Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()
You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6

How to create a frequency table in pandas python

If i have data like
Col1
A
B
A
B
A
C
I need output like
Col_value Count
A 3
B 2
C 1
I need to col_value and count be column names.
So I can access it like a['col_value']
Use value_counts:
df = pd.value_counts(df.Col1).to_frame().reset_index()
df
A 3
B 2
C 1
then rename your columns if needed:
df.columns = ['Col_value','Count']
df
Col_value Count
0 A 3
1 B 2
2 C 1
Another solution is groupby with aggregating size:
df = df.groupby('Col1')
.size()
.reset_index(name='Count')
.rename(columns={'Col1':'Col_value'})
print (df)
Col_value Count
0 A 3
1 B 2
2 C 1
Use pd.crosstab as another alternative:
import pandas as pd
help(pd.crosstab)
Help on function crosstab in module pandas.core.reshape.pivot:
crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
Example:
df_freq = pd.crosstab(df['Col1'], columns='count')
df_freq.head()
def frequencyTable(alist):
'''
list -> chart
Returns None. Side effect is printing two columns showing each number that
is in the list, and then a column indicating how many times it was in the list.
Example:
>>> frequencyTable([1, 3, 3, 2])
ITEM FREQUENCY
1 1
2 1
3 2
'''
countdict = {}
for item in alist:
if item in countdict:
countdict[item] = countdict[item] + 1
else:
countdict[item] = 1
itemlist = list(countdict.keys())
itemlist.sort()
print("ITEM", "FREQUENCY")
for item in itemlist:
print(item, " ", countdict[item])
return None

Categories

Resources