My googleing has failed me, I think my main issue is im unsure how to phrase the question (sorry about the crappy title). I'm trying to find the total each time 2 people vote the same way. Below you will see an example of how the data looks and the output I was looking for. I have a working solution but its very slow (see bottom) and was wondering if theres a better way to go about this.
This is how the data is shaped
----------------------------------
event person vote
1 a y
1 b n
1 c nv
1 d nv
1 e y
2 a n
2 b nv
2 c y
2 d n
2 e n
----------------------------------
This is the output im looking for
----------------------------------
Person a b c d e
a 2 0 0 1 2
b 0 2 0 0 0
c 0 0 2 1 0
d 1 0 1 2 1
e 2 0 0 1 2
----------------------------------
Working Code
df = df.pivot(index='event', columns='person', values='vote')
frame = pd.DataFrame(columns=df.columns, index=df.columns)
for person1, value in frame.iterrows():
for person2 in frame:
count = 0
for i, row in df.iterrows():
person1_votes = row[person1]
person2_votes = row[person2]
if person1_votes == person2_votes:
count += 1
frame.at[person1, person2] = count
Try look at your problem in different way
df=df.assign(key=1)
mergedf=df.merge(df,on=['event','key'])
mergedf['equal']=mergedf['vote_x'].eq(mergedf['vote_y'])
output=mergedf.groupby(['person_x','person_y'])['equal'].sum().unstack()
output
Out[1241]:
person_y a b c d e
person_x
a 2.0 0.0 0.0 1.0 2.0
b 0.0 2.0 0.0 0.0 0.0
c 0.0 0.0 2.0 1.0 0.0
d 1.0 0.0 1.0 2.0 1.0
e 2.0 0.0 0.0 1.0 2.0
#Wen-Ben already answered your question. It bases on the concept of finding all possibilities of pair-wise person and count those having same vote. Finding all pair-wise is cartesian product (cross join). You may read great post from #cs95 on cartesian product (CROSS JOIN) with pandas
In your problem, you count same vote per event, so it is cross joint per event. Therefore, you don't need adding helper key column as in #cs95 post. You may cross join directly on column event. After cross join, filter out those pair-wise person<->person having same vote using query. Finally, using crosstab to count those pair-wise.
Below is my solution:
df_match = df.merge(df, on='event').query('vote_x == vote_y')
pd.crosstab(index=df_match.person_x, columns=df_match.person_y)
Out[1463]:
person_y a b c d e
person_x
a 2 0 0 1 2
b 0 2 0 0 0
c 0 0 2 1 0
d 1 0 1 2 1
e 2 0 0 1 2
Related
I have a python dataframe with columns, 'Expected' vs 'Actual' that shows a product (A,B,C or D) for each record
ID
Expected
Actual
1
A
B
2
A
A
3
C
B
4
B
D
5
C
D
6
A
A
7
B
B
8
A
D
I want to get a count from both columns for each unique value found in both columns (both columns dont share all the same products). So the result should look like this,
Value
Expected
Actual
A
4
2
B
2
3
C
2
0
D
0
3
Thank you for all your help
You can use apply and value_counts
df = pd.DataFrame({'Expected':['A','A','C','B','C','A','B','A'],'Actual':['B','A','B','D','D','A','B','D']})
df.apply(pd.Series.value_counts).fillna(0)
output:
Expected Actual
A 4.0 2.0
B 2.0 3.0
C 2.0 0.0
D 0.0 3.0
I would do it following way
import pandas as pd
df = pd.DataFrame({'Expected':['A','A','C','B','C','A','B','A'],'Actual':['B','A','B','D','D','A','B','D']})
ecnt = df['Expected'].value_counts()
acnt = df['Actual'].value_counts()
known = sorted(set(df['Expected']).union(df['Actual']))
cntdf = pd.DataFrame({'Value':known,'Expected':[ecnt.get(k,0) for k in known],'Actual':[acnt.get(k,0) for k in known]})
print(cntdf)
output
Value Expected Actual
0 A 4 2
1 B 2 3
2 C 2 0
3 D 0 3
Explanation: main idea here is having separate value counts for Expected column and Actual column. If you wish to rather have Value as Index of your pandas.DataFrame you can do
...
cntdf = pd.DataFrame([acnt,ecnt]).T.fillna(0)
print(cntdf)
output
Actual Expected
D 3.0 0.0
B 3.0 2.0
A 2.0 4.0
C 0.0 2.0
Example df:
id home away winner loser winner_score loser_score
0 A B A B 20 10
1 C D D C 20 5
2 A D D A 30 0
My goal is to build a win/loss/for/against sort of table. End result:
Team
W
L
Total for
Total against
A
1
1
20
40
B
0
1
10
20
C
0
1
5
20
D
2
0
50
5
I can use groupby('winner/loser') to get the wins and losses, but am unable to properly groupby to get the summed scores (and mean, but that's pretty similar). My main issue is when a team has never won a match, I'll end up with NaNs.
My method now is along the lines of:
by_winner = df.groupby('winner').sum()
by_loser = df.groupby('loser').sum()
overall_table['score_for'] = by_winner.score + by_loser.score
I'm also not sure how to even phrase the question, but I would like to be able to run these stats from a concept of one line = one match, but I don't know how to group by the winner and loser, so that I get summed results of all teams at once.
Let's try:
counts = pd.get_dummies(df[['winner','loser']].stack()).sum(level=1).T
winner_score = (df.groupby('winner')[['winner_score','loser_score']].sum()
.rename(columns={'winner_score':'for', 'loser_score':'against'})
)
loser_score = (df.groupby('loser')[['winner_score','loser_score']].sum()
.rename(columns={'winner_score':'against', 'loser_score':'for'})
)
pd.concat((counts, winner_score.add( loser_score, fill_value=0) ), axis=1)
Output:
winner loser against for
A 1 1 40.0 20.0
B 0 1 20.0 10.0
C 0 1 20.0 5.0
D 2 0 5.0 50.0
I'm working with a survey relative to income. I have my data like this:
form Survey1 Survey2 Country
0 1 1 1 1
1 2 1 2 5
2 3 2 2 4
3 4 2 1 1
4 5 2 2 4
I want to group by the answer and by the Country. For example, let's think the Survey2 refers to the number of cars of the respondent, I want to know the number of people that owns one car in a certain country.
The expected output is as follows:
Country Survey1_1 Survey1_2 Survey2_1 Survey2_2
0 1 1 1 2 0
1 4 0 2 0 2
2 5 1 0 0 1
Here I added '_#' where # is the answer to count.
Until now I've created a code to find the different answers for each column and I've counted the answers responding, let's say 1, but I haven't founded the way to count the answers for a specific country.
number_unic = df.head().iloc[:,j+ci].nunique() # count unique answers
val_unic = list(df.iloc[:,column].unique()) # unique answers
for i in range(len(vals_unic)):
names = str(df.columns[j+ci]+'_' + str(vals[i])) #names of columns
count = (df.iloc[:,j+ci]==vals[i]).sum() #here I count the values that are equal to an unique answer
df.insert(len(df.columns.values),names, count) # to insert new columns
I would do this with a pivot_table:
In [11]: df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
Out[11]:
Survey1 Survey2
0 1 0 1
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
To get the output you wanted you could do something like:
In [21]: res = df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
In [22]: res.columns = [s + "_" + str(n + 1) for s, n in res.columns.values]
In [23]: res
Out[23]:
Survey1_1 Survey1_2 Survey2_1 Survey2_2
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
But, generally it's better to use the MultiIndex here...
To count the number of each responses you can do this somewhat more complicated groupby and value_count:
In [31]: df1 = df.set_index("Country")[["Survey1", "Survey2"]] # more columns work fine here
In [32]: df1.unstack().groupby(level=[0, 1]).value_counts().unstack(level=0, fill_value=0).unstack(fill_value=0)
Out[32]:
Survey1 Survey2
1 2 1 2
Country
1 1 1 2 0
4 0 2 0 2
5 1 0 0 1
I am trying to create a loop or a more efficient process that can count the amount of current values in a pandas df. At the moment I'm selecting the value I want to perform the function on.
So for the df below, I'm trying to determine two counts.
1) ['u'] returns the count of the same remaining values left in ['Code', 'Area']. So how many remaining times the same values occur.
2) ['On'] returns the amount of values that are currently occurring in ['Area']. It achieves this by parsing through the df to see if those values occur again. So it essentially looks into the future to see if those values occur again.
import pandas as pd
d = ({
'Code' : ['A','A','A','A','B','A','B','A','A','A'],
'Area' : ['Home','Work','Shops','Park','Cafe','Home','Cafe','Work','Home','Park'],
})
df = pd.DataFrame(data=d)
#Select value
df1 = df[df.Code == 'A'].copy()
df1['u'] = df1[::-1].groupby('Area').Area.cumcount()
ids = [1]
seen = set([df1.iloc[0].Area])
dec = False
for val, u in zip(df1.Area[1:], df1.u[1:]):
ids.append(ids[-1] + (val not in seen) - dec)
seen.add(val)
dec = u == 0
df1['On'] = ids
df1 = df1.reindex(df.index).fillna(df1)
The problem is I want to run this script on all values in Code. Instead of selecting one at a time. For instance, if I want to do the same thing on Code['B'], I would have to change: df2 = df1[df1.Code == 'B'].copy() and the run the script again.
If I have numerous values in Code it becomes very inefficient. I need a loop where it finds all unique values in 'Code'Ideally, the script would look like:
df1 = df[df.Code == 'All unique values'].copy()
Intended Output:
Code Area u On
0 A Home 2.0 1.0
1 A Work 1.0 2.0
2 A Shops 0.0 3.0
3 A Park 1.0 3.0
4 B Cafe 1.0 1.0
5 A Home 1.0 3.0
6 B Cafe 0.0 1.0
7 A Work 0.0 3.0
8 A Home 0.0 2.0
9 A Park 0.0 1.0
I find your "On" logic very confusing. That said, I think I can reproduce it:
df["u"] = df.groupby(["Code", "Area"]).cumcount(ascending=False)
df["nunique"] = pd.get_dummies(df.Area).groupby(df.Code).cummax().sum(axis=1)
df["On"] = (df["nunique"] -
(df["u"] == 0).groupby(df.Code).cumsum().groupby(df.Code).shift().fillna(0)
which gives me
In [212]: df
Out[212]:
Code Area u nunique On
0 A Home 2 1 1.0
1 A Work 1 2 2.0
2 A Shops 0 3 3.0
3 A Park 1 4 3.0
4 B Cafe 1 1 1.0
5 A Home 1 4 3.0
6 B Cafe 0 1 1.0
7 A Work 0 4 3.0
8 A Home 0 4 2.0
9 A Park 0 4 1.0
In this, u is the number of matching (Code, Area) pairs after that row. nunique is the number of unique Area values seen so far in that Code.
On is the number of unique Areas seen so far, except that once we "run out" of an Area -- once it's not used any more -- we start subtracting it from nuniq.
Using GroupBy with size and cumcount, you can construct your u series.
Your logic for On isn't clear: this requires clarification.
g = df.groupby(['Code', 'Area'])
df['u'] = g['Code'].transform('size') - (g.cumcount() + 1)
print(df)
Code Area u
0 A Home 2
1 A Home 1
2 B Shops 1
3 A Park 1
4 B Cafe 1
5 B Shops 0
6 A Home 0
7 B Cafe 0
8 A Work 0
9 A Park 0
What I need is normalize the rating column below by the following process:
Group by user field id.
Find mean rating for each user.
Locate each users review tip and subtract the user's mean rating.
I have this data frame:
user rating
review_id
a 1 5
b 2 3
c 1 3
d 1 4
e 3 4
f 2 2
...
I then calculate the mean for each user:
>>>data.groupby('user').rating.mean()
user
1 4
2 2.5
3 4
I need the final result to be:
user rating
review_id
a 1 1
b 2 0.5
c 1 -1
d 1 0
e 3 0
f 2 -0.5
...
How can dataframes provide this kind of functionality efficiently?
You can do this by using a groupby().transform(), see http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation
In this case, grouping by 'user', and then for each group subtract the mean of that group (the function you supply to transform is applied to each group, but the result keeps the original index):
In [7]: data.groupby('user').transform(lambda x: x - x.mean())
Out[7]:
rating
review_id
a 1.0
b 0.5
c -1.0
d 0.0
e 0.0
f -0.5