Normalize values in DataFrame

Normalize values in DataFrame - python

What I need is normalize the rating column below by the following process:
Group by user field id.
Find mean rating for each user.
Locate each users review tip and subtract the user's mean rating.
I have this data frame:
user rating
review_id
a 1 5
b 2 3
c 1 3
d 1 4
e 3 4
f 2 2
...
I then calculate the mean for each user:
>>>data.groupby('user').rating.mean()
user
1 4
2 2.5
3 4
I need the final result to be:
user rating
review_id
a 1 1
b 2 0.5
c 1 -1
d 1 0
e 3 0
f 2 -0.5
...
How can dataframes provide this kind of functionality efficiently?

You can do this by using a groupby().transform(), see http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation
In this case, grouping by 'user', and then for each group subtract the mean of that group (the function you supply to transform is applied to each group, but the result keeps the original index):
In [7]: data.groupby('user').transform(lambda x: x - x.mean())
Out[7]:
rating
review_id
a 1.0
b 0.5
c -1.0
d 0.0
e 0.0
f -0.5

Related

Python dataframe:: get count across two columns for each unique value in either column

I have a python dataframe with columns, 'Expected' vs 'Actual' that shows a product (A,B,C or D) for each record
ID
Expected
Actual
1
A
B
2
A
A
3
C
B
4
B
D
5
C
D
6
A
A
7
B
B
8
A
D
I want to get a count from both columns for each unique value found in both columns (both columns dont share all the same products). So the result should look like this,
Value
Expected
Actual
A
4
2
B
2
3
C
2
0
D
0
3
Thank you for all your help

You can use apply and value_counts
df = pd.DataFrame({'Expected':['A','A','C','B','C','A','B','A'],'Actual':['B','A','B','D','D','A','B','D']})
df.apply(pd.Series.value_counts).fillna(0)
output:
Expected Actual
A 4.0 2.0
B 2.0 3.0
C 2.0 0.0
D 0.0 3.0

I would do it following way
import pandas as pd
df = pd.DataFrame({'Expected':['A','A','C','B','C','A','B','A'],'Actual':['B','A','B','D','D','A','B','D']})
ecnt = df['Expected'].value_counts()
acnt = df['Actual'].value_counts()
known = sorted(set(df['Expected']).union(df['Actual']))
cntdf = pd.DataFrame({'Value':known,'Expected':[ecnt.get(k,0) for k in known],'Actual':[acnt.get(k,0) for k in known]})
print(cntdf)
output
Value Expected Actual
0 A 4 2
1 B 2 3
2 C 2 0
3 D 0 3
Explanation: main idea here is having separate value counts for Expected column and Actual column. If you wish to rather have Value as Index of your pandas.DataFrame you can do
...
cntdf = pd.DataFrame([acnt,ecnt]).T.fillna(0)
print(cntdf)
output
Actual Expected
D 3.0 0.0
B 3.0 2.0
A 2.0 4.0
C 0.0 2.0

Is there a cleaner way to write the code for conditional replacement of outlier values with the group mean in a dataframe

For the DF below - in the Value Column, Product 3(i.e, 100) and Product 4 (i.e. 98) have amounts that are outliers. I want to
group by ['Class']
obtain the mean of the [Value] excluding the outlier amount
replace the outlier amount with the mean calculated in step 2.
Any suggestions of how to structure the code greatly appreciated. I have my code that works given the sample table, but I have a feeling that when I implement in the real solution it might not work.
Product,Class,Value
0 1 A 5
1 2 A 4
2 3 A 100
3 4 B 98
4 5 B 20
5 6 B 25
My code implementation:
# Establish the condition to remove the outlier rows from the DF
stds = 1.0
filtered_df = df[~df.groupby('Class')['Value'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]
Output:
Product Class Value
0 1 A 5
1 2 A 4
4 5 B 20
5 6 B 25
# compute mean of each class without the outliers
class_means = filtered_df[['Class', 'Value']].groupby(['Class'])['Value'].mean()
Output:
Class
A 4.5
B 22.5
#extract rows in DF that are outliers and fail the test
outlier_df = df[df.groupby('Class')['Value'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]
outlier_df
Output:
Product Class Value
2 3 A 100
3 4 B 98
#replace outlier values with computed means grouped by class
outlier_df['Value'] = np.where((outlier_df.Class == class_means.index), class_means,outlier_df.Value)
outlier_df
Output:
Product Class Value
2 3 A 4.5
3 4 B 22.5
#recombine cleaned dataframes
df_cleaned = pd.concat([filtered_df,outlier_df], axis=0 )
df_cleaned
Output:
Product Class Value
0 1 A 5.0
1 2 A 4.0
4 5 B 20.0
5 6 B 25.0
2 3 A 4.5
3 4 B 22.5

Proceed as follows:
Start from your code:
stds = 1.0
Save your lambda function under a variable:
isOutlier = lambda x: abs((x - x.mean()) / x.std()) > stds
Define the following function, to be applied to each group:
def newValue(grp):
val = grp.Value
outl = isOutlier(val)
return val.mask(outl, val[~outl].mean())
Generate new Value column:
df.Value = df.groupby('Class', group_keys=False).apply(newValue)
The result is:
Product Class Value
0 1 A 5.0
1 2 A 4.0
2 3 A 4.5
3 4 B 22.5
4 5 B 20.0
5 6 B 25.0
You even don't lose the original row order.
Edit
Or you can "incorporate" the content of your lambda function in newValue
(as you don't call it in any other place):
def newValue(grp):
val = grp.Value
outl = abs((val - val.mean()) / val.std()) > stds
return val.mask(outl, val[~outl].mean())

Find and match elements in a column and change the values of corresponding rows in another column

I have a DataFrame that looks like this:
df = pd.DataFrame({'ID':['A','B','A','C','C'], 'value':[2,4,9,1,3.5]})
df
ID value
0 A 2.0
1 B 4.0
2 A 9.0
3 C 1.0
4 C 3.5
What I need to do is to go through ID column and for each unique value, find that row, and multiply the corresponding row in value column based on the reference that I have.
For example, if I have the following reference:
if A multiply by 10
if B multiply by 3
if C multiply by 2
Then the desired output would be:
df
ID value
0 A 2.0*10
1 B 4.0*3
2 A 9.0*10
3 C 1.0*2
4 C 3.5*2
Thanks in advance.

Use Series.map with dictionary for Series used for multiple column value:
d = {'A':10, 'B':3,'C':2}
df['value'] = df['value'].mul(df['ID'].map(d))
print (df)
ID value
0 A 20.0
1 B 12.0
2 A 90.0
3 C 2.0
4 C 7.0
Detail:
print (df['ID'].map(d))
0 10
1 3
2 10
3 2
4 2
Name: ID, dtype: int64

Get Sum of Every Time Two Values Match

My googleing has failed me, I think my main issue is im unsure how to phrase the question (sorry about the crappy title). I'm trying to find the total each time 2 people vote the same way. Below you will see an example of how the data looks and the output I was looking for. I have a working solution but its very slow (see bottom) and was wondering if theres a better way to go about this.
This is how the data is shaped
----------------------------------
event person vote
1 a y
1 b n
1 c nv
1 d nv
1 e y
2 a n
2 b nv
2 c y
2 d n
2 e n
----------------------------------
This is the output im looking for
----------------------------------
Person a b c d e
a 2 0 0 1 2
b 0 2 0 0 0
c 0 0 2 1 0
d 1 0 1 2 1
e 2 0 0 1 2
----------------------------------
Working Code
df = df.pivot(index='event', columns='person', values='vote')
frame = pd.DataFrame(columns=df.columns, index=df.columns)
for person1, value in frame.iterrows():
for person2 in frame:
count = 0
for i, row in df.iterrows():
person1_votes = row[person1]
person2_votes = row[person2]
if person1_votes == person2_votes:
count += 1
frame.at[person1, person2] = count

Try look at your problem in different way
df=df.assign(key=1)
mergedf=df.merge(df,on=['event','key'])
mergedf['equal']=mergedf['vote_x'].eq(mergedf['vote_y'])
output=mergedf.groupby(['person_x','person_y'])['equal'].sum().unstack()
output
Out[1241]:
person_y a b c d e
person_x
a 2.0 0.0 0.0 1.0 2.0
b 0.0 2.0 0.0 0.0 0.0
c 0.0 0.0 2.0 1.0 0.0
d 1.0 0.0 1.0 2.0 1.0
e 2.0 0.0 0.0 1.0 2.0

#Wen-Ben already answered your question. It bases on the concept of finding all possibilities of pair-wise person and count those having same vote. Finding all pair-wise is cartesian product (cross join). You may read great post from #cs95 on cartesian product (CROSS JOIN) with pandas
In your problem, you count same vote per event, so it is cross joint per event. Therefore, you don't need adding helper key column as in #cs95 post. You may cross join directly on column event. After cross join, filter out those pair-wise person<->person having same vote using query. Finally, using crosstab to count those pair-wise.
Below is my solution:
df_match = df.merge(df, on='event').query('vote_x == vote_y')
pd.crosstab(index=df_match.person_x, columns=df_match.person_y)
Out[1463]:
person_y a b c d e
person_x
a 2 0 0 1 2
b 0 2 0 0 0
c 0 0 2 1 0
d 1 0 1 2 1
e 2 0 0 1 2

Count votes of a survey by the answer

I'm working with a survey relative to income. I have my data like this:
form Survey1 Survey2 Country
0 1 1 1 1
1 2 1 2 5
2 3 2 2 4
3 4 2 1 1
4 5 2 2 4
I want to group by the answer and by the Country. For example, let's think the Survey2 refers to the number of cars of the respondent, I want to know the number of people that owns one car in a certain country.
The expected output is as follows:
Country Survey1_1 Survey1_2 Survey2_1 Survey2_2
0 1 1 1 2 0
1 4 0 2 0 2
2 5 1 0 0 1
Here I added '_#' where # is the answer to count.
Until now I've created a code to find the different answers for each column and I've counted the answers responding, let's say 1, but I haven't founded the way to count the answers for a specific country.
number_unic = df.head().iloc[:,j+ci].nunique() # count unique answers
val_unic = list(df.iloc[:,column].unique()) # unique answers
for i in range(len(vals_unic)):
names = str(df.columns[j+ci]+'_' + str(vals[i])) #names of columns
count = (df.iloc[:,j+ci]==vals[i]).sum() #here I count the values that are equal to an unique answer
df.insert(len(df.columns.values),names, count) # to insert new columns

I would do this with a pivot_table:
In [11]: df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
Out[11]:
Survey1 Survey2
0 1 0 1
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
To get the output you wanted you could do something like:
In [21]: res = df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
In [22]: res.columns = [s + "_" + str(n + 1) for s, n in res.columns.values]
In [23]: res
Out[23]:
Survey1_1 Survey1_2 Survey2_1 Survey2_2
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
But, generally it's better to use the MultiIndex here...
To count the number of each responses you can do this somewhat more complicated groupby and value_count:
In [31]: df1 = df.set_index("Country")[["Survey1", "Survey2"]] # more columns work fine here
In [32]: df1.unstack().groupby(level=[0, 1]).value_counts().unstack(level=0, fill_value=0).unstack(fill_value=0)
Out[32]:
Survey1 Survey2
1 2 1 2
Country
1 1 1 2 0
4 0 2 0 2
5 1 0 0 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Normalize values in DataFrame - python

Related

Python dataframe:: get count across two columns for each unique value in either column

Is there a cleaner way to write the code for conditional replacement of outlier values with the group mean in a dataframe

Find and match elements in a column and change the values of corresponding rows in another column

Get Sum of Every Time Two Values Match

Count votes of a survey by the answer

Categories

Resources