Imagine I've got this pandas DataFrame:
Class Val
0 A 1
1 B 1
2 B 1
3 B 1
4 B 0
And I want to do the mean of the values grouped by Class, BUT having in mind statistical significance of the values so, if B had a lot of Val equal to 1 the result value of the mean of B will overcome the result value of the mean of A because it only has one observation.
Use:
import pandas as pd
df = pd.DataFrame({'Class': ['A', 'B', 'B', 'B', 'B'], 'Val': [1, 1, 1, 1, 0]})
print(df.groupby('Class').agg(['mean', 'count']))
You will have to expand on how you decide which to use, but this provides you with the basic info you need to do that.
Related
How to impute the missing value or value having 0 with the average of two nearby non-zero values in Pandas in python shown in this Image
One possibility would be to replace the 0 with None, and then use .bfill() and .ffill() on the column in question:
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd', 'e'], 'b': [1, 2, 0, 0, 5]})
df.loc[df['b']==0]=None
df['b'] = (df['b'].ffill()+df['b'].bfill())*0.5
a b
0 a 1.0
1 b 2.0
2 c 3.5
3 d 3.5
4 e 5.0
I have a very large dataframe with over 2000 columns. I am trying to count the number of unique values for each column and filter out the columns with unique values below a certain number. Here is an example:
import pandas as pd
df = pd.DataFrame({'A': ('a', 'b', 'c', 'd', 'e', 'a', 'a'), 'B': (1, 1, 2, 1, 3, 3, 1)})
df.nunique()
A 5
B 3
dtype: int64
So lets say I wanna filter out column B which has lower than 5 unique values and return a df without column B.
Thanks-
Pass the .loc
df=df.loc[:,df.nunique()>3]
A
0 a
1 b
2 c
3 d
4 e
5 a
6 a
Others may have a more pythonic way. Try this out to see if it works.
x = df.nunique()
df[list(x[x>=5].index)]
I am struggling to figure out how to develop a square matrix given a format like
a a 0
a b 3
a c 4
a d 12
b a 3
b b 0
b c 2
...
To something like:
a b c d e
a 0 3 4 12 ...
b 3 0 2 7 ...
c 4 3 0 .. .
d 12 ...
e . ..
in pandas. I developed a method which I thinks works but takes forever to run because it has to iterate through each column and row for every value starting from the beginning each time using for loops. I feel like I'm definitely reinventing the wheel here. This also isnt realistic for my dataset given how many columns and rows there are. Is there something similar to R's cast function in python which can do this significantly faster?
You could use df.pivot:
import pandas as pd
df = pd.DataFrame([['a', 'a', 0],
['a', 'b', 3],
['a', 'c', 4],
['a', 'd', 12],
['b', 'a', 3],
['b', 'b', 0],
['b', 'c', 2]], columns=['X','Y','Z'])
print(df.pivot(index='X', columns='Y', values='Z'))
yields
Y a b c d
X
a 0.0 3.0 4.0 12.0
b 3.0 0.0 2.0 NaN
Here, index='X' tells df.pivot to use the column labeled 'X' as the index, and columns='Y' tells it to use the column labeled 'Y' as the column index.
See the docs for more on pivot and other reshaping methods.
Alternatively, you could use pd.crosstab:
print(pd.crosstab(index=df.iloc[:,0], columns=df.iloc[:,1],
values=df.iloc[:,2], aggfunc='sum'))
Unlike df.pivot which expects each (a1, a2) pair to be unique, pd.crosstab
(with agfunc='sum') will aggregate duplicate pairs by summing the associated
values. Although there are no duplicate pairs in your posted example, specifying
how duplicates are supposed to be aggregated is required when the values
parameter is used.
Also, whereas df.pivot is passed column labels, pd.crosstab is passed
array-likes (such as whole columns of df). df.iloc[:, i] is the ith column
of df.
I'm trying to set a number of different in a pandas DataFrame all to the same value. I thought I understood boolean indexing for pandas, but I haven't found any resources on this specific error.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df[mask] = 30
Traceback (most recent call last):
...
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
Above, I want to replace all of the True entries in the mask with the value 30.
I could do df.replace instead, but masking feels a bit more efficient and intuitive here. Can someone explain the error, and provide an efficient way to set all of the values?
You can't use the boolean mask on mixed dtypes for this unfortunately, you can use pandas where to set the values:
In [59]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df = df.where(mask, other=30)
df
Out[59]:
A B
0 1 a
1 30 30
2 3 30
Note: that the above will fail if you do inplace=True in the where method, so df.where(mask, other=30, inplace=True) will raise:
TypeError: Cannot do inplace boolean setting on mixed-types with a non
np.nan value
EDIT
OK, after a little misunderstanding you can still use where y just inverting the mask:
In [2]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df.where(~mask, other=30)
Out[2]:
A B
0 30 30
1 2 b
2 30 f
If you want to use different columns to create your mask, you need to call the values property of the dataframe.
Example
Let's say we want to, replace values in A_1 and 'A_2' according to a mask in B_1 and B_2. For example, replace those values in A (to 999) that corresponds to nulls in B.
The original dataframe:
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 5 n NaN
2 3 6 NaN NaN
The desired dataframe
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 999 n NaN
2 999 999 NaN NaN
The code:
df = pd.DataFrame({
'A_1': [1, 2, 3],
'A_2': [4, 5, 6],
'B_1': ['y', 'n', np.nan],
'B_2': ['n', np.nan, np.nan]})
_mask = df[['B_1', 'B_2']].notnull().values
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)
A_1 A_2
0 1 4
1 2 999
2 999 999
I'm not 100% sure but I suspect the error message relates to the fact that there is not identical treatment of missing data across different dtypes. Only float has NaN, but integers can be automatically converted to floats so it's not a problem there. But it appears mixing number dtypes and object dtypes does not work so easily...
Regardless of that, you could get around it pretty easily with np.where:
df[:] = np.where( mask, 30, df )
A B
0 30 30
1 2 b
2 30 f
pandas uses NaN to mark invalid or missing data and can be used across types, since your DataFrame as mixed int and string data types it will not accept the assignment to a single type (other than NaN) as this would create a mixed type (int and str) in B through an in-place assignment.
#JohnE method using np.where creates a new DataFrame in which the type of column B is an object not a string as in the initial example.
I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly