Pandas - Calculate expected frequency table - python

Consider the following dataframe:
data = [[1, 2, 3, 4], [4, 3, 2, 1]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
What would be the most efficient way to generate an expected frequency table? i.e. for each cell value compute the result of (row total * column total) / (total sum)
So that the final dataframe is:
data = [[2.5, 2.5, 2.5, 2.5], [2.5, 2.5, 2.5, 2.5]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])

You can use the underlying numpy array and broadcasting:
a = df.values
pd.DataFrame((a.sum(0)*a.sum(1)[:,None])/a.sum(),
columns=df.columns, index=df.index)
output:
A B C D
0 2.5 2.5 2.5 2.5
1 2.5 2.5 2.5 2.5

df.apply(lambda ss:ss.map(lambda x:ss.sum()),axis=1)*df.sum()/df.sum().sum()
out:
A B C D
0 2.5 2.5 2.5 2.5
1 2.5 2.5 2.5 2.5

Related

duplicate index in a list and calculate mean by index

input: list of dataframe
df1 = pd.DataFrame({'N': [1.2, 1.4, 3.3]}, index=[1, 2, 3])
df2 = pd.DataFrame({'N': [2.2, 1.8, 4.3]}, index=[1, 2, 4])
df3 = pd.DataFrame({'N': [2.5, 6.4, 4.9]}, index=[3, 5, 7])
df_list= []
for df in (df1,df2,df3):
df_list.append(df)
I have a duplicate index of [1,2,3], want an average of them in the output
output: dataframe with corresponding index
1 (1.2+2.2)/2
2 (1.4+1.8)/2
3 (3.3+2.5)/2
4 4.3
5 6.4
7 4.9
So how to groupby duplicate index in a list and output average into a dataframe. Directly concatenate dataframes is not an option for me.
I would first concatenate all the data into a single DataFrame. Note that the values will automatically be aligned by index. Then you can get the means easily:
df1 = pd.DataFrame({'N': [1.2, 1.4, 3.3]}, index=[1, 2, 3])
df2 = pd.DataFrame({'N': [2.2, 1.8, 4.3]}, index=[1, 2, 4])
df3 = pd.DataFrame({'N': [2.5, 6.4, 4.9]}, index=[3, 5, 7])
df_list = [df1, df2, df3]
df = pd.concat(df_list, axis=1)
df.columns = ['N1', 'N2', 'N3']
print(df.mean(axis=1))
1 1.7
2 1.6
3 2.9
4 4.3
5 6.4
7 4.9
dtype: float64

multiply 2 columns in 2 dfs if they match the column name

I have 2 dfs with some similar colnames.
I tried this, it worked only when I have nonrepetitive colnames in national df.
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].values
I tried to use the same code on df where it has several names, but I got the following error 'shapes (26,33) and (1,26) not aligned: 33 (dim 1) != 1 (dim 0)'. Because in the second df it has 33 columns with the same name, and that needs to be multiplied elementwise with one column for the first df.
This code does not work, as there are repeated same colnames in urban.columns.
[np.matrix(urban[col].values) * np.matrix(F[col2].values) for col in urban.columns for col2 in F.columns if col == col2]
Reproducivle code
df1 = pd.DataFrame({
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col2': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})
Hopefully the below working example helps. Please provided a minimum reproducible example in your question with input code and desired output like I have provided. Please see how to ask a good pandas question:
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6]})
print(df1)
df2 = pd.DataFrame({
'FX Rate': [1.5, 2.0, 3.0, 5.0, 10.0]})
print(df2)
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
for col in ['Col1', 'Col2']:
df1[col] = df1[col] * df2['FX Rate']
df1
(df1)
Product Col1 Col2
0 AA 1 2
1 AA 2 4
2 BB 1 2
3 BB 2 4
4 BB 3 6
(df2)
FX Rate
0 1.5
1 2.0
2 3.0
3 5.0
4 10.0
Out[1]:
Product Col1 Col2
0 AA 1.5 3.0
1 AA 4.0 8.0
2 BB 3.0 6.0
3 BB 10.0 20.0
4 BB 30.0 60.0
You can't multiply two DataFrame if they have different shapes but if you want to multiply it anyway then use transpose:
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].T.values
You can get the common columns of the 2 dataframes, then multiply the 2 dataframe by simple multiplication. Then, join back the only column(s) in df1 to the multiplication result, as follows:
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
Demo
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col3': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
print(df1)
Product Col1 Col2 Col3
0 AA 1.5 2.0 7
1 AA 4.0 0.0 4
2 BB 3.0 8.0 2
3 BB 10.0 20.0 8
4 BB 30.0 42.0 6
A friend of mine sent this solution wich works just as i wanted.
out = urban.copy()
for col in urban.columns:
for col2 in F.columns:
if col == col2:
out.loc[:,col] = urban.loc[:,[col]].values * F.loc[:,[col2]].values

The result of dataframe.mean() is incorrect

I am workint in Python 2.7 and I have a data frame and I want to get the average of the column called 'c', but only the rows that verify that the values in another column are equal to some value.
When I execute the code, the answer is unexpected, but when I execute the calculation, calculating the median, the result is correct.
Why is the output of the mean incorrect?
The code is the following:
df = pd.DataFrame(
np.array([['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]]),
columns=['a', 'b', 'c', 'd']
)
df
mean1 = df[df.a == 'A'].c.mean()
mean2 = df[df.a == 'B'].c.mean()
median1 = df[df.a == 'A'].c.median()
median2 = df[df.a == 'B'].c.median()
The output:
df
Out[1]:
a b c d
0 A 1 2 3
1 A 4 5 nan
2 A 7 8 9
3 B 3 2 nan
4 B 5 6 nan
5 B 5 6 nan
mean1
Out[2]: 86.0
mean2
Out[3]: 88.66666666666667
median1
Out[4]: 5.0
median2
Out[5]: 6.0
It is obvious that the output of the mean is incorrect.
Thanks.
Pandas is doing string concatenation for the "sum" when calculating the mean, this is plain to see from your example frame.
>>> df[df.a == 'B'].c
3 2
4 6
5 6
Name: c, dtype: object
>>> 266 / 3
88.66666666666667
If you look at the dtype's for your DataFrame, you'll notice that all of them are object, even though no single Series contains mixed types. This is due to the declaration of your numpy array. Arrays are not meant to contain heterogenous types, so the array defaults to dtype object, which is then passed to the DataFrame constructor. You can avoid this behavior by passing the constructor a list instead, which can hold differing dtype's with no issues.
df = pd.DataFrame(
[['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]],
columns=['a', 'b', 'c', 'd']
)
df[df.a == 'B'].c.mean()
4.666666666666667
In [17]: df.dtypes
Out[17]:
a object
b int64
c int64
d float64
dtype: object
I still can't imagine that this behavior is intended, so I believe it's worth opening an issue report on the pandas development page, but in general, you shouldn't be using object dtype Series for numeric calculations.

How to get number of rows in a grouped-by category in pandas

In pandas, I have (app_categ_events is a dataframe):
> print(app_categ_events.label_id.unique().shape)
> print(app_categ_events.category.unique().shape)
Out:
(492,)
(458,)
I want to look at the label_category’s that have more than one label_id for each (because I thought there was supposed to be a one-to-one mapping).
In r data.table, I can do:
app_categ_events[, count_rows := .N, by = list(category, label_id)]
# (or smth of that sort...)
print(app_categ_events[counts_rows > 1])
What’s the best way of doing that in pandas?
We transform the dataset to create the 'count_rows' column after grouping by 'category', 'label_id'
app_categ_events['count_rows'] = app_categ_events.groupby(['category',
'label_id'])['label_id'].transform('count')
print(app_categ_events)
# category label_id count_rows
#0 a 1 2
#1 a 1 2
#2 b 2 1
#3 b 3 1
Now, the equivalent of data.table as showed in the OP's post would be
print(app_categ_events[app_categ_events.count_rows>1])
# category label_id count_rows
#0 a 1 2
#1 a 1 2
data
import pandas as pd;
app_categ_events = pd.DataFrame({'category': ['a', 'a', 'b', 'b'], 'label_id': [1, 1, 2, 3]})
You can use filtration to return the desired results.
df = pd.DataFrame({'label_id': [1, 1, 2, 3],
'category': ['a', 'b', 'b', 'c']})
df.groupby(['category']).filter(lambda group: len(group) > 1)
category label_id
1 b 1
2 b 2
Given:
app_categ_events = pd.DataFrame({'category': ['a', 'a', 'b', 'b'],
'label_id': [1, 1, 2, 3]})
Solution:
# identify categories with greater than 1 number of related label_id's
cat_mask = app_categ_events.groupby('category')['label_id'].nunique().gt(1)
cats = cat_mask[cat_mask]
# filter data
app_categ_events[app_categ_events.category.isin(cats.index)]

Remove columns that have 'N' number of NA values in it - python

Suppose I use df.isnull().sum() and I get a count for all the 'NA' values in all the columns of df dataframe. I want to remove a column that has NA values above 'K'.
For eg.,
df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [0, np.nan, np.nan, 0, 0, 0],
'C': [0, 0, 0, 0, 0, 0.0],
'D': [5, 5, np.nan, np.nan, 5.6, 6.8],
'E': [0,np.nan,np.nan,np.nan,np.nan,np.nan],})
df.isnull().sum()
A 1
B 2
C 0
D 2
E 5
dtype: int64
Suppose I want to remove columns that have '2' and above number of NA values. How would be approach this problem? My output should be,
df.columns
A,C
Can anybody help me in doing this?
Thanks
Call dropna and pass axis=1 to drop column-wise and pass thresh=len(df)-K, what thresh does is it sets the minimum number of non-NaN values which is equal to the number of rows minus K NaN values
In [22]:
df.dropna(axis=1, thresh=len(df)-1)
Out[22]:
A C
0 1.0 0
1 2.1 0
2 NaN 0
3 4.7 0
4 5.6 0
5 6.8 0
If you just want the columns:
In [23]:
df.dropna(axis=1, thresh=len(df)-1).columns
Out[23]:
Index(['A', 'C'], dtype='object')
Or simply mask the counts output against the columns:
In [28]:
df.columns[df.isnull().sum() <2]
Out[28]:
Index(['A', 'C'], dtype='object')
Could do something like:
df = df.reindex(columns=[x for x in df.columns.values if df[x].isnull().sum() < threshold])
Which just builds a list of columns that match your requirement (fewer than threshold nulls), and then uses that list to reindex the dataframe. So if you set threshold to 1:
threshold = 1
df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [0, np.nan, np.nan, 0, 0, 0],
'C': [0, 0, 0, 0, 0, 0.0],
'D': [5, 5, np.nan, np.nan, 5.6, 6.8],
'E': ['NA', 'NA', 'NA', 'NA', 'NA', 'NA'],})
df = df.reindex(columns=[x for x in df.columns.values if df[x].isnull().sum() < threshold])
df.count()
Will yield:
C 6
E 6
dtype: int64
The dropna() function has a thresh argument that allows you to give the number of non-NaN values you require, so this would give you your desired output:
df.dropna(axis=1,thresh=5).count()
A 5
C 6
E 6
If you wanted just C & E, you'd have to change thresh to 6 in this case.

Categories

Resources