duplicate index in a list and calculate mean by index

duplicate index in a list and calculate mean by index - python

input: list of dataframe
df1 = pd.DataFrame({'N': [1.2, 1.4, 3.3]}, index=[1, 2, 3])
df2 = pd.DataFrame({'N': [2.2, 1.8, 4.3]}, index=[1, 2, 4])
df3 = pd.DataFrame({'N': [2.5, 6.4, 4.9]}, index=[3, 5, 7])
df_list= []
for df in (df1,df2,df3):
df_list.append(df)
I have a duplicate index of [1,2,3], want an average of them in the output
output: dataframe with corresponding index
1 (1.2+2.2)/2
2 (1.4+1.8)/2
3 (3.3+2.5)/2
4 4.3
5 6.4
7 4.9
So how to groupby duplicate index in a list and output average into a dataframe. Directly concatenate dataframes is not an option for me.

I would first concatenate all the data into a single DataFrame. Note that the values will automatically be aligned by index. Then you can get the means easily:
df1 = pd.DataFrame({'N': [1.2, 1.4, 3.3]}, index=[1, 2, 3])
df2 = pd.DataFrame({'N': [2.2, 1.8, 4.3]}, index=[1, 2, 4])
df3 = pd.DataFrame({'N': [2.5, 6.4, 4.9]}, index=[3, 5, 7])
df_list = [df1, df2, df3]
df = pd.concat(df_list, axis=1)
df.columns = ['N1', 'N2', 'N3']
print(df.mean(axis=1))
1 1.7
2 1.6
3 2.9
4 4.3
5 6.4
7 4.9
dtype: float64

Related

Creating new column by mapping to dictionary (with string contain match)

I am trying to create in df1 the column Factor based on the dictionary df2. However the Code columns for mapping are not exactly the same and the dictionary only contain the Code strings partially.
import pandas as pd
df1 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5],
'Code':['R:EST 5R', 'R:EKG EK', 'R:EKG EK', 'R:EST 5R', 'R:EKGP', 'R:EST 5R', 'R:OID_P', 'R:OID_P']})
df2 = pd.DataFrame({
'Code':['R:EST', 'R:EKG', 'R:OID'],
'Factor':[1, 1.3, 0.9]})
So far, I wasn't able to map the data frames correctly, because the columns are not exactly the same. The column Code does not necessary start with "R:".
df1['Factor'] = df1['Code'].map(df2.set_index('Code')['Factor'])
This is how the preferred output would look like:
df3 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5],
'Code':['R:EST 5R', 'R:EKG EK', 'R:EKG EK', 'R:EST 5R', 'R:EKGP', 'R:EST 5R', 'R:OID_P', 'R:OID_P'],
'Factor':[1, 1.3, 1.3, 1, 1.3, 1, 0.9, 0.9]})
Thanks a lot!

>>> df1['Code'].str[:5].map(df2.set_index('Code')['Factor'])
0 1.0
1 1.3
2 1.3
3 1.0
4 1.3
5 1.0
6 0.9
7 0.9
Name: Code, dtype: float64
>>> (df2.Code
.apply(lambda x:df1.Code.str.contains(x))
.T
.idxmax(axis=1)
.apply(lambda x:df2.Factor.iloc[x])
)
0 1.0
1 1.3
2 1.3
3 1.0
4 1.3
5 1.0
6 0.9
7 0.9
dtype: float64

Pandas - Calculate expected frequency table

Consider the following dataframe:
data = [[1, 2, 3, 4], [4, 3, 2, 1]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
What would be the most efficient way to generate an expected frequency table? i.e. for each cell value compute the result of (row total * column total) / (total sum)
So that the final dataframe is:
data = [[2.5, 2.5, 2.5, 2.5], [2.5, 2.5, 2.5, 2.5]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])

You can use the underlying numpy array and broadcasting:
a = df.values
pd.DataFrame((a.sum(0)*a.sum(1)[:,None])/a.sum(),
columns=df.columns, index=df.index)
output:
A B C D
0 2.5 2.5 2.5 2.5
1 2.5 2.5 2.5 2.5

df.apply(lambda ss:ss.map(lambda x:ss.sum()),axis=1)*df.sum()/df.sum().sum()
out：
A B C D
0 2.5 2.5 2.5 2.5
1 2.5 2.5 2.5 2.5

multiply 2 columns in 2 dfs if they match the column name

I have 2 dfs with some similar colnames.
I tried this, it worked only when I have nonrepetitive colnames in national df.
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].values
I tried to use the same code on df where it has several names, but I got the following error 'shapes (26,33) and (1,26) not aligned: 33 (dim 1) != 1 (dim 0)'. Because in the second df it has 33 columns with the same name, and that needs to be multiplied elementwise with one column for the first df.
This code does not work, as there are repeated same colnames in urban.columns.
[np.matrix(urban[col].values) * np.matrix(F[col2].values) for col in urban.columns for col2 in F.columns if col == col2]
Reproducivle code
df1 = pd.DataFrame({
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col2': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})

Hopefully the below working example helps. Please provided a minimum reproducible example in your question with input code and desired output like I have provided. Please see how to ask a good pandas question:
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6]})
print(df1)
df2 = pd.DataFrame({
'FX Rate': [1.5, 2.0, 3.0, 5.0, 10.0]})
print(df2)
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
for col in ['Col1', 'Col2']:
df1[col] = df1[col] * df2['FX Rate']
df1
(df1)
Product Col1 Col2
0 AA 1 2
1 AA 2 4
2 BB 1 2
3 BB 2 4
4 BB 3 6
(df2)
FX Rate
0 1.5
1 2.0
2 3.0
3 5.0
4 10.0
Out[1]:
Product Col1 Col2
0 AA 1.5 3.0
1 AA 4.0 8.0
2 BB 3.0 6.0
3 BB 10.0 20.0
4 BB 30.0 60.0

You can't multiply two DataFrame if they have different shapes but if you want to multiply it anyway then use transpose:
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].T.values

You can get the common columns of the 2 dataframes, then multiply the 2 dataframe by simple multiplication. Then, join back the only column(s) in df1 to the multiplication result, as follows:
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
Demo
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col3': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
print(df1)
Product Col1 Col2 Col3
0 AA 1.5 2.0 7
1 AA 4.0 0.0 4
2 BB 3.0 8.0 2
3 BB 10.0 20.0 8
4 BB 30.0 42.0 6

A friend of mine sent this solution wich works just as i wanted.
out = urban.copy()
for col in urban.columns:
for col2 in F.columns:
if col == col2:
out.loc[:,col] = urban.loc[:,[col]].values * F.loc[:,[col2]].values

Keep the last n real values of uneven rows in a dataframe?

I am collecting heart rate values over the course of time. Each subject varies in the length of time that data was collected. I would like to make a table of the last 2 seconds of collected data.
import pandas as pd
import numpy as np
#example data
example_s = [["4/20/21 4:20", 302, 0, 0, 1, 2, 3],
["2/17/21 9:20",135, 1, 1.4, 8, 10, np.NaN, np.NaN],
["2/17/21 9:20", 111, 5, 5,1, np.NaN, np.NaN,np.NaN, np.NaN]]
example_s_table = pd.DataFrame(example_s,columns=['Date_Time','CID', 0, 1, 2, 3, 4, 5, 6])
desired_outcome = [["4/20/21 4:20",302,1, 2, 3],
["2/17/21 9:20",135, 1.4, 8, 10 ],
["2/17/21 9:20",111, 5, 5,1 ]]
desired_outcome_table = pd.DataFrame(desired_outcome,columns=['Date_Time','CID', "Second 1", "Second 2", "Second 3"])
I can see how to collect a single instance of the data from the example shown here, but would like to know how to quickly add multiple values to my table:
desired_outcome_table["Last Second"]=example_s_table.iloc[:,1:].ffill(axis=1).iloc[:, -1]
Python Dataframe Get Value of Last Non Null Column for Each Row

Try:
df = example_s_table.copy()
df = df.set_index(['Date_Time', 'CID'])
df_out = df.mask(df.eq(0))\
.apply(lambda x: pd.Series(x.dropna().tail(3).values), axis=1)\
.rename(columns = lambda x: f'Second {x+1}')
df_out['Last Second'] = df_out['Second 3']
print(df_out.reset_index())
Output:
Date_Time CID Second 1 Second 2 Second 3 Last Second
0 4/20/21 4:20 302 1.0 2.0 3.0 3.0
1 2/17/21 9:20 135 1.4 8.0 10.0 10.0
2 2/17/21 9:20 111 5.0 5.0 1.0 1.0

Remove columns that have 'N' number of NA values in it - python

Suppose I use df.isnull().sum() and I get a count for all the 'NA' values in all the columns of df dataframe. I want to remove a column that has NA values above 'K'.
For eg.,
df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [0, np.nan, np.nan, 0, 0, 0],
'C': [0, 0, 0, 0, 0, 0.0],
'D': [5, 5, np.nan, np.nan, 5.6, 6.8],
'E': [0,np.nan,np.nan,np.nan,np.nan,np.nan],})
df.isnull().sum()
A 1
B 2
C 0
D 2
E 5
dtype: int64
Suppose I want to remove columns that have '2' and above number of NA values. How would be approach this problem? My output should be,
df.columns
A,C
Can anybody help me in doing this?
Thanks

Call dropna and pass axis=1 to drop column-wise and pass thresh=len(df)-K, what thresh does is it sets the minimum number of non-NaN values which is equal to the number of rows minus K NaN values
In [22]:
df.dropna(axis=1, thresh=len(df)-1)
Out[22]:
A C
0 1.0 0
1 2.1 0
2 NaN 0
3 4.7 0
4 5.6 0
5 6.8 0
If you just want the columns:
In [23]:
df.dropna(axis=1, thresh=len(df)-1).columns
Out[23]:
Index(['A', 'C'], dtype='object')
Or simply mask the counts output against the columns:
In [28]:
df.columns[df.isnull().sum() <2]
Out[28]:
Index(['A', 'C'], dtype='object')

Could do something like:
df = df.reindex(columns=[x for x in df.columns.values if df[x].isnull().sum() < threshold])
Which just builds a list of columns that match your requirement (fewer than threshold nulls), and then uses that list to reindex the dataframe. So if you set threshold to 1:
threshold = 1
df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [0, np.nan, np.nan, 0, 0, 0],
'C': [0, 0, 0, 0, 0, 0.0],
'D': [5, 5, np.nan, np.nan, 5.6, 6.8],
'E': ['NA', 'NA', 'NA', 'NA', 'NA', 'NA'],})
df = df.reindex(columns=[x for x in df.columns.values if df[x].isnull().sum() < threshold])
df.count()
Will yield:
C 6
E 6
dtype: int64

The dropna() function has a thresh argument that allows you to give the number of non-NaN values you require, so this would give you your desired output:
df.dropna(axis=1,thresh=5).count()
A 5
C 6
E 6
If you wanted just C & E, you'd have to change thresh to 6 in this case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

duplicate index in a list and calculate mean by index - python

Related

Creating new column by mapping to dictionary (with string contain match)

Pandas - Calculate expected frequency table

multiply 2 columns in 2 dfs if they match the column name

Keep the last n real values of uneven rows in a dataframe?

Remove columns that have 'N' number of NA values in it - python

Categories

Resources