Creating new column by mapping to dictionary (with string contain match) - python

I am trying to create in df1 the column Factor based on the dictionary df2. However the Code columns for mapping are not exactly the same and the dictionary only contain the Code strings partially.
import pandas as pd
df1 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5],
'Code':['R:EST 5R', 'R:EKG EK', 'R:EKG EK', 'R:EST 5R', 'R:EKGP', 'R:EST 5R', 'R:OID_P', 'R:OID_P']})
df2 = pd.DataFrame({
'Code':['R:EST', 'R:EKG', 'R:OID'],
'Factor':[1, 1.3, 0.9]})
So far, I wasn't able to map the data frames correctly, because the columns are not exactly the same. The column Code does not necessary start with "R:".
df1['Factor'] = df1['Code'].map(df2.set_index('Code')['Factor'])
This is how the preferred output would look like:
df3 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5],
'Code':['R:EST 5R', 'R:EKG EK', 'R:EKG EK', 'R:EST 5R', 'R:EKGP', 'R:EST 5R', 'R:OID_P', 'R:OID_P'],
'Factor':[1, 1.3, 1.3, 1, 1.3, 1, 0.9, 0.9]})
Thanks a lot!

>>> df1['Code'].str[:5].map(df2.set_index('Code')['Factor'])
0 1.0
1 1.3
2 1.3
3 1.0
4 1.3
5 1.0
6 0.9
7 0.9
Name: Code, dtype: float64
>>> (df2.Code
.apply(lambda x:df1.Code.str.contains(x))
.T
.idxmax(axis=1)
.apply(lambda x:df2.Factor.iloc[x])
)
0 1.0
1 1.3
2 1.3
3 1.0
4 1.3
5 1.0
6 0.9
7 0.9
dtype: float64

Related

duplicate index in a list and calculate mean by index

input: list of dataframe
df1 = pd.DataFrame({'N': [1.2, 1.4, 3.3]}, index=[1, 2, 3])
df2 = pd.DataFrame({'N': [2.2, 1.8, 4.3]}, index=[1, 2, 4])
df3 = pd.DataFrame({'N': [2.5, 6.4, 4.9]}, index=[3, 5, 7])
df_list= []
for df in (df1,df2,df3):
df_list.append(df)
I have a duplicate index of [1,2,3], want an average of them in the output
output: dataframe with corresponding index
1 (1.2+2.2)/2
2 (1.4+1.8)/2
3 (3.3+2.5)/2
4 4.3
5 6.4
7 4.9
So how to groupby duplicate index in a list and output average into a dataframe. Directly concatenate dataframes is not an option for me.
I would first concatenate all the data into a single DataFrame. Note that the values will automatically be aligned by index. Then you can get the means easily:
df1 = pd.DataFrame({'N': [1.2, 1.4, 3.3]}, index=[1, 2, 3])
df2 = pd.DataFrame({'N': [2.2, 1.8, 4.3]}, index=[1, 2, 4])
df3 = pd.DataFrame({'N': [2.5, 6.4, 4.9]}, index=[3, 5, 7])
df_list = [df1, df2, df3]
df = pd.concat(df_list, axis=1)
df.columns = ['N1', 'N2', 'N3']
print(df.mean(axis=1))
1 1.7
2 1.6
3 2.9
4 4.3
5 6.4
7 4.9
dtype: float64

multiply 2 columns in 2 dfs if they match the column name

I have 2 dfs with some similar colnames.
I tried this, it worked only when I have nonrepetitive colnames in national df.
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].values
I tried to use the same code on df where it has several names, but I got the following error 'shapes (26,33) and (1,26) not aligned: 33 (dim 1) != 1 (dim 0)'. Because in the second df it has 33 columns with the same name, and that needs to be multiplied elementwise with one column for the first df.
This code does not work, as there are repeated same colnames in urban.columns.
[np.matrix(urban[col].values) * np.matrix(F[col2].values) for col in urban.columns for col2 in F.columns if col == col2]
Reproducivle code
df1 = pd.DataFrame({
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col2': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})
Hopefully the below working example helps. Please provided a minimum reproducible example in your question with input code and desired output like I have provided. Please see how to ask a good pandas question:
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6]})
print(df1)
df2 = pd.DataFrame({
'FX Rate': [1.5, 2.0, 3.0, 5.0, 10.0]})
print(df2)
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
for col in ['Col1', 'Col2']:
df1[col] = df1[col] * df2['FX Rate']
df1
(df1)
Product Col1 Col2
0 AA 1 2
1 AA 2 4
2 BB 1 2
3 BB 2 4
4 BB 3 6
(df2)
FX Rate
0 1.5
1 2.0
2 3.0
3 5.0
4 10.0
Out[1]:
Product Col1 Col2
0 AA 1.5 3.0
1 AA 4.0 8.0
2 BB 3.0 6.0
3 BB 10.0 20.0
4 BB 30.0 60.0
You can't multiply two DataFrame if they have different shapes but if you want to multiply it anyway then use transpose:
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].T.values
You can get the common columns of the 2 dataframes, then multiply the 2 dataframe by simple multiplication. Then, join back the only column(s) in df1 to the multiplication result, as follows:
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
Demo
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col3': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
print(df1)
Product Col1 Col2 Col3
0 AA 1.5 2.0 7
1 AA 4.0 0.0 4
2 BB 3.0 8.0 2
3 BB 10.0 20.0 8
4 BB 30.0 42.0 6
A friend of mine sent this solution wich works just as i wanted.
out = urban.copy()
for col in urban.columns:
for col2 in F.columns:
if col == col2:
out.loc[:,col] = urban.loc[:,[col]].values * F.loc[:,[col2]].values

Keep the last n real values of uneven rows in a dataframe?

I am collecting heart rate values over the course of time. Each subject varies in the length of time that data was collected. I would like to make a table of the last 2 seconds of collected data.
import pandas as pd
import numpy as np
#example data
example_s = [["4/20/21 4:20", 302, 0, 0, 1, 2, 3],
["2/17/21 9:20",135, 1, 1.4, 8, 10, np.NaN, np.NaN],
["2/17/21 9:20", 111, 5, 5,1, np.NaN, np.NaN,np.NaN, np.NaN]]
example_s_table = pd.DataFrame(example_s,columns=['Date_Time','CID', 0, 1, 2, 3, 4, 5, 6])
desired_outcome = [["4/20/21 4:20",302,1, 2, 3],
["2/17/21 9:20",135, 1.4, 8, 10 ],
["2/17/21 9:20",111, 5, 5,1 ]]
desired_outcome_table = pd.DataFrame(desired_outcome,columns=['Date_Time','CID', "Second 1", "Second 2", "Second 3"])
I can see how to collect a single instance of the data from the example shown here, but would like to know how to quickly add multiple values to my table:
desired_outcome_table["Last Second"]=example_s_table.iloc[:,1:].ffill(axis=1).iloc[:, -1]
Python Dataframe Get Value of Last Non Null Column for Each Row
Try:
df = example_s_table.copy()
df = df.set_index(['Date_Time', 'CID'])
df_out = df.mask(df.eq(0))\
.apply(lambda x: pd.Series(x.dropna().tail(3).values), axis=1)\
.rename(columns = lambda x: f'Second {x+1}')
df_out['Last Second'] = df_out['Second 3']
print(df_out.reset_index())
Output:
Date_Time CID Second 1 Second 2 Second 3 Last Second
0 4/20/21 4:20 302 1.0 2.0 3.0 3.0
1 2/17/21 9:20 135 1.4 8.0 10.0 10.0
2 2/17/21 9:20 111 5.0 5.0 1.0 1.0

Index a Pandas DataFrame/Series by Series of indices

I am trying to find a way to create a pandas Series which is based on values within another DataFrame. A simplified example would be:
df_idx = pd.DataFrame([0, 2, 2, 3, 1, 3])
df_lookup = pd.DataFrame([10.0, 20.0, 30.0, 40.0])
where I wish to generate a new pandas series of values drawn from df_lookup based on the indices in df_idx, i.e.:
df_target = pd.DataFrame([10.0, 30.0, 30.0, 40.0, 20.0, 40.0])
Clearly, it is desirable to do this without looping for speed.
Any help greatly appreciated.
This is what reindex is for:
df_idx = pd.DataFrame([0, 2, 2, 3, 1, 3])
df_lookup = pd.DataFrame([10.0, 20.0, 30.0, 40.0])
df_lookup.reindex(df_idx[0])
Output:
0
0
0 10.0
2 30.0
2 30.0
3 40.0
1 20.0
3 40.0
This is precisely the use case for iloc:
import pandas as pd
df = pd.DataFrame([10.0, 20.0, 30.0, 40.0])
idx_lst = pd.Series([0, 2, 2, 3, 1, 3])
res = df.iloc[idx_lst]
See here for more on indexing by position.

Remove columns that have 'N' number of NA values in it - python

Suppose I use df.isnull().sum() and I get a count for all the 'NA' values in all the columns of df dataframe. I want to remove a column that has NA values above 'K'.
For eg.,
df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [0, np.nan, np.nan, 0, 0, 0],
'C': [0, 0, 0, 0, 0, 0.0],
'D': [5, 5, np.nan, np.nan, 5.6, 6.8],
'E': [0,np.nan,np.nan,np.nan,np.nan,np.nan],})
df.isnull().sum()
A 1
B 2
C 0
D 2
E 5
dtype: int64
Suppose I want to remove columns that have '2' and above number of NA values. How would be approach this problem? My output should be,
df.columns
A,C
Can anybody help me in doing this?
Thanks
Call dropna and pass axis=1 to drop column-wise and pass thresh=len(df)-K, what thresh does is it sets the minimum number of non-NaN values which is equal to the number of rows minus K NaN values
In [22]:
df.dropna(axis=1, thresh=len(df)-1)
Out[22]:
A C
0 1.0 0
1 2.1 0
2 NaN 0
3 4.7 0
4 5.6 0
5 6.8 0
If you just want the columns:
In [23]:
df.dropna(axis=1, thresh=len(df)-1).columns
Out[23]:
Index(['A', 'C'], dtype='object')
Or simply mask the counts output against the columns:
In [28]:
df.columns[df.isnull().sum() <2]
Out[28]:
Index(['A', 'C'], dtype='object')
Could do something like:
df = df.reindex(columns=[x for x in df.columns.values if df[x].isnull().sum() < threshold])
Which just builds a list of columns that match your requirement (fewer than threshold nulls), and then uses that list to reindex the dataframe. So if you set threshold to 1:
threshold = 1
df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [0, np.nan, np.nan, 0, 0, 0],
'C': [0, 0, 0, 0, 0, 0.0],
'D': [5, 5, np.nan, np.nan, 5.6, 6.8],
'E': ['NA', 'NA', 'NA', 'NA', 'NA', 'NA'],})
df = df.reindex(columns=[x for x in df.columns.values if df[x].isnull().sum() < threshold])
df.count()
Will yield:
C 6
E 6
dtype: int64
The dropna() function has a thresh argument that allows you to give the number of non-NaN values you require, so this would give you your desired output:
df.dropna(axis=1,thresh=5).count()
A 5
C 6
E 6
If you wanted just C & E, you'd have to change thresh to 6 in this case.

Categories

Resources