How to combine two different length dataframes with datetime index - python

I have two dataframes like
a = pd.DataFrame(
{
'Date': ['01-01-1990', '01-01-1991', '01-01-1993'],
'A': [1,2,3]
}
)
a = a.set_index('Date')
------------------------------------
A
Date
01-01-1990 1
01-01-1991 2
01-01-1993 3
and another one
b = pd.DataFrame(
{
'Date': ['01-01-1990', '01-01-1992', '01-01-1993', '01-01-1994'],
'B': [4,6,7,8]
}
)
b = b.set_index('Date')
-------------------------------
B
Date
01-01-1990 4
01-01-1992 6
01-01-1993 7
01-01-1994 8
here if you notice two dataframes have different lengths (a=3, b=4) with a different Date entry in '01-01-1992'.
Issue is when I am concating these dataframes I am getting below result
pd.concat([a,b], sort=True)
------------------------------
A B
Date
01-01-1990 1.0 NaN
01-01-1991 2.0 NaN
01-01-1993 3.0 NaN
01-01-1990 NaN 4.0
01-01-1992 NaN 6.0
01-01-1993 NaN 7.0
01-01-1994 NaN 8.0
here dates are repeating 01-01-1990 etc. also, there are Nan entries. I want to know how can I get rid of NaNs and unique dates like
A B
Date
01-01-1990 1.0 4.0
01-01-1991 2.0 NaN
01-01-1992 NaN 6.0
01-01-1993 3.0 7.0
01-01-1994 NaN 8.0

concat by default concatenate along rows (axis=0). You can specify axis=1 so it concatenate along columns (and join on index):
pd.concat([a, b], axis=1)
A B
01-01-1990 1.0 4.0
01-01-1991 2.0 NaN
01-01-1993 3.0 7.0
01-01-1992 NaN 6.0
01-01-1994 NaN 8.0

Or join:
a.join(b, how='outer')
Or merge:
a.merge(b, right_index=True, left_index=True, how='outer')

Related

How to merge 3 dataframes' column with two criteria in python

I tried to merge 3 columns from 3 dataframes based on 2 conditions. For example I have the 3 dataframes below called df_a, df_b and df_c
df_a:
df_b:
df_c:
I want to merge the column Results_b from df_b to df_a if they are the same company and in the same period. Also I would like to remove the column of factor a and factor b.
I tried df_merged = pd.merge(df_a, df_b, on=['Company name', 'Period'], how='left') for merging df_a and df_b and it works, but I am not sure how to only merge the column of Results_a and Results_b instead of merging all columns.
Lastly, I would also like to merge the column Results_c from df_c if they are the same company and in the same period. However, df_c data are based on each quarter (or every 3 months) and df_a and df_b are based on every month, so for the months which is not in df_c, I would like the data to be the same from previous available data. I am not so sure how to deal with it.
This is the outcome that I would like to see:
It would be really appreciated if someone can help me!! Thanks a lot
For reproducing the dataframes:
df_a = pd.DataFrame({
'Company name': ['A','B','C','A','B','C','A','B','C','A','B','D'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-02-28','2019-02-28','2019-02-28','2019-03-31','2019-03-31','2019-03-31','2019-04-30','2019-04-30','2019-04-30'],
'factor a': [37,41,64,52,97,10,55,47,52,61,59,70],
'Results_a': [1,4,2,3,4,1,2,3,3,1,2,4]
})
# b
df_b = pd.DataFrame({
'Company name': ['A','B','C','A','B','A','D','B','C'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-02-28','2019-02-28','2019-03-31','2019-03-31','2019-04-30','2019-04-30'],
'factor b': [55,34,28,17,95,98,61,14,87],
'Results_b': [2,3,1,4,2,1,4,1,4]
})
#c
df_c = pd.DataFrame({
'Company name': ['A','B','C','A','D'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-04-30','2019-04-30'],
'factor c': [27,63,18,23,89],
'Results_c' : [2,1,3,4,1],
})```
You Can also use merge & update value of column "Results_c" in loop, if required;
# Merge Data Sets
data_frames = [df_a,df_b,df_c]
df_result = reduce(lambda left,right: pd.merge(left,right,on=['Company name', 'Period'],
how='left'), data_frames)
df_result = df_result[[col for col in df_result.columns if "factor" not in col]]
# Updating Results_c values if NaN for the month
lst_period = sorted(list(df_result["Period"].unique()))
for i in range(0,len(lst_period)):
df_temp = df_result[df_result["Period"] == lst_period[i]]
if df_temp["Results_c"].isna().sum() == 3: #Edit this number depending on your company's count... as of now 3 bcz of A,B,C
lst_val = df_result[df_result["Period"] == lst_period[i-1]]["Results_c"]
df_result.loc[df_result["Period"] == lst_period[i],"Results_c"] = list(lst_val)
Hope this Helps...
Company name Period Results_a Results_b Results_c
0 A 2019-01-31 1 2.0 2.0
1 B 2019-01-31 4 3.0 1.0
2 C 2019-01-31 2 1.0 3.0
3 A 2019-02-28 3 4.0 2.0
4 B 2019-02-28 4 2.0 1.0
5 C 2019-02-28 1 NaN 3.0
6 A 2019-03-31 2 1.0 2.0
7 B 2019-03-31 3 NaN 1.0
8 C 2019-03-31 3 NaN 3.0
9 A 2019-04-30 1 NaN 4.0
10 B 2019-04-30 2 1.0 NaN
11 D 2019-04-30 4 NaN 1.0
Use concat:
dfs = [df_a, df_b, df_c]
out = pd.concat([df.set_index(['Company name', 'Period'])
.filter(like='Results_')
for df in dfs],
axis=1).reset_index()
NB. to mimick the left merge, you can add .dropna(subset='Results_a').
output:
Company name Period Results_a Results_b Results_c
0 A 2019-01-31 1.0 2.0 2.0
1 B 2019-01-31 4.0 3.0 1.0
2 C 2019-01-31 2.0 1.0 3.0
3 A 2019-02-28 3.0 4.0 NaN
4 B 2019-02-28 4.0 2.0 NaN
5 C 2019-02-28 1.0 NaN NaN
6 A 2019-03-31 2.0 1.0 NaN
7 B 2019-03-31 3.0 NaN NaN
8 C 2019-03-31 3.0 NaN NaN
9 A 2019-04-30 1.0 NaN 4.0
10 B 2019-04-30 2.0 1.0 NaN
11 D 2019-04-30 4.0 NaN 1.0
12 D 2019-03-31 NaN 4.0 NaN
13 C 2019-04-30 NaN 4.0 NaN

Conditional pairwise calculations in pandas

For example, I have 2 dfs:
df1
ID,col1,col2
1,5,9
2,6,3
3,7,2
4,8,5
and another df is
df2
ID,col1,col2
1,11,9
2,12,7
3,13,2
I want to calculate first pairwise subtraction from df2 to df1. I am using scipy.spatial.distance using a function subtract_
def subtract_(a, b):
return abs(a - b)
d1_s = df1[['col1']]
d2_s = df2[['col1']]
dist = cdist(d1_s, d2_s, metric=subtract_)
dist_df = pd.DataFrame(dist, columns= d2_s.values.ravel())
print(dist_df)
11 12 13
6.0 7.0 8.0
5.0 6.0 7.0
4.0 5.0 6.0
3.0 4.0 5.0
Now, I want to check, these new columns name like 11,12 and 13. I am checking if there is any values in this new dataframe less than 5. If there is, then I want to do further calculations. Like this.
For example, here for columns name '11', less than 5 value is 4 which is at rows 3. Now in this case, I want to subtract columns name ('col2') of df1 but at row 3, in this case it would be value 2. I want to subtract this value 2 with df2(col2) but at row 1 (because column name '11') was from value at row 1 in df2.
My for loop is so complex for this. It would be great, if there would be some easier way in pandas.
Any help, suggestions would be great.
The expected new dataframe is this
0,1,2
Nan,Nan,Nan
Nan,Nan,Nan
(2-9)=-7,Nan,Nan
(5-9)=-4,(5-7)=-2,Nan
Similar to Ben's answer, but with np.where:
pd.DataFrame(np.where(dist_df<5, df1.col2.values[:,None] - df2.col2.values, np.nan),
index=dist_df.index,
columns=dist_df.columns)
Output:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
In your case using numpy with mask
df.mask(df<5,df-(df1.col2.values[:,None]+df2.col2.values))
Out[115]:
11 12 13
0 6.0 7.0 8.0
1 5.0 6.0 7.0
2 -7.0 5.0 6.0
3 -11.0 -8.0 5.0
Update
Newdf=(df-(-df1.col2.values[:,None]+df2.col2.values)-df).where(df<5)
Out[148]:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN

pandas:numeric columns fillna with mean and character columns fillna with mode

I know how to select all the numeric columns and fillna with mean,but how to make numeric columns fillna with mean and character columns fillna with mode?
Use select_dtypes for numeric columns with mean, then get non numeric with difference and mode, join together by append and last call fillna:
Notice: (thanks #jpp)
Function mode should return multiple values, for seelct first add iloc
df = pd.DataFrame({
'A':list('ebcded'),
'B':[np.nan,np.nan,4,5,5,4],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'F':list('aaabbb')
})
df.loc[[0,1], 'F'] = np.nan
df.loc[[2,1], 'A'] = np.nan
print (df)
A B C D F
0 e NaN 7.0 1.0 NaN
1 NaN NaN NaN 3.0 NaN
2 NaN 4.0 9.0 5.0 a
3 d 5.0 4.0 NaN b
4 e 5.0 2.0 1.0 b
5 d 4.0 3.0 0.0 b
a = df.select_dtypes(np.number).mean()
b = df[df.columns.difference(a.index)].mode().iloc[0]
#alternative
#b = df.select_dtypes(object).mode().iloc[0]
print (df[df.columns.difference(a.index)].mode())
A F
0 d b
1 e NaN
df = df.fillna(a.append(b))
print (df)
A B C D F
0 e 4.5 7.0 1.0 b
1 d 4.5 5.0 3.0 b
2 d 4.0 9.0 5.0 a
3 d 5.0 4.0 2.0 b
4 e 5.0 2.0 1.0 b
5 d 4.0 3.0 0.0 b

Convert non-numbers in dataframe to NaN (numpy)?

How to convert non-numbers in DataFrame to NaN (numpy)? For example, here is a DataFrame:
a b
--------
10 ...
4 5
... 6
How to covert it to:
a b
--------
10 NaN
4 5
NaN 6
IIUC you can just do
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce') )
This will force the duff values to NaN, note that the presence of NaN will change the dtype to float as NaN cannot be represented by int
In [6]:
df = df.apply(pd.to_numeric, errors='coerce')
df
Out[6]:
a b
0 10.0 NaN
1 4.0 5.0
2 NaN 6.0
The lambda isn't necessary but it's more readable IMO
You can can also stack then unstack the dataframe
pd.to_numeric(df.stack(), errors='coerce').unstack()
a b
0 10.0 NaN
1 4.0 5.0
2 NaN 6.0

Delete rows in dataframe based on column values

I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN

Categories

Resources