Related
I tried to merge 3 columns from 3 dataframes based on 2 conditions. For example I have the 3 dataframes below called df_a, df_b and df_c
df_a:
df_b:
df_c:
I want to merge the column Results_b from df_b to df_a if they are the same company and in the same period. Also I would like to remove the column of factor a and factor b.
I tried df_merged = pd.merge(df_a, df_b, on=['Company name', 'Period'], how='left') for merging df_a and df_b and it works, but I am not sure how to only merge the column of Results_a and Results_b instead of merging all columns.
Lastly, I would also like to merge the column Results_c from df_c if they are the same company and in the same period. However, df_c data are based on each quarter (or every 3 months) and df_a and df_b are based on every month, so for the months which is not in df_c, I would like the data to be the same from previous available data. I am not so sure how to deal with it.
This is the outcome that I would like to see:
It would be really appreciated if someone can help me!! Thanks a lot
For reproducing the dataframes:
df_a = pd.DataFrame({
'Company name': ['A','B','C','A','B','C','A','B','C','A','B','D'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-02-28','2019-02-28','2019-02-28','2019-03-31','2019-03-31','2019-03-31','2019-04-30','2019-04-30','2019-04-30'],
'factor a': [37,41,64,52,97,10,55,47,52,61,59,70],
'Results_a': [1,4,2,3,4,1,2,3,3,1,2,4]
})
# b
df_b = pd.DataFrame({
'Company name': ['A','B','C','A','B','A','D','B','C'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-02-28','2019-02-28','2019-03-31','2019-03-31','2019-04-30','2019-04-30'],
'factor b': [55,34,28,17,95,98,61,14,87],
'Results_b': [2,3,1,4,2,1,4,1,4]
})
#c
df_c = pd.DataFrame({
'Company name': ['A','B','C','A','D'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-04-30','2019-04-30'],
'factor c': [27,63,18,23,89],
'Results_c' : [2,1,3,4,1],
})```
You Can also use merge & update value of column "Results_c" in loop, if required;
# Merge Data Sets
data_frames = [df_a,df_b,df_c]
df_result = reduce(lambda left,right: pd.merge(left,right,on=['Company name', 'Period'],
how='left'), data_frames)
df_result = df_result[[col for col in df_result.columns if "factor" not in col]]
# Updating Results_c values if NaN for the month
lst_period = sorted(list(df_result["Period"].unique()))
for i in range(0,len(lst_period)):
df_temp = df_result[df_result["Period"] == lst_period[i]]
if df_temp["Results_c"].isna().sum() == 3: #Edit this number depending on your company's count... as of now 3 bcz of A,B,C
lst_val = df_result[df_result["Period"] == lst_period[i-1]]["Results_c"]
df_result.loc[df_result["Period"] == lst_period[i],"Results_c"] = list(lst_val)
Hope this Helps...
Company name Period Results_a Results_b Results_c
0 A 2019-01-31 1 2.0 2.0
1 B 2019-01-31 4 3.0 1.0
2 C 2019-01-31 2 1.0 3.0
3 A 2019-02-28 3 4.0 2.0
4 B 2019-02-28 4 2.0 1.0
5 C 2019-02-28 1 NaN 3.0
6 A 2019-03-31 2 1.0 2.0
7 B 2019-03-31 3 NaN 1.0
8 C 2019-03-31 3 NaN 3.0
9 A 2019-04-30 1 NaN 4.0
10 B 2019-04-30 2 1.0 NaN
11 D 2019-04-30 4 NaN 1.0
Use concat:
dfs = [df_a, df_b, df_c]
out = pd.concat([df.set_index(['Company name', 'Period'])
.filter(like='Results_')
for df in dfs],
axis=1).reset_index()
NB. to mimick the left merge, you can add .dropna(subset='Results_a').
output:
Company name Period Results_a Results_b Results_c
0 A 2019-01-31 1.0 2.0 2.0
1 B 2019-01-31 4.0 3.0 1.0
2 C 2019-01-31 2.0 1.0 3.0
3 A 2019-02-28 3.0 4.0 NaN
4 B 2019-02-28 4.0 2.0 NaN
5 C 2019-02-28 1.0 NaN NaN
6 A 2019-03-31 2.0 1.0 NaN
7 B 2019-03-31 3.0 NaN NaN
8 C 2019-03-31 3.0 NaN NaN
9 A 2019-04-30 1.0 NaN 4.0
10 B 2019-04-30 2.0 1.0 NaN
11 D 2019-04-30 4.0 NaN 1.0
12 D 2019-03-31 NaN 4.0 NaN
13 C 2019-04-30 NaN 4.0 NaN
In a pandas dataframe, I want to transpose and agrupate datetime columns into rows.
Like this (there are about 12 date columns):
Category Type 11/2021 12/2021
0 A 1 0.0 20
1 A 2 NaN 13
2 B 1 5.0 7
3 B 2 20.0 4
to one like this:
Date Category Type1 Type2
0 2021-11 A 0 NaN
1 2021-11 B 5 20.0
2 2021-12 A 20 13.0
3 2021-12 B 7 4.0
I tought about using pivot tables, but I wasnt able to do so.
You could do:
(df.melt(['Category', 'Type'], var_name = 'Date').
pivot(['Date', 'Category'],'Type').reset_index())
Date Category value
Type 1 2
0 11/2021 A 0.0 NaN
1 11/2021 B 5.0 20.0
2 12/2021 A 20.0 13.0
3 12/2021 B 7.0 4.0
To be alitle cleaner you could use janitor:
import janitor
(df.pivot_longer(['Category', 'Type'], names_to = 'Date', values_to = 'type').
pivot_wider(['Date', 'Category'], names_from = 'Type', names_sep = ''))
Date Category type1 type2
0 11/2021 A 0.0 NaN
1 11/2021 B 5.0 20.0
2 12/2021 A 20.0 13.0
3 12/2021 B 7.0 4.0
Another solution:
x = (
df.set_index(["Category", "Type"])
.stack()
.unstack("Type")
.add_prefix("Type")
.reset_index()
)
x = x.rename(columns={"level_1": "Date"})
x.columns.name = None
print(x)
Prints:
Category Date Type1 Type2
0 A 11/2021 0.0 NaN
1 A 12/2021 20.0 13.0
2 B 11/2021 5.0 20.0
3 B 12/2021 7.0 4.0
I have a dictionary of the form;
data = {A:[(1,2),(3,4),(5,6),(7,8),(8,9)],
B:[(3,4),(4,5),(5,6),(6,7)],
C:[(10,11),(12,13)]}
I create a dataFrame by:
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in data.iteritems()]))
which in turn becomes;
A B C
(1,2) (3,4) (10,11)
(3,4) (4,5) (12,13)
(5,6) (5,6) NaN
(6,7) (6,7) NaN
(8,9) NaN NaN
Is there a way to go from the dataframe above to the one below:
A B C
one two one two one two
1 2 3 4 10 11
3 4 4 5 12 13
5 6 5 6 NaN NaN
6 7 6 7 NaN NaN
8 9 NaN NaN NaN NaN
You can use list comprehension with DataFrame constructor with converting columns to numpy array by values + tolist and concat:
cols = ['A','B','C']
L = [pd.DataFrame(df[x].values.tolist(), columns=['one','two']) for x in cols]
df = pd.concat(L, axis=1, keys=cols)
print (df)
A B C
one two one two one two
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
EDIT:
Similar solution with dict comprehension, integers values was converted to floats, because type of NaN is float too.
data = {'A':[(1,2),(3,4),(5,6),(7,8),(8,9)],
'B':[(3,4),(4,5),(5,6),(6,7)],
'C':[(10,11),(12,13)]}
cols = ['A','B','C']
d = {k: pd.DataFrame(v, columns=['one','two']) for k,v in data.items()}
df = pd.concat(d, axis=1)
print (df)
A B C
one two one two one two
0 1 2 3.0 4.0 10.0 11.0
1 3 4 4.0 5.0 12.0 13.0
2 5 6 5.0 6.0 NaN NaN
3 7 8 6.0 7.0 NaN NaN
4 8 9 NaN NaN NaN NaN
EDIT:
For multiple by one column is possible use slicers:
s = df[('A', 'one')]
print (s)
0 1
1 3
2 5
3 7
4 8
Name: (A, one), dtype: int64
df.loc(axis=1)[:, 'one'] = df.loc(axis=1)[:, 'one'].mul(s, axis=0)
print (df)
A B C
one two one two one two
0 1.0 2 3.0 4.0 10.0 11.0
1 9.0 4 12.0 5.0 36.0 13.0
2 25.0 6 25.0 6.0 NaN NaN
3 49.0 8 42.0 7.0 NaN NaN
4 64.0 9 NaN NaN NaN NaN
Another solution:
idx = pd.IndexSlice
df.loc[:, idx[:, 'one']] = df.loc[:, idx[:, 'one']].mul(s, axis=0)
print (df)
A B C
one two one two one two
0 1.0 2 3.0 4.0 10.0 11.0
1 9.0 4 12.0 5.0 36.0 13.0
2 25.0 6 25.0 6.0 NaN NaN
3 49.0 8 42.0 7.0 NaN NaN
4 64.0 9 NaN NaN NaN NaN
I have number of similar dataframes where I would like to standardize the nans across all the dataframes. For instance, if a nan exists in df1.loc[0,'a'] then ALL other dataframes should be set to nan for the same index location.
I am aware that I could group the dataframes to create one big multiindexed dataframe but sometimes I find it easier to work with a group of dataframes of the same structure.
Here is an example:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan
print df1
print ' '
print df2
print ' '
print df3
Output:
a b c
0 0.0 1 2
1 3.0 4 5
2 6.0 7 8
3 NaN 10 11
a b c
0 0 1.0 2
1 3 NaN 5
2 6 7.0 8
3 9 10.0 11
a b c
0 0 1 NaN
1 3 4 5.0
2 6 7 8.0
3 9 10 11.0
However, I would like df1, df2 and df3 to have nans in the same locations:
print df1
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
Using the answer provided by piRSquared, I was able to extend it for dataframes of different sizes. Here is the function:
def set_nans_over_every_df(df_list):
# Find unique index and column values
complete_index = sorted(set([idx for df in df_list for idx in df.index]))
complete_columns = sorted(set([idx for df in df_list for idx in df.columns]))
# Ensure that every df has the same indexes and columns
df_list = [df.reindex(index=complete_index, columns=complete_columns) for df in df_list]
# Find the nans in each df and set nans in every other df at the same location
mask = np.isnan(np.stack([df.values for df in df_list])).any(0)
df_list = [df.mask(mask) for df in df_list]
return df_list
And an example using different sized dataframes:
df1 = pd.DataFrame(np.reshape(np.arange(15), (5,3)), index=[0,1,2,3,4], columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), index=[0,1,2,3], columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(16), (4,4)), index=[0,1,2,3], columns=['a', 'b', 'c', 'd'])
df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan
df1, df2, df3 = set_nans_over_every_df([df1, df2, df3])
print df1
a b c d
0 0.0 1.0 NaN NaN
1 3.0 NaN 5.0 NaN
2 6.0 7.0 8.0 NaN
3 NaN 10.0 11.0 NaN
4 NaN NaN NaN NaN
I'd set up a mask in numpy then use this mask in the pd.DataFrame.mask method
mask = np.isnan(np.stack([d.values for d in [df1, df2, df3]])).any(0)
print(df1.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print(df2.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print(df3.mask(mask))
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
You can create mask and then apply to all dataframes:
mask = df1.notnull() & df2.notnull() & df3.notnull()
print (mask)
a b c
0 True True False
1 True False True
2 True True True
3 False True True
You can also set mask dynamically with reduce:
import functools
masks = [df1.notnull(),df2.notnull(),df3.notnull()]
mask = functools.reduce(lambda x,y: x & y, masks)
print (mask)
a b c
0 True True False
1 True False True
2 True True True
3 False True True
print (df1[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print (df2[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
print (df2[mask])
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
assuming that all your DF are of the same shape and have the same indexes:
In [196]: df2[df1.isnull()] = df3[df1.isnull()] = np.nan
In [197]: df1[df3.isnull()] = df2[df3.isnull()] = np.nan
In [198]: df1[df2.isnull()] = df3[df2.isnull()] = np.nan
In [199]: df1
Out[199]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
In [200]: df2
Out[200]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
In [201]: df3
Out[201]:
a b c
0 0.0 1.0 NaN
1 3.0 NaN 5.0
2 6.0 7.0 8.0
3 NaN 10.0 11.0
One simple method is to add the DataFrames together and multiply the result by 0 and then add this DataFrame to all the others individually.
df_zero = (df1 + df2 + df3) * 0
df1 + df_zero
df2 + df_zero
df3 + df_zero
I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN