I have two dataframes loaded from a .csv file. One contains numeric values, the other dates (month-year) for when these numeric values occured. The dates and values are basically mapped to each other. I would like to combine/merge these dataframes to have the dates as the column, and values as the rows. However, as you can see, the dates, though ordered from left to right, they don't all start on the same month.
import pandas as pd
df1 = pd.DataFrame(
[
[1, 2, pd.NA, pd.NA, pd.NA],
[2, 3, 4, pd.NA, pd.NA],
[4, 5, 6, pd.NA, pd.NA],
[5, 6, 12, 14, 15]
]
)
df2 = pd.DataFrame(
[
["2021-01", "2021-02", pd.NA, pd.NA, pd.NA],
["2021-02", "2021-03", "2021-04", pd.NA, pd.NA],
["2022-03", "2022-04", "2022-05", pd.NA, pd.NA],
["2021-04", "2021-05", "2021-06", "2021-07", "2021-08"]
]
)
df1
df2
Although I managed to create the combined dataframe, the dataframes, df1 and df2 contain ~300k rows, and the approach I thought of is rather slow. Is there a more efficient way of achieving the same result?
q = {z: {x: y for x, y in zip(df2.values[z], df1.values[z]) if not pd.isna(y)} for z in range(len(df2))}
df = pd.DataFrame.from_dict(q, orient='index')
idx = pd.to_datetime(df.columns, errors='coerce', format='%Y-%m').argsort()
df.iloc[:, idx]
df3 (result)
You can stack, concat and pivot:
(pd.concat([df1.stack(), df2.stack()], axis=1)
.reset_index(level=0)
.pivot(index='level_0', columns=1, values=0)
.rename_axis(index=None, columns=None)
)
Alternative with unstack:
(pd.concat([df1.stack(), df2.stack()], axis=1)
.droplevel(1).set_index(1, append=True)
[0].unstack(1)
.rename_axis(columns=None)
)
output:
2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07 2021-08 2022-03 2022-04 2022-05
0 1 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN 2 3 4 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN 4 5 6
3 NaN NaN NaN 5 6 12 14 15 NaN NaN NaN
Use concat with keys parameters, so possible after DataFrame.stack and convert MutiIndex to column use DataFrame.pivot:
df = (pd.concat([df1, df2], axis=1, keys=['a','b'])
.stack()
.reset_index()
.pivot('level_0','b','a'))
print (df)
b 2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07 2021-08 \
level_0
0 1 2 NaN NaN NaN NaN NaN NaN
1 NaN 2 3 4 NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN 5 6 12 14 15
b 2022-03 2022-04 2022-05
level_0
0 NaN NaN NaN
1 NaN NaN NaN
2 4 5 6
3 NaN NaN NaN
Related
I have a data frame that looks like this:
Col1
Cl2
C3
12/31/2018
9/30/2018
11/30/2018
1/31/2019
10/31/2018
4/30/2019
2/28/2019
11/30/2018
11/30/2020
And I am hoping to have this rearranged based on the row values so it turns to:
Col1
Cl2
C3
Nan
9/30/2018
Nan
Nan
10/31/2018
Nan
Nan
11/30/2018
11/30/2018
12/31/2018
Nan
Nan
1/31/2019
Nan
Nan
2/28/2019
Nan
Nan
Nan
Nan
4/30/2019
Nan
Nan
11/30/2020
From the above, we can note that all the rows must have the same date, otherwise, we fill them with some sort of Nan. I was also hoping that this idea could work for any number of columns, any number of rows, and any column names (the idea of being generic).
If helps:
import numpy as np
import pandas as pd
# Create the pandas DataFrame
df1 = pd.DataFrame(['2018-12-31','2019-01-31','2019-02-28'], columns = ['Col1'])
df2 = pd.DataFrame(['2018-09-30','2018-10-31','2018-11-30'], columns = ['Cl2'])
df3 = pd.DataFrame(['2018-11-30','2019-04-30','2020-11-30'], columns = ['C3'])
data = {'Col1': [np.nan,np.nan,np.nan,'2018-12-31','2019-01-31','2019-02-28',np.nan,np.nan],
'Cl2': ['2018-09-30','2018-10-31','2018-11-30',np.nan,np.nan,np.nan,np.nan,np.nan],
'C3': [np.nan,np.nan,'2018-11-30',np.nan,np.nan,np.nan,'2019-04-30','2020-11-30']}
desired_df = pd.DataFrame(data)
desired_df
Note: This is somewhat similar to a question that I previously posted here
You can set the column to index then add a dummy column
for df in [df1, df2, df3]:
df.set_index(df.columns[0], inplace=True)
df[df.index.name] = 1
print(df1)
Col1
Col1
2018-12-31 1
2019-01-31 1
2019-02-28 1
Then concat all the transformed dataframes and sort the index
df = pd.concat([df1, df2, df3], axis=1).sort_index()
print(df)
Col1 Cl2 C3
2018-09-30 NaN 1.0 NaN
2018-10-31 NaN 1.0 NaN
2018-11-30 NaN 1.0 1.0
2018-12-31 1.0 NaN NaN
2019-01-31 1.0 NaN NaN
2019-02-28 1.0 NaN NaN
2019-04-30 NaN NaN 1.0
2020-11-30 NaN NaN 1.0
At last, replace all the 1 with corresponding index
df = df.apply(lambda col: col.mask(col.eq(1), df.index), axis=0).reset_index(drop=True)
print(df)
Col1 Cl2 C3
0 NaN 2018-09-30 NaN
1 NaN 2018-10-31 NaN
2 NaN 2018-11-30 2018-11-30
3 2018-12-31 NaN NaN
4 2019-01-31 NaN NaN
5 2019-02-28 NaN NaN
6 NaN NaN 2019-04-30
7 NaN NaN 2020-11-30
With less lines
df = pd.concat([df.set_index(df.columns[0]).assign(**{f'{df.columns[0]}': 1}) for df in [df1, df2, df3]], axis=1).sort_index()
df = df.apply(lambda col: col.mask(col.eq(1), df.index), axis=0).reset_index(drop=True)
below is my DF in which I want to create a column based on other columns
test = pd.DataFrame({"Year_2017" : [np.nan, np.nan, np.nan, 4], "Year_2018" : [np.nan, np.nan, 3, np.nan], "Year_2019" : [np.nan, 2, np.nan, np.nan], "Year_2020" : [1, np.nan, np.nan, np.nan]})
Year_2017 Year_2018 Year_2019 Year_2020
0 NaN NaN NaN 1
1 NaN NaN 2 NaN
2 NaN 3 NaN NaN
3 4 NaN NaN NaN
The aim will be to create a new column and take value of the columns which is notna()
Below is what I tried without success..
test['Final'] = np.where(test.Year_2017.isna(), test.Year_2018,
np.where(test.Year_2018.isna(), test.Year_2019,
np.where(test.Year_2019.isna(), test.Year_2020, test.Year_2019)))
Year_2017 Year_2018 Year_2019 Year_2020 Final
0 NaN NaN NaN 1 NaN
1 NaN NaN 2 NaN NaN
2 NaN 3 NaN NaN 3
3 4 NaN NaN NaN NaN
The expected output:
Year_2017 Year_2018 Year_2019 Year_2020 Final
0 NaN NaN NaN 1 1
1 NaN NaN 2 NaN 2
2 NaN 3 NaN NaN 3
3 4 NaN NaN NaN 4
You can forward or back filling missing values and then select last or first column:
test['Final'] = test.ffill(axis=1).iloc[:, -1]
test['Final'] = test.bfill(axis=1).iloc[:, 0]
If there is only one non missing values per rows and numeric use:
test['Final'] = test.min(1)
test['Final'] = test.max(1)
test['Final'] = test.mean(1)
test['Final'] = test.sum(1, min_count=1)
I you only have a single non NA value per row, you can use:
df['Final'] = test.max(axis=1)
(or other aggregators)
How to combine this line into pandas dataframe to drop columns which its missing rate over 90%?
this line will show all the column and its missing rate:
percentage = (LoanStats_securev1_2018Q1.isnull().sum()/LoanStats_securev1_2018Q1.isnull().count()*100).sort_values(ascending = False)
Someone familiar with pandas please kindly help.
You can use dropna with a threshold
newdf=df.dropna(axis=1,thresh=len(df)*0.9)
axis=1 indicates column and thresh is the
minimum number of non-NA values required.
I think need boolean indexing with mean of boolean mask:
df = df.loc[:, df.isnull().mean() < .9]
Sample:
np.random.seed(2018)
df = pd.DataFrame(np.random.randn(20,3), columns=list('ABC'))
df.iloc[3:8,0] = np.nan
df.iloc[:-1,1] = np.nan
df.iloc[1:,2] = np.nan
print (df)
A B C
0 -0.276768 NaN 2.148399
1 -1.279487 NaN NaN
2 -0.142790 NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 -0.172797 NaN NaN
9 -1.604543 NaN NaN
10 -0.276501 NaN NaN
11 0.704780 NaN NaN
12 0.138125 NaN NaN
13 1.072796 NaN NaN
14 -0.803375 NaN NaN
15 0.047084 NaN NaN
16 -0.013434 NaN NaN
17 -1.580231 NaN NaN
18 -0.851835 NaN NaN
19 -0.148534 0.133759 NaN
print(df.isnull().mean())
A 0.25
B 0.95
C 0.95
dtype: float64
df = df.loc[:, df.isnull().mean() < .9]
print (df)
A
0 -0.276768
1 -1.279487
2 -0.142790
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.172797
9 -1.604543
10 -0.276501
11 0.704780
12 0.138125
13 1.072796
14 -0.803375
15 0.047084
16 -0.013434
17 -1.580231
18 -0.851835
19 -0.148534
I have a dataframe that looks like the following. There are >=1 consecutive rows where y_l is populated and y_h is NaN and vice versa.
When we have more than 1 consecutive populated lines between the NaNs we only want to keep the one with the lowest y_l or the highest y_h.
e.g. on the df below from the last 3 rows we would only keep the 2nd and discard the other two.
What would be a smart way to implement that?
df = pd.DataFrame({'y_l': [NaN, 97,95,98,NaN],'y_h': [90, NaN,NaN,NaN,95]}, columns=['y_l','y_h'])
>>> df
y_l y_h
0 NaN 90.0
1 97.0 NaN
2 95.0 NaN
3 98.0 NaN
4 NaN 95
Desired result:
y_l y_h
0 NaN 90.0
1 95.0 NaN
2 NaN 95
You need create new column or Series for distinguish each consecutives and then use groupby with aggreagte by agg, last for change order of columns use reindex:
a = df['y_l'].isnull()
b = a.ne(a.shift()).cumsum()
df = (df.groupby(b, as_index=False)
.agg({'y_l':'min', 'y_h':'max'})
.reindex(columns=['y_l','y_h']))
print (df)
y_l y_h
0 NaN 90.0
1 95.0 NaN
2 NaN 95.0
Detail:
print (b)
0 1
1 2
2 2
3 2
4 3
Name: y_h, dtype: int32
What if you had more columns?
for example
df = pd.DataFrame({'A': [NaN, 15,20,25,NaN],'y_l': [NaN, 97,95,98,NaN],'y_h': [90, NaN,NaN,NaN,95]}, columns=['A','y_l','y_h'])
>>>df
A y_l y_h
0 NaN NaN 90.0
1 15.0 97.0 NaN
2 20.0 95.0 NaN
3 25.0 98.0 NaN
4 NaN NaN 95.0
How could you keep the values in column A after filtering out the irrelevant rows as below?
A y_l y_h
0 NaN NaN 90.0
1 20.0 95.0 NaN
2 NaN NaN 95.0
The issue below was created in Python 2.7.11 with Pandas 0.17.1
When grouping a categorical column with both a period and date column, unexpected rows appear in the grouping. Is this a Pandas bug, or could it be something else?
df = pd.DataFrame({'date': pd.date_range('2015-12-29', '2016-1-3'),
'val1': [1] * 6,
'val2': range(6),
'cat1': ['a', 'b', 'c'] * 2,
'cat2': ['A', 'B', 'C'] * 2})
df['cat1'] = df.cat1.astype('category')
df['month'] = [d.to_period('M') for d in df.date]
>>> df
cat1 cat2 date val1 val2 month
0 a A 2015-12-29 1 0 2015-12
1 b B 2015-12-30 1 1 2015-12
2 c C 2015-12-31 1 2 2015-12
3 a A 2016-01-01 1 3 2016-01
4 b B 2016-01-02 1 4 2016-01
5 c C 2016-01-03 1 5 2016-01
Grouping the month and date with a regular series (e.g. cat2) works as expected:
>>> df.groupby(['month', 'date', 'cat2']).sum().unstack()
val1 val2
cat2 A B C A B C
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
But grouping on a categorical produces unexpected results. You'll notice in the index that the extra dates do not correspond to the grouped month.
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-02 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-03 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01 2015-12-29 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-30 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-31 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
Grouping the categorical by month periods or dates works fine, but not when both are combined as in the example above.
>>> df.groupby(['month', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month
2015-12 1 1 1 0 1 2
2016-01 1 1 1 3 4 5
>>> df.groupby(['date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
date
2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
EDIT
This behavior originated in the 0.15.0 update. Prior to that, this was the output:
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
As defined in pandas, grouping with a categorical will always have the full set of categories, even if there isn't any data with that category, e.g., doc example here
You can either not use a categorical, or add a .dropna(how='all') after your grouping step.