The data looks like this:
df1 = 456089.0 456091.0 456093.0
5428709.0 1.0 1.0 NaN
5428711.0 1.0 1.0 NaN
5428713.0 NaN NaN 1.0
df2 = 456093.0 456095.0 456097.0
5428711.0 2.0 NaN NaN
5428713.0 NaN 2.0 NaN
5428715.0 NaN NaN 2.0
I would like to have this output:
df3 = 456089.0 456091.0 456093.0 456095.0 456097.0
5428709.0 1.0 1.0 NaN NaN NaN
5428711.0 1.0 1.0 2.0 NaN NaN
5428713.0 NaN NaN 1.0 2.0 NaN
5428715.0 NaN NaN NaN NaN 2.0
I tried several combinations with pd.merge, pd.join, pd.concat but nothing worked the way I want it, since I want to combine the data by index and column.
Does anyone have an idea how to do this? Thanks in advance!
Let us try sum with concat
out = pd.concat([df1,df2]).sum(axis=1,level=0,min_count=1).sum(axis=0,level=0,min_count=1)
Out[150]:
456089.0 456091.0 456093.0 456095.0 456097.0
5428709.0 1.0 1.0 NaN NaN NaN
5428711.0 1.0 1.0 2.0 NaN NaN
5428713.0 NaN NaN 1.0 2.0 NaN
5428715.0 NaN NaN NaN NaN 2.0
Related
I have a pandas dataframe as below. I want to rearrange columns in my dataframe based on the sequence seperately for XX_ and YY_ columns.
import numpy as np
import pandas as pd
import math
import sys
import re
data=[[np.nan,2, 5,np.nan,np.nan,1],
[np.nan,np.nan,2,np.nan,np.nan,np.nan],
[np.nan,3,np.nan,np.nan,np.nan,np.nan],
[1,np.nan,np.nan,np.nan,np.nan,1],
[np.nan,2,np.nan,np.nan,2,np.nan],
[np.nan,np.nan,np.nan,2,np.nan,5]]
df = pd.DataFrame(data,columns=['XX_4','XX_2','XX_3','YY_4','YY_2','YY_3'])
df
My output dataframe should look like:
XX_2 XX_3 XX_4 YY_2 YY_3 YY_4
0 2.0 5.0 NaN NaN 1.0 NaN
1 NaN 2.0 NaN NaN NaN NaN
2 3.0 NaN NaN NaN NaN NaN
3 NaN NaN 1.0 NaN 1.0 NaN
4 2.0 NaN NaN 2.0 NaN NaN
5 NaN NaN 2.0 NaN 5.0 2.0
Since this is a small dataframe, I can manually rearrange the columns. Is there any way of doing it based on _2, _3 suffix?
IIUC we can use a function based off Jeff Attwood's article on sorting alphanumeric columns written by Mark Byers :
https://stackoverflow.com/a/2669120/9375102
import re
def sorted_nicely( l ):
""" Sort the given iterable in the way that humans expect."""
convert = lambda text: int(text) if text.isdigit() else text
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
df = pd.DataFrame(data,columns=['XX_9','XX_10','XX_3','YY_9','YY_10','YY_3'])
data = df.colums.tolist()
print(df[sorted_nicely(data)])
XX_3 XX_9 XX_10 YY_3 YY_9 YY_10
0 5.0 NaN 2.0 1.0 NaN NaN
1 2.0 NaN NaN NaN NaN NaN
2 NaN NaN 3.0 NaN NaN NaN
3 NaN 1.0 NaN 1.0 NaN NaN
4 NaN NaN 2.0 NaN NaN 2.0
5 NaN NaN NaN 5.0 2.0 NaN
I have a Pandas dataframe that I want to forward fill HORIZONTALLY but I don't want to forward fill past the last entry in each row. This is time series pricing data on products where some have been discontinued so I dont want the last value recorded to be forward filled to current.
FWDFILL.apply(lambda series: series.iloc[:,series.last_valid_index()].ffill(axis=1))
^The code I have included does what I want but it does it VERTICALLY. This could maybe help people as a starting point.
>>> print(FWDFILL)
1 1 NaN NaN 2 NaN
2 NaN 1 NaN 5 NaN
3 NaN 3 1 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5 NaN NaN 1
Desired Output:
1 1 1 1 2 NaN
2 NaN 1 1 5 NaN
3 NaN 3 1 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5 5 5 1
IIUC, you need to apply with axis=1, so you are applying to dataframe rows instead of dataframe columns.
df.apply(lambda x: x[:x.last_valid_index()].ffill(), axis=1)
Output:
1 2 3 4 5
0
1 1.0 1.0 1.0 2.0 NaN
2 NaN 1.0 1.0 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 5.0 5.0 1.0
Usage of bfill and ffill
s1=df.ffill(1)
s2=df.bfill(1)
df=df.mask(s1.notnull()&s2.notnull(),s1)
df
Out[222]:
1 2 3 4 5
1 1.0 1.0 1.0 2.0 NaN
2 NaN 1.0 1.0 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 5.0 5.0 1.0
Or just using interpolate
df.mask(df.interpolate(axis=1,limit_area='inside').notnull(),df.ffill(1))
Out[226]:
1 2 3 4 5
1 1.0 1.0 1.0 2.0 NaN
2 NaN 1.0 1.0 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 5.0 5.0 1.0
You can use numpy to find the last valid indices and mask your ffill. This allows you to use the vectorized ffill and then a vectorized mask.
u = df.values
m = (~np.isnan(u)).cumsum(1).argmax(1)
df.ffill(1).mask(np.arange(df.shape[0]) > m[:, None])
0 1 2 3 4
0 1.0 1.0 1.0 2.0 NaN
1 NaN 1.0 1.0 5.0 NaN
2 NaN 3.0 1.0 NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN 5.0 5.0 5.0 1.0
Info
>>> np.arange(df.shape[0]) > m[:, None]
array([[False, False, False, False, True],
[False, False, False, False, True],
[False, False, False, True, True],
[False, True, True, True, True],
[False, False, False, False, False]])
Little modification to - Most efficient way to forward-fill NaN values in numpy array's solution, solves it here -
def ffillrows_stoplast(arr):
# Identical to earlier solution of forward-filling
mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
idx_acc = np.maximum.accumulate(idx,axis=1)
out = arr[np.arange(idx.shape[0])[:,None], idx_acc]
# Perform flipped index accumulation to get trailing NaNs mask and
# accordingly assign NaNs there
out[np.maximum.accumulate(idx[:,::-1],axis=1)[:,::-1]==0] = np.nan
return out
Sample run -
In [121]: df
Out[121]:
A B C D E
1 1.0 NaN NaN 2.0 NaN
2 NaN 1.0 NaN 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 NaN NaN 1.0
In [122]: out = ffillrows_stoplast(df.to_numpy())
In [123]: pd.DataFrame(out,columns=df.columns,index=df.index)
Out[123]:
A B C D E
1 1.0 1.0 1.0 2.0 NaN
2 NaN 1.0 1.0 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 5.0 5.0 1.0
I think of using where on ffill to flip back to NaN those got ignored on bfill
df.ffill(1).where(df.bfill(1).notna())
Out[1623]:
a b c d e
1 1.0 1.0 1.0 2.0 NaN
2 NaN 1.0 1.0 5.0 NaN
3 NaN 3.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN 5.0 5.0 5.0 1.0
I have a dataframe with dummy variables for daily weather types observations.
date high_wind thunder snow smoke
0 2050-10-23 1.0 NaN NaN NaN
1 2050-10-24 1.0 1.0 NaN NaN
2 2050-10-25 NaN NaN NaN NaN
3 2050-10-26 NaN NaN NaN 1.0
4 2050-10-27 NaN NaN NaN 1.0
5 2050-10-28 NaN NaN NaN 1.0
6 2050-10-29 1.0 NaN NaN NaN
7 2050-10-30 NaN 1.0 NaN NaN
8 2050-10-31 NaN 1.0 NaN NaN
9 2050-11-01 1.0 1.0 NaN NaN
10 2050-11-02 1.0 1.0 NaN NaN
11 2050-11-03 1.0 1.0 NaN NaN
12 2050-11-04 1.0 NaN NaN NaN
13 2050-11-05 1.0 NaN NaN NaN
14 2050-11-06 NaN NaN NaN NaN
15 2050-11-07 NaN 1.0 NaN NaN
16 2050-11-08 NaN NaN NaN NaN
17 2050-11-09 NaN NaN 1.0 NaN
18 2050-11-10 NaN NaN NaN NaN
19 2050-11-11 NaN NaN 1.0 NaN
20 2050-11-12 NaN NaN 1.0 NaN
21 2050-11-13 NaN NaN NaN NaN
For those of you playing along at home, copy the above and then:
import pandas as pd
df = pd.read_clipboard()
df.date = df.date.apply(pd.to_datetime)
df.set_index('date', inplace=True)
I want to visualize this dataframe with the date on the x axis and each weather type category on the y axis. Here's what I've tried so far:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
labels = df.columns.tolist()
#unsatisfying loop to give categories some y separation
for i,col in enumerate(df.columns):
ax.scatter(x=df[col].index, y=(df[col]+i)) #add a little to each
ax.set_yticklabels(labels)
ax.set_xlim(df.index.min(), df.index.max())
fig.autofmt_xdate()
Which gives me this:
Questions:
How do I get the y labels aligned properly?
Is there a better way to structure the data to make plotting easier?
This aligns you y labels:
ax.set_yticks(range(1, len(df.columns) + 1))
For certain columns of df, if 80% of the column is NAN.
What's the simplest code to drop such columns?
You can use isnull with mean for threshold and then remove columns by boolean indexing with loc (because remove columns), also need invert condition - so <.8 means remove all columns >=0.8:
df = df.loc[:, df.isnull().mean() < .8]
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.random((100,5)), columns=list('ABCDE'))
df.loc[:80, 'A'] = np.nan
df.loc[:5, 'C'] = np.nan
df.loc[20:, 'D'] = np.nan
print (df.isnull().mean())
A 0.81
B 0.00
C 0.06
D 0.80
E 0.00
dtype: float64
df = df.loc[:, df.isnull().mean() < .8]
print (df.head())
B C E
0 0.278369 NaN 0.004719
1 0.670749 NaN 0.575093
2 0.209202 NaN 0.219697
3 0.811683 NaN 0.274074
4 0.940030 NaN 0.175410
If want remove columns by minimal values dropna working nice with parameter thresh and axis=1 for remove columns:
np.random.seed(1997)
df = pd.DataFrame(np.random.choice([np.nan,1], p=(0.8,0.2),size=(10,10)))
print (df)
0 1 2 3 4 5 6 7 8 9
0 NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN NaN
1 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN
3 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN 1.0 NaN NaN NaN 1.0
5 NaN NaN NaN 1.0 1.0 NaN NaN 1.0 NaN 1.0
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
9 1.0 NaN NaN NaN 1.0 NaN NaN 1.0 NaN NaN
df1 = df.dropna(thresh=2, axis=1)
print (df1)
0 3 4 5 7 9
0 NaN 1.0 1.0 NaN NaN NaN
1 1.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN 1.0 NaN NaN
3 NaN NaN 1.0 NaN NaN NaN
4 NaN NaN NaN 1.0 NaN 1.0
5 NaN 1.0 1.0 NaN 1.0 1.0
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN 1.0 NaN
9 1.0 NaN 1.0 NaN 1.0 NaN
EDIT: For non-Boolean data
Total number of NaN entries in a column must be less than 80% of total entries:
df = df.loc[:, df.isnull().sum() < 0.8*df.shape[0]]
df.dropna(thresh=np.int((100-percent_NA_cols_required)*(len(df.columns)/100)),inplace=True)
Basically pd.dropna takes number(int) of non_na cols required if that row is to be removed.
You can use the pandas dropna. For example:
df.dropna(axis=1, thresh = int(0.2*df.shape[0]), inplace=True)
Notice that we used 0.2 which is 1-0.8 since the thresh refers to the number of non-NA values
As suggested in comments, if you use sum() on a boolean test, you can get the number of occurences.
Code:
def get_nan_cols(df, nan_percent=0.8):
threshold = len(df.index) * nan_percent
return [c for c in df.columns if sum(df[c].isnull()) >= threshold]
Used as:
del df[get_nan_cols(df, 0.8)]
I have this kind of pandas DataFrame for each user in a large database.
each row is a period of length [start_date, end_date], but sometimes 2 consecutive rows are in fact the same period : end_date is equal to the following start_date (red underlining). Sometimes periods even overlap on more than 1 date.
I would like to get the "real periods" by combining rows which corresponds to the same periods.
What I have tried
def split_range(name):
df_user = de_201512_echant[de_201512_echant.name == name]
# -- Create a date_range with a length [min_start_date, max_start_date]
t_date = pd.DataFrame(index=pd.date_range("2005-01-01", "2015-12-12").date)
for row in range(0, df_user.shape[0]):
start_date = df_user.iloc[row].start_date
end_date = df_user.iloc[row].end_date
if ((pd.isnull(start_date) == False) and (pd.isnull(end_date) == False)):
t = pd.DataFrame(index=pd.date_range(start_date, end_date))
t["period_%s" % (row)] = 1
t_date = pd.merge(t_date, t, right_index=True, left_index=True, how="left")
else:
pass
return t_date
which yields a DataFrame where each colunms is a period (1 if in the range, NaN if not) :
t_date
Out[29]:
period_0 period_1 period_2 period_3 period_4 period_5 \
2005-01-01 NaN NaN NaN NaN NaN NaN
2005-01-02 NaN NaN NaN NaN NaN NaN
2005-01-03 NaN NaN NaN NaN NaN NaN
2005-01-04 NaN NaN NaN NaN NaN NaN
2005-01-05 NaN NaN NaN NaN NaN NaN
2005-01-06 NaN NaN NaN NaN NaN NaN
2005-01-07 NaN NaN NaN NaN NaN NaN
2005-01-08 NaN NaN NaN NaN NaN NaN
2005-01-09 NaN NaN NaN NaN NaN NaN
2005-01-10 NaN NaN NaN NaN NaN NaN
2005-01-11 NaN NaN NaN NaN NaN NaN
Then if I sum all the columns (periods) I got almost exactly what I want :
full_spell = t_date.sum(axis=1)
full_spell.loc[full_spell == 1]
Out[31]:
2005-11-14 1.0
2005-11-15 1.0
2005-11-16 1.0
2005-11-17 1.0
2005-11-18 1.0
2005-11-19 1.0
2005-11-20 1.0
2005-11-21 1.0
2005-11-22 1.0
2005-11-23 1.0
2005-11-24 1.0
2005-11-25 1.0
2005-11-26 1.0
2005-11-27 1.0
2005-11-28 1.0
2005-11-29 1.0
2005-11-30 1.0
2006-01-16 1.0
2006-01-17 1.0
2006-01-18 1.0
2006-01-19 1.0
2006-01-20 1.0
2006-01-21 1.0
2006-01-22 1.0
2006-01-23 1.0
2006-01-24 1.0
2006-01-25 1.0
2006-01-26 1.0
2006-01-27 1.0
2006-01-28 1.0
2015-07-06 1.0
2015-07-07 1.0
2015-07-08 1.0
2015-07-09 1.0
2015-07-10 1.0
2015-07-11 1.0
2015-07-12 1.0
2015-07-13 1.0
2015-07-14 1.0
2015-07-15 1.0
2015-07-16 1.0
2015-07-17 1.0
2015-07-18 1.0
2015-07-19 1.0
2015-08-02 1.0
2015-08-03 1.0
2015-08-04 1.0
2015-08-05 1.0
2015-08-06 1.0
2015-08-07 1.0
2015-08-08 1.0
2015-08-09 1.0
2015-08-10 1.0
2015-08-11 1.0
2015-08-12 1.0
2015-08-13 1.0
2015-08-14 1.0
2015-08-15 1.0
2015-08-16 1.0
2015-08-17 1.0
dtype: float64
But I could not find a way to slice all the time range of this sparse datetime index to finally get my desired output : the original dataframe containing the "real" period of time.
It might not be the most efficient way to do this, so If you have alternatives, do not hesitate!
I found a much more efficient way to do this by using apply:
def get_range(row):
'''returns a DataFrame containing the day-range from a "start_date"
and a "end_date"'''
start_date = row["start_date"]
end_date = row["end_date"]
period = pd.date_range(start_date, end_date, freq="1D")
return pd.Dataframe(period, columns='days_in_period')
# -- Apply get_range() to the initial df
t_all = df.apply(get_range)
# -- Drop overlapping dates
t_all.drop_duplicates(inplace=True)