Merging dataframes on two columns alternative solution - python

I have been trying to find an alternative (possibly more elegant) solution for the following code but without any luck. Here is my code:
import os
import pandas as pd
os.chdir(os.getcwd())
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'Temp': [0, 1, 2, 3, 4, 5]*2,
'Place': [12, 53, 6, 11, 9, 10, 0, 0, 0, 0, 0, 0],
'Place2': [1, 0, 23, 14, 9, 8, 0, 0, 0, 0, 0, 0],
'Place3': [2, 64, 24, 66, 14, 21, 0, 0, 0, 0, 0, 0]}
)
df2 = pd.DataFrame({'Month': [13] * 6,
'Temp': [0, 1, 2, 3, 4, 5],
'Place': [1, 22, 333, 444, 55, 6]})
# Here it creates new columns "Place_y" and "Place_x".
# I want to avoid this if possible.
df_merge = pd.merge(df1, df2, how='left',
left_on=['Temp', 'Month'],
right_on=['Temp', 'Month'])
df_merge.fillna(0, inplace=True)
add_not_nan = lambda x: x['Place_x'] if pd.isnull(x['Place_y']) else x['Place_y']
df_merge['Place'] = df_merge.apply(add_not_nan, axis=1)
df_merge.drop(['Place_x', 'Place_y'], axis=1, inplace=True)
print(df_merge)
What I am trying to accomplish is to merge the two dataframes based on the "Month" and "Temp" columns, while keeping 0s for missing values. I would like to know if there is any way to merge the dataframes without creating the _x and _y columns (basically, a way to skip the creation and deletion of those columns).
Inputs:
First dataframe
Month Temp Place Place2 Place3
0 1 0 12 1 2
1 1 1 53 0 64
2 1 2 6 23 24
3 1 3 11 14 66
4 1 4 9 9 14
5 1 5 10 8 21
6 13 0 0 0 0
7 13 1 0 0 0
8 13 2 0 0 0
9 13 3 0 0 0
10 13 4 0 0 0
11 13 5 0 0 0
Second dataframe
Month Temp Place
0 13 0 1
1 13 1 22
2 13 2 333
3 13 3 444
4 13 4 55
5 13 5 6
Outputs:
After merge
Month Temp Place_x Place2 Place3 Place_y
0 1 0 12 1 2 NaN
1 1 1 53 0 64 NaN
2 1 2 6 23 24 NaN
3 1 3 11 14 66 NaN
4 1 4 9 9 14 NaN
5 1 5 10 8 21 NaN
6 13 0 0 0 0 1.0
7 13 1 0 0 0 22.0
8 13 2 0 0 0 333.0
9 13 3 0 0 0 444.0
10 13 4 0 0 0 55.0
11 13 5 0 0 0 6.0
Final (desired)
Month Temp Place2 Place3 Place
0 1 0 1 2 0.0
1 1 1 0 64 0.0
2 1 2 23 24 0.0
3 1 3 14 66 0.0
4 1 4 9 14 0.0
5 1 5 8 21 0.0
6 13 0 0 0 1.0
7 13 1 0 0 22.0
8 13 2 0 0 333.0
9 13 3 0 0 444.0
10 13 4 0 0 55.0
11 13 5 0 0 6.0

It seems you don't need Place column from df1, you can just drop it before merging:
(df1.drop('Place', axis=1)
.merge(df2, how='left', on=['Temp', 'Month'])
.fillna({'Place': 0}))
# Month Temp Place2 Place3 Place
#0 1 0 1 2 0.0
#1 1 1 0 64 0.0
#2 1 2 23 24 0.0
#3 1 3 14 66 0.0
#4 1 4 9 14 0.0
#5 1 5 8 21 0.0
#6 13 0 0 0 1.0
#7 13 1 0 0 22.0
#8 13 2 0 0 333.0
#9 13 3 0 0 444.0
#10 13 4 0 0 55.0
#11 13 5 0 0 6.0

If you don't know how many such columns are there, and if you always want to include the column from second dataframe for such overlapping column names which you are not using as key column, then you can mask those variables using suffix parameter of pd.merge then filter out the columns taking the masking characters using pandas.DataFrame.filter:
df1.merge(df2,
how='left',
left_on=['Temp', 'Month'],
right_on=['Temp', 'Month'],
suffixes=('###', '')).fillna(0).filter(regex='.*(?<!###)$')
OUTPUT:
Month Temp Place2 Place3 Place
0 1 0 1 2 0.0
1 1 1 0 64 0.0
2 1 2 23 24 0.0
3 1 3 14 66 0.0
4 1 4 9 14 0.0
5 1 5 8 21 0.0
6 13 0 0 0 1.0
7 13 1 0 0 22.0
8 13 2 0 0 333.0
9 13 3 0 0 444.0
10 13 4 0 0 55.0
11 13 5 0 0 6.0
Apparently, you can also filter out the columns at beginning before merger, by checking the existence of columns from second dataframe in the first dataframe:
cols=[col for col in df1.columns if col in ('Temp', 'Month') or col not in df2.columns ]
df1[cols].merge(df2, how='left',
left_on=['Temp', 'Month'],
right_on=['Temp', 'Month']).fillna(0)
Month Temp Place2 Place3 Place
0 1 0 1 2 0.0
1 1 1 0 64 0.0
2 1 2 23 24 0.0
3 1 3 14 66 0.0
4 1 4 9 14 0.0
5 1 5 8 21 0.0
6 13 0 0 0 1.0
7 13 1 0 0 22.0
8 13 2 0 0 333.0
9 13 3 0 0 444.0
10 13 4 0 0 55.0
11 13 5 0 0 6.0

Related

Casting a value based on a trigger in pandas

I would like to create a new column every time I get 1 in the 'Signal' column that will cast the corresponding value from the 'Value' column (please see the expected output below).
Initial data:
Index
Value
Signal
0
3
0
1
8
0
2
8
0
3
7
1
4
9
0
5
10
0
6
14
1
7
10
0
8
10
0
9
4
1
10
10
0
11
10
0
Expected Output:
Index
Value
Signal
New_Col_1
New_Col_2
New_Col_3
0
3
0
0
0
0
1
8
0
0
0
0
2
8
0
0
0
0
3
7
1
7
0
0
4
9
0
7
0
0
5
10
0
7
0
0
6
14
1
7
14
0
7
10
0
7
14
0
8
10
0
7
14
0
9
4
1
7
14
4
10
10
0
7
14
4
11
10
0
7
14
4
What would be a way to do it?
You can use a pivot:
out = df.join(df
# keep only the values where signal is 1
# and get Signal's cumsum
.assign(val=df['Value'].where(df['Signal'].eq(1)),
col=df['Signal'].cumsum()
)
# pivot cumsumed Signal to columns
.pivot(index='Index', columns='col', values='val')
# ensure column 0 is absent (using loc to avoid KeyError)
.loc[:, 1:]
# forward fill the values
.ffill()
# rename columns
.add_prefix('New_Col_')
)
output:
Index Value Signal New_Col_1 New_Col_2 New_Col_3
0 0 3 0 NaN NaN NaN
1 1 8 0 NaN NaN NaN
2 2 8 0 NaN NaN NaN
3 3 7 1 7.0 NaN NaN
4 4 9 0 7.0 NaN NaN
5 5 10 0 7.0 NaN NaN
6 6 14 1 7.0 14.0 NaN
7 7 10 0 7.0 14.0 NaN
8 8 10 0 7.0 14.0 NaN
9 9 4 1 7.0 14.0 4.0
10 10 10 0 7.0 14.0 4.0
11 11 10 0 7.0 14.0 4.0
#create new column by incrementing the rows that has signal
df['new_col']='new_col_'+df['Signal'].cumsum().astype(str)
#rows having no signal, make them null
df['new_col'] = df['new_col'].mask(df['Signal']==0, '0')
#pivot table
df2=(df.pivot(index=['Index','Signal', 'Value'], columns='new_col', values='Value')
.reset_index()
.ffill().fillna(0) #forward fill and fillna with 0
.drop(columns=['0','Index'] ) #drop the extra columns
.rename_axis(columns={'new_col':'Index'}) # rename the axis
.astype(int)) # changes values to int, removing decimals
df2
Index Signal Value new_col_1 new_col_2 new_col_3
0 0 3 0 0 0
1 0 8 0 0 0
2 0 8 0 0 0
3 1 7 7 0 0
4 0 9 7 0 0
5 0 10 7 0 0
6 1 14 7 14 0
7 0 10 7 14 0
8 0 10 7 14 0
9 1 4 7 14 4
10 0 10 7 14 4
11 0 10 7 14 4

Percentage of events before and after a sequence of zeros in pandas rows

I have a dataframe like the following:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total
-----------------------------------------------------------------------------------------------------
0 A 2 21 0 18 3 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0 156
1 B 0 20 12 2 0 8 14 23 0 ... 0 0 0 0 0 0 0 0 0 0 231
2 C 0 38 19 3 1 3 3 7 1 ... 0 0 0 0 0 0 0 0 0 0 78
3 D 3 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5
and I want to know the % of events (the numbers in the cells) before and after the first sequence of zeros of length n appears in each row. This problem started as another question found here: Length of first sequence of zeros of given size after certain column in pandas dataframe, and I am trying to modify the code to do what I need, but I keep getting errors and can't seem to find the right way. This is what I have tried:
def func(row, n):
"""Returns the number of events before the
first sequence of 0s of length n is found
"""
idx = np.arange(0, 91)
a = row[idx]
b = (a != 0).cumsum()
c = b[a == 0]
d = c.groupby(c).count()
#in case there is no sequence of 0s with length n
try:
e = c[c >= d.index[d >= n][0]]
f = str(e.index[0])
except IndexError:
e = [90]
f = str(e[0])
idx_sliced = np.arange(0, int(f)+1)
a = row[idx_sliced]
if (int(f) + n > 90):
perc_before = 100
else:
perc_before = a.cumsum().tail(1).values[0]/row['total']
return perc_before
As is, the error I get is:
---> perc_before = a.cumsum().tail(1).values[0]/row['total']
TypeError: ('must be str, not int', 'occurred at index 0')
Finally, I would apply this function to a dataframe and return a new column with the % of events before the first sequence of n 0s in each row, like this:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total %_before
---------------------------------------------------------------------------------------------------------------
0 A 2 21 0 18 3 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0 156 43
1 B 0 20 12 2 0 8 14 23 0 ... 0 0 0 0 0 0 0 0 0 0 231 21
2 C 0 38 19 3 1 3 3 7 1 ... 0 0 0 0 0 0 0 0 0 0 78 90
3 D 3 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5 100
If trying to solve this, you can test by using this sample input:
a = pd.Series([1,1,13,0,0,0,4,0,0,0,0,0,12,1,1])
b = pd.Series([1,1,13,0,0,0,4,12,1,12,3,0,0,5,1])
c = pd.Series([1,1,13,0,0,0,4,12,2,0,5,0,5,1,1])
d = pd.Series([1,1,13,0,0,0,4,12,1,12,4,50,0,0,1])
e = pd.Series([1,1,13,0,0,0,4,12,0,0,0,54,0,1,1])
df = pd.DataFrame({'0':a, '1':b, '2':c, '3':d, '4':e})
df = df.transpose()
Give this a try:
def percent_before(row, n, ncols):
"""Return the percentage of activities happen before
the first sequence of at least `n` consecutive 0s
"""
start_index, i, size = 0, 0, 0
for i in range(ncols):
if row[i] == 0:
# increase the size of the island
size += 1
elif size >= n:
# found the island we want
break
else:
# start a new island
# row[start_index] is always non-zero
start_index = i
size = 0
if size < n:
# didn't find the island we want
return 1
else:
# get the sum of activities that happen
# before the island
idx = np.arange(0, start_index + 1).astype(str)
return row.loc[idx].sum() / row['total']
df['percent_before'] = df.apply(percent_before, n=3, ncols=15, axis=1)
Result:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 total percent_before
0 1 1 13 0 0 0 4 0 0 0 0 0 12 1 1 33 0.454545
1 1 1 13 0 0 0 4 12 1 12 3 0 0 5 1 53 0.283019
2 1 1 13 0 0 0 4 12 2 0 5 0 5 1 1 45 0.333333
3 1 1 13 0 0 0 4 12 1 12 4 50 0 0 1 99 0.151515
4 1 1 13 0 0 0 4 12 0 0 0 54 0 1 1 87 0.172414
For the full frame, call apply with ncols=91.
Another possible solution:
def get_vals(df, n):
df, out = df.T, []
for col in df.columns:
diff_to_previous = df[col] != df[col].shift(1)
g = df.groupby(diff_to_previous.cumsum())[col].agg(['idxmin', 'size'])
vals = df.loc[g.loc[g['size'] >= n, 'idxmin'].values, col]
if len(vals):
out.append( df.loc[np.arange(0, vals[vals == 0].index[0]), col].sum() / df[col].sum() )
else:
out.append( 1.0 )
return out
df['percent_before'] = get_vals(df, n=3)
print(df)
Prints:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 percent_before
0 1 1 13 0 0 0 4 0 0 0 0 0 12 1 1 0.454545
1 1 1 13 0 0 0 4 12 1 12 3 0 0 5 1 0.283019
2 1 1 13 0 0 0 4 12 2 0 5 0 5 1 1 0.333333
3 1 1 13 0 0 0 4 12 1 12 4 50 0 0 1 0.151515
4 1 1 13 0 0 0 4 12 0 0 0 54 0 1 1 0.172414
As one of the comment of the previous question was about the speed, I guess you can try to vectorize the problem. I used this dataframe to try (slightly different than your original input):
ID 0 1 2 3 4 5 6 7 8 total
0 A 2 21 0 18 3 0 0 0 2 46
1 B 0 0 12 2 0 8 14 23 0 59
2 C 0 38 19 3 1 3 3 7 1 75
3 D 3 0 0 1 0 0 0 0 0 4
Now what I think is chaining command to create a mask and find where the data is not equal to 0, then use cumsum along the column axis and see where the diff along the column is equal to 0. To find the first one, you can use cummax so that all the columns after (row-wise) are considered are True. Mask the original dataframe with the opposite of this mask, sum along the columns and divide by total. for example with n=2:
n=2
df['%_before'] = df[~(df.ne(0).cumsum(axis=1).diff(n, axis=1)[range(9)]
.eq(0).cummax(axis=1))].sum(axis=1)/df.total
print (df)
ID 0 1 2 3 4 5 6 7 8 total %_before
0 A 2 21 0 18 3 0 0 0 2 46 0.956522
1 B 0 0 12 2 0 8 14 23 0 59 0.000000
2 C 0 38 19 3 1 3 3 7 1 75 1.000000
3 D 3 0 0 1 0 0 0 0 0 4 0.750000
In your case, you need to change range(9) by range(91) to get all your columns
You can do this using the rolling method.
For your example input, given the number of zeros is 5, we can use
df.rolling(window=5, axis=1).apply(lambda x : np.sum(x))
The output would like
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 NaN NaN NaN NaN 15.0 14.0 17.0 4.0 4.0 4.0 4.0 0.0 12.0 13.0
1 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 17.0 29.0 32.0 28.0 16.0 20.0
2 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 18.0 18.0 23.0 19.0 12.0 11.0
3 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 17.0 29.0 33.0 79.0 67.0 66.0
4 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 16.0 16.0 16.0 66.0 54.0 55.0
14
0 14.0
1 9.0
2 12.0
3 55.0
4 56.0
Looking at the output, its very easy to see that in the first row, for column 11, since the value is 0, it means that starting at position 7, you have 5 zeros.
Since none of the other rows have 0 in them, it means that none of the other rows have 5 contiguous zeros in them.

Python Group BY Cumsum

I have this DataFrame :
Value Month
0 1
1 2
8 3
11 4
12 5
17 6
0 7
0 8
0 9
0 10
1 11
2 12
7 1
3 2
1 3
0 4
0 5
And i want to create new variable "Cumsum" like this :
Value Month Cumsum
0 1 0
1 2 1
8 3 9
11 4 20
12 5 32
17 6
0 7
0 8 ...
0 9
0 10
1 11
2 12
7 1 7
3 2 10
1 3 11
0 4 11
0 5 11
Sorry if my code it is not clean, I failed to include my dataframe ...
My problem is that I do not have only 12 lines (1 line per month) but I have many more lines.
By cons I know that my table is tidy and I want to have the cumulated amount until the 12th month and repeat that when the month 1 appears.
Thank you for your help.
Try:
df['Cumsum'] = df.groupby((df.Month == 1).cumsum())['Value'].cumsum()
print(df)
Value Month Cumsum
0 0 1 0
1 1 2 1
2 8 3 9
3 11 4 20
4 12 5 32
5 17 6 49
6 0 7 49
7 0 8 49
8 0 9 49
9 0 10 49
10 1 11 50
11 2 12 52
12 7 1 7
13 3 2 10
14 1 3 11
15 0 4 11
16 0 5 11
code:
df = pd.DataFrame({'value': [0, 1, 8, 11, 12, 17, 0, 0, 0, 0, 1, 2, 7, 3, 1, 0, 0],
'month': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5]})
temp = int(len(df)/12)
for i in range(temp + 1):
start = i * 12
if i < temp:
end = (i + 1) * 12 - 1
df.loc[start:end, 'cumsum'] = df.loc[start:end, 'value'].cumsum()
else:
df.loc[start:, 'cumsum'] = df.loc[start:, 'value'].cumsum()
# df.loc[12:, 'cumsum'] = 12
print(df)
output:
value month cumsum
0 0 1 0.0
1 1 2 1.0
2 8 3 9.0
3 11 4 20.0
4 12 5 32.0
5 17 6 49.0
6 0 7 49.0
7 0 8 49.0
8 0 9 49.0
9 0 10 49.0
10 1 11 50.0
11 2 12 52.0
12 7 1 7.0
13 3 2 10.0
14 1 3 11.0
15 0 4 11.0
16 0 5 11.0

How to replace column values with 1 and zero?

I have column in my dataframe which is having string value as shown in fig 1.
What i wanted to do is to replace all nan value from 0 and other with 1 (whatever another field is like string and int)
I tried this
func_lambda = lambda x: 1 if any(dataframe['Colum'].values != 0) else 0
But t is replacing all the column with 1.
this is my df.head
datacol.head(20)
Out[77]:
0 nan
1 4500856427
2 4003363
3 nan
4 16-4989
5 nan
6 nan
7 WVE-78686557032
8 nan
9 4501581113
10 D4-SC-0232737-1/G1023716
11 nan
12 nan
13 4502549104
14 nan
15 nan
16 nan
17 IT008297
18 15\036628
19 299011667
Name: Customer_PO_Number, dtype: object
Check this:
import pandas as pd
df = pd.DataFrame({"Customer_PO_Number":
['nan','4500856427','4003363','nan','16 - 4989','nan','nan','WVE - 78686557032',
'nan','4501581113','D4 - SC - 0232737 - 1 / G1023716','nan','nan','4502549104',
'nan','nan','nan','IT008297','15\03662','8','299011667']})
df.replace('nan', 0, inplace=True) # for replacing nan to 0
df[df != 0] = 1 # for replacing others to 1
print(df)
It will give you output like this:
Customer_PO_Number
0 0
1 1
2 1
3 0
4 1
5 0
6 0
7 1
8 0
9 1
10 1
11 0
12 0
13 1
14 0
15 0
16 0
17 1
18 1
19 1
20 1
Hope it will help you! :)
You can use a boolean test and cast the result as integer:
(df['Customer_PO_Number'] == 'nan').astype(int)
Output:
0 1
1 0
2 0
3 1
4 0
5 1
6 1
7 0
8 1
9 0
10 0
11 1
12 1
13 0
14 1
15 1
16 1
17 0
18 0
19 0
20 0
Name: Customer_PO_Number, dtype: int32
If 'nan' are really np.nan then you can use isnull:
df['Customer_PO_Number'].isnull().astype(int)

Pandas dataframe cumulative sum of column except last zero values

I want to do a cumulative sum on a pandas dataframe without carrying over the sum to last zero values. For example, give a dataframe:
A B
1 1 2
2 5 0
3 10 0
4 10 1
5 0 1
6 5 2
7 0 0
8 0 0
9 0 0
cumulative sum of index 1 to 6 only:
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
If want not use cumsum for last 0 values in all columns:
Compare if row no contains 0, shift mask and use cumulative sum. Last compare with last value and filter:
a = df.ne(0).any(1).shift().cumsum()
m = a != a.max()
df[m] = df[m].cumsum()
print (df)
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
Similar solution if want processes each column separately - only omit any:
print (df)
A B
1 1 2
2 5 0
3 10 0
4 10 1
5 0 1
6 5 0
7 0 0
8 0 0
9 0 0
a = df.ne(0).shift().cumsum()
m = a != a.max()
df[m] = df[m].cumsum()
print (df)
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 0
7 0 0
8 0 0
9 0 0
Use
In [262]: s = df.ne(0).all(1)
In [263]: l = s[s].index[-1]
In [264]: df[:l] = df.cumsum()
In [265]: df
Out[265]:
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
I will use last_valid_index
v=df.replace(0,np.nan).apply(lambda x : x.last_valid_index())
df[pd.DataFrame(df.index.values<=v.values[:,None],columns=df.index,index=df.columns).T].cumsum().fillna(0)
Out[890]:
A B
1 1.0 2.0
2 6.0 2.0
3 16.0 2.0
4 26.0 3.0
5 26.0 4.0
6 31.0 6.0
7 0.0 0.0
8 0.0 0.0
9 0.0 0.0
To skip all rows after the first 0, 0 row, get the first index (by rows) where df['A'] and df[B] are 0 using idxmax(0)
>>> m = ((df["A"]==0) & (df["B"]==0)).idxmax(0)
>>> df[:m] = df[:m].cumsum()
>>> df
A B
0 1 2
1 6 2
2 16 2
3 26 3
4 26 4
5 31 6
6 0 0
7 0 0
8 0 0

Categories

Resources