Python pandas: How to match data between two dataframes - python

The first dataframe(df1) is similar to this:
Result
A
B
C
2021-12-31
False
True
True
2022-01-01
False
False
True
2022-01-02
False
True
False
2022-01-03
True
False
True
df2 is an updated version of df1, the date data are new and the column names may be increased, which is similar to this:
Result
A
B
C
D
2022-01-04
False
False
True
True
2022-01-05
True
False
True
True
2022-01-06
False
True
False
True
2022-01-07
False
False
True
True
I want to integrate two databases, but I don't know how to do it。
I want to get a result similar to the following:
Result
A
B
C
D
2021-12-31
False
True
True
NaN
2022-01-01
False
False
True
NaN
2022-01-02
False
True
False
NaN
2022-01-03
True
False
True
NaN
2022-01-04
False
False
True
True
2022-01-05
True
False
True
True
2022-01-06
False
True
False
True
2022-01-07
False
False
True
True
Thank you very much!

Use the concatenate function while ignoring indexes
df_new = pd.concat([df1, df2], ignore_index=True)
Any missing values will be 'NaN'.

Related

Pandas and Numpy consecutive non Nan values

I'm trying to use np.where to count consecutive non-NaN values longer than a certain length as shown below:
e.g. If there are more than 3 consecutive non-NaN values then return True.
Would appreciate any help!
value
consecutive
nan
False
nan
False
1
False
1
False
nan
False
4
True
2
True
3
True
nan
False
nan
False
1
True
3
True
3
True
5
True
The idea is to create groups by testing missing values and mapping using Series.map with Series.value_counts to have only rows with non NaNs filtered by inverted mask ~m:
#convert values to numeric
df['value'] = df['value'].astype(float)
m = df['value'].isna()
s = m.cumsum()
N = 3
df['new'] = s.map(s[~m].value_counts()).ge(N) & ~m
print (df)
value consecutive new
0 NaN False False
1 NaN False False
2 1.0 False False
3 1.0 False False
4 NaN False False
5 4.0 True True
6 2.0 True True
7 3.0 True True
8 NaN False False
9 NaN False False
10 1.0 True True
11 3.0 True True
12 3.0 True True
13 5.0 True True

pandas groupby column and check if group meets multiple conditions

I have a DataFrame that looks like the following:
X Y Date are_equal
0 50.0 10.0 2018-08-19 False
1 NaN 10.0 2018-08-19 False
2 NaN 50.0 2018-08-19 True
3 10.0 NaN 2018-08-21 False
4 1.0 NaN 2018-08-19 False
5 NaN 10.0 2018-08-22 False
6 10.0 NaN 2018-08-21 False
The are_equal column indicates that a value in Y is in X for the same date (in this case 50.0).
I am trying to group by date and find whether X contains a specific value (say 1.0) for a date that contains are_equal True.
My approach was to use df.iterrows() and get the row at next index after meeting the condition df['are_equal'] == True. However, the rows aren't necessarily ordered.
How can I group by Date and check if a date contains True in are_equal and 1.0 in column X for that same date?
The output I'm trying to achieve is a new Boolean column that looks like this:
contains_specific_value
0 False
1 False
2 False
3 False
4 True
5 False
6 False
Let us do apply, this can add more condition but slow. You can check the other solution from transform
df['New']=df.groupby('Date').apply(lambda x : (x['X']==1)&x['are_equal'].any()).reset_index(level=0,drop=True)
df
Out[101]:
X Y Date are_equal New
0 50.0 10.0 2018-08-19 False False
1 NaN 10.0 2018-08-19 False False
2 NaN 50.0 2018-08-19 True False
3 10.0 NaN 2018-08-21 False False
4 1.0 NaN 2018-08-19 False True
5 NaN 10.0 2018-08-22 False False
6 10.0 NaN 2018-08-21 False False
Or transform
df['X'].eq(1)&(df.groupby('Date').X.transform('any'))
Out[102]:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
Name: X, dtype: bool

How to conditionally drop rows in pandas

I have the following dataframe:
True_False cum_val
Date
2018-01-02 False NaN
2018-01-03 False 0.006399
2018-01-04 False 0.010427
2018-01-05 False 0.017461
2018-01-08 False 0.019124
2018-01-09 False 0.020426
2018-01-10 False 0.019314
2018-01-11 False 0.026348
2018-01-12 False 0.033098
2018-01-16 False 0.029573
2018-01-17 False 0.038988
2018-01-18 False 0.037372
2018-01-19 False 0.041757
2018-01-22 False 0.049824
2018-01-23 False 0.051998
2018-01-24 False 0.051438
2018-01-25 False 0.052041
2018-01-26 False 0.063882
2018-01-29 False 0.057150
2018-01-30 True -0.010899
2018-01-31 True -0.010410
2018-02-01 True -0.011058
2018-02-02 True -0.032266
2018-02-05 True -0.073246
2018-02-06 True -0.055805
2018-02-07 True -0.060806
2018-02-08 True -0.098343
2018-02-09 True -0.083407
2018-02-12 False 0.013915
2018-02-13 False 0.016528
2018-02-14 False 0.029930
2018-02-15 False 0.041999
2018-02-16 False 0.042373
2018-02-20 False 0.036531
2018-02-21 False 0.031035
2018-03-06 False 0.013671
How can I drop the row second value after False all the the True values till the second True Value till the second False?
Such as for example:
True_False cum_val
Date
2020-01-21 False 0.022808
2020-01-22 False 0.023097
2020-01-23 True 0.001141
2020-01-24 True -0.007901 # <- Start drop here since this is the second True
2020-01-27 True -0.023632
2020-01-28 False -0.013578
2020-01-29 False -0.000867 #< - End Drop Here Since this is the second False
2020-01-30 False 0.003134
Edit 1:
I would like to add 1 more condition on the new df:
2020-01-22 0.000289 False
2020-01-23 0.001141 True
2020-01-27 -0.015731 True # <- Start Drop Here
2020-01-28 0.010054 True
2020-01-29 -0.000867 False
2020-01-30 0.003134 True #<-End drop here
2020-02-03 0.007255 True
As you have mentioned in the comment: [True, True, True, False, True]
In this case it would still start the drop at the second True value but would stop the drop right after the first False even though the second value has toggled to True. If the next value is still True drop it till the value after False
Let's try using where with ffill and parameter limit=2 then boolean filtering:
df[~(df['True_False'].where(df['True_False']).ffill(limit=2).cumsum() > 1)]
Output:
| | Date | True_False | cum_val |
|----|------------|--------------|-----------|
| 0 | 2020-01-21 | False | 1 |
| 1 | 2020-01-22 | False | 2 |
| 2 | 2020-01-23 | True | 3 |
| 7 | 2020-01-28 | False | 8 |
Details:
First let's convert the False to np.nan using where
Next, fill first two np.nan after the last True using
ffill(limit=2)
Now, let's use cumsum so we can add consecutive True and select
those greater than 2
And negate, to keep false records above the first True record and
third False record and on.
Here's what I tried.
The data I created is:
Date True_False cum_val
0 2020-01-21 False 1
1 2020-01-22 False 2
2 2020-01-23 True 3
3 2020-01-24 True 4
4 2020-01-25 True 5
5 2020-01-26 False 6
6 2020-01-27 False 7
7 2020-01-28 False 8
true_count = 0
false_count = 0
drop_continue = False
for index, row in df.iterrows():
if row['True_False'] is True and drop_continue is False:
true_count +=1
if true_count == 2:
drop_continue = True
df.drop(index, inplace=True)
true_count = 0
continue
if drop_continue is True:
if row['True_False'] is True:
df.drop(index, inplace=True)
if row['True_False'] is False:
false_count += 1
if false_count <2:
df.drop(index, inplace=True)
else:
drop_continue = False
false_count = 0
Output
Date True_False cum_val
0 2020-01-21 False 1
1 2020-01-22 False 2
2 2020-01-23 True 3
6 2020-01-27 False 7
7 2020-01-28 False 8
You could use Series.Shift and Series.bfill:
df = df[~df['True_False'].shift().bfill()]
print(df)
Date True_False cum_val
0 2020-01-21 False 0.022808
1 2020-01-22 False 0.023097
2 2020-01-23 True 0.001141
6 2020-01-29 False -0.000867
7 2020-01-30 False 0.003134
You can do:
#mark start of the area you want to drop
df["dropit"]=np.where(df["True_False"] & df["True_False"].shift(1) & np.logical_not(df["True_False"].shift(2)), "start", None)
#mark the end of the drop area
df["dropit"]=np.where(np.logical_not(df["True_False"].shift(1)) & df["True_False"].shift(2), "end", df["dropit"])
#indicate gaps between the different drop areas:
df.loc[df["dropit"].shift().eq("end")&df["dropit"].ne("start"), "dropit"]="keep"
#forward fill
df["dropit"]=df["dropit"].ffill()
#drop marked drop areas and drop "dropit" column
df=df.drop(df.loc[df["dropit"].isin(["start", "end"])].index, axis=0).drop("dropit", axis=1)
Outputs:
True_False cum_val
Date
2018-01-02 False NaN
2018-01-03 False 0.006399
2018-01-04 False 0.010427
2018-01-05 False 0.017461
2018-01-08 False 0.019124
2018-01-09 False 0.020426
2018-01-10 False 0.019314
2018-01-11 False 0.026348
2018-01-12 False 0.033098
2018-01-16 False 0.029573
2018-01-17 False 0.038988
2018-01-18 False 0.037372
2018-01-19 False 0.041757
2018-01-22 False 0.049824
2018-01-23 False 0.051998
2018-01-24 False 0.051438
2018-01-25 False 0.052041
2018-01-26 False 0.063882
2018-01-29 False 0.057150
2018-01-30 True -0.010899
2018-02-14 False 0.029930
2018-02-15 False 0.041999
2018-02-16 False 0.042373
2018-02-20 False 0.036531
2018-02-21 False 0.031035
2018-03-06 False 0.013671

Applying values to a DataFrame without using a for-loop

I'm looking for a faster method of applying values to a column in a DataFrame. The value is based on two True and False values in the first and second column. This is my current solution:
df['result'] = df.check1.astype(int)
for i in range(len(df)):
if df.result[i] != 1:
df.result[i] = df.result.shift(1)[i] + df.check2[i].astype(int)
Which yields this result:
check1 check2 result
0 True False 1
1 False False 1
2 False False 1
3 False False 1
4 False False 1
5 False False 1
6 False True 2
7 False False 2
8 False True 3
9 False False 3
10 False True 4
11 False False 4
12 False True 5
13 False False 5
14 False True 6
15 False False 6
16 False True 7
17 False False 7
18 False False 7
19 False False 7
20 False True 8
21 False False 8
22 False True 9
23 True False 1
24 False False 1
So the third column needs to be a number based on the value in the row above it.
If check1 is True the number needs to go back to 1. If check2 is true, 1 needs to be added to the number. Otherwise the number stays the same.
The current code is fine but it's taking too long as I need to apply this to a DataFrame with approx. 70.000 rows. I'm pretty sure it can be improved (I'm guessing using the apply function but I'm not sure). Any ideas?
Use pandas.DataFrame.groupby.cumsum:
import pandas as pd
df['result'] = df.groupby(df['check1'].cumsum())[['check1', 'check2']].cumsum().sum(1)
Or #Dan's suggestion:
df['result'] = df.groupby(df['check1'].cumsum())['check2'].cumsum().add(1)
Output:
check1 check2 result
0 True False 1.0
1 False False 1.0
2 False False 1.0
3 False False 1.0
4 False False 1.0
5 False False 1.0
6 False True 2.0
7 False False 2.0
8 False True 3.0
9 False False 3.0
10 False True 4.0
11 False False 4.0
12 False True 5.0
13 False False 5.0
14 False True 6.0
15 False False 6.0
16 False True 7.0
17 False False 7.0
18 False False 7.0
19 False False 7.0
20 False True 8.0
21 False False 8.0
22 False True 9.0
23 True False 1.0
24 False False 1.0
You want to iterate a dataframe using the value of the preceding row. In that case, the most efficient way is to directly iterate the underlying numpy arrays:
df = pd.read_fwf(io.StringIO(t))
df['result'] = df.check1.astype(int)
res = df['result'].values
c1 = df['check1'].values
c2 = df['check2'].values
old = -1
for i in range(len(df)):
if res[i] != 1:
res[i] = old + int(c2[i])
old = res[i]
This works fine because numpy arrays are mutable types, so the changes are reflected in the dataframe.
Timeit says that this is twice as fast as the original solution from #Chris's, and still 1.5 times faster after #Dan's improvement.

Python Pandas Boolean Dataframe Where Dataframe Equals False - Returns 0 instead of False?

If I have a Dataframe with True/False values only like this:
df_mask = pd.DataFrame({'AAA': [True] * 4,
'BBB': [False]*4,
'CCC': [True, False, True, False]}); print(df_mask)
AAA BBB CCC
0 True False True
1 True False False
2 True False True
3 True False False
Then try to print where the values in the dataframe is equivalent to False like so:
print(df_mask[df_mask == False])
print(df_mask.where(df_mask == False))
My question is about column CCC. Column BBB shows False (as I expect) but why is index 1 and 3 in column CCC equal to 0 instead of False?
AAA BBB CCC
0 NaN False NaN
1 NaN False 0
2 NaN False NaN
3 NaN False 0
AAA BBB CCC
0 NaN False NaN
1 NaN False 0
2 NaN False NaN
3 NaN False 0
Why doesn't it return a dataframe that looks like this?
AAA BBB CCC
0 NaN False NaN
1 NaN False False
2 NaN False NaN
3 NaN False False
Not entirely sure why, but if you're looking for a quick fix to convert it back to bools you can do the following:
>>> df_bool = df_mask.where(df_mask == False).astype(bool)
>>> df_bool
AAA BBB CCC
0 True False True
1 True False False
2 True False True
3 True False False
This is because the returned dataframe has a different dtype: it's not longer a dataframe of bools.
>>> df2 = df_mask.where(df_mask == False)
>>> df2.dtypes
AAA float64
BBB bool
CCC float64
dtype: object
This even occurs if you force it to a bool dtype from the getgo:
>>> df_mask = pd.DataFrame({'AAA': [True] * 4,
... 'BBB': [False]*4,
... 'CCC': [True, False, True, False]}, dtype=bool); print(df_mask)
AAA BBB CCC
0 True False True
1 True False False
2 True False True
3 True False False
>>> df2 = df_mask.where(df_mask == False)
>>> df2
AAA BBB CCC
0 NaN False NaN
1 NaN False 0
2 NaN False NaN
3 NaN False 0
If you're explicitly worried about memory, you can also just return a reference, but unless you're explicitly ignoring the old reference (in which case it shouldn't matter), be careful:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

Categories

Resources