Injecting values of DataFrame to NaN of Series with different length - python

How can I inject values of dataframe and complement NaN of series with different length like those:
series dataframe
0
0 NaN 0 0.468979
1 3.0 1 0.470546
2 NaN 2 0.458234
3 3.0 3 0.427878
4 NaN 4 0.494763
... ...
886 NaN 682 0.458234
887 2.0 683 0.501460
888 NaN 684 0.458234
889 3.0 685 0.494949
890 NaN 686 0.427878
891 2.0
892 3.0
893 1.0
I need to find something that can inject values of dataframe to NaN of series like below:
0 0.468979 <- row0 of dataframe
1 3.0
2 0.470546 <- row1 of dataframe
3 3.0
4 0.458234 <- row2 of dataframe
...
886 0.458234 <- row684 of dataframe
887 2.0
888 0.494949 <- row685 of dataframe
889 3.0
890 0.427878 <- row686 of dataframe
891 2.0
892 3.0
893 1.0
Actually, I can get the above result by this code:
j = 0
for i, c in enumerate(series):
if np.isnan(c):
series[i] = dataframe[0][j]
j += 1
but it makes SettingWithCopyWarning.
How can I inject values of dataframe and complement NaN of series without the warning?
According to the previous question: How to remove nan value while combining two column in Panda Data frame?
I tried below, however, it doesn't work well due to the different length:
series = series.fillna(dataframe[0])
0 0.468979 <- row0 of dataframe
1 3.000000
2 0.458234 <- row2 of dataframe
3 3.000000
4 0.494763 <- row4 of dataframe
...
886 NaN
887 2.000000
888 NaN
889 3.000000
890 NaN
891 2.000000
892 3.000000
893 1.000000

Related

Update dataframe via for loop

The code below has to update test_df dataframe, which is currently filled with NaNs.
Each 'dig' (which is always an integer) value has corresponding 'top', 'bottom', 'left' and 'right' values, and the slices of dataframe, corresponding to respective top:bottom, left:right ranges for each 'dig', need to be updated with 'dig' values.
For example, if dig=9, top=2, botton=4, left=1 and right=5, all the NaNs within the range of 2:4, 1:5 need to be replaced with 9s.
The following code reports no errors, however, no NaNs are being updated.
for index, row in letters_df.iterrows():
dig = str(row[0])
top = int(height) - int(row[2])
bottom = int(height) - int(row[4])
left = int(row[1])
right = int(row[3])
test_df.iloc[top:bottom, left:right] = dig
test_df:
0 1 2 3 4 5 6 ... 633 634 635 636 637 638 639
0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
letters_df:
0 1 2 3 4 5 dig_unique_letters
0 T 36 364 51 388 0 0
1 h 36 364 55 388 0 1
2 i 57 364 71 388 0 2
3 s 76 364 96 388 0 3
4 i 109 364 112 388 0 2
The problem I see is that in letters_df the value in column 4 is higher than the value in column 2. That means that when you do top = int(height) - int(row[2])
bottom = int(height) - int(row[4]) the value you will get in top will be bigger than the value you will get in bottem. So when you index .iloc[top:bottom] you have no rows in the slice. Maybe you should switch between top and bottem.

Calculation does not work on pandas dataframe

I am working with a dataframe such as below.
df.head()
Out[20]:
Date Price Open High ... Vol. Change % A Day % OC %
0 2016-04-25 9577.5 9650.0 9685.0 ... 306230.0 -0.83 1.79 -0.75
1 2016-04-26 9660.0 9567.5 9695.0 ... 389490.0 0.86 1.52 0.97
2 2016-04-27 9627.5 9660.0 9682.5 ... 277940.0 -0.34 1.02 -0.34
3 2016-04-28 9595.0 9625.0 9667.5 ... 75120.0 -0.34 1.36 -0.31
4 2016-04-29 9532.5 9567.5 9597.5 ... 138340.0 -0.65 0.73 -0.37
I sliced it with some conditions. As a result I got a list of sliced indices con_down_success whose length is 96.
Also, I made a list such as,
con_down_success_D1 = [x+1 for x in con_down_success]
What I want to do is below.
df.iloc[con_down_success_D1,:].Low/df.iloc[con_down_success,:].Price
This code is supposed to show calculated series but too many are NaNs like below.
(df.iloc[con_down_success_D1,:].Low/df.iloc[con_down_success,:].Price).tail(12)
Out[26]:
778 0.995716
779 NaN
787 NaN
788 NaN
794 NaN
795 NaN
821 NaN
822 NaN
827 NaN
828 NaN
830 NaN
831 NaN
All of the two series have actual numbers, not NaN or NA. For example, below is no problem.
df.iloc[831,:].Low/df.iloc[830,:].Price
Out[18]: 0.9968354430379747
Could you tell me how to handle the dataframe to show what I want?
Thanks in advance.

Pandas - duration where parameter is "1"

I am new to python and pandas and I am trying to solve this problem:
I have a dataset that looks something like this:
timestamp par_1 par_2
1486873206867 0 0
1486873207039 NaN 0
1486873207185 0 NaN
1486873207506 1 0
1486873207518 NaN NaN
1486873207831 1 0
1486873208148 0 NaN
1486873208469 0 1
1486873208479 1 NaN
1486873208793 1 NaN
1486873208959 NaN 1
1486873209111 1 NaN
1486873209918 NaN 0
1486873210075 0 NaN
I want to know the total duration of the event "1" for each parameter. (Parameters can only be NaN, 1 or 0)
I have already tried
df['duration_par_1'] = df.groupby(['par_1'])['timestamp'].apply(lambda x: x.max() - x.min())
but for further processing, I only need the duration of the event "1" to be in new columns and then that duration needs to be in every row of the new column so that it looks like this:
timestamp par_1 par_2 duration_par_1 duration_par2
1486873206867 0 0 2238 1449
1486873207039 NaN 0 2238 1449
1486873207185 0 NaN 2238 1449
1486873207506 1 0 2238 1449
1486873207518 NaN NaN 2238 1449
1486873207831 1 0 2238 1449
1486873208148 0 NaN 2238 1449
1486873208469 0 1 2238 1449
1486873208479 1 NaN 2238 1449
1486873208793 1 NaN 2238 1449
1486873208959 NaN 1 2238 1449
1486873209111 1 NaN 2238 1449
1486873209918 NaN 0 2238 1449
1486873210075 0 NaN 2238 1449
Thanks in advance!
I believe you need multiple values of par columns by difference of datetimes, because not exist another values like 0, 1 and NaN in data:
d = df['timestamp'].diff()
df1 = df.filter(like='par')
#if need duration by some value e.g. by `0`
#df1 = df.filter(like='par').eq(0).astype(int)
s = df1.mul(d, axis=0).sum().astype(int).add_prefix('duration_')
df = df.assign(**s)
print (df)
timestamp par_1 par_2 duration_par_1 duration_par_2
0 1486873206867 0.0 0.0 1110 487
1 1486873207039 NaN 0.0 1110 487
2 1486873207185 0.0 NaN 1110 487
3 1486873207506 1.0 0.0 1110 487
4 1486873207518 NaN NaN 1110 487
5 1486873207831 1.0 0.0 1110 487
6 1486873208148 0.0 NaN 1110 487
7 1486873208469 0.0 1.0 1110 487
8 1486873208479 1.0 NaN 1110 487
9 1486873208793 1.0 NaN 1110 487
10 1486873208959 NaN 1.0 1110 487
11 1486873209111 1.0 NaN 1110 487
12 1486873209918 NaN 0.0 1110 487
13 1486873210075 0.0 NaN 1110 487
Explanation:
First get difference of timestamp column:
print (df['timestamp'].diff())
0 NaN
1 172.0
2 146.0
3 321.0
4 12.0
5 313.0
6 317.0
7 321.0
8 10.0
9 314.0
10 166.0
11 152.0
12 807.0
13 157.0
Name: timestamp, dtype: float64
Select all columns with string par by filter:
print (df.filter(like='par'))
par_1 par_2
0 0.0 0.0
1 NaN 0.0
2 0.0 NaN
3 1.0 0.0
4 NaN NaN
5 1.0 0.0
6 0.0 NaN
7 0.0 1.0
8 1.0 NaN
9 1.0 NaN
10 NaN 1.0
11 1.0 NaN
12 NaN 0.0
13 0.0 NaN
Multiple filtered columns by mul by d:
print (df1.mul(d, axis=0))
par_1 par_2
0 NaN NaN
1 0.0 0.0
2 0.0 0.0
3 321.0 0.0
4 0.0 0.0
5 313.0 0.0
6 0.0 0.0
7 0.0 321.0
8 10.0 0.0
9 314.0 0.0
10 0.0 166.0
11 152.0 0.0
12 0.0 0.0
13 0.0 0.0
And sum values:
print (df1.mul(d, axis=0).sum())
par_1 1110.0
par_2 487.0
dtype: float64
Convert to integers and change index by add_prefix:
print (df1.mul(d, axis=0).sum().astype(int).add_prefix('duration_'))
duration_par_1 1110
duration_par_2 487
dtype: int32
Last create new columns by assign.

Pandas merge_asof tolerance must be integer

I have searched around, but could not find the answer I was looking for. I have two dataframes, one has fairly discrete integer values in column A (df2) the other does not (df1). I would like to merge the two such that where column A is within 1, values in columns C and D would get merged once and NaN otherwise.
df1=
A B
0 30.00 -52.382420
1 33.14 -50.392513
2 36.28 -53.699646
3 39.42 -49.228439
.. ... ...
497 1590.58 -77.646561
498 1593.72 -77.049423
499 1596.86 -77.711639
500 1600.00 -78.092979
df2=
A C D
0 0.009 NaN NaN
1 0.036 NaN NaN
2 0.100 NaN NaN
3 10.000 12.4 0.29
4 30.000 12.82 0.307
.. ... ... ...
315 15000.000 NaN 7.65
316 16000.000 NaN 7.72
317 17000.000 NaN 8.36
318 18000.000 NaN 8.35
I would like the output to be
merged=
A B C D
0 30.00 -52.382420 12.82 0.29
1 33.14 -50.392513 NaN NaN
2 36.28 -53.699646 NaN NaN
3 39.42 -49.228439 NaN NaN
.. ... ... ... ...
497 1590.58 -77.646561 NaN NaN
498 1593.72 -77.049423 NaN NaN
499 1596.86 -77.711639 NaN NaN
500 1600.00 -78.092979 28.51 2.5
I tried:
merged = pd.merge_asof(df1, df2, left_on='A', tolerance=1, direction='nearest')
Which gives me a MergeError: key must be integer or timestamp.
So far the only way I've been able to successfully merge the dataframes is with:
merged = pd.merge_asof(df1, df2, on='A')
But this takes whatever value was close enough in columns C and D and fills in the NaN values.
For anyone else facing a similar problem, the column that the merge is performed on must be an integer. In my case this meant having to change column A to an int.
df1['A Int'] = df1['A'].astype(int)
df2['A Int'] = df2['A'].astype(int)
merged = pd.merge_asof(df1, df2, on='A Int', direction='nearest', tolerance=1)

Why is this concatenation of the float values in pandas dataframe is giving NaN output?

I have bunch of pandas dataframe with float values. I want to concatenate them using pandas.
df1 =
hapX_Sp_Sum
contig pos F1_2ms04h_PI
0 2 16229767 726 3.5
1 2 16229783 726 3.5
2 2 16229880 726 2.0
3 2 16229891 726 2.0
4 2 16229982 726 0.0
5 2 16229992 726 0.0
df2 =
hapX_My_Sum
contig pos F1_2ms04h_PI
0 2 16229767 726 0.0
1 2 16229783 726 0.0
2 2 16229880 726 0.0
3 2 16229891 726 0.0
4 2 16229982 726 0.0
5 2 16229992 726 0.0
I concatenate them as:
frames = [df1, df2]
merged_df = pd.concat(frames, axis = 1)
The output I am getting:
hapX_My_Sum hapX_Sp_Sum
contig pos F1_2ms04h_PI
0 2 16229767 726 0.0 NaN
1 2 16229783 726 0.0 NaN
2 2 16229880 726 0.0 NaN
3 2 16229891 726 0.0 NaN
4 2 16229982 726 0.0 NaN
5 2 16229992 726 0.0 NaN
The values in each column is a float, but why am I running into this NaN problem? I generated these dataframe using pd.sum() of the float values, which should result in each value in the column being float. This is weird, any idea?
Thanks,
this looks normal to me as you are concatenating along the rows. So yes, hapX_Sp_Sum is of course empty in the first dataframe. If you print more lines you'll find non empty values (but NaNs for the other columns this time)
I suspect what you really want to do is
merged_df = pd.concat(frames, axis = 0)

Categories

Resources