I am working with a dataframe such as below.
df.head()
Out[20]:
Date Price Open High ... Vol. Change % A Day % OC %
0 2016-04-25 9577.5 9650.0 9685.0 ... 306230.0 -0.83 1.79 -0.75
1 2016-04-26 9660.0 9567.5 9695.0 ... 389490.0 0.86 1.52 0.97
2 2016-04-27 9627.5 9660.0 9682.5 ... 277940.0 -0.34 1.02 -0.34
3 2016-04-28 9595.0 9625.0 9667.5 ... 75120.0 -0.34 1.36 -0.31
4 2016-04-29 9532.5 9567.5 9597.5 ... 138340.0 -0.65 0.73 -0.37
I sliced it with some conditions. As a result I got a list of sliced indices con_down_success whose length is 96.
Also, I made a list such as,
con_down_success_D1 = [x+1 for x in con_down_success]
What I want to do is below.
df.iloc[con_down_success_D1,:].Low/df.iloc[con_down_success,:].Price
This code is supposed to show calculated series but too many are NaNs like below.
(df.iloc[con_down_success_D1,:].Low/df.iloc[con_down_success,:].Price).tail(12)
Out[26]:
778 0.995716
779 NaN
787 NaN
788 NaN
794 NaN
795 NaN
821 NaN
822 NaN
827 NaN
828 NaN
830 NaN
831 NaN
All of the two series have actual numbers, not NaN or NA. For example, below is no problem.
df.iloc[831,:].Low/df.iloc[830,:].Price
Out[18]: 0.9968354430379747
Could you tell me how to handle the dataframe to show what I want?
Thanks in advance.
Related
How can I inject values of dataframe and complement NaN of series with different length like those:
series dataframe
0
0 NaN 0 0.468979
1 3.0 1 0.470546
2 NaN 2 0.458234
3 3.0 3 0.427878
4 NaN 4 0.494763
... ...
886 NaN 682 0.458234
887 2.0 683 0.501460
888 NaN 684 0.458234
889 3.0 685 0.494949
890 NaN 686 0.427878
891 2.0
892 3.0
893 1.0
I need to find something that can inject values of dataframe to NaN of series like below:
0 0.468979 <- row0 of dataframe
1 3.0
2 0.470546 <- row1 of dataframe
3 3.0
4 0.458234 <- row2 of dataframe
...
886 0.458234 <- row684 of dataframe
887 2.0
888 0.494949 <- row685 of dataframe
889 3.0
890 0.427878 <- row686 of dataframe
891 2.0
892 3.0
893 1.0
Actually, I can get the above result by this code:
j = 0
for i, c in enumerate(series):
if np.isnan(c):
series[i] = dataframe[0][j]
j += 1
but it makes SettingWithCopyWarning.
How can I inject values of dataframe and complement NaN of series without the warning?
According to the previous question: How to remove nan value while combining two column in Panda Data frame?
I tried below, however, it doesn't work well due to the different length:
series = series.fillna(dataframe[0])
0 0.468979 <- row0 of dataframe
1 3.000000
2 0.458234 <- row2 of dataframe
3 3.000000
4 0.494763 <- row4 of dataframe
...
886 NaN
887 2.000000
888 NaN
889 3.000000
890 NaN
891 2.000000
892 3.000000
893 1.000000
First question here and a long one - there are a couple of things I am struggling with regarding merging and formatting my dataframes. I have some half working solutions ones but I am unsure if they are the best possible based on what I want.
Here are the standard formats of the dataframes I am merging with pandas.
df1 =
RT %Area RRT
0 4.83 5.257 0.509
1 6.76 0.424 0.712
2 7.27 0.495 0.766
3 7.70 0.257 0.811
4 7.79 0.122 0.821
5 9.49 92.763 1.000
6 11.40 0.681 1.201
df2=
RT %Area RRT
0 4.83 0.731 0.508
1 6.74 1.243 0.709
2 7.28 0.109 0.766
3 7.71 0.287 0.812
4 7.79 0.177 0.820
5 9.50 95.824 1.000
6 11.31 0.348 1.191
7 11.40 1.166 1.200
8 12.09 0.113 1.273
df3 = ...
Currently I am using a reduce operation on pd.merge_ordered() like below to merge my dataframes (3+). This kind of yields what I want and was from a previous question (pandas three-way joining multiple dataframes on columns). I am merging on RRT, and want the indexes with the same RRT values to be placed on the same row - and if the RRT values are unique for that dataset I want a NaN for missing data from other datasets.
#The for loop I use to generate the list of formatted dataframes prior to merging
dfs = []
for entry in os.scandir(directory):
if (entry.path.endswith(".csv")) and entry.is_file():
entry = pd.read_csv(entry.path, header=None)
#Block of formatting code removed
dfs.append(entry.round(2))
dfs = [df1ar,df2ar,df3ar]
df_final = reduce(lambda left,right: pd.merge_ordered(left,right,on='RRT'), dfs)
cols = ['RRT', 'RT_x', '%Area_x', 'RT_y', '%Area_y', 'RT', '%Area']
df_final = df_final[cols]
print(df_final)
RRT RT_x %Area_x RT_y %Area_y RT %Area
0 0.508 NaN NaN 4.83 0.731 NaN NaN
1 0.509 4.83 5.257 NaN NaN 4.83 5.257
2 0.709 NaN NaN 6.74 1.243 NaN NaN
3 0.712 6.76 0.424 NaN NaN 6.76 0.424
4 0.766 7.27 0.495 7.28 0.109 7.27 0.495
5 0.811 7.70 0.257 NaN NaN 7.70 0.257
6 0.812 NaN NaN 7.71 0.287 NaN NaN
7 0.820 NaN NaN 7.79 0.177 NaN NaN
8 0.821 7.79 0.122 NaN NaN 7.79 0.122
9 1.000 9.49 92.763 9.50 95.824 9.49 92.763
10 1.191 NaN NaN 11.31 0.348 NaN NaN
11 1.200 NaN NaN 11.40 1.166 NaN NaN
12 1.201 11.40 0.681 NaN NaN 11.40 0.681
13 1.273 NaN NaN 12.09 0.113 NaN NaN
This works, but:
Can I can insert a multiindex based on the filename of the dataframe that the data came from from and place it above the corresponding columns? Like the suffix option but related back to filename and for more than two sets of data. Is this better done prior to merging? and if so how do I do it? (I've included the for loop I use for to create a list of tables prior to merging.
Is this reduced merge_ordered the simplest way of doing this?
Can I do a similar merge with pd.merge_asof() and use the tolerance value to fine tune the merging based on the similarities between the RRT values? That is, can it be done without cutting off data from the longer dataframes?
I've tried the above and searched for answers, but I'm struggling to find the most efficient way to do everything I want.
concat = pd.concat(dfs, axis=1, keys=['A','B','C'])
concat_final = concat.round(3)
print(concat_final)
A B C
RT %Area RRT RT %Area RRT RT %Area RRT
0 4.83 5.257 0.509 4.83 0.731 0.508 4.83 5.257 0.509
1 6.76 0.424 0.712 6.74 1.243 0.709 6.76 0.424 0.712
2 7.27 0.495 0.766 7.28 0.109 0.766 7.27 0.495 0.766
3 7.70 0.257 0.811 7.71 0.287 0.812 7.70 0.257 0.811
4 7.79 0.122 0.821 7.79 0.177 0.820 7.79 0.122 0.821
5 9.49 92.763 1.000 9.50 95.824 1.000 9.49 92.763 1.000
6 11.40 0.681 1.201 11.31 0.348 1.191 11.40 0.681 1.201
7 NaN NaN NaN 11.40 1.166 1.200 NaN NaN NaN
8 NaN NaN NaN 12.09 0.113 1.273 NaN NaN NaN
I have also tried this - and I get the multiindex to denote which file (A,B,C, just as placeholders) it came from. However, it has obviously not merged based on the RRT value like I want.
Can I apply an operation to change this into a similar format to the pd.merge_ordered() format above? Would groupby() work?
Thanks!
I have got a file which looks like this
Times Code505 Code200 Code404
1543714067 855 86123 1840
1543714077 869 87327 1857
1543714087 882 88522 1883
1543714097 890 89764 1901
1543714107 904 90735 1924
1543714117 914 91963 1956
except it got a lot more data than this.
What I want to do is to plot a graph that looks like this
When I plot my graph, I get something more of this
What I am doing to get my graph which is the second one is
data['Times'] = pd.to_datetime(data['Times'], unit = 's')
data.set_index(['Times'],inplace=True)
data.plot()
I know I am missing something to get my graph look like a time series but I am unsure what I have to pass to pandas to get my graph look right.
I am collecting the data for a total of an hour and I collect a record which looks like this
1543714067 855 86123 1840
every 10 seconds
>>> df
Times Code505 Code200 Code404
0 1543714067 855 86123 1840
1 1543714077 869 87327 1857
2 1543714087 882 88522 1883
3 1543714097 890 89764 1901
4 1543714107 904 90735 1924
5 1543714117 914 91963 1956
>>>
This will calculate the RPS based on twenty second intervals:
Shift the data up 2 and subtract the original DataFrame
>>> df.shift(-2)
Times Code505 Code200 Code404
0 1.543714e+09 882.0 88522.0 1883.0
1 1.543714e+09 890.0 89764.0 1901.0
2 1.543714e+09 904.0 90735.0 1924.0
3 1.543714e+09 914.0 91963.0 1956.0
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
>>>
>>> deltas = df.shift(-2) - df
>>> deltas
Times Code505 Code200 Code404
0 20.0 27.0 2399.0 43.0
1 20.0 21.0 2437.0 44.0
2 20.0 22.0 2213.0 41.0
3 20.0 24.0 2199.0 55.0
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
>>>
Divide the deltas by twenty, then reestablish the times.
>>> rates = deltas / 20
>>> rates
Times Code505 Code200 Code404
0 1.0 1.35 119.95 2.15
1 1.0 1.05 121.85 2.20
2 1.0 1.10 110.65 2.05
3 1.0 1.20 109.95 2.75
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
>>> rates['Times'] = df['Times']
>>> rates
Times Code505 Code200 Code404
0 1543714067 1.35 119.95 2.15
1 1543714077 1.05 121.85 2.20
2 1543714087 1.10 110.65 2.05
3 1543714097 1.20 109.95 2.75
4 1543714107 NaN NaN NaN
5 1543714117 NaN NaN NaN
>>>
You can preserve the timestamps throughout the process if you make it the index first.
>>> df
Times Code505 Code200 Code404
0 1543714067 855 86123 1840
1 1543714077 869 87327 1857
2 1543714087 882 88522 1883
3 1543714097 890 89764 1901
4 1543714107 904 90735 1924
5 1543714117 914 91963 1956
>>> df = df.set_index('Times')
>>> df
Code505 Code200 Code404
Times
1543714067 855 86123 1840
1543714077 869 87327 1857
1543714087 882 88522 1883
1543714097 890 89764 1901
1543714107 904 90735 1924
1543714117 914 91963 1956
>>>
>>> deltas = df.shift(-2) - df
>>> rates = deltas / 20
>>> rates
Code505 Code200 Code404
Times
1543714067 1.35 119.95 2.15
1543714077 1.05 121.85 2.20
1543714087 1.10 110.65 2.05
1543714097 1.20 109.95 2.75
1543714107 NaN NaN NaN
1543714117 NaN NaN NaN
>>>
I have searched around, but could not find the answer I was looking for. I have two dataframes, one has fairly discrete integer values in column A (df2) the other does not (df1). I would like to merge the two such that where column A is within 1, values in columns C and D would get merged once and NaN otherwise.
df1=
A B
0 30.00 -52.382420
1 33.14 -50.392513
2 36.28 -53.699646
3 39.42 -49.228439
.. ... ...
497 1590.58 -77.646561
498 1593.72 -77.049423
499 1596.86 -77.711639
500 1600.00 -78.092979
df2=
A C D
0 0.009 NaN NaN
1 0.036 NaN NaN
2 0.100 NaN NaN
3 10.000 12.4 0.29
4 30.000 12.82 0.307
.. ... ... ...
315 15000.000 NaN 7.65
316 16000.000 NaN 7.72
317 17000.000 NaN 8.36
318 18000.000 NaN 8.35
I would like the output to be
merged=
A B C D
0 30.00 -52.382420 12.82 0.29
1 33.14 -50.392513 NaN NaN
2 36.28 -53.699646 NaN NaN
3 39.42 -49.228439 NaN NaN
.. ... ... ... ...
497 1590.58 -77.646561 NaN NaN
498 1593.72 -77.049423 NaN NaN
499 1596.86 -77.711639 NaN NaN
500 1600.00 -78.092979 28.51 2.5
I tried:
merged = pd.merge_asof(df1, df2, left_on='A', tolerance=1, direction='nearest')
Which gives me a MergeError: key must be integer or timestamp.
So far the only way I've been able to successfully merge the dataframes is with:
merged = pd.merge_asof(df1, df2, on='A')
But this takes whatever value was close enough in columns C and D and fills in the NaN values.
For anyone else facing a similar problem, the column that the merge is performed on must be an integer. In my case this meant having to change column A to an int.
df1['A Int'] = df1['A'].astype(int)
df2['A Int'] = df2['A'].astype(int)
merged = pd.merge_asof(df1, df2, on='A Int', direction='nearest', tolerance=1)
I have a pandas DataFrame of statistics for NBA games. Here's a sample of the data for away teams:
away_team away_efg away_drb away_score
date
2000-10-31 19:00:00 Los Angeles Clippers 0.522 74.4 94
2000-10-31 19:00:00 Milwaukee Bucks 0.434 63.0 93
2000-10-31 19:30:00 Minnesota Timberwolves 0.523 73.8 106
2000-10-31 19:30:00 Charlotte Hornets 0.605 77.1 106
2000-10-31 19:30:00 Seattle SuperSonics 0.429 73.1 88
There are many more numeric columns other than the away_score column, and also analogous columns for the home team.
What I would like is, for each row, replace the numeric columns (other than score) with the mean of the previous three observations, partitioned by team. I can almost get what I want by doing the following:
home_df.groupby("team").apply(lambda x: x.rolling(window=3).mean())
This returns, for example,
>>> home_avg[home_avg["team"]=="Utah Jazz"].head()
3par ast blk drb efg ftr orb
0 NaN NaN NaN NaN NaN NaN NaN
50 NaN NaN NaN NaN NaN NaN NaN
81 0.146667 71.600000 9.4 74.666667 0.512000 0.347667 25.833333
Taking this, along with
>>> home_df[home_df["team"]=="Utah Jazz"].head()
3par ast blk drb efg ftr orb stl team tov trb
0 0.118 76.7 7.1 64.7 0.535 0.365 25.6 11.5 Utah Jazz 10.8 42.9
50 0.100 63.9 9.1 80.5 0.536 0.414 27.6 2.2 Utah Jazz 20.2 58.6
81 0.222 74.2 12.0 78.8 0.465 0.264 24.3 7.3 Utah Jazz 13.9 50.0
122 0.119 81.8 11.3 75.0 0.515 0.642 25.0 12.2 Utah Jazz 21.8 52.5
135 0.129 76.7 17.8 75.9 0.650 0.400 37.9 5.7 Utah Jazz 18.8 62.7
demonstrates that it is including the current row in the calculation of the mean. I want to avoid this. More specifically, the desired output for row 81 would be all NaNs (because there haven't been three games yet), and the entry in the 3par column for row 122 would be .146667 (the average of the values in that column for rows 0, 50, and 81).
So, my question is, how can I exclude the current row in the rolling mean calculation?
You can use shift here which shifts the index for a given amount to make your rolling window use the last three values excluding the current value:
# create dummy data frame with numeric values
df = pd.DataFrame({"numeric_col": np.random.randint(0, 100, size=5)})
print(df)
numeric_col
0 66
1 60
2 74
3 41
4 83
df["mean"] = df["numeric_col"].shift(1).rolling(window=3).mean()
print(df)
numeric_col mean
0 66 NaN
1 60 NaN
2 74 NaN
3 41 66.666667
4 83 58.333333
Accordingly, change your apply function to lambda x: x.shift(1).rolling(window=3).mean() to make it work in your specific example.