Pandas - Identify Last Row by Date - python

I'm trying to accomplish two things in my Pandas dataframe:
Create new column Last Row ('Yes' or 'No') based on new DateCompleted
Capture the next transaction on the current row, unless it's a new DateCompleted (in which case mark as Null).
Original Dataset
DateCompleted TranNumber Sales
0 1/1/17 10:15AM 3133 130.31
1 1/1/17 11:21AM 3531 103.12
2 1/1/17 12:31PM 3652 99.23
3 1/2/17 9:31AM 3689 83.22
4 1/2/17 10:31AM 3701 29.93
5 1/3/17 8:30AM 3709 31.31
Desired Output
DateCompleted TranNumber Sales NextTranSales LastRow
0 1/1/17 10:15AM 3133 130.31 103.12 No
1 1/1/17 11:21AM 3531 103.12 99.23 No
2 1/1/17 12:31PM 3652 99.23 NaN Yes
3 1/2/17 9:31AM 3689 83.22 29.93 No
4 1/2/17 10:31AM 3701 29.93 NaN Yes
5 1/3/17 8:30AM 3709 31.31 ... No
I can get the NextTranSales based on:
df['NextTranSales'] = df.Sales.shift(-1)
But I'm having trouble determining the last row in the DateCompleted group and marking NextTranSales as Null if it is the last row.
Thanks for your help!

If your data frame has been sorted by the DateCompleted column, then you might just need groupby.shift:
date = pd.to_datetime(df.DateCompleted).dt.date
df["NextTranSales"] = df.groupby(date).Sales.shift(-1)
If you need the LastRow column, you can find out the last row index with groupby and then assign yes to the rows:
last_row_index = df.groupby(date, as_index=False).apply(lambda g: g.index[-1])
df["LastRow"] = "No"
df.loc[last_row_index, "LastRow"] = "Yes"
df

NOTE: This depends on Sales being free of NaN. If it has any NaN we will get erroneous determinations of last row. This happens because I'm leveraging the convenience that the shifted column leaves a NaN in the last position.
d = df.DateCompleted.dt.date
m = {True: 'Yes', False: 'No'}
s = df.groupby(d).Sales.shift(-1)
df = df.assign(NextTranSales=s).assign(LastRow=s.isnull().map(m))
print(df)
DateCompleted TranNumber Sales NextTranSales LastRow
0 2017-01-01 10:15:00 3133 130.31 103.12 No
1 2017-01-01 11:21:00 3531 103.12 99.23 No
2 2017-01-01 12:31:00 3652 99.23 NaN Yes
3 2017-01-02 09:31:00 3689 83.22 29.93 No
4 2017-01-02 10:31:00 3701 29.93 NaN Yes
5 2017-01-03 08:30:00 3709 31.31 NaN Yes
We can be free of the no NaN restriction with this
d = df.DateCompleted.dt.date
m = {True: 'Yes', False: 'No'}
s = df.groupby(d).Sales.shift(-1)
l = pd.Series(
'Yes', df.groupby(d).tail(1).index
).reindex(df.index, fill_value='No')
df.assign(NextTranSales=s).assign(LastRow=l)
DateCompleted TranNumber Sales NextTranSales LastRow
0 2017-01-01 10:15:00 3133 130.31 103.12 No
1 2017-01-01 11:21:00 3531 103.12 99.23 No
2 2017-01-01 12:31:00 3652 99.23 NaN Yes
3 2017-01-02 09:31:00 3689 83.22 29.93 No
4 2017-01-02 10:31:00 3701 29.93 NaN Yes
5 2017-01-03 08:30:00 3709 31.31 NaN Yes

Related

Why is pandas str.replace returning NaN?

I am trying to remove the comma separator from values in a dataframe in Pandas to enable me to convert the to Integers. I have been using the following method:
df_orders['qty'] = df_orders['qty'].str.replace(',','')
However this seems to be returning NaN values for some numbers which did not originally contain ',' in their values. I have included a sample of my Input data and current output below:
Input:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A 18
667919 2020-10-13 A 5
674990 2020-10-12 A 2
703901 2020-10-09 A 1
715411 2020-10-08 A 1
721557 2020-10-07 A 31
740515 2020-10-06 A 49
752670 2020-10-05 A 4
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A 2
969909 2020-09-07 A 3
1021548 2020-08-31 A 2
1032254 2020-08-30 A 8
1077443 2020-08-25 A 5
1089670 2020-08-24 A 24
1098843 2020-08-23 A 16
1102025 2020-08-22 A 23
1179347 2020-08-12 A 1
1305700 2020-07-29 A 1
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
Current Output:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A NaN
667919 2020-10-13 A NaN
674990 2020-10-12 A NaN
703901 2020-10-09 A NaN
715411 2020-10-08 A NaN
721557 2020-10-07 A NaN
740515 2020-10-06 A NaN
752670 2020-10-05 A NaN
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A NaN
969909 2020-09-07 A NaN
1021548 2020-08-31 A NaN
1032254 2020-08-30 A NaN
1077443 2020-08-25 A NaN
1089670 2020-08-24 A NaN
1098843 2020-08-23 A NaN
1102025 2020-08-22 A NaN
1179347 2020-08-12 A NaN
1305700 2020-07-29 A NaN
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
I have had a look around but can't seem to find what is causing this error.
I was able to reproduce your issue:
# toy df
df
qty
0 1
1 2,
2 3
df['qty'].str.replace(',', '')
0 NaN
1 2
2 NaN
Name: qty, dtype: object
I created df by doing this:
df = pd.DataFrame({'qty': [1, '2,', 3]})
In other words, your column has mixed data types - some values are integers while others are strings. So when you apply .str methods on mixed types, non str types are converted to NaN to indicate "hey it doesn't make sense to run a str method on an int".
You may fix this by converting the entire column to string, then back to int:
df['qty'].astype(str).str.replace(',', '').astype(int)
Or if you want something a litte more robust, try
df['qty'] = pd.to_numeric(
df['qty'].astype(str).str.extract('(\d+)', expand=False), errors='coerce')

compare dates within a dataframe and assign a value to another variable

I have two dataframes (df and df1) like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM', '06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']
df1 = pd.DataFrame({'person_id': [101,101,101,101,101,101,101,202,202,202,202,202,202,202,202],'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM', '19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','DEF','DEF','DEF',np.nan]
What I would like to do is
a) Pick each person from df1 who doesnt have NA in 'within_id' column and check whether their date_1 is between (df.start_date - 1) and (df.end_date + 1) of the same person in df and for the same within_idor enc_id
ex: for subject = 101 and within_id = ABC, we have date_1 is 7/7/2013, you check whether they are between 4/7/2013 (df.start_date - 1) and 11/7/2013 (df.end_date + 1).
As the first-row comparison itself gave us the result, we don't have to compare our date_1 with rest of the records in df for subject 101. If not, we need to find/scan until we find the interval within which date_1 falls.
b) If date interval found, then assign the corresponding enc_id from df to the within_id in df1
c) If not then assign, "Out of Range"
I tried the below
t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values, 'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values, 'date_1')
t3= pd.concat([t1, t2], axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]
I expect my output (also see 14th row at the bottom of my screenshot) to be as shown below. As I intend to apply the solution on big data (4/5 million records and there might be 5000-6000 unique person_ids), any efficient and elegant solution is helpful
14 202 2012-12-13 11:00:00 NA
Let's do:
d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),
on=['person_id', 'within_id'], how='left', indicator=True)
m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),
d['end_date'] + pd.Timedelta(days=1))
d = df1.merge(d[m | d['_merge'].ne('both')], on=['person_id', 'date_1'], how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]
Details:
Left merge the dataframe df1 with df on person_id and within_id:
print(d)
person_id date_1 within_id start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
1 101 2013-07-07 11:20:00 ABC 2013-09-08 11:21:00 2013-09-13 11:21:00 ABC2 both
2 101 2013-07-07 11:20:00 ABC 2014-06-06 08:00:00 2014-06-11 08:00:00 ABC3 both
3 101 2013-07-07 11:20:00 ABC 2014-06-06 05:00:00 2014-06-11 10:00:00 DEF1 both
....
47 202 2012-12-18 10:00:00 DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
48 202 2012-12-18 10:00:00 DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
49 202 2013-12-19 11:00:00 NaN NaT NaT NaN left_only
Create a boolean mask m to represent the condition where date_1 is between df.start_date - 1 days and df.end_date + 1 days:
print(m)
0 False
1 False
2 False
3 False
...
47 False
48 True
49 False
dtype: bool
Again left merge the dataframe df1 with the dataframe filtered using mask m on columns person_id and date_1:
print(d)
person_id date_1 within_id_x within_id_y start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC NaN NaT NaT NaN NaN
1 101 2013-05-07 14:30:00 ABC ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
2 101 2013-06-07 14:40:00 ABC NaN NaT NaT NaN NaN
3 101 2014-08-06 00:00:00 ABC NaN NaT NaT NaN NaN
4 101 2014-11-06 00:00:00 ABC NaN NaT NaT NaN NaN
5 101 2013-02-03 12:30:00 ABC NaN NaT NaT NaN NaN
6 101 2014-06-13 00:00:00 ABC NaN NaT NaT NaN NaN
7 202 2011-12-11 00:00:00 DEF DEF 2011-12-11 10:00:00 2011-12-16 10:00:00 DEF1 both
8 202 2012-10-13 07:00:00 DEF DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
9 202 2015-12-13 00:00:00 DEF NaN NaT NaT NaN NaN
10 202 2012-12-13 00:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
11 202 2012-12-13 18:30:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
12 202 2011-07-13 10:00:00 DEF NaN NaT NaT NaN NaN
13 202 2012-12-18 10:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
14 202 2013-12-19 11:00:00 NaN NaN NaT NaT NaN left_only
Populate the values in within_id column from enc_id and using Series.fillna fill the NaN excluding the ones that doesn't match from df with out of range, finally filter the columns to get the result:
print(d)
person_id date_1 within_id
0 101 2013-07-07 11:20:00 out of range
1 101 2013-05-07 14:30:00 ABC1
2 101 2013-06-07 14:40:00 out of range
3 101 2014-08-06 00:00:00 out of range
4 101 2014-11-06 00:00:00 out of range
5 101 2013-02-03 12:30:00 out of range
6 101 2014-06-13 00:00:00 out of range
7 202 2011-12-11 00:00:00 DEF1
8 202 2012-10-13 07:00:00 DEF2
9 202 2015-12-13 00:00:00 out of range
10 202 2012-12-13 00:00:00 DEF3
11 202 2012-12-13 18:30:00 DEF3
12 202 2011-07-13 10:00:00 out of range
13 202 2012-12-18 10:00:00 DEF3
14 202 2013-12-19 11:00:00 NaN
I used df and df1 as provided above.
The basic approach is to iterate over df1 and extract the matching values of enc_id.
I added a 'rule' column, to show how each value got populated.
Unfortunately, I was not able to reproduce the expected results. Perhaps the general approach will be useful.
df1['rule'] = 0
for t in df1.itertuples():
person = (t.person_id == df.person_id)
b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
if (m := person & b).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 1
elif (m := person & c).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 10
elif (m := person & d).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 100
elif (m := person & e).any():
df1.at[t.Index, 'within_id'] = 'out of range'
df1.at[t.Index, 'rule'] += 1_000
else:
df1.at[t.Index, 'within_id'] = 'impossible!'
df1.at[t.Index, 'rule'] += 10_000
df1['within_id'] = df1['within_id'].astype('Int64')
The results are:
print(df1)
person_id date_1 date_2 within_id rule
0 11 1961-12-30 00:00:00 1962-01-01 00:00:00 11345678901 1
1 11 1962-01-30 00:00:00 1962-02-01 00:00:00 11345678902 1
2 12 1962-02-28 00:00:00 1962-03-02 00:00:00 34567892101 100
3 12 1989-07-29 00:00:00 1989-07-31 00:00:00 34567892101 1
4 12 1989-09-03 00:00:00 1989-09-05 00:00:00 34567892101 10
5 12 1989-10-02 00:00:00 1989-10-04 00:00:00 34567892103 1
6 12 1989-10-01 00:00:00 1989-10-03 00:00:00 34567892103 1
7 13 1999-03-29 00:00:00 1999-03-31 00:00:00 56432718901 1
8 13 1999-04-20 00:00:00 1999-04-22 00:00:00 56432718901 10
9 13 1999-06-02 00:00:00 1999-06-04 00:00:00 56432718904 1
10 13 1999-06-03 00:00:00 1999-06-05 00:00:00 56432718904 1
11 13 1999-07-29 00:00:00 1999-07-31 00:00:00 56432718905 1
12 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
13 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1

Assitance needed in python pandas to reduce lines of code and cycle time

I have a DF where I am calculating the filling the emi value in fields
account Total Start Date End Date EMI
211829 107000 05/19/17 01/22/19 5350
320563 175000 08/04/17 10/30/18 12500
648336 246000 02/26/17 08/25/19 8482.7586206897
109996 175000 11/23/17 11/27/19 7291.6666666667
121213 317000 09/07/17 04/12/18 45285.7142857143
Then based on dates range I create new fields like Jan 17 , Feb 17 , Mar 17 etc. and fill them up with the code below.
jant17 = pd.to_datetime('2017-01-01')
febt17 = pd.to_datetime('2017-02-01')
mart17 = pd.to_datetime('2017-03-01')
jan17 = pd.to_datetime('2017-01-31')
feb17 = pd.to_datetime('2017-02-28')
mar17 = pd.to_datetime('2017-03-31')
df.ix[(df['Start Date'] <= jan17) & (df['End Date'] >= jant17) , 'Jan17'] = df['EMI']
But the drawback is when I have to do a forecast till 2019 or 2020 They become too many lines of code to write and when there is any update I need to modify too many lines of code. To reduce the lines of code I tried an alternate method with using for loop but the code started taking very long to execute.
monthend = { 'Jan17' : pd.to_datetime('2017-01-31'),
'Feb17' : pd.to_datetime('2017-02-28'),
'Mar17' : pd.to_datetime('2017-03-31')}
monthbeg = { 'Jant17' : pd.to_datetime('2017-01-01'),
'Febt17' : pd.to_datetime('2017-02-01'),
'Mart17' : pd.to_datetime('2017-03-01')}
for mend in monthend.values():
for mbeg in monthbeg.values():
for coln in colnames:
df.ix[(df['Start Date'] <= mend) & (df['End Date'] >= mbeg) , coln] = df['EMI']
This greatly reduced the no of lines of code but increased to execution time from 3-4 mins to 1 hour plus. Is there a better way to code this with less lines and lesser processing time
I think you can create helper df with start, end dates and names of columns, loop rows and create new columns of original df:
dates = pd.DataFrame({'start':pd.date_range('2017-01-01', freq='MS', periods=10),
'end':pd.date_range('2017-01-01', freq='M', periods=10)})
dates['names'] = dates.start.dt.strftime('%b%y')
print (dates)
end start names
0 2017-01-31 2017-01-01 Jan17
1 2017-02-28 2017-02-01 Feb17
2 2017-03-31 2017-03-01 Mar17
3 2017-04-30 2017-04-01 Apr17
4 2017-05-31 2017-05-01 May17
5 2017-06-30 2017-06-01 Jun17
6 2017-07-31 2017-07-01 Jul17
7 2017-08-31 2017-08-01 Aug17
8 2017-09-30 2017-09-01 Sep17
9 2017-10-31 2017-10-01 Oct17
#if necessary convert to datetimes
df['Start Date'] = pd.to_datetime(df['Start Date'])
df['End Date'] = pd.to_datetime(df['End Date'])
def f(x):
df.loc[(df['Start Date'] <= x.start) & (df['End Date'] >= x.end) , x.names] = df['EMI']
dates.apply(f, axis=1)
print (df)
account Total Start Date End Date EMI Jan17 Feb17 \
0 211829 107000 2017-05-19 2019-01-22 5350.000000 NaN NaN
1 320563 175000 2017-08-04 2018-10-30 12500.000000 NaN NaN
2 648336 246000 2017-02-26 2019-08-25 8482.758621 NaN NaN
3 109996 175000 2017-11-23 2019-11-27 7291.666667 NaN NaN
4 121213 317000 2017-09-07 2018-04-12 45285.714286 NaN NaN
Mar17 Apr17 May17 Jun17 Jul17 \
0 NaN NaN NaN 5350.000000 5350.000000
1 NaN NaN NaN NaN NaN
2 8482.758621 8482.758621 8482.758621 8482.758621 8482.758621
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
Aug17 Sep17 Oct17
0 5350.000000 5350.000000 5350.000000
1 NaN 12500.000000 12500.000000
2 8482.758621 8482.758621 8482.758621
3 NaN NaN NaN
4 NaN NaN 45285.714286

dataframe merge with missing data

I have 2 dataframes:
df.head()
Out[2]:
Unnamed: 0 Symbol Date Close
0 4061 A 2016-01-13 36.515889
1 4062 A 2016-01-14 36.351784
2 4063 A 2016-01-15 36.351784
3 4064 A 2016-01-19 36.590483
4 4065 A 2016-01-20 35.934062
and
dfw.head()
Out[3]:
Symbol Weight
0 A (0.000002)
1 AA 0.000112
2 AAC (0.000004)
3 AAL 0.000006
4 AAMC 0.000002
ISSUE:
Not every symbol if df will have a weight in dfw. If it does not I want to drop it from my new dataframe (all dates of it). If the symbol is in dfw I want to merge the weight in with df so that each row has symbol, date, close and weight. I have tried the following but get NaN values. I also am not sure how to remove all symbols with no weights even if I was successful.
dfall = df.merge(dfw, on='Symbol', how='left')
dfall.head()
Out[14]:
Unnamed: 0 Symbol Date Close Weight
0 4061 A 2016-01-13 36.515889 NaN
1 4062 A 2016-01-14 36.351784 NaN
2 4063 A 2016-01-15 36.351784 NaN
3 4064 A 2016-01-19 36.590483 NaN
4 4065 A 2016-01-20 35.934062 NaN
df_all = df[df.Symbol.isin(dfw.Symbol.unique())].merge(dfw, how='left', on='Symbol')
I am not sure why you are getting NaN values. Perhaps you have spaces in you your symbols? You can clean them via: dfw['Symbol'] = dfw.Symbol.str.strip() You would need to do the same for df.
>>> df_all
Unnamed: 0 Symbol Date Close Weight
0 4061 A 2016-01-13 36.515889 (0.000002)
1 4062 A 2016-01-14 36.351784 (0.000002)
2 4063 A 2016-01-15 36.351784 (0.000002)
3 4064 A 2016-01-19 36.590483 (0.000002)
4 4065 A 2016-01-20 35.934062 (0.000002)

Python Pandas - Cannot Merge Multiple DataFrame returning NaN

I am trying to merge multiple CSV files into that of 1 large dataframe. I want to merge them in respect to that of the Date Column. Although some CSV files have missing dates and will require a blank or NA to be recorded.
Searching around led me to believe that pandas in python would be a viable solution.
My code is as follows:
import pandas as pd
AvgPrice = pd.read_csv('csv/BAVERAGE-USD-Bitcoin24hPrice.csv', index_col=False)
AvgPrice = AvgPrice.iloc[:,(0,1)]
AvgPrice.columns.values[1] = 'Price'
TransVol = pd.read_csv('csv/BCHAIN-ETRAV-BitcoinEstimatedTransactionVolume.csv', index_col=False)
TransVol.columns.values[1] = 'TransactionVolume'
TotalBTC = pd.read_csv('csv/BCHAIN-TOTBC-TotalBitcoins.csv', index_col=False)
TotalBTC.columns.values[1] = 'TotalBTC'
USDExchVol = pd.read_csv('csv/BCHAIN-TRVOU-BitcoinUSDExchangeTradeVolume.csv', index_col=False)
USDExchVol.columns.values[1] = 'USDExchange Volume'
df1 = pd.merge(TransVol, AvgPrice, on='Date', how='outer')
df2 = pd.merge(USDExchVol, TotalBTC, on='Date', how='outer)
df_test = pd.merge(AvgPrice, TransVol, on='Date', how='outer')
CSV files are located here: https://drive.google.com/folderview?id=0B8xdmDmZgtJbVkhCcjZkZUhaajg&usp=sharing
Results of df_test:
Date Price TransactionVolume
0 2016-05-10 459.30 NaN
1 2016-05-09 462.49 NaN
2 2016-05-08 461.85 NaN
3 2016-05-07 460.86 NaN
4 2016-05-06 453.51 NaN
5 2016-05-05 449.31 NaN
Whereas df1 seems to be fine:
Date TransactionVolume Price
0 2016-05-10 275352.0 459.30
1 2016-05-09 256585.0 462.49
2 2016-05-08 152045.0 461.85
3 2016-05-07 245115.0 460.86
4 2016-05-06 264882.0 453.51
5 2016-05-05 273005.0 449.31
I have no idea why df2 and df_test have the right most column filled with NaN. This is restricting me from merging both df1 and df2 to make one large DataFrame.
Any help would be greatly appreciated as I've spent hours with no success.
You have to add parameters names and usecols to read_csv, and then it works nice:
import pandas as pd
AvgPrice = pd.read_csv('csv/BAVERAGE-USD-Bitcoin24hPrice.csv',
index_col=False,
parse_dates=['Date'],
usecols=[0,1],
header=0,
names=['Date','Price'])
TransVol = pd.read_csv('csv/BCHAIN-ETRAV-BitcoinEstimatedTransactionVolume.csv',
index_col=False,
parse_dates=['Date'],
header=0,
names=['Date','TransactionVolume'])
TotalBTC = pd.read_csv('csv/BCHAIN-TOTBC-TotalBitcoins.csv',
index_col=False,
parse_dates=['Date'],
header=0,
names=['Date','TotalBTC'])
USDExchVol = pd.read_csv('csv/BCHAIN-TRVOU-BitcoinUSDExchangeTradeVolume.csv',
index_col=False,
parse_dates=['Date'],
header=0,
names=['Date','USDExchange Volume'])
df1 = pd.merge(TransVol, AvgPrice, on='Date', how='outer')
df2 = pd.merge(USDExchVol, TotalBTC, on='Date', how='outer')
df_test = pd.merge(AvgPrice, TransVol, on='Date', how='outer')
print (df1.head())
print (df2.head())
print (df_test.head())
Date TransactionVolume Price
0 2016-05-10 275352.0 459.30
1 2016-05-09 256585.0 462.49
2 2016-05-08 152045.0 461.85
3 2016-05-07 245115.0 460.86
4 2016-05-06 264882.0 453.51
Date USDExchange Volume TotalBTC
0 2016-05-10 2.158373e+06 15529625.0
1 2016-05-09 1.438420e+06 15525825.0
2 2016-05-08 6.679933e+05 15521275.0
3 2016-05-07 1.825475e+06 15517400.0
4 2016-05-06 1.908048e+06 15513525.0
Date Price TransactionVolume
0 2016-05-10 459.30 275352.0
1 2016-05-09 462.49 256585.0
2 2016-05-08 461.85 152045.0
3 2016-05-07 460.86 245115.0
4 2016-05-06 453.51 264882.0
EDIT by comment:
I think you can convert column Date to_period of months and then use groupby with mean:
print (df1.Date.dt.to_period('M'))
0 2016-05
1 2016-05
2 2016-05
3 2016-05
4 2016-05
5 2016-05
6 2016-05
7 2016-05
...
...
print (df1.groupby( df1.Date.dt.to_period('M') ).mean() )
TransactionVolume Price
Date
2011-05 1.605518e+05 7.272273
2011-06 1.739163e+05 17.914583
2011-07 6.647129e+04 14.100645
2011-08 1.050460e+05 10.089677
2011-09 9.562243e+04 5.933667
2011-10 9.120232e+04 3.638065
2011-11 8.927442e+05 2.690333
2011-12 1.092328e+06 3.463871
2012-01 1.168704e+05 6.105161
2012-02 1.465859e+05 5.115517
...
...
If order is important, add parameter sort=False:
print (df1.groupby( df1.Date.dt.to_period('M') , sort=False).mean() )
TransactionVolume Price
Date
2016-05 2.511146e+05 454.544000
2016-04 2.747255e+05 435.102333
2016-03 3.142206e+05 418.208710
2016-02 3.402811e+05 404.091379
2016-01 2.548778e+05 412.671935
2015-12 3.857985e+05 423.402903
2015-11 4.290366e+05 349.200333
2015-10 3.134802e+05 266.007097
2015-09 2.572308e+05 235.310345
2015-08 2.737384e+05 253.951613
...
...
There is a subtle bug here, you're renaming the columns by directly assigning to the column array in each df:
AvgPrice.columns.values[1] = 'Price'
If you try TransVol.info() it raises a KeyError on TransactionVolume
if instead you use rename then it works:
In [35]:
AvgPrice = pd.read_csv(r'c:\data\BAVERAGE-USD-Bitcoin24hPrice.csv', index_col=False)
AvgPrice = AvgPrice.iloc[:,(0,1)]
AvgPrice.rename(columns={'24h Average':'Price'}, inplace=True)
​
TransVol = pd.read_csv(r'c:\data\BCHAIN-ETRAV-BitcoinEstimatedTransactionVolume.csv', index_col=False)
TransVol.rename(columns={'Value':'TransactionVolume'}, inplace=True)
​
TotalBTC = pd.read_csv(r'c:\data\BCHAIN-TOTBC-TotalBitcoins.csv', index_col=False)
TotalBTC.rename(columns={'Value':'TotalBTC'}, inplace=True)
​
USDExchVol = pd.read_csv(r'c:\data\BCHAIN-TRVOU-BitcoinUSDExchangeTradeVolume.csv', index_col=False)
USDExchVol.rename(columns={'Value':'USDExchange Volume'}, inplace=True)
​
df1 = pd.merge(TransVol, AvgPrice, on='Date', how='outer')
df2 = pd.merge(USDExchVol, TotalBTC, on='Date', how='outer')
​
df_test = pd.merge(AvgPrice, TransVol, on='Date', how='outer')
df_test
Out[35]:
Date Price TransactionVolume
0 2016-05-10 459.30 275352.0
1 2016-05-09 462.49 256585.0
2 2016-05-08 461.85 152045.0
3 2016-05-07 460.86 245115.0
4 2016-05-06 453.51 264882.0
5 2016-05-05 449.31 273005.0
6 2016-05-04 449.32 370911.0
7 2016-05-03 447.93 252534.0
8 2016-05-02 448.00 249926.0
9 2016-05-01 452.87 170791.0
10 2016-04-30 454.88 190470.0
11 2016-04-29 451.88 278893.0
12 2016-04-28 445.80 329924.0
13 2016-04-27 461.92 335750.0
14 2016-04-26 465.91 344162.0
15 2016-04-25 460.32 307790.0
16 2016-04-24 455.53 188499.0
17 2016-04-23 449.13 203792.0
18 2016-04-22 447.73 291487.0
19 2016-04-21 445.28 316159.0
20 2016-04-20 438.98 302380.0
21 2016-04-19 432.35 275994.0
22 2016-04-18 429.76 245313.0
23 2016-04-17 431.93 186607.0
24 2016-04-16 432.86 200628.0
25 2016-04-15 429.06 281389.0
26 2016-04-14 426.21 274524.0
27 2016-04-13 425.50 309995.0
28 2016-04-12 426.15 341372.0
29 2016-04-11 422.91 264357.0
... ... ... ...
1798 2011-05-18 7.14 80290.0
1799 2011-05-17 7.52 138205.0
1800 2011-05-16 7.77 62341.0
1801 2011-05-15 6.74 272130.0
1802 2011-05-14 7.86 656162.0
1803 2011-05-13 7.48 324020.0
1804 2011-05-12 5.83 101674.0
1805 2011-05-11 5.35 114243.0
1806 2011-05-10 4.74 104592.0
1807 2015-09-03 NaN 256023.0
1808 2015-02-03 NaN 213538.0
1809 2015-01-07 NaN 256344.0
1810 2014-11-21 NaN 161082.0
1811 2014-10-17 NaN 142251.0
1812 2014-09-28 NaN 92933.0
1813 2014-09-09 NaN 111317.0
1814 2014-08-05 NaN 136298.0
1815 2014-08-03 NaN 49181.0
1816 2014-08-01 NaN 166173.0
1817 2014-06-03 NaN 124768.0
1818 2014-06-02 NaN 87513.0
1819 2014-05-09 NaN 80315.0
1820 2013-10-27 NaN 107717.0
1821 2013-09-17 NaN 137920.0
1822 2011-06-25 NaN 110463.0
1823 2011-06-24 NaN 106146.0
1824 2011-06-23 NaN 475995.0
1825 2011-06-22 NaN 122507.0
1826 2011-06-21 NaN 114264.0
1827 2011-06-20 NaN 836861.0
[1828 rows x 3 columns]

Categories

Resources