dataframe merge with missing data - python

I have 2 dataframes:
df.head()
Out[2]:
Unnamed: 0 Symbol Date Close
0 4061 A 2016-01-13 36.515889
1 4062 A 2016-01-14 36.351784
2 4063 A 2016-01-15 36.351784
3 4064 A 2016-01-19 36.590483
4 4065 A 2016-01-20 35.934062
and
dfw.head()
Out[3]:
Symbol Weight
0 A (0.000002)
1 AA 0.000112
2 AAC (0.000004)
3 AAL 0.000006
4 AAMC 0.000002
ISSUE:
Not every symbol if df will have a weight in dfw. If it does not I want to drop it from my new dataframe (all dates of it). If the symbol is in dfw I want to merge the weight in with df so that each row has symbol, date, close and weight. I have tried the following but get NaN values. I also am not sure how to remove all symbols with no weights even if I was successful.
dfall = df.merge(dfw, on='Symbol', how='left')
dfall.head()
Out[14]:
Unnamed: 0 Symbol Date Close Weight
0 4061 A 2016-01-13 36.515889 NaN
1 4062 A 2016-01-14 36.351784 NaN
2 4063 A 2016-01-15 36.351784 NaN
3 4064 A 2016-01-19 36.590483 NaN
4 4065 A 2016-01-20 35.934062 NaN

df_all = df[df.Symbol.isin(dfw.Symbol.unique())].merge(dfw, how='left', on='Symbol')
I am not sure why you are getting NaN values. Perhaps you have spaces in you your symbols? You can clean them via: dfw['Symbol'] = dfw.Symbol.str.strip() You would need to do the same for df.
>>> df_all
Unnamed: 0 Symbol Date Close Weight
0 4061 A 2016-01-13 36.515889 (0.000002)
1 4062 A 2016-01-14 36.351784 (0.000002)
2 4063 A 2016-01-15 36.351784 (0.000002)
3 4064 A 2016-01-19 36.590483 (0.000002)
4 4065 A 2016-01-20 35.934062 (0.000002)

Related

Why is pandas str.replace returning NaN?

I am trying to remove the comma separator from values in a dataframe in Pandas to enable me to convert the to Integers. I have been using the following method:
df_orders['qty'] = df_orders['qty'].str.replace(',','')
However this seems to be returning NaN values for some numbers which did not originally contain ',' in their values. I have included a sample of my Input data and current output below:
Input:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A 18
667919 2020-10-13 A 5
674990 2020-10-12 A 2
703901 2020-10-09 A 1
715411 2020-10-08 A 1
721557 2020-10-07 A 31
740515 2020-10-06 A 49
752670 2020-10-05 A 4
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A 2
969909 2020-09-07 A 3
1021548 2020-08-31 A 2
1032254 2020-08-30 A 8
1077443 2020-08-25 A 5
1089670 2020-08-24 A 24
1098843 2020-08-23 A 16
1102025 2020-08-22 A 23
1179347 2020-08-12 A 1
1305700 2020-07-29 A 1
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
Current Output:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A NaN
667919 2020-10-13 A NaN
674990 2020-10-12 A NaN
703901 2020-10-09 A NaN
715411 2020-10-08 A NaN
721557 2020-10-07 A NaN
740515 2020-10-06 A NaN
752670 2020-10-05 A NaN
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A NaN
969909 2020-09-07 A NaN
1021548 2020-08-31 A NaN
1032254 2020-08-30 A NaN
1077443 2020-08-25 A NaN
1089670 2020-08-24 A NaN
1098843 2020-08-23 A NaN
1102025 2020-08-22 A NaN
1179347 2020-08-12 A NaN
1305700 2020-07-29 A NaN
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
I have had a look around but can't seem to find what is causing this error.
I was able to reproduce your issue:
# toy df
df
qty
0 1
1 2,
2 3
df['qty'].str.replace(',', '')
0 NaN
1 2
2 NaN
Name: qty, dtype: object
I created df by doing this:
df = pd.DataFrame({'qty': [1, '2,', 3]})
In other words, your column has mixed data types - some values are integers while others are strings. So when you apply .str methods on mixed types, non str types are converted to NaN to indicate "hey it doesn't make sense to run a str method on an int".
You may fix this by converting the entire column to string, then back to int:
df['qty'].astype(str).str.replace(',', '').astype(int)
Or if you want something a litte more robust, try
df['qty'] = pd.to_numeric(
df['qty'].astype(str).str.extract('(\d+)', expand=False), errors='coerce')

How to copy paste values from another dataset conditional on a column

I have df1
Id Data Group_Id
0 1 A 1
1 2 B 2
2 3 B 3
...
100 4 A 101
101 5 A 102
...
and df2
Timestamp Group_Id
2012-01-01 00:00:05.523 1
2013-07-01 00:00:10.757 2
2014-01-12 00:00:15.507. 3
...
2016-03-05 00:00:05.743 101
2017-12-24 00:00:10.407 102
...
I want to match the 2 datasets by Group_Id, then copy only date from Timestamp in df2 and paste to a new column in df1 based on corresponding Group_Id, name the column day1.
Then I want to add 6 more columns next to day1, name them day2, ..., day7 with the next six days based on day1. So it looks like:
Id Data Group_Id day1 day2 day3 ... day7
0 1 A 1 2012-01-01 2012-01-02 2012-01-03 ...
1 2 B 2 2013-07-01 2013-07-02 2013-07-03 ...
2 3 B 3 2014-01-12 2014-01-13 2014-01-14 ...
...
100 4 A 101 2016-03-05 2016-03-06 2016-03-07 ...
101 5 A 102 2017-12-24 2017-12-25 2017-12-26 ...
...
Thanks.
First we need merge here
df1=df1.merge(df2,how='left')
s=pd.DataFrame([pd.date_range(x,periods=6,freq ='D') for x in df1.Timestamp],index=df1.index)
s.columns+=1
df1.join(s.add_prefix('Day'))
another approach here, basically just merges the dfs, grabs the date from the timestamp and makes 6 new columns adding a day each time:
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df3 = df1.merge(df2, on='Group_Id')
df3['Timestamp'] = pd.to_datetime(df3['Timestamp']) #only necessary if not already timestamp
df3['day1'] = df3['Timestamp'].dt.date
for i in (range(1,7)):
df3['day'+str(i+1)] = df3['day1'] + pd.Timedelta(i,unit='d')
output:
Id Data Group_Id Timestamp day1 day2 day3 day4 day5 day6 day7
0 1 A 1 2012-01-01 00:00:05.523 2012-01-01 2012-01-02 2012-01-03 2012-01-04 2012-01-05 2012-01-06 2012-01-07
1 2 B 2 2013-07-01 00:00:10.757 2013-07-01 2013-07-02 2013-07-03 2013-07-04 2013-07-05 2013-07-06 2013-07-07
2 3 B 3 2014-01-12 00:00:15.507 2014-01-12 2014-01-13 2014-01-14 2014-01-15 2014-01-16 2014-01-17 2014-01-18
3 4 A 101 2016-03-05 00:00:05.743 2016-03-05 2016-03-06 2016-03-07 2016-03-08 2016-03-09 2016-03-10 2016-03-11
4 5 A 102 2017-12-24 00:00:10.407 2017-12-24 2017-12-25 2017-12-26 2017-12-27 2017-12-28 2017-12-29 2017-12-30
note that I copied your data frame into a csv and only had the 5 entires so the index is not the same as your example (i.e. 100, 101)
you can delete the timestamp col if not needed

Pandas - Identify Last Row by Date

I'm trying to accomplish two things in my Pandas dataframe:
Create new column Last Row ('Yes' or 'No') based on new DateCompleted
Capture the next transaction on the current row, unless it's a new DateCompleted (in which case mark as Null).
Original Dataset
DateCompleted TranNumber Sales
0 1/1/17 10:15AM 3133 130.31
1 1/1/17 11:21AM 3531 103.12
2 1/1/17 12:31PM 3652 99.23
3 1/2/17 9:31AM 3689 83.22
4 1/2/17 10:31AM 3701 29.93
5 1/3/17 8:30AM 3709 31.31
Desired Output
DateCompleted TranNumber Sales NextTranSales LastRow
0 1/1/17 10:15AM 3133 130.31 103.12 No
1 1/1/17 11:21AM 3531 103.12 99.23 No
2 1/1/17 12:31PM 3652 99.23 NaN Yes
3 1/2/17 9:31AM 3689 83.22 29.93 No
4 1/2/17 10:31AM 3701 29.93 NaN Yes
5 1/3/17 8:30AM 3709 31.31 ... No
I can get the NextTranSales based on:
df['NextTranSales'] = df.Sales.shift(-1)
But I'm having trouble determining the last row in the DateCompleted group and marking NextTranSales as Null if it is the last row.
Thanks for your help!
If your data frame has been sorted by the DateCompleted column, then you might just need groupby.shift:
date = pd.to_datetime(df.DateCompleted).dt.date
df["NextTranSales"] = df.groupby(date).Sales.shift(-1)
If you need the LastRow column, you can find out the last row index with groupby and then assign yes to the rows:
last_row_index = df.groupby(date, as_index=False).apply(lambda g: g.index[-1])
df["LastRow"] = "No"
df.loc[last_row_index, "LastRow"] = "Yes"
df
NOTE: This depends on Sales being free of NaN. If it has any NaN we will get erroneous determinations of last row. This happens because I'm leveraging the convenience that the shifted column leaves a NaN in the last position.
d = df.DateCompleted.dt.date
m = {True: 'Yes', False: 'No'}
s = df.groupby(d).Sales.shift(-1)
df = df.assign(NextTranSales=s).assign(LastRow=s.isnull().map(m))
print(df)
DateCompleted TranNumber Sales NextTranSales LastRow
0 2017-01-01 10:15:00 3133 130.31 103.12 No
1 2017-01-01 11:21:00 3531 103.12 99.23 No
2 2017-01-01 12:31:00 3652 99.23 NaN Yes
3 2017-01-02 09:31:00 3689 83.22 29.93 No
4 2017-01-02 10:31:00 3701 29.93 NaN Yes
5 2017-01-03 08:30:00 3709 31.31 NaN Yes
We can be free of the no NaN restriction with this
d = df.DateCompleted.dt.date
m = {True: 'Yes', False: 'No'}
s = df.groupby(d).Sales.shift(-1)
l = pd.Series(
'Yes', df.groupby(d).tail(1).index
).reindex(df.index, fill_value='No')
df.assign(NextTranSales=s).assign(LastRow=l)
DateCompleted TranNumber Sales NextTranSales LastRow
0 2017-01-01 10:15:00 3133 130.31 103.12 No
1 2017-01-01 11:21:00 3531 103.12 99.23 No
2 2017-01-01 12:31:00 3652 99.23 NaN Yes
3 2017-01-02 09:31:00 3689 83.22 29.93 No
4 2017-01-02 10:31:00 3701 29.93 NaN Yes
5 2017-01-03 08:30:00 3709 31.31 NaN Yes

removing rows with any column containing NaN, NaTs, and nans

Currently I have data as below:
df_all.head()
Out[2]:
Unnamed: 0 Symbol Date Close Weight
0 4061 A 2016-01-13 36.515889 (0.000002)
1 4062 AA 2016-01-14 36.351784 0.000112
2 4063 AAC 2016-01-15 36.351784 (0.000004)
3 4064 AAL 2016-01-19 36.590483 0.000006
4 4065 AAMC 2016-01-20 35.934062 0.000002
df_all.tail()
Out[3]:
Unnamed: 0 Symbol Date Close Weight
1252498 26950320 nan NaT 9.84 NaN
1252499 26950321 nan NaT 10.26 NaN
1252500 26950322 nan NaT 9.99 NaN
1252501 26950323 nan NaT 9.11 NaN
1252502 26950324 nan NaT 9.18 NaN
df_all.dtypes
Out[4]:
Unnamed: 0 int64
Symbol object
Date datetime64[ns]
Close float64
Weight object
dtype: object
As can be seen, I am getting values in Symbol of nan, Nat for Date and NaN for weight.
MY GOAL: I want to remove any row that has ANY column containing nan, Nat or NaN and have a new df_clean to be the result
I don't seem to be able to apply the appropriate filter? I am not sure if I have to convert the datatypes first (although I tried this as well)
You can use
df_all.replace({'nan': None})[~pd.isnull(df_all).any(axis=1)]
This is because isnull recognizes both NaN and NaT as "null" values.
Since, the symbol 'nan' is not caught by dropna() or isnull(). You need to cast the symbol'nan' as np.nan
Try this:
df["symbol"] = np.where(df["symbol"]=='nan',np.nan, df["symbol"] )
df.dropna()

Python Pandas - Cannot Merge Multiple DataFrame returning NaN

I am trying to merge multiple CSV files into that of 1 large dataframe. I want to merge them in respect to that of the Date Column. Although some CSV files have missing dates and will require a blank or NA to be recorded.
Searching around led me to believe that pandas in python would be a viable solution.
My code is as follows:
import pandas as pd
AvgPrice = pd.read_csv('csv/BAVERAGE-USD-Bitcoin24hPrice.csv', index_col=False)
AvgPrice = AvgPrice.iloc[:,(0,1)]
AvgPrice.columns.values[1] = 'Price'
TransVol = pd.read_csv('csv/BCHAIN-ETRAV-BitcoinEstimatedTransactionVolume.csv', index_col=False)
TransVol.columns.values[1] = 'TransactionVolume'
TotalBTC = pd.read_csv('csv/BCHAIN-TOTBC-TotalBitcoins.csv', index_col=False)
TotalBTC.columns.values[1] = 'TotalBTC'
USDExchVol = pd.read_csv('csv/BCHAIN-TRVOU-BitcoinUSDExchangeTradeVolume.csv', index_col=False)
USDExchVol.columns.values[1] = 'USDExchange Volume'
df1 = pd.merge(TransVol, AvgPrice, on='Date', how='outer')
df2 = pd.merge(USDExchVol, TotalBTC, on='Date', how='outer)
df_test = pd.merge(AvgPrice, TransVol, on='Date', how='outer')
CSV files are located here: https://drive.google.com/folderview?id=0B8xdmDmZgtJbVkhCcjZkZUhaajg&usp=sharing
Results of df_test:
Date Price TransactionVolume
0 2016-05-10 459.30 NaN
1 2016-05-09 462.49 NaN
2 2016-05-08 461.85 NaN
3 2016-05-07 460.86 NaN
4 2016-05-06 453.51 NaN
5 2016-05-05 449.31 NaN
Whereas df1 seems to be fine:
Date TransactionVolume Price
0 2016-05-10 275352.0 459.30
1 2016-05-09 256585.0 462.49
2 2016-05-08 152045.0 461.85
3 2016-05-07 245115.0 460.86
4 2016-05-06 264882.0 453.51
5 2016-05-05 273005.0 449.31
I have no idea why df2 and df_test have the right most column filled with NaN. This is restricting me from merging both df1 and df2 to make one large DataFrame.
Any help would be greatly appreciated as I've spent hours with no success.
You have to add parameters names and usecols to read_csv, and then it works nice:
import pandas as pd
AvgPrice = pd.read_csv('csv/BAVERAGE-USD-Bitcoin24hPrice.csv',
index_col=False,
parse_dates=['Date'],
usecols=[0,1],
header=0,
names=['Date','Price'])
TransVol = pd.read_csv('csv/BCHAIN-ETRAV-BitcoinEstimatedTransactionVolume.csv',
index_col=False,
parse_dates=['Date'],
header=0,
names=['Date','TransactionVolume'])
TotalBTC = pd.read_csv('csv/BCHAIN-TOTBC-TotalBitcoins.csv',
index_col=False,
parse_dates=['Date'],
header=0,
names=['Date','TotalBTC'])
USDExchVol = pd.read_csv('csv/BCHAIN-TRVOU-BitcoinUSDExchangeTradeVolume.csv',
index_col=False,
parse_dates=['Date'],
header=0,
names=['Date','USDExchange Volume'])
df1 = pd.merge(TransVol, AvgPrice, on='Date', how='outer')
df2 = pd.merge(USDExchVol, TotalBTC, on='Date', how='outer')
df_test = pd.merge(AvgPrice, TransVol, on='Date', how='outer')
print (df1.head())
print (df2.head())
print (df_test.head())
Date TransactionVolume Price
0 2016-05-10 275352.0 459.30
1 2016-05-09 256585.0 462.49
2 2016-05-08 152045.0 461.85
3 2016-05-07 245115.0 460.86
4 2016-05-06 264882.0 453.51
Date USDExchange Volume TotalBTC
0 2016-05-10 2.158373e+06 15529625.0
1 2016-05-09 1.438420e+06 15525825.0
2 2016-05-08 6.679933e+05 15521275.0
3 2016-05-07 1.825475e+06 15517400.0
4 2016-05-06 1.908048e+06 15513525.0
Date Price TransactionVolume
0 2016-05-10 459.30 275352.0
1 2016-05-09 462.49 256585.0
2 2016-05-08 461.85 152045.0
3 2016-05-07 460.86 245115.0
4 2016-05-06 453.51 264882.0
EDIT by comment:
I think you can convert column Date to_period of months and then use groupby with mean:
print (df1.Date.dt.to_period('M'))
0 2016-05
1 2016-05
2 2016-05
3 2016-05
4 2016-05
5 2016-05
6 2016-05
7 2016-05
...
...
print (df1.groupby( df1.Date.dt.to_period('M') ).mean() )
TransactionVolume Price
Date
2011-05 1.605518e+05 7.272273
2011-06 1.739163e+05 17.914583
2011-07 6.647129e+04 14.100645
2011-08 1.050460e+05 10.089677
2011-09 9.562243e+04 5.933667
2011-10 9.120232e+04 3.638065
2011-11 8.927442e+05 2.690333
2011-12 1.092328e+06 3.463871
2012-01 1.168704e+05 6.105161
2012-02 1.465859e+05 5.115517
...
...
If order is important, add parameter sort=False:
print (df1.groupby( df1.Date.dt.to_period('M') , sort=False).mean() )
TransactionVolume Price
Date
2016-05 2.511146e+05 454.544000
2016-04 2.747255e+05 435.102333
2016-03 3.142206e+05 418.208710
2016-02 3.402811e+05 404.091379
2016-01 2.548778e+05 412.671935
2015-12 3.857985e+05 423.402903
2015-11 4.290366e+05 349.200333
2015-10 3.134802e+05 266.007097
2015-09 2.572308e+05 235.310345
2015-08 2.737384e+05 253.951613
...
...
There is a subtle bug here, you're renaming the columns by directly assigning to the column array in each df:
AvgPrice.columns.values[1] = 'Price'
If you try TransVol.info() it raises a KeyError on TransactionVolume
if instead you use rename then it works:
In [35]:
AvgPrice = pd.read_csv(r'c:\data\BAVERAGE-USD-Bitcoin24hPrice.csv', index_col=False)
AvgPrice = AvgPrice.iloc[:,(0,1)]
AvgPrice.rename(columns={'24h Average':'Price'}, inplace=True)
​
TransVol = pd.read_csv(r'c:\data\BCHAIN-ETRAV-BitcoinEstimatedTransactionVolume.csv', index_col=False)
TransVol.rename(columns={'Value':'TransactionVolume'}, inplace=True)
​
TotalBTC = pd.read_csv(r'c:\data\BCHAIN-TOTBC-TotalBitcoins.csv', index_col=False)
TotalBTC.rename(columns={'Value':'TotalBTC'}, inplace=True)
​
USDExchVol = pd.read_csv(r'c:\data\BCHAIN-TRVOU-BitcoinUSDExchangeTradeVolume.csv', index_col=False)
USDExchVol.rename(columns={'Value':'USDExchange Volume'}, inplace=True)
​
df1 = pd.merge(TransVol, AvgPrice, on='Date', how='outer')
df2 = pd.merge(USDExchVol, TotalBTC, on='Date', how='outer')
​
df_test = pd.merge(AvgPrice, TransVol, on='Date', how='outer')
df_test
Out[35]:
Date Price TransactionVolume
0 2016-05-10 459.30 275352.0
1 2016-05-09 462.49 256585.0
2 2016-05-08 461.85 152045.0
3 2016-05-07 460.86 245115.0
4 2016-05-06 453.51 264882.0
5 2016-05-05 449.31 273005.0
6 2016-05-04 449.32 370911.0
7 2016-05-03 447.93 252534.0
8 2016-05-02 448.00 249926.0
9 2016-05-01 452.87 170791.0
10 2016-04-30 454.88 190470.0
11 2016-04-29 451.88 278893.0
12 2016-04-28 445.80 329924.0
13 2016-04-27 461.92 335750.0
14 2016-04-26 465.91 344162.0
15 2016-04-25 460.32 307790.0
16 2016-04-24 455.53 188499.0
17 2016-04-23 449.13 203792.0
18 2016-04-22 447.73 291487.0
19 2016-04-21 445.28 316159.0
20 2016-04-20 438.98 302380.0
21 2016-04-19 432.35 275994.0
22 2016-04-18 429.76 245313.0
23 2016-04-17 431.93 186607.0
24 2016-04-16 432.86 200628.0
25 2016-04-15 429.06 281389.0
26 2016-04-14 426.21 274524.0
27 2016-04-13 425.50 309995.0
28 2016-04-12 426.15 341372.0
29 2016-04-11 422.91 264357.0
... ... ... ...
1798 2011-05-18 7.14 80290.0
1799 2011-05-17 7.52 138205.0
1800 2011-05-16 7.77 62341.0
1801 2011-05-15 6.74 272130.0
1802 2011-05-14 7.86 656162.0
1803 2011-05-13 7.48 324020.0
1804 2011-05-12 5.83 101674.0
1805 2011-05-11 5.35 114243.0
1806 2011-05-10 4.74 104592.0
1807 2015-09-03 NaN 256023.0
1808 2015-02-03 NaN 213538.0
1809 2015-01-07 NaN 256344.0
1810 2014-11-21 NaN 161082.0
1811 2014-10-17 NaN 142251.0
1812 2014-09-28 NaN 92933.0
1813 2014-09-09 NaN 111317.0
1814 2014-08-05 NaN 136298.0
1815 2014-08-03 NaN 49181.0
1816 2014-08-01 NaN 166173.0
1817 2014-06-03 NaN 124768.0
1818 2014-06-02 NaN 87513.0
1819 2014-05-09 NaN 80315.0
1820 2013-10-27 NaN 107717.0
1821 2013-09-17 NaN 137920.0
1822 2011-06-25 NaN 110463.0
1823 2011-06-24 NaN 106146.0
1824 2011-06-23 NaN 475995.0
1825 2011-06-22 NaN 122507.0
1826 2011-06-21 NaN 114264.0
1827 2011-06-20 NaN 836861.0
[1828 rows x 3 columns]

Categories

Resources