Subtracting value from column gives NaN only - python

I have multiple column csv file and I want to subtract values of column X31-X27,Y31-Y27,Z31-Z27 from the same dataframe but when I am subtracting it gives me NaN values.
Here is the values of csv file:
It gives me the result as shown in picture
Help me to figure out this problem
import pandas as pd
import os
import numpy as np
df27 = pd.read_csv('D:27.txt', names=['No27','X27','Y27','Z27','Date27','Time27'], sep='\s+')
df28 = pd.read_csv('D:28.txt', names=['No28','X28','Y28','Z28','Date28','Time28'], sep='\s+')
df29 = pd.read_csv('D:29.txt', names=['No29','X29','Y29','Z29','Date29','Time29'], sep='\s+')
df30 = pd.read_csv('D:30.txt', names=['No30','X30','Y30','Z30','Date30','Time30'], sep='\s+')
df31 = pd.read_csv('D:31.txt', names=['No31','X31','Y31','Z31','Date31','Time31'], sep='\s+')
total=pd.concat([df27,df28,df29,df30,df31], axis=1)
total.to_csv('merge27-31.csv', index = False)
print(total)
df2731 = pd.read_csv('C:\\Users\\finalmerge27-31.csv')
df2731.reset_index(inplace=True)
print(df2731)
df227 = df2731[['X31', 'Y31', 'Z31']] - df2731[['X27', 'Y27', 'Z27']]
print(df227)

# input data
df = pd.DataFrame({'x27':[-1458.88, 181.78, 1911.84, 3739.3, 5358.19], 'y27':[-5885.8, -5878.1,-5786.5,-5735.7, -5545.6],
'z27':[1102,4139,4616,4108,1123], 'x31':[-1458, 181, 1911, np.nan, 5358], 'y31':[-5885, -5878, -5786, np.nan, -5554],
'z31':[1102,4138,4616,np.nan,1123]})
df
x27 y27 z27 x31 y31 z31
0 -1458.88 -5885.8 1102 -1458.0 -5885.0 1102.0
1 181.78 -5878.1 4139 181.0 -5878.0 4138.0
2 1911.84 -5786.5 4616 1911.0 -5786.0 4616.0
3 3739.30 -5735.7 4108 NaN NaN NaN
4 5358.19 -5545.6 1123 5358.0 -5554.0 1123.0
pd.DataFrame(df1.values - df2.values).rename(columns={0:'x32-x27', 1:'y31-y27', 2:'z31-x31'})
Out:
x32-x27 y31-y27 z31-x31
0 -0.88 -0.8 0.0
1 0.78 -0.1 1.0
2 0.84 -0.5 0.0
3 NaN NaN NaN
4 0.19 8.4 0.0

Related

How to fill empty data with zeros?

After going through some previous answers I found that I could use this code to fill missing values of df1[0] which range from 340 to 515,
with open('contactasortedtest.dat', 'r') as f:
text = [line.split() for line in f]
def replace_missing(df1 , Ids ):
missing = np.setdiff1d(Ids,df1[1])
print(missing)
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
missing_df[1] = missing
missing_df[2].replace(0 , df1[2].iloc[1] , inplace = True)
df1 = pd.concat([df1 , missing_df])
return df1
Ids = (np.arange(340.0,515.0))
final_df = df1.groupby(df1[2],as_index=True).apply(replace_missing ,Ids).reset_index(drop = True)
final_df
Through troubleshooting I found that missing = np.setdiff1d(Ids,df1[1]) does not perform. Rather return the whole array. I found many answers on this, but I couldn't work it out. Any help would be appreciated.
Sample data I used,
12 340.0 1.0 0.0
2 491.0 1.0 35.8
13 492.0 1.0 81.4
4 493.0 1.0 0.0
7 495.0 1.0 0.2
0 496.0 1.0 90.3
11 509.0 1.0 2.3
6 513.0 1.0 4.3
8 515.0 1.0 0.1
Thank you !
You can use df['x'].fillna(0) to fill non zeros in a column

Setting row 0 of a dataframe to be the header

I have a dataframe (df) that looks like:
0 1 2 3 \
0 date BBG.XASX.ABP.S_price BBG.XASX.ABP.S_pos BBG.XASX.ABP.S_trade
1 2017-09-11 2.8303586 0.0 0.0
2 2017-09-12 2.8135189 98570.0 98570.0
3 2017-09-13 2.7829274 98570.0 0.0
4 2017-09-14 2.7928042 98570.0 0.0
4 5
0 BBG.XASX.ABP.S_cost BBG.XASX.ABP.S_pnl_pre_cost
1 -0.0 0.0
2 -37.439355326355 0.0
3 -0.0 -3015.4041549999965
4 -0.0 973.5561759999837
and has a df.column set:
Int64Index([ 0, 1, 2, 3, 4, 5], dtype='int64')
How can I amend the dataframe so that column 0 is the header row? So the dataframe would look like:
date BBG.XASX.ABP.S_price BBG.XASX.ABP.S_pos BBG.XASX.ABP.S_trade
0 2017-09-11 2.8303586 0.0 0.0
1 2017-09-12 2.8135189 98570.0 98570.0
2 2017-09-13 2.7829274 98570.0 0.0
3 2017-09-14 2.7928042 98570.0 0.0
BBG.XASX.ABP.S_cost BBG.XASX.ABP.S_pnl_pre_cost
0 -0.0 0.0
1 -37.439355326355 0.0
2 -0.0 -3015.4041549999965
3 -0.0 973.5561759999837
and the df.column set wold look like:
[date,BBG.XASX.ABP.S_price,BBG.XASX.ABP.S_pos,BBG.XASX.ABP.S_trade,BBG.XASX.ABP.S_cost,BBG.XASX.ABP.S_pnl_pre_cost]
The code to create the dataframe (as it stands is below):
for subdirname in glob.iglob('C:/Users/stacey/WorkDocs/tradeopt/'+filename+'//BBG*/tradeopt.is-pnl*.lzma', recursive=True):
a = pd.DataFrame(numpy.zeros((0,27)))#data is 35 columns
row = 0
with lzma.open(subdirname, mode='rt') as file:
print(subdirname)
for line in file:
items = line.split(",")
a.loc[row] = items
row = row+1
#a.columns = a.iloc[0]
print(a.columns)
print(a.head())
Create list of lists and pass to DataFrame constructor all list without first by out[1:] with columns names by out[0]:
out = []
with lzma.open(subdirname, mode='rt') as file:
print(subdirname)
for line in file:
items = line.split(",")
out.append(items)
a = pd.DataFrame(out[1:], columns=out[0])
I didn't test this but probably should work:
with lzma.open(subdirname. mode='rt') as file:
df = pd.read_csv(file, sep=',', header=0)
This approach is based that your file looks like a csv.

How to exactly compute percentage change with NaN in DataFrame for each day?

I would like to compute a daily percentage change for this DataFrame (frame_):
import pandas as pd
import numpy as np
data_ = {
'A':[1,np.NaN,2,1,1,2],
'B':[1,2,3,1,np.NaN,1],
'C':[1,2,np.NaN,1,1,2],
}
dates_ = [
'06/01/2018','05/01/2018','04/01/2018','03/01/2018','02/01/2018', '01/01/2018'
]
frame_ = pd.DataFrame(data_, index=dates_, columns=['A','B','C'])
The issue is that I get a DataFrame with this method:
returns_ = frame_.pct_change(periods=1, fill_method='pad')
dates,A,B,C
06/01/2018,,,
05/01/2018,,1.0,1.0
04/01/2018,1.0,0.5,
03/01/2018,-0.5,-0.6666666666666667,-0.5
02/01/2018,0.0,,0.0
01/01/2018,1.0,0.0,1.0
Which is not what I am looking for. And the dropna() method also doesn't give me the result I seek. I would like to compute a value for each day which has value and NaN for the day where there is no value or NaN. For example, on column A: as a percentage change I would like to see
dates,A
06/01/2018,1
05/01/2018,
04/01/2018,1.0
03/01/2018,-0.5
02/01/2018,0.0
01/01/2018,1.0
Many thanks in advance
This is one way, a bit by brute-force.
import pandas as pd
import numpy as np
data_ = {
'A':[1,np.NaN,2,1,1,2],
'B':[1,2,3,1,np.NaN,1],
'C':[1,2,np.NaN,1,1,2],
}
dates_ = [
'06/01/2018','05/01/2018','04/01/2018','03/01/2018','02/01/2018', '01/01/2018'
]
frame_ = pd.DataFrame(data_, index=dates_, columns=['A','B','C'])
frame_ = pd.concat([frame_, pd.DataFrame(columns=['dA', 'dB', 'dC'])])
for col in ['A', 'B', 'C']:
frame_['d'+col] = frame_[col].pct_change()
frame_.loc[pd.notnull(frame_[col]) & pd.isnull(frame_['d'+col]), 'd'+col] = frame_[col]
# A B C dA dB dC
# 06/01/2018 1.0 1.0 1.0 1.0 1.000000 1.0
# 05/01/2018 NaN 2.0 2.0 NaN 1.000000 1.0
# 04/01/2018 2.0 3.0 NaN 1.0 0.500000 NaN
# 03/01/2018 1.0 1.0 1.0 -0.5 -0.666667 -0.5
# 02/01/2018 1.0 NaN 1.0 0.0 NaN 0.0
# 01/01/2018 2.0 1.0 2.0 1.0 0.000000 1.0

Tiling in groupby on dataframe

I have a data frame that contains returns, size and sedols for a couple of dates.
My goal is to identify the top and bottom values for a certain condition per date, i.e I want the top decile largest size entries and the bottom decile smallest size entries for each date and flag them in a new column by 'xx' and 'yy'.
I am confused how to apply the tiling while grouping as well as creating a new column, here is what I already have.
import pandas as pd
import numpy as np
import datetime as dt
from random import choice
from string import ascii_uppercase
def create_dummy_data(start_date, days, entries_pday):
date_sequence_lst = [dt.datetime.strptime(start_date,'%Y-%m-%d') +
dt.timedelta(days=x) for x in range(0,days)]
date_sequence_lst = date_sequence_lst * entries_pday
returns_lst = [round(np.random.uniform(low=-0.10,high=0.20),2) for _ in range(entries_pday*days)]
size_lst = [round(np.random.uniform(low=10.00,high=10000.00),0) for _ in range(entries_pday*days)]
rdm_sedol_lst = [(''.join(choice(ascii_uppercase) for i in range(7))) for x in range(entries_pday)]
rdm_sedol_lst = rdm_sedol_lst * days
dates_returns_df = pd.DataFrame({'Date':date_sequence_lst , 'Sedols':rdm_sedol_lst, 'Returns':returns_lst,'Size':size_lst})
dates_returns_df = dates_returns_df.sort_values('Date',ascending=True)
dates_returns_df = dates_returns_df.reset_index(drop=True)
return dates_returns_df
def order_df_by(df_in,column_name):
df_out = df_in.sort_values(['Date',column_name],ascending=[True,False])
return df_out
def get_ntile(df_in,ntile):
df_in['Tiled'] = df_in.groupby(['Date'])['Size'].transform(lambda x : pd.qcut(x,ntile))
return df_in
if __name__ == "__main__":
# create dummy returns
data_df = create_dummy_data('2001-01-01',31,10)
# sort by attribute
data_sorted_df = order_df_by(data_df,'Size')
#ntile data per date
data_ntiled = get_ntile(data_sorted_df, 10)
for key, item in data_ntiled:
print(data_ntiled.get_group(key))
so far I would be expecting deciled results based on 'Size' for each date, the next step would be to filter only for decile 1 and decile 10 and flag the entries 'xx' and 'yy' respectively.
thanks
Consider using transform on the pandas.qcut method with labels 1 through ntile+1 for a decile column, then conditionally set flag with np.where using decile values:
...
def get_ntile(df_in, ntile):
df_in['Tiled'] = df_in.groupby(['Date'])['Size'].transform(lambda x: pd.qcut(x, ntile, labels=list(range(1, ntile+1))))
return df_in
if __name__ == "__main__":
# create dummy returns
data_df = create_dummy_data('2001-01-01',31,10)
# sort by attribute
data_sorted_df = order_df_by(data_df,'Size')
#ntile data per date
data_ntiled = get_ntile(data_sorted_df, 10)
data_ntiled['flag'] = np.where(data_ntiled['Tiled']==1.0, 'YY',
np.where(data_ntiled['Tiled']==10.0, 'XX', np.nan))
print(data_ntiled.reset_index(drop=True).head(15))
# Date Returns Sedols Size Tiled flag
# 0 2001-01-01 -0.03 TEEADVJ 8942.0 10.0 XX
# 1 2001-01-01 -0.03 PDBWGBJ 7142.0 9.0 nan
# 2 2001-01-01 0.03 QNVVPIC 6995.0 8.0 nan
# 3 2001-01-01 0.04 NTKEAKB 6871.0 7.0 nan
# 4 2001-01-01 0.20 ZVVCLSJ 6541.0 6.0 nan
# 5 2001-01-01 0.12 IJKXLIF 5131.0 5.0 nan
# 6 2001-01-01 0.14 HVPDRIU 4490.0 4.0 nan
# 7 2001-01-01 -0.08 XNOGFET 3397.0 3.0 nan
# 8 2001-01-01 -0.06 JOARYWC 2582.0 2.0 nan
# 9 2001-01-01 0.12 FVKBQGU 723.0 1.0 YY
# 10 2001-01-02 0.03 ZVVCLSJ 9291.0 10.0 XX
# 11 2001-01-02 0.14 HVPDRIU 8875.0 9.0 nan
# 12 2001-01-02 0.08 PDBWGBJ 7496.0 8.0 nan
# 13 2001-01-02 0.02 FVKBQGU 7307.0 7.0 nan
# 14 2001-01-02 -0.01 QNVVPIC 7159.0 6.0 nan

Summary statistics on Large csv file using python pandas

Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method.
In this case first i need to create a DataFrame for all the 10gb csv data.
text_csv=Pandas.read_csv("target.csv")
df=Pandas.DataFrame(text_csv)
df.describe()
Does this mean all the 10gb will get loaded in to memory and calculate the statistics?
Yes, I think you are right. And you can omit df=Pandas.DataFrame(text_csv), because output from read_csv is DataFrame:
import pandas as pd
df = pd.read_csv("target.csv")
print df.describe()
Or you can use dask:
import dask.dataframe as dd
df = dd.read_csv('target.csv.csv')
print df.describe()
You can use parameter chunksize of read_csv, but you get output TextParser not DataFrame, so then you need concat:
import pandas as pd
import io
temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""
#after testing replace io.StringIO(temp) to filename
#chunksize = 2 for testing
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp
<pandas.io.parsers.TextFileReader object at 0x000000001995ADA0>
df = pd.concat(tp, ignore_index=True)
print df.describe()
a b
count 15.000000 15.000000
mean 3.333333 527.600000
std 1.877181 5.082182
min 1.000000 519.000000
25% 2.000000 524.500000
50% 3.000000 528.000000
75% 5.000000 531.500000
max 6.000000 535.000000
You can convert TextFileReader to DataFrame, but aggregate this output can be difficult:
import pandas as pd
import io
temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""
#after testing replace io.StringIO(temp) to filename
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp
dfs = []
for t in tp:
df = pd.DataFrame(t)
df1 = df.describe()
dfs.append(df1.T)
df2 = pd.concat(dfs)
print df2
count mean std min 25% 50% 75% max
a 2 1.0 0.000000 1 1.00 1.0 1.00 1
b 2 525.5 0.707107 525 525.25 525.5 525.75 526
a 2 1.5 0.707107 1 1.25 1.5 1.75 2
b 2 530.0 4.242641 527 528.50 530.0 531.50 533
a 2 2.0 0.000000 2 2.00 2.0 2.00 2
b 2 530.0 2.828427 528 529.00 530.0 531.00 532
a 2 3.0 0.000000 3 3.00 3.0 3.00 3
b 2 526.5 10.606602 519 522.75 526.5 530.25 534
a 2 3.5 0.707107 3 3.25 3.5 3.75 4
b 2 532.5 3.535534 530 531.25 532.5 533.75 535
a 2 5.0 0.000000 5 5.00 5.0 5.00 5
b 2 530.0 1.414214 529 529.50 530.0 530.50 531
a 2 6.0 0.000000 6 6.00 6.0 6.00 6
b 2 520.5 0.707107 520 520.25 520.5 520.75 521
a 1 6.0 NaN 6 6.00 6.0 6.00 6
b 1 524.0 NaN 524 524.00 524.0 524.00 524
Seems there is no limitation of file size for pandas.read_csv method.
According to #fickludd's and #Sebastian Raschka's answer in Large, persistent DataFrame in pandas, you can use iterator=True and chunksize=xxx to load the giant csv file and calculate the statistics you want:
import pandas as pd
df = pd.read_csv('some_data.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iteratable with chunks of 1000 rows.
partial_desc = df.describe()
And aggregate all the partial describe info all yourself.

Categories

Resources