Setting row 0 of a dataframe to be the header - python

I have a dataframe (df) that looks like:
0 1 2 3 \
0 date BBG.XASX.ABP.S_price BBG.XASX.ABP.S_pos BBG.XASX.ABP.S_trade
1 2017-09-11 2.8303586 0.0 0.0
2 2017-09-12 2.8135189 98570.0 98570.0
3 2017-09-13 2.7829274 98570.0 0.0
4 2017-09-14 2.7928042 98570.0 0.0
4 5
0 BBG.XASX.ABP.S_cost BBG.XASX.ABP.S_pnl_pre_cost
1 -0.0 0.0
2 -37.439355326355 0.0
3 -0.0 -3015.4041549999965
4 -0.0 973.5561759999837
and has a df.column set:
Int64Index([ 0, 1, 2, 3, 4, 5], dtype='int64')
How can I amend the dataframe so that column 0 is the header row? So the dataframe would look like:
date BBG.XASX.ABP.S_price BBG.XASX.ABP.S_pos BBG.XASX.ABP.S_trade
0 2017-09-11 2.8303586 0.0 0.0
1 2017-09-12 2.8135189 98570.0 98570.0
2 2017-09-13 2.7829274 98570.0 0.0
3 2017-09-14 2.7928042 98570.0 0.0
BBG.XASX.ABP.S_cost BBG.XASX.ABP.S_pnl_pre_cost
0 -0.0 0.0
1 -37.439355326355 0.0
2 -0.0 -3015.4041549999965
3 -0.0 973.5561759999837
and the df.column set wold look like:
[date,BBG.XASX.ABP.S_price,BBG.XASX.ABP.S_pos,BBG.XASX.ABP.S_trade,BBG.XASX.ABP.S_cost,BBG.XASX.ABP.S_pnl_pre_cost]
The code to create the dataframe (as it stands is below):
for subdirname in glob.iglob('C:/Users/stacey/WorkDocs/tradeopt/'+filename+'//BBG*/tradeopt.is-pnl*.lzma', recursive=True):
a = pd.DataFrame(numpy.zeros((0,27)))#data is 35 columns
row = 0
with lzma.open(subdirname, mode='rt') as file:
print(subdirname)
for line in file:
items = line.split(",")
a.loc[row] = items
row = row+1
#a.columns = a.iloc[0]
print(a.columns)
print(a.head())

Create list of lists and pass to DataFrame constructor all list without first by out[1:] with columns names by out[0]:
out = []
with lzma.open(subdirname, mode='rt') as file:
print(subdirname)
for line in file:
items = line.split(",")
out.append(items)
a = pd.DataFrame(out[1:], columns=out[0])

I didn't test this but probably should work:
with lzma.open(subdirname. mode='rt') as file:
df = pd.read_csv(file, sep=',', header=0)
This approach is based that your file looks like a csv.

Related

Subtracting value from column gives NaN only

I have multiple column csv file and I want to subtract values of column X31-X27,Y31-Y27,Z31-Z27 from the same dataframe but when I am subtracting it gives me NaN values.
Here is the values of csv file:
It gives me the result as shown in picture
Help me to figure out this problem
import pandas as pd
import os
import numpy as np
df27 = pd.read_csv('D:27.txt', names=['No27','X27','Y27','Z27','Date27','Time27'], sep='\s+')
df28 = pd.read_csv('D:28.txt', names=['No28','X28','Y28','Z28','Date28','Time28'], sep='\s+')
df29 = pd.read_csv('D:29.txt', names=['No29','X29','Y29','Z29','Date29','Time29'], sep='\s+')
df30 = pd.read_csv('D:30.txt', names=['No30','X30','Y30','Z30','Date30','Time30'], sep='\s+')
df31 = pd.read_csv('D:31.txt', names=['No31','X31','Y31','Z31','Date31','Time31'], sep='\s+')
total=pd.concat([df27,df28,df29,df30,df31], axis=1)
total.to_csv('merge27-31.csv', index = False)
print(total)
df2731 = pd.read_csv('C:\\Users\\finalmerge27-31.csv')
df2731.reset_index(inplace=True)
print(df2731)
df227 = df2731[['X31', 'Y31', 'Z31']] - df2731[['X27', 'Y27', 'Z27']]
print(df227)
# input data
df = pd.DataFrame({'x27':[-1458.88, 181.78, 1911.84, 3739.3, 5358.19], 'y27':[-5885.8, -5878.1,-5786.5,-5735.7, -5545.6],
'z27':[1102,4139,4616,4108,1123], 'x31':[-1458, 181, 1911, np.nan, 5358], 'y31':[-5885, -5878, -5786, np.nan, -5554],
'z31':[1102,4138,4616,np.nan,1123]})
df
x27 y27 z27 x31 y31 z31
0 -1458.88 -5885.8 1102 -1458.0 -5885.0 1102.0
1 181.78 -5878.1 4139 181.0 -5878.0 4138.0
2 1911.84 -5786.5 4616 1911.0 -5786.0 4616.0
3 3739.30 -5735.7 4108 NaN NaN NaN
4 5358.19 -5545.6 1123 5358.0 -5554.0 1123.0
pd.DataFrame(df1.values - df2.values).rename(columns={0:'x32-x27', 1:'y31-y27', 2:'z31-x31'})
Out:
x32-x27 y31-y27 z31-x31
0 -0.88 -0.8 0.0
1 0.78 -0.1 1.0
2 0.84 -0.5 0.0
3 NaN NaN NaN
4 0.19 8.4 0.0

How to fill empty data with zeros?

After going through some previous answers I found that I could use this code to fill missing values of df1[0] which range from 340 to 515,
with open('contactasortedtest.dat', 'r') as f:
text = [line.split() for line in f]
def replace_missing(df1 , Ids ):
missing = np.setdiff1d(Ids,df1[1])
print(missing)
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
missing_df[1] = missing
missing_df[2].replace(0 , df1[2].iloc[1] , inplace = True)
df1 = pd.concat([df1 , missing_df])
return df1
Ids = (np.arange(340.0,515.0))
final_df = df1.groupby(df1[2],as_index=True).apply(replace_missing ,Ids).reset_index(drop = True)
final_df
Through troubleshooting I found that missing = np.setdiff1d(Ids,df1[1]) does not perform. Rather return the whole array. I found many answers on this, but I couldn't work it out. Any help would be appreciated.
Sample data I used,
12 340.0 1.0 0.0
2 491.0 1.0 35.8
13 492.0 1.0 81.4
4 493.0 1.0 0.0
7 495.0 1.0 0.2
0 496.0 1.0 90.3
11 509.0 1.0 2.3
6 513.0 1.0 4.3
8 515.0 1.0 0.1
Thank you !
You can use df['x'].fillna(0) to fill non zeros in a column

transpose data based on multiple tables with the help of pandas

Looking for some help for the below. Have 2 big csv files and need to get the data based on few conditions. Here is my sample data file.
a1,a2,v1,v2,v3,v4
1,12,12.99,0.0,34.33,0
1,13,12.99,0.0,34.33,0
1,145,12.99,0.0,34.33,0
2,15,12.99,0.0,34.33,0
2,169,12.99,0.0,34.33,0
3,164,12.99,0.0,34.33,0
3,147,12.99,0.0,34.33,0
1,174,12.99,0.0,34.33,0
2,148,12.99,0.0,34.33,0
4,154,12.99,0.0,34.33,0
a1,k1
1,v1
2,v2
3,v3
4,v4
The values under a1 and k1 to be matched and if any of the v* are zero,those to be dropped from the final csv file.
a1,a2,v1,v2,v3,v4
1,12,12.99,0.0,34.33,0
1,13,12.99,0.0,34.33,0
1,145,12.99,0.0,34.33,0
3,164,12.99,0.0,34.33,0
3,147,12.99,0.0,34.33,0
1,174,12.99,0.0,34.33,0
The values of v2 and v4 are zeroes, so 2 and 4 from A1 rows are dropped.
Thanks in adanvce.
IIUC:
# Find v* columns where all (any?) values are 0
vx_idx = df1.filter(regex=r'^v\d+').eq(0).all().loc[lambda x: x].index
# Find a1 values that match with v*
a1_val = df2.loc[df2['k1'].isin(v), 'a1'].tolist()
# Filter out your final dataframe
out = df1[~df1['a1'].isin(a1_val)]
Output:
>>> out
a1 a2 v1 v2 v3 v4
0 1 12 12.99 0.0 34.33 0
1 1 13 12.99 0.0 34.33 0
2 1 145 12.99 0.0 34.33 0
5 3 164 12.99 0.0 34.33 0
6 3 147 12.99 0.0 34.33 0
7 1 174 12.99 0.0 34.33 0
>>> print(out.to_csv(index=False))
a1,a2,v1,v2,v3,v4
1,12,12.99,0.0,34.33,0
1,13,12.99,0.0,34.33,0
1,145,12.99,0.0,34.33,0
3,164,12.99,0.0,34.33,0
3,147,12.99,0.0,34.33,0
1,174,12.99,0.0,34.33,0

Create new columns based on previous columns with multiplication

I want to create a list of columns where the new columns are based on previous columns times 1.5. It will roll until Year 2020. I tried to use previous and current but it didn't work as expected. How can I make it work as expected?
df = pd.DataFrame({
'us2000':[5,3,6,9,2,4],
}); df
a = []
for i in range(1, 21):
a.append("us202" + str(i))
for previous, current in zip(a, a[1:]):
df[current] = df[previous] * 1.5
IIUC you can fix you code with:
a = []
for i in range(0, 21):
a.append(f'us20{i:02}')
for previous, current in zip(a, a[1:]):
df[current] = df[previous] * 1.5
Another, vectorial, approach with numpy would be:
df2 = (pd.DataFrame(df['us2000'].to_numpy()[:,None]*1.5**np.arange(21),
columns=[f'us20{i:02}' for i in range(21)]))
output:
us2000 us2001 us2002 us2003 us2004 us2005 us2006 us2007 ...
0 5 7.5 11.25 16.875 25.3125 37.96875 56.953125 85.429688
1 3 4.5 6.75 10.125 15.1875 22.78125 34.171875 51.257812
2 6 9.0 13.50 20.250 30.3750 45.56250 68.343750 102.515625
3 9 13.5 20.25 30.375 45.5625 68.34375 102.515625 153.773438
4 2 3.0 4.50 6.750 10.1250 15.18750 22.781250 34.171875
5 4 6.0 9.00 13.500 20.2500 30.37500 45.562500 68.343750
Try:
for i in range(1, 21):
df[f"us{int(2000+i):2d}"] = df[f"us{int(2000+i-1):2d}"].mul(1.5)
>>> df
us2000 us2001 us2002 ... us2018 us2019 us2020
0 5 7.5 11.25 ... 7389.45940 11084.18910 16626.283650
1 3 4.5 6.75 ... 4433.67564 6650.51346 9975.770190
2 6 9.0 13.50 ... 8867.35128 13301.02692 19951.540380
3 9 13.5 20.25 ... 13301.02692 19951.54038 29927.310571
4 2 3.0 4.50 ... 2955.78376 4433.67564 6650.513460
5 4 6.0 9.00 ... 5911.56752 8867.35128 13301.026920
[6 rows x 21 columns]
pd.DataFrame(df.to_numpy()*[1.5**i for i in range(0,21)])\
.rename(columns=lambda x:str(x).rjust(2,'0')).add_prefix("us20")
out
us2000 us2001 us2002 ... us2018 us2019 us2020
0 5 7.5 11.25 ... 7389.45940 11084.18910 16626.283650
1 3 4.5 6.75 ... 4433.67564 6650.51346 9975.770190
2 6 9.0 13.50 ... 8867.35128 13301.02692 19951.540380
3 9 13.5 20.25 ... 13301.02692 19951.54038 29927.310571
4 2 3.0 4.50 ... 2955.78376 4433.67564 6650.513460
5 4 6.0 9.00 ... 5911.56752 8867.35128 13301.026920
[6 rows x 21 columns]

pandas adding column values in a loop without using iloc

I want to add (ideally get the mean of the sum) of several column values starting with my index i,
investmentlength=list(range(1,13,1))
returns=list()
for i in range(0,len(stocks2)):
if stocks2['Startpoint'][i]==1:
nextmonth=nextmonth+stocks2['RET'][i+1]+stocks2['RET'][i+2]+stocks2['RET'][i+3]+....
counter+=1
Is there a way to give the beginning Index and the end index and prob step size and then sum it all in one command instead of copy and paste to death? I wanted to go trough all the different investment lengths and put the avg returns in the empty list.
SHRCD EXCHCD SICCD PRC VOL RET SHROUT \
DATE PERMNO
1970-08-31 10559.0 10.0 1.0 5311.0 35.000 1692.0 0.030657 12048.0
12626.0 10.0 1.0 5411.0 46.250 926.0 0.088235 6624.0
12749.0 11.0 1.0 5331.0 45.500 5632.0 0.126173 34685.0
13100.0 11.0 1.0 5311.0 22.000 1759.0 0.171242 15107.0
13653.0 10.0 1.0 5311.0 13.125 141.0 0.220930 1337.0
13936.0 11.0 1.0 2331.0 11.500 270.0 -0.053061 3942.0
14322.0 11.0 1.0 5311.0 64.750 6934.0 0.024409 154187.0
16969.0 10.0 1.0 5311.0 42.875 1069.0 0.186851 13828.0
17072.0 10.0 1.0 5311.0 14.750 777.0 0.026087 5415.0
17304.0 10.0 1.0 5311.0 24.875 1939.0 0.058511 8150.0
MV XRET IB ... PE2 \
DATE PERMNO ...
1970-08-31 10559.0 421680.000 0.025357 NaN ... 13.852692
12626.0 306360.000 0.082935 NaN ... 13.145312
12749.0 1578167.500 0.120873 NaN ... 25.970466
13100.0 332354.000 0.165942 NaN ... 9.990711
13653.0 17548.125 0.215630 NaN ... 6.273570
13936.0 45333.000 -0.058361 NaN ... 6.473123
14322.0 9983608.250 0.019109 NaN ... 22.204047
16969.0 592875.500 0.181551 NaN ... 11.948061
17072.0 79871.250 0.020787 NaN ... 8.845526
17304.0 202731.250 0.053211 NaN ... 8.641655
lagPE1 lagPE2 lagMV lagSEQ QUINTILE1 \
DATE PERMNO
1970-08-31 10559.0 13.852692 13.852692 412644.000 264.686 4
12626.0 13.145312 13.145312 281520.000 164.151 4
12749.0 25.970466 25.970466 1404742.500 367.519 5
13100.0 9.990711 9.990711 288921.375 414.820 3
13653.0 6.273570 6.273570 14372.750 24.958 1
13936.0 6.473123 6.473123 48289.500 76.986 1
14322.0 22.204047 22.204047 9790874.500 3439.802 5
16969.0 11.948061 11.948061 499536.500 NaN 4
17072.0 8.845526 8.845526 77840.625 NaN 3
17304.0 8.641655 8.641655 191525.000 307.721 3
QUINTILE2 avgvol avg Startpoint
DATE PERMNO
1970-08-31 10559.0 4 9229.057592 1697.2 0
12626.0 4 3654.367470 894.4 0
12749.0 5 188206.566860 5828.6 0
13100.0 3 94127.319048 3477.2 0
13653.0 1 816.393162 268.8 0
13936.0 1 71547.050633 553.2 0
14322.0 5 195702.521519 6308.8 0
16969.0 4 3670.297872 2002.0 0
17072.0 3 3774.083333 3867.8 0
17304.0 3 12622.112903 1679.4 0

Categories

Resources