Pandas - Local store a variable (multiplier) - python

Another question :) I need to know how to store a variable for reference at the beginning of my script. In this case, I am needing to store an FX conversion rate which I want to be able to adjust at the beginning of the script. I am also wanting to store a directory in my PC which will store the outputs of the script and is prone to changing, each month.
For reference, I have created the following example.
import pandas as pd
FX_rate = {'AUD':[0.71442],'NZD':[0.68476]}
Dir = 'C:\Users\Admin\Desktop\December\Monthly_Output.csv'
df = {'AU_SALES':[1000,2500,750,6800,1000],'NZ_SALES':[500,2200,430,100,6670]
df1 = pd.DataFrame(df)
# ISSUE HERE - Covert Sales using FX_rate dictionary
df_USDAUD = df['AU_SALES'] * FX_rate['AUD']
df_USDNZD = df['NZ_SALES'] * FX_rate['NZD']
df_converted = df_USDAUD.append(df_USDNZD)
# Save output in folder, using Dir directory
df_converted.to_csv(Dir)
If I were to run this script I would get an error telling me that the number of values in df['AU_SALES'] (5) and the number of values in FX_rate['AUD'] (1) do not match.

I do not know what your expected output should be but I think this should work. Use multiply with the start operator *
import pandas as pd
FX_rate = {'AUD':[0.71442],'NZD':[0.68476]}
Dir = r'C:\Users\Admin\Desktop\December\Monthly_Output.csv'
df = {'AU_SALES':[1000,2500,750,6800,1000],'NZ_SALES':[500,2200,430,100,6670]}
df1 = pd.DataFrame(df)
# use multiply with the star operator
df_USDAUD = df1['AU_SALES'].multiply(*FX_rate['AUD'])
df_USDNZD = df1['NZ_SALES'].multiply(*FX_rate['NZD'])
df_converted = df_USDAUD.append(df_USDNZD)
print(df_converted)
0 714.4200
1 1786.0500
2 535.8150
3 4858.0560
4 714.4200
0 342.3800
1 1506.4720
2 294.4468
3 68.4760
4 4567.3492
dtype: float64
Or you can create a function
# create a function
def myFun(df, aud, nzd, Dir):
df_USDAUD = df['AU_SALES'] * aud
df_USDNZD = df['NZ_SALES'] * nzd
df_converted = df_USDAUD.append(df_USDNZD)
df_converted.to_csv(Dir)
return df_converted
Dir = r'C:\Users\Admin\Desktop\December\Monthly_Output.csv'
df = {'AU_SALES':[1000,2500,750,6800,1000],'NZ_SALES':[500,2200,430,100,6670]}
df1 = pd.DataFrame(df)
myFun(df1, 0.71442, 0.68476, Dir)
or just do not store the numbers in a list inside the dict: FX_rate = {'AUD':0.71442,'NZD':0.68476}
FX_rate = {'AUD':0.71442,'NZD':0.68476}
Dir = r'C:\Users\Admin\Desktop\December\Monthly_Output.csv'
df = {'AU_SALES':[1000,2500,750,6800,1000],'NZ_SALES':[500,2200,430,100,6670]}
df1 = pd.DataFrame(df)
# use multiple with the star operator
df_USDAUD = df1['AU_SALES'] * FX_rate['AUD']
df_USDNZD = df1['NZ_SALES'] * FX_rate['NZD']
df_converted = df_USDAUD.append(df_USDNZD)
print(df_converted)

Related

Equivalent of pd.Series.str.slice() and pd.Series.apply() in cuDF

I am wanting to convert the following code (which runs in pandas) to code that runs in cuDF.
Sample data from .head() of Series being manipulated is plugged into OG code in the 3rd code cell down -- should be able to copy/paste run.
Original code in pandas
# both are float columns now
# rawcensustractandblock
s_rawcensustractandblock = df_train['rawcensustractandblock'].apply(lambda x: str(x))
# adjust/set new tract number
df_train['census_tractnumber'] = s_rawcensustractandblock.str.slice(4,11)
# adjust block number
df_train['block_number'] = s_rawcensustractandblock.str.slice(start=11)
df_train['block_number'] = df_train['block_number'].apply(lambda x: x[:4]+'.'+x[4:]+'0' )
df_train['block_number'] = df_train['block_number'].apply(lambda x: int(round(float(x),0)) )
df_train['block_number'] = df_train['block_number'].apply(lambda x: str(x).ljust(4,'0') )
Data being manipulated
# series of values from df_train.['rawcensustractandblock'].head()
data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
Code adjusted to start with this sample data
Here's how the code looks when using the above provided data instead of the entire dataframe.
Based on errors encountered when trying to convert, this issue is at the Series level, so the converting the cell below to execute in cuDF should solve the problem.
import pandas as pd
# series of values from df_train.['rawcensustractandblock'].head()
data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
# how the first line looks using the series
s_rawcensustractandblock = data.apply(lambda x: str(x))
# adjust/set new tract number
census_tractnumber = s_rawcensustractandblock.str.slice(4,11)
# adjust block number
block_number = s_rawcensustractandblock.str.slice(start=11)
block_number = block_number.apply(lambda x: x[:4]+'.'+x[4:]+'0' )
block_number = block_number.apply(lambda x: int(round(float(x),0)) )
block_number = block_number.apply(lambda x: str(x).ljust(4,'0') )
Expected changes (output)
df_train['census_tractnumber'].head()
# out
0 1066.46
1 0524.22
2 4638.00
3 2963.00
4 0423.38
Name: census_tractnumber, dtype: object
df_train['block_number'].head()
0 1001
1 2024
2 3004
3 2002
4 1006
Name: block_number, dtype: object
You can use cuDF string methods (via nvStrings) for almost everything you're trying to do. You will lose some precision converting these floats to strings in cuDF (though it may not matter in your example above), so for this example I've simply converted beforehand. If possible, I would recommend initially creating the rawcensustractandblock as a string column rather than a float column.
import cudf
import pandas as pd
​
gdata = cudf.from_pandas(pd_data.astype('str'))
​
tractnumber = gdata.str.slice(4,11)
blocknumber = gdata.str.slice(11)
blocknumber = blocknumber.str.slice(0,4).str.cat(blocknumber.str.slice(4), '.')
blocknumber = blocknumber.astype('float').round(0).astype('int')
blocknumber = blocknumber.astype('str').str.ljust(4, '0')
​
tractnumber
0 1066.46
1 0524.22
2 4638.00
3 2963.00
4 0423.38
dtype: object
blocknumber
0 1001
1 2024
2 3004
3 2002
4 1006
dtype: object
for loop solution
pandas (original code)
import pandas as pd
# data from df_train.rawcensustractandblock.head()
pd_data = pd.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
# using series instead of dataframe
pd_raw_block = pd_data.apply(lambda x: str(x))
# adjust/set new tract number
pd_tractnumber = pd_raw_block.str.slice(4,11)
# set/adjust block number
pd_block_number = pd_raw_block.str.slice(11)
pd_block_number = pd_block_number.apply(lambda x: x[:4]+'.'+x[4:]+'0')
pd_block_number = pd_block_number.apply(lambda x: int(round(float(x),0)))
pd_block_number = pd_block_number.apply(lambda x: str(x).ljust(4,'0'))
# print(list(pd_tractnumber))
# print(list(pd_block_number))
cuDF (solution code)
import cudf
# data from df_train.rawcensustractandblock.head()
cudf_data = cudf.Series([60371066.461001, 60590524.222024, 60374638.00300401,
60372963.002002, 60590423.381006])
# using series instead of dataframe
cudf_tractnumber = cudf_data.values_to_string()
# adjust/set new tract number
for i in range(len(cudf_tractnumber)):
funct = slice(4,11)
cudf_tractnumber[i] = cudf_tractnumber[i][funct]
# using series instead of dataframe
cudf_block_number = cudf_data.values_to_string()
# set/adjust block number
for i in range(len(cudf_block_number)):
funct = slice(11, None)
cudf_block_number[i] = cudf_block_number[i][funct]
cudf_block_number[i] = cudf_block_number[i][:4]+'.'+cudf_block_number[i][4:]+'0'
cudf_block_number[i] = int(round(float(cudf_block_number[i]), 0))
cudf_block_number[i] = str(cudf_block_number[i]).ljust(4,'0')
# print(cudf_tractnumber)
# print(cudf_block_number)

Python - Pandas library returns wrong column values after parsing a CSV file

SOLVED Found the solution by myself. Turns out that when you want to retrieve specific columns by their names you should pass the names in the order they appear inside the csv (which is really stupid for a library that is intended to save some parsing time for a developer IMO). Correct me if I am wrong but i dont see a on option to get a specific columns values by its name if the columns are in a different order...
I am trying to read a comma separated value file with python and then
parse it using Pandas library. Since the file has many values (columns) that are not needed I make a list of the column names i do need.
Here's a look at the csv file format.
Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Attendance,Referee,HS,AS,HST,AST,HHW,AHW,HC,AC,HF,AF,HO,AO,HY,AY,HR,AR,HBP,ABP,GBH,GBD,GBA,IWH,IWD,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA
E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,20043,Rob
Harris,17,8,14,4,2,1,6,6,13,12,8,6,1,2,0,0,10,20,2,3,3.2,2.2,2.9,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1
E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,34914,Graham
Barber,17,12,10,5,1,0,7,7,19,14,2,3,1,2,0,0,10,20,1.47,3.4,5.2,1.6,3.2,4.2,1.5,3.4,6,1.5,3.6,6,1.44,3.6,6.5
E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,20624,Barry
Knight,6,16,3,9,0,1,8,4,15,21,1,3,5,3,1,0,75,30,2.15,3,3,2.2,2.9,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62
E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,27223,Andy
D'Urso,6,13,4,6,0,0,5,8,11,13,0,2,1,1,0,0,10,10,2,3.1,3.2,1.8,3,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2,3.2,3.2
E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,40010,Dermot
Gallagher,17,12,8,6,0,0,6,4,21,20,6,1,1,3,0,0,10,30,1.65,3.3,4.3,1.55,3.3,4.5,1.55,3.5,5,1.57,3.6,5,1.61,3.5,4.5
E0,19/08/00,Leicester,Aston Villa,0,0,D,0,0,D,21455,Mike
Riley,5,5,4,3,0,0,5,4,12,12,1,4,2,3,0,0,20,30,2.15,3.1,2.9,2.3,2.9,2.5,2.35,3.2,2.6,2.25,3.25,2.75,2.4,3.25,2.5
E0,19/08/00,Liverpool,Bradford,1,0,H,0,0,D,44183,Paul
Durkin,16,3,10,2,0,0,6,1,8,8,5,0,1,1,0,0,10,10,1.25,4.1,7.2,1.25,4.3,8,1.35,4,8,1.36,4,8,1.33,4,8
This list is passed to pandas.read_csv()'s names parameter.
See code.
# Returns an array of the column names needed for our raw data table
def cols_to_extract():
cols_to_use = [None] * RawDataCols.COUNT
cols_to_use[RawDataCols.DATE] = 'Date'
cols_to_use[RawDataCols.HOME_TEAM] = 'HomeTeam'
cols_to_use[RawDataCols.AWAY_TEAM] = 'AwayTeam'
cols_to_use[RawDataCols.FTHG] = 'FTHG'
cols_to_use[RawDataCols.HG] = 'HG'
cols_to_use[RawDataCols.FTAG] = 'FTAG'
cols_to_use[RawDataCols.AG] = 'AG'
cols_to_use[RawDataCols.FTR] = 'FTR'
cols_to_use[RawDataCols.RES] = 'Res'
cols_to_use[RawDataCols.HTHG] = 'HTHG'
cols_to_use[RawDataCols.HTAG] = 'HTAG'
cols_to_use[RawDataCols.HTR] = 'HTR'
cols_to_use[RawDataCols.ATTENDANCE] = 'Attendance'
cols_to_use[RawDataCols.HS] = 'HS'
cols_to_use[RawDataCols.AS] = 'AS'
cols_to_use[RawDataCols.HST] = 'HST'
cols_to_use[RawDataCols.AST] = 'AST'
cols_to_use[RawDataCols.HHW] = 'HHW'
cols_to_use[RawDataCols.AHW] = 'AHW'
cols_to_use[RawDataCols.HC] = 'HC'
cols_to_use[RawDataCols.AC] = 'AC'
cols_to_use[RawDataCols.HF] = 'HF'
cols_to_use[RawDataCols.AF] = 'AF'
cols_to_use[RawDataCols.HFKC] = 'HFKC'
cols_to_use[RawDataCols.AFKC] = 'AFKC'
cols_to_use[RawDataCols.HO] = 'HO'
cols_to_use[RawDataCols.AO] = 'AO'
cols_to_use[RawDataCols.HY] = 'HY'
cols_to_use[RawDataCols.AY] = 'AY'
cols_to_use[RawDataCols.HR] = 'HR'
cols_to_use[RawDataCols.AR] = 'AR'
return cols_to_use
# Extracts raw data from the raw data csv and populates the raw match data table in the database
def extract_raw_data(csv):
# Clear the database table if it has any logs
# if MatchRawData.objects.count != 0:
# MatchRawData.objects.delete()
cols_to_use = cols_to_extract()
# Read and parse the csv file
parsed_csv = pd.read_csv(csv, delimiter=',', names=cols_to_use, header=0)
for col in cols_to_use:
values = parsed_csv[col].values
for val in values:
print(str(col) + ' --------> ' + str(val))
Where RawDataCols is an IntEnum.
class RawDataCols(IntEnum):
DATE = 0
HOME_TEAM = 1
AWAY_TEAM = 2
FTHG = 3
HG = 4
FTAG = 5
AG = 6
FTR = 7
RES = 8
...
The column names are obtained using it. That part of code works ok. The correct column name is obtained but after trying to get its values using
values = parsed_csv[col].values
pandas return the values of a wrong column. The wrong column index is around 13 indexes away from the one i am trying to get. What am i missing?
You can select column by name wise.Just use following line
values = parsed_csv[["Column Name","Column Name2"]]
Or you select Index wise by
cols = [1,2,3,4]
values = parsed_csv[parsed_csv.columns[cols]]

Python Pandas rolling mean DataFrame Constructor not properly called

I am trying to create a simple time-series, of different rolling types. One specific example, is a rolling mean of N periods using the Panda python package.
I get the following error : ValueError: DataFrame constructor not properly called!
Below is my code :
def py_TA_MA(v, n, AscendType):
df = pd.DataFrame(v, columns=['Close'])
df = df.sort_index(ascending=AscendType) # ascending/descending flag
M = pd.Series(df['Close'].rolling(n), name = 'MovingAverage_' + str(n))
df = df.join(M)
df = df.sort_index(ascending=True) #need to double-check this
return df
Would anyone be able to advise?
Kind regards
found the correction! It was erroring out (new error), where I had to explicitly declare n as an integer. Below, the code works
#xw.func
#xw.arg('n', numbers = int, doc = 'this is the rolling window')
#xw.ret(expand='table')
def py_TA_MA(v, n, AscendType):
df = pd.DataFrame(v, columns=['Close'])
df = df.sort_index(ascending=AscendType) # ascending/descending flag
M = pd.Series(df['Close'], name = 'Moving Average').rolling(window = n).mean()
#df = pd.Series(df['Close']).rolling(window = n).mean()
df = df.join(M)
df = df.sort_index(ascending=True) #need to double-check this
return df

Pandas apply a change which affects 2 columns at the same time

I have the dataframe below. bet_pl and co_pl keep track of the daily changes in the 2 balances. I have updated co_balance based on co_pl and the cumsum.
init_balance = D('100.0')
co_thresh = D('1.05') * init_balance
def get_pl_co(row):
if row['eod_bet_balance'] > co_thresh:
diff = row['eod_bet_balance']- co_thresh
return(diff)
else:
return Decimal('0.0')
df_odds_winloss['eod_bet_balance'] = df_odds_winloss['bet_pl'].cumsum()+initial_balance
df_odds_winloss['sod_bet_balance']= df_odds_winloss['eod_bet_balance'].shift(1).fillna(init_balance)
df_odds_winloss['co_pl'] = df_odds_winloss.apply(get_pl_co, axis=1)
df_odds_winloss['co_balance'] = df_odds_winloss['co_pl'].cumsum()
# trying this
df_odds_winloss['eod_bet_balance'] = df_odds_winloss['eod_bet_balance'] - df_odds_winloss['co_pl']
Now I want the eod_bet_balance to update with negative co_pl as it is a transfer between the 2 balances, but am not getting the right eod (end of day) balances.
Can anyone give a hint?
UPDATED: The eod_balances reflect the change in bet_pl but not the subsequent change in co_pl.
FINAL UPDATE:
initial_balance = D('100.0')
df = pd.DataFrame({ 'SP': res_df['SP'], 'winloss': bin_seq_l}, columns=['SP', 'winloss'])
df['bet_pl'] = df.apply(get_pl_lvl, axis=1)
df['interim_balance'] = df_odds_winloss['bet_pl'].cumsum()+initial_balance
df['co_pl'] = (df['interim_balance'] - co_thresh).clip_lower(0)
df['co_balance'] = df_odds_winloss['co_pl'].cumsum()
df['post_co_balance'] = df['interim_balance'] - df['co_pl']
bf_r = D('0.05')
df['post_co_deduct_balance'] = df['post_co_balance'] - (df['post_co_balance']* bf_r)
df['sod_bet_balance'] = df['post_co_deduct_balance'].shift(1).fillna(init_balance)
First, you don't need to apply a custom function to get co_pl, it could be done like so:
df['co_pl'] = (df['eod_bet_balance'] - co_thresh).clip_lower(0)
As for updating the other column, if I understand correctly you want something like this:
df['eod_bet_balance'] = df['eod_bet_balance'].clip_upper(co_thresh)
or, equivalently...
df['eod_bet_balance'] -= df['co_pl']

Using Pandas to export multiple rows of data to a csv

I have a matching algorithm which links students to projects. It's working, and I have trouble exporting the data to a csv file. It only takes the last value and exports that only, when there are 200 values to be exported.
The data that's exported uses each number as a value when I would like to get the whole 's' rather than the three 3 numbers which make up 's', which are split into three columns. I've attached the images below. Any help would be appreciated.
What it looks like
What it should look like
#Imports for Pandas
import pandas as pd
from pandas import DataFrame
SPA()
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
df = pd.DataFrame.from_records([dataPack])
df.to_csv('try.csv')
You keep overwriting in the loop so you only end up with the last bit of data, you need to append to the csv with df.to_csv('try.csv',mode="a",header=False) or create one df and append to that and write outside the loop, something like:
df = pd.DataFrame()
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
df.append(pd.DataFrame.from_records([dataPack]))
df.to_csv('try.csv') # write all data once outside the loop
A better option would be to open a file and pass that file object to to_csv:
with open('try.csv', 'w') as f:
for m in M:
s = m['student']
l = m['lecturer']
Lecturer[l]['limit'] = Lecturer[l]['limit'] - 1
id = m['projectid']
p = Project[id]['title']
c = Project[id]['sourceid']
r = str(getRank("Single_Projects1copy.csv",s,c))
print(s+","+l+","+p+","+c+","+r)
dataPack = (s+","+l+","+p+","+c+","+r)
pd.DataFrame.from_records([dataPack]).to_csv(f, header=False)
You get individual chars because you are using from_records passing a single string dataPack as the value so it iterates over the chars:
In [18]: df = pd.DataFrame.from_records(["foobar,"+"bar"])
In [19]: df
Out[19]:
0 1 2 3 4 5 6 7 8 9
0 f o o b a r , b a r
In [20]: df = pd.DataFrame(["foobar,"+"bar"])
In [21]: df
Out[21]:
0
0 foobar,bar
I think you basically want to leave as a tuple dataPack = (s, l, p,c, r) and use pd.DataFrame(dataPack). You don't really need pandas at all, the csv lib would do all this for you without needing to create Dataframes.

Categories

Resources