Python pandas extract variables from dataframe - python

What is the best way to convert DataFrame columns into variables. I have a condition for bet placement and I use head(n=1)
back_bf_lay_bq = bb[(bb['bf_back_bq_lay_lose_net'] > 0) & (bb['bq_lay_price'] < 5) & (bb['bq_lay_price'] != 0) & (bb['bf_back_liquid'] > bb['bf_back_stake']) & (bb['bq_lay_liquid'] > bb['bq_lay_horse_win'])].head(n=1)
I would like to convert columns into variables and pass them to API for bet placement. So I convert back_bf_lay_bq to dictionary and extract values
#Bets placements
dd = pd.DataFrame.to_dict(back_bf_lay_bq, orient='list')
#Betdaq bet placement
bq_selection_id = dd['bq_selection_id'][0]
bq_lay_stake = dd['bq_lay_stake'][0]
bq_lay_price = dd['bq_lay_price'][0]
bet_type = 2
reset_count = dd['bq_count_reset'][0]
withdrawal_sequence = dd['bq_withdrawal_sequence'][0]
kill_type = 2
betdaq_request = betdaq_api.PlaceOrdersNoReceipt(bq_selection_id,bq_lay_stake,bq_lay_price,bet_type,reset_count,withdrawal_sequence,kill_type)
I do not think that it is the most efficient way and it brings a bug from time to time
bq_selection_id = dd['bq_selection_id'][0]
IndexError: list index out of range
So can you suggest a better way to get values from DataFrame and pass them to API?

IIUC you could use iloc to get your first row and then slice your dataframe with your columns subset and pass that to your variables. Something like that:
bq_selection_id, bq_lay_stake, bq_lay_price, withdrawal_sequence = back_bf_lay_bq[['bq_selection_id', 'bq_lay_stake', 'bq_lay_price', 'withdrawal_sequence']].iloc[0]

Related

How to filter this dataframe?

I have a large dataframe (sample). I was filtering the data according to this code:
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
for i in A:
cond_A = (df[i]>= -0.0423) & (df[i]<=3)
filt_df = df[cond_A]
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df[cond_B]
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df2[cond_B]
When I print filt_df3, I am getting only an empty dataframe - why?
How can I improve the code, other approaches like some advanced techniques?
I am not sure the code above works as outlined in the edit below?
I would like to know how can I change the code, such that it works as outlined in the edit below?
Edit:
I want to remove the rows based on columns (A0 - A49) based on cond_A.
Then filter the dataframe from 1 based on columns (B0 - B49) with cond_B.
Then filter the dataframe from 2 based on columns (C0 - C49) with cond_C.
Thank you very much in advance.
It seems to me that there is an issue with your codes when you are using the iteration to do the filtering. For example, filt_df is being overwritten in every iteration of the first loop. When the loop ends, filt_df only contains the data filtered with the conditions set in the last iteration. Is this what you intend to do?
And if you want to do the filtering efficient, you can try to use pandas.DataFrame.query (see documentation here). For example, if you want to filter out all rows with column B0 to B49 containing values between 0 and 200 inclusive, you can try to use the Python codes below (assuming that you have imported the raw data in the variable df below).
condition_list = [f'B{i} >= 0 & B{i} <= 200' for i in range(50)]
filter_str = ' & '.join(condition_list)
subset_df = df.query(filter_str)
print(subset_df)
Since the column A1 contains only -0.057 which is outside [-0.0423, 3] everything gets filtered out.
Nevertheless, you seem not to take over the filter in every loop as filt_df{1|2|3} is reset.
This should work:
import pandas as pd
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
filt_df = df.copy()
for i in A:
cond_A = (df[i] >= -0.0423) & (df[i]<=3)
filt_df = filt_df[cond_A]
filt_df2 = filt_df.copy()
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df2[cond_B]
filt_df3 = filt_df2.copy()
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df3[cond_B]
print(filt_df3)
Of course you will find a lot of filter tools in the pandas library that can be applied to multiple columns
For example this:
https://stackoverflow.com/a/39820329/6139079
You can filter by all columns together with DataFrame.all for test if all rows match together:
A = [f"A{i}" for i in range(50)]
cond_A = ((df[A] >= -0.0423) & (df[A]<=3)).all(axis=1)
B = [f"B{i}" for i in range(50)]
cond_B = ((df[B]>= 15) & (df[B]<=20)).all(axis=1)
C = [f"C{i}" for i in range(50)]
cond_C = ((df[C]>= 15) & (df[C]<=20)).all(axis=1)
And last chain all masks by & for bitwise AND:
filt_df = df[cond_A & cond_B & cond_C]
If get empty DataFrame it seems no row satisfy all conditions.

How to add a new column to pandas dataframe while iterate over the rows?

I want to generate a new column using some columns that already exists.But I think it is too difficult to use an apply function. Can I generate a new column (ftp_price here) when iterating through this dataframe? Here is my code. When I call product_df['ftp_price'],I got a KeyError.
for index, row in product_df.iterrows():
current_curve_type_df = curve_df[curve_df['curve_surrogate_key'] == row['curve_surrogate_key_x']]
min_tmp_df = row['start_date'] - current_curve_type_df['datab_map'].apply(parse)
min_tmp_df = min_tmp_df[min_tmp_df > timedelta(days=0)]
curve = current_curve_type_df.loc[min_tmp_df.idxmin()]
tmp_diff = row['end_time'] - np.array(row['start_time'])
if np.isin(0, tmp_diff):
idx = np.where(tmp_diff == 0)
col_name = COL_NAMES[idx[0][0]]
row['ftp_price'] = curve[col_name]
else:
idx = np.argmin(tmp_diff > 0)
p_plus_one_rate = curve[COL_NAMES[idx]]
p_minus_one_rate = curve[COL_NAMES[idx - 1]]
d_plus_one_days = row['start_date'] + rate_mapping_dict[COL_NAMES[idx]]
d_minus_one_days = row['start_date'] + rate_mapping_dict[COL_NAMES[idx - 1]]
row['ftp_price'] = p_minus_one_rate + (p_plus_one_rate - p_minus_one_rate) * (row['start_date'] - d_minus_one_days) / (d_plus_one_days - d_minus_one_days)
An alternative to setting new value to a particular index is using at:
for index, row in product_df.iterrows():
product_df.at[index, 'ftp_price'] = val
Also, you should read why using iterrows should be avoided
A row can be a view or a copy (and is often a copy), so changing it would not change the original dataframe. The correct way is to always change the original dataframe using loc or iloc:
product_df.loc[index, 'ftp_price'] = ...
That being said, you should try to avoid to explicitely iterate the rows of a dataframe when possible...

Counting the repeated values in one column base on other column

Using Panda, I am dealing with the following CSV data type:
f,f,f,f,f,t,f,f,f,t,f,t,g,f,n,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,t,t,nowin
t,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,nowin
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,win
For this part of the raw data, I was trying to return something like:
Column1_name -- t -- counts of nowin = 0
Column1_name -- t -- count of wins = 3
Column1_name -- f -- count of nowin = 2
Column1_name -- f -- count of win = 1
Based on this idea get dataframe row count based on conditions I was thinking in doing something like this:
print(df[df.target == 'won'].count())
However, this would return always the same number of "wons" based on the last column without taking into consideration if this column it's a "f" or a "t". In other others, I was hoping to use something from Panda dataframe work that would produce the idea of a "group by" from SQL, grouping based on, for example, the 1st and last column.
Should I keep pursing this idea of should I simply start using for loops?
If you need, the rest of my code:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king-pawn/kr-vs-kp.data"
df = pd.read_csv(url,names=[
'bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target'
])
features = ['bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target']
# number of lines
#tot_of_records = np.size(my_data,0)
#tot_of_records = np.unique(my_data[:,1])
#for item in my_data:
# item[:,0]
num_of_won=0
num_of_nowin=0
for item in df.target:
if item == 'won':
num_of_won = num_of_won + 1
else:
num_of_nowin = num_of_nowin + 1
print(num_of_won)
print(num_of_nowin)
print(df[df.target == 'won'].count())
#print(df[:1])
#print(df.bkblk.to_string(index=False))
#print(df.target.unique())
#ini_entropy = (() + ())
This could work -
outdf = df.apply(lambda x: pd.crosstab(index=df.target,columns=x).to_dict())
Basically we are going in on each feature column and making a crosstab with target column
Hope this helps! :)

Pandas: how to add a dataframe inside a cell of another dataframe?

I have an empty dataframe like the following:
simReal2013 = pd.DataFrame(index = np.arange(0,1,1))
Then I read as dataframes some .csv files.
stat = np.arange(0,5)
xv = [0.005, 0.01, 0.05]
br = [0.001,0.005]
for i in xv:
for j in br:
I = 0
for s in stat:
string = 'results/2013/real/run_%d_%f_%f_15.0_10.0_T0_RealNet.csv'%(s,i,j)
sim = pd.read_csv(string, sep=' ')
I += np.array(sim.I)
sim.I = I / 5
col = '%f_%f'%(i,j)
simReal2013.insert(0, col, sim)
I would like to add the dataframe that I read in a cell of simReal2013. In doing so I get the following error:
ValueError: Wrong number of items passed 9, placement implies 1
Yes, putting a dataframe inside of a dataframe is probably not the way you want to go, but if you must, this is one way to do it:
df_in=pd.DataFrame([[1,2,3]]*2)
d={}
d['a']=df
df_out=pd.DataFrame([d])
type(df_out.loc[0,"a"])
>>> pandas.core.frame.DataFrame
Maybe a dictionary of dataframes would suffice for your use.

Reading multiple CSV files to DataFrames and naming them after their original file name

I had several csv files with same structure and I want
Assign file to dataframe name with the same order :
1.csv -> data1, 2.csv ->data2
And assign columns in the same manner :
delta1 = data1["C"] - data1["A"]
I want put it into the for loop, looks like this:
for i in range (1, 22):
data%i = pd.read_csv('CSV/' + str(i) + '.csv')
delta%i = data%i["C"] - data%i["A"]
# And I want to compare the 2 series from dataframe.column to find a set intersection
set(data1[data1.delta1 > 0].column) & set(data2[data2.delta2 == 0].column)
set(data2[data2.delta2 > 0].column) & set(data3[data3.delta3 == 0].column)
but certainly wrong syntax in the for loop, is there better way to code it so that after the loop, I can get:
data1, data2, data3 ...
with corresponding:
delta1, delta2, delta3 ...
You can do everything with native pandas functions as opposed to dicts.
First read your csvs into a list:
df_list = []
for i in range(1, 22):
df_list.append(pd.read_csv("{i}.csv".format(i)))
Now concat them:
df = pd.concat(df_list, keys=range(1,22))
Now your dataframe df is indexed with the key of the file you loaded.
Doing for example df.loc[1] will get you the data from the file 1.csv
You can now set your deltas witha a single operation:
df["delta"] = df["C"] - df["A"]
And you can access these deltas with the DataFrame.loc operation too, like this:
df.loc[2,"delta"]
This method is more native to pandas and may scale better with larger datasets.
Here is an implementation of what you want using dictionaries (as also suggested by #EdChum in the comments for example):
data = {}
delta = {}
for i in range (1, 22):
data[i]= pd.read_csv('CSV/' + str(i) + '.csv')
delta[i] = data[i]["C"] - data[i]["A"]
# And I want to compare the 2 series from dataframe.column to find a set intersection
set(data[1][data[1].delta[1] > 0].column) & set(data[2][data[2].delta[2] == 0].column)
set(data[2][data[2].delta[2] > 0].column) & set(data[3][data[3].delta[3] == 0].column)
I would really recommend using a dictionnary such as above. However if you really really really insist on allocating variables dynamically like you want to do in your questions, you can do the following highly dangerous and not recommended thing:
You can allocate variables dynamically using:
globals()[variable_name]=variable_value
Again you really shouldn't do that. There is no reason to do it either, but here you go, here is the modification of your code which does exactly what you wanted:
for i in range (1, 22):
datai = "data"+str(i)
deltai = "delta"+str(i)
globals()[datai] = pd.read_csv('CSV/' + str(i) + '.csv')
globals()[deltai] = globals()[datai]["C"] - globals()[datai]["A"]
# And I want to compare the 2 series from dataframe.column to find a set intersection
set(data1[data1.delta1 > 0].column) & set(data2[data2.delta2 == 0].column)
set(data2[data2.delta2 > 0].column) & set(data3[data3.delta3 == 0].column)

Categories

Resources