Starting with a CSV file with the columns ['race_number', 'number_of_horses_bet_on','odds']
I would like to add/calculate an extra column called 'desired_output'.
The 'desired_output' column is computed by
for 'race_number' 1, the 'number_of_horses_bet_on'=2, therefore in the 'desired_output column', only the first 2 'odds' are included. The remaining values for 'race_number' 1 are 0. Then we go to 'race_number' 2 and the cycle repeats.
Code I have tried includes:
import pandas as pd
df=pd.read_csv('test.csv')
desired_output=[]
count=0
for i in df.number_of_horses_bet_on:
for j in df.odds:
if count<i:
desired_output.append(j)
count+=1
else:
desired_output.append(0)
print(desired_output)
and also
df['desired_output']=df.odds.apply(lambda x: x if count<number_of_horses_bet_on else 0)
Neither of these give the output of the column 'desired_output'
I realise the 'count' in the lambda above is misplaced - but hopefully you can see what I am after.
Thanks.
I'm gonna do it a bit differently, this will be what I'm gonna do
get a list of all race_number
for each race_number, extract the number_of_horses_bet_on
create a list that contains 1 or 0, where we would have number_of_horses_bet_on number of 1s and the rest would be zero.
multiple this list with the odds column
import pandas as pd
df=pd.read_csv('test.csv')
mask = []
races = df['race_number'].unique().tolist() # unique list of all races
for race in races:
# filter the dataframe by the race number
df_race = df[df['race_number'] == race]
# assuming number of horses is unique for every race, we extract it here
number_of_horses = df_race['number_of_horses_bet_on'].iloc[0]
# this mask will contain a list of 1s and 0s, for example for race 1 it'll be [1,1,0,0,0]
mask = mask + [1] * number_of_horses + [0] * (len(df_race) - number_of_horses)
df['mask'] = mask
df['desired_output'] = df['mask'] * df['odds']
del df['mask']
print(df)
This assumes that for each race the numbers_of_horses_bet_on equals or less than the number of rows for that race, otherwise you might need to use min/max to get proper results
Related
I have the following dataframe df, in which I highlighted in green the cells with values of interest:
enter image description here
and I would like to obtain for each columns (therefore by considering the whole dataframe) the following statistics: the occurrence of a value less or equal to 0.5 (green cells in the dataframe) -Nan values are not to be included- and its percentage in the considered columns in order to use say 50% as benchmark.
For the point asked I tried with value_count like (df['A'].value_counts()/df['A'].count())*100, but this returns the partial result not the way I would and only for specific columns; I was also thinking about using filter or lamba function like df.loc[lambda x: x <= 0.5] but cleary that is not the result I wanted.
The goal/output will be a dataframe as shown below in which are displayed just the columns that "beat" the benchmark (recall: at least (half) 50% of their values <= 0.5).
enter image description here
e.g. in column A the count would be 2 and the percentage: 2/3 * 100 = 66%, while in column B the count would be 4 and the percentage: 4/8 * 100 = 50%. (The same goes for columns X, Y and Z). On the other hand in column C where 2/8 *100 = 25% won't beat the benchmark and therefore not considered in the output.
Is there a suitable way to achieve this IYHO? Apologies in advance if this was a kinda duplicated question but I found no other questions able to help me out, and thx to any saviour.
I believe I have understood your ask in the below code...
It would be good if you could provide an expected output in your question so that it is easier to follow.
Anyways the first part of the code below is just set up so can be ignored as you already have your data set up.
Basically I have created a quick function for you that will return the percentage of values that are under a threshold that you can define.
This function is called in a loop of all the columns within your dataframe and if this percentage is more than the output threshold (again you can define it) it will keep it for actually outputting.
import pandas as pd
import numpy as np
import random
import datetime
### SET UP ###
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(10)]
def rand_num_list(length):
peak = [round(random.uniform(0,1),1) for i in range(length)] + [0] * (10-length)
random.shuffle(peak)
return peak
df = pd.DataFrame(
{
'A':rand_num_list(3),
'B':rand_num_list(5),
'C':rand_num_list(7),
'D':rand_num_list(2),
'E':rand_num_list(6),
'F':rand_num_list(4)
},
index=date_list
)
df = df.replace({0:np.nan})
##############
print(df)
def less_than_threshold(thresh_df, thresh_col, threshold):
if len(thresh_df[thresh_col].dropna()) == 0:
return 0
return len(thresh_df.loc[thresh_df[thresh_col]<=threshold]) / len(thresh_df[thresh_col].dropna())
output_dict = {'cols':[]}
col_threshold = 0.5
output_threshold = 0.5
for col in df.columns:
if less_than_threshold(df, col, col_threshold) >= output_threshold:
output_dict['cols'].append(col)
df_output = df.loc[:,output_dict.get('cols')]
print(df_output)
Hope this achieves your goal!
i am working whit a data of about 200,000 rows, in one column of the pandas i have some values that have a empty list, the most of them are list whit several values, here is a picture:
what i want to do is change the empty sets whit this set
[[close*0.95,close*0.94]]
where the close is the close value on the table, the for loop that i use is this one:
for i in range(1,len(data3.index)):
close = data3.close[data3.index==data3.index[i]].values[0]
sell_list = data3.sell[data3.index==data3.index[i]].values[0]
buy_list = data3.buy[data3.index==data3.index[i]].values[0]
if len(sell_list)== 0:
data3.loc[data3.index[i],"sell"].append([[close*1.05,close*1.06]])
if len(buy_list)== 0:
data3.loc[data3.index[i],"buy"].append([[close*0.95,close*0.94]])
i tried to make it work whit multithread but as i need to read all the table to do the next step i cant split the data, i hope you can help me to make a kind of lamda function to apply the df, or something, i am not to much skilled on this, thanks for reading!
the expected output of the row and column "buy" of and empty set should be [[[11554, 11566]]]
Example data:
import pandas as pd
df = pd.DataFrame({'close': [11763, 21763, 31763], 'buy':[[], [[21763, 21767]], []]})
close buy
0 11763 []
1 21763 [[[21763, 21767]]]
2 31763 []
You could do it like this:
# Create mask (a bit faster than df['buy'].apply(len) == 0).
# Assumes there are no NaNs in the column. If you have NaNs, use pd.apply.
m = [len(l) == 0 for l in df['buy'].tolist()]
# Create triple nested lists and assign.
df.loc[m, 'buy'] = list(df.loc[m, ['close', 'close']].mul([0.95, 0.94]).to_numpy()[:, None][:, None])
print(df)
Result:
close buy
0 11763 [[[11174.85, 11057.22]]]
1 21763 [[[21763, 21767]]]
2 31763 [[[30174.85, 29857.219999999998]]]
Some explanation:
m is a boolean mask that selects the rows of the DataFrame with an empty list in the 'buy' column:
m = [len(l) == 0 for l in df['buy'].tolist()]
# Or (a bit slower)
# "Apply the len() function to all lists in the column.
m = df['buy'].apply(len) == 0
print(m)
0 True
1 False
2 True
Name: buy, dtype: bool
We can use this mask to select where to calculate the values.
df.loc[m, ['close', 'close']].mul([0.95, 0.94]) duplicates the 'close' column and calculates the vectorised product of all the (close, close) pairs with (0.95, 0.94) to obtain (close*0.94, close*0.94) in each row of the resulting array.
[:, None][:, None] is just a trick to create two additional axes on the resulting array. This is required since you want triple nested lists ([[[]]]).
I have a big dataframe (~10 millon rows). Each row has:
category
start position
end position
If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows.
For example
1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25
I want to filter out 1 or 2.
What I'm doing right now isn't very efficient,
import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
if index in discard:
continue
df_2 = dfs[row.category]
res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
if len(res.index) > 1:
discard.extend(res.index.values)
rows.append(row)
df = pd.DataFrame(rows)
I've also tried a different approach making use of a sorted version of the dataframe.
my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
row = df.iloc[[my_index]]
cond = True
next_index = 1
while cond:
second_row = df.iloc[[my_index + next_index]]
c1 = (row.iloc[0].category == second_row.iloc[0].category)
c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
cond = c1 and c2 and c3
if cond and (c2 amd c3):
indexes.append(my_index)
cond = True
next_index += 1
indexes.append(my_index)
my_index += next_index
indexes.append(total_len - 1)
The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one
I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.
The approach here should be this:
pandas.groupby by categories
agg(Func) on groupby result
the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)
Do you want to merge all similar or only 2 consecutive?
If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row.
If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:
merge rows pandas dataframe based on condition
I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0), where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0).
Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:
def drop_duplicates(df):
df = df.sort_values("sstart")
...
This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.
This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
take data
separate out non numerical columns
find mean vectors by y column
save means
subtract each mean vector from each row based upon y value
square each column
add all columns
join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric
--------------update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(row-means.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(row-means.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
I think this is what you want
import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()
# Subtract mean (calculated using the transform function which preserves the
# number of rows) for each class to create distance to mean
df_dist_to_mean = df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.
I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
Assigned the 'class' as the index.
Rotated so that the 'class' was in the columns.
Performed that operation of means that corresponded with df_numeric
Squared the values.
Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
#print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean().T
import numpy as np
# Changed index
df_numeric.index = df_numeric['class']
df_numeric.drop('class' , axis = 1 , inplace = True)
# Rotated the Numeric data sideways so the class was in the columns
df_numeric = df_numeric.T
#Iterated through the values in means and seen which df_Numeric values matched
store = [] # Assigned an empty array
for j in means:
sto = df_numeric[j]
if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
sto = sto.to_frame() # Need to convert ot dataframe type
store.append(sto-j) # append the various values to the array
values = np.array(store)**2 # Squaring the values
# Summing the rows
summed = []
for i in values:
summed.append((i.sum(axis = 1)))
df_new = pd.concat(summed , axis = 1)
df_new.T
I have a csv datafile that I've split by a column value into 5 datasets for each person using:
for i in range(1,6):
PersonData = df[df['Person'] == i].values
P[i] = PersonData
I want to sort the data into ascending order according to one column, then split the data half way at that column to find the median.
So I sorted the data with the following:
dataP = {}
for i in range(1,6):
sortData = P[i][P[i][:,9].argsort()]
P[i] = sortData
P[i] = pd.DataFrame(P[i])
dataP[1]
Using that I get a dataframe for each of my datasets 1-6 sorted by the relevant column (9), depending on which number I put into dataP[i].
Then I calculate half the length:
for i in range(1,6):
middle = len(dataP[i])/2
print(middle)
Here is where I'm stuck!
I need to create a new column in each dataP[i] dataframe that splits the length in 2 and gives the value 0 if it's in the first half and 1 if it's in the second.
This is what I've tried but I don't understand why it doesn't produce a new list of values 0 and 1 that I can later append to dataP[i]:
for n in range(1, (len(dataP[i]))):
for n, line in enumerate(dataP[i]):
if middle > n:
confval = 0
elif middle < n:
confval = 1
for i in range(1,6):
Confval[i] = confval
Confval[1]
Sorry if this is basic, I'm quite new to this so a lot of what I've written might not be the best way to do it/necessary, and sorry also for the long post.
Any help would be massively appreciated. Thanks in advance!
If I'm reading your question right I believe you are attempting to do two things.
Find the median value of a column
Create a new column which is 0 if the value is less than the median or 1 if greater.
Let's tackle #1 first:
median = df['originalcolumn'].median()
That easy! There's many great pandas functions for things like this.
Ok so number two:
df['newcolumn'] = df[df['originalcolumn'] > median].astype(int)
What we're doing here is creating a new bool series, false if the value at that location is less than the median, true otherwise. Then we can cast that to an int which gives us 0s and 1s.