python/pandas: random sampling does not work in for loop

python/pandas: random sampling does not work in for loop - python

I am working on a pandas data frame and I am trying to subset my data frame such that the cumulative sum of the column is not greater than 18
and then the percentage of yellow color selected should not be less than 65% and then trying to run multiple iterations of the same. However sometimes
loop goes into infinite loop and sometime it does produce the results but we get the same result in every iteration.
Everything after the while loop was taken from the below post
Python random sample selection based on multiple conditions
df=pd.DataFrame({'id':['A','B','C','D','E','G','H','I','J','k','l','m','n','o'],'color':['red','red','orange','red','red','red','red','yellow','yellow','yellow','yellow','yellow','yellow','yellow'], 'qty':[5,2, 3, 4, 7, 6, 8, 1, 5,2, 3, 4, 7, 6]})
df_sample = df
for x in range(2):
sample_s = df.sample(n=df.shape[0])
sample_s= sample_s[(sample_s.qty.cumsum()<= 30)]
sample_size=len(sample_s)
while sum(df['qty']) > 18:
yellow_size = 0.65
df_yellow = df[df['color'] == 'yellow'].sample(int(yellow_size*sample_size))
others_size = 1 - yellow_size
df_others = df[df['color'] != 'yellow'].sample(int(others_size*sample_size))
df = pd.concat([df_yellow, df_others]).sample(frac=1)
print df
This is how I get the result when it works wherein both the results are same.
color id qty
red H 2
yellow n 3
yellow J 5
red G 2
yellow I 1
red D 4
color id qty
red H 2
yellow n 3
yellow J 5
red G 2
yellow I 1
red D 4
I am really hoping if someone could please help to resolve the issue.

Related

pandas: find maximum across column range; use second column range for tie breaks

I have a data frame with two corresponding sets of columns, e.g. like this sample containing people and their rating of three fruits as well as their ability to detect a fruit ('corresponding' means that banana_rati corresponds to banana_reco etc.).
import pandas as pd
df_raw = pd.DataFrame(data=[ ["name1", 10, 10, 9, 10, 10, 10],
["name2", 10, 10, 8, 10, 8, 4],
["name3", 10, 8, 8, 10, 8, 8],
["name4", 5, 10, 10, 5, 10, 8]],
columns=["name", "banana_rati", "mango_rati", "orange_rati",
"banana_reco", "mango_reco", "orange_reco"])
Suppose I now want to find each respondent's favorite fruit, which I define was the highest rated fruit.
I do this via:
cols_find_max = ["banana_rati", "mango_rati", "orange_rati"] # columns to find the maximum in
mxs = df_raw[cols_find_max].eq(df_raw[cols_find_max].max(axis=1), axis=0) # bool indicator if the cell contains the row-wise maximum value across cols_find_max
However, some respondents rated more than one fruit with the highes value:
df_raw['highest_rated_fruits'] = mxs.dot(mxs.columns + ' ').str.rstrip(', ').str.replace("_rati", "").str.split()
df_raw['highest_rated_fruits']
# Out:
# [banana, mango]
# [banana, mango]
# [banana]
# [mango, orange]
I now want to use the maximum of ["banana_reco", "mango_reco", "orange_reco"] for tie breaks. If this also gives no tie break, I want a random selection of fruits from the so-determined favorite ones.
Can someone help me with this?
The expected output is:
df_raw['fav_fruit']
# Out
# mango # <- random selection from banana (rating: 10, recognition: 10) and mango (same values)
# banana # <- highest ratings: banana, mango; highest recognition: banana
# banana # <- highest rating: banana
# mango # <- highest ratings: mango, orange; highest recognition: mango

UPDATED
Here's a way to do what your question asks:
from random import sample
df = pd.DataFrame({
'name':[c[:-len('_rati')] for c in df_raw.columns if c.endswith('_rati')]})
df = df.assign(rand=df.name + '_rand', tupl=df.name + '_tupl')
num = len(df)
df_raw[df.rand] = [sample(range(num), k=num) for _ in range(len(df_raw))]
df_ord = pd.DataFrame(
{f'{fr}_tupl':df_raw.apply(
lambda x: tuple(x[(f'{fr}_{suff}' for suff in ('rati','reco','rand'))]), axis=1)
for fr in df.name})
df_raw['fav_fruit'] = df_ord.apply(lambda x: df.name[list(x==x.max())].squeeze(), axis=1)
df_raw = df_raw.drop(columns=df.rand)
Sample output:
name banana_rati mango_rati orange_rati banana_reco mango_reco orange_reco fav_fruit
0 name1 10 10 9 10 10 10 banana
1 name2 10 10 8 10 8 4 banana
2 name3 10 8 8 10 8 8 banana
3 name4 5 10 10 5 10 8 mango
Explanation:
create one new column per fruit ending in rand to collectively hold a random shuffled sequence of those fruits (0 through number of fruits) for each row
create one new column per fruit ending in tupl containing 3-tuples of rati, reco, rand corresponding to that fruit
because the rand value for each fruit in a given row is distinct, the 3-tuples will break ties, and therefore, for each row we can simply look up the favorite fruit, namely the fruit whose tuple matches the row's max tuple
drop intermediate columns and we're done.

Try:
import numpy as np
mxs.dot(mxs.columns + ' ').str.rstrip(', ').str.replace("_rati", "").str.split().apply(lambda x: x[np.random.randint(len(x))])
This adds .apply(lambda x: x[np.random.randint(len(x))]) to the end of your last statement and randomly selects an element from the list.
Run 1:
0 banana
1 banana
2 banana
3 orange
Run 2:
0 mango
1 banana
2 banana
3 orange

Calculate average based on available data points

Imagine I have the following data frame:
Product
Month 1
Month 2
Month 3
Month 4
Total
Stuff A
5
0
3
3
11
Stuff B
10
11
4
8
33
Stuff C
0
0
23
30
53
that can be constructed from:
df = pd.DataFrame({'Product': ['Stuff A', 'Stuff B', 'Stuff C'],
'Month 1': [5, 10, 0],
'Month 2': [0, 11, 0],
'Month 3': [3, 4, 23],
'Month 4': [3, 8, 30],
'Total': [11, 33, 53]})
This data frame shows the amount of units sold per product, per month.
Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.
Based on these conditions, I have come up with the following solution in pseudo-code:
months = ["list of index names of months to calculate"]
x = len(months)
if df["Month 1"] != 0:
df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
df["Average"] = df["Total"] / x - 1
...
elif df["Month " + str(x)] != 0:
df["Average"] = df["Total"] / 1
else:
df["Average"] = 0
That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.
I would appreciate any help, since I've been trying to crack this problem for a while with no success.

There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.
Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.
import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]
Output:
Product Month 1 Month 2 Month 3 Month 4 Total Average Sales
0 Stuff A 5 0 3 3 11 2.75
1 Stuff B 10 11 4 8 33 8.25
2 Stuff C 0 0 23 30 53 26.50

Try:
df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)
Output:
Product Total Month 1 Month 2 Month 3 Month 4 Average
0 Stuff A 11 5 0 3 3 2.75
1 Stuff B 33 10 11 4 8 8.25
2 Stuff C 53 0 0 23 30 26.50
Details:
Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.
First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.
Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.

If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:
import numpy as np
arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()
Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.

Plot made of array from a pandas dataset in Python

The problem is: I have a SQLAlchemy database called NumFav with arrays of favourite numbers of some people, which uses such a structure:
id name numbers
0 Vladislav [2, 3, 5]
1 Michael [4, 6, 7, 9]
numbers is postgresql.ARRAY(Integer)
I want to make a plot which demonstrates id of people on X and numbers dots on Y in order to show which numbers have been chosen like this:
I extract data using
df = pd.read_sql(Session.query(NumFav).statement, engine)
How can I create a plot with such data?

You can explode the number lists into "long form":
df = df.explode('numbers')
df['color'] = df.id.map({0: 'red', 1: 'blue'})
# id name numbers color
# 0 Vladislav 2 red
# 0 Vladislav 3 red
# 0 Vladislav 5 red
# 1 Michael 4 blue
# 1 Michael 6 blue
# 1 Michael 7 blue
# 1 Michael 9 blue
Then you can directly plot.scatter:
df.plot.scatter(x='name', y='numbers', c='color')

Like this:
import matplotlib.pyplot as plt
for idx, row in df.iterrows():
plt.plot(row['numbers'])
plt.legend(df['name'])
plt.show()

Pandas: efficient way to get a random subset from each row within a restricted column range

I have some numerical time-series of varying lengths stored in a wide pandas dataframe. Each row corresponds to one series and each column to a measurement time point. Because of their varying length, those series can have missing values (NA) tails either left (first time points) or right (last time points) or both. There is always a continuous stripe without NA of a minimum length on each row.
I need to get a random subset of fixed length from each of these rows, without including any NA. Ideally, I wish to keep the original dataframe intact and to report the subsets in a new one.
I managed to obtain this output with a very inefficient for loop that goes through each row one by one, determines a start for the crop position such that NAs will not be included in the output and copies the cropped result. This works but it is extremely slow on large datasets. Here is the code:
import pandas as pd
import numpy as np
from copy import copy
def crop_random(df_in, output_length, ignore_na_tails=True):
# Initialize new dataframe
colnames = ['X_' + str(i) for i in range(output_length)]
df_crop = pd.DataFrame(index=df_in.index, columns=colnames)
# Go through all rows
for irow in range(df_in.shape[0]):
series = copy(df_in.iloc[irow, :])
series = np.array(series).astype('float')
length = len(series)
if ignore_na_tails:
pos_non_na = np.where(~np.isnan(series))
# Range where the subset might start
lo = pos_non_na[0][0]
hi = pos_non_na[0][-1]
left = np.random.randint(lo, hi - output_length + 2)
else:
left = np.random.randint(0, length - output_length)
series = series[left : left + output_length]
df_crop.iloc[irow, :] = series
return df_crop
And a toy example:
df = pd.DataFrame.from_dict({'t0': [np.NaN, 1, np.NaN],
't1': [np.NaN, 2, np.NaN],
't2': [np.NaN, 3, np.NaN],
't3': [1, 4, 1],
't4': [2, 5, 2],
't5': [3, 6, 3],
't6': [4, 7, np.NaN],
't7': [5, 8, np.NaN],
't8': [6, 9, np.NaN]})
# t0 t1 t2 t3 t4 t5 t6 t7 t8
# 0 NaN NaN NaN 1 2 3 4 5 6
# 1 1 2 3 4 5 6 7 8 9
# 2 NaN NaN NaN 1 2 3 NaN NaN NaN
crop_random(df, 3)
# One possible output:
# X_0 X_1 X_2
# 0 2 3 4
# 1 7 8 9
# 2 1 2 3
How could I achieve same results in a way adapted to large dataframes?
Edit: Moved my improved solution to the answer section.

I managed to speed up things quite drastically with:
def crop_random(dataset, output_length, ignore_na_tails=True):
# Get a random range to crop for each row
def get_range_crop(series, output_length, ignore_na_tails):
series = np.array(series).astype('float')
if ignore_na_tails:
pos_non_na = np.where(~np.isnan(series))
start = pos_non_na[0][0]
end = pos_non_na[0][-1]
left = np.random.randint(start,
end - output_length + 2) # +1 to include last in randint; +1 for slction span
else:
length = len(series)
left = np.random.randint(0, length - output_length)
right = left + output_length
return left, right
# Crop the rows to random range, reset_index to do concat without recreating new columns
range_subset = dataset.apply(get_range_crop, args=(output_length,ignore_na_tails, ), axis = 1)
new_rows = [dataset.iloc[irow, range_subset[irow][0]: range_subset[irow][1]]
for irow in range(dataset.shape[0])]
for row in new_rows:
row.reset_index(drop=True, inplace=True)
# Concatenate all rows
dataset_cropped = pd.concat(new_rows, axis=1).T
return dataset_cropped

How to give certain rows 'points' depending on how much larger that row's column 1 is compared to that row's column 2

I'm looking at creating an algorithm where if the views_per_hour is 2x larger than the average_views_per_hour, I give the channel 5 points; if it is 3x larger I give the row 10 points and if it is 4x larger, I give the row 20 points. I'm not really sure how to go about this and would really appreciate some help.
df = pd.DataFrame({'channel':['channel1','channel2','channel3','channel4'], 'views_per_hour_today':[300,500,2000,100], 'average_views_per_hour':[100,200,200,50],'points': [0,0,0,0] })
df.loc[:, 'average_views_per_hour'] *= 2
df['n=2'] = np.where((df['views_per_hour'] >= df['average_views_per_hour']) , 5, 0)
df.loc[:, 'average_views_per_hour'] *= 3
df['n=3'] = np.where((df['views_per_hour'] >= df['average_views_per_hour']) , 5, 0)
df.loc[:, 'average_views_per_hour'] *= 4
df['n=4'] = np.where((df['views_per_hour'] >= df['average_views_per_hour']) , 10, 0)
I expected to be able to add up the results from columns n=2, n=3, n=4 for each row in the 'Points' column but the columns are always showing either 5 or 10 and never 0 (the code thinks that the views_per_hour is always greater than the average_views_per_hour, even when the average_views_per_hour is multiplied by a large integer.)

There are multiple ways of solving this kind of problem. You can use numpy select which has more concise syntax, you can also define a function and apply on the data frame.
div = df['views_per_hour_today']/df['average_views_per_hour']
cond = [(div >= 2) & (div < 3), (div >= 3) & (div < 4), (div >= 4) ]
choice = [5, 10, 20]
df['points'] = np.select(cond, choice)
channel views_per_hour_today average_views_per_hour points
0 channel1 300 100 10
1 channel2 500 200 5
2 channel3 2000 200 20
3 channel4 100 50 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python/pandas: random sampling does not work in for loop - python

Related

pandas: find maximum across column range; use second column range for tie breaks

Calculate average based on available data points

Plot made of array from a pandas dataset in Python

Pandas: efficient way to get a random subset from each row within a restricted column range

How to give certain rows 'points' depending on how much larger that row's column 1 is compared to that row's column 2

Categories

Resources