import pandas as pd
import numpy as np
data_dir = 'data_r14.csv'
data = pd.read_csv(data_dir)
# print(data)
signals = data['signal']
value_counts = signals.value_counts()
buy_count = value_counts[1]
signals_code = [1, 2]
buy_sell_rows = data.loc[data['signal'].isin(signals_code)]
data_without_signals = data[~data['signal'].isin(signals_code)]
random_0_indexes = np.random.choice(data_without_signals.index.values, buy_count)
value_counts2 = data_without_signals['signal'].value_counts()
# print(value_counts2)
for index in random_0_indexes:
row = data.loc[index, :]
# df = row.to_frame()
print(row)
buy_sell_rows.append(row)
# print(buy_sell_rows)
# print(signals.loc[index, :])
# print(random_0_rows)
print(buy_sell_rows)
# print(buy_sell_rows['signal'].value_counts())
So I have a dataframe where I have a column named signal where the values are either 0, 1, or 2 and I want to balance them by having equal amount rows for each value because they are very unbalanced I have only 1984 row of non zero value and over 20000 rows of zero value.
So I created a new dataframe where all the values are zeroes and called it data_without_signals then I get a random list of indexes from it, then I run a loop to get that row to append it to another dataframe I created called buy_sell_rows where only non zero values are in, but the issue is that row is being appened.
As said in my comment, I think your general approach could be simplified by randomly sampling the different signals:
# my test signal of 0s, 1s and 2s
test = pd.DataFrame({"data" : [0,0,0,1,1,1,1,1,1,1,2,2,2,2,2,2]})
# get the lowest size of any group, which is the size all groups should be reduced to
max_size = test.groupby("data")["data"].count().min()
# sample
output = (test
.groupby(["data"])
.agg(sample = ("data", lambda x : x.sample(max_size).to_list()))
.explode("sample")
.reset_index(drop=True)
)
and the output for this test is:
sample
0
0
1
0
2
0
3
1
4
1
5
1
6
2
7
2
8
2
Related
Good evening all,
I have a situation where I need to split a dataframe into two complementary parts based on the value of one feature.
What I mean by this is that for every row in dataframe 1, I need a complementary row in dataframe 2 that takes on the opposite value of that specific feature.
In my source dataframe, the feature I'm referring to is stored under column "773", and it can take on values of either 0.0 or 1.0.
I came up with the following code that does this sufficiently, but it is remarkably slow. It takes about a minute to split 10,000 rows, even on my all-powerful EC2 instance.
data = chunk.iloc[:,1:776]
listy1 = []
listy2 = []
for i in range(0,len(data)):
random_row = data.sample(n=1).iloc[0]
listy1.append(random_row.tolist())
if random_row["773"] == 0.0:
x = data[data["773"] == 1.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
else:
x = data[data["773"] == 0.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
df1 = pd.DataFrame(listy1)
df2 = pd.DataFrame(listy2)
Note: I don't care about duplicate rows, because this data is being used to train a model that compares two objects to tell which one is "better."
Do you have some insight into why this is so slow, or any suggestions as to make this faster?
A key concept in efficient numpy/scipy/pandas coding is using library-shipped vectorized functions whenever possible. Try to process multiple rows at once instead of iterate explicitly over rows. i.e. avoid for loops and .iterrows().
The implementation provided is a little subtle in terms of indexing, but the vectorization thinking should be straightforward as follows:
Draw the main dataset at once.
The complementary dataset: draw the 0-rows at once, the complementary 1-rows at once, and then put them into the corresponding rows at once.
Code:
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(52) # reproducibility
n = 10000
df = pd.DataFrame(
data={
"773": [0,1]*int(n/2),
"dummy1": list(range(n)),
"dummy2": list(range(0, 10*n, 10))
}
)
t0 = datetime.now()
print("Program begins...")
# 1. draw the main dataset
draw_idx = np.random.choice(n, n) # repeatable draw
df_main = df.iloc[draw_idx, :].reset_index(drop=True)
# 2. draw the complementary dataset
# (1) count number of 1's and 0's
n_1 = np.count_nonzero(df["773"][draw_idx].values)
n_0 = n - n_1
# (2) split data for drawing
df_0 = df[df["773"] == 0].reset_index(drop=True)
df_1 = df[df["773"] == 1].reset_index(drop=True)
# (3) draw n_1 indexes in df_0 and n_0 indexes in df_1
idx_0 = np.random.choice(len(df_0), n_1)
idx_1 = np.random.choice(len(df_1), n_0)
# (4) broadcast the drawn rows into the complementary dataset
df_comp = df_main.copy()
mask_0 = (df_main["773"] == 0).values
df_comp.iloc[mask_0 ,:] = df_1.iloc[idx_1, :].values # df_1 into mask_0
df_comp.iloc[~mask_0 ,:] = df_0.iloc[idx_0, :].values # df_0 into ~mask_0
print(f"Program ends in {(datetime.now() - t0).total_seconds():.3f}s...")
Check
print(df_main.head(5))
773 dummy1 dummy2
0 0 28 280
1 1 11 110
2 1 13 130
3 1 23 230
4 0 86 860
print(df_comp.head(5))
773 dummy1 dummy2
0 1 19 190
1 0 74 740
2 0 28 280 <- this row is complementary to df_main
3 0 60 600
4 1 37 370
Efficiency gain: 14.23s -> 0.011s (ca. 128x)
I have sales data till Jul-2020 and want to predict the next 3 months using a recovery rate.
This is the dataframe:
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
This is how it looks:
Now, I want to add a "Predicted" column resulting into this dataframe:
The first value 300 at row 3, is basically (200 * 1.5/1). This will be our base value going ahead, so next value i.e. 500 is basically (300 * 2.5/1.5) and so on.
How do I iterate over row every row, starting from row 3 onwards? I tried using shift() but couldn't iterate over the rows.
You could do it like this:
import pandas as pd
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
test['Prediction'] = test['Sales']
for i in range(1, len(test)):
#prevent division by zero
if test.loc[i-1, 'Recovery'] != 0:
test.loc[i, 'Prediction'] = test.loc[i-1, 'Prediction'] * test.loc[i, 'Recovery'] / test.loc[i-1, 'Recovery']
The sequence you have is straight up just Recovery * base level (Sales = 200)
You can compute that sequence like this:
valid_sales = test.Sales > 0
prediction = (test.Recovery * test.Sales[valid_sales].iloc[-1]).rename("Predicted")
And then combine by index, insert column or concat:
pd.concat([test, prediction], axis=1)
I have a data frame with the column "Key" as index like below:
Key
Prediction
C11D0 0
C11D1 8
C12D0 1
C12D1 5
C13D0 3
C13D1 9
C14D0 4
C14D1 9
C15D0 5
C15D1 3
C1D0 5
C2D0 7
C3D0 4
C4D0 1
C4D1 9
I want to add the values of two cells in Prediction column when their "index = something". The logic is I want to add the values whose index matches for upto 4 letters. Example: indexes having "C11D0 & C11D1" or having "C14D0 & C14D1" ? Then the output will be:
Operation
Addition Result
C11D0+C11D1 8
C12D0+C12D1 6
C13D0+C13D1 12
you can use isin function.
Example:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4,5,6], 'value':[1,2,1,3,7,1]})
df[df.id.isin([1,5,6])].value.sum()
output:
9
for your case
idx = ['C11D0', 'C11D1']
print(df[df.Key.isin(idx)].Prediction.sum()) #outputs 8
First set key as a column if it is the index:
df.reset_index(inplace=True)
Then you can use DataFrame.loc with boolean indexing:
df.loc[df['key'].isin(["C11D0","C11D1"]),'Prediciton'].sum()
You can also create a function for it:
def sum_select_df(key_list,df):
return pd.concat([df[df['Key'].isin(['C'+str(key)+'D1','C'+str(key)+'D0'])] for key in key_list])['Prediction'].sum()
sum_select_df([11,14],df)
Output:
21
Here is a complete solution, slightly different from the other answers so far. I tried to make it pretty self-explanatory, but let me know if you have any questions!
import numpy as np # only used to generate test data
import pandas as pd
import itertools as itt
start_inds = ["C11D0", "C11D1", "C12D0", "C12D1", "C13D0", "C13D1", "C14D0", "C14D1",
"C15D0", "C15D1", "C1D0", "C2D0", "C3D0", "C4D0", "C4D1"]
test_vals = np.random.randint(low=0, high=10, size=len(start_inds))
df = pd.DataFrame(data=test_vals, index=start_inds, columns=["prediction"])
ind_combs = itt.combinations(df.index.array, 2)
sum_records = ((f"{ind1}+{ind2}", df.loc[[ind1, ind2], "prediction"].sum())
for (ind1, ind2) in ind_combs if ind1[:4] == ind2[:4])
res_ind, res_vals = zip(*sum_records)
res_df = pd.DataFrame(data=res_vals, index=res_ind, columns=["sum_result"])
I have a data frame with 384 rows (and an additional dummy one in the bigining).
each row has 4 variable I wrote manually. 3 calculated fields based on those 4 variables.
and 3 that are comparing each calculated variable to the row before. each field can have 1 of two values (basically True/False).
Final goal - I want to arrange the data frame in a way that the 64 possible combination of the 6 calculated fields (2^6), occur 6 times (2^6*6=384).
Each iteration does a frequency table (pivot) and if one of the fields differ from 6 it breaks and randomize the order.
The problem that there are 384!-12*6! possible combinations and my computer is running the following script for over 4 days without a solution.
import pandas as pd
from numpy import random
# a function that calculates if a row is congruent or in-congruent
def set_cong(df):
if df["left"] > df["right"] and df["left_size"] > df["right_size"] or df["left"] < df["right"] and df["left_size"] < df["right_size"]:
return "Cong"
else:
return "InC"
# open file and calculate the basic fields
DF = pd.read_csv("generator.csv")
DF["distance"] = abs(DF.right-DF.left)
DF["CR"] = DF.left > DF.right
DF["Cong"] = DF.apply(set_cong, axis=1)
again = 1
# main loop to try and find optimal order
while again == 1:
# make a copy of the DF to not have to load it each iteration
df = DF.copy()
again = 0
df["rand"] = [[random.randint(low=1, high=100000)] for i in range(df.shape[0])]
# as 3 of the fields are calculated based on the previous row the first one is a dummy and when sorted needs to stay first
df.rand.loc[0] = 0
Sorted = df.sort_values(['rand'])
Sorted["Cong_n1"] = Sorted.Cong.eq(Sorted.Cong.shift())
Sorted["Side_n1"] = Sorted.CR.eq(Sorted.CR.shift())
Sorted["Dist_n1"] = Sorted.distance.eq(Sorted.distance.shift())
# here the dummy is deleted
Sorted = Sorted.drop(0, axis=0)
grouped = Sorted.groupby(['distance', 'CR', 'Cong', 'Cong_n1', 'Dist_n1', "Side_n1"])
for name, group in grouped:
if group.shape[0] != 6:
again = 1
break
Sorted.to_csv("Edos.csv", sep="\t",index=False)
print ("bye")
the data frame looks like this:
left right size_left size_right distance cong CR distance_n1 cong_n1 side_n1
1 6 22 44 5 T F dummy dummy dummy
5 4 44 22 1 T T F T F
2 3 44 22 1 F F T F F
Starting with this dataframe I want to generate 100 random numbers using the hmean column for loc and the hstd column for scale
I am starting with a data frame that I change to an array. I want to iterate through the entire data frame and produce the following output.
My code below will only return the answer for row zero.
Name amax hmean hstd amin
0 Bill 22.924545 22.515861 0.375822 22.110000
1 Bob 26.118182 24.713880 0.721507 23.738400
2 Becky 23.178606 22.722464 0.454028 22.096752
This code provides one row of output, instead of three
from scipy import stats
import pandas as pd
def h2f(df, n):
for index, row in df.iterrows():
list1 = []
nr = df.as_matrix()
ff = stats.norm.rvs(loc=nr[index,2], scale=nr[index,3], size = n)
list1.append(ff)
return list1
df2 = h2f(data, 100)
pd.DataFrame(df2)
This is the output of my code
0 1 2 3 4 ... 99 100
0 22.723833 22.208324 22.280701 22.416486 22.620035 22.55817
This is the desired output
0 1 2 3 ... 99 100
0 22.723833 22.208324 22.280701 22.416486 22.620035
1 21.585776 22.190145 22.206638 21.927285 22.561882
2 22.357906 22.680952 21.4789 22.641407 22.341165
Dedent return list1 so it is not in the for-loop.
Otherwise, the function returns after only one pass through the loop.
Also move list1 = [] outside the for-loop so list1 does not get re-initialized with every pass through the loop:
import io
from scipy import stats
import pandas as pd
def h2f(df, n):
list1 = []
for index, row in df.iterrows():
mean, std = row['hmean'], row['hstd']
ff = stats.norm.rvs(loc=mean, scale=std, size=n)
list1.append(ff)
return list1
content = '''\
Name amax hmean hstd amin
0 Bill 22.924545 22.515861 0.375822 22.110000
1 Bob 26.118182 24.713880 0.721507 23.738400
2 Becky 23.178606 22.722464 0.454028 22.096752'''
df = pd.read_table(io.BytesIO(content), sep='\s+')
df2 = pd.DataFrame(h2f(df, 100))
print(df2)
PS. It is inefficent to call nr = df.as_matrix() with each pass through the loop.
Since nr never changes, at most, call it once, before entering the for-loop.
Even better, just use row['hmean'] and row['hstd'] to obtain the desired numbers.