Generate dummy data with 60% 0's and 40% 1's - python

I want to generate a new column in my dataframe df which can take only two values i.e. 0 or 1. My dataframe currently has 1000 rows with other columns as well. I want to generate 0 and 1 in such a way that 60% of the values in the column are 0 and rest 40% 1.
I did the following :
generated_data = []
for index, row in df.iterrows():
if index <= len(df) * 0.6 :
generated_data.append(0)
else :
generated_data.append(1)
The question is : How can this be achieved randomly. in my code top 60% of the rows are 0 and rest 1. I want to achieve the randomness in the creation.
Thanks

In case you want precisely 60% of 0 and 40% of 1, you could first create the column with np.onesand np.zeros, and then shuffle it :
import numpy as np
generated_data = np.concatenate([np.zeros(600), np.ones(400)])
np.random.shuffle(generated_data)
print(generated_data)

Use numpy.random.choice with p parameter if need each value has 60% chance to be 0 and 40% chance to be 1.
For 60% 0's and 40% 1's use numpy.random.shuffle. with all possible values generated before:
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'a':range(1000)})
#print (df)
arr = np.ones(len(df))
arr[:int(len(df) * 0.6)] = 0
np.random.shuffle(arr)
df['new1'] = arr
df['new2'] = np.random.choice([0, 1], size=len(df), p=(0.6, 0.4))
print (df['new1'].value_counts())
0.0 600
1.0 400
Name: new1, dtype: int64
print (df['new2'].value_counts())
0 601
1 399
Name: new2, dtype: int64

Related

Select rows of dataframe whose column values amount to a given sum

I need to find out how many of the first N rows of a dataframe make up (just over) 50% of the sum of values for that column.
Here's an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 1), columns=list("A"))
0 0.681991
1 0.304026
2 0.552589
3 0.716845
4 0.559483
5 0.761653
6 0.551218
7 0.267064
8 0.290547
9 0.182846
therefore
sum_of_A = df["A"].sum()
4.868260213425804
and with this example I need to find, starting from row 0, how many rows I need to get a sum of at least 2.43413 (approximating 50% of sum_of_A).
Of course I could iterate through the rows and sum and break when I get over 50%, but is there a more concise/Pythonic/efficient way of doing this?
I would use .cumsum(), which we can use to get all the rows where the cumulative sum is at least half of the total sum:
df[df["A"].cumsum() < df["A"].sum() / 2]

How can I forward fill a dataframe column where the limit of rows filled is based on the value of a cell in another column?

So I am trying to forward fill a column with the limit being the value in another column. This is the code I run and I get this error message.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['NM'] = [0, 0, 1, np.nan, np.nan, np.nan, 0]
df['length'] = [0, 0, 2, 0, 0, 0, 0]
print(df)
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 NaN 0
4 NaN 0
5 NaN 0
6 0.0 0
df['NM'] = df['NM'].fillna(method='ffill', limit=df['length'])
print(df)
ValueError: Limit must be an integer
The dataframe I want looks like this:
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 1.0 0
4 1.0 0
5 NaN 0
6 0.0 0
Thanks in advance for any help you can provide!
I do not think you want to use ffill for this instance.
Rather I would recommend filtering to where length is greater than 0, then iterating through those rows to enter the NM value from that row in the next n+length rows.
for row in df.loc[df.length.gt(0)].reset_index().to_dict(orient='records'):
df.loc[row['index']+1:row['index']+row['length'], 'NM'] = row['NM']
To better break this down:
Get rows containing change information be sure to include the index.
df.loc[df.length.gt(0)].reset_index().to_dict(orient='records')
iterate through them... I prefer to_dict for performance reasons on large datasets. It is a habit.
sets NM rows to the NM value of your row with the defined length.
You can first group the dataframe by the length column before filling. Only issue is that for the first group in your example limit would be 0 which causes an error, so we can make sure it's at least 1 with max. This might cause unexpected results if there are nan values before the first non-zero value in length but from the given data it's not clear if that can happen.
# make groups
m = df.length.gt(0).cumsum()
# fill the column
df["NM"] = df.groupby(m).apply(
lambda f: f.NM.fillna(
method="ffill",
limit=max(f.length.iloc[0], 1))
).values

searching rows for a value that is 90% of the max value in that row and then outputting the index of the column for CSV files, using pandas

I am a python newbie so struggling with this problem.
I am using pandas to read in csv files with multiple rows (changes depending on csv file, up to 200,000) and columns (495).
I want to search along the rows separately to find the max value, then I want to take the value that is 90% of the max and index this to find what column number it is. I want to do this for all rows separately.
For example:
row 1 has a max value of 12,098 and is in column 300
90% of 12,098 gives a value of 10,888. it is unlikely there will be an exact match, so i want
to find the nearest match in that row and then provide me with the column number (index), which
could be column 300 for example.
I then want to repeat this for every row.
This is what I have done so far:
1.search my rows of data to find the max value,
maxValues = df.max(axis = 1)
calculate 90% of this:
newmax = maxValues / 10 * 9
then find the value closest to that newmax in the row, and then tell me what the column number where that value is - this is the part I can't do. I have tried:
arr = pulses.to_numpy()
x = newmax.values`
difference_array = np.absolute(arr-x).axis=1
index = difference_array.argmin().axis=1
provides the following error: operands could not be broadcast together with shapes (114,495)
(114,)
I can do up to number 2 above, but can't figure out 3. I have tried converting them to arrays as you can see but this only produces errors.
Let's say we have a following dataframe:
import pandas as pd
d= {'a':[0,1], 'b':[10, 20], 'c':[30, 40], 'd':[15, 30]}
df = pd.DataFrame(data=d)
To go row by row you can use apply function
Since you operate with just one row, you can find its maximum with max
To find a closest value to 0.9 of maximum you need to find the smallest abs difference between numbers
To insert values by index of row in initial dataframe use at
So a code would be like this:
percent = 0.9
def foo(row):
max_val = row.max()
max_col = row[row==max_val].index[0]
second_max_val = percent * max_val
idx = row.name
df.at[idx, 'max'] = max_col
df.at[idx, '0.9max'] = (abs(row.loc[row.index!=max_col] - second_max_val)).idxmin()
return row
df.apply(lambda row: foo(row), axis=1)
print(df)
Your error occurs because you are comparing a two dimensional array with an one dimensional one (arr - x).
Consider this sample data frame:
import pandas as pd
import numpy as np
N=5
df = pd.DataFrame({
"col1": np.random.randint(100, size=(N,)),
"col2": np.random.randint(100, size=(N,)),
"col3": np.random.randint(100, size=(N,)),
"col4": np.random.randint(100, size=(N,)),
"col5": np.random.randint(100, size=(N,))
})
col1 col2 col3 col4 col5
0 48 21 74 76 95
1 66 1 13 56 83
2 91 67 96 93 28
3 49 76 39 95 84
4 65 31 61 68 24
IIUC, you could use the following code (no iteration needed, relies only on numpy and pandas) to find the index positions of those columns that are closest to the maximum value in each row multiplied by 0.9. If two values are equally close, the first index will be returned. The code only needs about five seconds for 2.mio rows.
Code:
np.argmin(df.sub(df.max(axis=1) * 0.9, axis=0).apply(np.abs).values, axis=1)
Output:
array([3, 4, 0, 4, 2])

How to pass the whole dataframe and the index of the row being operated upon to the apply() method

How do I pass the whole dataframe and the index of the row being operated upon when using the apply() method on a dataframe?
Specifically, I have a dataframe correlation_df with the following data:
id
scores
cosine
1
100
0.8
2
75
0.7
3
50
0.4
4
25
0.05
I want to create an extra column where each row value is the correlation of scores and cosine without that row's values included.
My understanding is that I should do this with with a custom function and the apply method, i.e. correlation_df.apply(my_fuct). However, I need to pass in the whole dataframe and the index of the row in question so that I can ignore it in the correlation calculation.
NB. Problem code:
import numpy as np
import pandas as pd
score = np.array([100, 75, 50, 25])
cosine = np.array([.8, 0.7, 0.4, .05])
correlation_df = pd.DataFrame(
{
"score": score,
"cosine": cosine,
}
)
corr = correlation_df.corr().values[0, 1]
[Edit] Roundabout solution that I'm sure can be improved:
def my_fuct(row):
i = int(row["index"])
r = list(range(correlation_df.shape[0]))
r.remove(i)
subset = correlation_df.iloc[r, :].copy()
subset = subset.set_index("index")
return subset.corr().values[0, 1]
correlation_df["diff_correlations"] = = correlation_df.apply(my_fuct, axis=1)
Your problem can be simplified to:
>>> df["diff_correlations"] = df.apply(lambda x: df.drop(x.name).corr().iat[0,1], axis=1)
>>> df
score cosine diff_correlations
0 100 0.80 0.999015
1 75 0.70 0.988522
2 50 0.40 0.977951
3 25 0.05 0.960769
A more sophisticated method would be:
The whole correlation matrix isn't made every time this way.
df.apply(lambda x: (tmp_df := df.drop(x.name)).score.corr(tmp_df.cosine), axis=1)
The index can be accessed in an apply with .name or .index, depending on the axis:
>>> correlation_df.apply(lambda x: x.name, axis=1)
0 0
1 1
2 2
3 3
dtype: int64
>>> correlation_df.apply(lambda x: x.index, axis=0)
score cosine
0 0 0
1 1 1
2 2 2
3 3 3
Using
correlation_df = correlation_df.reset_index()
gives you a new column index, denoting the index of the row, namely what previously was your index. Now when using pd.apply access it via:
correlation_df.apply(lambda r: r["index"])
After you are done you could do:
correlation_df = correlation_df.set_index("index")
to get your previous format back.

Cumulative sum of a pandas column until a maximum value is met, and average adjacent rows

I'm a biology student who is fairly new to python and was hoping someone might be able to help with a problem I have yet to solve
With some subsequent code I have created a pandas dataframe that looks like the example below:
Distance. No. of values Mean rSquared
1 500 0.6
2 80 0.3
3 40 0.4
4 30 0.2
5 50 0.2
6 30 0.1
I can provide my previous code to create this dataframe, but I didn't think it was particularly relevant.
I need to sum the number of values column until I achieve a value >= 100; and then combine the data of the rows of the adjacent columns, taking the weighted average of the distance and mean r2 values, as seen in the example below
Mean Distance. No. Of values Mean rSquared
1 500 0.6
(80*2+40*3)/120 (80+40) = 120 (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110 (30+50+30) = 110 (30*0.2+50*0.2+30*0.1)/110
etc...
I know pandas has it's .cumsum function, which I might be able to implement into a for loop with an if statement that checks the upper limit and resets the sum back to 0 when it is greater than or equal to the upper limit. However, I haven't a clue how to average the adjacent columns.
Any help would be appreciated!
You can use this code snippet to solve your problem.
# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
min_threshold = 100
indexes = []
temp_sum = 0
# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]
# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)
# main loop to check and compute the desired output
for index, _ in df.iterrows():
temp_sum += df.iloc[index]["No. of values"]
indexes.append(index)
# if the sum exceeds 'min_threshold' then do some computation
if temp_sum >= min_threshold:
temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
# create temporary dataframe and concatenate with the 'final_df'
temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
final_df = pd.concat([final_df, temp_df])
# reset the variables
temp_sum = 0
indexes = []
Numpy has a function numpy.frompyfunc You can use that to get the cumulative value based on a threshold.
Here's how to implement it. With that, you can then figure out the index when the value goes over the threshold. Use that to calculate the Mean Distance and Mean rSquared for the values in your original dataframe.
I also leveraged #sujanay's idea of calculating the weighted values first.
c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
[4,30,0.2], [5,50,0.2], [6,30,0.1]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]
#use numpy.frompyfunc to setup the threshold condition
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)
#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()
#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]
#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
i = j+1
print (df)
The output of this will be:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
1 2 80 ... NaN NaN
2 3 40 ... 2.33 0.33
3 4 30 ... NaN NaN
4 5 50 ... NaN NaN
5 6 30 ... 5.00 0.17
If you want to extract only the records that are non NaN, you can do:
final_df = df[df['Mean Distance'].notnull()]
This will result in:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
2 3 40 ... 2.33 0.33
5 6 30 ... 5.00 0.17
I looked up BEN_YO's implementation of numpy.frompyfunc. The original SO post can be found here. Restart cumsum and get index if cumsum more than value
If you figure out the grouping first, pandas groupby-functionality will do a lot of the remaining work for you. A loop is appropriate to get the grouping (unless somebody has a clever one-liner):
>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
... if cumsum >= 100:
... cumsum = 0
... group = group + 1
... cumsum = cumsum + n
... groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]
Before doing the grouped operations you need to use the No. of values information to get the weighting in:
df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)
Now get the sums like this:
>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0 500
1 120
2 110
Name: No. of values, dtype: int64
And finally the weighted group averages like this:
>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
Distance. Mean rSquared
0 1.000000 0.600000
1 2.333333 0.333333
2 5.000000 0.172727

Categories

Resources