Randomly select rows from DataFrame Pandas - python

Okay this is somewhat tricky. I have a DataFrame of people and I want to randomly select 27% of them. I want to create a new Boolean column in that DataFrame that shows if that person was randomly selected.
Anyone have any idea how to do this?

The in-built sample function provides a frac argument to give the fraction contained in the sample.
If your DataFrame of people is people_df:
percent_sampled = 27
sample_df = people_df.sample(frac = percent_sampled/100)
people_df['is_selected'] = people_df.index.isin(sample_df.index)

n = len(df)
idx = np.arange(n)
idx = random.shuffle(idx)
*selected_idx = idx[:int(0.27*n)]
selected_df = df[df.index.isin(selected_idx)]

Defining a dataframe with 100 random numbers in column 0:
import random
import pandas as pd
import numpy as np
a = pd.DataFrame(range(100))
random.shuffle(a[0])
Using random.sample to choose 27 random numbers from the list, WITHOUT repetition: (replace 27 with 0.27*int(len(a[0]) if you want to define this as percentage)
choices = random.sample(list(a[0]),27)
Using np.where to assign boolean values to new column in dataframe:
a['Bool'] = np.where(a[0].isin(choices),True,False)

Related

Mean of every 15 rows of a dataframe in python

I have a dataframe of (1500x11). I have to select each of the 15 rows and take mean of every 11 columns separately. So my final dataframe should be of dimension 100x11. How to do this in Python.
The following should work:
dfnew=df[:0]
for i in range(100):
df2=df.iloc[i*15:i*15+15, :]
x=pd.Series(dict(df2.mean()))
dfnew=dfnew.append(x, ignore_index=True)
print(dfnew)
Don't know much about pandas, hence I've coded my next solution in pure numpy. Without any python loops hence very efficient. And converted result back to pandas DataFrame:
Try next code online!
import pandas as pd, numpy as np
df = pd.DataFrame([[i + j for j in range(11)] for i in range(1500)])
a = df.values
a = a.reshape((a.shape[0] // 15, 15, a.shape[1]))
a = np.mean(a, axis = 1)
df = pd.DataFrame(a)
print(df)
You can use pandas.DataFrame.
Use a for loop to compute the means and create a counter which should be reseted at every 15 entries.
columns = [col1, col2, ..., col12]
for columns, values in df.items():
# compute mean
# at every 15 entries save it
Also, using pd.DataFrame() you can create the new dataframe.
I'd recommend you to read the documentation.
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

pandas DataFrame: replace values in multiple columns with the value from another

I've got a pandas DataFrame where I want to replace certain values in a selection of columns with the value from another in the same row.
I did the following:
df[cols[23:30]] = df[cols[23:30]].apply(lambda x: x.replace(99, df['col1']))
df[cols[30:36]] = df[cols[30:36]].apply(lambda x: x.replace(99, df['col2']))
cols is a list with column names.
99 is considered a missing value which I want to replace with the (already calculated) Mean for the given class (i.e., col1 or col2 depending on the selection)
It works, but time it takes to replace all those values seems to take longer than would be necessary. I figured there must be a quicker (computationally) way of achieving the same.
Any suggestions?
You can try:
import numpy as np
df[cols[23:30]] = np.where(df[cols[23:30]] == 99, df[['col1'] * (30-23)], df[cols[23:30]])
df[cols[30:36]] = np.where(df[cols[30:36]] == 99, df[['col2'] * (36-30)], df[cols[30:36]])
df[["col1"] * n] will create dataframe with exactly same column repeated n times, so numpy could use it as a mask for n columns you want to iterate through if 99 is encountered, otherwise taking respective value, which is already there.

Get all previous values for every row

I'm about to write a backtesting tool and so for every row I'd like to have access to all the dataframe till the given row. In the following example I'm doing it from a fixed index using a loop. I'm wondering if there is any better solution.
import numpy as np
import pandas as pd
N
df = pd.DataFrame({"a":np.arange(N)})
for i in range(3,N):
print(df["a"][:i].values)
UPDATE (toy example)
I need to apply a custom function to all the previous values. Here as a toy example I will use the sum of the square of all previous values.
def toyFun(v):
return np.sum(v**2)
res = np.empty(N)
res[:] = np.nan
for i in range(3, N):
res[i] = toyFun(df["a"][:i].values)
df["res"] = res
If you are indexing rows for a particular column say 'a', you can use .iloc indexer (i stands for index, loc means location) to index on the columns.
df = pd.DataFrame({'a': [1,2,3,4]})
print(df.a.iloc[:2]) # get first two values
So, you can do:
for i in range(3, 10):
print(df.a.iloc[:i])
The best way is to use a temporary column with the direct results, that way you are not re-calculating everything.
df["a"].apply(lambda x: x**2).cumsum()
Then re-index as you which:
res[3:] = df["a"].apply(lambda x: x**2).cumsum()[2:N-1].values
or directly to the dataframe.

pandas sample based on criteria

I would like to use pandas sample function but with a criteria without grouping or filtering data.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print df.sample(n=100)
This will sample 100 rows, but what if i want to sample 50 rows containing 0 to 50 rows containing 1 in df['a'].
You can use the == operator to make a list* of boolean values. And when said list is put into the getter ([]) it will filter the values. If you want to, you can use n=50 to create a sample size of 50 rows.
New code
df[df['a']==1].sample(n=50)
Full code
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print(df[df['a']==1].sample(n=50))
*List isn't literally a list in this context, but it is a great word for explaining how it works. It's a technically a DataFrame that maps rows to a true/false value.
More obscure DataFrame sampling
If you want to sample all 50 where a is 1 or 0:
print(df[(df['a']==1) | (df['a']==0)].sample(n=50))
And if you want to sample 50 of each:
df1 = df[df['a']==1].sample(n=50)
df0 = df[df['a']==0].sample(n=50)
print(pd.concat([df1,df0]))

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.
Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)
# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

Categories

Resources