pandas sample based on criteria - python

I would like to use pandas sample function but with a criteria without grouping or filtering data.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print df.sample(n=100)
This will sample 100 rows, but what if i want to sample 50 rows containing 0 to 50 rows containing 1 in df['a'].

You can use the == operator to make a list* of boolean values. And when said list is put into the getter ([]) it will filter the values. If you want to, you can use n=50 to create a sample size of 50 rows.
New code
df[df['a']==1].sample(n=50)
Full code
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print(df[df['a']==1].sample(n=50))
*List isn't literally a list in this context, but it is a great word for explaining how it works. It's a technically a DataFrame that maps rows to a true/false value.
More obscure DataFrame sampling
If you want to sample all 50 where a is 1 or 0:
print(df[(df['a']==1) | (df['a']==0)].sample(n=50))
And if you want to sample 50 of each:
df1 = df[df['a']==1].sample(n=50)
df0 = df[df['a']==0].sample(n=50)
print(pd.concat([df1,df0]))

Related

Mean of every 15 rows of a dataframe in python

I have a dataframe of (1500x11). I have to select each of the 15 rows and take mean of every 11 columns separately. So my final dataframe should be of dimension 100x11. How to do this in Python.
The following should work:
dfnew=df[:0]
for i in range(100):
df2=df.iloc[i*15:i*15+15, :]
x=pd.Series(dict(df2.mean()))
dfnew=dfnew.append(x, ignore_index=True)
print(dfnew)
Don't know much about pandas, hence I've coded my next solution in pure numpy. Without any python loops hence very efficient. And converted result back to pandas DataFrame:
Try next code online!
import pandas as pd, numpy as np
df = pd.DataFrame([[i + j for j in range(11)] for i in range(1500)])
a = df.values
a = a.reshape((a.shape[0] // 15, 15, a.shape[1]))
a = np.mean(a, axis = 1)
df = pd.DataFrame(a)
print(df)
You can use pandas.DataFrame.
Use a for loop to compute the means and create a counter which should be reseted at every 15 entries.
columns = [col1, col2, ..., col12]
for columns, values in df.items():
# compute mean
# at every 15 entries save it
Also, using pd.DataFrame() you can create the new dataframe.
I'd recommend you to read the documentation.
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

Randomly select rows from DataFrame Pandas

Okay this is somewhat tricky. I have a DataFrame of people and I want to randomly select 27% of them. I want to create a new Boolean column in that DataFrame that shows if that person was randomly selected.
Anyone have any idea how to do this?
The in-built sample function provides a frac argument to give the fraction contained in the sample.
If your DataFrame of people is people_df:
percent_sampled = 27
sample_df = people_df.sample(frac = percent_sampled/100)
people_df['is_selected'] = people_df.index.isin(sample_df.index)
n = len(df)
idx = np.arange(n)
idx = random.shuffle(idx)
*selected_idx = idx[:int(0.27*n)]
selected_df = df[df.index.isin(selected_idx)]
Defining a dataframe with 100 random numbers in column 0:
import random
import pandas as pd
import numpy as np
a = pd.DataFrame(range(100))
random.shuffle(a[0])
Using random.sample to choose 27 random numbers from the list, WITHOUT repetition: (replace 27 with 0.27*int(len(a[0]) if you want to define this as percentage)
choices = random.sample(list(a[0]),27)
Using np.where to assign boolean values to new column in dataframe:
a['Bool'] = np.where(a[0].isin(choices),True,False)

Remove values above/below standard deviation

I have a database that is made out of 18 columns and 15 million rows, in each column there are outliers and I wanted to remove values above and below 2 standard deviations. My code doesn't seem to edit anything in the database though.
Thank you.
import pandas as pd
import random as r
import numpy as np
df = pd.read_csv('D:\\Project\\database\\3-Last\\LastCombineHalf.csv')
df[df.apply(lambda x :(x-x.mean()).abs()<(2*x.std()) ).all(1)]
df.to_csv('D:\\Project\\database\\3-Last\\Removal.csv', index=False)
Perhaps because you didn't assign the results back to df?
From:
df[df.apply(lambda x :(x-x.mean()).abs()<(2*x.std()) ).all(1)]
To:
df = df[df.apply(lambda x :(x-x.mean()).abs()<(2*x.std()) ).all(1)]

Appending new column to dask dataframe

This is a follow up question to Shuffling data in dask.
I have an existing dask dataframe df where I wish to do the following:
df['rand_index'] = np.random.permutation(len(df))
However, this gives the error, Column assignment doesn't support type ndarray. I tried to use df.assign(rand_index = np.random.permutation(len(df)) which gives the same error.
Here is a minimal (not) working sample:
import pandas as pd
import dask.dataframe as dd
import numpy as np
df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*10, 'B':[3,2,1]*10}), npartitions=10)
df['rand_index'] = np.random.permutation(len(df))
Note:
The previous question mentioned using df = df.map_partitions(add_random_column_to_pandas_dataframe, ...) but I'm not sure if that is relevant to this particular case.
Edit 1
I attempted
df['rand_index'] = dd.from_array(np.random.permutation(len_df)) which, executed without an issue. When I inspected df.head() it seems that the new column was created just fine. However, when I look at df.tail() the rand_index is a bunch of NaNs.
In fact just to confirm I checked df.rand_index.max().compute() which turned out to be smaller than len(df)-1. So this is probably where df.map_partitions comes into play as I suspect this is an issue with dask being partitioned. In my particular case I have 80 partitions (not referring to the sample case).
You would need to turn np.random.permutation(len(df)) into type that dask understands:
permutations = dd.from_array(np.random.permutation(len(df)))
df['rand_index'] = permutations
df
This would yield:
Dask DataFrame Structure:
A B rand_index
npartitions=10
0 int64 int64 int32
3 ... ... ...
... ... ... ...
27 ... ... ...
29 ... ... ...
Dask Name: assign, 61 tasks
So it is up to you now if you want to .compute() to calculate actual results.
To assign a column you should use df.assign
Got the same problem as in Edit 1.
My work around is to get a unique column from the existing dataframe and feed into the dataframe that is to be appended.
import dask.dataframe as dd
import dask.array as da
import numpy as np
import panda as pd
df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*2, 'B':[3,2,1]*2, 'idx':[0,1,2,3,4,5]}), npartitions=10)
chunks = tuple(df.map_partitions(len).compute())
size = sum(chunks)
permutations = da.from_array(np.random.permutation(len(df)), chunks=chunks)
idx = da.from_array(df['idx'].compute(), chunks=chunks)
ddf = dd.concat([dd.from_dask_array(c) for c in [idx,permutations]], axis = 1)
ddf.columns = ['idx','rand_idx']
df = df.merge(ddf, on='idx')
df = df.set_index('rand_idx')
df.compute().head()

In Python, given that there is a matrix, how do I print a slice of it?

Let me first outline the overall context of the problem at hand, through the following code snippet.
import pandas as pd
df = pd.read_csv("abc.csv")
df.as_matrix
The desired matrix [100 rows x 785 columns] is output.
I am having difficulty in outputting(using print()) a row of the above matrix.
I tried the following, but in vain:
print(df[0])
print(df[:, 0])
The return value of as_matrix() is array of arrays. So following code should work:
import pandas as pd
df = pd.read_csv("abc.csv")
matrix = df.as_matrix()
print(matrix[0]) # out put first row.
print(matrix[3:5] # output from 3 row up to 4 row.
I think you are missing parenthesis:
df.as_matrix()[0]
Or you can use:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html
.head(n=5) Returns first n rows
df.head(1).as_matrix()
To get a specific row where the row index is equal A:
df.iloc[A, :785]
In general, you can use df.iloc to slice from a pandas dataframe using the idnex. For example, to slice rows from up to 100 and columns up to 785 you can do the following:
import pandas as pd
df = pd.read_csv("abc.csv")
df = df.iloc[:100, :785]
df.as_matrix()
If you want to slice the first row after converting to matrix, you are working with a list of lists, so you can do that as follows:
print(df.as_matrix()[1,:])
Here is a working example:
from StringIO import StringIO
import pandas as pd
st = """
col1|col2|col3
1|2|3
4|5|6
7|8|9
"""
pd.read_csv(StringIO(st), sep="|")
df = pd.read_csv(StringIO(st), sep="|")
print("print first row from a matrix")
print(df.as_matrix()[0,:])
print("print one column")
print(df.iloc[:2,1])
print("print a slice")
print(df.iloc[:2,:])
print("print one row")
print(df.iloc[1,:])

Categories

Resources