randomly remove rows from dataframe based on condition

randomly remove rows from dataframe based on condition - python

given a dataframe with numerical values in a specific column, I want to randomly remove a certain percentage of the rows for which the value in that specific column lies within a certain range.
For example given the following dataframe:
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7,8,9,10]})
df
col1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
2/5 of the rows where col1 is below 6 should be removed randomly.
Whats the most concise way to do that?

use sample + drop
df.drop(df.query('col1 < 6').sample(frac=.4).index)
col1
1 2
3 4
4 5
5 6
6 7
7 8
8 9
9 10
For a range
df.drop(df.query('2 < col1 < 8').sample(frac=.4).index)
col1
0 1
1 2
3 4
4 5
5 6
7 8
8 9
9 10

Related

Split Single Column(1,000 rows) into two smaller columns(500 each)

How to split a single column containing 1000 rows into chunk of two columns containing 500 rows per column in pandas.
I have a csv file that contains a single column and I need to split this into multiple columns. Below is the format in csv.
Steps I took:
I had multiple csv files containing one column with 364 rows. I concatenated them after converting them into a dataframe, but it copies the file in a linear fashion.
Code I tried
monthly_list = []
for file in ['D0_monthly.csv','c1_monthly.csv','c2_monthly.csv','c2i_monthly.csv','c3i_monthly.csv','c4i_monthly.csv','D1_monthly.csv','D2i_monthly.csv','D3i_monthly.csv','D4i_monthly.csv',
'D2j_monthly.csv','D3j_monthly.csv','D4j_monthly.csv','c2j_monthly.csv','c3j_monthly.csv','c4j_monthly.csv']:
monthly_file = pd.read_csv(file,header=None,index_col=None,skiprows=[0])
monthly_list.append(monthly_file)
monthly_all_file = pd.concat(monthly_list)
How the data is:
column1
1
2
3
.
.
364
1
2
3
.
.
364
I need to split the above column in the format shown below.
What the data should be:
column1
column2
1
1
2
2
3
3
4
4
5
5
.
.
.
.
.
.
364
364

Answer updated to work for arbitrary number of columns
You could start with number of columns or row length. For a given initial column length you could calculate one given the other. In this answer I use desired target column length - tgt_row_len.
nb_groups = 4
tgt_row_len = 5
df = pd.DataFrame({'column1': np.arange(1,tgt_row_len*nb_groups+1)})
print(df)
column1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
...
17 18
18 19
19 20
Create groups in the index for the following grouping operation
df.index = df.reset_index(drop=True).index // tgt_row_len
column1
0 1
0 2
0 3
0 4
0 5
1 6
1 7
...
3 17
3 18
3 19
3 20
dfn = (
df.groupby(level=0).apply(lambda x: x['column1'].reset_index(drop=True)).T
.rename(columns = lambda x: 'col' + str(x+1)).rename_axis(None)
)
print(dfn)
col1 col2 col3 col4
0 1 6 11 16
1 2 7 12 17
2 3 8 13 18
3 4 9 14 19
4 5 10 15 20
Previous answer that handles creating two columns
This answer just shows 10 target rows as an example. That can easily be changed to 364 or 500.
A dataframe where column1 contains 2 sets of 10 rows
tgt_row_len = 10
df = pd.DataFrame({'column1': np.tile(np.arange(1,tgt_row_len+1),2)})
print(df)
column1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 1
11 2
12 3
13 4
14 5
15 6
16 7
17 8
18 9
19 10
Move the bottom set of rows to column2
df.assign(column2=df['column1'].shift(-tgt_row_len)).iloc[:tgt_row_len].astype(int)
column1 column2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 9 9
9 10 10

I don't know if anyone has a more efficient solution but using pd. merge on a temp column should solve your issue. Here is a quick implementation of what you could write.
csv1['temp'] = 1
csv2['temp'] = 1
new_df=pd.merge(csv1,csv2,on=["temp"])
new_df.drop("temp",axis=1)
I hope this helps!

Conditional cumcount of values in second column

I want to fill numbers in column flag, based on the value in column KEY.
Instead of using cumcount() to fill incremental numbers, I want to fill same number for every two rows if the value in column KEY stays same.
If the value in column KEY changes, the number filled changes also.
Here is the example, df1 is what I want from df0.
df0 = pd.DataFrame({'KEY':['0','0','0','0','1','1','1','2','2','2','2','2','3','3','3','3','3','3','4','5','6']})
df1 = pd.DataFrame({'KEY':['0','0','0','0','1','1','1','2','2','2','2','2','3','3','3','3','3','3','4','5','6'],
'flag':['0','0','1','1','2','2','3','4','4','5','5','6','7','7','8','8','9','9','10','11','12']})

You want to get the cumcount and add one. Then use %2 to differentiate between odd or even rows. Then, take the cumulative sum and subtract 1 to start counting from zero.
You can use:
df0['flag'] = ((df0.groupby('KEY').cumcount() + 1) % 2).cumsum() - 1
df0
Out[1]:
KEY flag
0 0 0
1 0 0
2 0 1
3 0 1
4 1 2
5 1 2
6 1 3
7 2 4
8 2 4
9 2 5
10 2 5
11 2 6
12 3 7
13 3 7
14 3 8
15 3 8
16 3 9
17 3 9
18 4 10
19 5 11
20 6 12

Pandas calculate column based on other row

I need to calculate a column based on other row. Basically I want my new_column to be the sum of "base_column" for all row with same id.
I currently do the following (but is not really efficient) what is the most efficient way to achieve that ?
def calculate(x):
filtered_df = df[["id"] == dataset.at[x.name, "id"]] # in fact my filter is more complex basically same id and date in the last 4 weeks
df.at[x.name, "new_column"] = filtered_df["base_column"].sum()
df.apply(calculate)

You can do a below
df['new_column']= df.groupby('id')['base_column'].transform('sum')
input
id base_column
0 1 2
1 1 4
2 2 5
3 3 6
4 5 7
5 7 4
6 7 5
7 7 3
output
id base_column new_column
0 1 2 6
1 1 4 6
2 2 5 5
3 3 6 6
4 5 7 7
5 7 4 12
6 7 5 12
7 7 3 12

Another way to do this is to use groupby and merge
import pandas as pd
df = pd.DataFrame({'id':[1,1,2],'base_column':[2,4,5]})
# compute sum by id
sum_base =df.groupby("id").agg({"base_column": 'sum'}).reset_index().rename(columns={'base_column':'new_column'})
# join the result to df
df = pd.merge(df,sum_base,how='left',on='id')
# id base_column new_column
#0 1 2 6
#1 1 4 6
#2 2 5 5

Groupby column keep multiple rows with minimum value

I have a dataframe consisting of two columns with id's and one column with numerical values. I want to groupby the first id column and keep all the rows corresponding to the smallest values in the second column, so that I keep multiple rows if needed.
This is my pandas dataframe
id1 id2 num1
1 1 9
1 1 4
1 2 4
1 2 3
1 3 7
2 6 9
2 6 1
2 6 5
2 9 3
2 9 7
3 2 8
3 4 2
3 4 7
3 4 9
3 4 10
What I want to have is:
id1 id2 num1
1 1 9
1 1 4
2 6 9
2 6 1
2 6 5
3 2 8
I have tried to keep the min value, find the idxmin() or remove duplicates but this ends up with only one row per id1 and id2.
firstS.groupby('id1')['id2'].transform(min)
Many thanks in advance!

You are close, only need compare id2 column with transform Series and filter by boolean indexing:
df = firstS[firstS['id2'] == firstS.groupby('id1')['id2'].transform(min)]
print (df)
id1 id2 num1
0 1 1 9
1 1 1 4
5 2 6 9
6 2 6 1
7 2 6 5
10 3 2 8

Simplest way:
df = df.merge(df.groupby("id1").id2.min().reset_index())

Creating a Cumulative Frequency Column in a Dataframe Python

I am trying to create a new column named 'Cumulative Frequency' in a data frame where it consists of all the previous frequencies to the frequency for the current row as shown here.
What is the way to do this?

You want cumsum:
df['Cumulative Frequency'] = df['Frequency'].cumsum()
Example:
In [23]:
df = pd.DataFrame({'Frequency':np.arange(10)})
df
Out[23]:
Frequency
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
In [24]:
df['Cumulative Frequency'] = df['Frequency'].cumsum()
df
Out[24]:
Frequency Cumulative Frequency
0 0 0
1 1 1
2 2 3
3 3 6
4 4 10
5 5 15
6 6 21
7 7 28
8 8 36
9 9 45

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

randomly remove rows from dataframe based on condition - python

use sample + drop df.drop(df.query('col1 < 6').sample(frac=.4).index) col1 1 2 3 4 4 5 5 6 6 7 7 8 8 9 9 10 For a range df.drop(df.query('2 < col1 < 8').sample(frac=.4).index) col1 0 1 1 2 3 4 4 5 5 6 7 8 8 9 9 10

Related

Split Single Column(1,000 rows) into two smaller columns(500 each)

Conditional cumcount of values in second column

Pandas calculate column based on other row

Groupby column keep multiple rows with minimum value

Creating a Cumulative Frequency Column in a Dataframe Python

Categories

Resources