Splitting pandas dataframe based on value - python

I would like to split pandas dataframe to groups in order to process each group separately. My 'value.csv' file contains the following numbers
num tID y x height width
2 0 0 0 1 16
2 1 1 0 1 16
5 0 1 0 1 16
5 1 0 0 1 8
5 2 0 8 1 8
6 0 0 0 1 16
6 1 1 0 1 8
6 2 1 8 1 8
2 0 0 0 1 16
2 1 1 0 1 16
5 0 1 0 1 16
5 1 0 0 1 8
5 2 0 8 1 8
6 0 0 0 1 16
6 1 1 0 1 8
6 2 1 8 1 8
I would like to split the data based on the starting value of 0 at the tID column like that for the first 4 seperation.
First:
2 0 0 0 1 16
2 1 1 0 1 16
Second:
5 0 1 0 1 16
5 1 0 0 1 8
5 2 0 8 1 8
Third:
6 0 0 0 1 16
6 1 1 0 1 8
6 2 1 8 1 8
Fourth:
2 0 0 0 1 16
2 1 1 0 1 16
For this, I tried to split it using if but no success, any efficient ideas?
import pandas as pd
statQuality = 'value.csv'
df = pd.read_csv(statQuality, names=['num','tID','y','x','height','width'])
df2 = df.copy()
df2.drop(['num'], axis=1, inplace=True)
x = []
for index, row in df2.iterrows():
if row['tID'] == 0:
x = []
x.append(row)
print(x)
else:
x.append(row)

Use:
#create groups by consecutive values
s = df['num'].ne(df['num'].shift()).cumsum()
#create helper count Series for duplicated groups like `2_0`, `2_1`...
g = s.groupby(df['num']).transform(lambda x: x.factorize()[0])
#dictionary of DataFrames
d = {'{}_{}'.format(i,j): v.drop('num', axis=1) for (i, j), v in df.groupby(['num', g])}
print (d)
{'2_0': tID y x height width
0 0 0 0 1 16
1 1 1 0 1 16, '2_1': tID y x height width
8 0 0 0 1 16
9 1 1 0 1 16, '5_0': tID y x height width
2 0 1 0 1 16
3 1 0 0 1 8
4 2 0 8 1 8, '5_1': tID y x height width
10 0 1 0 1 16
11 1 0 0 1 8
12 2 0 8 1 8, '6_0': tID y x height width
5 0 0 0 1 16
6 1 1 0 1 8
7 2 1 8 1 8, '6_1': tID y x height width
13 0 0 0 1 16
14 1 1 0 1 8
15 2 1 8 1 8}

Related

Trying to merge dictionaries together to create new df but dictionaries values arent showing up in df

image of jupter notebook issue
For my quarters instead of values for examples 1,0,0,0 showing up I get NaN.
How do I fix the code below so I return values in my dataframe
qrt_1 = {'q1':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]}
qrt_2 = {'q2':[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0]}
qrt_3 = {'q3':[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]}
qrt_4 = {'q4':[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
year = {'year': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9]}
value = data_1['Sales']
data = [year, qrt_1, qrt_2, qrt_3, qrt_4]
dataframes = []
for x in data:
dataframes.append(pd.DataFrame(x))
df = pd.concat(dataframes)
I am expecting a dataframe that returns the qrt_1, qrt_2 etc with their corresponding column names
Try to use axis=1 in pd.concat:
df = pd.concat(dataframes, axis=1)
print(df)
Prints:
year q1 q2 q3 q4
0 1 1 0 0 0
1 1 0 1 0 0
2 1 0 0 1 0
3 1 0 0 0 1
4 2 1 0 0 0
5 2 0 1 0 0
6 2 0 0 1 0
7 2 0 0 0 1
8 3 1 0 0 0
9 3 0 1 0 0
10 3 0 0 1 0
11 3 0 0 0 1
12 4 1 0 0 0
13 4 0 1 0 0
14 4 0 0 1 0
15 4 0 0 0 1
16 5 1 0 0 0
17 5 0 1 0 0
18 5 0 0 1 0
19 5 0 0 0 1
20 6 1 0 0 0
21 6 0 1 0 0
22 6 0 0 1 0
23 6 0 0 0 1
24 7 1 0 0 0
25 7 0 1 0 0
26 7 0 0 1 0
27 7 0 0 0 1
28 8 1 0 0 0
29 8 0 1 0 0
30 8 0 0 1 0
31 8 0 0 0 1
32 9 1 0 0 0
33 9 0 1 0 0
34 9 0 0 1 0
35 9 0 0 0 1

Pandas - changing rows where less than n subsequent values are equal

I have the following dataframe:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,1,1,0,0,1,1,1,0,1,1,1,1,0,0,0]})
Now I would like to set all the rows equal to zero where less than four 1's appear "in a row", i.e. I would like to have the following resulting DataFrame:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0]})
I was not able to find a way to achieve this nicely...
Try with groupby and where:
streaks = df.groupby(df["col"].ne(df["col"].shift()).cumsum()).transform("sum")
output = df.where(streaks.ge(4), 0)
>>> output
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0
We can do
df.loc[df.groupby(df.col.eq(0).cumsum()).transform('count')['col']<5,'col'] = 0
df
Out[77]:
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0

Replace values in df col - pandas

I'm aiming to replace values in a df column Num. Specifically:
where 1 is located in Num, I want to replace preceding 0's with 1 until the nearest Item is 1 working backwards or backfilling.
where Num == 1, the corresponding row in Item will always be 0.
Also, Num == 0 will always follow Num == 1.
Input and code:
df = pd.DataFrame({
'Item' : [0,1,2,3,4,4,0,1,2,3,1,1,2,3,4,0],
'Num' : [0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0]
})
df['Num'] = np.where((df['Num'] == 1) & (df['Item'].shift() > 1), 1, 0)
Item Num
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 0
12 2 0
13 3 0
14 4 1
15 0 0
intended output:
Item Num
0 0 0
1 1 1
2 2 1
3 3 1
4 4 1
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 1
12 2 1
13 3 1
14 4 1
15 0 0
First, create groups of the rows according to the two start and end conditions using cumsum. Then we can group by this new column and sum over the Num column. In this way, all groups that contain a 1 in the Num column will get the value 1 while all other groups will get 0.
groups = ((df['Num'].shift() == 1) | (df['Item'] == 1)).cumsum()
df['Num'] = df.groupby(groups)['Num'].transform('sum')
Result:
Item Num
0 0 0
1 1 1
2 2 1
3 3 1
4 4 1
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 1
12 2 1
13 3 1
14 4 1
15 0 0
You could try:
for a, b in zip(df[df['Item'] == 0].index, df[df['Num'] == 1].index):
df.loc[(df.loc[a+1:b-1, 'Item'] == 1)[::-1].idxmax():b-1, 'Num'] = 1

Create calculated column of sum values of other columns in pandas

I have a dataframe with about 60 columns and the following structure:
A B C Y
0 12 1 0 1
1 13 1 0 [....] 0
2 14 0 1 1
3 15 1 0 0
4 16 0 1 1
I want to create a zth column which will be the sum of the values from columns B to Y.
How can I proceed?
To create a copy of the dataframe while including a new column, use assign
df.assign(Z=df.loc[:, 'B':'Y'].sum(1))
A B C Y Z
0 12 1 0 1 2
1 13 1 0 0 1
2 14 0 1 1 2
3 15 1 0 0 1
4 16 0 1 1 2
To assign it to the same dataframe, in place, use
df['Z'] = df.loc[:, 'B':'Y'].sum(1)
df
A B C Y Z
0 12 1 0 1 2
1 13 1 0 0 1
2 14 0 1 1 2
3 15 1 0 0 1
4 16 0 1 1 2
Try this
df['z']=df.iloc[:,1:].sum(1)
You could
In [2361]: df.assign(Z=df.loc[:, 'B':'Y'].sum(1))
Out[2361]:
A B C Y Z
0 12 1 0 1 2
1 13 1 0 0 1
2 14 0 1 1 2
3 15 1 0 0 1
4 16 0 1 1 2

Leave blocks of 1 of size >= k in Pandas data frame

I need to leave block >= k of '1'. All other block of '1' should be transformed to zero. For example, k=2:
df=
a b
0 1 1
1 1 1
2 0 0
3 1 0
4 0 0
5 1 0
6 0 0
7 1 0
8 0 0
9 1 1
10 1 1
11 1 1
12 0 0
13 0 0
14 1 0
15 0 0
16 1 1
17 1 1
18 0 0
19 1 0
where the column a is the original sequence, and the column b is the desired.
z = df.a.eq(0)
g = z.cumsum().mask(z, -1)
k = 2
df['b'] = df.a.groupby(g).transform('size').ge(k).mask(z, 0)
a b
0 1 1
1 1 1
2 0 0
3 1 0
4 0 0
5 1 0
6 0 0
7 1 0
8 0 0
9 1 1
10 1 1
11 1 1
12 0 0
13 0 0
14 1 0
15 0 0
16 1 1
17 1 1
18 0 0
19 1 0

Categories

Resources