Split a python pandas column using positional string in another column - python

I have a dataframe with the structure:
ID
Split
Data
1
GT:RC:BC:CN
1:4:5:3
2
GT:RC:CN
1:7:0
3
GT:BC
4:2
I would like to create n new columns and populate with the data in the Data column, where n is the total number of unique fields split by a colon in the Split column (in this case, this would be 4 new columns: GT, RC, BC, CN). The new columns should be populated with the corresponding data in the Data column, so for ID 3, only column GT and BC should be populated. I have tried using string splitting, but that doesn't take into account the correct column to move the data to.
The output should look like this:
ID
Split
Data
GT
RC
BC
CN
1
GT:RC:BC:CN
1:4:5:3
1
4
5
3
2
GT:RC:CN
1:7:0
1
7
0
3
GT:BC
4:2
4
2

You can use:
out = df.join(pd.concat([pd.Series(d.split(':'), index=s.split(':'))
for s,d in zip(df['Split'], df['Data'])], axis=1).T)
output:
ID Split Data GT RC BC CN
0 1 GT:RC:BC:CN 1:4:5:3 1 4 5 3
1 2 GT:RC:CN 1:7:0 1 7 NaN 0
2 3 GT:BC 4:2 4 NaN 2 NaN

Related

How to aggregate all values in a pandas dataframe columns in 2 values

I have a Pandas dataframe contains some columns. Each columns have some differents values. See the image.
In col1 I have that the value 1 is more frequent than others, so, I need to transform this column to have values 1 and more then 1.
How can I do that?
My goals here is transforme this column in a categorical column but I have no idea how can I do that.
The output expected is something like the next image:
Try clip function on column:
df["col1"].clip(upper=2)
0 1
1 2
2 2
3 2
4 1
5 2
6 2
7 1
8 1
9 1
10 1
11 2
12 1

How, in python, can I count unique values in a column for gradually increasing numbers of rows within groups

I am working in python on a pandas data frame and am trying to count unique values of a column within groups. My problem is that I need that count to represent steadily increasing numbers of rows within the groups and I also don't want NaNs to be counted.
Simplified, the data looks like this
ID occup
1 NaN
1 A
1 NaN
1 Nan
1 B
2 K
2 NaN
2 L
2 L
2 M
The new column 'occupcount' should, within the groups defined by 'ID', count the number of unique values in 'occup' but, in the first row of each group I want the count to only consider the first row in the respective group. In the second row, I want to count over the first two rows. In the fifth row, I want the count of unique values over all five rows within each group. It should look like this:
ID occup occupcount
1 NaN 0
1 A 1
1 NaN 1
1 B 2
1 A 2
2 K 1
2 NaN 1
2 L 2
2 K 2
2 M 3
I tried to solve the task with something like
df['occupcount'] = (df.groupby(["ID"])['occup'].transform('nunique'))
But it only provides the total amount of unique values over all rows within each group, no gradual increase. Thanks in advance!
Idea is chain first duplicated values by both columns with not missing values for mask and then use GroupBy.cumsum:
df['occupcount'] = ((~df.duplicated(['ID','occup']) & df['occup'].notna())
.groupby(df['ID'])
.cumsum())
print (df)
ID occup occupcount
0 1 NaN 0
1 1 A 1
2 1 NaN 1
3 1 B 2
4 1 A 2
5 2 K 1
6 2 NaN 1
7 2 L 2
8 2 L 2
9 2 M 3

Group seperated counting values in a pandas dataframe

I have following df
A B
0 1 10
1 2 20
2 NaN 5
3 3 1
4 NaN 2
5 NaN 3
6 1 10
7 2 50
8 Nan 80
9 3 5
Consisting of repeating sequences from 1-3 seperated by a variable number of NaN's.I want to groupby each this sequences from 1-3 and get the minimum value of column B within these sequences.
Desired Output something like:
B_min
0 1
6 5
Many thanks beforehand
draj
Idea is first remove rows by missing values by DataFrame.dropna, then use GroupBy.cummin by helper Series created by compare A for equal by Series.eq and Series.cumsum, last data cleaning to one column DataFrame:
df = (df.dropna(subset=['A'])
.groupby(df['A'].eq(1).cumsum())['B']
.min()
.reset_index(drop=True)
.to_frame(name='B_min'))
print (df)
B_min
0 1
1 5
All you need to df.groupby() and apply min(). Is this what you are expecting?
df.groupby('A')['B'].min()
Output:
A
1 10
2 20
3 1
Nan 80
If you don't want the NaNs in your group you can drop them using df.dropna()
df.dropna().groupby('A')['B'].min()

Count of columns which as some value in pandas dataframe

Derive a new pandas column based on lengh of string in other columns
I want to count the number of columns which have a value in each row and create a new column with that number. Assume if I have 3 columns and two columns have some value then new column for that row will have the value 2.
df = pd.DataFrame({'ID':['1','2','3'], 'J1': ['a','ab',''],'J2':['22','','33']})
print df
The output should be like:
ID J1 J2 Count_of_cols_have_values
0 1 a 22 2
1 2 ab 1
2 3 33 1
One way could be to check which cells are not equal (DataFrame.ne) to an empty string, and take the sum along the rows:
df['Count_of_cols_have_values '] = df.set_index('ID').ne('').sum(1).values
ID J1 J2 Count_of_cols_have_values
0 1 a 22 2
1 2 ab 1
2 3 33 1
Or you can also replace with NaNs and count, which returns the amount of non_NA values:
df['Count_of_cols_have_values '] = df.set_index('ID').replace('',np.nan).count(1).values
ID J1 J2 Count_of_cols_have_values
0 1 a 22 2
1 2 ab 1
2 3 33 1

Removing duplicates based on two columns while deleting inconsistent data

I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.
First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64

Categories

Resources