How to generate sequence but avoid numbering duplicates in groups? - python

I have a pandas dataframe like as shown below
df = pd.DataFrame({'sub_id': [101,101,101,102,102,103,104,104,105],
'test_id':['A1','A1','C1','A1','B1','D1','E1','A1','F1'],
'dummy':['hi','hello','how','are','you','am','fine','thank','you']})
I want each combination of sub_id and test_id to have a unique id (sequence number)
Please note that one subject can have duplicate test_ids but dummy values will be different.
Similarly, multiple subjects can share the same test_ids as shown in sample dataframe
So, I tried the below 2 approaches but they are incorrect.
df.groupby(['sub_id','test_id']).cumcount()+1 # incorrect
df['seq_id'] = df.index + 1 # incorrect
I expect my output to be like as below

IIUC:
try via ngroup():
df['seq_id']=df.groupby(['sub_id','test_id'],sort=False).ngroup()+1
output of df:
sub_id test_id dummy seq_id
0 101 A1 hi 1
1 101 A1 hello 1
2 101 C1 how 2
3 102 A1 are 3
4 102 B1 you 4
5 103 D1 am 5
6 104 E1 fine 6
7 104 A1 thank 7
8 105 F1 you 8

Related

How to correctly use panda's slice/replace with values form another column and then explode a row into two

I'm working with a DataFrame that looks like this:
ID SEQ BEG_GAP END_GAP
0 A1 ABCDEFG 2 4
1 B1 YUUAAMN 4 6
2 C1 LKHUTYYYYA 7 9
And what I'm trying to do is basically first replace the part of the string in "SEQ" that's in between the values of "BEG_GAP" and "END_GAP", to then explode the two pieces of the string left into two different lines (probably using Panda's explode).
I.e: First expected result:
ID SEQ BEG_GAP END_GAP
0 A1 AB---FG 2 4
1 B1 YUUA--- 4 6
2 C1 LKHUTY--YA 7 8
To then get:
ID SEQ BEG_GAP END_GAP
0 A1 AB 2 4
1 A1 FG 2 4
2 B1 YUUA 4 6
3 C1 LKHUTY 7 8
4 C1 YA 7 8
I'm trying to use the following code:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
for index, rows in df.iterrows():
start = df["BEG_GAP"].astype(float)
stop= df["END_GAP"].astype(float)
df["SEQ"] = df["SEQ"].astype(str)
df['SEQ'] = df['SEQ'].str.slice_replace(start=start,stop=stop,repl='-')
But the column "SEQ" that I'm getting is full of NaN. I suppose it has to do with how I'm using start and stop. I could use some help with this, and also with how to later divide the rows according to the gaps.
I hope I was clear enough, thanks in advance!
Let's try:
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
Output:
ID SEQ BEG_GAP END_GAP
0 A1 AB 2 4
0 A1 FG 2 4
1 B1 YUUA 4 6
2 C1 LKHUTYY 7 8
2 C1 A 7 8

Create new column in pandas that will get values depend on condition

I have two databses:
Database one:
one has data about name of observation and different measurments that were taken for each obseravtion , e.g:
and database2 which contain the same name observations (but not all of them) and carbom measurment for each.
I want to do these steps:
- add empty column in database 1
- if the name in databse2 is in databse 1 , I want to take the carbon value and add it to the new column.
if not, leave it empty.
I have tried to write something but it's really the beginning and I feel stuck :
NaN = np.nan
df['carbon'] = NaN
for i in df.loc['name']:
if i in df_chemo.loc['sample name'] is in df.loc['name']:
I know it is just the beginning but I feel like I don't know how to write what I want.
My end goal: to add to databse 1 new column that will have values from database2 only if the names are match.
What you are looking for is the merge method:
df = df.merge(df_chemo, how='left', on='name')
Example:
import pandas as pd
df1 = pd.DataFrame({'x':[1,2,1,4], 'y':[11,22,33,44]})
print(df1, end='\n ------------- \n')
df2 = pd.DataFrame({'x':[1,2,5,7], 'z':list('abcd')})
print(df2, end='\n ------------- \n')
print(df1.merge(df2, on='x', how='left'))
Output:
x y
0 1 11
1 2 22
2 1 33
3 4 44
-------------
x z
0 1 a
1 2 b
2 5 c
3 7 d
-------------
x y z
0 1 11 a
1 2 22 b
2 1 33 a
3 4 44 NaN

Groupby and sum several columns pandas

I am trying to groupby several columns. With one column it is easy, just df.groupby('c1').['c3].sum()
Original dataframe
c1 c2 c3
1 1 2
1 2 12
2 1 87
2 2 12
2 3 87
2 3 13
Desired result
c2 c3(c1_1) c3(c1_2)
1 2 87
2 12 12
3 (0?) 100
Where c3(c1_1) means sum of column c3 where c1 has a value of 1
I have no idea how to apply groupby on this. It would be nice, if someone will show not only how to solve it, but what to read to have no such stupid questions
You can group by multiple columns at once by providing a list to groupby. If you don't mind the output being formatted slightly differently, you can achieve this result through
In [32]: df.groupby(['c2', 'c1']).c3.sum().unstack(fill_value=0)
Out[32]:
c1 1 2
c2
1 2 87
2 12 12
3 0 100
With a bit of work, this can be massaged into the form you give as well.

Randomly combine pandas group objects

Question:
How can one use pandas df.groupby() function to create randomly selected groups of groups?
Example:
I would like to group a dataframe into random groups of size n where n corresponds to the number of unique values in a given column.
I have a dataframe with a variety of columns including "id". Some rows have unique ids whereas others may have the same id. For example:
c1 id c2
0 a 1 4
1 b 2 6
2 c 2 2
3 d 5 7
4 y 9 3
In reality this dataframe can have up to 1000 or so rows.
I would like to be able to group this dataframe using the following criteria:
each group should contain at most n unique ids
no id should appear in more than one group
the specific ids in a given group should be selected randomly
each id should appear in exactly one group
For example the example dataframe (above) could become:
group1:
c1 id c2
0 a 1 4
4 y 9 3
group2:
c1 id c2
1 b 2 6
2 c 2 2
3 d 5 7
where n = 2
Thanks for your suggestions.
It seems difficult for a uniq groupby statement. A way to do that :
uniq=df['id'].unique()
random.shuffle(uniq)
groups=np.split(uniq,2)
dfr=df.set_index(df['id'])
for gp in groups : print (dfr.loc[gp])
For
c1 id c2
id
9 y 9 3
1 a 1 4
c1 id c2
id
5 d 5 7
2 b 2 6
2 c 2 2
If size of groups(n) does'nt divide len(uniq), You can use np.split(uniq,range(n,len(uniq),n)) instead.
Here's a way to do it:
import numpy as np
df = pd.DataFrame({'c1':list('abcdy'), 'id':[1,2,2,5,9], 'c2':[4,6,2,7,3]})
n = 2
shuffled_ids = np.random.permutation(df['id'].unique())
id_groups = [shuffled_ids[i:i+n] for i in xrange(0, len(shuffled_ids), n)]
groups = [df['id'].apply(lambda x: x in g) for g in id_groups]
Output:
In [1]: df[groups[0]]
Out[1]:
c1 c2 id
1 b 6 2
2 c 2 2
3 d 7 5
In [2]: df[groups[1]]
Out[2]:
c1 c2 id
0 a 4 1
4 y 3 9
This approach doesn't involve changing the index, in case you need to keep it.

sum two pandas dataframe columns, keep non-common rows

I just asked a similar question but then
realized, it wasn't the right question.
What I'm trying to accomplish is to combine two data frames that actually have the same columns, but may or may not have common rows (indices of a MultiIndex). I'd like to combine them taking the sum of one of the columns, but leaving the other columns.
According to the accepted answer, the approach may be something like:
def mklbl(prefix,n):
try:
return ["%s%s" % (prefix,i) for i in range(n)]
except:
return ["%s%s" % (prefix,i) for i in n]
mi1 = pd.MultiIndex.from_product([mklbl('A',4), mklbl('C',2)])
mi2 = pd.MultiIndex.from_product([mklbl('A',[2,3,4]), mklbl('C',2)])
df2 = pd.DataFrame({'a':np.arange(len(mi1)), 'b':np.arange(len(mi1)),'c':np.arange(len(mi1)), 'd':np.arange(len( mi1))[::-1]}, index=mi1).sort_index().sort_index(axis=1)
df1 = pd.DataFrame({'a':np.arange(len(mi2)), 'b':np.arange(len(mi2)),'c':np.arange(len(mi2)), 'd':np.arange(len( mi2))[::-1]}, index=mi2).sort_index().sort_index(axis=1)
df1 = df1.add(df2.pop('b'))
but the problem is this will fail as the indices don't align.
This is close to what I'm trying to achieve, except that I lose rows that are not common to the two dataframes:
df1['b'] = df1['b'].add(df2['b'], fill_value=0)
But this gives me:
Out[197]:
a b c d
A2 C0 0 4 0 5
C1 1 6 1 4
A3 C0 2 8 2 3
C1 3 10 3 2
A4 C0 4 4 4 1
C1 5 5 5 0
When I want:
In [197]: df1
Out[197]:
a b c d
A0 C0 0 0 0 7
C1 1 2 1 6
A1 C0 2 4 2 5
C1 3 6 3 4
A2 C0 0 4 0 5
C1 1 6 1 4
A3 C0 2 8 2 3
C1 3 10 3 2
A4 C0 4 4 4 1
C1 5 5 5 0
Note: in response to #RandyC's comment about the XY problem... the specific problem is that I have a class which reads data and returns a dataframe of 1e9 rows. The columns of the data frame are latll, latur, lonll, lonur, concentration, elevation. The data frame is indexed by a MultiIndex (lat, lon, time) where time is a datetime. The rows of the two dataframes may/may not be the same (IF they exist for a given date, the lat/lon will be the same... they are grid cell centers). latll, latur, lonll, lonur are calculated from lat/lon. I want to sum the concentration column as I add two data frames, but not change the others.
Self answering, there was an error in the comment above that caused a double adding. This is correct:
newdata = df2.pop('b')
result = df1.combine_first(df2)
result['b']= result['b'].add(newdata, fill_value=0)
seems to provide the solution to my use-case.

Categories

Resources