I have a large(ish) set of experimental data that contains pairs of values. Each pair is associated with a particular barcode. Ideally, each pair should have a unique barcode. Unfortunately, it turns out that I screwed something up during the experiment. Now several pairs share a single barcode. I need to exclude these pairs/barcodes from my analysis.
My data looks kind of like this:
The pairs are in columns 'A' and 'B' -- I just included 'X' to represent some arbitrary associated data:
df = pd.DataFrame({'Barcode' : ['AABBCC', 'AABBCC', 'BABACC', 'AABBCC', 'DABBAC', 'ABDABD', 'DABBAC'],
'A' : ['v', 'v', 'x', 'y', 'z', 'h', 'z'],
'B' : ['h', 'h', 'j', 'k', 'l', 'v', 'l'],
'X' : np.random.randint(10, size = 7)})
df = df[['Barcode', 'A', 'B', 'X']]
df
Barcode A B X
0 AABBCC v h 8
1 AABBCC v h 7
2 BABACC x j 2
3 AABBCC y k 3
4 DABBAC z l 8
5 ABDABD h v 0
6 DABBAC z l 4
I want to get rid of the rows described by barcode 'AABBCC', since this barcode is associated with two different pairs (rows 0 and 1 are both the same pair -- which is fine -- but, row 3 is a different pair).
df.loc[df.Barcode != 'AABBCC']
Barcode A B X
2 BABACC x j 6
4 DABBAC z l 0
5 ABDABD h v 7
6 DABBAC z l 5
My solution thus far:
def duplicates(bar):
if len(df.loc[df.Barcode == bar].A.unique()) > 1 or len(df.loc[df.Barcode == bar].B.unique()) > 1:
return 'collision'
else:
return 'single'
df['Barcode_collision'] = df.apply(lambda row: duplicates(row['Barcode']), axis = 1)
df.loc[df.Barcode_collision == 'single']
Barcode A B X Barcode_collision
2 BABACC x j 6 single
4 DABBAC z l 0 single
5 ABDABD h v 7 single
6 DABBAC z l 5 single
Unfortunately, this is very slow with a large dataframe (~500,000 rows) using my delicate computer. I'm sure there must be a better/faster way. Maybe using the groupby function?
df.groupby(['Barcode', 'A', 'B']).count()
X
Barcode A B
AABBCC v h 2
y k 1
ABDABD h v 1
BABACC x j 1
DABBAC z l 2
Then filtering out rows that have more than one value in the second or third indexes? But my brain and my googling skills can't seem to get me further than this...
You can use filter:
print(df.groupby('Barcode').filter(lambda x: ((x.A.nunique() == 1) or (x.B.nunique() == 1))))
Barcode A B X Barcode_collision
2 BABACC x j 4 single
4 DABBAC z l 9 single
5 ABDABD h v 3 single
6 DABBAC z l 9 single
Another solution with transform and boolean indexing:
g = df.groupby('Barcode')
A = g.A.transform('nunique')
B = g.B.transform('nunique')
print (df[(A == 1) | (B == 1)])
Barcode A B X Barcode_collision
2 BABACC x j 2 single
4 DABBAC z l 6 single
5 ABDABD h v 1 single
6 DABBAC z l 3 single
Related
I have a dataset which contains a lot of items for which I track the status each week (so an item can occur multiple times in the dataset). I would like to build logic which counts the number of consecutive weeks an item has had a given status. Per item I would like to see how long it was status "z" and preferably in which week the item was status "z" for the last time. I only want the counter to start from the first week the item became status "z". Once it runs into a week where this item was no longer status "z" I want the counter to stop and insert the value it has at the original row. For all weeks I only want to take historical weeks into account. (Week 2 should not take week 3 into account).
Furthermore, I would like to include the most recent week it had status z. Also, for items which don't have status z in the current week I would like to see when the last week was when status z was applicable.
df = pd.DataFrame({'WeekNr': [202301,202302,202303,202304,202305,202301,202302,202303,202304,202305], 'Status': ['A', 'A', 'A', 'Z', 'Z', 'Z', 'A', 'A', 'Z', 'Z'], 'Item': ['x', 'x', 'x', 'x', 'x', 'y', 'y', 'y', 'y','y']})
First, I sort my dataframe to make sure we iterate in a chronological order:
df.sort_values('WeekNr', ascending = False)
check = 0
for index, row in df.iterrows():
for index2,row2 in df.iterrows():
if row["Item"] == row2["Item"]:
if row2["Status"] == "z":
check += 1
elif row["Item"] == row2["Item"]:
if row2["Status"] != "z":
row["Check"] = check
else:
continue
Check = 0
Preferred output would be:
202301 A x 0 -
202302 A x 0 -
202303 A x 0 -
202304 Z x 1 202304
202305 Z x 2 202304
202301 Z y 1 202301
202302 A y 0 202301
202303 A y 0 202301
202304 Z y 1 202304
202305 Z y 2 202304
Could someone point out what I am doing wrong/suggest some improvements?
Thanks!
I would use:
# sort by Item/WeekNr
df = df.sort_values(by=['Item', 'WeekNr'])
# identify Z
m = df['Status'].eq('Z')
# group by successsive Status
group = group = df['Status'].ne(df.groupby('Item')['Status'].shift()).cumsum()
# set up grouper
g = df.groupby(group)
# add counter
df['counter'] = g.cumcount().add(1).where(m, 0)
# Get first WeekNr, or previous one for Z
df['Check'] = (g['WeekNr'].transform('first')
.where(m).groupby(df['Item']).ffill()
)
Output:
WeekNr Status Item counter Check
0 202301 A x 0 NaN
1 202302 A x 0 NaN
2 202303 A x 0 NaN
3 202304 Z x 1 202304.0
4 202305 Z x 2 202304.0
5 202301 Z y 1 202301.0
6 202302 A y 0 202301.0
7 202303 A y 0 202301.0
8 202304 Z y 1 202304.0
9 202305 Z y 2 202304.0
I want to divide below Columns_A and Columns_B into 3 columns.
What approach I am thinking of creating(but no idea what to write in python):
breakdown Columns_A and Columns_B into 3 columns
merge pass_one and pass_two and pass_three
append Columns_C and Columns_D to the longest values of list
Original data(I changed it to list of lists):
Columns_A
Columns_B
Columns_C
Columns_D
1
A
X
Y
1
A
X
Y
1
A
X
Y
2
B
X
Y
2
B
X
Y
3
C
X
Y
3
C
X
Y
3
C
X
Y
3
C
X
Y
11
D
Z
Q
12
E
Z
Q
12
E
Z
Q
12
E
Z
Q
13
F
Z
Q
13
F
Z
Q
What I would like to create:
Columns_A_1
Columns_B_1
Columns_A_2
Columns_B_2
Columns_A_3
Columns_B_3
Columns_C
Columns_D
1
A
2
B
3
C
X
Y
1
A
2
B
3
C
X
Y
1
A
Blacnk
Blacnk
3
C
X
Y
Blacnk
Blacnk
Blacnk
Blacnk
3
C
X
Y
11
D
12
E
13
F
Z
Q
Blank
Blank
12
E
13
F
Z
Q
Blank
Blank
12
E
Blank
Blank
Z
Q
Code that I tried but didn't work (no error but pass_two & pass_two output blank):
#â‘ breakdown Columns_A and Columns_B into 3 columns
!pip install pandas
import pandas as pd
dic = {'Column_A': ["1","1","1","2","2","3","3","3","3","11","12","12","12","13","13"],
'Column_B': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'E', 'E', 'E', 'F', 'F'],
'Column_C': ['X'] * 9 + ['Z'] * 6,
'Column_D': ['Y'] * 9 + ['Q'] * 6,}
df = pd.DataFrame(dic)
list_data = df.values.tolist()
pass_one = [] #Columns_A_1 and Columns_A_1
pass_two = [] #Columns_A_2 and Columns_B_2
pass_three = [] #Columns_A_3 and Columns_B_3
for row in list_data:
Columns_A = row[0]
Columns_B = row[1]
Columns_C = row[2]
Columns_D = row[3]
list_one = [Columns_A ,Columns_B] #I would like to append these data set
if Columns_C in Columns_C and Columns_A not in Columns_A:
pass_two.append(list_one)
if Columns_C in Columns_C and Columns_A not in Columns_A:
pass_three.append(list_one)
else:
pass_one.append(list_one)
Once Columns_A and Columns_B is separated into 3 list of lists:
I would like to merge pass_one and pass_two and pass_three
At last, append Columns_C and Columns_D to the longest values of list
Does anyone have any ideas how to do this??
This is not a complete answer, but perhaps it'll get you one step further. I assumed your sort criteria was Column_A mod 10:
# create the column we can group by; column A integers mod 10
df['Column_A_sort'] = df['Column_A'].astype(int) % 10
# group by that value
df.groupby('Column_A_sort').agg(list)
Output:
for i in g.groups:
print(g.get_group(i))
prints:
Column_A Column_B Column_C Column_D Column_A_sort
0 1 A X Y 1
1 1 A X Y 1
2 1 A X Y 1
9 11 D Z Q 1
Column_A Column_B Column_C Column_D Column_A_sort
3 2 B X Y 2
4 2 B X Y 2
10 12 E Z Q 2
11 12 E Z Q 2
12 12 E Z Q 2
Column_A Column_B Column_C Column_D Column_A_sort
5 3 C X Y 3
6 3 C X Y 3
7 3 C X Y 3
8 3 C X Y 3
13 13 F Z Q 3
14 13 F Z Q 3
As ignoring_gravity suggests, in order to go further, it'd be helpful to understand exactly your criteria for sorting and recombining the columns.
I have the following dataframe:
NAME
SIGNAL
a
0
b
0
c
0
d
0
e
1
f
1
g
1
h
0
i
0
j
0
k
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
t
0
I need to write a function that will allow me to extract another dataframe, or just modify the existing frame based on a condition:
Get all columns (in my case NAME) if SIGNAL column is 1 for the row but also extract 2 rows extra from above and 2 rows extra from bellow.
In my example, the function should return me the following table:
NAME
SIGNAL
c
0
d
0
e
1
f
1
g
1
h
0
i
0
j
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
Thanks!
UPDATE:
This is the code I have so far:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['a', 0], ['b', 0], ['c', 1], ['d', 1], ['e', 0], ['f', 0], ['g', 0], ['h', 1], ['i', 0], ['j', 0], ['k', 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['NAME', 'SIGNAL'])
# print dataframe.
print(df)
print("----------------")
for index, row in df.iterrows():
#print(row['Name'], row['Age'])
if((df.iloc[index]['SIGNAL'] == 1) & (df.iloc[index-1]['SIGNAL'] == 0)): #check when the signal change from 0 to 1
print(df.iloc[index]['NAME']) #first line with signal 1 after it was 0
#print the above 2 lines
print(df.iloc[index-1]['NAME'])
print(df.iloc[index-2]['NAME'])
My dataframe is like:
NAME SIGNAL
0 a 0
1 b 0
2 c 1
3 d 1
4 e 0
5 f 0
6 g 0
7 h 1
8 i 0
9 j 0
10 k 0
My code is returning:
c
b
a
h
g
f
The problem here is that I cannot return the value of "d" and "e" + "f" or "i" and "j" because i get the error "IndexError: single positional indexer is out-of-bounds" if i try if condition:
(df.iloc[index]['SIGNAL'] == 1) & (df.iloc[index+1]['SIGNAL'] == 0)
enter code here
Also the extended bounds will be variable, sometimes I will work with 2 extra rows from top and bottom sometimes with more.
I am looking for a solution using dataframes functions and not iteration.
thanks!
This will return the desired data frame:
df[(df.shift(periods=-2, axis="rows").SIGNAL == 1) | (df.shift(periods=-1, axis="rows").SIGNAL == 1) | (df.SIGNAL == 1) | (df.shift(periods=1, axis="rows").SIGNAL == 1) | (df.shift(periods=2, axis="rows").SIGNAL == 1)]
Output:
NAME
SIGNAL
c
0
d
0
e
1
f
1
g
1
h
0
i
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
Add .NAME to the end to get your series of names
2 c
3 d
4 e
5 f
6 g
7 h
8 i
11 l
12 m
13 n
14 o
15 p
16 q
17 r
18 s
Name: NAME, dtype: object
Update: for arbitrarily large span
m=(df.shift(periods=-400, axis="rows").SIGNAL == 1)
for i in list(range(-399,401)):
m= m | (df.shift(periods=i, axis="rows").SIGNAL == 1)
print(df[m])
Disclaimer:
This method may be inefficient for large spans
I have a dataframe, one column (col1) of which contains values either Y or N. I would like to assign values (random, not repetitive numbers) to the next column (col2) based on the values in col1 - if value in col1 equals to N, then value in col2 would be some number, if value in col1 equals to Y, then value in col2 would repeat the previous. I tried to create a for loop and iterate over rows using df.iterrows(), however the numbers in col2 were equal for all Ns.
Example of the dataframe I want to get:
df = pd.DataFrame([[N, Y, Y, N, N, Y], [1, 1, 1, 2, 3, 3]])
where for each new N new number is assigned in other column, while for each Y the number is repeated as in previous row.
Assuming a DataFrame df:
df = pd.DataFrame(['N', 'Y', 'Y', 'N', 'N', 'Y'], columns=['YN'])
YN
0 N
1 Y
2 Y
3 N
4 N
5 Y
Using itertuples (no repeation):
np.random.seed(42)
arr = np.arange(1, len(df[df.YN == 'N']) + 1)
np.random.shuffle(arr)
cnt = 0
for idx, val in enumerate(df.itertuples()):
if df.YN[idx] == 'N':
df.loc[idx, 'new'] = arr[cnt]
cnt += 1
else:
df.loc[idx, 'new'] = np.NaN
df.new = df.new.ffill().astype(int)
df
YN new
0 N 1
1 Y 1
2 Y 1
3 N 2
4 N 3
5 Y 3
Using apply (repetition may arise with small number range):
np.random.seed(42)
df['new'] = df.YN.apply(lambda x: np.random.randint(10) if x == 'N' else np.NaN).ffill().astype(int)
YN new
0 N 6
1 Y 6
2 Y 6
3 N 3
4 N 7
5 Y 7
I've searched for an answer for the past 30 min, but the only solutions are either for a single column or in R. I have a dataset in which I want to change the ('Y/N') values to 1 and 0 respectively. I feel like copying and pasting the code below 17 times is very inefficient.
df.loc[df.infants == 'n', 'infants'] = 0
df.loc[df.infants == 'y', 'infants'] = 1
df.loc[df.infants == '?', 'infants'] = 1
My solution is the following. This doesn't cause an error, but the values in the dataframe doesn't change. I'm assuming I need to do something like df = df_new. But how to do this?
for coln in df:
for value in coln:
if value == 'y':
value = '1'
elif value == 'n':
value = '0'
else:
value = '1'
EDIT: There are 17 columns in this dataset, but there is another dataset I'm hoping to tackle which contains 56 columns.
republican n y n.1 y.1 y.2 y.3 n.2 n.3 n.4 y.4 ? y.5 y.6 y.7 n.5 y.8
0 republican n y n y y y n n n n n y y y n ?
1 democrat ? y y ? y y n n n n y n y y n n
2 democrat n y y n ? y n n n n y n y n n y
3 democrat y y y n y y n n n n y ? y y y y
4 democrat n y y n y y n n n n n n y y y y
This should work:
for col in df.columns():
df.loc[df[col] == 'n', col] = 0
df.loc[df[col] == 'y', col] = 1
df.loc[df[col] == '?', col] = 1
I think simpliest is use replace by dict:
np.random.seed(100)
df = pd.DataFrame(np.random.choice(['n','y','?'], size=(5,5)),
columns=list('ABCDE'))
print (df)
A B C D E
0 n n n ? ?
1 n ? y ? ?
2 ? ? y n n
3 n n ? n y
4 y ? ? n n
d = {'n':0,'y':1,'?':1}
df = df.replace(d)
print (df)
A B C D E
0 0 0 0 1 1
1 0 1 1 1 1
2 1 1 1 0 0
3 0 0 1 0 1
4 1 1 1 0 0
This should do:
df.infants = df.infants.map({ 'Y' : 1, 'N' : 0})
Maybe you can try apply,
import pandas as pd
# create dataframe
number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df_new = pd.DataFrame()
df_new['number'] = number
df_new['sex'] = sex
df_new.head()
# create def for category to number 0/1
def tran_cat_to_num(df):
if df['sex'] == 'male':
return 1
elif df['sex'] == 'female':
return 0
# create sex_new
df_new['sex_new']=df_new.apply(tran_cat_to_num,axis=1)
df_new
raw
number sex
0 1 male
1 2 female
2 3 female
3 4 female
4 5 male
after use apply
number sex sex_new
0 1 male 1
1 2 female 0
2 3 female 0
3 4 female 0
4 5 male 1
You can change the values using the map function.
Ex.:
x = {'y': 1, 'n': 0}
for col in df.columns():
df[col] = df[col].map(x)
This way you map each column of your dataframe.
All the solutions above are correct, but what you could also do is:
df["infants"] = df["infants"].replace("Y", 1).replace("N", 0).replace("?", 1)
which now that I read more carefully is very similar to using replace with dict !
Use dataframe.replace():
df.replace({'infants':{'y':1,'?':1,'n':0}},inplace=True)