I have a dataset like:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
1 0
2 0
3 0
3 0
I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
2 0
3 0
3 0
i.e.
1 0 --> gets removed since this row appears after id 1 already had a status of 1
How to implement it efficiently since I have a very large (200 GB+) dataset.
Thanks for your help.
Here's an idea;
You can create a dict with the first index where the status is 1 for each ID (assuming the DataFrame is sorted by ID):
d = df.loc[df["Status"]==1].drop_duplicates()
d = dict(zip(d["Id"], d.index))
Then you create a column with the first status=1 for each Id:
df["first"] = df["Id"].map(d)
Finally you drop every row where the index is less than than the first column:
df = df.loc[df.index<df["first"]]
EDIT: Revisiting this question a month later, there is actually a much simpler way with groupby and cumsum: Just group by Id and take the cumsum of Status, then drop the values where the cumsum is more than 0:
df[df.groupby('Id')['Status'].cumsum() < 1]
The best way I have found is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:
def remove(series):
indexless = series.reset_index(drop=True)
ones = indexless[indexless['Status'] == 1]
if len(ones) > 0:
return indexless.iloc[:ones.index[0] + 1]
else:
return indexless
df.groupby('Id').apply(remove).reset_index(drop=True)
Output:
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 3 0
8 3 0
Use groupby with cumsum to find where status is 1.
res = df.groupby('Id', group_keys=False).apply(lambda x: x[x.Status.cumsum() > 0])
res
Id Status
4 1 1
6 1 0
Exclude index that Status==0.
not_select_id = res[res.Status==0].index
df[~df.index.isin(not_select_id)]
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 3 0
9 3 0
Related
a = [[0,0,0,0],[0,-1,1,0],[1,-1,1,0],[1,-1,1,0]]
df = pd.DataFrame(a, columns=['A','B','C','D'])
df
Output:
A B C D
0 0 0 0 0
1 0 -1 1 0
2 1 -1 1 0
3 1 -1 1 0
So reading down vertically per column, values in the columns all begin at 0 on the first row, once they change they can never change back and can either become a 1 or a -1. I would like to re arrange the dataframe columns so that the columns in this order:
Order columns that hit 1 in the earliest row as possible
Order columns that hit -1 in the earliest row as possible
Finally the remaining rows that never changed values and remained as zero (if there are even any left)
Desired Output:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
The my main data frame is 3000 rows and 61 columns long, is there any way of doing this quickly?
We have to handle the positive and negative values seperately. One way is take sum of the columns , then using sort_values , we can adjust the ordering:
a = df.sum().sort_values(ascending=False)
b = pd.concat((a[a.gt(0)],a[a.lt(0)].sort_values(),a[a.eq(0)]))
out = df.reindex(columns=b.index)
print(out)
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
Try with pd.Series.first_valid_index
s = df.where(df.ne(0))
s1 = s.apply(pd.Series.first_valid_index)
s2 = s.bfill().iloc[0]
out = df.loc[:,pd.concat([s2,s1],axis=1,keys=[0,1]).sort_values([0,1],ascending=[False,True]).index]
out
Out[35]:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
if have two dataframes, (pandas.DataFrame), each looking as follows. Let's call the first one df_A
code1 code2 code3 code4 code5
0 1 4 2 0 0
1 3 2 1 5 0
2 2 3 0 0 0
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 0 0 1
2 0 1 1 0 0
The objects(rows) are each given up to 5 codes shown by the five columns in the first df.
I instead want a binary representation of which codes each object has. As shown in the second df.
The functions in pandas or scikit-learn for dummy-values take into account which position the code is written in, this in unimportant.
The attempts I have with my own code have not worked due to my inexperience in python and pandas.
This case is different from others I have seen on stack overflow as all the columns represent the same thing.
Thank you!
Edit:
for colname in df_bin.columns:
for row in range(len(df_codes)):
if int(colname) in df_codes.iloc[[row]]:
df_bin[colname][row]=1
This is one of the attempts I made so far.
You can try stack then str.get_dummies
s=df.stack().loc[lambda x : x!=0].astype(str).str.get_dummies().sum(level=0).add_prefix('Has')
Has1 Has2 Has3 Has4 Has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Let's try:
(df.stack().groupby(level=0)
.value_counts()
.unstack(fill_value=0)
[range(1,6)]
.add_prefix('has')
)
Output:
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Here's another way using pd.crosstab:
df_out = df.reset_index().melt('index')
df_out = pd.crosstab(df_out['index'], df_out['value']).drop(0, axis=1).add_prefix('has')
Output:
value has1 has2 has3 has4 has5
index
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
I have a dataframe which is something like this:
index buyedA total
a 2 4
b 1 2
and I need to turn it into something like this:
index buyedA total
a 1 1
a 1 1
a 0 1
a 0 1
b 1 1
b 0 1
I need for each index as many rows as specified by column total (each one filled with a value of 1), and if column buyedA says 2, I need 2 of those rows filled with a 1.
Is there a way to do so in Python?
Thanks!
Using repeat and a simple groupby
n = df.loc[df.index.repeat(df.total)].assign(total=1)
n['buyedA'] = n.groupby('index').total.cumsum().le(n.buyedA).astype(int)
index buyedA total
0 a 1 1
0 a 1 1
0 a 0 1
0 a 0 1
1 b 1 1
1 b 0 1
Let's try this:
#make sure index is in the dataframe index
df=df.set_index('index')
#use repeat and reindex
df_out = df.reindex(df.index.repeat(df['total'])).assign(total=1)
#Limit buyedA by row number in each group of index
df_out['buyedA'] = ((df_out.groupby('index').cumcount() + 1) <= df_out['buyedA']).mul(1)
df_out
output:
buyedA total
index
a 1 1
a 1 1
a 0 1
a 0 1
b 1 1
b 0 1
As described above i want to get the Position Index of the Dataframe entry based on the condition. It should look something like this
import pandas as pd
a = [[1,0,0,1],[0,1,0,1],[0,0,0,1]]
df = pd.DataFrame(a)
df
Out[61]:
0 1 2 3
0 1 0 0 1
1 0 1 0 1
2 0 0 0 1
And i want to create a new column, that returns the position of the first 1 of the corresponding row. So the End result should look like this:
Out[62]:
0 1 2 3 New
0 1 0 0 1 0
1 0 1 0 1 1
2 0 0 0 1 3
This is my first Question on stackoverflow, so sorry if i did some formal mistakes while asking this question.
Any help appreciated
I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0