How to create new dataframe rows based on df value - python

I have a dataframe which is something like this:
index buyedA total
a 2 4
b 1 2
and I need to turn it into something like this:
index buyedA total
a 1 1
a 1 1
a 0 1
a 0 1
b 1 1
b 0 1
I need for each index as many rows as specified by column total (each one filled with a value of 1), and if column buyedA says 2, I need 2 of those rows filled with a 1.
Is there a way to do so in Python?
Thanks!

Using repeat and a simple groupby
n = df.loc[df.index.repeat(df.total)].assign(total=1)
n['buyedA'] = n.groupby('index').total.cumsum().le(n.buyedA).astype(int)
index buyedA total
0 a 1 1
0 a 1 1
0 a 0 1
0 a 0 1
1 b 1 1
1 b 0 1

Let's try this:
#make sure index is in the dataframe index
df=df.set_index('index')
#use repeat and reindex
df_out = df.reindex(df.index.repeat(df['total'])).assign(total=1)
#Limit buyedA by row number in each group of index
df_out['buyedA'] = ((df_out.groupby('index').cumcount() + 1) <= df_out['buyedA']).mul(1)
df_out
output:
buyedA total
index
a 1 1
a 1 1
a 0 1
a 0 1
b 1 1
b 0 1

Related

Re-order Columns In A Data Frame Depending On Conditions Of Values

a = [[0,0,0,0],[0,-1,1,0],[1,-1,1,0],[1,-1,1,0]]
df = pd.DataFrame(a, columns=['A','B','C','D'])
df
Output:
A B C D
0 0 0 0 0
1 0 -1 1 0
2 1 -1 1 0
3 1 -1 1 0
So reading down vertically per column, values in the columns all begin at 0 on the first row, once they change they can never change back and can either become a 1 or a -1. I would like to re arrange the dataframe columns so that the columns in this order:
Order columns that hit 1 in the earliest row as possible
Order columns that hit -1 in the earliest row as possible
Finally the remaining rows that never changed values and remained as zero (if there are even any left)
Desired Output:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
The my main data frame is 3000 rows and 61 columns long, is there any way of doing this quickly?
We have to handle the positive and negative values seperately. One way is take sum of the columns , then using sort_values , we can adjust the ordering:
a = df.sum().sort_values(ascending=False)
b = pd.concat((a[a.gt(0)],a[a.lt(0)].sort_values(),a[a.eq(0)]))
out = df.reindex(columns=b.index)
print(out)
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
Try with pd.Series.first_valid_index
s = df.where(df.ne(0))
s1 = s.apply(pd.Series.first_valid_index)
s2 = s.bfill().iloc[0]
out = df.loc[:,pd.concat([s2,s1],axis=1,keys=[0,1]).sort_values([0,1],ascending=[False,True]).index]
out
Out[35]:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0

Drop Rows of an id after a particular column value in Pandas

I have a dataset like:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
1 0
2 0
3 0
3 0
I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
2 0
3 0
3 0
i.e.
1 0 --> gets removed since this row appears after id 1 already had a status of 1
How to implement it efficiently since I have a very large (200 GB+) dataset.
Thanks for your help.
Here's an idea;
You can create a dict with the first index where the status is 1 for each ID (assuming the DataFrame is sorted by ID):
d = df.loc[df["Status"]==1].drop_duplicates()
d = dict(zip(d["Id"], d.index))
Then you create a column with the first status=1 for each Id:
df["first"] = df["Id"].map(d)
Finally you drop every row where the index is less than than the first column:
df = df.loc[df.index<df["first"]]
EDIT: Revisiting this question a month later, there is actually a much simpler way with groupby and cumsum: Just group by Id and take the cumsum of Status, then drop the values where the cumsum is more than 0:
df[df.groupby('Id')['Status'].cumsum() < 1]
The best way I have found is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:
def remove(series):
indexless = series.reset_index(drop=True)
ones = indexless[indexless['Status'] == 1]
if len(ones) > 0:
return indexless.iloc[:ones.index[0] + 1]
else:
return indexless
df.groupby('Id').apply(remove).reset_index(drop=True)
Output:
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 3 0
8 3 0
Use groupby with cumsum to find where status is 1.
res = df.groupby('Id', group_keys=False).apply(lambda x: x[x.Status.cumsum() > 0])
res
Id Status
4 1 1
6 1 0
Exclude index that Status==0.
not_select_id = res[res.Status==0].index
df[~df.index.isin(not_select_id)]
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 3 0
9 3 0

pandas: adding rows in a column

i have a dataframe like this,
Count
1
0
1
1
1
I want to add N and N+1 in count column and store it in N, is it possible to do in pandas way?
result should like this, technically it is cumulative sum:
Counts
1
1
2
3
4
You can use the cumulative sum function, cumsum().
df = pd.DataFrame([1, 0, 1, 1,1], columns=['Count'])
df['Counts'] = df['Count'].cumsum()
print(df)
giving you the desired output.
Count Counts
0 1 1
1 0 1
2 1 2
3 1 3
4 1 4

Pandas DataFrame Groupby to get Unique row condition and identify with increasing value up to Number of Groups

I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0

How to use trailing rows on a column for calculations on that same column | Pandas Python

I'm trying to figure out how to compare the element of the previous row of a column to a different column on the current row in a Pandas DataFrame. For example:
data = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','0']})
Output:
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 0
And now I want to make a new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column.
Theoretically:
data['c'] = np.where(data['a']==( the previous row value of data['a'] ),min((data['b']+( the previous row value of data['c'] )),1),data['b'])
So that I can theoretically output:
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 1
4 1 0 1
I'm wondering how to do this because I'm trying to recreate this excel conditional statement: =IF(A70=A69,MIN((P70+Q69),1),P70)
where data['a'] = column A and data['b'] = column P.
If anyone has any ideas on how to do this, I'd greatly appreciate your advice.
According to your statement: 'new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column' I can suggest you to solve it by this way:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','3']})
>>> df
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 3
>>> df['c'] = np.where(df['a']+df['b'] > df['a'].shift(1)+df['b'].shift(1), 1, 0)
>>> df
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 0
4 1 3 1
But it doesn't looking for 'previous value of that same column'.
If you would try to write df['c'].shift(1) in np.where(), it gonna to raise KeyError: 'c'.

Categories

Resources