My input dataframe;
MinA MinB MaxA MaxB
0 1 2 5 7
1 1 0 8 6
2 2 15 15
3 3
4 10
I want to merge "min" and "max" columns amongst themselves with priority (A columns have more priority than B columns).
If both columns are null, they should have default values, for min=0 for max=100.
Desired output is;
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 15 15 2 15
3 3 3 100
4 10 0 10
Could you please help me about this?
This can be accomplished using mask. With your data that would look like the following:
df = pd.DataFrame({
'MinA': [1,1,2,None,None],
'MinB': [2,0,None,3,None],
'MaxA': [5,8,15,None,None],
'MaxB': [7,6,15,None,10],
})
# Create new Column, using A as the base, if it is Nan, then use B.
# Then do the same again using specified values
df['Min'] = df['MinA'].mask(pd.isna, df['MinB']).mask(pd.isna, 0)
df['Max'] = df['MaxA'].mask(pd.isna, df['MaxB']).mask(pd.isna, 100)
The above would result in the desired output:
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 NaN 15 15 2 15
3 NaN 3 NaN NaN 3 100
4 NaN NaN NaN 10 0 10
Just use fillna() will be fine.
df['Min'] = df['MinA'].fillna(df['MinB']).fillna(0)
df['Max'] = df['MaxA'].fillna(df['MaxB']).fillna(100)
Related
How to compare between 2 dataframes by 0,1?
df1
L_ID L_Values
1 20-25
2 30-35
3 25
4 45
5 30-45
df2
From L_ID in df1, it represents to columns 1,2,3,4,5 in df2
Name 1 2 3 4 5
John 25 25 20 30 45
Zara 20 NaN NaN 25 30
Kim NaN NaN NaN 45 50
I would like to compare values in df2, is it range in df1?
yes = 0, no = 1
Expect output in df3
Name 1 2 3 4 5
John 0 1 1 1 0
Zara 0 1 1 1 0
Kim 1 1 1 0 1
The following should work
Just take care for Nan condition in
df3.col.iloc[i]==df3.col.iloc[i]
if Nan is in sting format, just replace this part of the code with
if df3.col.iloc[i]!='Nan'
The code is:
d={}
for i in range len(df1):
d[df1.L_ID.iloc[i]]=[]
temp=df1.L_Values.iloc[i].split('-')
if len(temp)==2:
d[df1.L_ID.iloc[i]]=temp
else:
d[df1.L_ID.iloc[i]]=[temp[0], temp[0]]
df3=df2.copy()
for i in range(len(df3)):
for col in df3.columns:
if df3.col.iloc[i]==df3.col.iloc[i] and d[col][0] <= df3.col.iloc[i] <= d[col][1]:
df3.col.iloc[i]=0
else:
df3.col.iloc[i]=1
print(df3)
I'm trying to group several group of columns to count or sum the rows in a pandas dataframe
I've checked many questions already and the most similar I found is this one > Groupby sum and count on multiple columns in python, but, by what I understand I have to do many steps to reach my goal. and was also looking at this link
As an example, I have the dataframe below:
import numpy as np
df = pd.DataFrame(np.random.randint(0,5,size=(5, 7)), columns=["grey2","red1","blue1","red2","red3","blue2","grey1"])
grey2 red1 blue1 red2 red3 blue2 grey1
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
I want to group here, all the columns by colour, for example, and what I would expect is:
If I sum the numbers,
blue 15
grey 22
red 34
If I count ( x > 0 ) then I will get,
blue 7
grey 10
red 13
this is what I have achieved so far, so now i will have to sum and then create a dataframe with the results, but if I have 100 groups,this would be very time consuming.
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='sum', margins=True)
red1 red2 red3
0 3 2 4
1 2 4 0
2 1 1 1
3 4 4 1
4 4 0 3
ALL 14 11 9
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='count', margins=True)
But here is also counting the zeros:
red1 red2 red3
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
All 5 5 5
Not sure how to alter the function to get my results, and I've already spend hours, hopefully you can help.
NOTE:
I only use colours in this example to simplify the case, but I could have around many columns and they are called col001 till col300, etc...
So, the groups could be:
blue = col131, col254, col005
red = col023, col190, col053
and so on.....
You can use pd.wide_to_long:
data= pd.wide_to_long(df.reset_index(), stubnames=['grey','red','blue'],
i='index',
j='group',
sep=''
)
Output:
# data
grey red blue
index group
0 1 2.0 3 0.0
2 4.0 2 0.0
3 NaN 4 NaN
1 1 1.0 2 0.0
2 4.0 4 3.0
3 NaN 0 NaN
2 1 1.0 1 3.0
2 1.0 1 3.0
3 NaN 1 NaN
3 1 1.0 4 1.0
2 4.0 4 1.0
3 NaN 1 NaN
4 1 1.0 4 1.0
2 3.0 0 3.0
3 NaN 3 NaN
And:
data.sum()
# grey 22.0
# red 34.0
# blue 15.0
# dtype: float64
data.gt(0).sum()
# grey 10
# red 13
# blue 7
# dtype: int64
Update wide_to_long is just a convenient shortcut for merge and rename. So if you have a dictionary {cat:[col_list]}, you could resolve to that:
groups = {'blue' : ['col131', 'col254', 'col005'],
'red' : ['col023', 'col190', 'col053']}
# create the inverse dictionary for mapping
inv_group = {v:k for k,v in groups.items()}
data = df.melt()
# map the original columns to group
data['group'] = data['variable'].map(inv_group)
# from now on, it's similar to other answers
# sum
data.groupby('group')['value'].sum()
# count
data['value'].gt(0).groupby(data['group']).sum()
The complication here is that you want to collapse both by rows and columns, which is generally difficult to do at the same time. We can melt to go from your wide format to a longer format, which then reduces the problem to a single groupby
# Get rid of the numbers + reshape
df.columns = pd.Index(df.columns.str.rstrip('0123456789'), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#blue 15
#grey 22
#red 34
df.value.gt(0).groupby(df.color).sum()
#color
#blue 7.0
#grey 10.0
#red 13.0
#Name: value, dtype: float64
With names that are less simple to group, we'd need to have the mapping somewhere, the steps are very similar:
# Unnecessary in this case, but more general
d = {'grey1': 'color_1', 'grey2': 'color_1',
'red1': 'color_2', 'red2': 'color_2', 'red3': 'color_2',
'blue1': 'color_3', 'blue2': 'color_3'}
df.columns = pd.Index(df.columns.map(d), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#color_1 22
#color_2 34
#color_3 15
Use:
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
Output:
blue 15
grey 22
red 34
dtype: int64
this works regardless of the number of digits contained in the name of the columns:
df=df.add_suffix('22')
print(df)
grey22222 red12222 blue12222 red22222 red32222 blue22222 grey12222
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
blue 15
grey 22
red 34
dtype: int64
You could also do something like this for the general case:
colors = {'blue':['blue1','blue2'], 'red':['red1','red2','red3'], 'grey':['grey1','grey2']}
orig_columns = df.columns
df.columns = [key for col in df.columns for key in colors.keys() if col in colors[key]]
print(df.groupby(level=0,axis=1).sum().sum())
df.columns = orig_columns
I have some dataframes where data is tagged in groups, let's say as such:
df1 = pd.DataFrame({'id':[1,3,7, 10,30, 70, 100, 300], 'name':[1,1,1,1,1,1,1,1], 'tag': [1,1,1, 2,2,2, 3,3]})
df2 = pd.DataFrame({'id':[2,5,6, 20, 50, 200, 500, 600], 'name': [2,2,2,2,2,2,2,2], 'tag':[1,1,1, 2, 2, 3,3,3]})
df3 = pd.DataFrame({'id':[4, 8, 9, 40, 400, 800, 900], 'name': [3,3,3,3,3,3,3], 'tag':[1,1,1, 2, 3, 3,3]})
In each dataframe, the tag is attibuted in an ascending order of ids (so bigger ids will have equal or bigger tags).
My wish is to recalculate tags in the concatenated dataframe,
df = pd.concat([df1, df2, df3])
so that the tag of each group will be in ascending order of ids of the first element of each. So, the group starting by id=1 will be tagged by 1 (that is, ids 1,3,7), the group starting by id=2 will be tagged by 2 (that is , ids 2,5,6), the group starting by 4 will be tagged by 3, the group starting by 10 will be tagged as 4, and so on.
I did manage to get a (complicated!) solution:
1) Get first row of each group , put those in a dataframe , sort by id and create the new tags:
dff = pd.concat([df1.groupby('tag').first(), df2.groupby('tag').first(), df3.groupby('tag').first()])
dff = dff.sort(['id'])
dff = dff.reset_index()
dff['new_tags'] = dff.index +1
2) Concatenate this dataframe with initial ones, drop_duplicates so as to keep the newly tagged rows, order by group , then propagate new tags:
df = pd.concat([dff, df1, df2, df3])
df = df.drop_duplicates(subset=['id', 'tag', 'name'])
df = df.sort(['name', 'tag'])
df = df.fillna(method = 'pad')
The new tags are exactly what needed, but my solution seems too complicated. Would you have a suggestion on how to make easier? I think I must be missing something!
Thanks in advance,
M.
Using pd.concat + keys , I break down the steps
df=pd.concat([df1,df2,df3],keys=[0,1,2])
df=df.reset_index(level=0)#get the level=0 index
df=df.sort_values(['tag','level_0']) # sort the value
df['New']=(df['tag'].diff().ne(0)|df['level_0'].diff().ne(0)).cumsum()
df
Out[110]:
level_0 id name tag New
0 0 1 1 1 1
1 0 3 1 1 1
2 0 7 1 1 1
0 1 2 2 1 2
1 1 5 2 1 2
2 1 6 2 1 2
0 2 4 3 1 3
1 2 8 3 1 3
2 2 9 3 1 3
3 0 10 1 2 4
4 0 30 1 2 4
5 0 70 1 2 4
3 1 20 2 2 5
4 1 50 2 2 5
3 2 40 3 2 6
6 0 100 1 3 7
7 0 300 1 3 7
5 1 200 2 3 8
6 1 500 2 3 8
7 1 600 2 3 8
4 2 400 3 3 9
5 2 800 3 3 9
6 2 900 3 3 9
Once concatenated, you can use groupby the columns 'tag' and 'name' with transform and first on the column 'id'. Then sort_values this series and cumsum the diff is more than 0 such as:
df = pd.concat([df1, df2, df3]).sort_values('id').reset_index(drop=True)
df['new'] = (df.groupby(['tag','name'])['id'].transform('first')
.sort_values().diff().ne(0.).cumsum())
and you get the expected output:
id name tag new
0 1 1 1 1
1 2 2 1 2
2 3 1 1 1
3 4 3 1 3
4 5 2 1 2
5 6 2 1 2
6 7 1 1 1
7 8 3 1 3
8 9 3 1 3
9 10 1 2 4
10 20 2 2 5
11 30 1 2 4
12 40 3 2 6
...
EDIT: to avoid using groupby, you can drop_duplicates and index to get the index of the first ids, create the column new with an incremental value using loc and range and then ffill after sort_values to fill the values:
df = pd.concat([df1, df2, df3]).sort_values('id').reset_index(drop=True)
list_ind = df.drop_duplicates(['name','tag']).index
df.loc[list_ind,'new'] = range(1,len(list_ind)+1)
df['new'] = df.sort_values(['tag','name'])['new'].ffill().astype(int)
and you get the same result
I am looking to create a new column in panda based on the value in the row. My sample data:
df=pd.DataFrame({"A":['a','a','a','a','a','a','b','b','b'],
"Sales":[2,3,7,1,4,3,5,6,9,10,11,8,7,13,14],
"Week":[1,2,3,4,5,11,1,2,3,4])
I want a new column "Last3WeekSales" corresponding to each week, having the sum of sales for the previous 3 weeks.
NOTE: Shift() won't work here as data for some weeks is missing.
Logic which I thought:
Checking the week no. in each row, then summing up the data from w-1, w-2, w-3.
Output required:
A Week Last3WeekSales
0 a 1 0
1 a 2 2
2 a 3 5
3 a 4 12
4 a 5 11
5 a 11 0
6 b 1 0
7 b 2 5
8 b 3 11
9 b 4 20
Use groupby, shift and rolling:
df['Last3WeekSales'] = df.groupby('A')['Sales']\
.apply(lambda x: x.shift(1)
.rolling(3, min_periods=1)
.sum())\
.fillna(0)
Output:
A Sales Week Last3WeekSales
0 a 2 1 0.0
1 a 3 2 2.0
2 a 7 3 5.0
3 a 1 4 12.0
4 a 4 5 11.0
5 a 3 6 12.0
6 b 5 1 0.0
7 b 6 2 5.0
8 b 9 3 11.0
you can use pandas.rolling_sum to sum over 3 last values, and shift(n) to shift your column by n times (1 in your case).
if we suppose you a column 'sales' with the sales of each week, the code would be :
df["Last3WeekSales"] = df.groupby("A")["sales"].apply(lambda x: pd.rolling_sum(x.shoft(1),3))
I have a csv like
A,B,C,D
1,2,,
1,2,30,100
1,2,40,100
4,5,,
4,5,60,200
4,5,70,200
8,9,,
In row 1 and row 4 C value is missing (NaN). I want to take their value from row 2 and 5 respectively. (First occurrence of same A,B value).
If no matching row is found, just put 0 (like in last line)
Expected op:
A,B,C,D
1,2,30,
1,2,30,100
1,2,40,100
4,5,60,
4,5,60,200
4,5,70,200
8,9,0,
using fillna I found bfill: use NEXT valid observation to fill gap but the NEXT observation has to be taken logically (looking at col A,B values) and not just the upcoming C column value
You'll have to call df.groupby on A and B first and then apply the bfill function:
In [501]: df.C = df.groupby(['A', 'B']).apply(lambda x: x.C.bfill()).reset_index(drop=True)
In [502]: df
Out[502]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
You can also group and then call dfGroupBy.bfill directly (I think this would be faster):
In [508]: df.C = df.groupby(['A', 'B']).C.bfill().fillna(0).astype(int); df
Out[508]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
If you wish to get rid of NaNs in D, you could do:
df.D.fillna('', inplace=True)