Fill duplicates with missing value after grouping with some logic - python

I have a dataframe, I need to take off the duplicates of ticket_id if the owner_type is the same, and if not, pick 'm' over 's', if no value is picket then a NaN is returned:
data = pd.DataFrame({'owner_type':['m','m','m','s','s','m','s','s'],'ticket_id':[1,1,2,2,3,3,4,4]})
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | 1 |
| 1 | m | 1 |
| 2 | m | 2 |
| 3 | s | 2 |
| 4 | s | 3 |
| 5 | m | 3 |
| 6 | s | 4 |
| 7 | s | 4 |'
Should give back:
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | NaN |
| 1 | m | NaN |
| 2 | m | 2 |
| 3 | s | NaN |
| 4 | s | NaN |
| 5 | m | 3 |
| 6 | s | NaN |
| 7 | s | NaN |'
Pseudo code would be like : If ticket_id is duplicated, look at owner_type, if owner_type has mover than one value, return value of 'm' and NaN for 's'.
My attempt
data.groupby('ticket_id').apply(lambda x: x['owner_type'] if len(x) < 2 else NaN)
Not working

Try this:
(df['ticket_id'].where(
~df.duplicated(['owner_type','ticket_id'],keep=False) &
df['owner_type'].eq(df.groupby('ticket_id')['owner_type'].transform('min'))))
Old answer:
m = ~df.duplicated(keep=False) & df['owner_type'].eq('m')
df['ticket_id'].where(m)
Output:
0 NaN
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN

Related

Pandas.DataFrame: efficient way to add a column "seconds since last event"

I have a Pandas.DataFrame with a standard index representing seconds, and I want to add a column "seconds elapsed since last event" where the events are given in a list. Specifically, say
event = [2, 5]
and
df = pd.DataFrame(np.zeros((7, 1)))
| | 0 |
|---:|----:|
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
Then I want to obtain
| | 0 | x |
|---:|----:|-----:|
| 0 | 0 | <NA> |
| 1 | 0 | <NA> |
| 2 | 0 | 0 |
| 3 | 0 | 1 |
| 4 | 0 | 2 |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
I tried
df["x"] = pd.Series(range(5)).shift(2)
| | 0 | x |
|---:|----:|----:|
| 0 | 0 | nan |
| 1 | 0 | nan |
| 2 | 0 | 0 |
| 3 | 0 | 1 |
| 4 | 0 | 2 |
| 5 | 0 | nan |
| 6 | 0 | nan |
so apparently to make it work I need to write df["x"] = pd.Series(range(5+2)).shift(2).
More importantly, when I then do df["x"] = pd.Series(range(2+5)).shift(5) I obtain
| | 0 | x |
|---:|----:|----:|
| 0 | 0 | nan |
| 1 | 0 | nan |
| 2 | 0 | nan |
| 3 | 0 | nan |
| 4 | 0 | nan |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
That is: the previous has been overwritten. Is there a way to assign new values without overwriting existing values by nan ?
Then, I can do something like
for i in event:
df["x"] = pd.Series(range(len(df))).shift(i)
Or is there a more efficient way ?
For the record, here is my naive code. It works, but looks inefficient and of poor design:
c = 1000000
df["x"] = c
if event:
idx = 0
for i in df.itertuples():
print(i)
if idx < len(event) and i.Index == event[idx]:
c = 0
idx += 1
df.loc[i.Index, "x"] = c
c += 1
return df
IIUC, you can do double groupby:
s = df.index.isin(event).cumsum()
# or equivalently
# s = df.loc[event, 0].reindex(df.index).isna().cumsum()
df['x'] = np.where(s>0,df.groupby(s).cumcount(), np.nan)
Output:
0 x
0 0.0 NaN
1 0.0 NaN
2 0.0 0.0
3 0.0 1.0
4 0.0 2.0
5 0.0 0.0
6 0.0 1.0
Let's try this:
df = pd.DataFrame(np.zeros((7, 1)))
event = [2, 5]
df.loc[event, 0] = 1
df = df.replace(0, np.nan)
grp=df[0].cumsum().ffill()
df['x'] = df.groupby(grp).cumcount().mask(grp.isna())
df
Output:
| | 0 | x |
|---:|----:|----:|
| 0 | nan | nan |
| 1 | nan | nan |
| 2 | 1 | 0 |
| 3 | nan | 1 |
| 4 | nan | 2 |
| 5 | 1 | 0 |
| 6 | nan | 1 |

Append column values from one column to another based on value in another column

I have a dataframe of connected values (edges and nodes). It shows how family and friends are connected and it looks like:
+---------------+--------------+--------------+----------------+-----------------+-------------+------------+--------------+------------+--------------+------------+--------------+-------------------+-------------------+-------------------+
| Orginal_Match | Orginal_Name | Connected_ID | Connected_Name | Connection_Type | Match-Final | ID_Match_0 | Name_Match_0 | ID_Match_1 | Name_Match_1 | ID_match_2 | Name_Match_2 | Connection_Type_0 | Connection_Type_1 | Connection_Type_2 |
+---------------+--------------+--------------+----------------+-----------------+-------------+------------+--------------+------------+--------------+------------+--------------+-------------------+-------------------+-------------------+
| 1 | A | 2 | B | FRIEND | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | A | 4 | E | FAMILY | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | A | 3 | F | FRIEND | 2 | 3 | C | 11 | H | 2 | B | FRIEND | FRIEND | FRIEND |
| 1 | A | 5 | G | FRIEND | 2 | 4 | E | NaN | NaN | NaN | NaN | FAMILY | NaN | NaN |
| 1 | A | 6 | D | FRIEND | 2 | 3 | C | NaN | NaN | NaN | NaN | FRIEND | NaN | NaN |
| 1 | A | 7 | B | FAMILY | 2 | 2 | B | NaN | NaN | NaN | NaN | FRIEND | NaN | NaN |
| 1 | A | 7 | B | FRIEND | 2 | 2 | B | NaN | NaN | NaN | NaN | FRIEND | NaN | NaN |
| 1 | A | 8 | B | FRIEND | 2 | 2 | B | NaN | NaN | NaN | NaN | FRIEND | NaN | NaN |
| 1 | A | 9 | C | OTHER | 2 | 3 | C | NaN | NaN | NaN | NaN | FRIEND | NaN | NaN |
| 1 | A | 10 | I | FRIEND | 3 | 3 | C | 6 | D | NaN | NaN | FRIEND | FRIEND | NaN |
+---------------+--------------+--------------+----------------+-----------------+-------------+------------+--------------+------------+--------------+------------+--------------+-------------------+-------------------+-------------------+
In the above dataframe, Original_Match is connected to Connected_ID either directly or indirectly. Match-Final says how many other connections (nodes) are between them. Thus, if Match-Final is 1, Original_Match and Connected_ID are directly connected. For any value x in Match-Final, the number of connections (edges along a single path) between Original_Match and Connected_ID is x-1 (that is the number of edges one must walk along to get from Original_Match to Connected_ID is equal to Match-Final; there can still be n nodes between each set of nodes, but they are all the same distance away). The ID_Match_0 through ID_Match_n columns say all the IDs that were connected in between the previous node and the next node.
As a note for clarity, Match-Final only states the number of edges between the first and last node in the connection if you were walk along one path. It has no bearing on the amount of nodes at every connection. Therefore Match-Final could be 2, meaning that you would need to walk along two edges to get from Orginal_Match to Connected_ID but there could be n paths you could take to do so, because Original_Match may be connected to n nodes, which are then connected to Connected_ID. Thus they are still only two steps away from one another.
So for example, in the above dataframe, the data states that:
Row 0: Match-Final == 1, so Original_Match is connected directly to Connected_ID and they are connected via, Connection_Type. Therefore
1A---FRIEND--2B
__________________________________________________________________________________________________
Row 2: Match-Final == 2, so Original_Match is connected to Connected_ID via ID_Match_0, ID_Match_1, ID_Match_2, using all the corresponding Connection_Type columns. Therefore
11H---------------------#
| |
FRIEND FRIEND
| |
1--FRIEND--3C--FRIEND--3F
| |
FRIEND FRIEND
| |
2B---------------------#
_________________________________________________________________________________________
Row 9: Match-Final == 3, so Original_Match is connected to something, which is then connected to ID_Match_1, ID_Match_2, which then connects to Connected_ID. Therefore
1A--FRIEND--3C--FRIEND--10I
In order to make this a network graph, however, I need to transform the dataframe above into:
+---------------+--------------+--------------+----------------+-----------------+
| Orginal_Match | Orginal_Name | Connected_ID | Connected_Name | Connection_Type |
+---------------+--------------+--------------+----------------+-----------------+
| 1 | A | 2 | B | FRIEND |
| 1 | A | 4 | E | FAMILY |
| 1 | A | 3 | F | FRIEND |
| 1 | A | 5 | G | FRIEND |
| 1 | A | 6 | D | FRIEND |
| 1 | A | 7 | B | FAMILY |
| 1 | A | 7 | B | FRIEND |
| 1 | A | 8 | B | FRIEND |
| 1 | A | 9 | C | OTHER |
| 1 | A | 10 | I | FRIEND |
| 3 | C | 3 | F | FRIEND |
| 11 | H | 3 | F | FRIEND |
| 2 | B | 3 | F | FRIEND |
| 4 | E | 5 | G | FAMILY |
| 3 | C | 6 | D | FRIEND |
| 2 | B | 7 | B | FRIEND |
| 2 | B | 7 | B | FRIEND |
| 2 | B | 8 | B | FRIEND |
| 3 | C | 9 | C | FRIEND |
| 3 | C | 10 | I | FRIEND |
| 6 | D | 10 | I | FRIEND |
+---------------+--------------+--------------+----------------+-----------------+
Which means that I need to append the values in ID_Match_0,..., ID_Match_n and Name_Match_0,..., Name_Match_n to Original_Match and Connected_ID based on where they match and the number in Match-Final. I also need to append the Connection_Type_n to Connection_Type via the same criteria.
This would need to be looped for n number of ID_Match, Name_Match, and Connection_Type columns.
I have considered using np.where but I haven't gotten anywhere with it. Any help would be greatly appreciated!

i have from my original dataframe obtained another two , how can i merge in a final one the columns that i need

i have a table with 4 columns , from this data i obtained another 2 tables with some rolling averages from the original table. now i want to combine these 3 into a final table. but the indexes are not in order now and i cant do it. I just started to learn python , i have zero experience and i would really need all the help i can get.
DF
+----+------------+-----------+------+------+
| | A | B | C | D |
+----+------------+-----------+------+------+
| 1 | Home Team | Away Team | Htgs | Atgs |
| 2 | dalboset | sopot | 1 | 2 |
| 3 | calnic | resita | 1 | 3 |
| 4 | sopot | dalboset | 2 | 2 |
| 5 | resita | sopot | 4 | 1 |
| 6 | sopot | dalboset | 2 | 1 |
| 7 | caransebes | dalboset | 1 | 2 |
| 8 | calnic | resita | 1 | 3 |
| 9 | dalboset | sopot | 2 | 2 |
| 10 | calnic | resita | 4 | 1 |
| 11 | sopot | dalboset | 2 | 1 |
| 12 | resita | sopot | 1 | 2 |
| 13 | sopot | dalboset | 1 | 3 |
| 14 | caransebes | dalboset | 2 | 2 |
| 15 | calnic | resita | 4 | 1 |
| 16 | dalboset | sopot | 2 | 1 |
| 17 | calnic | resita | 1 | 2 |
| 18 | sopot | dalboset | 4 | 1 |
| 19 | resita | sopot | 2 | 1 |
| 20 | sopot | dalboset | 1 | 2 |
| 21 | caransebes | dalboset | 1 | 3 |
| 22 | calnic | resita | 2 | 2 |
+----+------------+-----------+------+------+
CODE
df1 = df.groupby('Home Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df1 =df1.rename(columns={'Htgs': 'Htgs/3', 'Atgs': 'Htgc/3'})
df1
df2 = df.groupby('Away Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df2 =df2.rename(columns={'Htgs': 'Atgc/3', 'Atgs': 'Atgs/3'})
df2
now i need a solution to see the columns with the rolling average next to the Home Team,,,,Away Team,,,,Htgs,,,,Atgs from the original table
Done !
i create a new column direct in the data frame like this
df = pd.read_csv('Fd.csv', )
df['Htgs/3'] = df.groupby('Home Team', ) ['Htgs'].rolling(window=4, min_periods=3).mean().reset_index(0,drop=True)
the Htgs/3 will be the new column with the rolling average of the column Home Team, and for the rest i will do the same like in this part.

create conditional new multiple columns and values based on flags

I have a dataframe like this.
import pandas as pd
from collections import OrderedDict
have = pd.DataFrame(OrderedDict({'User':['101','101','102','102','103','103','103'],
'Name':['A','A','B','B','C','C','C'],
'Country':['India','UK','US','UK','US','India','UK'],
'product':['Soaps','Brush','Soaps','Brush','Soaps','Brush','Brush'],
'channel':['Retail','Online','Retail','Online','Retail','Online','Online'],
'Country_flag':['Y','Y','N','Y','N','N','Y'],
'product_flag':['N','Y','Y','Y','Y','N','N'],
'channel_flag':['N','N','N','Y','Y','Y','Y']
}))
I want to create new columns based the flags.
if user has flag Y then i want combine those respective records.
in below image 1st record user has flag Y on country only i want to create new ctry column and the the value should be concatenate( user |name|country) similarly in second record country and product has Y then ctry_prod column and values as concatenate( user |name|country|product) etc
wanted output:
My take:
# columns of interest
cat_cols = ['Country', 'product', 'channel']
flag_cols = [col+'_flag' for col in cat_cols]
# select those values marked 'Y'
s = (have[cat_cols].where(have[flag_cols].eq('Y').values)
.stack()
.reset_index(level=1)
)
# join columns and values by |
s = s.groupby(s.index).agg('|'.join)
# add the 'User' and 'Name'
s[0] = have['User'] + "|" + have['Name'] + "|" + s[0]
# unstack to turn `level_1` to columns
s = s.reset_index().set_index(['index','level_1'])[0].unstack()
# concat by rows
pd.concat((have,s), axis=1)
Output:
+----+--------+--------+-----------+-----------+-----------+----------------+----------------+----------------+-------------+-------------------+-------------------+---------------------------+--------------+-------------+--------------------+
| | User | Name | Country | product | channel | Country_flag | product_flag | channel_flag | Country | Country|channel | Country|product | Country|product|channel | channel | product | product|channel |
|----+--------+--------+-----------+-----------+-----------+----------------+----------------+----------------+-------------+-------------------+-------------------+---------------------------+--------------+-------------+--------------------|
| 0 | 101 | A | India | Soaps | Retail | Y | N | N | 101|A|India | nan | nan | nan | nan | nan | nan |
| 1 | 101 | A | UK | Brush | Online | Y | Y | N | nan | nan | 101|A|UK|Brush | nan | nan | nan | nan |
| 2 | 102 | B | US | Soaps | Retail | N | Y | N | nan | nan | nan | nan | nan | 102|B|Soaps | nan |
| 3 | 102 | B | UK | Brush | Online | Y | Y | Y | nan | nan | nan | 102|B|UK|Brush|Online | nan | nan | nan |
| 4 | 103 | C | US | Soaps | Retail | N | Y | Y | nan | nan | nan | nan | nan | nan | 103|C|Soaps|Retail |
| 5 | 103 | C | India | Brush | Online | N | N | Y | nan | nan | nan | nan | 103|C|Online | nan | nan |
| 6 | 103 | C | UK | Brush | Online | Y | N | Y | nan | 103|C|UK|Online | nan | nan | nan | nan | nan |
+----+--------+--------+-----------+-----------+-----------+----------------+----------------+----------------+-------------+-------------------+-------------------+---------------------------+--------------+-------------+--------------------+
This is hard question
s1=have.iloc[:,-3:]
#filtr the columns
s2=have.iloc[:,2:-3]
#filtr the columns
s2=s2.where((s1=='Y').values,np.nan)
#mask the name by it condition , if Y replace it as NaN
s3=pd.concat([have.iloc[:,:2],s2],1).stack().groupby(level=0).agg('|'.join)
#make the series you need
s1=s1.eq('Y').dot(s1.columns+'_').str.strip('_')
#Using dot get the column name for additional columns
s=pd.crosstab(values=s3,index=have.index,columns=s1,aggfunc='first').fillna(0)
#convert it by using crosstab
df=pd.concat([have,s],axis=1)
df
Out[175]:
User Name Country ... channel_flag product_flag product_flag_channel_flag
0 101 A India ... 0 0 0
1 101 A UK ... 0 0 0
2 102 B US ... 0 102|B|Soaps 0
3 102 B UK ... 0 0 0
4 103 C US ... 0 0 103|C|Soaps| Retail
5 103 C India ... 103|C|Online 0 0
6 103 C UK ... 0 0 0
[7 rows x 15 columns]
Not very elegant, but it will work. I kept the loops and if statements in multiple lines for clarity's sake:
have['Linked_Flags'] = have['Country_flag'] + have['product_flag'] + have['channel_flag']
mapping = OrderedDict([('YNN', 'ctry'), ('NYN', 'prod'), ('NNY', 'chnl'), ('YYY', 'ctry_prod_channel'),('YYN', 'ctry_prod'), ('YNY', 'ctry_channel'), ('NYY', 'prod_channel')])
string_to_add_dict = {0: 'Country', 1: 'product', 2: 'channel'}
for linked_flag in mapping.keys():
string_to_add = ''
for position, letter in enumerate(linked_flag):
if letter == 'Y':
string_to_add += have[string_to_add_dict[position]] + '| '
have[mapping[linked_flag]] = np.where(have['Linked_Flags'] == linked_flag, have['User'] + '|' + have['Name'] + '|' + string_to_add, '')
del have['Linked_Flags']

Get next value from a row that satisfies a condition in pandas

I have a DataFrame that looks something like this:
| event_type | object_id
------ | ------ | ------
0 | A | 1
1 | D | 1
2 | A | 1
3 | D | 1
4 | A | 2
5 | A | 2
6 | D | 2
7 | A | 3
8 | D | 3
9 | A | 3
What I want to do is get the index of the next row where the event_type is A and the object_id is still the same, so as an additional column this would look like this:
| event_type | object_id | next_A
------ | ------ | ------ | ------
0 | A | 1 | 2
1 | D | 1 | 2
2 | A | 1 | NaN
3 | D | 1 | NaN
4 | A | 2 | 5
5 | A | 2 | NaN
6 | D | 2 | NaN
7 | A | 3 | 9
8 | D | 3 | 9
9 | A | 3 | NaN
and so on.
I want to avoid using .apply() because my DataFrame is quite large, is there a vectorized way to do this?
EDIT: for multiple A/D pairs for the same object_id, I'd like it to always use the next index of A, like this:
| event_type | object_id | next_A
------ | ------ | ------ | ------
0 | A | 1 | 2
1 | D | 1 | 2
2 | A | 1 | 4
3 | D | 1 | 4
4 | A | 1 | NaN
You can do it with groupby like:
def populate_next_a(object_df):
object_df['a_index'] = pd.Series(object_df.index, index=object_df.index)[object_df.event_type == 'A']
object_df['a_index'].fillna(method='bfill', inplace=True)
object_df['next_A'] = object_df['a_index'].where(object_df.event_type != 'A', object_df['a_index'].shift(-1))
object_df.drop('a_index', axis=1)
return object_df
result = df.groupby(['object_id']).apply(populate_next_a)
print(result)
event_type object_id next_A
0 A 1 2.0
1 D 1 2.0
2 A 1 NaN
3 D 1 NaN
4 A 2 5.0
5 A 2 NaN
6 D 2 NaN
7 A 3 9.0
8 D 3 9.0
9 A 3 NaN
GroupBy.apply will not have as much overhead as a simple apply.
Note you cannot (yet) store integer with NaN: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na so they end up as float values

Categories

Resources