create conditional new multiple columns and values based on flags

create conditional new multiple columns and values based on flags - python

I have a dataframe like this.
import pandas as pd
from collections import OrderedDict
have = pd.DataFrame(OrderedDict({'User':['101','101','102','102','103','103','103'],
'Name':['A','A','B','B','C','C','C'],
'Country':['India','UK','US','UK','US','India','UK'],
'product':['Soaps','Brush','Soaps','Brush','Soaps','Brush','Brush'],
'channel':['Retail','Online','Retail','Online','Retail','Online','Online'],
'Country_flag':['Y','Y','N','Y','N','N','Y'],
'product_flag':['N','Y','Y','Y','Y','N','N'],
'channel_flag':['N','N','N','Y','Y','Y','Y']
}))
I want to create new columns based the flags.
if user has flag Y then i want combine those respective records.
in below image 1st record user has flag Y on country only i want to create new ctry column and the the value should be concatenate( user |name|country) similarly in second record country and product has Y then ctry_prod column and values as concatenate( user |name|country|product) etc
wanted output:

My take:
# columns of interest
cat_cols = ['Country', 'product', 'channel']
flag_cols = [col+'_flag' for col in cat_cols]
# select those values marked 'Y'
s = (have[cat_cols].where(have[flag_cols].eq('Y').values)
.stack()
.reset_index(level=1)
)
# join columns and values by |
s = s.groupby(s.index).agg('|'.join)
# add the 'User' and 'Name'
s[0] = have['User'] + "|" + have['Name'] + "|" + s[0]
# unstack to turn `level_1` to columns
s = s.reset_index().set_index(['index','level_1'])[0].unstack()
# concat by rows
pd.concat((have,s), axis=1)
Output:
+----+--------+--------+-----------+-----------+-----------+----------------+----------------+----------------+-------------+-------------------+-------------------+---------------------------+--------------+-------------+--------------------+
| | User | Name | Country | product | channel | Country_flag | product_flag | channel_flag | Country | Country|channel | Country|product | Country|product|channel | channel | product | product|channel |
|----+--------+--------+-----------+-----------+-----------+----------------+----------------+----------------+-------------+-------------------+-------------------+---------------------------+--------------+-------------+--------------------|
| 0 | 101 | A | India | Soaps | Retail | Y | N | N | 101|A|India | nan | nan | nan | nan | nan | nan |
| 1 | 101 | A | UK | Brush | Online | Y | Y | N | nan | nan | 101|A|UK|Brush | nan | nan | nan | nan |
| 2 | 102 | B | US | Soaps | Retail | N | Y | N | nan | nan | nan | nan | nan | 102|B|Soaps | nan |
| 3 | 102 | B | UK | Brush | Online | Y | Y | Y | nan | nan | nan | 102|B|UK|Brush|Online | nan | nan | nan |
| 4 | 103 | C | US | Soaps | Retail | N | Y | Y | nan | nan | nan | nan | nan | nan | 103|C|Soaps|Retail |
| 5 | 103 | C | India | Brush | Online | N | N | Y | nan | nan | nan | nan | 103|C|Online | nan | nan |
| 6 | 103 | C | UK | Brush | Online | Y | N | Y | nan | 103|C|UK|Online | nan | nan | nan | nan | nan |
+----+--------+--------+-----------+-----------+-----------+----------------+----------------+----------------+-------------+-------------------+-------------------+---------------------------+--------------+-------------+--------------------+

This is hard question
s1=have.iloc[:,-3:]
#filtr the columns
s2=have.iloc[:,2:-3]
#filtr the columns
s2=s2.where((s1=='Y').values,np.nan)
#mask the name by it condition , if Y replace it as NaN
s3=pd.concat([have.iloc[:,:2],s2],1).stack().groupby(level=0).agg('|'.join)
#make the series you need
s1=s1.eq('Y').dot(s1.columns+'_').str.strip('_')
#Using dot get the column name for additional columns
s=pd.crosstab(values=s3,index=have.index,columns=s1,aggfunc='first').fillna(0)
#convert it by using crosstab
df=pd.concat([have,s],axis=1)
df
Out[175]:
User Name Country ... channel_flag product_flag product_flag_channel_flag
0 101 A India ... 0 0 0
1 101 A UK ... 0 0 0
2 102 B US ... 0 102|B|Soaps 0
3 102 B UK ... 0 0 0
4 103 C US ... 0 0 103|C|Soaps| Retail
5 103 C India ... 103|C|Online 0 0
6 103 C UK ... 0 0 0
[7 rows x 15 columns]

Not very elegant, but it will work. I kept the loops and if statements in multiple lines for clarity's sake:
have['Linked_Flags'] = have['Country_flag'] + have['product_flag'] + have['channel_flag']
mapping = OrderedDict([('YNN', 'ctry'), ('NYN', 'prod'), ('NNY', 'chnl'), ('YYY', 'ctry_prod_channel'),('YYN', 'ctry_prod'), ('YNY', 'ctry_channel'), ('NYY', 'prod_channel')])
string_to_add_dict = {0: 'Country', 1: 'product', 2: 'channel'}
for linked_flag in mapping.keys():
string_to_add = ''
for position, letter in enumerate(linked_flag):
if letter == 'Y':
string_to_add += have[string_to_add_dict[position]] + '| '
have[mapping[linked_flag]] = np.where(have['Linked_Flags'] == linked_flag, have['User'] + '|' + have['Name'] + '|' + string_to_add, '')
del have['Linked_Flags']

Related

Fill duplicates with missing value after grouping with some logic

I have a dataframe, I need to take off the duplicates of ticket_id if the owner_type is the same, and if not, pick 'm' over 's', if no value is picket then a NaN is returned:
data = pd.DataFrame({'owner_type':['m','m','m','s','s','m','s','s'],'ticket_id':[1,1,2,2,3,3,4,4]})
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | 1 |
| 1 | m | 1 |
| 2 | m | 2 |
| 3 | s | 2 |
| 4 | s | 3 |
| 5 | m | 3 |
| 6 | s | 4 |
| 7 | s | 4 |'
Should give back:
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | NaN |
| 1 | m | NaN |
| 2 | m | 2 |
| 3 | s | NaN |
| 4 | s | NaN |
| 5 | m | 3 |
| 6 | s | NaN |
| 7 | s | NaN |'
Pseudo code would be like : If ticket_id is duplicated, look at owner_type, if owner_type has mover than one value, return value of 'm' and NaN for 's'.
My attempt
data.groupby('ticket_id').apply(lambda x: x['owner_type'] if len(x) < 2 else NaN)
Not working

Try this:
(df['ticket_id'].where(
~df.duplicated(['owner_type','ticket_id'],keep=False) &
df['owner_type'].eq(df.groupby('ticket_id')['owner_type'].transform('min'))))
Old answer:
m = ~df.duplicated(keep=False) & df['owner_type'].eq('m')
df['ticket_id'].where(m)
Output:
0 NaN
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN

Pandas.DataFrame: efficient way to add a column "seconds since last event"

I have a Pandas.DataFrame with a standard index representing seconds, and I want to add a column "seconds elapsed since last event" where the events are given in a list. Specifically, say
event = [2, 5]
and
df = pd.DataFrame(np.zeros((7, 1)))
| | 0 |
|---:|----:|
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
Then I want to obtain
| | 0 | x |
|---:|----:|-----:|
| 0 | 0 | <NA> |
| 1 | 0 | <NA> |
| 2 | 0 | 0 |
| 3 | 0 | 1 |
| 4 | 0 | 2 |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
I tried
df["x"] = pd.Series(range(5)).shift(2)
| | 0 | x |
|---:|----:|----:|
| 0 | 0 | nan |
| 1 | 0 | nan |
| 2 | 0 | 0 |
| 3 | 0 | 1 |
| 4 | 0 | 2 |
| 5 | 0 | nan |
| 6 | 0 | nan |
so apparently to make it work I need to write df["x"] = pd.Series(range(5+2)).shift(2).
More importantly, when I then do df["x"] = pd.Series(range(2+5)).shift(5) I obtain
| | 0 | x |
|---:|----:|----:|
| 0 | 0 | nan |
| 1 | 0 | nan |
| 2 | 0 | nan |
| 3 | 0 | nan |
| 4 | 0 | nan |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
That is: the previous has been overwritten. Is there a way to assign new values without overwriting existing values by nan ?
Then, I can do something like
for i in event:
df["x"] = pd.Series(range(len(df))).shift(i)
Or is there a more efficient way ?
For the record, here is my naive code. It works, but looks inefficient and of poor design:
c = 1000000
df["x"] = c
if event:
idx = 0
for i in df.itertuples():
print(i)
if idx < len(event) and i.Index == event[idx]:
c = 0
idx += 1
df.loc[i.Index, "x"] = c
c += 1
return df

IIUC, you can do double groupby:
s = df.index.isin(event).cumsum()
# or equivalently
# s = df.loc[event, 0].reindex(df.index).isna().cumsum()
df['x'] = np.where(s>0,df.groupby(s).cumcount(), np.nan)
Output:
0 x
0 0.0 NaN
1 0.0 NaN
2 0.0 0.0
3 0.0 1.0
4 0.0 2.0
5 0.0 0.0
6 0.0 1.0

Let's try this:
df = pd.DataFrame(np.zeros((7, 1)))
event = [2, 5]
df.loc[event, 0] = 1
df = df.replace(0, np.nan)
grp=df[0].cumsum().ffill()
df['x'] = df.groupby(grp).cumcount().mask(grp.isna())
df
Output:
| | 0 | x |
|---:|----:|----:|
| 0 | nan | nan |
| 1 | nan | nan |
| 2 | 1 | 0 |
| 3 | nan | 1 |
| 4 | nan | 2 |
| 5 | 1 | 0 |
| 6 | nan | 1 |

How do I create a new column if the text from one column if the text from a second column contains a specific string pattern?

My current data looks something like this
+-------+----------------------------+-------------------+-----------------------+
| Index | 0 | 1 | 2 |
+-------+----------------------------+-------------------+-----------------------+
| 0 | Reference Curr | Daybook / Voucher | Invoice Date Due Date |
| 1 | V50011 Tech Comp | nan | Phone:0177222222 |
| 2 | Regis Place | nan | Fax:017757575789 |
| 3 | Catenberry | nan | nan |
| 4 | Manhattan, NY | nan | nan |
| 5 | V7484 Pipe | nan | Phone: |
| 6 | Japan | nan | nan |
| 7 | nan | nan | nan |
| 8 | 4543.34GBP (British Pound) | nan | nan |
+-------+----------------------------+-------------------+-----------------------+
I am trying to create a new column, df['Company'], that should contain the what is in df[0] if it Starts with a "V" and if df[2] has "Phone" in it. If the condition is not satisfied, then it can be nan. Below is what I am looking for.
+-------+----------------------------+-------------------+-----------------------+------------+
| Index | 0 | 1 | 2 | Company |
+-------+----------------------------+-------------------+-----------------------+------------+
| 0 | Reference Curr | Daybook / Voucher | Invoice Date Due Date | nan |
| 1 | V50011 Tech | nan | Phone:0177222222 |V50011 Tech |
| 2 | Regis Place | nan | Fax:017757575789 | nan |
| 3 | Catenberry | nan | nan | nan |
| 4 | Manhattan, NY | nan | nan | nan |
| 5 | V7484 Pipe | nan | Phone: | V7484 Pipe |
| 6 | Japan | nan | nan | nan |
| 7 | nan | nan | nan | nan |
| 8 | 4543.34GBP (British Pound) | nan | nan | nan |
+-------+----------------------------+-------------------+-----------------------+------------+
I am trying the below script but I get an error ValueError: Wrong number of items passed 1420, placement implies 1
df['Company'] = pd.np.where(df[2].str.contains("Ph"), df[0].str.extract(r"(^V[A-Za-z0-9]+)"),"stop")
I put in "stop" as the else part because I don't know how to let python use nan when the condition is not met.
I would also like to be able to parse out a section of the df[0], for example just the v5001 section, but not rest of the cell contents. I tried something like this using AMCs answer but get an error:
df.loc[df[0].str.startswith('V') & df[2].str.contains('Phone'), 'Company'] = df[0].str.extract(r"(^V[A-Za-z0-9]+)")
Thank you

You haven't provided an easy way for us to test potential solutions, but this should do the job:
df.loc[df[0].str.startswith('V') & df[2].str.contains('Phone'), 'Company'] = df[0]

A potential solution to this would be to use list comprehension. You could probably get a speed boost using some of pandas' built in functions but this will get you there.
#!/usr/bin/env python
import numpy as np
import pandas as pd
df = pd.DataFrame({
0:["reference", "v5001 tech comp", "catenberry", "very different"],
1:["not", "phone", "other", "text"]
})
df["new_column"] = [x if (x[0].lower() == "v") & ("phone" in y.lower())
else np.nan for x,y in df.loc[:, [0,1]].values]
print(df)
Which will produce
0 1 new_column
0 reference not NaN
1 v5001 tech comp phone v5001 tech comp
2 catenberry other NaN
3 very different text NaN
All I'm doing is taking your two conditions and building a new list which will then be assigned to your new column.

Here's another way to get your result
condition1=df['0'].str.startswith('V')
condition2=df['2'].str.contains('Phone')
df['Company']=np.where((condition1 & condition2), df['0'],np.nan)
df['Company']=df['Company'].str.split(' ',expand=True)

You can do it with the pandas apply function:
import re
import numpy as np
import pandas as pd
df['Company'] = df.apply(lambda x: x[0].split()[0] if re.match(r'^v[A-Za-z0-9]+', x[0].lower()) and 'phone' in x[1].lower() else np.nan, axis=1)
Edit:
To adjust to comment under #AMC's answer

IIUC,
we can use either a boolean condition to extract the V Number with some basic regex,
or we can apply the same formula within a where statement.
to set a value to NaN we can use np.nan
if you want to grab the entire string after V we can use [V]\w+.* which will grab everything after the first match.
from IO import StringIO
d = """+-------+----------------------------+-------------------+-----------------------+
| Index | 0 | 1 | 2 |
+-------+----------------------------+-------------------+-----------------------+
| 0 | Reference Curr | Daybook / Voucher | Invoice Date Due Date |
| 1 | V50011 Tech Comp | nan | Phone:0177222222 |
| 2 | Regis Place | nan | Fax:017757575789 |
| 3 | Catenberry | nan | nan |
| 4 | Manhattan, NY | nan | nan |
| 5 | Ultilagro, CT | nan | nan |
| 6 | Japan | nan | nan |
| 7 | nan | nan | nan |
| 8 | 4543.34GBP (British Pound) | nan | nan |
+-------+----------------------------+-------------------+-----------------------+"""
df = pd.read_csv(StringIO(d),sep='|',skiprows=1)
df = df.iloc[1:-1,2:-1]
df.columns = df.columns.str.strip()
df["3"] = df[df["2"].str.contains("phone", case=False) == True]["0"].str.extract(
r"([V]\w+)"
)
print(df[['0','2','3']])
0 2 3
1 Reference Curr Invoice Date Due Date nan
2 V50011 Tech Comp Phone:0177222222 V50011
3 Regis Place Fax:017757575789 nan
4 Catenberry nan nan
5 Manhattan, NY nan nan
6 Ultilagro, CT nan nan
7 Japan nan nan
8 nan nan nan
9 4543.34GBP (British Pound) nan nan
if you want as a where statement:
import numpy as np
df["3"] = np.where(
df[df["2"].str.contains("phone", case=False)], df["0"].str.extract(r"([V]\w+)"), np.nan
)
print(df[['0','2','3']])
0 2 3
1 Reference Curr Invoice Date Due Date NaN
2 V50011 Tech Comp Phone:0177222222 V50011
3 Regis Place Fax:017757575789 NaN
4 Catenberry nan NaN
5 Manhattan, NY nan NaN
6 Ultilagro, CT nan NaN
7 Japan nan NaN
8 nan nan NaN
9 4543.34GBP (British Pound) nan NaN

Reading data from text file with variable numbers of Column

I am reading data from a text file in python using pandas. There are no header values (column names) assigned to the data in the text file. I want to reshape the data into a readable form. The problem i am facing is variable column lengths
For example in my text file i have
1,2,3,4,5,Hello,7,8
1,2,3,4,5,7,8,
1,2,3,4,5,7,8,
1,2,3,4,5,Hello,7,8,
Now when i create a data frame I want to make sure that in the second row instead of Hello a "NAN" is written as the value for that column is not present. and in the end after giving column names and rearranging the data frame will look like.
1,2,3,4,5,Hello,7,8
1,2,3,4,5,"NA,"7,8,
1,2,3,4,5,"NA",7,8,
1,2,3,4,5,Hello,7,8,

Answer to updated question and also a generalized solution for such case.
focus_col_idx = 5 # The column where you want to bring NaN in expected output
last_idx = df.shape[1] - 1
# Fetching the index of rows which have None in last column
idx = df[df[last_idx].isnull()].index
# Shifting the column values for those rows with index idx
df.iloc[idx,focus_col_idx+1:] = df.iloc[idx,focus_col_idx:last_idx].values
# Putting NaN for second column where row index is idx
df.iloc[idx,focus_col_idx] = np.NaN
df
+---+----+---+---+---+---+-------+---+-----+
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+----+---+---+---+---+-------+---+-----+
| 0 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
| 1 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 2 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 3 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
+---+----+---+---+---+---+-------+---+-----+
Answer to previous data
Assuming only one column is having missing value (say 2nd column as per your previous data). Here's a quick sol -
df = pd.read_table('SO.txt',sep='\,', header=None)
df
+---+---+---+---+---+------+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+------+
| 0 | A | B | C | D | E |
| 1 | A | C | D | E | None |
+---+---+---+---+---+------+
# Fetching the index of rows which have None in last column
idx = df[df[4].isnull()].index
idx
# Int64Index([1], dtype='int64')
# Shifting the column values for those rows with index idx
df.iloc[idx,2:] = df.iloc[idx,1:4].values
df
+---+---+---+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | C | C | D | E | # <- Notice the shifting.
+---+---+---+---+---+---+
# Putting NaN for second column where row index is idx
df.iloc[idx,1] = np.NaN
# Final output
df
+---+---+-----+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+-----+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | NaN | C | D | E |
+---+---+-----+---+---+---+

Naming columns in pandas deletes data

The code below gives me this table:
raw = pd.read_clipboard()
raw.head()
+---+---------------------+-------------+---------+----------+-------------+
| | Afghanistan | South Asia | 652225 | 26000000 | Unnamed: 4 |
+---+---------------------+-------------+---------+----------+-------------+
| 0 | Albania | Europe | 28728 | 3200000 | 6656000000 |
| 1 | Algeria | Middle East | 2400000 | 32900000 | 75012000000 |
| 2 | Andorra | Europe | 468 | 64000 | NaN |
| 3 | Angola | Africa | 1250000 | 14500000 | 14935000000 |
| 4 | Antigua and Barbuda | Americas | 442 | 77000 | 770000000 |
+---+---------------------+-------------+---------+----------+-------------+
But when I attempt to rename the columns and create a DataFrame, all of the data disappears:
df = pd.DataFrame(raw, columns = ['name', 'region', 'area', 'population', 'gdp'])
df.head()
+---+------+--------+------+------------+-----+
| | name | region | area | population | gdp |
+---+------+--------+------+------------+-----+
| 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN |
| 3 | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | NaN |
+---+------+--------+------+------------+-----+
Any idea why?

You should just write:
df.columns = ['name', 'region', ...]
This is also much more efficient as you aren't trying to copy the entire DataFrame; as far as I know passing one DataFrame into the constructor for another will make a deep, not shallow copy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

create conditional new multiple columns and values based on flags - python

Related

Fill duplicates with missing value after grouping with some logic

Pandas.DataFrame: efficient way to add a column "seconds since last event"

How do I create a new column if the text from one column if the text from a second column contains a specific string pattern?

Reading data from text file with variable numbers of Column

Naming columns in pandas deletes data

Categories

Resources