Python Pandas: add column based on complicated if statement - python

I am working on analyzing customer return behavior and am working with the following dataframe df:
Customer_ID | Order | Store_ID | Date | Item_ID | Count_of_units | Event_Return_Flag
ABC 123 1 23052016 A -1 Y
ABC 345 1 23052016 B 1 0
ABC 567 1 24052016 C -1 0
I need to add another column to find customers who returned during the event (Event_Return_Flag=Y) and bought something in the same day and store.
In other words, I wanted to add a flag df['target'] with the following if logic:
same Customer_ID, Store_ID, Date as a record with Event_Return_Flag=Y
but different Item_ID then the record with Event_Return_Flag=Y
Count_of_units > 0
I don't know how to accomplish this in python pandas.
I was thinking of creating a key by concatenating Customer_ID, Store_ID and Date; then spliting the file by Event_Return_flag and using an isin statement, something like this:
df['key']=df['Customer_ID']+'_'+df['Store_ID']+'_'+df['Date'].apply(str)
df_1 = df.loc[df['Event_Return_Flag'] == 'Y']
df_2 = df.loc[df['Event_Return_Flag'] == '0']
df_3 = df2.loc[df['Count_of_units'] > 0]
df3['target'] = np.where(df3['key'].isin(df1['key']), 'Y', 0)
This approach seems qutie wrong, but I couldn't come up with something better. I get this error message for the last line with np.where:
C:\Users\xxx\AppData\Local\Continuum\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == '__main__':
I tried something down this line, but couldn't figure out how to match rows based on column Event_Return_Flag
df['target'] = (np.where((df.Item_Units_S > 0)&(df.groupby(['key','Item_ID']).Event_Return_flag.transform('nunique') > 1), 'Y', ''))

Related

get values for potentially multiple matches from an other dataframe

I want to fill the 'references' column in df_out with the 'ID' if the corresponding 'my_ID' in df_sp is contained in df_jira 'reference_ids'.
import pandas as pd
d_sp = {'ID': [1,2,3,4], 'my_ID': ["my_123", "my_234", "my_345", "my_456"], 'references':["","","2",""]}
df_sp = pd.DataFrame(data=d_sp)
d_jira = {'my_ID': ["my_124", "my_235", "my_346"], 'reference_ids': ["my_123, my_234", "", "my_345"]}
df_jira = pd.DataFrame(data=d_jira)
df_new = df_jira[~df_jira["my_ID"].isin(df_sp["my_ID"])].copy()
df_out = pd.DataFrame(columns=df_sp.columns)
needed_cols = list(set(df_sp.columns).intersection(df_new.columns))
for column in needed_cols:
df_out[column] = df_new[column]
df_out['Related elements_my'] = df_jira['reference_ids']
Desired output df_out:
| ID | my_ID | references |
|----|-------|------------|
| | my_124| 1, 2 |
| | my_235| |
| | my_346| 3 |
What I tried so far is list comprehension, but I only managed to get the reference_ids "copied" from a helper column to my 'references' column with this:
for row, entry in df_out.iterrows():
cpl_ids = [x for x in entry['Related elements_my'].split(', ') if any(vh_id == x for vh_id in df_cpl_list['my-ID'])]
df_out.at[row, 'Related elements'] = ', '.join(cpl_ids)
I can not wrap my head around on how to get the specific 'ID's on the matches of 'any()' or if this actually the way to go as I need all the matches, not something if there is any match.
Any hints are appreciated!
I work with python 3.9.4 on Windows (adding in case python 3.10 has any other solution)
Backstory: Moving data from Jira to MS SharePoint lists. (Therefore, the 'ID' does not equal the actual index in the dataframe, but is rather assigned by SharePoint upon insertion into the list. Hence, empty after running for the new entries.)
ref_df = df_sp[["ID","my_ID"]].set_index("my_ID")
df_out.references = df_out["Related elements_my"].apply(lambda x: ",".join(list(map(lambda y: "" if y == "" else str(ref_df.loc[y.strip()].ID), x.split(",")))))
df_out[["ID","my_ID","references"]]
output:
ID my_ID references
0 NaN my_124 1,2
1 NaN my_235
2 NaN my_346 3
what is map?
map is something like [func(i) for i in lst] and apply func on all variables of lst but in another manner that increase speed.
and you can read more about this: https://realpython.com/python-map-function/
but, there, our function is : lambda y: "" if y == "" else str(ref_df.loc[y.strip()].ID)
so, if y, or y.strip() there and just for remove spaces, is empty, maps to empty: "" if y == "" like my_234
otherwise, locate y in df_out and get corresponding ID, i.e maps each my_ID to ID
hope to be helpfull :)

How to re-number strings after sorting a dataframe

Description:
I have a GUI that allows the user to add variables that are displayed in a dataframe. As the variables are added, they are automatically numbered, ex.'FIELD_0' and 'FIELD_1' etc and each variable has a value associated with it. The data is actually row-based instead of column based, in that the 'FIELD' ids are in column 0 and progress downwards and the corresponding value is in column 1, in the same corresponding row. As shown below:
0 1
0 FIELD_0 HH_5_MILES
1 FIELD_1 POP_5_MILES
The user is able to reorder these values and move them up/down a row. However, it's important that the number ordering remains sequential. So, if the user positions 'FIELD_1' above 'FIELD_0' then it gets re-numbered appropriately. Example:
0 1
0 FIELD_0 POP_5_MILES
1 FIELD_1 HH_5_MILES
Currently, I'm using the below code to perform this adjustment - this same re-numbering occurs with other variable names within the same dataframe.
df = pandas.DataFrame({0:['FIELD_1','FIELD_0']})
variable_list = ['FIELD', 'OPERATOR', 'RESULT']
for var in variable_list:
field_list = ['%s_%s' % (var, _) for _, field_name in enumerate(df[0].isin([var]))]
field_count = 0
for _, field_name in enumerate(df.loc[:, 0]):
if var in field_name:
df.loc[_, 0] = field_list[field_count]
field_count += 1
This gets me the result I want, but it seems a bit inelegant. If there is a better way, I'd love to know what it is.
It appears you're looking to overwrite the Field values so that they always appear in order starting with 0.
We can filter to only rows which str.contains the word FIELD. Then assign those to a list comprehension like field_list.
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
# Select Where Values are Field
m = df[0].str.contains('FIELD')
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'FIELD_{n}' for n in range(m.sum())]
print(df)
df:
0
0 FIELD_0
1 OTHER_1
2 FIELD_1
3 OTHER_0
For multiple variables:
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
variable_list = ['FIELD', 'OTHER']
for v in variable_list:
# Select Where Values are Field
m = df[0].str.contains(v)
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'{v}_{n}' for n in range(m.sum())]
df:
0
0 FIELD_0
1 OTHER_0
2 FIELD_1
3 OTHER_1
You can use sort values as below:
def f(x):
l=x.split('_')[1]
return int(l)
df.sort_values(0, key=lambda col: [f(k) for k in col]).reset_index(drop=True)
0
0 FIELD_0
1 FIELD_1

Python - How to filter column mix with int and string using DataFrame.query()?

I would like to filter records based on some criteria as below:
import pandas as pd
def doFilter(df, type, criteria):
if type=="contain":
return df[df.country.apply(str).str.contains(criteria)]
elif type=="start":
return df[df.remarks.apply(str).str.startswith(criteria)]
df= pd.read_csv("testdata.csv")
tempdf = doFilter(df, "contain", "U")
finaldf = doFilter(tempdf, "start", "123")
print(finaldf)
[testdata.csv]
id country remarks
1 UK 123
2 UK 123abc
3 US 456
4 JP 456
[Output]
id country remarks
0 1 UK 123
1 2 UK 123abc
As I need to filter dynamically by reading input config for different criteria (e.g. startswith(), contains(), endswith(), substring() etc.), I would like to use DataFrame.query() so that I can filter everything in 1 go.
e.g.
I've tried many ways similar to below but no luck:
output=df.query('country.apply(str).str.contains("U") & remarks.apply(str).str.startswith("123")')
Any help would be greatly appreciated. Thank you so much.
Could not test since you provide no sample data.
This will allow you to read filters at runtime and apply them with pandas built-in string methods.
# better to cast all relevant columns to string while setting up
df.country = df.country.astype(str)
df.remarks = df.remarks.astype(str)
# get passed filters
filter1 = [ # (field, filtertype, value, jointype)
('country', 'contains', 'U'),
('remarks', 'startswith', '123'),
]
# create a collection of boolean masks
mask = []
for field, filtertype, value in filter1:
if filtertype == 'contains':
mask.append(df[field].str.contains(value))
elif filtertype == 'startswith':
mask.append(df[field].str.startswith(value))
elif filtertype == 'endswith':
mask.append(df[field].str.startswith(value))
# all these filters need to be combined with `and`, as in `condition1 & condition2`
# if you need to allow for `or` then the whole thing gets a lot more complicated
# as you also need to expect parenthesis as in `(cond1 | cond2) & cond3`
# but it can be done with a parser
# allowing only `and` conditions
mask_combined = mask[0]
form m in mask[1:]:
mask_combined *= m
# apply filter
df_final = df[mask_combined]

Numerical Column Not displaying numerical correctly in DF

I have a df such as below ( 3 rows for example )
ID | Dollar_Value
C 45.32
E 5.21
V 121.32
When I view the df in my notebook such as df:
It shows the Dollar_value as
ID | Dollar_Value
C 8.493000e+01
E 2.720000e+01
V 1.720000e+01
Instead of the regular format, but when I try to filter the df for specific ID, it shows the values as they are supposed to be ( 82.23 or 2.45)
df[df['ID'] == 'E']
ID | Dollar_Value
E 45.32
is there something I have to do formatting wise? So the df itself can display the value column as its supposed to?
Thanks!
You can try run this code before print , since you columns may have big number or very small number.(Check with df.describe())
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Counting Values In Columns Igonorig AlphaNumeric Values

First post here, I am trying to find out total count of values in an excel file. So after importing the file, I need to run a condition which is count all the values except 0 also where it finds 0 make that blank.
> df6 = df5.append(df5.ne(0).sum().rename('Final Value'))
I tried the above one but not working properly, It is counting the column name as well, I only need to count the float values.
Demo DataFrame:
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
GSM95473 0.08277 0.00874 0.00363 0.01877
GSM95474 0.09503 0.00592 0.00352 0
GSM95475 0.08486 0.00678 0.00386 0.01973
GSM95476 0.08105 0.00913 0.00306 0.01801
GSM95477 0.00000 0.00812 0.00428 0
GSM95478 0.07615 0.00777 0.00438 0.01799
GSM95479 0 0.00508 1 0
GSM95480 0.08499 0.00442 0.00298 0.01897
GSM95481 0.08893 0.00734 0.00204 0
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
These are column name and index value which needs to be ignored when counting.
The output Should be like this after counting:
Final 8 9 9 5
If you just nee the count, but change the values in your dataframe, you could apply a function to each cell in your DataFrame with the applymap method. First create a function to check for a float:
def floatcheck(value):
if isinstance(value, float):
return 1
else:
return 0
Then apply it to your dataframe:
df6 = df5.applymap(floatcheck)
This will create a dataframe with a 1 if the value is a float and a 0 if not. Then you can apply your sum method:
df7 = df6.append(df6.sum().rename("Final Value"))
I was able to solve the issue, So here it is:
df5 = df4.append(pd.DataFrame(dict(((df4[1:] != 1) & (df4[1:] != 0)).sum()), index=['Final']))
df5.columns = df4.columns
went = df5.to_csv("output3.csv")
What i did was i changed the starting index so i didn't count the first row which was alphanumeric and then i just compared it.
Thanks for your response.

Categories

Resources