If col is empty string, make adjacent column empty as well - python

Consider this sample df:
colAnum colB colCnum colD
123 House 456 Book
Car 789 Table
891 Chair Porch
I am trying to roll through this df and if the "num" column is an empty string, then make the adjacent column, to the right, empty as well.
This is the expected output:
colAnum colB colCnum colD
123 House 456 Book
789 Table
891 Chair
I attempted this with variations on this:
for idx, col in enumerate(df.columns):
if df.iloc[idx, col] == '':
df[idx+1,col] == ''
I am sure I am missing something simple to make this occur, but cannot work my way around it.

Try with shift with mask
out = df.mask(df.eq('').shift(axis=1).fillna(False),'')
colAnum colB colCnum colD
0 123.0 House 456.0 Book
1 789.0 Table
2 891.0 Chair

Related

pandas fill in missing index from other dataframe

I wanted to know if there is a way for me to merge / re-join the missing rows simply by index.
My original way to approach is just to cleanly separate df1 into df1_cleaned and df1_untouched, and then join them back together. But I thought there's probably an easier way to re-join the two df2 since I didn't change the index. I tried outer merge with left_index and right_index but was left with the dupe columns with suffix to clean.
df1
index
colA
colB
colC
0
California
123
abc
1
New York
456
def
2
Texas
789
ghi
df2 (subset of df1 and cleaned)
index
colA
colB
colC
0
California
321
abc
2
Texas
789
ihg
end-result
index
colA
colB
colC
0
California
321
abc
1
New York
456
def
2
Texas
789
ihg
You can use combine_first or update:
df_out = df2.combine_first(df1)
or, pd.DataFrame.update (which is an inplace operation and will overwrite df1):
df1.update(df2)
Output:
colA colB colC
index
0 California 321.0 abc
1 New York 456.0 def
2 Texas 789.0 ihg
You can get difference of index, and add the missing index from df1 to df_result after reindexing df2
df_result = df2.reindex(df1.index)
missing_index = df1.index.difference(df2.index)
df_result.loc[missing_index] = df1.loc[missing_index]
print(df_result)
colA colB colC
0 California 321.0 abc
1 New York 456.0 def
2 Texas 789.0 ihg

I want to count occurence of a subset in pandas dataframe

If in a data frame I have a data like below:
Name Id
Alex 123
John 222
Alex 123
Kendal 333
So I want to add a column which will result:
Name Id Subset Count
Alex 123 2
John 222 1
Alex 123 2
Kendal 333 1
I used the below code but didnt got the output:
df['Subset Count'] = df.value_counts(subset=['Name','Id'])
try via groupby():
df['Subset Count']=df.groupby(['Name','Id'])['Name'].transform('count')
OR
via droplevel() and map()
df['Subset Count']=df['Name'].map(df.value_counts(subset=['Name','Id']).droplevel(1))

Replace column values using mask and multiple mappings

I have two dataframes. One is v_df and looks like this:
VENDOR_ID
VENDOR_NAME
123
APPLE
456
GOOGLE
987
FACEBOOK
The other is n_df and looks like this:
Vendor_Name
GL_Transaction_Description
AMEX
HELLO 345
Not assigned
BYE 456
Not assigned
THANKS 123
I want to populate the 'Vendor_Name' column in n_df on the condition that the 'GL_Transaction_Description' on the same row contains any VENDOR_ID value from v_df. So the resulting n_df would be this:
Vendor_Name
GL_Transaction_Description
AMEX
HELLO 345
GOOGLE
BYE 456
APPLE
THANKS 123
So far I have written this code:
v_list = v_df['VENDOR_ID'].to_list()
mask_id = list(map((lambda x: any([(y in x) for y in v_list])), n_df['GL_Transaction_Description']))
n_df['Vendor_Name'].mask((mask_id), other = 'Solution Here', inplace=True)
I am just not able to grasp what to write in the 'other' condition of the final masking. Any ideas? (n_df has more than 100k rows, so the execution speed of the solution is of high importance)
Series.str.extract + map
i = v_df['VENDOR_ID'].astype(str)
m = v_df.set_index(i)['VENDOR_NAME']
s = n_df['GL_Transaction_Description'].str.extract(r'(\d+)', expand=False)
n_df['Vendor_Name'].update(s.map(m))
Explanations
Create a mapping series m from the v_df dataframe by setting the VENDOR_ID column as the index and selecting the VENDOR_NAME column
>>> m
VENDOR_ID
123 APPLE
456 GOOGLE
987 FACEBOOK
Name: VENDOR_NAME, dtype: object
Now extract the vendor id from the strings in GL_Transaction_Description column
>>> s
0 345
1 456
2 123
Name: GL_Transaction_Description, dtype: object
Map the extracted vendor id with the mapping series m and update the mapped values in Vendor_Name column
>>> n_df
Vendor_Name GL_Transaction_Description
0 AMEX HELLO 345
1 GOOGLE BYE 456
2 APPLE THANKS 123

compare columns and generate duplicate rows in mysql (or) python pandas

i am new to Mysql and just getting started with some basic concepts. i have been trying to solve this for a while now. any help is appreciated.
I have a list of users with two phone numbers. i would want to compare two columns(phone numbers) and generate a new row if the data is different in both columns, else retain the row and make no changes.
The processed data would look like the second table.
Is there any way to acheive this in MySql.
i also don't minds doing the transformation in a dataframe and then loading into a table.
id username primary_phone landline
1 John 222 222
2 Michael 123 121
3 lucy 456 456
4 Anderson 900 901
Thanks!!!
Use DataFrame.melt with remove variable column and DataFrame.drop_duplicates:
df = (df.melt(['id','username'], value_name='phone')
.drop('variable', axis=1)
.drop_duplicates()
.sort_values('id'))
print (df)
id username phone
0 1 John 222
1 2 Michael 123
5 2 Michael 121
2 3 lucy 456
3 4 Anderson 900
7 4 Anderson 901

Use Excel sheet to create dictionary in order to replace values

I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88

Categories

Resources