How to fill missing values based on group using pandas - python

I am looking at order data. Each order comes in at multiple lines depending on how many different items are part of the order. The table looks like this:
+--------------+------------------+-------+
| order number | shipping address | item |
+--------------+------------------+-------+
| A123 | Canada | boots |
+--------------+------------------+-------+
| A123 | null | socks |
+--------------+------------------+-------+
| A123 | null | laces |
+--------------+------------------+-------+
| B456 | California | shirt |
+--------------+------------------+-------+
How can I fill the null values with the actual shipping address, etc. for that order, in this case 'Canada'? (Using python + pandas ideally)

You need a dictionary of order number as the key and shipping address as the value, Just drop the NULLs and create a dict which you can map to the shipping address column.
di = df[['order number', 'shipping addres']]
di = di[di['shipping addres'].notnull()]
di.set_index('order number', inplace=True)
di = di.to_dict('index')
df['shipping addres'] = df['order number'].map(di)

This is an approach using df.groupby()
df['shipping address'] = (df.groupby('order number')['shipping address'] \
.apply(lambda x: x.ffill().bfill()))
print(df)
order number shipping address item
0 A123 Canada boots
1 A123 Canada socks
2 A123 Canada laces
3 B456 California shirt

Related

How can I filter rows out if their start date is within 90 days from today and place them out until the 1st of the following month in R?

I am having difficulty finding the words to describe what I am searching for but will try. I would like to solve the following using R or Python (but preferably R).
I have a dataframe of employees with their employee ID, department, start date etc. I am looking to perform calculations for each employee but would like to ignore employees that have a start date within 90 days from today. Additionally I would like for this employee to be left out of consideration until the 1st of the following month. So basically exclude employees until the 1st of the month following their 90th day after hire. I do not need to include only workdays for this project.
In the below example for a report ran on May 3, 2022 I would exclude ID (22222, 33333, 44444, 666666 88888, and 99999).
ID | Dept | Start Date |
11111 | Sales | 04/10/2015 |
22222 | Field Tech | 04/30/2022 |
33333 | Lab tech | 02/10/2022 |
44444 | Sales | 02/01/2022 |
55555 | Proj. Man | 01/01/2022 |
66666 | Administr | 05/05/1999 |
77777 | Field Tech | 06/25/2015 |
88888 | Administr | 03/01/2022 |
99999 | Lab tech | 05/12/2022 |
your_data %>%
mutate(`Start Date` = as.Date(`Start Date`, "%m/%d/%Y")) %>%
filter(`Start Date` - Sys.Date() >= 90)

Fill Null address based on the same owner in Python

Lets say I have a table of house cleaning service like this.
| Customer| House Address | Date |
| ------- | ------------- | -------- |
| Sam | London | 10/01/22 |
| Lina | Manchester | 12/01/22 |
| Sam | Null | 15/01/22 |
We know that Sam house address should be London (assume that the customer id is the same).
How can I fill the third row based on the first row?
Data:
{'Customer': ['Sam', 'Lina', 'Sam'],
'House Address': ['London', 'Manchester', nan],
'Date': ['10/01/22', '12/01/22', '15/01/22']}
You could groupby "Customer" and transform first for "House Address" (first drops NaN values so only London will be selected for Sam). It returns a DataFrame having the same indexes as the original df filled with the transformed firsts.
Then pass this to fillna to fill NaN values in "House Address":
df['House Address'] = df['House Address'].fillna(df.groupby('Customer')['House Address'].transform('first'))
Output:
Customer House Address Date
0 Sam London 10/01/22
1 Lina Sydney 12/01/22
2 Sam London 15/01/22

Reorder columns of a dataframe based on max length of columns

I have a dataframe like so :
| emp_id | name | address | zipcode |
|--------|------------|----------------------|---------|
| 1234 | Jack Black | 123 at abc shore xyz | 12345 |
| 1233 | John Wick | 321 at xyz | 54321 |
| 1232 | Sam | 321 at xyz at qrst | 54311 |
I want to rearrange the columns, based on the max length of string in each column.
In the above example, address would have the hight max string length(length 20 in row 1), and say emp_id has max length as 4 (when converted to string ).
I need to rearrange the columns based on this max length(descending), post which, the table must look like the following:
| address | name | zipcode | emp_id |
|----------------------|-------------|---------|--------|
| 123 at abc shore xyz | Jack Black | 12345 | 1234|
| 321 at xyz | John Wick | 54321 | 1233 |
| 321 at xyz at qrst | Sam | 54321 | 1232 |
Is there a way to do this for any random number of columns?
try via assign()+sort_values()+drop():
the idea here is to typecast whole dataframe to string(for getting the max length your original dtypes in dataframe remains same) and then calculate string length of each column and find the max number of length and sort according to it and after sorting drop that column:
df=(df.assign(s=df.astype(str).applymap(len).max(axis=1))
.sort_values('s',ignore_index=True,ascending=False).drop(columns='s'))
OR
as suggested by #mozway:
df=df.loc[df.astype(str).applymap(len).max(axis=1).sort_values(ascending=False).index]
OR
Another possible way is to reindex the index after sorting:
df=df.reindex(df.astype(str).applymap(len).max(axis=1).sort_values(ascending=False).index)
Convert all the columns to string, then applymap to get the length of the strings, call max to get maximum out of it, sort the maximum length values in descending order, and take the index.
cols = df.astype(str).applymap(len).max().sort_values(ascending=False).index
#cols
Index(['address', 'name', 'zipcode', 'emp_id'], dtype='object')
Then re-order the dataframe based on this column index:
df.loc[:,cols]
OUTPUT:
address name zipcode emp_id
0 123 at abc shore xyz Jack Black 12345 1234
1 321 at xyz John Wick 54321 1233
2 321 at xyz at qrst Sam 54311 1232

groupby: how to show max(field1) and the value of field2 corresponding to max(field1)?

Let's say I have a table with 3 fields: client, city, sales, with sales being a float.
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | NY | 0 |
| a | LA | 1 |
| a | London | 2 |
| b | NY | 3 |
| b | LA | 4 |
| b | London | 5 |
+--------+--------+-------+
For each client, I would like to show what is the city with the greatest sales, and what those sales are, ie I want this output:
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | London | 2 |
| b | London | 5 |
+--------+--------+-------+
Any suggestions?
This table can be generated with:
df=pd.DataFrame()
df['client']= np.repeat( ['a','b'],3 )
df['city'] = np.tile( ['NY','LA','London'],2)
df['sales']= np.arange(0,6)
This is wrong because it calculates the 'maximum' of the city, and shows NY because it considers that N > L
max_by_id = df.groupby('client').max()
I can first create a dataframe with the highest sales, and then merge it with the initial dataframe to retrieve the city; it works, but I was wondering if there is a faster / more elegant way?
out = pd.merge( df, max_by_id, how='inner' ,on=['client','sales'] )
I remember doing something similar with cross apply statements in SQL but wouldn't know how to run a Pandas equivalent.
You need to sort by sales and then groupby client and pick first
df.sort_values(['sales'], ascending=False).groupby('client').first().reset_index()
OR
As #user3483203:
df.loc[df.groupby('client')['sales'].idxmax()]
Output:
client city sales
0 a London 2
1 b London 5

find a record across multiple python pandas dataframes

Let's say, I have three dataframes as follows, and I would like to find in which dataframes a particular record exists.
this is dataframe1 (df1)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | rider | 223344 | Mexico
This is dataframe2 (df2)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | keith | 993344 | Brazil
This is dataframe3 (df3)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | hopper | 444444 | Canada
So, if I run the following code, I can find all the information about acct_no 112233 for a single dataframe.
p = df1.loc[df1['acct_no']=112233]
But, I would like to know which code will help me find out that acct_no 112233 exists in df1, df2, df3
One wat to know if the element is in the column 'acct_no' of the dataframe is:
>> (df1['acct_no']==112233).any()
True
You could check all at the same time by doing:
>> all([(df['acct_no']==112233).any() for df in [df1, df2, df3]])
True

Categories

Resources