Fill Null address based on the same owner in Python - python

Lets say I have a table of house cleaning service like this.
| Customer| House Address | Date |
| ------- | ------------- | -------- |
| Sam | London | 10/01/22 |
| Lina | Manchester | 12/01/22 |
| Sam | Null | 15/01/22 |
We know that Sam house address should be London (assume that the customer id is the same).
How can I fill the third row based on the first row?
Data:
{'Customer': ['Sam', 'Lina', 'Sam'],
'House Address': ['London', 'Manchester', nan],
'Date': ['10/01/22', '12/01/22', '15/01/22']}

You could groupby "Customer" and transform first for "House Address" (first drops NaN values so only London will be selected for Sam). It returns a DataFrame having the same indexes as the original df filled with the transformed firsts.
Then pass this to fillna to fill NaN values in "House Address":
df['House Address'] = df['House Address'].fillna(df.groupby('Customer')['House Address'].transform('first'))
Output:
Customer House Address Date
0 Sam London 10/01/22
1 Lina Sydney 12/01/22
2 Sam London 15/01/22

Related

How to fill missing values based on group using pandas

I am looking at order data. Each order comes in at multiple lines depending on how many different items are part of the order. The table looks like this:
+--------------+------------------+-------+
| order number | shipping address | item |
+--------------+------------------+-------+
| A123 | Canada | boots |
+--------------+------------------+-------+
| A123 | null | socks |
+--------------+------------------+-------+
| A123 | null | laces |
+--------------+------------------+-------+
| B456 | California | shirt |
+--------------+------------------+-------+
How can I fill the null values with the actual shipping address, etc. for that order, in this case 'Canada'? (Using python + pandas ideally)
You need a dictionary of order number as the key and shipping address as the value, Just drop the NULLs and create a dict which you can map to the shipping address column.
di = df[['order number', 'shipping addres']]
di = di[di['shipping addres'].notnull()]
di.set_index('order number', inplace=True)
di = di.to_dict('index')
df['shipping addres'] = df['order number'].map(di)
This is an approach using df.groupby()
df['shipping address'] = (df.groupby('order number')['shipping address'] \
.apply(lambda x: x.ffill().bfill()))
print(df)
order number shipping address item
0 A123 Canada boots
1 A123 Canada socks
2 A123 Canada laces
3 B456 California shirt

How to order columns in pyspark in a specific sequence based on a list?

I have a dataframe in Spark that looks like this (but with more rows), where each city has the number of visitors on my website.
| date | New York | Los Angeles | Tokyo | London | Berlin | Paris |
|:----------- |:--------:| -----------:|------:|-------:|-------:|------:|
| 2022-01-01 | 150000 | 1589200 | 500120| 120330 |95058331|980000 |
I wanted to order the columns based onn this list of cities (they are ordered according to their importance to me)
order = ["Paris", "Berlin", "London", "New York", "Los Angeles", "Tokyo"]
In the end, I need a dataframe like this. Is there any way to create a function that perform this ordering everytime I need it? Expected result bellow:
| date | Paris | Berlin | London | New York | Los Angeles | Tokyo |
|:----------- |:--------:| -------:|-------:|---------:|------------:|------:|
| 2022-01-01 | 980000 | 95058331| 120330 | 150000 | 1589200 | 500120|
Thank you!
Try select using list. In this case, insert date at the start of list
order[0:0] =['date']
df_exemple.select(order).show()
+----------+------+--------+------+--------+-----------+------+
| date| Paris| Berlin|London|New York|Los Angeles| Tokyo|
+----------+------+--------+------+--------+-----------+------+
|2022-01-01|980000|95058331|120330| 150000 | 1589200|500120|
+----------+------+--------+------+--------+-----------+------+
Your exemple:
df_exemple = spark.createDataFrame(
[
('2022-01-01','150000 ','1589200','500120','120330','95058331','980000')
], ['date', 'New York', 'Los Angeles', 'Tokyo', 'London', 'Berlin', 'Paris'])
order = ['Paris', 'Berlin', 'London', 'New York', 'Los Angeles', 'Tokyo']
Now, a simple function to reorder:
def order_func(df, order_list):
return df.select('date', *order_list)
result_df = order_func(df_exemple, order)
result_df.show()
+----------+------+--------+------+--------+-----------+------+
| date| Paris| Berlin|London|New York|Los Angeles| Tokyo|
+----------+------+--------+------+--------+-----------+------+
|2022-01-01|980000|95058331|120330| 150000 | 1589200|500120|
+----------+------+--------+------+--------+-----------+------+
you order it from the start launching point of the discard launch

Reorder columns of a dataframe based on max length of columns

I have a dataframe like so :
| emp_id | name | address | zipcode |
|--------|------------|----------------------|---------|
| 1234 | Jack Black | 123 at abc shore xyz | 12345 |
| 1233 | John Wick | 321 at xyz | 54321 |
| 1232 | Sam | 321 at xyz at qrst | 54311 |
I want to rearrange the columns, based on the max length of string in each column.
In the above example, address would have the hight max string length(length 20 in row 1), and say emp_id has max length as 4 (when converted to string ).
I need to rearrange the columns based on this max length(descending), post which, the table must look like the following:
| address | name | zipcode | emp_id |
|----------------------|-------------|---------|--------|
| 123 at abc shore xyz | Jack Black | 12345 | 1234|
| 321 at xyz | John Wick | 54321 | 1233 |
| 321 at xyz at qrst | Sam | 54321 | 1232 |
Is there a way to do this for any random number of columns?
try via assign()+sort_values()+drop():
the idea here is to typecast whole dataframe to string(for getting the max length your original dtypes in dataframe remains same) and then calculate string length of each column and find the max number of length and sort according to it and after sorting drop that column:
df=(df.assign(s=df.astype(str).applymap(len).max(axis=1))
.sort_values('s',ignore_index=True,ascending=False).drop(columns='s'))
OR
as suggested by #mozway:
df=df.loc[df.astype(str).applymap(len).max(axis=1).sort_values(ascending=False).index]
OR
Another possible way is to reindex the index after sorting:
df=df.reindex(df.astype(str).applymap(len).max(axis=1).sort_values(ascending=False).index)
Convert all the columns to string, then applymap to get the length of the strings, call max to get maximum out of it, sort the maximum length values in descending order, and take the index.
cols = df.astype(str).applymap(len).max().sort_values(ascending=False).index
#cols
Index(['address', 'name', 'zipcode', 'emp_id'], dtype='object')
Then re-order the dataframe based on this column index:
df.loc[:,cols]
OUTPUT:
address name zipcode emp_id
0 123 at abc shore xyz Jack Black 12345 1234
1 321 at xyz John Wick 54321 1233
2 321 at xyz at qrst Sam 54311 1232

groupby: how to show max(field1) and the value of field2 corresponding to max(field1)?

Let's say I have a table with 3 fields: client, city, sales, with sales being a float.
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | NY | 0 |
| a | LA | 1 |
| a | London | 2 |
| b | NY | 3 |
| b | LA | 4 |
| b | London | 5 |
+--------+--------+-------+
For each client, I would like to show what is the city with the greatest sales, and what those sales are, ie I want this output:
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | London | 2 |
| b | London | 5 |
+--------+--------+-------+
Any suggestions?
This table can be generated with:
df=pd.DataFrame()
df['client']= np.repeat( ['a','b'],3 )
df['city'] = np.tile( ['NY','LA','London'],2)
df['sales']= np.arange(0,6)
This is wrong because it calculates the 'maximum' of the city, and shows NY because it considers that N > L
max_by_id = df.groupby('client').max()
I can first create a dataframe with the highest sales, and then merge it with the initial dataframe to retrieve the city; it works, but I was wondering if there is a faster / more elegant way?
out = pd.merge( df, max_by_id, how='inner' ,on=['client','sales'] )
I remember doing something similar with cross apply statements in SQL but wouldn't know how to run a Pandas equivalent.
You need to sort by sales and then groupby client and pick first
df.sort_values(['sales'], ascending=False).groupby('client').first().reset_index()
OR
As #user3483203:
df.loc[df.groupby('client')['sales'].idxmax()]
Output:
client city sales
0 a London 2
1 b London 5

find a record across multiple python pandas dataframes

Let's say, I have three dataframes as follows, and I would like to find in which dataframes a particular record exists.
this is dataframe1 (df1)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | rider | 223344 | Mexico
This is dataframe2 (df2)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | keith | 993344 | Brazil
This is dataframe3 (df3)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | hopper | 444444 | Canada
So, if I run the following code, I can find all the information about acct_no 112233 for a single dataframe.
p = df1.loc[df1['acct_no']=112233]
But, I would like to know which code will help me find out that acct_no 112233 exists in df1, df2, df3
One wat to know if the element is in the column 'acct_no' of the dataframe is:
>> (df1['acct_no']==112233).any()
True
You could check all at the same time by doing:
>> all([(df['acct_no']==112233).any() for df in [df1, df2, df3]])
True

Categories

Resources