Reorder columns of a dataframe based on max length of columns - python

I have a dataframe like so :
| emp_id | name | address | zipcode |
|--------|------------|----------------------|---------|
| 1234 | Jack Black | 123 at abc shore xyz | 12345 |
| 1233 | John Wick | 321 at xyz | 54321 |
| 1232 | Sam | 321 at xyz at qrst | 54311 |
I want to rearrange the columns, based on the max length of string in each column.
In the above example, address would have the hight max string length(length 20 in row 1), and say emp_id has max length as 4 (when converted to string ).
I need to rearrange the columns based on this max length(descending), post which, the table must look like the following:
| address | name | zipcode | emp_id |
|----------------------|-------------|---------|--------|
| 123 at abc shore xyz | Jack Black | 12345 | 1234|
| 321 at xyz | John Wick | 54321 | 1233 |
| 321 at xyz at qrst | Sam | 54321 | 1232 |
Is there a way to do this for any random number of columns?

try via assign()+sort_values()+drop():
the idea here is to typecast whole dataframe to string(for getting the max length your original dtypes in dataframe remains same) and then calculate string length of each column and find the max number of length and sort according to it and after sorting drop that column:
df=(df.assign(s=df.astype(str).applymap(len).max(axis=1))
.sort_values('s',ignore_index=True,ascending=False).drop(columns='s'))
OR
as suggested by #mozway:
df=df.loc[df.astype(str).applymap(len).max(axis=1).sort_values(ascending=False).index]
OR
Another possible way is to reindex the index after sorting:
df=df.reindex(df.astype(str).applymap(len).max(axis=1).sort_values(ascending=False).index)

Convert all the columns to string, then applymap to get the length of the strings, call max to get maximum out of it, sort the maximum length values in descending order, and take the index.
cols = df.astype(str).applymap(len).max().sort_values(ascending=False).index
#cols
Index(['address', 'name', 'zipcode', 'emp_id'], dtype='object')
Then re-order the dataframe based on this column index:
df.loc[:,cols]
OUTPUT:
address name zipcode emp_id
0 123 at abc shore xyz Jack Black 12345 1234
1 321 at xyz John Wick 54321 1233
2 321 at xyz at qrst Sam 54311 1232

Related

Python3 Pandas - filter dataframe rows by matching columns values on another table

I need to do some specific filtering on a dataframe by matching a pair of value in each row of the dataframe with another pair of values in a csv table and checking if the table's third column for the match says True or False.
Here's the dataframe (not the actual data, but has the same structure):
ID
Date
FName
LName
Value
0
1
01.03.2022
John
Doe
12.30
1
2
01.03.2022
John
Doe
53.45
2
3
01.03.2022
John
Doe
17.48
3
4
21.03.2022
Amber
Conley
100.48
4
5
21.03.2022
Amber
Conley
96.41
...
...
...
...
...
...
9999
10000
23.03.2022
Nathaniel
Doe
79.88
And here's the csv table (again, not the actual data):
| FName | LName | Included? |
| ------ | ------- | ------ |
| John | Doe | False |
| Karen | Buck | False |
| Jessica | Ryan | True |
| John | Ballard | False |
| Maria | Ross | True |
| Thomas | Martin | False |
| Amber | Conley | True |
| Richard | Buck | False |
| Brenda | Martin | False |
What I need to do is:
for each row 'R1' of the dataframe
if there's any row 'R2' in the table with the same FName-LName pair:
if the matched row 'R2' has a True value in the 'Included?' column, keep 'R1'
else, if the matched row 'R2' has a False value in the 'Included?' column, remove 'R1' from the dataframe
else, if there's no matching row in the table for 'R1', remove 'R1' from the dataframe, but also print a warning/alert to the user that there is a new name.
In the input I've shown, rows 0 to 2 will be removed, since 'John Doe' has a False value in the table; rows 3 and 4 will be kept, since 'Amber Conley' has a True value in the table; and the last row will be removed, but an alert will be printed, saying "New name: Doe, Nathaniel".
What is the best way to code this in Python?

Replace comma-separated values in a dataframe with values from another dataframe

this is my first question on StackOverflow, so please pardon if I am not clear enough. I usually find my answers here but this time I had no luck. Maybe I am being dense, but here we go.
I have two pandas dataframes formatted as follows
df1
+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2 | Descr 1 |
| 3 | Descr 2 |
| 2,3,5 | Descr 3 |
+------------+-------------+
df2
+--------+--------------+
| Ref_ID | ShortRef |
+--------+--------------+
| 1 | Smith (2006) |
| 2 | Mike (2009) |
| 3 | John (2014) |
| 4 | Cole (2007) |
| 5 | Jill (2019) |
| 6 | Tom (2007) |
+--------+--------------+
Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1
What I would like to do is to replace values in the References field in df1 so it looks like this:
+-------------------------------------+-------------+
| References | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009) | Descr 1 |
| John (2014) | Descr 2 |
| Mike (2009);John (2014);Jill (2019) | Descr 3 |
+-------------------------------------+-------------+
So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly
Pandas - Replacing Values by Looking Up in an Another Dataframe
But I cannot get my mind around this slightly different problem. The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.
This would be, I am afraid, very slow as I have ca. 2000 unique Ref_IDs and I have to repeat this operation in several columns similar to the References one.
Anyone is willing to point me in the right direction?
Many thanks in advance.
Let's try this:
df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
'Mike (2009)',
'John (2014)',
'Cole (2007)',
'Jill (2019)',
'Tom (2007)']})
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(list))
Output:
Reference Description Reference2
0 1,2 Descr 1 [Smith (2006), Mike (2009)]
1 3 Descr 2 [John (2014)]
2 1,3,5 Descr 3 [Smith (2006), John (2014), Jill (2019)]
#Datanovice thanks for the update.
df1['Reference2'] = (df1['Reference'].str.split(',')
.explode()
.map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
.set_index('Ref_ID')['ShortRef'])
.groupby(level=0).agg(';'.join))
Output:
Reference Description Reference2
0 1,2 Descr 1 Smith (2006);Mike (2009)
1 3 Descr 2 John (2014)
2 1,3,5 Descr 3 Smith (2006);John (2014);Jill (2019)
you can use some list comprehension and dict lookups and I dont think this will be too slow
First, making a fast-to-access mapping for id to short_ref
mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()
Then, lets split references by commas
df1_values = [v.split(',') for v in df1['References']]
Finally, we can iterate over and do dictionary lookups, before concatenating back to strings
df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])
Is this usable or is it too slow?
Another solution is using str.get_dummies and dot
df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
.reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
.reset_index())
Out[462]:
Description References
0 Descr 1 Smith (2006);Mike (2009)
1 Descr 2 John (2014)
2 Descr 3 Mike (2009);John (2014);Jill (2019)

Copy data from 1 data-set to another on the basis of Unique ID

I am matching two large data-sets and trying to perform update,remove and create operations on original data-set by comparing it with other data-set. How can I update 2 or 3 column out of 10 of original data-set and keep other column's value same as before?
I tried merge but no avail.
Original data:
id | full_name | date
1 | John | 02-23-2006
2 | Paul Elbert | 09-29-2001
3 | Donag | 11-12-2013
4 | Tom Holland | 06-17-2016
other data:
id | full_name | date
1 | John | 02-25-2018
2 | Paul | 03-09-2001
3 | Donag | 07-09-2017
4 | Tom | 05-09-2016
Is it possible to update date column of original data on the basis of ID?
answering your question :
"When ID match code update all values in date column without changing any value in name column of original data set"
original = pd.DataFrame({'id':['1','2'],'full_name':['John','Paul Elbert'],'date':
['02-23-2006','09-29-2001']})
other = pd.DataFrame({'id':['1','2'],'full_name':['John','Paul'],'date':['02-25-2018','03-09-2001']})
original = original[['id','full_name']].merge(other[['id','date']],on='id')
print(original)
id full_name date
0 1 John 02-25-2018
1 2 Paul Elbert 03-09-2001

How do I split a single dataframe into multiple dataframes by the range of a column value?

First off, I realize that this question has been asked a ton of times in many different forms, but a lot of the answers just give code that solves the problem without explaining what the code actually does or why it works.
I have an enormous data set of phone numbers and area codes that I have loaded into a dataframe in python to do some processing with. Before I do that processing, I need to split the single dataframe into multiple dataframes that contain phone numbers in certain ranges of area codes that I can then do more processing on. For example:
+---+--------------+-----------+
| | phone_number | area_code |
+---+--------------+-----------+
| 1 | 5501231234 | 550 |
+---+--------------+-----------+
| 2 | 5051231234 | 505 |
+---+--------------+-----------+
| 3 | 5001231234 | 500 |
+---+--------------+-----------+
| 4 | 6201231234 | 620 |
+---+--------------+-----------+
into
area-codes (500-550)
+---+--------------+-----------+
| | phone_number | area_code |
+---+--------------+-----------+
| 1 | 5501231234 | 550 |
+---+--------------+-----------+
| 2 | 5051231234 | 505 |
+---+--------------+-----------+
| 3 | 5001231234 | 500 |
+---+--------------+-----------+
and
area-codes (600-650)
+---+--------------+-----------+
| | phone_number | area_code |
+---+--------------+-----------+
| 1 | 6201231234 | 620 |
+---+--------------+-----------+
I get that this should be possible using pandas (specifically groupby and a Series object I think) but the documentation and examples on the internet I could find were a little too nebulous or sparse for me to follow. Maybe there's a better way to do this than the way I'm trying to do it?
You can use pd.cut to bin the area column , then use the labels to group the data and store in a dictionary. Finally print each key to see the dataframe:
bins=[500,550,600,650]
labels=['500-550','550-600','600-650']
d={f'area_code_{i}':g for i,g in
df.groupby(pd.cut(df.area_code,bins,include_lowest=True,labels=labels))}
print(d['area_code_500-550'])
print('\n')
print(d['area_code_600-650'])
phone_number area_code
0 5501231234 550
1 5051231234 505
2 5001231234 500
phone_number area_code
3 6201231234 620
You can also do this by select rows in dataframe by chaining multiple condition with & or | operator
df1 select rows with area_code between 500-550
df2 select rows with area_code between 600-650
df = pd.DataFrame({'phone_number':[5501231234, 5051231234, 5001231234 ,6201231234],
'area_code':[550,505,500,620]},
columns=['phone_number', 'area_code'])
df1 = df[ (df['area_code']>=500) & (df['area_code']<=550) ]
df2 = df[ (df['area_code']>=600) & (df['area_code']<=650) ]
df1
phone_number area_code
0 5501231234 550
1 5051231234 505
2 5001231234 500
df2
phone_number area_code
3 6201231234 620

find a record across multiple python pandas dataframes

Let's say, I have three dataframes as follows, and I would like to find in which dataframes a particular record exists.
this is dataframe1 (df1)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | rider | 223344 | Mexico
This is dataframe2 (df2)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | keith | 993344 | Brazil
This is dataframe3 (df3)
index | name | acct_no | country
2 | alex | 112233 | USA
3 | hopper | 444444 | Canada
So, if I run the following code, I can find all the information about acct_no 112233 for a single dataframe.
p = df1.loc[df1['acct_no']=112233]
But, I would like to know which code will help me find out that acct_no 112233 exists in df1, df2, df3
One wat to know if the element is in the column 'acct_no' of the dataframe is:
>> (df1['acct_no']==112233).any()
True
You could check all at the same time by doing:
>> all([(df['acct_no']==112233).any() for df in [df1, df2, df3]])
True

Categories

Resources