How to text to columns in pandas and create new columns? - python

I have a csv file as show below, the names column has names separated with commas, I want to spilt them on comma and append them to new columns and create the same csv, similar to the text to columns in excel, the problem is some rows have random number of names.
| Address | Name |
| 1st st | John, Smith |
|2nd st. | Andrew, Jane, Aaron|
my pandas code look something like this
df1 = pd.read_csv('sample.csv')
df1['Name'] = df1['Name'].str.split(',', expand=True)
df1.to_csv('results.csv',index=None)
offcourse this doesn't work because columns must be same length as key. The expected output is
| Address | Name | | |
| 1st st | John |Smith| |
|2nd st. | Andrew| Jane| Aaron|

count the max number of commas, then accordingly assign to new columns.
max_commas = df['name'].str.split(',').transform(len).max()
df[[f'name_{x}' for x in range(max_commas)]] = df['name'].str.split(',', expand=True)
input df:
col name
0 1st st john, smith
1 2nd st andrew, jane, aron
2 3rd st harry, philip, anna, james
output:
col name name_0 name_1 name_2 name_3
0 1st st john, smith john smith None None
1 2nd st andrew, jane, aron andrew jane aron None
2 3rd st harry, philip, anna, james harry philip anna james

You can do
out = df.join(df['Name'].str.split(', ',expand=True).add_prefix('name_'))

Related

Python, Pandas: Assign cells in dataframe to groups

Good morning to everyone!
I'm hoping the title does fit in some way to my question!? I have a dataframe that looks something like this:
Dataframe (before):
id | name | position
1 | jane doe | position_1
2 | john doe | position_2
3 | john smith | position_3
Also I'm having multiple lists of groups:
group_1 = ['position_3', 'position_18', 'position_45']
group_2 = ['position_2', 'position_9']
group_7 = ['position_1']
Now I'm wondering what is the best way to achieve another column inside my dataframe with the assigned group? For example:
Dataframe (after):
id | name | position | group
1 | jane doe | position_1 | group_7
2 | john doe | position_2 | group_2
3 | john smith | position_3 | group_1
Notes:
every position is unique and will never apear in more than one group!
You can create a mapping dictionary in which key is the position and value is the group name, then map this dictionary onto the column position:
dct = {'group_1': group_1, 'group_2': group_2, 'group_7': group_7}
mapping_dct = {pos:grp for grp, positions in dct.items() for pos in positions}
df['group'] = df['position'].map(mapping_dct)
>>> df
id name position group
0 1 jane doe position_1 group_7
1 2 john doe position_2 group_2
2 3 john smith position_3 group_1
id=[1,2,3]
name=['jane doe','john doe','john smith']
position=['position_1','position_2','position_3']
df=pd.DataFrame({'id':id,'name':name,'position':position,})
group_1 = ['position_3', 'position_18', 'position_45']
group_2 = ['position_2', 'position_9']
group_7 = ['position_1']
dct = {'group_1': group_1, 'group_2': group_2, 'group_7': group_7}
def lookup(itemParam):
keys=[]
for key,item in dct.items():
if itemParam in item:
keys.append(key)
return keys
mylist=[*map(lookup,df['position'])]
mylist=[x[0] for x in mylist]
df['group']=mylist
print(df.head())
output:
id name position group
0 1 jane doe position_1 group_7
1 2 john doe position_2 group_2
2 3 john smith position_3 group_1

Split lists with uncertain elements into different categories (using pandas)

I am having trouble with a pandas split. So I have a column of data that looks something like this:
Initial Dataframe
index | Address
0 | [123 New York St]
1 | [Amazing Building, 23 New Jersey St, 2F]
2 | [98 New Mexico Ave, 16F]
3 | [White House, 1600 Pennsylvania Ave, PH]
4 | [221 Baker Street]
5 | [Hogwarts]
As you can see, the list contains varying categories and number of elements. Some have building names along with addresses. Some only have addresses with building floors. I want to sort them out by category (building name, address, unit/floor number) but I'm having trouble coming up with a solution to this, as I'm a beginner python & pandas learner.
How do I split the addresses into different categories to get an output that looks like this, assuming the building names ALL start with an alphabet and I can put Null for categories with missing value?
Desired Output:
index | Building Name | Address | Unit Number
0 | Null | 123 New York St | Null
1 | Amazing Building | 23 New Jersery St. | 2F
2 | Null | 98 New Mexico Ave. | 16F
3 | White House | 1600 Pennsylvania Ave | PH
4 | Null | 221B Baker St | Null
5 | Hogwarts | Null | Null
The main thing I need is for all addresses to be in the Address Column. Thanks for any help!
preconditional condition : Building name starts with a character, not a number
If the building name starts with a number, the wrong result can be output.
import pandas as pd
df = pd.DataFrame({'addr' : ['123 New York St',
'Amzing Building, 23 New Jersey St, 2F',
'98 New Mexico Ave, 16F']})
# Check the number of items in the address value
df['addr'] = df['addr'].str.split(',')
df['cnt'] = df['addr'].apply(lambda x: len(x)).values
# function, Building name start letter check
def CheckInt(s):
try:
int(s[0])
return True
except ValueError:
return False
for i, v in df.iterrows():
# One item of address value
if v.cnt == 1:
df.loc[i,'Address'] = v.addr
# Three items of address value
elif v.cnt == 3:
df.loc[i,'Building'] = v.addr[0]
df.loc[i,'Address'] = v.addr[1]
df.loc[i,'Unit'] = v.addr[2]
# Two items of address value
else:
if CheckInt(v.addr[0]):
df.loc[i,'Address'] = v.addr[0]
df.loc[i,'Unit'] = v.addr[1]
else:
df.loc[i,'Building'] = v.addr[0]
df.loc[i,'Address'] = v.addr[1]
We can get the output for your input dataframe as below.
If the data is different, you may have to tinker around.
df['com_Address'] = df[' Address'].apply(lambda x: x.replace('[','').replace(']','')).str.split(',')
st_list= ['St','Ave']
df['St_Address']=df.apply(lambda x: [a if st in a else '' for st in st_list for a in x['com_Address']],axis=1)
df['St_Address']=df['St_Address'].apply(lambda x:[i for i in x if i]).astype(str).apply(lambda x: x.strip("[]'"))
df['Building Name']=df.apply(lambda x: [x['com_Address'][0] if len(x['com_Address'])==3 else 'Null'],axis=1).astype(str).apply(lambda x: x.strip("[]'"))
df['Building Name']=df.apply(lambda x: np.where((len(x['com_Address'])==1) & (x['St_Address']==''),x['com_Address'][0],x['Building Name']),axis=1)
df['Unit Number']=df.apply(lambda x: [x['com_Address'][2] if len(x['com_Address'])==3 else 'Null'],axis=1).astype(str).apply(lambda x: x.strip("[]'"))
df['Unit Number']=df.apply(lambda x: np.where((len(x['com_Address'])==2) & (x['St_Address']!=''),x['com_Address'][-1],x['Unit Number']),axis=1)
df
Column "com_Address" is optional. I had to create it because the 'Address' from your input came to me as a string & not as a list. If you already have it as list, you don't need this & you will have to update "com_Address" with 'Address' in the code.
Output
index Address com_Address Building Name St_Address Unit Number
0 0 [123 New York St] [ 123 New York St] Null 123 New York St Null
1 1 [Amazing Building, 23 New Jersey St, 2F] [ Amazing Building, 23 New Jersey St, 2F] Amazing Building 23 New Jersey St 2F
2 2 [98 New Mexico Ave, 16F] [ 98 New Mexico Ave, 16F] Null 98 New Mexico Ave 16F
3 3 [White House, 1600 Pennsylvania Ave, PH] [ White House, 1600 Pennsylvania Ave, PH] White House 1600 Pennsylvania Ave PH
4 4 [221 Baker Street] [ 221 Baker Street] Null 221 Baker Street Null
5 5 [Hogwarts] [ Hogwarts] Hogwarts Null

Merging two columns with different information, python

I have a dataframe with one column of last names, and one column of first names. How do I merge these columns so that I have one column with first and last names?
Here is what I have:
First Name (Column 1)
John
Lisa
Jim
Last Name (Column 2)
Smith
Brown
Dandy
This is what I want:
Full Name
John Smith
Lisa Brown
Jim Dandy.
Thank you!
Try
df.assign(name = df.apply(' '.join, axis = 1)).drop(['first name', 'last name'], axis = 1)
You get
name
0 bob smith
1 john smith
2 bill smith
Here's a sample df:
df
first name last name
0 bob smith
1 john smith
2 bill smith
You can do the following to combine columns:
df['combined']= df['first name'] + ' ' + df['last name']
df
first name last name combined
0 bob smith bob smith
1 john smith john smith
2 bill smith bill smith

Splitting multiple pipe delimited values in multiple columns of a comma delimited CSV and mapping them to each other

I have a csv with comma delimiters that has multiple values in a column that are delimited by a pipe and I need to map them to another column with multiple pipe delimited values and then give them their own row along with data in the original row that doesn't have multiple values. My CSV looks like this (with commas between the categories):
row name city amount
1 frank | john | dave toronto | new york | anaheim 10
2 george | joe | fred fresno | kansas city | reno 20
I need it to look like this:
row name city amount
1 frank toronto 10
2 john new york 10
3 dave anaheim 10
4 george fresno 20
5 joe kansas city 20
6 fred reno 20
Maybe not the nicest but working solution:
(works with no piped lines and for different pipe-length)
df = pd.read_csv('<your_data>.csv')
str_split = ' | '
# Calculate maximum length of piped (' | ') values
df['max_len'] = df[['name', 'city']].apply(lambda x: max(len(x[0].split(str_split)),
len(x[0].split(str_split))), axis=1)
max_len = df['max_len'].max()
# Split '|' piped cell values into columns (needed at unpivot step)
# Create as many new 'name_<x>' & 'city_<x>' columns as 'max_len'
df[['name_{}'.format(i) for i in range(max_len)]] = df['name'].apply(lambda x: \
pd.Series(x.split(str_split)))
df[['city_{}'.format(i) for i in range(max_len)]] = df['city'].apply(lambda x: \
pd.Series(x.split(str_split)))
# Unpivot 'name_<x>' & 'city_<x>' columns into rows
df_pv_name = pd.melt(df, value_vars=['name_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
df_pv_city = pd.melt(df, value_vars=['city_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
# Rename upivoted columns (these are the final columns)
df_pv_name = df_pv_name.rename(columns={'value':'name'})
df_pv_city = df_pv_city.rename(columns={'value':'city'})
# Rename 'city_<x>' values (rows) to be 'key' for join (merge)
df_pv_city['variable'] = df_pv_city['variable'].map({'city_{}'.format(i):'name_{}'\
.format(i) for i in range(max_len)})
# Join unpivoted 'name' & 'city' dataframes
df_res = df_pv_name.merge(df_pv_city, on=['variable', 'amount'])
# Drop 'variable' column and NULL rows if you have not equal pipe-length in original rows
# If you want to drop any NULL rows then replace 'all' to 'any'
df_res = df_res.drop(['variable'], axis=1).dropna(subset=['name', 'city'], how='all',
axis=0).reset_index(drop=True)
The result is:
amount name city
0 10 frank toronto
1 20 george fresno
2 10 john new york
3 20 joe kansas city
4 10 dave anaheim
5 20 fred reno
Another test input:
name city amount
0 frank | john | dave | joe | bill toronto | new york | anaheim | los angeles | caracas 10
1 george | joe | fred fresno | kansas city 20
2 danny miami 30
Result of this test (if you don't want NaN rows then replace how='all' to how='any' in the code at merging):
amount name city
0 10 frank toronto
1 20 george fresno
2 30 danny miami
3 10 john new york
4 20 joe kansas city
5 10 dave anaheim
6 20 fred NaN
7 10 joe los angeles
8 10 bill caracas
Given a row:
['1','frank|joe|dave', 'toronto|new york|anaheim', '20']
you can use
itertools.izip_longest(*[value.split('|') for value in row])
on it to obtain following structure:
[('1', 'frank', 'toronto', '20'),
(None, 'joe', 'new york', None),
(None, 'dave', 'anaheim', None)]
Here we want to replace all None values with last seen value in corresponding column. Can be done when looping over result.
So given a TSV already splitted by tabs following code should do the trick:
import itertools
def flatten_tsv(lines):
result = []
for line in lines:
flat_lines = itertools.izip_longest(*[value.split('|') for value in line])
for flat_line in flat_lines:
result.append([result[-1][i] if v is None else v
for i, v in enumerate(flat_line)])
return result

Add UUID's to pandas DF

Say I have a pandas DataFrame like so:
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
Name
0 John Doe
1 Jane Smith
2 John Doe
3 Jane Smith
4 Jack Dawson
5 John Doe
And I want to add a column with uuids that are the same if the name is the same. For example, the DataFrame above should become:
df:
Name UUID
0 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
1 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
2 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
3 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
4 Jack Dawson 6a495c95-dd68-4a7c-8109-43c2e32d5d42
5 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
The uuid's should be generated from the uuid.uuid4() function.
My current idea is to use a groupby("Name").cumcount() to identify which rows have the same name and which are different. Then I'd create a dictionary with a key of the cumcount and a value of the uuid and use that to add the uuids to the DF.
While that would work, I'm wondering if there's a more efficient way to do this?
Grouping the data frame and applying uuid.uuid4 will be more efficient than looping through the groups. Since you want to keep the original shape of your data frame you should use pandas function transform.
Using your sample data frame, we'll add a column in order to have a series to apply transform to. Since uuid.uuid4 doesn't take any argument it really doesn't matter what the column is.
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df.loc[:, "UUID"] = 1
Now to use transform:
import uuid
df.loc[:, "UUID"] = df.groupby("Name").UUID.transform(lambda g: uuid.uuid4())
+----+--------------+--------------------------------------+
| | Name | UUID |
+----+--------------+--------------------------------------+
| 0 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 1 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 2 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 3 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 4 | Jack Dawson | 6b843d0f-ba3a-4880-8a84-d98c4af09cc3 |
| 5 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
+----+--------------+--------------------------------------+
uuid.uuid4 will be called as many times as there are distinct groups
How about this
names = df['Name'].unique()
for name in names:
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()
could shorten it to
for name in df['Name'].unique():
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()

Categories

Resources