Python, Pandas: Assign cells in dataframe to groups - python

Good morning to everyone!
I'm hoping the title does fit in some way to my question!? I have a dataframe that looks something like this:
Dataframe (before):
id | name | position
1 | jane doe | position_1
2 | john doe | position_2
3 | john smith | position_3
Also I'm having multiple lists of groups:
group_1 = ['position_3', 'position_18', 'position_45']
group_2 = ['position_2', 'position_9']
group_7 = ['position_1']
Now I'm wondering what is the best way to achieve another column inside my dataframe with the assigned group? For example:
Dataframe (after):
id | name | position | group
1 | jane doe | position_1 | group_7
2 | john doe | position_2 | group_2
3 | john smith | position_3 | group_1
Notes:
every position is unique and will never apear in more than one group!

You can create a mapping dictionary in which key is the position and value is the group name, then map this dictionary onto the column position:
dct = {'group_1': group_1, 'group_2': group_2, 'group_7': group_7}
mapping_dct = {pos:grp for grp, positions in dct.items() for pos in positions}
df['group'] = df['position'].map(mapping_dct)
>>> df
id name position group
0 1 jane doe position_1 group_7
1 2 john doe position_2 group_2
2 3 john smith position_3 group_1

id=[1,2,3]
name=['jane doe','john doe','john smith']
position=['position_1','position_2','position_3']
df=pd.DataFrame({'id':id,'name':name,'position':position,})
group_1 = ['position_3', 'position_18', 'position_45']
group_2 = ['position_2', 'position_9']
group_7 = ['position_1']
dct = {'group_1': group_1, 'group_2': group_2, 'group_7': group_7}
def lookup(itemParam):
keys=[]
for key,item in dct.items():
if itemParam in item:
keys.append(key)
return keys
mylist=[*map(lookup,df['position'])]
mylist=[x[0] for x in mylist]
df['group']=mylist
print(df.head())
output:
id name position group
0 1 jane doe position_1 group_7
1 2 john doe position_2 group_2
2 3 john smith position_3 group_1

Related

Convert multiple rows into one row with multiple columns in pyspark?

I have something like this (I've simplified the number of columns for brevity, there's about 10 other attributes):
id name foods foods_eaten color continent
1 john apples 2 red Europe
1 john oranges 3 red Europe
2 jack apples 1 blue North America
I want to convert it to:
id name apples oranges color continent
1 john 2 3 red Europe
2 jack 1 0 blue North America
Edit:
(1) I updated the data to show a few more of the columns.
(3) I've done
df_piv = df.groupBy(['id', 'name', 'color', 'continent', ...]).pivot('foods').avg('foods_eaten')
Is there a simpler way to do this sort of thing? As far as I can tell, I'll need to groupby almost every attribute to get my result.
Extending from what you have done so far and leveraging here
>>>from pyspark.sql import functions as F
>>>from pyspark.sql.types import *
>>>from pyspark.sql.functions import collect_list
>>>data=[{'id':1,'name':'john','foods':"apples"},{'id':1,'name':'john','foods':"oranges"},{'id':2,'name':'jack','foods':"banana"}]
>>>dataframe=spark.createDataFrame(data)
>>>dataframe.show()
+-------+---+----+
| foods| id|name|
+-------+---+----+
| apples| 1|john|
|oranges| 1|john|
| banana| 2|jack|
+-------+---+----+
>>>grouping_cols = ["id","name"]
>>>other_cols = [c for c in dataframe.columns if c not in grouping_cols]
>>> df=dataframe.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols])
>>>df.show()
+---+----+-----------------+
| id|name| foods|
+---+----+-----------------+
| 1|john|[apples, oranges]|
| 2|jack| [banana]|
+---+----+-----------------+
>>>df_sizes = df.select(*[F.size(col).alias(col) for col in other_cols])
>>>df_max = df_sizes.agg(*[F.max(col).alias(col) for col in other_cols])
>>> max_dict = df_max.collect()[0].asDict()
>>>df_result = df.select('id','name', *[df[col][i] for col in other_cols for i in range(max_dict[col])])
>>>df_result.show()
+---+----+--------+--------+
| id|name|foods[0]|foods[1]|
+---+----+--------+--------+
| 1|john| apples| oranges|
| 2|jack| banana| null|
+---+----+--------+--------+

In, Python trying to remove duplicate word in dataframe, but get error

I'm trying to remove a duplicate word in a cell
Current Desired
0 John and Jane John and Jane
1 John and John John
2 John John
3 Jane and Jane Jane
I have tried the following, desired column gets filled with o d i c t _ k e y s ( [ ' n a n ' ] ):
from collections import OrderedDict
df['Current'] = (df['Desired'].astype(str).str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.astype(str).str.join(' '))
I have also tried this, but the desired column gets filled with nan
df['Desired'] = df['Current'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
Let us do split with set then join back
df['out'] = df.Current.str.split(' and ').map(lambda x : ' and '.join(set(x)))
df
Out[876]:
Current out
0 John and Jane Jane and John
1 John and John John
2 John John
3 Jane and Jane Jane

How to text to columns in pandas and create new columns?

I have a csv file as show below, the names column has names separated with commas, I want to spilt them on comma and append them to new columns and create the same csv, similar to the text to columns in excel, the problem is some rows have random number of names.
| Address | Name |
| 1st st | John, Smith |
|2nd st. | Andrew, Jane, Aaron|
my pandas code look something like this
df1 = pd.read_csv('sample.csv')
df1['Name'] = df1['Name'].str.split(',', expand=True)
df1.to_csv('results.csv',index=None)
offcourse this doesn't work because columns must be same length as key. The expected output is
| Address | Name | | |
| 1st st | John |Smith| |
|2nd st. | Andrew| Jane| Aaron|
count the max number of commas, then accordingly assign to new columns.
max_commas = df['name'].str.split(',').transform(len).max()
df[[f'name_{x}' for x in range(max_commas)]] = df['name'].str.split(',', expand=True)
input df:
col name
0 1st st john, smith
1 2nd st andrew, jane, aron
2 3rd st harry, philip, anna, james
output:
col name name_0 name_1 name_2 name_3
0 1st st john, smith john smith None None
1 2nd st andrew, jane, aron andrew jane aron None
2 3rd st harry, philip, anna, james harry philip anna james
You can do
out = df.join(df['Name'].str.split(', ',expand=True).add_prefix('name_'))

Splitting multiple pipe delimited values in multiple columns of a comma delimited CSV and mapping them to each other

I have a csv with comma delimiters that has multiple values in a column that are delimited by a pipe and I need to map them to another column with multiple pipe delimited values and then give them their own row along with data in the original row that doesn't have multiple values. My CSV looks like this (with commas between the categories):
row name city amount
1 frank | john | dave toronto | new york | anaheim 10
2 george | joe | fred fresno | kansas city | reno 20
I need it to look like this:
row name city amount
1 frank toronto 10
2 john new york 10
3 dave anaheim 10
4 george fresno 20
5 joe kansas city 20
6 fred reno 20
Maybe not the nicest but working solution:
(works with no piped lines and for different pipe-length)
df = pd.read_csv('<your_data>.csv')
str_split = ' | '
# Calculate maximum length of piped (' | ') values
df['max_len'] = df[['name', 'city']].apply(lambda x: max(len(x[0].split(str_split)),
len(x[0].split(str_split))), axis=1)
max_len = df['max_len'].max()
# Split '|' piped cell values into columns (needed at unpivot step)
# Create as many new 'name_<x>' & 'city_<x>' columns as 'max_len'
df[['name_{}'.format(i) for i in range(max_len)]] = df['name'].apply(lambda x: \
pd.Series(x.split(str_split)))
df[['city_{}'.format(i) for i in range(max_len)]] = df['city'].apply(lambda x: \
pd.Series(x.split(str_split)))
# Unpivot 'name_<x>' & 'city_<x>' columns into rows
df_pv_name = pd.melt(df, value_vars=['name_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
df_pv_city = pd.melt(df, value_vars=['city_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
# Rename upivoted columns (these are the final columns)
df_pv_name = df_pv_name.rename(columns={'value':'name'})
df_pv_city = df_pv_city.rename(columns={'value':'city'})
# Rename 'city_<x>' values (rows) to be 'key' for join (merge)
df_pv_city['variable'] = df_pv_city['variable'].map({'city_{}'.format(i):'name_{}'\
.format(i) for i in range(max_len)})
# Join unpivoted 'name' & 'city' dataframes
df_res = df_pv_name.merge(df_pv_city, on=['variable', 'amount'])
# Drop 'variable' column and NULL rows if you have not equal pipe-length in original rows
# If you want to drop any NULL rows then replace 'all' to 'any'
df_res = df_res.drop(['variable'], axis=1).dropna(subset=['name', 'city'], how='all',
axis=0).reset_index(drop=True)
The result is:
amount name city
0 10 frank toronto
1 20 george fresno
2 10 john new york
3 20 joe kansas city
4 10 dave anaheim
5 20 fred reno
Another test input:
name city amount
0 frank | john | dave | joe | bill toronto | new york | anaheim | los angeles | caracas 10
1 george | joe | fred fresno | kansas city 20
2 danny miami 30
Result of this test (if you don't want NaN rows then replace how='all' to how='any' in the code at merging):
amount name city
0 10 frank toronto
1 20 george fresno
2 30 danny miami
3 10 john new york
4 20 joe kansas city
5 10 dave anaheim
6 20 fred NaN
7 10 joe los angeles
8 10 bill caracas
Given a row:
['1','frank|joe|dave', 'toronto|new york|anaheim', '20']
you can use
itertools.izip_longest(*[value.split('|') for value in row])
on it to obtain following structure:
[('1', 'frank', 'toronto', '20'),
(None, 'joe', 'new york', None),
(None, 'dave', 'anaheim', None)]
Here we want to replace all None values with last seen value in corresponding column. Can be done when looping over result.
So given a TSV already splitted by tabs following code should do the trick:
import itertools
def flatten_tsv(lines):
result = []
for line in lines:
flat_lines = itertools.izip_longest(*[value.split('|') for value in line])
for flat_line in flat_lines:
result.append([result[-1][i] if v is None else v
for i, v in enumerate(flat_line)])
return result

Add UUID's to pandas DF

Say I have a pandas DataFrame like so:
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
Name
0 John Doe
1 Jane Smith
2 John Doe
3 Jane Smith
4 Jack Dawson
5 John Doe
And I want to add a column with uuids that are the same if the name is the same. For example, the DataFrame above should become:
df:
Name UUID
0 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
1 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
2 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
3 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
4 Jack Dawson 6a495c95-dd68-4a7c-8109-43c2e32d5d42
5 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
The uuid's should be generated from the uuid.uuid4() function.
My current idea is to use a groupby("Name").cumcount() to identify which rows have the same name and which are different. Then I'd create a dictionary with a key of the cumcount and a value of the uuid and use that to add the uuids to the DF.
While that would work, I'm wondering if there's a more efficient way to do this?
Grouping the data frame and applying uuid.uuid4 will be more efficient than looping through the groups. Since you want to keep the original shape of your data frame you should use pandas function transform.
Using your sample data frame, we'll add a column in order to have a series to apply transform to. Since uuid.uuid4 doesn't take any argument it really doesn't matter what the column is.
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df.loc[:, "UUID"] = 1
Now to use transform:
import uuid
df.loc[:, "UUID"] = df.groupby("Name").UUID.transform(lambda g: uuid.uuid4())
+----+--------------+--------------------------------------+
| | Name | UUID |
+----+--------------+--------------------------------------+
| 0 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 1 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 2 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 3 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 4 | Jack Dawson | 6b843d0f-ba3a-4880-8a84-d98c4af09cc3 |
| 5 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
+----+--------------+--------------------------------------+
uuid.uuid4 will be called as many times as there are distinct groups
How about this
names = df['Name'].unique()
for name in names:
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()
could shorten it to
for name in df['Name'].unique():
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()

Categories

Resources