I want to split two columns at their comma and bring them back to the original pandas dataframe. I tried to explode() but I got an error with ValueError: cannot handle a non-unique multi-index! I wonder how I can overcome this error.
import pandas as pd
data = {'fruit_tag': {0: 'apple, organge', 1: 'watermelon', 2: 'banana', 3: 'banana', 4: 'apple, banana'}, 'location': {0: 'Hong Kong , London', 1: 'New York, Tokyo', 2: 'Singapore', 3: 'Singapore, Hong Kong', 4: 'Tokyo'}, 'rating': {0: 'bad', 1: 'good', 2: 'good', 3: 'bad', 4: 'good'}, 'measure_score': {0: 0.9529434442520142, 1: 0.952498733997345, 2: 0.9080725312232971, 3: 0.8847543001174927, 4: 0.8679852485656738}}
dt = pd.DataFrame.from_dict(data)
dt.\
set_index(['rating', 'measure_score']).\
apply(lambda x: x.str.split(',').explode())
When you explode, the index are the same for (each) old rows. Pandas doesn't know (or like) to align these indexes, because intention of users can be different from case to case, e.g. align by order, or cross merge. In your case, for example, what do you expect to get from row 1 where you have 2 entries for each column? How about row 2?
If you want a cross merge, you would need to explode manually:
def explode(x, col): return x.assign(**{col:x[col].str.split(', ')}).explode(col)
explode(explode(dt, 'fruit_tag'), 'location')
Output:
fruit_tag location rating measure_score
0 apple Hong Kong bad 0.952943
0 apple London bad 0.952943
0 organge Hong Kong bad 0.952943
0 organge London bad 0.952943
1 watermelon New York good 0.952499
1 watermelon Tokyo good 0.952499
2 banana Singapore good 0.908073
3 banana Singapore bad 0.884754
3 banana Hong Kong bad 0.884754
4 apple Tokyo good 0.867985
4 banana Tokyo good 0.867985
Related
My data frame has multiple columns like ID, Organizations, Date, Location, etc. I am trying to extract the "organization" values that are within the "Organizations" column. My desired output should be multiple organization's names in a new column, separated by a comma. For example:
ID
Organizations
1
[{organization=Glaxosmithkline, character_offset=10512}, {organization=Vulpes Fund, character_offset=13845}]
2
[{organization=Amazon, character_offset=14589}, {organization=Sinovac, character_offset=18923}]
I want the output to be something like:
ID
Organizations
1
Glaxosmithkline, Vulpes Fund
2
Amazon, Sinovac
I tried the following code (getting output as NaN):
latin_combined['newOrg'] = latin_combined['organizations'].str[0].str['organization']
Edited:
df.head(5)['organizations'].to_dict() gives me the following output:
{0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
Any suggestions will be helpful.
It seems you have a string. You can use regex to extract the key, value pair separated by =, pivot as shown below:
(df['organizations'].str.extractall('([^{=,]+)= *([^=,}]+)')
.rename({0:'key', 1:'value'}, axis = 1).reset_index()
.groupby(['level_0', 'key'])['value'].agg(', '.join).unstack())
key character_offset organization
level_0
0 14199, 1494 Vac, Health
1 700, 1711 Store, Museum
2 8232, 5517 Mart, Rep
3 3881, 5947 Lodge, Hotel
4 3881, 5947 Airport, Landmark
The data
d = {0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
df = pd.Series(d).to_frame('organizations')
Is this what you are trying to do?
latin_combined['newOrg'] = latin_combined['organizations'].apply(lambda x : x.split(',')[0])
You could use list comprehension with apply:
import pandas as pd
df = pd.DataFrame([[[{'organization':'Glaxosmithkline', 'character_offset':10512}, {'organization':'Vulpes Life Sciences Fund', 'character_offset':13845}]]], columns=['newOrg'])
df['Organizations'] = df['newOrg'].apply(lambda x: [i['organization'] for i in x])
output:
newOrg
Organizations
0
[{'organization': 'Glaxosmithkline', 'character_offset': 10512}, {'organization': 'Vulpes Life Sciences Fund', 'character_offset': 13845}]
['Glaxosmithkline', 'Vulpes Life Sciences Fund']
You could do:
df['organizations'].str.extractall(r"organization= *(\w+)") \
.groupby(level=0).agg(', '.join).rename(columns={0:'Organizations'})
Organizations
0 Vac, Health
1 Store, Museum
2 Mart, Rep
3 Lodge, Hotel
4 Airport, Landmark
1. Updated following your recent dataframe update:
data = {'Organizations': ['[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
'[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
'[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
'[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
'[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]']}
df = pd.DataFrame(data)
df
index
Organizations
0
[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]
1
[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]
2
[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]
3
[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]
4
[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]
2. Use ''.join() + regex with .apply() on the column you want:
import re
df.Organizations = df.Organizations.apply(lambda x: ', '.join(re.findall(r'{[^=]+=\s*([^=,}]+)', x)))
df
3. Result:
index
Organizations
0
Vac, Health
1
Store, Museum
2
Mart, Rep
3
Lodge, Hotel
4
Airport, Landmark
In my own opinion, you should try to better scrap and/or clean your data before putting them into a dataframe.
I am most interested in how this is done in a good and excellent pandas way.
In this example data Tim from Osaka has two fruit's.
import pandas as pd
data = {'name': ['Susan', 'Tim', 'Tim', 'Anna'],
'fruit': ['Apple', 'Apple', 'Banana', 'Banana'],
'town': ['Berlin', 'Osaka', 'Osaka', 'Singabpur']}
df = pd.DataFrame(data)
print(df)
Result
name fruit town
0 Susan Apple Berlin
1 Tim Apple Osaka
2 Tim Banana Osaka
3 Anna Banana Singabpur
I investigate the data ans see that one of the persons have multiple fruits. I want to create a new "category" for it named banana&fruit (or something else). The point is that the other fields of Tim are equal in their values.
df.groupby(['name', 'town', 'fruit']).size()
I am not sure if this is the correct way to explore this data set. The logical question behind is if some of the person+town combinations have multiple fruits.
As a result I want this
name fruit town
0 Susan Apple Berlin
1 Tim Apple&Banana Osaka
2 Anna Banana Singabpur
Use groupby agg:
new_df = (
df.groupby(['name', 'town'], as_index=False, sort=False)
.agg(fruit=('fruit', '&'.join))
)
new_df:
name town fruit
0 Susan Berlin Apple
1 Tim Osaka Apple&Banana
2 Anna Singabpur Banana
>>> df.groupby(["name", "town"], sort=False)["fruit"]
.apply(lambda f: "&".join(f)).reset_index()
name town fruit
0 Anna Singabpur Banana
1 Susan Berlin Apple
2 Tim Osaka Apple&Banana
I have a pandas DataFrame like this:
city country city_population
0 New York USA 8300000
1 London UK 8900000
2 Paris France 2100000
3 Chicago USA 2700000
4 Manchester UK 510000
5 Marseille France 860000
I want to create a new column country_population by calculating a sum of every city for each country. I have tried:
df['Country population'] = df['city_population'].sum().where(df['country'])
But this won't work, could I have some advise on the problem?
Sounds like you're looking for groupby
import pandas as pd
data = {
'city': ['New York', 'London', 'Paris', 'Chicago', 'Manchester', 'Marseille'],
'country': ['USA', 'UK', 'France', 'USA', 'UK', 'France'],
'city_population': [8_300_000, 8_900_000, 2_100_000, 2_700_000, 510_000, 860_000]
}
df = pd.DataFrame.from_dict(data)
# group by country, access 'city_population' column, sum
pop = df.groupby('country')['city_population'].sum()
print(pop)
output:
country
France 2960000
UK 9410000
USA 11000000
Name: city_population, dtype: int64
Appending this Series to the DataFrame. (Arguably discouraged though, since it stores information redundantly and doesn't really fit the structure of the original DataFrame):
# add to existing df
pop.rename('country_population', inplace=True)
# how='left' to preserve original ordering of df
df = df.merge(pop, how='left', on='country')
print(df)
output:
city country city_population country_population
0 New York USA 8300000 11000000
1 London UK 8900000 9410000
2 Paris France 2100000 2960000
3 Chicago USA 2700000 11000000
4 Manchester UK 510000 9410000
5 Marseille France 860000 2960000
based on #Vaishali's comment, a one-liner
df['Country population'] = df.groupby([ 'country']).transform('sum')['city_population']
I have am column full of state names.
I know how to iterate down through it, but I don't know what syntax to use to have it check for empty values as it goes. tried "isnull()" but that seems to be the wrong approach. Anyone know a way?
was thinking something like:
for state_name in datFrame.state_name:
if datFrame.state_name.isnull():
print ('no name value' + other values from row)
else:
print(row is good.)
df.head():
state_name state_ab city zip_code
0 Alabama AL Chickasaw 36611
1 Alabama AL Louisville 36048
2 Alabama AL Columbiana 35051
3 Alabama AL Satsuma 36572
4 Alabama AL Dauphin Island 36528
to_dict():
{'state_name': {0: 'Alabama',
1: 'Alabama',
2: 'Alabama',
3: 'Alabama',
4: 'Alabama'},
'state_ab': {0: 'AL', 1: 'AL', 2: 'AL', 3: 'AL', 4: 'AL'},
'city': {0: 'Chickasaw',
1: 'Louisville',
2: 'Columbiana',
3: 'Satsuma',
4: 'Dauphin Island'},
'zip_code': {0: '36611', 1: '36048', 2: '35051', 3: '36572', 4: '36528'}}
Based on your description, you can use np.where to check if rows are either null or empty strings.
df['status'] = np.where(df['state'].eq('') | df['state'].isnull(), 'Not Good', 'Good')
(MCVE) For example, suppose you have the following dataframe
state
0 New York
1 Nevada
2
3 None
4 New Jersey
then,
state status
0 New York Good
1 Nevada Good
2 Not Good
3 None Not Good
4 New Jersey Good
It's always worth mentioning that you should avoid loops whenever possible, because they are way slower than masking
I have a weak grasp of Pandas and not a strong understanding of Python.
I am wanting to update a column (d.Alias) based on the value of existing columns (d.Company and d2.Alias). d.Alias should be equal to d2.Alias if d2.Alias is a substring of d.Company.
Example datasets:
d = {'Company': ['The Cool Company Inc', 'Cool Company, Inc', 'The Cool
Company', 'The Shoe Company', 'Muffler Store', 'Muffler Store'],
'Position': ['Cool Job A', 'Cool Job B', 'Cool Job C', 'Salesman',
'Sales', 'Technician'],
'City': ['Tacoma', 'Tacoma','Tacoma', 'Boulder', 'Chicago', 'Chicago'],
'State': ['AZ', 'AZ', 'AZ', 'CO', 'IL', 'IL'],
'Alias': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
d2 = {'Company': ['The Cool Company, Inc.', 'The Shoe Company', 'Muffler
Store LLC'],
'Alias': ['Cool Company', np.nan, 'Muffler'],
'First Name': ['Carol', 'James', 'Frankie'],
'Last Name': ['Fisher', 'Smith', 'Johnson']}
The np.nan for The Shoe Company is because for that instance an alias is not necessary.
I have tried using .loc, for loops, while loops, pandas.where, numpy.where, and several variations of each with no desirable outcomes. When using a for loop, the end of d2.Alias was copied to all rows in d.Alias. I have not been able to reproduce that, however.
Previous posts that I have looked at which I wasn't able to get to work, or I didn't understand them: Conditionally fill column with value from another DataFrame based on row match in Pandas
pandas create new column based on values from other columns
Any help is greatly appreciated!
EDIT:
Expected output
Update:
After a few days of tinkering I reached the desired outcome. With Wen's response I had to change a couple of things.
First, I created a list from df2.Alias called aliases:
aliases = df2.Alias.unique()
Then, I had to remove .map(df2.set_index('Company').Alias. The line that generated my desired resutls:
df1['Alias'] = df1.Company.apply(lambda x: [process.extract(x, aliases, limit=1)][0][0][0]).
Solution from fuzzywuzzy
from fuzzywuzzy import process
df1['Alias']=df1.Company.apply(lambda x :[process.extract(x, df2.Company, limit=1)][0][0][0]).map(df2.set_index('Company').Alias)
df1
Out[31]:
Alias City Company Position State
0 Cool Company Tacoma The Cool Company Inc Cool Job A AZ
1 Cool Company Tacoma Cool Company, Inc Cool Job B AZ
2 Cool Company Tacoma The Cool Company Cool Job C AZ
3 NaN Boulder The Shoe Company Salesman CO
4 Muffler Chicago Muffler Store Sales IL
5 Muffler Chicago Muffler Store Technician IL
One approach is to loop through your presumably much smaller dataframe and just look to see when the alias is a substring of d.Company and then just replace the alias with that.
import pandas as pd
d = pd.DataFrame(d)
d2 = pd.DataFrame(d2)
for row in d2[d2.Alias.notnull()].itertuples():
d.loc[d.Company.str.contains(row.Alias), 'Alias'] = row.Alias
print(d)
# Alias City Company Position State
#0 Cool Company Tacoma The Cool Company Inc Cool Job A AZ
#1 Cool Company Tacoma Cool Company, Inc Cool Job B AZ
#2 Cool Company Tacoma The Cool Company Cool Job C AZ
#3 NaN Boulder The Shoe Company Salesman CO
#4 Muffler Chicago Muffler Store Sales IL
#5 Muffler Chicago Muffler Store Technician IL