Check if column in dataframe is missing values - python

I have am column full of state names.
I know how to iterate down through it, but I don't know what syntax to use to have it check for empty values as it goes. tried "isnull()" but that seems to be the wrong approach. Anyone know a way?
was thinking something like:
for state_name in datFrame.state_name:
if datFrame.state_name.isnull():
print ('no name value' + other values from row)
else:
print(row is good.)
df.head():
state_name state_ab city zip_code
0 Alabama AL Chickasaw 36611
1 Alabama AL Louisville 36048
2 Alabama AL Columbiana 35051
3 Alabama AL Satsuma 36572
4 Alabama AL Dauphin Island 36528
to_dict():
{'state_name': {0: 'Alabama',
1: 'Alabama',
2: 'Alabama',
3: 'Alabama',
4: 'Alabama'},
'state_ab': {0: 'AL', 1: 'AL', 2: 'AL', 3: 'AL', 4: 'AL'},
'city': {0: 'Chickasaw',
1: 'Louisville',
2: 'Columbiana',
3: 'Satsuma',
4: 'Dauphin Island'},
'zip_code': {0: '36611', 1: '36048', 2: '35051', 3: '36572', 4: '36528'}}

Based on your description, you can use np.where to check if rows are either null or empty strings.
df['status'] = np.where(df['state'].eq('') | df['state'].isnull(), 'Not Good', 'Good')
(MCVE) For example, suppose you have the following dataframe
state
0 New York
1 Nevada
2
3 None
4 New Jersey
then,
state status
0 New York Good
1 Nevada Good
2 Not Good
3 None Not Good
4 New Jersey Good
It's always worth mentioning that you should avoid loops whenever possible, because they are way slower than masking

Related

Extract organization from a column with list of dictionaries using pd dataframe

My data frame has multiple columns like ID, Organizations, Date, Location, etc. I am trying to extract the "organization" values that are within the "Organizations" column. My desired output should be multiple organization's names in a new column, separated by a comma. For example:
ID
Organizations
1
[{organization=Glaxosmithkline, character_offset=10512}, {organization=Vulpes Fund, character_offset=13845}]
2
[{organization=Amazon, character_offset=14589}, {organization=Sinovac, character_offset=18923}]
I want the output to be something like:
ID
Organizations
1
Glaxosmithkline, Vulpes Fund
2
Amazon, Sinovac
I tried the following code (getting output as NaN):
latin_combined['newOrg'] = latin_combined['organizations'].str[0].str['organization']
Edited:
df.head(5)['organizations'].to_dict() gives me the following output:
{0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
Any suggestions will be helpful.
It seems you have a string. You can use regex to extract the key, value pair separated by =, pivot as shown below:
(df['organizations'].str.extractall('([^{=,]+)= *([^=,}]+)')
.rename({0:'key', 1:'value'}, axis = 1).reset_index()
.groupby(['level_0', 'key'])['value'].agg(', '.join).unstack())
key character_offset organization
level_0
0 14199, 1494 Vac, Health
1 700, 1711 Store, Museum
2 8232, 5517 Mart, Rep
3 3881, 5947 Lodge, Hotel
4 3881, 5947 Airport, Landmark
The data
d = {0: '[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
1: '[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
2: '[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
3: '[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
4: '[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]'}
df = pd.Series(d).to_frame('organizations')
Is this what you are trying to do?
latin_combined['newOrg'] = latin_combined['organizations'].apply(lambda x : x.split(',')[0])
You could use list comprehension with apply:
import pandas as pd
df = pd.DataFrame([[[{'organization':'Glaxosmithkline', 'character_offset':10512}, {'organization':'Vulpes Life Sciences Fund', 'character_offset':13845}]]], columns=['newOrg'])
df['Organizations'] = df['newOrg'].apply(lambda x: [i['organization'] for i in x])
output:
newOrg
Organizations
0
[{'organization': 'Glaxosmithkline', 'character_offset': 10512}, {'organization': 'Vulpes Life Sciences Fund', 'character_offset': 13845}]
['Glaxosmithkline', 'Vulpes Life Sciences Fund']
You could do:
df['organizations'].str.extractall(r"organization= *(\w+)") \
.groupby(level=0).agg(', '.join).rename(columns={0:'Organizations'})
Organizations
0 Vac, Health
1 Store, Museum
2 Mart, Rep
3 Lodge, Hotel
4 Airport, Landmark
1. Updated following your recent dataframe update:
data = {'Organizations': ['[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]',
'[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]',
'[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]',
'[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]',
'[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]']}
df = pd.DataFrame(data)
df
index
Organizations
0
[{organization= Vac, character_offset=14199}, {organization=Health, character_offset=1494}]
1
[{organization=Store, character_offset=700}, {organization= Museum, character_offset=1711}]
2
[{organization= Mart, character_offset=8232}, {organization= Rep, character_offset=5517}]
3
[{organization= Lodge, character_offset=3881}, {organization= Hotel, character_offset=5947}]
4
[{organization=Airport, character_offset=3881}, {organization=Landmark, character_offset=5947}]
2. Use ''.join() + regex with .apply() on the column you want:
import re
df.Organizations = df.Organizations.apply(lambda x: ', '.join(re.findall(r'{[^=]+=\s*([^=,}]+)', x)))
df
3. Result:
index
Organizations
0
Vac, Health
1
Store, Museum
2
Mart, Rep
3
Lodge, Hotel
4
Airport, Landmark
In my own opinion, you should try to better scrap and/or clean your data before putting them into a dataframe.

How to extract a specific "word" from a list

I have a dataframe.
dict_df = {'code': {0: 'a02',
1: 'a03',
2: 'a04',
3: 'a05',
4: 'a06',
5: 'a07',
6: 'a08',
7: 'a09',
8: 'a10'},
'name': {0: 'Dr Mike',
1: ' Dr. Benjamin',
2: 'Doctor Dre',
3: 'ApotekOne',
4: 'Aptek Two',
5: 'Apotek 3',
6: 'DrVladrimir',
7: ' dR Sarah inc.',
8: 'DR.John'}}
df = pd.DataFrame(dict_df)
I'm trying to extract in another column different strings. I will take "dr" as example but it is applies to all of them.
For "dr" I need it in any form or shape (dr, DR, Dr, dR) plus
before (dr) can be blank or any other char except a letter or a number (ex. Dr)
after (dr) can be blank, point or any other char except a letter of a number (ex. DR.John)
if there is no special char after (dr) (ex. blank, point, etc) and it is an uppercase letter, it is a match (ex "Dre" is not a match but "DrVlad" is a match)
What I did by now but it doesn't cover all conditions above:
df['inclusions']= df['name'].str.findall(r'(?i)dr|doctor|apotek|aptek|two').str.join(", ").str.lower()
If on the column "inclusions" I have double (dr), how can I keep only one (no duplicates)?
Thank you.
IIUC, you can use an anchor to the start, a partial case insensitive group and a negative match:
df['inclusions']= (df['name']
.str.findall(r'^\s*(?i:dr|doctor|apotek|aptek|two)(?![a-z])')
.str.join(", ")
.str.lower()
)
output:
code name inclusions
0 a02 Dr Mike dr
1 a03 Dr. Benjamin dr
2 a04 Doctor Dre doctor
3 a05 ApotekOne apotek
4 a06 Aptek Two aptek
5 a07 Apotek 3 apotek
6 a08 DrVladrimir dr
7 a09 dR Sarah inc. dr
8 a10 DR.John dr

Getting ValueError - non-unique multi-index when using explode() in Pandas

I want to split two columns at their comma and bring them back to the original pandas dataframe. I tried to explode() but I got an error with ValueError: cannot handle a non-unique multi-index! I wonder how I can overcome this error.
import pandas as pd
data = {'fruit_tag': {0: 'apple, organge', 1: 'watermelon', 2: 'banana', 3: 'banana', 4: 'apple, banana'}, 'location': {0: 'Hong Kong , London', 1: 'New York, Tokyo', 2: 'Singapore', 3: 'Singapore, Hong Kong', 4: 'Tokyo'}, 'rating': {0: 'bad', 1: 'good', 2: 'good', 3: 'bad', 4: 'good'}, 'measure_score': {0: 0.9529434442520142, 1: 0.952498733997345, 2: 0.9080725312232971, 3: 0.8847543001174927, 4: 0.8679852485656738}}
dt = pd.DataFrame.from_dict(data)
dt.\
set_index(['rating', 'measure_score']).\
apply(lambda x: x.str.split(',').explode())
When you explode, the index are the same for (each) old rows. Pandas doesn't know (or like) to align these indexes, because intention of users can be different from case to case, e.g. align by order, or cross merge. In your case, for example, what do you expect to get from row 1 where you have 2 entries for each column? How about row 2?
If you want a cross merge, you would need to explode manually:
def explode(x, col): return x.assign(**{col:x[col].str.split(', ')}).explode(col)
explode(explode(dt, 'fruit_tag'), 'location')
Output:
fruit_tag location rating measure_score
0 apple Hong Kong bad 0.952943
0 apple London bad 0.952943
0 organge Hong Kong bad 0.952943
0 organge London bad 0.952943
1 watermelon New York good 0.952499
1 watermelon Tokyo good 0.952499
2 banana Singapore good 0.908073
3 banana Singapore bad 0.884754
3 banana Hong Kong bad 0.884754
4 apple Tokyo good 0.867985
4 banana Tokyo good 0.867985

Performing an Operation On Grouped Pandas Data

I have a pandas DataFrame with the following information:
year state candidate percvotes electoral_votes perc_evotes vote_frac vote_int
1976 ALABAMA CARTER, JIMMY 55.727269 9 5.015454 0.015454 5
1976 ALABAMA FORD, GERALD 42.614871 9 3.835338 0.835338 3
1976 ALABAMA MADDOX, LESTER 0.777613 9 0.069985 0.069985 0
1976 ALABAMA BUBAR, BENJAMIN 0.563808 9 0.050743 0.050743 0
1976 ALABAMA HALL, GUS 0.165194 9 0.014867 0.014867 0
where pervotes is the percentage of the total votes cast the candidate received (calculated before), electoral_votes are the electoral college votes for that state, perc_evotes is the calculated percent of the electoral votes, and vote_frac and vote_int are the fraction and whole number part of the electoral votes earned respectively. This data repeats for each year of an election and then by state per year. The candidates each have a row for each state, and it is similar data.
What I want to do is allocate the leftover electoral votes to the candidate with the highest fraction. This number is different for each state and year. In this case there would be 1 leftover electoral vote (9 total votes and 5+3=8 are already allocated) and the remaining one will go to 'FORD, GERALD' since he has 0.85338 in the vote_frac column. Sometimes there are 2 or 3 left unallocated.
I have a solution that adds the data to a dictionary, but it is using for loops. I know there must be a better way to do this in a more "pandas" way. I have touched on groupby in this loop but I feel like I am not utilizing pandas to it's full potential.
My for loop:
results = {}
grouped = electdf.groupby(["year", "state"])
for key, group in grouped:
year, state = key
group['vote_remaining'] = group['electoral_votes'] - group['vote_int'].sum()
remaining = group['vote_remaining'].iloc[0]
top_fracs = group['vote_frac'].nlargest(remaining)
group['total'] = (group['vote_frac'].isin(top_fracs)).astype(int) + group['vote_int']
if year not in results:
results[year] = {}
for candidate, evotes in zip(group['candidate'], group['total']):
if candidate not in results[year] and evotes:
results[year][candidate] = 0
if evotes:
results[year][candidate] += evotes
Thanks in advance!
Perhaps an apply function which finds the available electoral votes, the amount of votes cast, and conditionally updates the max 'vote_frac' row's 'vote_int' column with the difference of available and cast votes:
import pandas as pd
df = pd.DataFrame({'year': {0: 1976, 1: 1976, 2: 1976, 3: 1976, 4: 1976},
'state': {0: 'ALABAMA', 1: 'ALABAMA', 2: 'ALABAMA',
3: 'ALABAMA', 4: 'ALABAMA'},
'candidate': {0: 'CARTER, JIMMY', 1: 'FORD, GERALD',
2: 'MADDOX, LESTER', 3: 'BUBAR, BENJAMIN',
4: 'HALL, GUS'},
'percvotes': {0: 55.727269, 1: 42.614871, 2: 0.777613, 3: 0.563808,
4: 0.165194},
'electoral_votes': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9},
'perc_evotes': {0: 5.015454, 1: 3.835338, 2: 0.069985,
3: 0.050743, 4: 0.014867},
'vote_frac': {0: 0.015454, 1: 0.835338, 2: 0.069985,
3: 0.050743, 4: 0.014867},
'vote_int': {0: 5, 1: 3, 2: 0, 3: 0, 4: 0}})
def apply_extra_e_votes(grp):
# Get First Electoral Vote
# (Assumes first row in group contains the
# correct number of electoral votes for the group)
available_e_votes = grp['electoral_votes'].iloc[0]
# Get the Sum of the vote_int column
current_e_votes = grp['vote_int'].sum()
# If there more available votes than votes cast
if available_e_votes > current_e_votes:
# Update the 'vote_int' column at the max value of 'vote_frac'
grp.loc[
grp['vote_frac'].idxmax(),
'vote_int'
] += available_e_votes - current_e_votes # (Remaining Votes)
return grp
# Groupby and Apply Function
new_df = df.groupby(['year', 'state']).apply(apply_extra_e_votes)
# For Display
print(new_df.to_string(index=False))
Output:
year
state
candidate
percvotes
electoral_votes
perc_evotes
vote_frac
vote_int
1976
ALABAMA
CARTER, JIMMY
55.727269
9
5.015454
0.015454
5
1976
ALABAMA
FORD, GERALD
42.614871
9
3.835338
0.835338
4
1976
ALABAMA
MADDOX, LESTER
0.777613
9
0.069985
0.069985
0
1976
ALABAMA
BUBAR, BENJAMIN
0.563808
9
0.050743
0.050743
0
1976
ALABAMA
HALL, GUS
0.165194
9
0.014867
0.014867
0

Update column values based on other columns

I have a weak grasp of Pandas and not a strong understanding of Python.
I am wanting to update a column (d.Alias) based on the value of existing columns (d.Company and d2.Alias). d.Alias should be equal to d2.Alias if d2.Alias is a substring of d.Company.
Example datasets:
d = {'Company': ['The Cool Company Inc', 'Cool Company, Inc', 'The Cool
Company', 'The Shoe Company', 'Muffler Store', 'Muffler Store'],
'Position': ['Cool Job A', 'Cool Job B', 'Cool Job C', 'Salesman',
'Sales', 'Technician'],
'City': ['Tacoma', 'Tacoma','Tacoma', 'Boulder', 'Chicago', 'Chicago'],
'State': ['AZ', 'AZ', 'AZ', 'CO', 'IL', 'IL'],
'Alias': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
d2 = {'Company': ['The Cool Company, Inc.', 'The Shoe Company', 'Muffler
Store LLC'],
'Alias': ['Cool Company', np.nan, 'Muffler'],
'First Name': ['Carol', 'James', 'Frankie'],
'Last Name': ['Fisher', 'Smith', 'Johnson']}
The np.nan for The Shoe Company is because for that instance an alias is not necessary.
I have tried using .loc, for loops, while loops, pandas.where, numpy.where, and several variations of each with no desirable outcomes. When using a for loop, the end of d2.Alias was copied to all rows in d.Alias. I have not been able to reproduce that, however.
Previous posts that I have looked at which I wasn't able to get to work, or I didn't understand them: Conditionally fill column with value from another DataFrame based on row match in Pandas
pandas create new column based on values from other columns
Any help is greatly appreciated!
EDIT:
Expected output
Update:
After a few days of tinkering I reached the desired outcome. With Wen's response I had to change a couple of things.
First, I created a list from df2.Alias called aliases:
aliases = df2.Alias.unique()
Then, I had to remove .map(df2.set_index('Company').Alias. The line that generated my desired resutls:
df1['Alias'] = df1.Company.apply(lambda x: [process.extract(x, aliases, limit=1)][0][0][0]).
Solution from fuzzywuzzy
from fuzzywuzzy import process
df1['Alias']=df1.Company.apply(lambda x :[process.extract(x, df2.Company, limit=1)][0][0][0]).map(df2.set_index('Company').Alias)
df1
Out[31]:
Alias City Company Position State
0 Cool Company Tacoma The Cool Company Inc Cool Job A AZ
1 Cool Company Tacoma Cool Company, Inc Cool Job B AZ
2 Cool Company Tacoma The Cool Company Cool Job C AZ
3 NaN Boulder The Shoe Company Salesman CO
4 Muffler Chicago Muffler Store Sales IL
5 Muffler Chicago Muffler Store Technician IL
One approach is to loop through your presumably much smaller dataframe and just look to see when the alias is a substring of d.Company and then just replace the alias with that.
import pandas as pd
d = pd.DataFrame(d)
d2 = pd.DataFrame(d2)
for row in d2[d2.Alias.notnull()].itertuples():
d.loc[d.Company.str.contains(row.Alias), 'Alias'] = row.Alias
print(d)
# Alias City Company Position State
#0 Cool Company Tacoma The Cool Company Inc Cool Job A AZ
#1 Cool Company Tacoma Cool Company, Inc Cool Job B AZ
#2 Cool Company Tacoma The Cool Company Cool Job C AZ
#3 NaN Boulder The Shoe Company Salesman CO
#4 Muffler Chicago Muffler Store Sales IL
#5 Muffler Chicago Muffler Store Technician IL

Categories

Resources