Update or replace value in df when conditions are met - python

I have a list of city names and a df with city, state, and zipcode columns. Some of the zipcodes are missing. When a zipcode is missing, I want to use a generic zipcode based on the city. For example, the city is San Jose so the zipcode should be a generic 'SJ_zipcode'.
pattern_city = '|'.join(cities) #works
foundit = ( (df['cty_nm'].str.contains(pattern_city, flags=re.IGNORECASE)) & (df['zip_cd']==0) & (df['st_cd'].str.match('CA') ) ) #works--is this foundit a df?
df['zip_cd'] = foundit.replace( 'SJ_zipcode' ) #nope, error
Error: "Invalid dtype for pad_1d [bool]"
Implemented with where
df['zip_cd'].where( (df['cty_nm'].str.contains(pattern_city, flags=re.IGNORECASE)) & (df['zip_cd']==0) & (df['st_cd'].str.match('CA') ), "SJ_Zipcode", inplace = True) #nope, empty set; all set to nan?
Implemented with loc
df['zip_cd'].loc[ (df['cty_nm'].str.contains(pattern_city, flags=re.IGNORECASE)) & (df['zip_cd']==0) & (df['st_cd'].str.match('CA') ) ] = "SJ_Zipcode"
Some possible solutions that did not work
df.loc[df['First Season'] > 1990, 'First Season'] = 1 which I used as df.loc[foundit, 'zip_cd'] = 'SJ_zipcode' Pandas DataFrame: replace all values in a column, based on condition and similar/same as Conditional Replace Pandas
df['c'] = df.apply( lambda row: row['a']*row['b'] if np.isnan(row['c']) else row['c'], axis=1) however, I am not multiplying values https://datascience.stackexchange.com/questions/17769/how-to-fill-missing-value-based-on-other-columns-in-pandas-dataframe
I tried a solution using where, however, it seemed to replace the values where the condition was not met with nan--but the nan value was not helpful https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html
This conditional approach looked promising but then without looping with each value I was confused by how does anything happen... What should replace comparisons with False in python?
An example using replace which does not have the multiple conditions and pattern Replacing few values in a pandas dataframe column with another value
An additional 'want'; I want to update a dataframe with values, I do not want to create a new dataframe.

Try this:
df = pd.DataFrame(data)
df
city state zip
0 Burbank California 44325
1 Anaheim California nan
2 El Cerrito California 57643
3 Los Angeles California 56734
4 san Fancisco California 32819
def generate_placeholder_zip(row):
if pd.isnull(row['zip'] ):
row['zip'] =row['city']+'_ZIPCODE'
return row
df.apply(generate_placeholder_zip, axis =1)
city state zip
0 Burbank California 44325
1 Anaheim California Anaheim_ZIPCODE
2 El Cerrito California 57643
3 Los Angeles California 56734
4 san Fancisco California 32819

Related

Text to columns in pandas dataframe

I have a pandas dataset like below:
import pandas as pd
data = {'id': ['001', '002', '003'],
'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA, United States",
"1180 Shelard Tower\nMinneapolis, MN 55426\nUSA, United States",
"William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA, United States"]
}
df = pd.DataFrame(data)
print(df)
I need to convert address column to text delimited by \n and create new columns like name, address line 1, City, State, Zipcode, Country like below:
id Name addressline1 City State Zipcode Country
1 William J. Clare 290 Valley Dr. Casper WY 82604 United States
2 null 1180 Shelard Tower Minneapolis MN 55426 United States
3 William N. Barnard 145 S. Durbin Casper WY 82601 United States
I am learning python and from morning I am solving this. Any help will be greatly appreciated.
Thanks,
Right now, Pandas is returning you the table with 2 columns. If you look at the value in the second column, the essential information is separated with the comma. Therefore, if you saved your dataframe to df you can do the following:
df['address_and_city'] = df['address'].apply(lambda x: x.split(',')[0])
df['state_and_postal'] = df['address'].apply(lambda x: x.split(',')[1])
df['country'] = df['address'].apply(lambda x: x.split(',')[2])
Now, you have additional three columns in your dataframe, the last one contains the full information about the country already. Now from the first two columns that you have created you can extract the info you need in a similar way.
df['address_first_line'] = df['address_and_city'].apply(lambda x: ' '.join(x.split('\n')[:-1]))
df['city'] = df['address_and_city'].apply(lambda x: x.split('\n')[-1])
df['state'] = df['state_and_postal'].apply(lambda x: x.split(' ')[1])
df['postal'] = df['state_and_postal'].apply(lambda x: x.split(' ')[2].split('\n')[0])
Now you should have all the columns you need. You can remove the excess columns with:
df.drop(columns=['address','address_and_city','state_and_postal'], inplace=True)
Of course, it all can be done faster and with fewer lines of code, but I think it is the clearest way of doing it, which I hope you will find useful. If you don't understand what I did there, check the documentation for split and join methods, and also for apply method, native to pandas.

How to use pd.DataFrame.replace on a column containing lists

I am having trouble with pandas replace-function. Let's say we have an example dataframe like this:
df = pd.DataFrame({'State': ['Georgia', 'Alabama', 'Tennessee'],
'Cities': [['Atlanta', 'Albany'], ['Montgomery', 'Huntsville', 'Birmingham'], ['Nashville', 'Knoxville']]})
>>> df
State Cities
0 Georgia [Atlanta, Albany]
1 Alabama [Montgomery, Huntsville, Birmingham]
2 Tennessee [Nashville, Knoxville]
Now I want to replace the state names and city names all by abbreviations. I have two dictionaries that define the replacement values:
state_abbrv = {'Alabama': 'AL', 'Georgia': 'GA', 'Tennessee': 'TN'}
city_abbrv = {'Albany': 'Alb.', 'Atlanta': 'Atl.', 'Birmingham': 'Birm.',
'Huntsville': 'Htsv.', 'Knoxville': 'Kxv.',
'Montgomery': 'Mont.', 'Nashville': 'Nhv.'}
When using pd.DataFrame.replace() on the "States" column (which only contains one value per row) it works as expected and replaces all state names:
>>> df.replace({'State': state_abbrv})
State Cities
0 GA [Atlanta, Albany]
1 AL [Montgomery, Huntsville, Birmingham]
2 TN [Nashville, Knoxville]
I was hoping that it would also individually replace all matching names within the lists of the "Cities" column, but unfortunately it does not seem to work as all cities remain unabbreviated:
>>> df.replace({'Cities': city_abbrv})
State Cities
0 Georgia [Atlanta, Albany]
1 Alabama [Montgomery, Huntsville, Birmingham]
2 Tennessee [Nashville, Knoxville]
How do I get the pd.DataFrame.replace() function to individually circle through all list elements in the column per row and replace accordingly?
Try:
explode to split the list into individual rows
replace each column using the relevant dictionary
groupby and agg to get back the original structure
>>> output = df.explode("Cities").replace({"State": state_abbrv, "Cities": city_abbrv}).groupby("State", as_index=False)["Cities"].agg(list)
State Cities
0 AL [Mont., Htsv., Birm.]
1 GA [Atl., Alb.]
2 TN [Nhv., Kxv.]

Create new column based on value of another column

I have a solution below to give me a new column as a universal identifier, but what if there is additional data in the NAME column, how can I tweak the below to account for a wildcard like search term?
I want to basically have so if German/german or Mexican/mexican is in that row value then to give me Euro or South American value in new col
df["Identifier"] = (df["NAME"].str.lower().replace(
to_replace = ['german', 'mexican'],
value = ['Euro', 'South American']
))
print(df)
NAME Identifier
0 German Euro
1 german Euro
2 Mexican South American
3 mexican South American
Desired output
NAME Identifier
0 1990 German Euro
1 german 1998 Euro
2 country Mexican South American
3 mexican city 2006 South American
Based on an answer in this post:
r = '(german|mexican)'
c = dict(german='Euro', mexican='South American')
df['Identifier'] = df['NAME'].str.lower().str.extract(r, expand=False).map(c)
Another approach would be using np.where with those two conditions, but probably there is a more ellegant solution.
below code will work. i tried it using apply function but somehow can't able to get it. probably in sometime. meanwhile workable code below
df3['identifier']=''
js_ref=[{'german':'Euro'},{'mexican':'South American'}]
for i in range(len(df3)):
for l in js_ref:
for k,v in l.items():
if k.lower() in df3.name[i].lower():
df3.identifier[i]=v
break

Best way to filter for multiple conditions in the same df

Ive got a df thats been merged and I want to do some logic to it so that I capture issues from the data sources.
I want to capture both when theres a situation when the Areacode's match but T's do not
AND when both Areacode's and T's dont match at all.
Here's a merged_df before the filter.
Name t_1 Areacode_1 t_2 Areacode_2
Jerry New Jersey 12674 Texas 12674
Elaine New York 98765 Alaska 78654
George New York 12345 New York 12345
Is there a way to do this all in one filter? This is what I have so far, but it would be nice to put it as one line:
m = merged_df.loc[(merged_df['t_1'] != merged_df['t_2']) & (merged_df['Areacode_1'] == merged_df['Areacode_2']) ]
m2 = merged_df.loc[(merged_df['t_1'] != merged_df['t_2']) & (merged_df['Areacode_1'] != merged_df['Areacode_2']) ]
After the filter Id expect George to be removed because all columns matched.
Expected merged_df:
Name t_1 Areacode_1 t_2 Areacode_2
Jerry New Jersey 12674 Texas 12674
Elaine New York 98765 Alaska 78654
You could do it like this:
import pandas as pd
merged_df = pd.DataFrame({'Name':['Jerry','Elaine','George'],
't_1':['New Jersey', 'New York','New York'],
'Areacode_1': [12674,98765,12345],
't_2':['Texas','Alaska','New York'],
'Areacode_2':[12674,78654,12345]})
filtered1 = merged_df.loc[~((merged_df.t_1 == merged_df.t_2) & (merged_df.Areacode_1 == merged_df.Areacode_2))]
display(filtered1)
filtered2 = merged_df.loc[(merged_df.t_1 != merged_df.t_2)]
display(filtered2)
Note that filtered1 shows the same output as filtered2 and is the same as your 'Expected merged_df'.
Both will essentially meet your criteria.
I used np.where to solve this.
merged_df2 = merged_df.assign(Filter = np.where((merged_df['Salesforce_Territory'] !=
merged_df['Snowflake Territory']) & (merged_df['Salesforce_Zip_Code'] != merged_df['Snowflake Zip']) |
((merged_df['Salesforce_Territory'] != merged_df['Snowflake Territory'])), True, False))

Speed up pandas dataframe lookup

I have a pandas data frame with zip codes, city, state and country of ~ 600,000 locations. Let's call it my_df
I'd like to look up the corresponding longitude and latitude for each of these locations. Thankfully, there is a database for this. Let's call this dataframe zipdb.
zipdb has, among others, columns for zip codes, city, state and country.
So, I'd like to look up all of the locations (zip, city, state and country) in zipdb.
def zipdb_lookup(zipcode, city, state, country):
countries_mapping = { "UNITED STATES":"US"
, "CANADA":"CA"
, "KOREA REP OF":"KR"
, "ITALY":"IT"
, "AUSTRALIA":"AU"
, "CHILE":"CL"
, "UNITED KINGDOM":"GB"
, "BERMUDA":"BM"
}
try:
slc = zipdb[ (zipdb.Zipcode == str(zipcode)) &
(zipdb.City == str(city).upper()) &
(zipdb.State == str(state).upper()) &
(zipdb.Country == countries_mapping[country].upper()) ]
if slc.shape[0] == 1:
return np.array(slc["Lat"])[0], np.array(slc["Long"])[0]
else:
return None
except:
return None
I have tried pandas' .apply as well as a for loop to do this.
Both are very slow. I recognize there are a large number of rows, but I can't help but think something faster must be possible.
zipdb = pandas.read_csv("free-zipcode-database.csv") #linked to above
Note: I've also performed this transformation on zibdb:
zipdb["Zipcode"] = zipdb["Zipcode"].astype(str)
Function Call:
#Defined a wrapper function:
def lookup(row):
"""
:param row:
:return:
"""
lnglat = zipdb_lookup(
zipcode = my_df["organization_zip"][row]
, city = my_df["organization_city"][row]
, state = my_df["organization_state"][row]
, country = my_df["organization_country"][row]
)
return lnglat
lnglat = list()
for l in range(0, my_df.shape[0]):
# if l % 5000 == 0: print(round((float(l) / my_df.shape[0])*100, 2), "%")
lnglat.append(lookup(row = l))
Sample Data from my_df:
organization_zip organization_city organization_state organization_country
0 60208 EVANSTON IL United Sates
1 77555 GALVESTON TX United Sates
2 23284 RICHMOND VA United Sates
3 53233 MILWAUKEE WI United Sates
4 10036 NEW YORK NY United Sates
5 33620 TAMPA FL United Sates
6 10029 NEW YORK NY United Sates
7 97201 PORTLAND OR United Sates
8 97201 PORTLAND OR United Sates
9 53715 MADISON WI United Sates
Using merge() will be a lot faster than calling a function on every row. Make sure the field types match and strings are stripped:
# prepare your dataframe
data['organization_zip'] = data.organization_zip.astype(str)
data['organization_city'] = data.organization_city.apply(lambda v: v.strip())
# get the zips database
zips = pd.read_csv('/path/to/free-zipcode-database.csv')
zips['Zipcode'] = zips.Zipcode.astype(str)
# left join
# -- prepare common join columns
zips.rename(columns=dict(Zipcode='organization_zip',
City='organization_city'),
inplace=True)
# specify join columns along with zips' columns to copy
cols = ['organization_zip', 'organization_city', 'Lat', 'Long']
data.merge(zips[cols], how='left')
=>
Note you may need to extend the merge columns and/or add more columns to copy from the zips dataframe.

Categories

Resources