Compare values between 2 dataframes and transform data - python

The main aim of this script is to compare the regex format of the data present in the csv with the official ZIP Code regex format for that country, and if the format does not match, the script would carry out transformations on said data and output it all in one final dataframe.
I have 2 csv files, one (countries.csv) containing the following columns & data examples
INPUT:
Contact ID
Country
Zip Code
1
USA
71293
2
Italy
IT 2310219
and another csv (Regex.csv) with the following data examples:
Country
Regex format
USA
[0-9]{5}(?:-[0-9]{4})?
Italy
\d{5}
Now, the first csv has some 35k records so I would like to create a function which loops through the regex.csv (Dataframe) to grab the country column and also the regex format. Then it would loop through the country list to grab every instance where regex['country'] == countries['country'] and it would apply the regex transformation to the ZIP Codes for that country.
So far I have this function but I can't get it to work.
def REGI (dframe):
dframe=pd.DataFrame().reindex_like(contacts)
cols = list(contacts.columns)
for index,row in mergeOne.iterrows():
country = (row['Country'])
reg = (row[r'regex'])
for i, r in contactsS.iterrows():
if (r['Country of Residence'] == country or r['Country of Residence.1'] == country or r['Mailing Country (text only)'] == country or r['Other Country (text only)'] == country) :
dframe.loc[i] = r
dframe['Mailing Zip/Postal Code']=dframe['Mailing Zip/Postal Code'].apply(str).str.extractall(reg).unstack().apply(lambda x:','.join(x.dropna()), axis=1)
contacts.loc[contacts['Contact ID'].isin(dframe['Contact ID']),cols] = dframe[cols]
dframe = dframe.dropna(how='all')
return dframe
['Contact ID'] is being used as an identifier column.
The second for loop works on its own however I would need to manually re-type a new dataframe name, regex format and country name (without the first for loop).
At the moment I am getting the following error:
ValueError
ValueError: pattern contains no capture groups
removed some columns to mimic example given above
dataframes & error
error continued
If I paste the results into a new dataframe, it returns the following:
results in a new dataframe
Example as text
Account ID
Country
Zip/Postal Code
1
United Kingdom
WV9 5BT
2
Ireland
D24 EO29
3
Latvia
1009
4
United Kingdom
EN6 1JE
5
Italy
22010
REGEX table
Country
Regex
United Kingdom
([Gg][Ii][Rr] 0[Aa]{2})
(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})
([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
Latvia
[L]{1}[V]{1}-{4}
Ireland
STRNG_LTN_EXT_255
Italy
\d{5}
United Kingdom regex:
([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})

Based on your response to my comment, I would suggest to directly fix the zip code using your regexes:
df3 = df2.set_index('Country')
df1['corrected_Zip'] = (df1.groupby('Country')
['Zip Code']
.apply(lambda x: x.str.extract('(%s)' % df3.loc[x.name, 'Regex format']))
)
df1
This groups by country, applies the regex for this country, and extract the value.
output:
Contact ID Country Zip Code corrected_Zip
0 1 USA 71293 71293
1 2 Italy IT 2310219 23102
NB. if you want you can directly overwrite Zip Code by doing df1['Zip Code'] = …
NB2. This will work only if all countries have an entry in df2, if this is not the case, you need to add a check for that (let me know)
NB3. if you want to know which rows had an invalid zip, you can fetch them using:
df1[df1['Zip Code']!=df1['corrected_Zip']]

Related

Text to columns in pandas dataframe

I have a pandas dataset like below:
import pandas as pd
data = {'id': ['001', '002', '003'],
'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA, United States",
"1180 Shelard Tower\nMinneapolis, MN 55426\nUSA, United States",
"William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA, United States"]
}
df = pd.DataFrame(data)
print(df)
I need to convert address column to text delimited by \n and create new columns like name, address line 1, City, State, Zipcode, Country like below:
id Name addressline1 City State Zipcode Country
1 William J. Clare 290 Valley Dr. Casper WY 82604 United States
2 null 1180 Shelard Tower Minneapolis MN 55426 United States
3 William N. Barnard 145 S. Durbin Casper WY 82601 United States
I am learning python and from morning I am solving this. Any help will be greatly appreciated.
Thanks,
Right now, Pandas is returning you the table with 2 columns. If you look at the value in the second column, the essential information is separated with the comma. Therefore, if you saved your dataframe to df you can do the following:
df['address_and_city'] = df['address'].apply(lambda x: x.split(',')[0])
df['state_and_postal'] = df['address'].apply(lambda x: x.split(',')[1])
df['country'] = df['address'].apply(lambda x: x.split(',')[2])
Now, you have additional three columns in your dataframe, the last one contains the full information about the country already. Now from the first two columns that you have created you can extract the info you need in a similar way.
df['address_first_line'] = df['address_and_city'].apply(lambda x: ' '.join(x.split('\n')[:-1]))
df['city'] = df['address_and_city'].apply(lambda x: x.split('\n')[-1])
df['state'] = df['state_and_postal'].apply(lambda x: x.split(' ')[1])
df['postal'] = df['state_and_postal'].apply(lambda x: x.split(' ')[2].split('\n')[0])
Now you should have all the columns you need. You can remove the excess columns with:
df.drop(columns=['address','address_and_city','state_and_postal'], inplace=True)
Of course, it all can be done faster and with fewer lines of code, but I think it is the clearest way of doing it, which I hope you will find useful. If you don't understand what I did there, check the documentation for split and join methods, and also for apply method, native to pandas.

How to read CSV in pyspark with "," delimiter but not ", "

I am using the following code to read the CSV file in PySpark
cb_sdf = sqlContext.read.format("csv") \
.options(header='true',
multiLine = 'True',
inferschema='true',
treatEmptyValuesAsNulls='true') \
.load(cb_file)
The number of rows is correct. But for some rows, the columns are separated incorrectly. I think it is because the current delimiter is ",", but some cells contain ", " in the text as well.
For example, the following row in the pandas dataframe(I used pd.read_csv to debug)
Unnamed: 0
name
domain
industry
locality
country
size_range
111
cjsc "transport, customs, tourism"
ttt-w.ru
package/freight delivery
vyborg, leningrad, russia
russia
1 - 10
becomes
_c0
name
domain
industry
locality
country
size_range
111
"cjsc ""transport
customs
tourism"""
ttt-w.ru
package/freight delivery
vyborg, leningrad, russia
when I implemented pyspark.
It seems the cell "cjsc "transport, customs, tourism"" is separated into 3 cells: |"cjsc ""transport| customs| tourism"""|.
How can I set the delimiter to be exactly "," without any whitespace followed?
UPDATE:
I checked the CSV file, the original line is:
111,"cjsc ""transport, customs, tourism""",ttt-w.ru,package/freight delivery,"vyborg, leningrad, russia",russia,1 - 10
So is it still the problem of delimiter, or is it the problem of quotes?
I think that separating we'll have:
col1: 111
col2: "cjsc ""transport, customs, tourism"""
col3: ttt-w.ru,package/freight delivery
col4: "vyborg, leningrad, russia"
col5: russia
col6: 1 - 10

Create new column based on value of another column

I have a solution below to give me a new column as a universal identifier, but what if there is additional data in the NAME column, how can I tweak the below to account for a wildcard like search term?
I want to basically have so if German/german or Mexican/mexican is in that row value then to give me Euro or South American value in new col
df["Identifier"] = (df["NAME"].str.lower().replace(
to_replace = ['german', 'mexican'],
value = ['Euro', 'South American']
))
print(df)
NAME Identifier
0 German Euro
1 german Euro
2 Mexican South American
3 mexican South American
Desired output
NAME Identifier
0 1990 German Euro
1 german 1998 Euro
2 country Mexican South American
3 mexican city 2006 South American
Based on an answer in this post:
r = '(german|mexican)'
c = dict(german='Euro', mexican='South American')
df['Identifier'] = df['NAME'].str.lower().str.extract(r, expand=False).map(c)
Another approach would be using np.where with those two conditions, but probably there is a more ellegant solution.
below code will work. i tried it using apply function but somehow can't able to get it. probably in sometime. meanwhile workable code below
df3['identifier']=''
js_ref=[{'german':'Euro'},{'mexican':'South American'}]
for i in range(len(df3)):
for l in js_ref:
for k,v in l.items():
if k.lower() in df3.name[i].lower():
df3.identifier[i]=v
break

Python Pandas compare two dataframes with a similar (string) column, where 1 df's values are substrings of the other df's values

I have a large CSV with phone number prefixes, and corresponding country names.
Then I have a smaller dataframe with prefixes, and corresponding country names that are spelled wrong. I need to find a way to replace the smaller df's country names with the ones from the CSV. Unfortunately, I can't simply compare csv_df['prefix'] == small_df['prefix'].
The smaller dataframe contains very long, specific prefixes ("3167777"), while the CSV has less specific prefixes ("31"). As you can imagine, a prefix like 3167777 is a subcategory of 31, and therefore belongs to the CSV's 31-group. However, if the CSV contains a more specific prefix ("316"), then the longer prefix should only belong to that group.
Dummy example data:
# CSV-df with correct country names
prefix country
93, Portugal
937, Portugal Special Numbers
31, Netherlands
316, Netherlands Special Numbers
# Smaller df with wrong country names
prefix country
93123, PT
9377777, PT
3121, NL
31612345, NL
31677777, NL
31688888, NL
Ideal result:
# Smaller df, but with correct country names
prefix country
9312345, Portugal
9377777, Portugal Special Numbers
3121, Netherlands
31612345, Netherlands Special Numbers
31677777, Netherlands Special Numbers
31688888, Netherlands Special Numbers
So far, I have a very slow, non-elegant solution: I've put the smaller df's prefixes in a list, and loop through the list, constantly .
# small_df_prefixes = ['9312345', '3167', '31612345', ...]
for prefix in small_df_prefixes:
# See if the prefix occurs in the CSV
if prefix in [p[:len(prefix)] for p in csv_df['prefix']]:
new_dest = csv_df.loc[csv_df['prefix'] == prefix]['destination'].values[0]
else:
new_dest = 'UNKNOWN'
This runs slowly and does not give me the solution I need. I've looked into panda's startswith and contains functions but can't quite figure out how to use them correctly here. I hope someone can help me out with a better solution.
I believe that you should take a look into pandas.DataFrame.apply. It enable you to apply a function either column or row wise on a DataFrame.
The idea here is to order df_right (which has the right names) by the lenght of the prefix (the longer the more specific) and then tries to exactly match the prefix in df_wrong to a prefix in df_right.
import pandas as pd
df_wrong = pd.read_csv('path/to/wrong.csv')
# df_wrong
# prefix country
# 9312345 PT
# 3167 NL
# ... ...
df_right = pd.read_csv('path/to/right.csv')
# df_right
# prefix country
# 93 Portugal
# 937 Portugal Special Numbers
# 31 Netherlands
# ... ...
# because prefixes 93 != 937, we will sort values by their len, as 93 would contain 937
df_right.sort_values(by = 'prefix', key = lambda prefix: prefix.len(), inplace = True, ascending = False)
def correct_country(prefix):
global df_right
equal_match = df_right[df_right.prefix == prefix].country
if equal_math.shape[0] == 0: # no matches
return 'UNKNOWN'
return equal_match[0]
df_wrong['corrected_country'] = df_wrong['prefix'].apply(correct_prefix)
Resources:
Sort Dataframe by String Lenght

Pandas/Python - Merging files on different columns based on incoming files

I have a python program which receive incoming files. Incoming files are files based on different countries. Sample files are below -
File 1 (USA) -
country state city population
USA IL Chicago 2000000
USA TX Dallas 1000000
USA CO Denver 5000000
File 2 (Non USA) -
country state city population
UK London 2000000
UK Bristol 1000000
UK Glasgow 5000000
Then I have a mapping file which needs to be merged with incoming files. Mapping file look like this
Country state Continent
UK Europe
Egypt Africa
USA TX North America
USA IL North America
USA CO North America
Now the requirement is that I need to join the incoming file with mapping file based on state column if its a USA file and join based on Country Column if its a Non USA file. For example -
If its a USA file -
result_file = pd.merge(input_file, mapping_file, on="state", how="left")
If its a non USA file -
result_file = pd.merge(input_file, mapping_file, on="country", how="left")
How can I place a condition which can identify the incoming file and do the merging of file accordingly?
Thanks in advance
In order to get a unified code for the both two cases, After reading the files, add another column for both DataFrame of fileX (df) and DataFrame of the mapping file (dfmap) with the name of (country_state) in which country and state are combined, then make this column is the linked relation.
for example:
import pandas as pd
df = pd.read_csv('fileX.txt') # assumed for fileX
dfmap = pd.read_csv('mapping_file.txt') # assumed for mapping file
df.fillna('') # to replace Nan values with ''
if 'state' in df.columns:
df['country_state'] = df['country'] + df['state']
else:
df['country_state'] = df['country']
dfmap['country_state'] = dfmap['country'] + dfmap['state']
result_file = pd.merge(df, dfmap, on="country_state", how="left")
Then you can drop the columns you do not need
Adding a modification in which adding state if not exist, and set relation based on country and state without adding the column 'country_sate' shown in the previous code:
import pandas as pd
df = pd.read_csv('file1.txt')
dfmap = pd.read_csv('file_map.txt')
df.fillna('')
if 'state' not in df.columns:
df['state']=''
result_file = pd.merge(df, dfmap, on=["country", "state"], how="left")
First, empty the state column for non-US files.
input_file.loc[input_file.country!='US', 'state'] = ''
Then, merge on two columns:
result_file = pd.merge(input_file, mapping_file, on=["country", "state"], how="left")
-How are you loading the files?
Are there any pattern in the names of the files which you can work on?
If they are in the same folder, you can recognize the file with
import os
list_of_files=os.listdir('my_directory/')
or you could do a simple search in the Country column looking for USA, and then apply the merges according to the situation

Categories

Resources