Pandas/Python - Merging files on different columns based on incoming files

Pandas/Python - Merging files on different columns based on incoming files - python

I have a python program which receive incoming files. Incoming files are files based on different countries. Sample files are below -
File 1 (USA) -
country state city population
USA IL Chicago 2000000
USA TX Dallas 1000000
USA CO Denver 5000000
File 2 (Non USA) -
country state city population
UK London 2000000
UK Bristol 1000000
UK Glasgow 5000000
Then I have a mapping file which needs to be merged with incoming files. Mapping file look like this
Country state Continent
UK Europe
Egypt Africa
USA TX North America
USA IL North America
USA CO North America
Now the requirement is that I need to join the incoming file with mapping file based on state column if its a USA file and join based on Country Column if its a Non USA file. For example -
If its a USA file -
result_file = pd.merge(input_file, mapping_file, on="state", how="left")
If its a non USA file -
result_file = pd.merge(input_file, mapping_file, on="country", how="left")
How can I place a condition which can identify the incoming file and do the merging of file accordingly?
Thanks in advance

In order to get a unified code for the both two cases, After reading the files, add another column for both DataFrame of fileX (df) and DataFrame of the mapping file (dfmap) with the name of (country_state) in which country and state are combined, then make this column is the linked relation.
for example:
import pandas as pd
df = pd.read_csv('fileX.txt') # assumed for fileX
dfmap = pd.read_csv('mapping_file.txt') # assumed for mapping file
df.fillna('') # to replace Nan values with ''
if 'state' in df.columns:
df['country_state'] = df['country'] + df['state']
else:
df['country_state'] = df['country']
dfmap['country_state'] = dfmap['country'] + dfmap['state']
result_file = pd.merge(df, dfmap, on="country_state", how="left")
Then you can drop the columns you do not need
Adding a modification in which adding state if not exist, and set relation based on country and state without adding the column 'country_sate' shown in the previous code:
import pandas as pd
df = pd.read_csv('file1.txt')
dfmap = pd.read_csv('file_map.txt')
df.fillna('')
if 'state' not in df.columns:
df['state']=''
result_file = pd.merge(df, dfmap, on=["country", "state"], how="left")

First, empty the state column for non-US files.
input_file.loc[input_file.country!='US', 'state'] = ''
Then, merge on two columns:
result_file = pd.merge(input_file, mapping_file, on=["country", "state"], how="left")

-How are you loading the files?
Are there any pattern in the names of the files which you can work on?
If they are in the same folder, you can recognize the file with
import os
list_of_files=os.listdir('my_directory/')
or you could do a simple search in the Country column looking for USA, and then apply the merges according to the situation

Related

Python dataframe from 2 text files (different number of columns)

I need to make a dataframe from two txt files.
The first txt file looks like this Street_name space id.
The second txt file loks like this City_name space id.
Example:
text file 1:
Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567
text file 2:
Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567
I need to make one dataframe out of this. Sometimes there is just one word for Street_name, and sometimes more. The same goes for City_name.
I get an error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 3 because I'm trying to put both words for street name into the same column, but don't know how to do it. I want one column for street name (no matter if it consists of one or more words, one for city name and one for id.
I want a df with 3 rows and 3 cols.
Thanks!
Edit: both text files are huge (each 50 mil rows +) so i need this code not to break and be optimised for large files.

It is NOT correct CSV and it may need to read it on your own.
You can normal open(), read() and later split on new line to create list of lines. And later you can use for-loop and use line.rsplit(" ", 1) to split line on last space.
Minimal working example:
I use io to simulate file in memory - so everyone can simply copy and test it - but you should use open()
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
#with open('filename') as fh:
with io.StringIO(text) as fh:
lines = fh.read().splitlines()
print(lines)
lines = [line.rsplit(" ", 1) for line in lines]
print(lines)
import pandas as pd
df = pd.DataFrame(lines, columns=['name', 'name'])
print(df)
Result:
['Roseberry st 1234', 'Brooklyn st 4321', 'Wolseley 1234567']
[['Roseberry st', '1234'], ['Brooklyn st', '4321'], ['Wolseley', '1234567']]
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
EDIT:
read_csv can use regex to define separator (i.e. sep="\s+" for many spaces) and it can even use lookahead/loopbehind ((?=...)/(?<=...)) to check if there is digit after space without catching it as part of separator.
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
import pandas as pd
#df = pd.read_csv('filename', names=['name', 'number'], sep='\s(?=\d)', engine='python')
df = pd.read_csv(io.StringIO(text), names=['name', 'number'], sep='\s(?=\d)', engine='python')
print(df)
Result:
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
And later you can try to connect both dataframe using .join(), .merge() with parameter on= (or something similar) like in SQL query.
text1 = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
text2 = '''Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567'''
import io
import pandas as pd
df1 = pd.read_csv(io.StringIO(text1), names=['street name', 'id'], sep='\s(?=\d)', engine='python')
df2 = pd.read_csv(io.StringIO(text2), names=['city name', 'id'], sep='\s(?=\d)', engine='python')
print(df1)
print(df2)
df = df1.merge(df2, on='id')
print(df)
Result:
street name id
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
city name id
0 Winnipeg 4321
1 Winnipeg 1234
2 Ste Anne 1234567
street name id city name
0 Roseberry st 1234 Winnipeg
1 Brooklyn st 4321 Winnipeg
2 Wolseley 1234567 Ste Anne
Pandas doc: Merge, join, concatenate and compare

There's nothing that I'm aware of in pandas that does this automatically.
Below, I built a script that will merge those addresses (addy + st) into a single column, then merges the two data frames into one based on the "id".
I assume your actual text files are significantly larger, so assuming they follow the pattern set in the two examples, this script should work fine.
Basically, this code turns each line of text in the file into a list, then combines lists of length 3 into length 2 by combining the first two list items.
After that, it turns the "list of lists" into a dataframe and merges those dataframes on column "id".
Couple caveats:
Make sure you set the correct text file paths
Make sure the first line of the text files contains 2, single string column headers (ie: "address id") or (ie: "city id")
Make sure each text file id column header is named "id"
import pandas as pd
import numpy as np
# set both text file paths (you may need full path i.e. C:\Users\Name\bla\bla\bla\text1.txt)
text_path_1 = r'text1.txt'
text_path_2 = r'text2.txt'
# declares first text file
with open(text_path_1) as f1:
text_file_1 = f1.readlines()
# declares second text file
with open(text_path_2) as f2:
text_file_2 = f2.readlines()
# function that massages data into two columns (to put "st" into same column as address name)
def data_massager(text_file_lines):
data_list = []
for item in text_file_lines:
stripped_item = item.strip('\n')
split_stripped_item = stripped_item.split(' ')
if len(split_stripped_item) == 3:
split_stripped_item[0:2] = [' '.join(split_stripped_item[0 : 2])]
data_list.append(split_stripped_item)
return data_list
# runs function on both text files
data_list_1 = data_massager(text_file_1)
data_list_2 = data_massager(text_file_2)
# creates dataframes on both text files
df1 = pd.DataFrame(data_list_1[1:], columns = data_list_1[0])
df2 = pd.DataFrame(data_list_2[1:], columns = data_list_2[0])
# merges data based on id (make sure both text files' id is named "id")
merged_df = df1.merge(df2, how='left', on='id')
# prints dataframe (assuming you're using something like jupyter-lab)
merged_df

pandas has strong support for strings. You can make the lines of each file into a Series and then use a regular expression to separate the fields into separate columns. I assume that "id" is the common value that links the two datasets, so it can become the dataframe index and the columns can just be added together.
import pandas as pd
street_series = pd.Series([line.strip() for line in open("text1.txt")])
street_df = street_series.str.extract(r"(.*?) (\d+)$")
del street_series
street_df.rename({0:"street", 1:"id"}, axis=1, inplace=True)
street_df.set_index("id", inplace=True)
print(street_df)
city_series = pd.Series([line.strip() for line in open("text2.txt")])
city_df = city_series.str.extract(r"(.*?) (\d+)$")
del city_series
city_df.rename({0:"city", 1:"id"}, axis=1, inplace=True)
city_df.set_index("id", inplace=True)
print(city_df)
street_df["city"] = city_df["city"]
print(street_df)

Compare values between 2 dataframes and transform data

The main aim of this script is to compare the regex format of the data present in the csv with the official ZIP Code regex format for that country, and if the format does not match, the script would carry out transformations on said data and output it all in one final dataframe.
I have 2 csv files, one (countries.csv) containing the following columns & data examples
INPUT:
Contact ID
Country
Zip Code
1
USA
71293
2
Italy
IT 2310219
and another csv (Regex.csv) with the following data examples:
Country
Regex format
USA
[0-9]{5}(?:-[0-9]{4})?
Italy
\d{5}
Now, the first csv has some 35k records so I would like to create a function which loops through the regex.csv (Dataframe) to grab the country column and also the regex format. Then it would loop through the country list to grab every instance where regex['country'] == countries['country'] and it would apply the regex transformation to the ZIP Codes for that country.
So far I have this function but I can't get it to work.
def REGI (dframe):
dframe=pd.DataFrame().reindex_like(contacts)
cols = list(contacts.columns)
for index,row in mergeOne.iterrows():
country = (row['Country'])
reg = (row[r'regex'])
for i, r in contactsS.iterrows():
if (r['Country of Residence'] == country or r['Country of Residence.1'] == country or r['Mailing Country (text only)'] == country or r['Other Country (text only)'] == country) :
dframe.loc[i] = r
dframe['Mailing Zip/Postal Code']=dframe['Mailing Zip/Postal Code'].apply(str).str.extractall(reg).unstack().apply(lambda x:','.join(x.dropna()), axis=1)
contacts.loc[contacts['Contact ID'].isin(dframe['Contact ID']),cols] = dframe[cols]
dframe = dframe.dropna(how='all')
return dframe
['Contact ID'] is being used as an identifier column.
The second for loop works on its own however I would need to manually re-type a new dataframe name, regex format and country name (without the first for loop).
At the moment I am getting the following error:
ValueError
ValueError: pattern contains no capture groups
removed some columns to mimic example given above
dataframes & error
error continued
If I paste the results into a new dataframe, it returns the following:
results in a new dataframe
Example as text
Account ID
Country
Zip/Postal Code
1
United Kingdom
WV9 5BT
2
Ireland
D24 EO29
3
Latvia
1009
4
United Kingdom
EN6 1JE
5
Italy
22010
REGEX table
Country
Regex
United Kingdom
([Gg][Ii][Rr] 0[Aa]{2})
(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})
([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
Latvia
[L]{1}[V]{1}-{4}
Ireland
STRNG_LTN_EXT_255
Italy
\d{5}
United Kingdom regex:
([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})

Based on your response to my comment, I would suggest to directly fix the zip code using your regexes:
df3 = df2.set_index('Country')
df1['corrected_Zip'] = (df1.groupby('Country')
['Zip Code']
.apply(lambda x: x.str.extract('(%s)' % df3.loc[x.name, 'Regex format']))
)
df1
This groups by country, applies the regex for this country, and extract the value.
output:
Contact ID Country Zip Code corrected_Zip
0 1 USA 71293 71293
1 2 Italy IT 2310219 23102
NB. if you want you can directly overwrite Zip Code by doing df1['Zip Code'] = …
NB2. This will work only if all countries have an entry in df2, if this is not the case, you need to add a check for that (let me know)
NB3. if you want to know which rows had an invalid zip, you can fetch them using:
df1[df1['Zip Code']!=df1['corrected_Zip']]

Adding information from a smaller table to a large one with Pandas

I would like to add the regional information to the main table that contains entity and account columns. In this way, each row in the main table should be duplicated, just like the append tool in Alteryx.
Is there a way to do this operation with Pandas in Python?
Thanks!

Unfortunately no build-in method exist, as you'll need to build cartesian product of those DataFrame check that fancy explanation of merge DataFrames in pandas
But for your specific problem, try this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['Entity', 'Account'])
df1.Entity = ['Entity1', 'Entity1']
df1.Account = ['Sales', 'Cost']
df2 = pd.DataFrame(columns=['Region'])
df2.Region = ['North America', 'Europa', 'Asia']
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
resultdf = cartesian_product_simplified(df1, df2)
print(resultdf)
output:
0 1 2
0 Entity1 Sales North America
1 Entity1 Sales Europa
2 Entity1 Sales Asia
3 Entity1 Cost North America
4 Entity1 Cost Europa
5 Entity1 Cost Asia
as expected.
Btw, please provide the Data Frame the next time as code, not as a screenshot or even as link. It helps up saving time (please check how to ask)

How do I add a blank line between merged files

I have several CSV files that I have managed to merge. However, I need to add a blank row between each files as they merge so I know a different file starts at that point. Tried everything. Please help.
import os
import glob
import pandas
def concatenate(indir="C:\\testing", outfile="C:\\done.csv"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList=[]
colnames=["Creation Date","Author","Tweet","Language","Location","Country","Continent"]
for filename in fileList:
print(filename)
df=pandas.read_csv(filename, header=None)
ins=df.insert(len(df),'\n')
dfList.append(ins)
concatDf=pandas.concat(dfList,axis=0)
concatDf.columns=colnames
concatDf.to_csv(outfile,index=None)

Here is an example script. You can use the loc method with a non-existent key to enlarge the DataFrame and set the value of the new row.
The simplest solution seems to be to create a template DataFrame to use as a separator with the values set as desired. Then just insert it into the list of data frames to concatenate at appropriate positions.
Lastly, I removed the chdir, since glob can search in any path.
import glob
import pandas
def concatenate(input_dir, output_file_name):
file_list=glob.glob(input_dir + "/*.csv")
column_names=["Creation Date"
, "Author"
, "Tweet"
, "Language"
, "Location"
, "Country"
, "Continent"]
# Create a separator template
separator = pandas.DataFrame(columns=column_names)
separator.loc[0] = [""]*7
dataframes = []
for file_name in file_list:
print(file_name)
if len(dataframes):
# The list is not empty, so we need to add a separator
dataframes.append(separator)
dataframes.append(pandas.read_csv(file_name))
concatenated = pandas.concat(dataframes, axis=0)
concatenated.to_csv(output_file_name, index=None)
print(concatenated)
concatenate("input", ".out.csv")
An alternative, even shorter, way is to build the concatenated DataFrame iteratively, using the append method.
def concatenate(input_dir, output_file_name):
file_list=glob.glob(input_dir + "/*.csv")
column_names=["Creation Date"
, "Author"
, "Tweet"
, "Language"
, "Location"
, "Country"
, "Continent"]
concatenated = pandas.DataFrame(columns=column_names)
for file_name in file_list:
print(file_name)
if len(concatenated):
# The list is not empty, so we need to add a separator
concatenated.loc[len(concatenated)] = [""]*7
concatenated = concatenated.append(pandas.read_csv(file_name))
concatenated.to_csv(output_file_name, index=None)
print(concatenated)
I tested the script with 3 input CSV files:
input/1.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2015-12-17,foo,Hello,EN,London,UK,Europe
2015-12-18,bar,Bye,EN,Manchester,UK,Europe
2015-12-28,baz,Hallo,DE,Frankfurt,Germany,Europe
input/2.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2016-01-09,bar,Tweeeeet,EN,New York,USA,America
2016-01-09,cat,Miau,FI,Helsinki,Finland,Europe
input/3.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2018-12-12,who,Hello,EN,Delhi,India,Asia
When I ran it, the following output was written to console:
Console Output (using concat)
input\1.csv
input\2.csv
input\3.csv
Creation Date Author Tweet Language Location Country Continent
0 2015-12-17 foo Hello EN London UK Europe
1 2015-12-18 bar Bye EN Manchester UK Europe
2 2015-12-28 baz Hallo DE Frankfurt Germany Europe
0
0 2016-01-09 bar Tweeeeet EN New York USA America
1 2016-01-09 cat Miau FI Helsinki Finland Europe
0
0 2018-12-12 who Hello EN Delhi India Asia
The console output of the shorter variant is slightly different (note the indices in the first column), however this has no effect on the generated CSV file.
Console Output (using append)
input\1.csv
input\2.csv
input\3.csv
Creation Date Author Tweet Language Location Country Continent
0 2015-12-17 foo Hello EN London UK Europe
1 2015-12-18 bar Bye EN Manchester UK Europe
2 2015-12-28 baz Hallo DE Frankfurt Germany Europe
3
0 2016-01-09 bar Tweeeeet EN New York USA America
1 2016-01-09 cat Miau FI Helsinki Finland Europe
6
0 2018-12-12 who Hello EN Delhi India Asia
Finally, this is what the output CSV file it generated looks like:
out.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2015-12-17,foo,Hello,EN,London,UK,Europe
2015-12-18,bar,Bye,EN,Manchester,UK,Europe
2015-12-28,baz,Hallo,DE,Frankfurt,Germany,Europe
,,,,,,
2016-01-09,bar,Tweeeeet,EN,New York,USA,America
2016-01-09,cat,Miau,FI,Helsinki,Finland,Europe
,,,,,,
2018-12-12,who,Hello,EN,Delhi,India,Asia

Bash - Date manipulation and join

I have two CSV files that I would like to merge using the DATE (CSV 1) and pickup_datetime (CSV 2).
CSV 1: Weather.csv (45KB ~ 365 rows)
head -3 Weather.csv
STATION,STATION_NAME,ELEVATION,LATITUDE,LONGITUDE,DATE,PRCP,SNWD,SNOW,TMAX,TMIN,AWND,WDF2,WSF2
GHCND:USW00094728,NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US,39.6,40.77889,-73.96917,20130101,0,0,0,44,-33,31,310,67
GHCND:USW00094728,NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US,39.6,40.77889,-73.96917,20130102,0,0,0,6,-56,26,310,67
CSV 2: Final_Data_1.csv (250MB ~ 1.5M rows)
head -3 final_data_1.csv
medallion,hack_license,vendor_id_x,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,vendor_id_y,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-01 23:54:15,2013-01-01 23:58:20,2,244,0.7,-73.974602,40.759945,-73.984734,40.759388,CMT,CSH,5.0,0.5,0.5,0.0,0.0,6.0
237F49C3ECC11F5024B254268F054384,93C363DDF8ED9385D65FAD07CE3F5F07,CMT,1,N,2013-01-01 07:35:47,2013-01-01 07:46:00,1,612,2.3,-73.98850999999999,40.774307,-73.981094,40.755325,CMT,CSH,10.0,0.0,0.5,0.0,0.0,10.5
How do I manipulate the date column in both CSV files and merge it to get one file with columns of Final_Data_1.csv coming before Weather.csv?

You definitely don't want to be using Bash, a good way in Python would be to use pandas, something like this:
import pandas as pd
df1 = pd.read_csv('weather.csv')
df2 = pd.read_csv('final.csv')
#format the date columns so they match up
df3 = pd.merge(df2,df1, on='date_formatted')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas/Python - Merging files on different columns based on incoming files - python

First, empty the state column for non-US files. input_file.loc[input_file.country!='US', 'state'] = '' Then, merge on two columns: result_file = pd.merge(input_file, mapping_file, on=["country", "state"], how="left")

Related

Python dataframe from 2 text files (different number of columns)

Compare values between 2 dataframes and transform data

Adding information from a smaller table to a large one with Pandas

How do I add a blank line between merged files

Bash - Date manipulation and join

Categories

Resources