Best strategy for merging a lot of data frames using pandas - python

I'm trying to merge many (a few thousand one column tsv files) data frames into a single csv file using pandas. I'm new to pandas (and python for that matter) and could use some input or direction.
My data frames are observational data on a list scraped from the web and do not contain headers. For example:
data frame 1:
bluebird 34
chickadee 168
eagle 10
hawk 67
sparrow 2
data frame 2:
albatross 56
bluebird 78
hawk 3
pelican 19
sparrow 178
I'm looking to do is simply create a master file with all of the individual observations:
albatross 0 56
bluebird 34 78
chickadee 168 0
eagle 10 0
hawk 67 3
pelican 0 19
sparrow 2 178
I've tried to merge the data frames one at a time using pandas:
import pandas as pd
df1 = pd.read_table("~/home/birds1.tsv", sep='\t')
df2 = pd.read_table("~/home/birds2.tsv", sep='\t')
merged = df1.merge(df1, df2, how="left").fillna("0")
merged.to_csv("merged.csv", index=False)
but I am only getting one column. I don't have a master list of "birds", but I can concatenate all the data and sort on unique names for a dictionary list if this is needed.
What should my strategy be for merging a few thousand files?

Look at the docs for merge, when called from a frame, the first parameter is the 'other' frame, and the second is which variables you want to merge on (not actually sure what happens when you pass a DataFrame).
But, assuming your bird column is called 'bird', what you probably want is:
In [412]: df1.merge(df2, on='bird', how='outer').fillna(0)
Out[412]:
bird value_x value_y
0 bluebird 34 78
1 chickadee 168 0
2 eagle 10 0
3 hawk 67 3
4 sparrow 2 178
5 albatross 0 56
6 pelican 0 19

I would think the fastest way is to set the column you want to merge on to the index, create a list of the dataframes and then pd.concat them. Something like this:
import os
import pandas as pd
directory = os.path.expanduser('~/home')
files = os.path.listdir(directory)
dfs = []
for filename in files:
if '.tsv' in file:
df = pd.read_table(os.path.join(directory,filename),sep='\t').set_index('bird')
dfs.append(df)
master_df = pd.concat(dfs,axis=1)

Related

When importing csv to PANDAS, how can you only import columns that contain a specified string within them?

I have thousands of CSV files that each contain hundreds of columns and hundreds of thousands of rows. For speeds I want to only import the data to PANDAS dataframes that I need. I can filter our the CSV files that I do not need using a separate metadata file, but I am having trouble figuring out how to drop the columns that I do not need (during the import -- I know how to filter columns of a dataframe after its been imported, but like I said, I am trying to avoid importing unnecessary data).
So let's say I have the following csv file:
Date/Time Apple Tart Cherry Pie Blueberry Pie Banana Pudding Tomato Soup
1:00 2 4 7 6 5
2:00 3 5 4 5 8
3:00 1 4 7 4 4
I want to import only columns that include the text "Pie", as well as the "Date/Time" column. Also note that the column names and number of columns are different for all of my csv files, so the "usecol" specification has not worked for me as-is since I do not know the specific column names to enter.
The usecols parameter in pandas read_csv accepts a function to filter for the columns you are interested in :
import pandas as pd
from io import StringIO
data = """Date/Time Apple Tart Cherry Pie Blueberry Pie Banana Pudding Tomato Soup
1:00 2 4 7 6 5
2:00 3 5 4 5 8
3:00 1 4 7 4 4"""
df = pd.read_csv(StringIO(data),
sep='\s{2,}',
engine='python',
#this is the key part of the code for your usecase
#looks for columns that contain Pie or Date/Time
#and returns only those columns
#quite extensible as well, since it accepts a function
usecols = lambda x: ("Pie" in x) or ("Date/Time" in x) )
df
Date/Time Cherry Pie Blueberry Pie
0 1:00 4 7
1 2:00 5 4
2 3:00 4 7
You can specify the column names when you use read_csv() as a list, for example:
df=pd.read_csv('fila.csv',names=['columnName#1','columnName3'])
Look that i did not use 'columnName2'.

how to map two dataframes with pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two excel files
+ File one contains specific data about different customer (like: Sex, Age, Name...) and
+ File two contains different transactions for each customer
I want to create a new Column in File2 containing specific data to each Costumer from File1
file1.csv
customer_id,sex,age,name
af4wf3,m,12,mike
z20ask,f,15,sam
file2.csv
transaction_id,customer_id,amount
12h2j4hk,af4wf3,123.20
12h2j4h1,af4wf3,5.22
12h2j4h2,z20ask,13.20
12h2j4h3,af4wf3,1.20
12h2j4h4,z20ask,2341.12
12h2j4h5,z20ask,235.96
12h2j4h6,af4wf3,999.30
Load and join the dataframes
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
df1.set_index('customer_id', inplace=True)
df2.set_index('transaction_id', inplace=True)
output = df2.join(df1, on='customer_id')
output.to_csv('file2_updated.csv')
file2_updated.csv
transaction_id,customer_id,amount,sex,age,name
12h2j4hk,af4wf3,123.2,m,12,mike
12h2j4h1,af4wf3,5.22,m,12,mike
12h2j4h2,z20ask,13.2,f,15,sam
12h2j4h3,af4wf3,1.2,m,12,mike
12h2j4h4,z20ask,2341.12,f,15,sam
12h2j4h5,z20ask,235.96,f,15,sam
12h2j4h6,af4wf3,999.3,m,12,mike
the same as #jc416 but using pd.merge:
file2.merge(file1, on='customer_id')
transaction_id customer_id amount sex age name
0 12h2j4hk af4wf3 123.2 m 12 mike
1 12h2j4h1 af4wf3 5.22 m 12 mike
2 12h2j4h3 af4wf3 1.2 m 12 mike
3 12h2j4h6 af4wf3 999.3 m 12 mike
4 12h2j4h2 z20ask 13.2 f 15 sam
5 12h2j4h4 z20ask 2341.12 f 15 sam
6 12h2j4h5 z20ask 235.96 f 15 sam
You definitely should read Pandas merging 101

Merging DataFrames on specific columns

I have a frame moviegoers that includes zip codes but not cities.
I then redefined moviegoers to be zipcodes and changed the data type of zip codes to be a data frame instead of a series.
zipcodes = pd.read_csv('NYC1-moviegoers.csv',dtype={'zip_code': object})
I know the dataset URL I need is this: https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv.
I defined a dataframe, zip_codes, to call the data from that dataset and change the dataset type from series to dataframe so its in the same format as the zipcodes dataframe.
I want to merge the dataframes so I can have the movie goer data. But, instead of zipcodes, I want to have the state abbreviation. This is where I am having issues.
The end goal is to count the number of movie goers per state. Example ideal output:
CA 116
MN 78
NY 60
TX 51
IL 50
Any ideas would be greatly appreciated.
I think need map by Series and then use value_counts for count:
print (zipcodes)
zip_code
0 85711
1 94043
2 32067
3 43537
4 15213
s = zip_codes.set_index('Zipcode')['State']
df = zipcodes['zip_code'].map(s).value_counts().rename_axis('state').reset_index(name='count')
print (df.head())
state count
0 OH 1
1 CA 1
2 FL 1
3 AZ 1
4 PA 1
Simply merge both datasets on Zipcode columns then run groupby for state counts.
# READ DATA FILES WITH RENAMING OF ZIP COLUMN IN FIRST
url = "https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv"
moviegoers = pd.read_csv('NYC1-moviegoers.csv', dtype={'zip_code': object}).rename(columns={'zip_code': 'Zipcode'})
zipcodes = pd.read_csv(url, dtype={'Zipcode': object})
# MERGE ON COMMON FIELD
merged_df = pd.merge(moviegoers, zipcodes, on='Zipcode')
# AGGREGATE BY INDICATOR (STATE)
merged_df.groupby('State').size()
# ALTERNATIVE GROUP BY COUNT
merged_df.groupby('State')['Zipcode'].agg('count')

Use pandas to get county name using fips codes

I have fips codes here: http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt
And a dataset that looks like this:
fips_state fips_county value
1 1 10
1 3 34
1 5 37
1 7 88
1 9 93
How can I get the county name of each row using the data from the link above with pandas?
Simply load both data sets into DataFrames, then set the appropriate index:
df1.set_index(['fips_state', 'fips_county'], inplace=True)
This gives you a MultiIndex by state+county. Once you've done this for both datasets, you can trivially map them, for example:
df1['county_name'] = df2.county_name

Column Manipulations with date-Time Pandas

I am trying to do some column manipulations with row and column at same time including date and time series in Pandas. Traditionally with no series python dictionaries are great. But with Pandas it a new thing for me.
Input file : N number of them.
File1.csv, File2.csv, File3.csv, ........... Filen.csv
Ids,Date-time-1 Ids,Date-time-2 Ids,Date-time-1
56,4568 645,5545 25,54165
45,464 458,546
I am trying to merge the Date-time column of all the files into a big data file with respect to Ids
Ids,Date-time-ref,Date-time-1,date-time-2
56,100,4468,NAN
45,150,314,NAN
645,50,NAN,5495
458,200,NAN,346
25,250,53915,NAN
Check for date-time column - If not matched create one and then fill the values with respect to Ids by Subtracting the current date-time value with the value of date-time-ref of that respective Ids.
Fill in empty place with NAN and if next file has that value then replace the new value with NAN
If it were straight column subtract it was pretty much easy but in sync with date-time series and with respect to Ids seems a bit confusing.
Appreciate some suggestions to begin with. Thanks in advance.
Here is one way to do it.
import pandas as pd
import numpy as np
from StringIO import StringIO
# your csv file contents
csv_file1 = 'Ids,Date-time-1\n56,4568\n45,464\n'
csv_file2 = 'Ids,Date-time-2\n645,5545\n458,546\n'
# add a duplicated Ids record for testing purpose
csv_file3 = 'Ids,Date-time-1\n25,54165\n645, 4354\n'
csv_file_all = [csv_file1, csv_file2, csv_file3]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(StringIO(csv_file)) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
Out[206]:
Date-time-1 Date-time-2
Ids
56 4568 NaN
45 464 NaN
645 NaN 5545
458 NaN 546
25 54165 NaN
645 4354 NaN
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
Out[207]:
Date-time-1 Date-time-2
Ids
25 54165 NaN
45 464 NaN
56 4568 NaN
458 NaN 546
645 4354 5545
# do the subtraction
master_csv_file = 'Ids,Date-time-ref\n56,100\n45,150\n645,50\n458,200\n25,250\n'
df_master = pd.read_csv(io.StringIO(master_csv_file), index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
Out[208]:
Date-time-ref Date-time-1 Date-time-2
Ids
25 250 53915 NaN
45 150 314 NaN
56 100 4468 NaN
458 200 NaN 346
645 50 4304 5495

Categories

Resources