Bash - Date manipulation and join - python

I have two CSV files that I would like to merge using the DATE (CSV 1) and pickup_datetime (CSV 2).
CSV 1: Weather.csv (45KB ~ 365 rows)
head -3 Weather.csv
STATION,STATION_NAME,ELEVATION,LATITUDE,LONGITUDE,DATE,PRCP,SNWD,SNOW,TMAX,TMIN,AWND,WDF2,WSF2
GHCND:USW00094728,NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US,39.6,40.77889,-73.96917,20130101,0,0,0,44,-33,31,310,67
GHCND:USW00094728,NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US,39.6,40.77889,-73.96917,20130102,0,0,0,6,-56,26,310,67
CSV 2: Final_Data_1.csv (250MB ~ 1.5M rows)
head -3 final_data_1.csv
medallion,hack_license,vendor_id_x,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,vendor_id_y,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-01 23:54:15,2013-01-01 23:58:20,2,244,0.7,-73.974602,40.759945,-73.984734,40.759388,CMT,CSH,5.0,0.5,0.5,0.0,0.0,6.0
237F49C3ECC11F5024B254268F054384,93C363DDF8ED9385D65FAD07CE3F5F07,CMT,1,N,2013-01-01 07:35:47,2013-01-01 07:46:00,1,612,2.3,-73.98850999999999,40.774307,-73.981094,40.755325,CMT,CSH,10.0,0.0,0.5,0.0,0.0,10.5
How do I manipulate the date column in both CSV files and merge it to get one file with columns of Final_Data_1.csv coming before Weather.csv?

You definitely don't want to be using Bash, a good way in Python would be to use pandas, something like this:
import pandas as pd
df1 = pd.read_csv('weather.csv')
df2 = pd.read_csv('final.csv')
#format the date columns so they match up
df3 = pd.merge(df2,df1, on='date_formatted')

Related

Pandas/Python - Merging files on different columns based on incoming files

I have a python program which receive incoming files. Incoming files are files based on different countries. Sample files are below -
File 1 (USA) -
country state city population
USA IL Chicago 2000000
USA TX Dallas 1000000
USA CO Denver 5000000
File 2 (Non USA) -
country state city population
UK London 2000000
UK Bristol 1000000
UK Glasgow 5000000
Then I have a mapping file which needs to be merged with incoming files. Mapping file look like this
Country state Continent
UK Europe
Egypt Africa
USA TX North America
USA IL North America
USA CO North America
Now the requirement is that I need to join the incoming file with mapping file based on state column if its a USA file and join based on Country Column if its a Non USA file. For example -
If its a USA file -
result_file = pd.merge(input_file, mapping_file, on="state", how="left")
If its a non USA file -
result_file = pd.merge(input_file, mapping_file, on="country", how="left")
How can I place a condition which can identify the incoming file and do the merging of file accordingly?
Thanks in advance
In order to get a unified code for the both two cases, After reading the files, add another column for both DataFrame of fileX (df) and DataFrame of the mapping file (dfmap) with the name of (country_state) in which country and state are combined, then make this column is the linked relation.
for example:
import pandas as pd
df = pd.read_csv('fileX.txt') # assumed for fileX
dfmap = pd.read_csv('mapping_file.txt') # assumed for mapping file
df.fillna('') # to replace Nan values with ''
if 'state' in df.columns:
df['country_state'] = df['country'] + df['state']
else:
df['country_state'] = df['country']
dfmap['country_state'] = dfmap['country'] + dfmap['state']
result_file = pd.merge(df, dfmap, on="country_state", how="left")
Then you can drop the columns you do not need
Adding a modification in which adding state if not exist, and set relation based on country and state without adding the column 'country_sate' shown in the previous code:
import pandas as pd
df = pd.read_csv('file1.txt')
dfmap = pd.read_csv('file_map.txt')
df.fillna('')
if 'state' not in df.columns:
df['state']=''
result_file = pd.merge(df, dfmap, on=["country", "state"], how="left")
First, empty the state column for non-US files.
input_file.loc[input_file.country!='US', 'state'] = ''
Then, merge on two columns:
result_file = pd.merge(input_file, mapping_file, on=["country", "state"], how="left")
-How are you loading the files?
Are there any pattern in the names of the files which you can work on?
If they are in the same folder, you can recognize the file with
import os
list_of_files=os.listdir('my_directory/')
or you could do a simple search in the Country column looking for USA, and then apply the merges according to the situation

Adding information from a smaller table to a large one with Pandas

I would like to add the regional information to the main table that contains entity and account columns. In this way, each row in the main table should be duplicated, just like the append tool in Alteryx.
Is there a way to do this operation with Pandas in Python?
Thanks!
Unfortunately no build-in method exist, as you'll need to build cartesian product of those DataFrame check that fancy explanation of merge DataFrames in pandas
But for your specific problem, try this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['Entity', 'Account'])
df1.Entity = ['Entity1', 'Entity1']
df1.Account = ['Sales', 'Cost']
df2 = pd.DataFrame(columns=['Region'])
df2.Region = ['North America', 'Europa', 'Asia']
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
resultdf = cartesian_product_simplified(df1, df2)
print(resultdf)
output:
0 1 2
0 Entity1 Sales North America
1 Entity1 Sales Europa
2 Entity1 Sales Asia
3 Entity1 Cost North America
4 Entity1 Cost Europa
5 Entity1 Cost Asia
as expected.
Btw, please provide the Data Frame the next time as code, not as a screenshot or even as link. It helps up saving time (please check how to ask)

CSV File Transpose Column to Row in Python

Ive been wrecking my head with this and I probably just need to step back.
I have a CSV file like this : ( dummy data - there could be 1-20 Parameters )
CAR,NAME,AGE,COLOUR
Ford,Mike,45,Blue
VW,Peter,67,Yellow
And need
CAR,PARAMETER,VALUE
Ford,NAME,Mike
Ford,AGE,45
Ford,COLOUR,BLUE
VW,NAME,Peter
VW,AGE,67
VW,COLOUR,Yellow
Im Looking at :
How to transpose a dataset in a csv file?
How to transpose a dataset in a csv file?
Python writing a .csv file with rows and columns transpose
But i think because I want to keep CAR column static , the Python zip function might not hack it..
Any thoughts on this Sunny Friday Gurus?
Regards!
<Python - Transpose columns to rows within data operation and before writing to file >>
Use pandas:
df_in = read_csv('infile.csv')
df_out = df_in.set_index('CAR').stack().reset_index()
df_out.columns = ['CAR', 'PARAMETER', 'VALUE']
df_out.to_csv('outfile.csv', index=False)
Input and output example:
>>> df_in
CAR NAME AGE COLOUR
0 Ford Mike 45 Blue
1 VW Peter 67 Yellow
>>> df_out
CAR PARAMETER VALUE
0 Ford NAME Mike
1 Ford AGE 45
2 Ford COLOUR Blue
3 VW NAME Peter
4 VW AGE 67
5 VW COLOUR Yellow
I was able to use Python - Transpose columns to rows within data operation and before writing to file with some tweaks and all is working now well.
import csv
with open('transposed.csv', 'wt') as destfile:
writer = csv.writer(destfile)
writer.writerow(['car', 'parameter', 'value'])
with open('input.csv', 'rt') as sourcefile:
for d in csv.DictReader(sourcefile):
car= d.pop('car')
for parameter, value in sorted(d.items()):
row = [car, parameter.upper(), value]
writer.writerow(row)

Python: Average values in a CSV file based on value of another column

I am a noob and I have a large CSV file with data structured like this (with a lot more columns):
State daydiff
CT 5.5
CT 6.5
CT 6.25
NY 3.2
NY 3.225
PA 7.522
PA 4.25
I want to output a new CSV where the daydiff is averaged for each State like this:
State daydiff
CT 6.083
NY 3.2125
PA 5.886
I have tried numerous ways and the cleanest seemed to leverage pandas groupby but when i run the code below:
import pandas as pd
df = pd.read_csv('C:...input.csv')
df.groupby('State')['daydiff'].mean()
df.to_csv('C:...AverageOutput.csv')
I get a file that is identical to the original file but with a counter added in the first column with no header:
,State,daydiff
0,CT,5.5
1,CT,6.5
2,CT,6.25
3,NY,3.2
4,NY,3.225
5,PA,7.522
6,PA,4.25
I was also hoping to control the new average in datediff to a decimal going out only to the hundredths. Thanks
The "problem" with the counter is because the default behaviour for to_csvis to write the index. You should do df.to_csv('C:...AverageOutput.csv', index=False).
You can control the output format of daydiff by converting it to string. df.daydiff = df.daydiff.apply(lambda x: '{:.2f}'.format(x))
Your complete code should be:
df = pd.read_csv('C:...input.csv')
df2 = df.groupby('State')['daydiff'].mean().apply(lambda x: '{:.2f}'.format(x))
df2.to_csv('C:...AverageOutput.csv')

Python Pandas: df =pd.read_csv('olympics.csv')

I'm asking help how to use the Python command: df=pd.read_csv('olympics.csv'). My intention is to use pandas to read this file, and determine how many countries have won more than 1 Gold medal.
Assumption: 'olympics.csv' resides in same directory as .py file. I tried #using the entire path inside parentheses, but that had no effect
#('/Users/myname/temp/intro_ds/week2/olympics.csv')
The error I receive when running this file in Bash is: KeyError:'Gold'
I'm using Python 2.7.10 on a MacBook, Unix
CODE:
import pandas as pd
df = pd.read_csv('olympics.csv')
only_gold = df.where(df['Gold'] > 0)
print only_gold()
olympics.csv has no column with name Gold, Silver or Bronze when you first convert it to csv. You have to rename column headers, skip some unnecessary rows and make an index.
To read olympics.csv, skip rows (if you need to, depends on your csv formatting) and Make an index on Team names.
import pandas as pd
df = pd.read_csv('olympics.csv', skiprows=1, index_col=0)
df.head()
This should give you results like this which has 01!, 02! instead of Gold, Silver in columns header.
To rename columns header to Gold, Silver and Bronze from 01!, 02! and 03!. Run the following
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
df.head()
Now you can make query like
df['Gold'] #for summer olympics Gold medals
df['Gold.1'] #for winter olympics Gold medals
df['Gold.2'] #for combined summer+winter Gold medals
Convert All-time_Olympic_Games_medal_table table to csv

Categories

Resources