Python Pandas: df =pd.read_csv('olympics.csv') - python

I'm asking help how to use the Python command: df=pd.read_csv('olympics.csv'). My intention is to use pandas to read this file, and determine how many countries have won more than 1 Gold medal.
Assumption: 'olympics.csv' resides in same directory as .py file. I tried #using the entire path inside parentheses, but that had no effect
#('/Users/myname/temp/intro_ds/week2/olympics.csv')
The error I receive when running this file in Bash is: KeyError:'Gold'
I'm using Python 2.7.10 on a MacBook, Unix
CODE:
import pandas as pd
df = pd.read_csv('olympics.csv')
only_gold = df.where(df['Gold'] > 0)
print only_gold()

olympics.csv has no column with name Gold, Silver or Bronze when you first convert it to csv. You have to rename column headers, skip some unnecessary rows and make an index.
To read olympics.csv, skip rows (if you need to, depends on your csv formatting) and Make an index on Team names.
import pandas as pd
df = pd.read_csv('olympics.csv', skiprows=1, index_col=0)
df.head()
This should give you results like this which has 01!, 02! instead of Gold, Silver in columns header.
To rename columns header to Gold, Silver and Bronze from 01!, 02! and 03!. Run the following
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
df.head()
Now you can make query like
df['Gold'] #for summer olympics Gold medals
df['Gold.1'] #for winter olympics Gold medals
df['Gold.2'] #for combined summer+winter Gold medals
Convert All-time_Olympic_Games_medal_table table to csv

Related

How to join columns in CSV files using Pandas in Python

I have a CSV file that looks something like this:
# data.csv (this line is not there in the file)
Names, Age, Names
John, 5, Jane
Rian, 29, Rath
And when I read it through Pandas in Python I get something like this:
import pandas as pd
data = pd.read_csv("data.csv")
print(data)
And the output of the program is:
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
Is there any way to get:
Names Age
0 John 5
1 Rian 29
2 Jane
3 Rath
First, I'd suggest having unique names for each column. Either go into the csv file and change the name of a column header or do so in pandas.
Using 'Names2' as the header of the column with the second occurence of the same column name, try this:
Starting from
datalist = [['John', 5, 'Jane'], ['Rian', 29, 'Rath']]
df = pd.DataFrame(datalist, columns=['Names', 'Age', 'Names2'])
We have
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
So, use:
dff = pd.concat([df['Names'].append(df['Names2'])
.reset_index(drop=True),
df.iloc[:,1]], ignore_index=True, axis=1)
.fillna('').rename(columns=dict(enumerate(['Names', 'Ages'])))
to get your desired result.
From the inside out:
df.append combines the columns.
pd.concat( ... ) combines the results of df.append with the rest of the dataframe.
To discover what the other commands do, I suggest removing them one-by-one and looking at the results.
Please forgive the formating of dff. I'm trying to make everything clear from an educational perspective.
Adjust indents so the code will compile.
You can use:
usecols which helps to read only selected columns.
low_memory is used so that we Internally process the file in chunks.
import pandas as pd
data = pd.read_csv("data.csv", usecols = ['Names','Age'], low_memory = False))
print(data)
Please have unique column name in your csv

Merge multiple CSV files that share 2 columns into one unique data frame

I have multiple CSV files (like 200) in a folder that I want to merge them into one unique dataframe. For example, each file has 3 columns, of which 2 are common in all the files (Country and Year), the third column is different in each file.
For example, one file has the following columns:
Country Year X
----------------------
Mexico 2015 10
Spain 2014 6
And other file can be like this:
Country Year A
--------------------
Mexico 2015 90
Spain 2014 67
USA 2020 8
I can read this files and merge them with the following code:
x = pd.read_csv("x.csv")
a = pd.read_csv("a.csv")
df = pd.merge(a, x, how="left", left_on=["country", "year"],
right_on=["country", "year"], indicator=False)
And this result in the output that I want, like this:
Country Year A X
-------------------------
Mexico 2015 90 10
Spain 2014 67 6
USA 2020 8
However, my problem is to do the previously process with each file, there are more than 200, I want to know if I can use a loop (or other method) in order to read the files and merge them into a unique dataframe.
Thank you very much, I hope I was clear enough.
Use glob like this:
import glob
print(glob.glob("/home/folder/*.csv"))
This gives all your files in a list : ['/home/folder/file1.csv', '/home/folder/file2.csv', .... ]
Now, you can just iterate over this list : from 1->end, keeping 0 as your base, and do pd.read_csv() and pd.merge() - it should be sorted!
Try this:
import os
import pandas as pd
# update this to path that contains your .csv's
path = '.'
# get files that end with csv in path
dir_list = [file for file in os.listdir(path) if file.endswith('.csv')]
# initiate empty list
df_list = []
# simple for loop with Try, Except that passes on iterations that throw errors when trying to 'read_csv' your files
for file in dir_list:
try:
# append to df_list and set your indices to match across your df's for later pd.concat to work
df_list.append(pd.read_csv(file).set_index(['Country', 'Year']))
except: # change this depending on whatever Errors pd.read_csv() throws
pass
concatted = pd.concat(df_list)

How to fix Excel import and compare error?

I'm comparing two Excel files and want to write matches to new file using some filters.
I'm comparing two Excel files and if there is matching in Make, Model, Modification, Horse Power and also from "WheelSizeFullDB_new" year is in range of year start and year stop of "sql-autobaza", then I want to create a new file and add to it all column of "sql-autobaza" and also 2 last column from "WheelSizeFullDB_new" Tire Size Front and Tire Size Back
dowoload files:
sql_base : drive.google.com/open?id=1Dk_1q9n5RgKFRawT7qBwyMY4ldGUL0fb
sab_base : drive.google.com/file/d/1AewxBR9p0Tgxi2i-iXS_9RDCd90hsA4G
import pandas as pd
import re
sab_base = pd.read_excel('C:\\Users\\x\\Desktop\\Reziko\\Programming\\Visual Studio 2019\\WheelSizeFullDB_new.xlsx')
sql_base = pd.read_excel('C:\\Users\\x\\Desktop\\Reziko\\Programming\\Visual Studio 2019\\sql-autobaza.xlsx')
sqlbase = sql_base.loc[
(sql_base['Make'].str.contains('%s[a-z]*'%sab_base['Make'], flags=re.I, regex=True)) &
(sql_base['Model'].str.contains('%s[a-z]*'%sab_base['Model'], flags=re.I, regex=True)) &
(sql_base['Modification'].str.contains('%s[a-z]*'%sab_base['Modification'], flags=re.I, regex=True)) &
(sql_base['Horse Power'].str.contains('%s[a-z]*'%sab_base['Horse Power'], flags=re.I, regex=True)) &
(sql_base['Year Start'] < sab_base['Year']) &
(sql_base['Year Stop'] > sab_base['Year'])
]
print(sqlbase)
sqlbase.to_excel('sab_base_update.xlsx', index=False)
I expect to create new file and add to it all column of "sql-autobaza" and also two last columns from "WheelSizeFullDB_new" Tire Size Front and Tire Size Back but my code does not work
The best method is to press Ctrl + F (known as the find function) and then select the tab that says Replace. Type “#REF!” in the Find field and leave the Replace field empty, then press Replace All. This will remove any #REF Excelerrors from formulas and thus fix the problem.

Python: Average values in a CSV file based on value of another column

I am a noob and I have a large CSV file with data structured like this (with a lot more columns):
State daydiff
CT 5.5
CT 6.5
CT 6.25
NY 3.2
NY 3.225
PA 7.522
PA 4.25
I want to output a new CSV where the daydiff is averaged for each State like this:
State daydiff
CT 6.083
NY 3.2125
PA 5.886
I have tried numerous ways and the cleanest seemed to leverage pandas groupby but when i run the code below:
import pandas as pd
df = pd.read_csv('C:...input.csv')
df.groupby('State')['daydiff'].mean()
df.to_csv('C:...AverageOutput.csv')
I get a file that is identical to the original file but with a counter added in the first column with no header:
,State,daydiff
0,CT,5.5
1,CT,6.5
2,CT,6.25
3,NY,3.2
4,NY,3.225
5,PA,7.522
6,PA,4.25
I was also hoping to control the new average in datediff to a decimal going out only to the hundredths. Thanks
The "problem" with the counter is because the default behaviour for to_csvis to write the index. You should do df.to_csv('C:...AverageOutput.csv', index=False).
You can control the output format of daydiff by converting it to string. df.daydiff = df.daydiff.apply(lambda x: '{:.2f}'.format(x))
Your complete code should be:
df = pd.read_csv('C:...input.csv')
df2 = df.groupby('State')['daydiff'].mean().apply(lambda x: '{:.2f}'.format(x))
df2.to_csv('C:...AverageOutput.csv')

Bash - Date manipulation and join

I have two CSV files that I would like to merge using the DATE (CSV 1) and pickup_datetime (CSV 2).
CSV 1: Weather.csv (45KB ~ 365 rows)
head -3 Weather.csv
STATION,STATION_NAME,ELEVATION,LATITUDE,LONGITUDE,DATE,PRCP,SNWD,SNOW,TMAX,TMIN,AWND,WDF2,WSF2
GHCND:USW00094728,NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US,39.6,40.77889,-73.96917,20130101,0,0,0,44,-33,31,310,67
GHCND:USW00094728,NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US,39.6,40.77889,-73.96917,20130102,0,0,0,6,-56,26,310,67
CSV 2: Final_Data_1.csv (250MB ~ 1.5M rows)
head -3 final_data_1.csv
medallion,hack_license,vendor_id_x,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,vendor_id_y,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-01 23:54:15,2013-01-01 23:58:20,2,244,0.7,-73.974602,40.759945,-73.984734,40.759388,CMT,CSH,5.0,0.5,0.5,0.0,0.0,6.0
237F49C3ECC11F5024B254268F054384,93C363DDF8ED9385D65FAD07CE3F5F07,CMT,1,N,2013-01-01 07:35:47,2013-01-01 07:46:00,1,612,2.3,-73.98850999999999,40.774307,-73.981094,40.755325,CMT,CSH,10.0,0.0,0.5,0.0,0.0,10.5
How do I manipulate the date column in both CSV files and merge it to get one file with columns of Final_Data_1.csv coming before Weather.csv?
You definitely don't want to be using Bash, a good way in Python would be to use pandas, something like this:
import pandas as pd
df1 = pd.read_csv('weather.csv')
df2 = pd.read_csv('final.csv')
#format the date columns so they match up
df3 = pd.merge(df2,df1, on='date_formatted')

Categories

Resources