Querying csv files in python like sql - python

This is apparently a popular interview question.
There are 2 CSV files with dinosaur data. We need to query them to return dinosaurs satisfying a certain condition.
Note - We cannot use additional modules like q, fsql, csvkit etc.
file1.csv:
NAME,LEG_LENGTH,DIET
Hadrosaurus,1.2,herbivore
Struthiomimus,0.92,omnivore
Velociraptor,1.0,carnivore
Triceratops,0.87,herbivore
Euoplocephalus,1.6,herbivore
Stegosaurus,1.40,herbivore
Tyrannosaurus Rex,2.5,carnivore
file2.csv
NAME,STRIDE_LENGTH,STANCE
Euoplocephalus,1.87,quadrupedal
Stegosaurus,1.90,quadrupedal
Tyrannosaurus Rex,5.76,bipedal
Hadrosaurus,1.4,bipedal
Deinonychus,1.21,bipedal
Struthiomimus,1.34,bipedal
Velociraptor,2.72,bipedal
using the forumla :
speed = ((STRIDE_LENGTH / LEG_LENGTH) - 1) * SQRT(LEG_LENGTH * g), where g = 9.8 m/s^2
Write a program to read csv files, and print only names of bipedal dinosaurs, sorted by speed from fastest to slowest.
In SQL, this would be simple:
select f2.name from
file1 f1 join file2 f2 on f1.name = f2.name
where f1.stance = 'bipedal'
order by (f2.stride_length/f1.leg_length - 1)*pow(f1.leg_length*9.8,0.5) desc
How can this be done in python ?

You can do it in pandas,
import pandas as pd
df_1 = pd.read_csv('df_1.csv')
df_2 = pd.read_csv('df_2.csv')
df_comb = df_1.join(df_2.set_index('NAME'), on = 'NAME')
df_comb = df_comb.loc[df_comb.STANCE == 'bipedal']
df_comb['SPEED'] = (df_comb.STRIDE_LENGTH/df_comb.LEG_LENGTH - 1)*pd.Series.pow(df_comb.LEG_LENGTH*9.8,0.5)
df_comb.sort_values('SPEED', ascending = False)
Not as clean as SQL!

You can write SQL in python using pandasql.

def csvtable(file): # Read CSV file into 2-D dictionary
table = {}
f = open(file)
columns = f.readline().strip().split(',') # Get column names
for line in f.readlines():
values = line.strip().split(',') # Get current row
for column,value in zip(columns,values):
if column == 'NAME': # table['TREX'] = {}
key = value
table[key] = {}
else:
table[key][column] = value # table['TREX']['LENGTH'] = 10
f.close()
return table
# READ
try:
table1 = csvtable('csv1.txt')
table2 = csvtable('csv2.txt')
except Exception as e:
print (e)
# JOIN, FILTER & COMPUTE
table3 = {}
for value in table1.keys():
if value in table2.keys() and table2[value]['STANCE'] == 'bipedal': # Join both tables on key (NAME) and filter (STANCE)
leg_length = float(table1[value]['LEG_LENGTH'])
stride_length = float(table2[value]['STRIDE_LENGTH'])
speed = ((stride_length / leg_length) - 1) * pow((leg_length * 9.8),0.5) # Compute SPEED
table3[value] = speed
# SORT
result = sorted(table3, key=lambda x:table3[x], reverse=True) # Sort descending by value
# WRITE
try:
f = open('result.txt', 'w')
for r in result:
f.write('%s\n' % r)
f.close()
except Exception as e:
print (e)

I've encountered the same problem at work and decided to build an offline Desktop app where you can load CSVs and start writing SQL. You can join, group by, and etc.
This is backed by C and SQLite and can handle GBs of CSVs file in ~10 seconds. It's very fast.
Here's the app: https://superintendent.app/
This is not Python though, but it is a lot more convenient to use.

Related

find duplicate data from CSV and merge then in a specific schema

Context :
For video editing purpose, I receive CSVs with shot name, start and end timecode.
the interesting CSV datas are formatted this way :
the source in / out needs to be converted to a framerange (like 1-24)
so basically a shot is extracted as sequence('shot','framerange') like : `SQ010,('010','150-1000').
A sequence can (or not) have multiple shots and multiple time the same shot with a similar or different source in & out. an edit (a CSV) can have multiple sequences
the problem :
1/ I'm not able to merge duplicate shots with overlapping framerange in the expected way, that is :
[ ('SQ010', ('010', '0-81'),('010', '10-250') ), ('SQ020', (...) ) ]
and if the cut of the same shot doesn't overlap, the result must be :
[ ('SQ010', ('050', '0-65,70-81'),('070', '10-250') ), ('SQ020', (...) ) ]
2/ when I have duplicate shots, I can't run all the scripts we have behind, as they expect unique shot. so i can't (for now) change the expected result formatting
3/ from some reason, some 'None' can appear in the list because of duplicates shots with my code. I don't really understand how to avoid them.
What I can do for now :
I'm able to extract sequences, shots, timecode and convert it, but not to merge duplicate shots timecode.
import csv
import re
import sys
import os
with open(writeClean, 'r') as clearSL:
reader = csv.DictReader(clearSL)
data = {}
for row in reader:
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
# extract from CSV
seqLine = []
shotLine= []
for i in data['Name']:
if i.startswith('SQ') :
seq,shot=i.split('_')
seqLine.append(seq)
shotLine.append(shot)
else:
pass
fInLine=[]
durationTCtoFrame=[]
#read framerate from the first usable shot TODO: do it per shot for precision
framerate = int( float( list( set(data['Source FPS']) )[0] ) )
#convert Timecode to frame
for TC in data['Source In']:
inTCtoFrame = int(round(tcToFrames(TC,framerate)))
fInLine.append(inTCtoFrame)
for TC in data['Source Out']:
duration = int(round(tcToFrames(TC,framerate)))
durationTCtoFrame.append(duration)
fRange = ["{}-{}".format(*i) for i in zip(fInLine, fOutLine)] #merge first frame and last frame in string format usable by GCC
shotList = list(map(lambda x,y:(x,y),shotLine,fRange)) # merge shotname with framerange
#TEST - print(shotList)
seqInfo = map(lambda x,y:(x,y),seqLine,shotList) # merge sequence and shot
#TEST - print(seqInfo)
seqClean = list(set(seqInfo)) # remove duplicate with set()
seqClean.sort() # order all
#TEST - print(seqClean)
seqList = [(i,) + tuple(i[1] for i in e) for i, e in groupby(seqClean, lambda x: x[0])] #group shots per sequences
#TEST - print(seqList)
this give me this type of result :
[('SQ010', ('050', '0-81'), ('050', '0-65'),(...)]
if it's help, i can provide a full CSV
thank you

Looking for a more elegant and sophisticated solution when multiple if and for-loop are used

I am beginner/intermediate user working with python and when I write elaborate code (at least for me), I always try to rewrite it looking for reducing the number of lines when possible.
Here the code I have written.
It is basically read all values of one data frame looking for a specific string, if string found save index and value in a dictionary and drop rows where these string was found. And the same with next string...
##### Reading CSV file values and looking for variants IDs ######
# Find Variant ID (rs000000) in CSV
# \d+ is neccesary in case the line find a rs+something. rs\d+ looks for rs+ numbers
rs = df_draft[df_draft.apply(lambda x:x.str.contains("rs\d+"))].dropna(how='all').dropna(axis=1, how='all')
# Now, we save the results found in a dict key=index and value=variand ID
if rs.empty == False:
ind = rs.index.to_list()
vals = list(rs.stack().values)
row2rs = dict(zip(ind, vals))
print(row2rs)
# We need to remove the row where rs has been found.
# Because if in the same row more than one ID variant found (i.e rs# and NM_#)
# this code is going to get same variant more than one.
for index, rs in row2rs.items():
# Rows where substring 'rs' has been found need to be delete to avoid repetition
# This will be done in df_draft
df_draft = df_draft.drop(index)
## Same thing with other ID variants
# Here with Variant ID (NM_0000000) in CSV
NM = df_draft[df_draft.apply(lambda x:x.str.contains("NM_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NM.empty == False:
ind = NM.index.to_list()
vals = list(NM.stack().values)
row2NM = dict(zip(ind, vals))
print(row2NM)
for index, NM in row2NM.items():
df_draft = df_draft.drop(index)
# Here with Variant ID (NP_0000000) in CSV
NP = df_draft[df_draft.apply(lambda x:x.str.contains("NP_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NP.empty == False:
ind = NP.index.to_list()
vals = list(NP.stack().values)
row2NP = dict(zip(ind, vals))
print(row2NP)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
# Here with ClinVar field (RCV#) in CSV
RCV = df_draft[df_draft.apply(lambda x:x.str.contains("RCV\d+"))].dropna(how='all').dropna(axis=1, how='all')
if RCV.empty == False:
ind = RCV.index.to_list()
vals = list(RCV.stack().values)
row2RCV = dict(zip(ind, vals))
print(row2RCV)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
I was wondering for a more elegant solution of writing this simple but long code.
I have been thinking of sa

Python - Pandas library returns wrong column values after parsing a CSV file

SOLVED Found the solution by myself. Turns out that when you want to retrieve specific columns by their names you should pass the names in the order they appear inside the csv (which is really stupid for a library that is intended to save some parsing time for a developer IMO). Correct me if I am wrong but i dont see a on option to get a specific columns values by its name if the columns are in a different order...
I am trying to read a comma separated value file with python and then
parse it using Pandas library. Since the file has many values (columns) that are not needed I make a list of the column names i do need.
Here's a look at the csv file format.
Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Attendance,Referee,HS,AS,HST,AST,HHW,AHW,HC,AC,HF,AF,HO,AO,HY,AY,HR,AR,HBP,ABP,GBH,GBD,GBA,IWH,IWD,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA
E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,20043,Rob
Harris,17,8,14,4,2,1,6,6,13,12,8,6,1,2,0,0,10,20,2,3,3.2,2.2,2.9,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1
E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,34914,Graham
Barber,17,12,10,5,1,0,7,7,19,14,2,3,1,2,0,0,10,20,1.47,3.4,5.2,1.6,3.2,4.2,1.5,3.4,6,1.5,3.6,6,1.44,3.6,6.5
E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,20624,Barry
Knight,6,16,3,9,0,1,8,4,15,21,1,3,5,3,1,0,75,30,2.15,3,3,2.2,2.9,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62
E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,27223,Andy
D'Urso,6,13,4,6,0,0,5,8,11,13,0,2,1,1,0,0,10,10,2,3.1,3.2,1.8,3,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2,3.2,3.2
E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,40010,Dermot
Gallagher,17,12,8,6,0,0,6,4,21,20,6,1,1,3,0,0,10,30,1.65,3.3,4.3,1.55,3.3,4.5,1.55,3.5,5,1.57,3.6,5,1.61,3.5,4.5
E0,19/08/00,Leicester,Aston Villa,0,0,D,0,0,D,21455,Mike
Riley,5,5,4,3,0,0,5,4,12,12,1,4,2,3,0,0,20,30,2.15,3.1,2.9,2.3,2.9,2.5,2.35,3.2,2.6,2.25,3.25,2.75,2.4,3.25,2.5
E0,19/08/00,Liverpool,Bradford,1,0,H,0,0,D,44183,Paul
Durkin,16,3,10,2,0,0,6,1,8,8,5,0,1,1,0,0,10,10,1.25,4.1,7.2,1.25,4.3,8,1.35,4,8,1.36,4,8,1.33,4,8
This list is passed to pandas.read_csv()'s names parameter.
See code.
# Returns an array of the column names needed for our raw data table
def cols_to_extract():
cols_to_use = [None] * RawDataCols.COUNT
cols_to_use[RawDataCols.DATE] = 'Date'
cols_to_use[RawDataCols.HOME_TEAM] = 'HomeTeam'
cols_to_use[RawDataCols.AWAY_TEAM] = 'AwayTeam'
cols_to_use[RawDataCols.FTHG] = 'FTHG'
cols_to_use[RawDataCols.HG] = 'HG'
cols_to_use[RawDataCols.FTAG] = 'FTAG'
cols_to_use[RawDataCols.AG] = 'AG'
cols_to_use[RawDataCols.FTR] = 'FTR'
cols_to_use[RawDataCols.RES] = 'Res'
cols_to_use[RawDataCols.HTHG] = 'HTHG'
cols_to_use[RawDataCols.HTAG] = 'HTAG'
cols_to_use[RawDataCols.HTR] = 'HTR'
cols_to_use[RawDataCols.ATTENDANCE] = 'Attendance'
cols_to_use[RawDataCols.HS] = 'HS'
cols_to_use[RawDataCols.AS] = 'AS'
cols_to_use[RawDataCols.HST] = 'HST'
cols_to_use[RawDataCols.AST] = 'AST'
cols_to_use[RawDataCols.HHW] = 'HHW'
cols_to_use[RawDataCols.AHW] = 'AHW'
cols_to_use[RawDataCols.HC] = 'HC'
cols_to_use[RawDataCols.AC] = 'AC'
cols_to_use[RawDataCols.HF] = 'HF'
cols_to_use[RawDataCols.AF] = 'AF'
cols_to_use[RawDataCols.HFKC] = 'HFKC'
cols_to_use[RawDataCols.AFKC] = 'AFKC'
cols_to_use[RawDataCols.HO] = 'HO'
cols_to_use[RawDataCols.AO] = 'AO'
cols_to_use[RawDataCols.HY] = 'HY'
cols_to_use[RawDataCols.AY] = 'AY'
cols_to_use[RawDataCols.HR] = 'HR'
cols_to_use[RawDataCols.AR] = 'AR'
return cols_to_use
# Extracts raw data from the raw data csv and populates the raw match data table in the database
def extract_raw_data(csv):
# Clear the database table if it has any logs
# if MatchRawData.objects.count != 0:
# MatchRawData.objects.delete()
cols_to_use = cols_to_extract()
# Read and parse the csv file
parsed_csv = pd.read_csv(csv, delimiter=',', names=cols_to_use, header=0)
for col in cols_to_use:
values = parsed_csv[col].values
for val in values:
print(str(col) + ' --------> ' + str(val))
Where RawDataCols is an IntEnum.
class RawDataCols(IntEnum):
DATE = 0
HOME_TEAM = 1
AWAY_TEAM = 2
FTHG = 3
HG = 4
FTAG = 5
AG = 6
FTR = 7
RES = 8
...
The column names are obtained using it. That part of code works ok. The correct column name is obtained but after trying to get its values using
values = parsed_csv[col].values
pandas return the values of a wrong column. The wrong column index is around 13 indexes away from the one i am trying to get. What am i missing?
You can select column by name wise.Just use following line
values = parsed_csv[["Column Name","Column Name2"]]
Or you select Index wise by
cols = [1,2,3,4]
values = parsed_csv[parsed_csv.columns[cols]]

Need to get the predicted values into a csv or Sframe or Dataframe in Python

Prerequisites
Dataset I'm working with is MovieLens 100k
Python Packages I'm using is Surprise, Io and Pandas
Agenda is to test the recommendation system using KNN (+ K-Fold) on Algorithms: Vector cosine & Pearson, for both User based CF & Item based CF
Briefing
So far, I have coded for both UBCF & IBCF as below
Q1. IBCF Generates data as per input given to it, I need it to export a csv file since I need to find out the predicted values
Q2. UBCF needs to enter each data separately and doesn't work even with immediate below code:
csvfile = 'pred_matrix.csv'
with open(csvfile, "w") as output:
writer = csv.writer(output,lineterminator='\n')
#algo.predict(user_id, item_id, estimated_ratings)
for val in algo.predict(str(range(1,943)),range(1,1683),1):
writer.writerow([val])
Clearly it throws the error of lists, as it cannot be Comma separated.
Q3 Getting Precision & Recall on Evaluated and Recommended values
CODE
STARTS WITH
if ip == 1:
one = 'cosine'
else:
one = 'pearson'
choice = raw_input("Filtering Method: \n1.User based \n2.Item based \n Choice:")
if choice == '1':
user_based_cf(one)
elif choice == '2':
item_based_cf(one)
else:
sim_op={}
exit(0)
UBCF:
def user_based_cf(co_pe):
# INITIALIZE REQUIRED PARAMETERS
path = '/home/mister-t/Projects/PycharmProjects/RecommendationSys/ml-100k/u.user'
prnt = "USER"
sim_op = {'name': co_pe, 'user_based': True}
algo = KNNBasic(sim_options=sim_op)
# RESPONSIBLE TO EXECUTE DATA SPLITS Mentioned in STEP 4
perf = evaluate(algo, df, measures=['RMSE', 'MAE'])
print_perf(perf)
print type(perf)
# START TRAINING
trainset = df.build_full_trainset()
# APPLYING ALGORITHM KNN Basic
res = algo.train(trainset)
print "\t\t >>>TRAINED SET<<<<\n\n", res
# PEEKING PREDICTED VALUES
search_key = raw_input("Enter User ID:")
item_id = raw_input("Enter Item ID:")
actual_rating = input("Enter actual Rating:")
print algo.predict(str(search_key), item_id, actual_rating)
IBCF
def item_based_cf(co_pe):
# INITIALIZE REQUIRED PARAMETERS
path = '/location/ml-100k/u.item'
prnt = "ITEM"
sim_op = {'name': co_pe, 'user_based': False}
algo = KNNBasic(sim_options=sim_op)
# RESPONSIBLE TO EXECUTE DATA SPLITS = 2
perf = evaluate(algo, df, measures=['RMSE', 'MAE'])
print_perf(perf)
print type(perf)
# START TRAINING
trainset = df.build_full_trainset()
# APPLYING ALGORITHM KNN Basic
res = algo.train(trainset)
print "\t\t >>>TRAINED SET<<<<\n\n", res
# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names(path)
search_key = raw_input("ID:")
print "ALGORITHM USED : ", one
toy_story_raw_id = name_to_rid[search_key]
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
# Retrieve inner ids of the nearest neighbors of Toy Story.
k=5
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=k)
# Convert inner ids of the neighbors into names.
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
for rid in toy_story_neighbors)
print 'The ', k,' nearest neighbors of ', search_key,' are:'
for movie in toy_story_neighbors:
print(movie)
Q1. IBCF Generates data as per input given to it, I need it to export a csv file since I need to find out the predicted values
the easiest way to dump anything to a csv would be to use the csv module!
import csv
res = [x, y, z, ....]
csvfile = "<path to output csv or txt>"
#Assuming res is a flat list
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in res:
writer.writerow([val])
#Assuming res is a list of lists
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
writer.writerows(res)

Pandas optimization for multiple records

I have a file with around 500K records.
Each record needs to be validated.
Records are de duplicated and store in a list:
with open(filename) as f:
records = f.readlines()
The validation file I used is stored in a Pandas Dataframe
This DataFrame contains around 80K records and 9 columns (myfile.csv).
filename = 'myfile.csv'
df = pd.read_csv(filename)
def check(df, destination):
try:
area_code = destination[:3]
office_code = destination[3:6]
subscriber_number = destination[6:]
if any(df['AREA_CODE'].astype(int) == area_code):
area_code_numbers = df[df['AREA_CODE'] == area_code]
if any(area_code_numbers['OFFICE_CODE'].astype(int) == office_code):
matching_records = area_code_numbers[area_code_numbers['OFFICE_CODE'].astype(int) == office_code]
start = subscriber_number >= matching_records['SUBSCRIBER_START']
end = subscriber_number <= matching_records['SUBSCRIBER_END']
# Perform intersection
record_found = matching_records[start & end]['LABEL'].to_string(index=False)
# We should return only 1 value
if len(record_found) > 0:
return record_found
else:
return 'INVALID_SUBSCRIBER'
else:
return 'INVALID_OFFICE_CODE'
else:
return 'INVALID_AREA_CODE'
except KeyError:
pass
except Exception:
pass
I'm looking for a way to improve the comparisons, as when I run it, it just hangs. If I run it with an small subset (10K) it works fine.
Not sure if there is a more efficient notation/recommendation.
for record in records:
check(df, record)
Using MacOS 8GB/2.3 GHz Intel Core i7.
With Cprofile.run in check function alone shows:
4253 function calls (4199 primitive calls) in 0.017 seconds.
Hence I assume 500K will take around 2 1/2 hours
While no data is available, consider this untested approach with a couple of left join merges of both data pieces and then run the validation steps. This would avoid any looping and run conditional logic across columns:
import pandas as pd
import numpy as np
with open('RecordsValidate.txt') as f:
records = f.readlines()
print(records)
rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
'rcd_area_code': [int(rcd[:3]) for rcd in records],
'rcd_office_code': [int(rcd[3:6]) for rcd in records],
'rcd_subscriber_number': [rcd[6:] for rcd in records]})
filename = 'myfile.csv'
df = pd.read_csv(filename)
# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
# VALIDATE OFFICE CODE
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])
# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
(mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
(mrgdf['LABEL'].str.len() = 0),
'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)

Categories

Resources