Querying csv files in python like sql - python
This is apparently a popular interview question.
There are 2 CSV files with dinosaur data. We need to query them to return dinosaurs satisfying a certain condition.
Note - We cannot use additional modules like q, fsql, csvkit etc.
file1.csv:
NAME,LEG_LENGTH,DIET
Hadrosaurus,1.2,herbivore
Struthiomimus,0.92,omnivore
Velociraptor,1.0,carnivore
Triceratops,0.87,herbivore
Euoplocephalus,1.6,herbivore
Stegosaurus,1.40,herbivore
Tyrannosaurus Rex,2.5,carnivore
file2.csv
NAME,STRIDE_LENGTH,STANCE
Euoplocephalus,1.87,quadrupedal
Stegosaurus,1.90,quadrupedal
Tyrannosaurus Rex,5.76,bipedal
Hadrosaurus,1.4,bipedal
Deinonychus,1.21,bipedal
Struthiomimus,1.34,bipedal
Velociraptor,2.72,bipedal
using the forumla :
speed = ((STRIDE_LENGTH / LEG_LENGTH) - 1) * SQRT(LEG_LENGTH * g), where g = 9.8 m/s^2
Write a program to read csv files, and print only names of bipedal dinosaurs, sorted by speed from fastest to slowest.
In SQL, this would be simple:
select f2.name from
file1 f1 join file2 f2 on f1.name = f2.name
where f1.stance = 'bipedal'
order by (f2.stride_length/f1.leg_length - 1)*pow(f1.leg_length*9.8,0.5) desc
How can this be done in python ?
You can do it in pandas,
import pandas as pd
df_1 = pd.read_csv('df_1.csv')
df_2 = pd.read_csv('df_2.csv')
df_comb = df_1.join(df_2.set_index('NAME'), on = 'NAME')
df_comb = df_comb.loc[df_comb.STANCE == 'bipedal']
df_comb['SPEED'] = (df_comb.STRIDE_LENGTH/df_comb.LEG_LENGTH - 1)*pd.Series.pow(df_comb.LEG_LENGTH*9.8,0.5)
df_comb.sort_values('SPEED', ascending = False)
Not as clean as SQL!
You can write SQL in python using pandasql.
def csvtable(file): # Read CSV file into 2-D dictionary
table = {}
f = open(file)
columns = f.readline().strip().split(',') # Get column names
for line in f.readlines():
values = line.strip().split(',') # Get current row
for column,value in zip(columns,values):
if column == 'NAME': # table['TREX'] = {}
key = value
table[key] = {}
else:
table[key][column] = value # table['TREX']['LENGTH'] = 10
f.close()
return table
# READ
try:
table1 = csvtable('csv1.txt')
table2 = csvtable('csv2.txt')
except Exception as e:
print (e)
# JOIN, FILTER & COMPUTE
table3 = {}
for value in table1.keys():
if value in table2.keys() and table2[value]['STANCE'] == 'bipedal': # Join both tables on key (NAME) and filter (STANCE)
leg_length = float(table1[value]['LEG_LENGTH'])
stride_length = float(table2[value]['STRIDE_LENGTH'])
speed = ((stride_length / leg_length) - 1) * pow((leg_length * 9.8),0.5) # Compute SPEED
table3[value] = speed
# SORT
result = sorted(table3, key=lambda x:table3[x], reverse=True) # Sort descending by value
# WRITE
try:
f = open('result.txt', 'w')
for r in result:
f.write('%s\n' % r)
f.close()
except Exception as e:
print (e)
I've encountered the same problem at work and decided to build an offline Desktop app where you can load CSVs and start writing SQL. You can join, group by, and etc.
This is backed by C and SQLite and can handle GBs of CSVs file in ~10 seconds. It's very fast.
Here's the app: https://superintendent.app/
This is not Python though, but it is a lot more convenient to use.
Related
find duplicate data from CSV and merge then in a specific schema
Context : For video editing purpose, I receive CSVs with shot name, start and end timecode. the interesting CSV datas are formatted this way : the source in / out needs to be converted to a framerange (like 1-24) so basically a shot is extracted as sequence('shot','framerange') like : `SQ010,('010','150-1000'). A sequence can (or not) have multiple shots and multiple time the same shot with a similar or different source in & out. an edit (a CSV) can have multiple sequences the problem : 1/ I'm not able to merge duplicate shots with overlapping framerange in the expected way, that is : [ ('SQ010', ('010', '0-81'),('010', '10-250') ), ('SQ020', (...) ) ] and if the cut of the same shot doesn't overlap, the result must be : [ ('SQ010', ('050', '0-65,70-81'),('070', '10-250') ), ('SQ020', (...) ) ] 2/ when I have duplicate shots, I can't run all the scripts we have behind, as they expect unique shot. so i can't (for now) change the expected result formatting 3/ from some reason, some 'None' can appear in the list because of duplicates shots with my code. I don't really understand how to avoid them. What I can do for now : I'm able to extract sequences, shots, timecode and convert it, but not to merge duplicate shots timecode. import csv import re import sys import os with open(writeClean, 'r') as clearSL: reader = csv.DictReader(clearSL) data = {} for row in reader: for header, value in row.items(): try: data[header].append(value) except KeyError: data[header] = [value] # extract from CSV seqLine = [] shotLine= [] for i in data['Name']: if i.startswith('SQ') : seq,shot=i.split('_') seqLine.append(seq) shotLine.append(shot) else: pass fInLine=[] durationTCtoFrame=[] #read framerate from the first usable shot TODO: do it per shot for precision framerate = int( float( list( set(data['Source FPS']) )[0] ) ) #convert Timecode to frame for TC in data['Source In']: inTCtoFrame = int(round(tcToFrames(TC,framerate))) fInLine.append(inTCtoFrame) for TC in data['Source Out']: duration = int(round(tcToFrames(TC,framerate))) durationTCtoFrame.append(duration) fRange = ["{}-{}".format(*i) for i in zip(fInLine, fOutLine)] #merge first frame and last frame in string format usable by GCC shotList = list(map(lambda x,y:(x,y),shotLine,fRange)) # merge shotname with framerange #TEST - print(shotList) seqInfo = map(lambda x,y:(x,y),seqLine,shotList) # merge sequence and shot #TEST - print(seqInfo) seqClean = list(set(seqInfo)) # remove duplicate with set() seqClean.sort() # order all #TEST - print(seqClean) seqList = [(i,) + tuple(i[1] for i in e) for i, e in groupby(seqClean, lambda x: x[0])] #group shots per sequences #TEST - print(seqList) this give me this type of result : [('SQ010', ('050', '0-81'), ('050', '0-65'),(...)] if it's help, i can provide a full CSV thank you
Looking for a more elegant and sophisticated solution when multiple if and for-loop are used
I am beginner/intermediate user working with python and when I write elaborate code (at least for me), I always try to rewrite it looking for reducing the number of lines when possible. Here the code I have written. It is basically read all values of one data frame looking for a specific string, if string found save index and value in a dictionary and drop rows where these string was found. And the same with next string... ##### Reading CSV file values and looking for variants IDs ###### # Find Variant ID (rs000000) in CSV # \d+ is neccesary in case the line find a rs+something. rs\d+ looks for rs+ numbers rs = df_draft[df_draft.apply(lambda x:x.str.contains("rs\d+"))].dropna(how='all').dropna(axis=1, how='all') # Now, we save the results found in a dict key=index and value=variand ID if rs.empty == False: ind = rs.index.to_list() vals = list(rs.stack().values) row2rs = dict(zip(ind, vals)) print(row2rs) # We need to remove the row where rs has been found. # Because if in the same row more than one ID variant found (i.e rs# and NM_#) # this code is going to get same variant more than one. for index, rs in row2rs.items(): # Rows where substring 'rs' has been found need to be delete to avoid repetition # This will be done in df_draft df_draft = df_draft.drop(index) ## Same thing with other ID variants # Here with Variant ID (NM_0000000) in CSV NM = df_draft[df_draft.apply(lambda x:x.str.contains("NM_\d+"))].dropna(how='all').dropna(axis=1, how='all') if NM.empty == False: ind = NM.index.to_list() vals = list(NM.stack().values) row2NM = dict(zip(ind, vals)) print(row2NM) for index, NM in row2NM.items(): df_draft = df_draft.drop(index) # Here with Variant ID (NP_0000000) in CSV NP = df_draft[df_draft.apply(lambda x:x.str.contains("NP_\d+"))].dropna(how='all').dropna(axis=1, how='all') if NP.empty == False: ind = NP.index.to_list() vals = list(NP.stack().values) row2NP = dict(zip(ind, vals)) print(row2NP) for index, NP in row2NP.items(): df_draft = df_draft.drop(index) # Here with ClinVar field (RCV#) in CSV RCV = df_draft[df_draft.apply(lambda x:x.str.contains("RCV\d+"))].dropna(how='all').dropna(axis=1, how='all') if RCV.empty == False: ind = RCV.index.to_list() vals = list(RCV.stack().values) row2RCV = dict(zip(ind, vals)) print(row2RCV) for index, NP in row2NP.items(): df_draft = df_draft.drop(index) I was wondering for a more elegant solution of writing this simple but long code. I have been thinking of sa
Python - Pandas library returns wrong column values after parsing a CSV file
SOLVED Found the solution by myself. Turns out that when you want to retrieve specific columns by their names you should pass the names in the order they appear inside the csv (which is really stupid for a library that is intended to save some parsing time for a developer IMO). Correct me if I am wrong but i dont see a on option to get a specific columns values by its name if the columns are in a different order... I am trying to read a comma separated value file with python and then parse it using Pandas library. Since the file has many values (columns) that are not needed I make a list of the column names i do need. Here's a look at the csv file format. Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Attendance,Referee,HS,AS,HST,AST,HHW,AHW,HC,AC,HF,AF,HO,AO,HY,AY,HR,AR,HBP,ABP,GBH,GBD,GBA,IWH,IWD,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,20043,Rob Harris,17,8,14,4,2,1,6,6,13,12,8,6,1,2,0,0,10,20,2,3,3.2,2.2,2.9,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1 E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,34914,Graham Barber,17,12,10,5,1,0,7,7,19,14,2,3,1,2,0,0,10,20,1.47,3.4,5.2,1.6,3.2,4.2,1.5,3.4,6,1.5,3.6,6,1.44,3.6,6.5 E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,20624,Barry Knight,6,16,3,9,0,1,8,4,15,21,1,3,5,3,1,0,75,30,2.15,3,3,2.2,2.9,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62 E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,27223,Andy D'Urso,6,13,4,6,0,0,5,8,11,13,0,2,1,1,0,0,10,10,2,3.1,3.2,1.8,3,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2,3.2,3.2 E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,40010,Dermot Gallagher,17,12,8,6,0,0,6,4,21,20,6,1,1,3,0,0,10,30,1.65,3.3,4.3,1.55,3.3,4.5,1.55,3.5,5,1.57,3.6,5,1.61,3.5,4.5 E0,19/08/00,Leicester,Aston Villa,0,0,D,0,0,D,21455,Mike Riley,5,5,4,3,0,0,5,4,12,12,1,4,2,3,0,0,20,30,2.15,3.1,2.9,2.3,2.9,2.5,2.35,3.2,2.6,2.25,3.25,2.75,2.4,3.25,2.5 E0,19/08/00,Liverpool,Bradford,1,0,H,0,0,D,44183,Paul Durkin,16,3,10,2,0,0,6,1,8,8,5,0,1,1,0,0,10,10,1.25,4.1,7.2,1.25,4.3,8,1.35,4,8,1.36,4,8,1.33,4,8 This list is passed to pandas.read_csv()'s names parameter. See code. # Returns an array of the column names needed for our raw data table def cols_to_extract(): cols_to_use = [None] * RawDataCols.COUNT cols_to_use[RawDataCols.DATE] = 'Date' cols_to_use[RawDataCols.HOME_TEAM] = 'HomeTeam' cols_to_use[RawDataCols.AWAY_TEAM] = 'AwayTeam' cols_to_use[RawDataCols.FTHG] = 'FTHG' cols_to_use[RawDataCols.HG] = 'HG' cols_to_use[RawDataCols.FTAG] = 'FTAG' cols_to_use[RawDataCols.AG] = 'AG' cols_to_use[RawDataCols.FTR] = 'FTR' cols_to_use[RawDataCols.RES] = 'Res' cols_to_use[RawDataCols.HTHG] = 'HTHG' cols_to_use[RawDataCols.HTAG] = 'HTAG' cols_to_use[RawDataCols.HTR] = 'HTR' cols_to_use[RawDataCols.ATTENDANCE] = 'Attendance' cols_to_use[RawDataCols.HS] = 'HS' cols_to_use[RawDataCols.AS] = 'AS' cols_to_use[RawDataCols.HST] = 'HST' cols_to_use[RawDataCols.AST] = 'AST' cols_to_use[RawDataCols.HHW] = 'HHW' cols_to_use[RawDataCols.AHW] = 'AHW' cols_to_use[RawDataCols.HC] = 'HC' cols_to_use[RawDataCols.AC] = 'AC' cols_to_use[RawDataCols.HF] = 'HF' cols_to_use[RawDataCols.AF] = 'AF' cols_to_use[RawDataCols.HFKC] = 'HFKC' cols_to_use[RawDataCols.AFKC] = 'AFKC' cols_to_use[RawDataCols.HO] = 'HO' cols_to_use[RawDataCols.AO] = 'AO' cols_to_use[RawDataCols.HY] = 'HY' cols_to_use[RawDataCols.AY] = 'AY' cols_to_use[RawDataCols.HR] = 'HR' cols_to_use[RawDataCols.AR] = 'AR' return cols_to_use # Extracts raw data from the raw data csv and populates the raw match data table in the database def extract_raw_data(csv): # Clear the database table if it has any logs # if MatchRawData.objects.count != 0: # MatchRawData.objects.delete() cols_to_use = cols_to_extract() # Read and parse the csv file parsed_csv = pd.read_csv(csv, delimiter=',', names=cols_to_use, header=0) for col in cols_to_use: values = parsed_csv[col].values for val in values: print(str(col) + ' --------> ' + str(val)) Where RawDataCols is an IntEnum. class RawDataCols(IntEnum): DATE = 0 HOME_TEAM = 1 AWAY_TEAM = 2 FTHG = 3 HG = 4 FTAG = 5 AG = 6 FTR = 7 RES = 8 ... The column names are obtained using it. That part of code works ok. The correct column name is obtained but after trying to get its values using values = parsed_csv[col].values pandas return the values of a wrong column. The wrong column index is around 13 indexes away from the one i am trying to get. What am i missing?
You can select column by name wise.Just use following line values = parsed_csv[["Column Name","Column Name2"]] Or you select Index wise by cols = [1,2,3,4] values = parsed_csv[parsed_csv.columns[cols]]
Need to get the predicted values into a csv or Sframe or Dataframe in Python
Prerequisites Dataset I'm working with is MovieLens 100k Python Packages I'm using is Surprise, Io and Pandas Agenda is to test the recommendation system using KNN (+ K-Fold) on Algorithms: Vector cosine & Pearson, for both User based CF & Item based CF Briefing So far, I have coded for both UBCF & IBCF as below Q1. IBCF Generates data as per input given to it, I need it to export a csv file since I need to find out the predicted values Q2. UBCF needs to enter each data separately and doesn't work even with immediate below code: csvfile = 'pred_matrix.csv' with open(csvfile, "w") as output: writer = csv.writer(output,lineterminator='\n') #algo.predict(user_id, item_id, estimated_ratings) for val in algo.predict(str(range(1,943)),range(1,1683),1): writer.writerow([val]) Clearly it throws the error of lists, as it cannot be Comma separated. Q3 Getting Precision & Recall on Evaluated and Recommended values CODE STARTS WITH if ip == 1: one = 'cosine' else: one = 'pearson' choice = raw_input("Filtering Method: \n1.User based \n2.Item based \n Choice:") if choice == '1': user_based_cf(one) elif choice == '2': item_based_cf(one) else: sim_op={} exit(0) UBCF: def user_based_cf(co_pe): # INITIALIZE REQUIRED PARAMETERS path = '/home/mister-t/Projects/PycharmProjects/RecommendationSys/ml-100k/u.user' prnt = "USER" sim_op = {'name': co_pe, 'user_based': True} algo = KNNBasic(sim_options=sim_op) # RESPONSIBLE TO EXECUTE DATA SPLITS Mentioned in STEP 4 perf = evaluate(algo, df, measures=['RMSE', 'MAE']) print_perf(perf) print type(perf) # START TRAINING trainset = df.build_full_trainset() # APPLYING ALGORITHM KNN Basic res = algo.train(trainset) print "\t\t >>>TRAINED SET<<<<\n\n", res # PEEKING PREDICTED VALUES search_key = raw_input("Enter User ID:") item_id = raw_input("Enter Item ID:") actual_rating = input("Enter actual Rating:") print algo.predict(str(search_key), item_id, actual_rating) IBCF def item_based_cf(co_pe): # INITIALIZE REQUIRED PARAMETERS path = '/location/ml-100k/u.item' prnt = "ITEM" sim_op = {'name': co_pe, 'user_based': False} algo = KNNBasic(sim_options=sim_op) # RESPONSIBLE TO EXECUTE DATA SPLITS = 2 perf = evaluate(algo, df, measures=['RMSE', 'MAE']) print_perf(perf) print type(perf) # START TRAINING trainset = df.build_full_trainset() # APPLYING ALGORITHM KNN Basic res = algo.train(trainset) print "\t\t >>>TRAINED SET<<<<\n\n", res # Read the mappings raw id <-> movie name rid_to_name, name_to_rid = read_item_names(path) search_key = raw_input("ID:") print "ALGORITHM USED : ", one toy_story_raw_id = name_to_rid[search_key] toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id) # Retrieve inner ids of the nearest neighbors of Toy Story. k=5 toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=k) # Convert inner ids of the neighbors into names. toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id) for inner_id in toy_story_neighbors) toy_story_neighbors = (rid_to_name[rid] for rid in toy_story_neighbors) print 'The ', k,' nearest neighbors of ', search_key,' are:' for movie in toy_story_neighbors: print(movie)
Q1. IBCF Generates data as per input given to it, I need it to export a csv file since I need to find out the predicted values the easiest way to dump anything to a csv would be to use the csv module! import csv res = [x, y, z, ....] csvfile = "<path to output csv or txt>" #Assuming res is a flat list with open(csvfile, "w") as output: writer = csv.writer(output, lineterminator='\n') for val in res: writer.writerow([val]) #Assuming res is a list of lists with open(csvfile, "w") as output: writer = csv.writer(output, lineterminator='\n') writer.writerows(res)
Pandas optimization for multiple records
I have a file with around 500K records. Each record needs to be validated. Records are de duplicated and store in a list: with open(filename) as f: records = f.readlines() The validation file I used is stored in a Pandas Dataframe This DataFrame contains around 80K records and 9 columns (myfile.csv). filename = 'myfile.csv' df = pd.read_csv(filename) def check(df, destination): try: area_code = destination[:3] office_code = destination[3:6] subscriber_number = destination[6:] if any(df['AREA_CODE'].astype(int) == area_code): area_code_numbers = df[df['AREA_CODE'] == area_code] if any(area_code_numbers['OFFICE_CODE'].astype(int) == office_code): matching_records = area_code_numbers[area_code_numbers['OFFICE_CODE'].astype(int) == office_code] start = subscriber_number >= matching_records['SUBSCRIBER_START'] end = subscriber_number <= matching_records['SUBSCRIBER_END'] # Perform intersection record_found = matching_records[start & end]['LABEL'].to_string(index=False) # We should return only 1 value if len(record_found) > 0: return record_found else: return 'INVALID_SUBSCRIBER' else: return 'INVALID_OFFICE_CODE' else: return 'INVALID_AREA_CODE' except KeyError: pass except Exception: pass I'm looking for a way to improve the comparisons, as when I run it, it just hangs. If I run it with an small subset (10K) it works fine. Not sure if there is a more efficient notation/recommendation. for record in records: check(df, record) Using MacOS 8GB/2.3 GHz Intel Core i7. With Cprofile.run in check function alone shows: 4253 function calls (4199 primitive calls) in 0.017 seconds. Hence I assume 500K will take around 2 1/2 hours
While no data is available, consider this untested approach with a couple of left join merges of both data pieces and then run the validation steps. This would avoid any looping and run conditional logic across columns: import pandas as pd import numpy as np with open('RecordsValidate.txt') as f: records = f.readlines() print(records) rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)), 'rcd_area_code': [int(rcd[:3]) for rcd in records], 'rcd_office_code': [int(rcd[3:6]) for rcd in records], 'rcd_subscriber_number': [rcd[6:] for rcd in records]}) filename = 'myfile.csv' df = pd.read_csv(filename) # VALIDATE AREA CODE mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code']) mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan) mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1) # VALIDATE OFFICE CODE mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'], right_on=['rcd_area_code', 'rcd_office_code']) mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN']) # VALIDATE SUBSCRIBER mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) | (mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) | (mrgdf['LABEL'].str.len() = 0), 'INVALID_SUBSCRIBER', mrgdf['RETURN']) mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)