I'm trying to read in a .csv file and extract specific columns so that I can output a single table that essentially performs a 'GROUP BY' on a particular column and aggregates certain other columns of interest (similar to how you would in SQL) but I'm not too familiar how to do this easily in Python.
The csv file is in the following form:
age,education,balance,approved
30,primary,1850,yes
54,secondary,800,no
24,tertiary,240,yes
I've tried to import and read in the csv files to parse the three columns I care about and iterate through them to put them into three separate array lists. I'm not too familiar with packages and how to get these into a data frame or matrix with 3 columns so that I can then iterate through them mutate or perform all of the aggregated output field (see below expected results).
with open('loans.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter = ',')
next(readCSV) ##skips header row
education = []
balance = []
loan_approved = []
for row in readCSV:
educat = row[1]
bal = row[2]
approve = row[3]
education.append(educat)
balance.append(bal)
loan_approved.append(approve)
print(education)
print(balance)
print(loan_approved)
The output would be a 4x7 table of four rows (grouped by education level) and the following headers:
Education|#Applicants|Min Bal|Max Bal|#Approved|#Rejected|%Apps Approved
Primary ...
Secondary ...
Terciary ...
It seems to be much simpler by using Pandas instead. For instance, you can read only the columns that you care for instead of all of them:
import Pandas as pd
df = pd.read_csv(usecols=['education', 'balance', 'loan_approved'])
Now, to group by education level, you can find all the unique entries for that column and group them:
groupby_education = {}
for level in list(set(df['education'])):
groupby_education[level] = df.loc[df['education'] == level]
print(groupby_education)
I hope this helped. Let me know if you still need help.
Cheers!
Related
I'm streaming data to a csv file.
This is the request:
symbols=["SPY", "IVV", "SDS", "SH", "SPXL", "SPXS", "SPXU", "SSO", "UPRO", "VOO"]
each symbols has a list range from (0,8)
this is how it looks like in 3 columns:
-1583353249601,symbol,SH
-1583353249601,delayed,False
-1583353249601,asset-main-type,EQUITY
-1583353250614,symbol,SH
-1583353250614,last-price,24.7952
-1583353250614,bid-size,362
-1583353250614,symbol,VOO
-1583353250614,bid-price,284.79
-1583353250614,bid-size,3
-1583353250614,ask-size,1
-1583353250614,bid-id,N
my end goal Is to reshape the data:
this is what I need to achieved.
the problems that I encounter where:
not being able to group by tiemstamp and not being able to pivot.
1)I tried to crate a dict and so later It can be passed to pandas, but I m missing data in the process.
I need to find the way to group the data that has the same timestamp.it looks like that omit the lines with the same timestamp.
code:
new_data_dict = {}
with open("stream_data.csv", 'r') as data_file:
data = csv.DictReader(data_file, delimiter=",")
for row in data:
item = new_data_dict.get(row["timestamp"], dict())
item[row["symbol"]] = row["value"]
new_data_dict[row['timestamp']] = item
data = new_data_dict
data = pd.DataFrame.from_dict(data)
data.T
print(data.T)
2)this is an other approach, I was able to group by timestamp by creating 2 different data, but I can not split the value column in to multiple columns to be merge late matching indexes.
code:
data = pd.read_csv("tasty_hola.csv",sep=',' )
data1 = data.groupby(['timestamp']).apply(lambda v: v['value'].unique())
data = data.groupby(['timestamp']).apply(lambda v: v['symbol'].unique())
data1 = pd.DataFrame({'timestamp':data1.index, 'value':data1.values})
At this moment I don't know if the logic that I m trying to apply is the correct one. very lost not being able to see the light at the end of the tunnel.
Thank you very much
There is a csv with 9 columns and 1.5 million rows. The question asks us to compute the spending for each account. There are 7700 account numbers that I was able to extract. Here is an sample from the file since someone asked (it is a link since i don't have enough clout on here to post photos apparently):
sample of the file
I'm especially confused given that you need to add the extra step of multiplying quantity and price since the transactions in the table are for individual items.
Oh, and we are not allowed to use pandas. And all of this is string data.
I have not tried much because I'm pretty stumped beyond simply getting the list of all of the account ids. Even that was a challenge for me, so any help is appreciated. Below is simply the code I used to get the list of IDs, and I'm pretty sure I wasn't even supposed to use import csv for that but oh well.
import csv
f_file = open ('myfile.csv')
csv_f_file = csv.reader(f_file)
account_id = []
for row in csv_f_file:
account_id.append(row[4])
account_id = set(account_id)
account_id_list = list(account_id)
print(customer_id_list)
The result should look like this (but imagine it 7000 times):
account: SID600
spending: 87.500
Thank you to anyone who can help!!
You could make it readable by using DictReader and DictWriter, but it's mandatory that you have the CSV with a header. Also you could save the results in another CSV for persistence.
Since in your input there may be a different product per entry for the same account (for example for SID600 there could be entries for chair, table and some other table, with different prices and quantities), there is a need to gather all the spendings in lists for each account and then sum them to a total.
Sample CSV input:
date,trans,item,account,quantity,price
0409,h65009,chair,SID600,12.5,7
0409,h65009,table,SID600,40,2
0409,h65009,table,SID600,22,10
0409,h65009,chair,SID601,30,11
0409,h65009,table,SID601,30,11
0409,h65009,table,SID602,4,9
The code:
import csv
from collections import defaultdict
inpf = open("accounts.csv", "r")
outpf = open("accounts_spending.csv", "w")
incsv = csv.DictReader(inpf)
outcsv = csv.DictWriter(outpf, fieldnames=['account', 'spending'])
outcsv.writeheader()
spending = defaultdict(list)
# calculate spendings for all entries
for row in incsv:
spending[row["account"]].append(float(row["quantity"]) * float(row["price"]))
# sum the spendings for all accounts
for account in spending:
spending[account] = sum(spending[account])
# output the spending to a CSV
for account, total_spending in spending.items():
outcsv.writerow({
"account": account,
"spending": total_spending
})
inpf.close()
outpf.close()
for whose output will be:
account,spending
SID600,387.5
SID601,660.0
SID602,36.0
You can try this:
import csv
with open ('myfile.csv') as f:
csv_f_file = csv.reader(f)
data = list(csv_f_file)
res = {}
for row in data:
res[row[3]] = res.get(row[3], 0.0)
res[row[3]] += float(row[4]) * float(row[5])
print(res)
import csv
f_file = open ('myfile.csv')
csv_f_file = csv.reader(p_supermarket_file)
account_id = []
for row in csv_f_file:
account_id.append(row[4])
account_id = set(account_id)
account_id_list = list(account_id)
for id in account_id_list:
for row in csv_f_file:
if row[3] == id:
total_amount = row[4] * row[5]
#make a dictionary to store amount and its corresponding is in it.
I have not tested it but it i from what i understood.
Give a try to Pandas. Use groupby method with lamda.
If your CSV file have features row wise take the transpose and then use groupby method.
Only refer pandas official documentation sites.
I want to read the entire row data and store it in variables, later use them in selenium to write it to webelements. Programming language is Python.
Example: I have an excel sheet of Incidents and their details regarding priority, date, assignee etc
If I give the string as INC00000 it should match the excel data, fetch all the above details and store it in separate variables like
INC #= INC0000 Priority= Moderate Date = 11/2/2020
Is this feasible? I tried and failed writing a code. Please suggest other possible ways to do this.
I would,
load the sheet into a pandas DataFrame
filter the corresponding column in the DataFrame by the INC # of interest
convert the row to dictionary (assuming the INC filter produces only 1 row)
get the corresponding value in the dictionary to assign to the corresponding webelement
Example:
import pandas as pd
df = pd.read_excel("full_file_path", sheet_name="name_of_sheet")
dict_data = df[df['INC #']].to_dict("record") # <-- assuming the "INC #" are in column named "INC #" in the spreadsheet
webelement1.send_keys(dict_data[columnname1])
webelement2.send_keys(dict_data[columnname2])
webelement3.send_keys(dict_data[columnname3])
.
.
.
Please find the below code and do the changes as per your variables after saving your excel file as csv:
Please find the dummy data image
import csv
# Set up input and output variables for the script
gTrack = open("file1.csv", "r")
# Set up CSV reader and process the header
csvReader = csv.reader(gTrack)
header = csvReader.next()
print header
id_index = header.index("id")
date_index = header.index("date ")
var1_index = header.index("var1")
var2_index = header.index("var2")
# # Make an empty list
cList = []
# Loop through the lines in the file and get required id
for row in csvReader:
id = row[id_index]
if(id == 'INC001') :
date = row[date_index]
var1 = row[var1_index]
var2 = row[var2_index]
cList.append([id,date,var1,var2])
# # Print the coordinate list
print cList
I am processing a CSV file in python thats delimited by a comma (,).
Each column is a sampled parameter, for instance column 0 is time, sampled at once a second, column 1 is altitude sampled at 4 times a second, etc.
So columns will look like as below:
Column 0 -> ["Time", 0, " "," "," ",1]
Column 1 -> ["Altitude", 100, 200, 300, 400]
I am trying to create a list for each column that captures its name and all its data. That way I can do calculations and organize my data into a new file automatically (the sampled data I am working with has substantial number of rows)
I want to do this for any file not just one, so the number of columns can vary.
Normally if every file was consistent I would do something like:
import csv
time =[]
alt = []
dct = {}
with open('test.csv',"r") as csvfile:
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0])
alt.append(row[1]) #etc for all columns
I am pretty new in python. Is this a good way to tackle this, if not what is better methodology?
Thanks for your time
Pandas will probably work best for you. If you use csv_read from pandas, it will create a DataFrame based on the column. It's roughly a dictionary of lists.
You can also use the .tolist() functionality of pandas to convert it to a list if you want a list specifically.
import pandas as pd
data = pd.read_csv("soqn.csv")
dict_of_lists = {}
for column_name in data.columns:
temp_list = data[column_name].tolist()
dict_of_lists[column_name] = temp_list
print dict_of_lists
EDIT:
dict_of_lists={column_name: data[column_name].tolist() for column_name in data.columns}
#This list comprehension might work faster.
I think I made my problem more simpler and just focused on one column.
What I ultimately wanted to do was to interpolate to the highest sampling rate. So here is what I came up with... Please let me know if I can do anything more efficient. I used A LOT of searching on this site to help build this. Again I am new at Python (about 2-3 weeks but some former programming experience)
import csv
header = []
#initialize variables
loc_int = 0
loc_fin = 0
temp_i = 0
temp_f = 0
with open('test2.csv',"r") as csvfile: # open csv file
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0]) #make a list that consists of all content in column A
for x in range(0,len(header)-1): #go through entire column
if header[x].isdigit() and header[x+1]=="": # find lower bound of sample to be interpolated
loc_int = x
temp_i = int(header[x])
elif header[x+1].isdigit() and header[x]=="": # find upper bound of sample to be interpolated
loc_fin = x
temp_f = int(header[x+1])
if temp_f>temp_i: #calculate interpolated values
f_min_i = temp_f - temp_i
interp = f_min_i/float((loc_fin+1)-loc_int)
for y in range(loc_int, loc_fin+1):
header[y] = temp_i + interp*(y-loc_int)
print header
with open("output.csv", 'wb') as g: #write to new file
writer = csv.writer(g)
for item in header:
writer.writerow([item])
I couldn't figure out how to write my new list "header" with its interpolated values and replace it with column A of my old file , test2.csv.
Anywho thank you very much for looking...
The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. The data are ordered, but the ordering differs between files. The files contain about 70K lines, weighing around 15MB. Nothing fancy or hardcore here. Here's part of the code:
def getCSV(fpath):
with open(fpath,"rb") as f:
csvfile = csv.reader(f)
for row in csvfile:
allRows.append(row)
allCols = map(list, zip(*allRows))
Am I properly reading from my CSV files? I'm using csv.reader, but would I benefit from using csv.DictReader?
How can I create a list containing whole rows which have a certain value in a precise column?
Are you sure you want to be keeping all rows around? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. Just to note, you mention the 20th column, but row[20] will be the 21st column...
import csv
matching20 = []
for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
with open(fname) as fin:
csvin = csv.reader(fin)
next(csvin) # <--- if you want to skip header row
for row in csvin:
if row[20] == 'value':
matching20.append(row) # or do something with it here
You only want csv.DictReader if you have a header row and want to access your columns by name.
This should work, you don't need to make another list to have access to the columns.
import csv
import sys
def getCSV(fpath):
with open(fpath) as ifile:
csvfile = csv.reader(ifile)
rows = list(csvfile)
value_20 = [x for x in rows if x[20] == 'value']
If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct?
If your rows are lists, then this should work:
testlist = [row for row in allRows if 'value' in row]
post-edit:
If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos, then:
testlist = []
pos = 20
for row in allRows:
testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])
(I haven't tested this, but let me now if that works).