How to create a multiIndex dataframe from a streaming csv file - python

I'm streaming data to a csv file.
This is the request:
symbols=["SPY", "IVV", "SDS", "SH", "SPXL", "SPXS", "SPXU", "SSO", "UPRO", "VOO"]
each symbols has a list range from (0,8)
this is how it looks like in 3 columns:
-1583353249601,symbol,SH
-1583353249601,delayed,False
-1583353249601,asset-main-type,EQUITY
-1583353250614,symbol,SH
-1583353250614,last-price,24.7952
-1583353250614,bid-size,362
-1583353250614,symbol,VOO
-1583353250614,bid-price,284.79
-1583353250614,bid-size,3
-1583353250614,ask-size,1
-1583353250614,bid-id,N
my end goal Is to reshape the data:
this is what I need to achieved.
the problems that I encounter where:
not being able to group by tiemstamp and not being able to pivot.
1)I tried to crate a dict and so later It can be passed to pandas, but I m missing data in the process.
I need to find the way to group the data that has the same timestamp.it looks like that omit the lines with the same timestamp.
code:
new_data_dict = {}
with open("stream_data.csv", 'r') as data_file:
data = csv.DictReader(data_file, delimiter=",")
for row in data:
item = new_data_dict.get(row["timestamp"], dict())
item[row["symbol"]] = row["value"]
new_data_dict[row['timestamp']] = item
data = new_data_dict
data = pd.DataFrame.from_dict(data)
data.T
print(data.T)
2)this is an other approach, I was able to group by timestamp by creating 2 different data, but I can not split the value column in to multiple columns to be merge late matching indexes.
code:
data = pd.read_csv("tasty_hola.csv",sep=',' )
data1 = data.groupby(['timestamp']).apply(lambda v: v['value'].unique())
data = data.groupby(['timestamp']).apply(lambda v: v['symbol'].unique())
data1 = pd.DataFrame({'timestamp':data1.index, 'value':data1.values})
At this moment I don't know if the logic that I m trying to apply is the correct one. very lost not being able to see the light at the end of the tunnel.
Thank you very much

Related

For basic maths calculations on very large csv files how can I do this faster when I have mixed datatypes in my csv - with python

I have some very large CSV files (+15Gb) that contain 4 initial rows of meta data / header info and then the data. The first 3 columns are 3D Cartesian coordinates and are the values I need to change with basic maths operations. e.g. Add, subtract, multiple, divide. I need to do this on mass to each of the coordinate columns. The first 3 columns are float type values
The rest of the columns in the CSV could be of any type, e.g. string, int, etc....
I currently use a script where I can read in each row of the csv and make the modification, then write to a new file and it seems to work fine. But the problem is it takes days on a large file. The machine I'm running on has plenty of memory (120Gb), but mu current method doesn't utilise that.
I know I can update a column on mass using a numpy 2D array if I skip the 4 metadata rows.
e.g
arr = np.genfromtxt(input_file_path, delimiter=',', skip_header=4)
arr[:,0]=np.add(arr[:,0],300)
this will update the first column by adding 300 to each value. But the issue I have with trying to use numpy is
Numpy arrays don't support mixed data types for the rest of the columns that will be imported (I don't know what the other columns will hold so I can't use structured arrays - or rather i want it to be a universal tool so I don't have to know what they will hold)
I can export the numpy array to csv (providing it's not mixed types) and just using regular text functions I can create a separate CSV for the 4 rows of metadata, but then I need to somehow concatenate them and I don't want to have read through all the lines of the data csv just to append it to the bottom of the metadata csv.
I know if I can make this work with Numpy it will greatly increase the speed by utilizing the machine's large amount of memory, by holding the entire csv in memory while I do operations. I've never used pandas but would also consider using it for a solution. I've had a bit of a look into pandas thinking I maybe able to do it with dataframes but I still need to figure out how to have 4 rows as my column header instead of one and additionally I haven't seen a way to apply a mass update to the whole column (like I can with numpy) without using a python loop - not sure if that would make it slow or not if it's already in memory.
The metadata can be empty for rows 2,3,4 but in most cases row 4 will have the data type recorded. There could be up to 200 data columns in addition to the initial 3 coordinate columns.
My current (slow) code looks like this:
import os
import subprocess
import csv
import numpy as np
def move_txt_coords_to(move_by_coords, input_file_path, output_file_path):
# create new empty output file
open(output_file_path, 'a').close()
with open(input_file_path, newline='') as f:
reader = csv.reader(f)
for idx, row in enumerate(reader):
if idx < 4:
append_row(output_file_path, row)
else:
new_x = round(float(row[0]) + move_by_coords['x'], 3)
new_y = round(float(row[1]) + move_by_coords['y'], 3)
new_z = round(float(row[2]) + move_by_coords['z'], 3)
row[0] = new_x
row[1] = new_y
row[2] = new_z
append_row(output_file_path, row)
def append_row(output_file, row):
f = open(output_file, 'a', newline='')
writer = csv.writer(f, delimiter=',')
writer.writerow(row)
f.close()
if __name__ == '__main__':
move_by_coords = {
'x': -338802.5,
'y': -1714752.5,
'z': 0
}
input_file_path = r'D:\incoming_data\large_data_set1.csv'
output_file_path = r'D:\outgoing_data\large_data_set_relocated.csv'
move_txt_coords_to(move_by_coords, input_file_path, output_file_path)
Okay so I've got an almost complete answer and it was so much easier than trying to use numpy.
import pandas pd
input_file_path = r'D:\input\large_data.csv'
output_file_path = r'D:\output\large_data_relocated.csv'
move_by_coords = {
'x': -338802.5,
'y': -1714752.5,
'z': 0
}
df = pd.read_csv(input_file_path, header=[0,1,2,3])
df.centroid_x += move_by_coords['x']
df.centroid_y += move_by_coords['y']
df.centroid_z += move_by_coords['z']
df.to_csv(output_file_path,sep=',')
But I have one remaining issue (possibly 2). The blanks cells in my header are being populated with Unnamed. I somehow need it to sub in a blank string for those in the header row.
Also #FBruzzesi has warned me I made need to use batchsize to make it more efficient which i'll need to check out.
---------------------Update-------------
Okay I resolved the multiline header issue. I just use the regular csv reader module to read the first 4 rows into a list of rows, then I transpose this to be a list of columns where I convert the column list to tuples at the same time. Once I have a list of column header tuples (where the tuples consist of each of the rows within that column header), I can use the list to name the header. I there fore skip the header rows on reading the csv to the data frame, and then update each column by it's index. I also drop the index column on export back to csv once done.
It seems work very well.
import csv
import itertools
import pandas as pd
def make_first_4rows_list_of_tuples(input_csv_file_path):
f = open(input_csv_file_path, newline='')
reader = csv.reader(f)
header_rows = []
for row in itertools.islice(reader, 0, 4):
header_rows.append(row)
header_col_tuples = list(map(tuple, zip(*header_rows)))
print("Header columns: \n", header_col_tuples)
return header_col_tuples
if __name__ == '__main__':
move_by_coords = {
'x': 1695381.5,
'y': 5376792.5,
'z': 100
}
input_file_path = r'D:\temp\mydata.csv'
output_file_path = r'D:\temp\my_updated_data.csv'
column_headers = make_first_4rows_list_of_tuples(input_file_path)
df = pd.read_csv(input_file_path, skiprows=4, names=column_headers)
df.iloc[:, 0] += move_by_coords['x']
df.iloc[:, 1] += move_by_coords['y']
df.iloc[:, 2] += move_by_coords['z']
df.to_csv(output_file_path, sep=',', index=False)

Iterating through a csv file and creating a table

I'm trying to read in a .csv file and extract specific columns so that I can output a single table that essentially performs a 'GROUP BY' on a particular column and aggregates certain other columns of interest (similar to how you would in SQL) but I'm not too familiar how to do this easily in Python.
The csv file is in the following form:
age,education,balance,approved
30,primary,1850,yes
54,secondary,800,no
24,tertiary,240,yes
I've tried to import and read in the csv files to parse the three columns I care about and iterate through them to put them into three separate array lists. I'm not too familiar with packages and how to get these into a data frame or matrix with 3 columns so that I can then iterate through them mutate or perform all of the aggregated output field (see below expected results).
with open('loans.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter = ',')
next(readCSV) ##skips header row
education = []
balance = []
loan_approved = []
for row in readCSV:
educat = row[1]
bal = row[2]
approve = row[3]
education.append(educat)
balance.append(bal)
loan_approved.append(approve)
print(education)
print(balance)
print(loan_approved)
The output would be a 4x7 table of four rows (grouped by education level) and the following headers:
Education|#Applicants|Min Bal|Max Bal|#Approved|#Rejected|%Apps Approved
Primary ...
Secondary ...
Terciary ...
It seems to be much simpler by using Pandas instead. For instance, you can read only the columns that you care for instead of all of them:
import Pandas as pd
df = pd.read_csv(usecols=['education', 'balance', 'loan_approved'])
Now, to group by education level, you can find all the unique entries for that column and group them:
groupby_education = {}
for level in list(set(df['education'])):
groupby_education[level] = df.loc[df['education'] == level]
print(groupby_education)
I hope this helped. Let me know if you still need help.
Cheers!

Creating multiple csv files from existing csv file python pandas

I'm trying to take a large csv file and write a csv file for the sort of two columns. I was able to get the two individual unique values from the file to be able to know which csv files need to be created.
Ex Data:
1,224939.203,1243008.651,1326.774,F,C-GRAD-FILL,09/22/18 07:24:34,
1,225994.242,1243021.426,1301.772,BS,C-GRAD-FILL,09/24/18 08:24:18,
451,225530.332,1243016.186,1316.173,GRD,C-TOE,10/02/18 11:49:13,
452,225522.429,1242996.017,1319.168,GRD,C-TOE KEY,10/02/18 11:49:46,
I would like to create a csv file "C-GRAD-FILL 09-22-18.csv" with all of the data that matches the two values. I cannot decide how to iterate through the data for both values.
def readData(fileName):
df = pd.read_csv(fileName,index_col=False, names+['Number','Northing','Easting','Elevation','Description','Layer','Date'],parse_dates=['Date'] )
##Layers here!!!
layers = df['Layer'].unique()
##Dates here!!! AS DATETIME OBJECTS!!!!
dates = df['Date'].map(lambda t: t.date()).unique()
##Sorted in order
sortedList = df.sort_values(by=['Layer','Date'])
You can use a GroupBy object. First ensure your date is in the correct string format:
df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%m-%d-%y')
To output all files, iterate a GroupBy object:
for (layer, date), group in df.groupby(['Layer', 'Date']):
group.to_csv(f'{layer} {date}.csv', index=False)
Or, for one specific combination:
layer = 'C-GRAD-FILL'
date = '09-22-18'
g = df.groupby(['Layer', 'Date'])
g.get_group((layer, date)).to_csv(f'{layer} {date}.csv', index=False)

Read data from excel after a string matches

I want to read the entire row data and store it in variables, later use them in selenium to write it to webelements. Programming language is Python.
Example: I have an excel sheet of Incidents and their details regarding priority, date, assignee etc
If I give the string as INC00000 it should match the excel data, fetch all the above details and store it in separate variables like
INC #= INC0000 Priority= Moderate Date = 11/2/2020
Is this feasible? I tried and failed writing a code. Please suggest other possible ways to do this.
I would,
load the sheet into a pandas DataFrame
filter the corresponding column in the DataFrame by the INC # of interest
convert the row to dictionary (assuming the INC filter produces only 1 row)
get the corresponding value in the dictionary to assign to the corresponding webelement
Example:
import pandas as pd
df = pd.read_excel("full_file_path", sheet_name="name_of_sheet")
dict_data = df[df['INC #']].to_dict("record") # <-- assuming the "INC #" are in column named "INC #" in the spreadsheet
webelement1.send_keys(dict_data[columnname1])
webelement2.send_keys(dict_data[columnname2])
webelement3.send_keys(dict_data[columnname3])
.
.
.
Please find the below code and do the changes as per your variables after saving your excel file as csv:
Please find the dummy data image
import csv
# Set up input and output variables for the script
gTrack = open("file1.csv", "r")
# Set up CSV reader and process the header
csvReader = csv.reader(gTrack)
header = csvReader.next()
print header
id_index = header.index("id")
date_index = header.index("date ")
var1_index = header.index("var1")
var2_index = header.index("var2")
# # Make an empty list
cList = []
# Loop through the lines in the file and get required id
for row in csvReader:
id = row[id_index]
if(id == 'INC001') :
date = row[date_index]
var1 = row[var1_index]
var2 = row[var2_index]
cList.append([id,date,var1,var2])
# # Print the coordinate list
print cList

Import CSV and create one list for each column in Python

I am processing a CSV file in python thats delimited by a comma (,).
Each column is a sampled parameter, for instance column 0 is time, sampled at once a second, column 1 is altitude sampled at 4 times a second, etc.
So columns will look like as below:
Column 0 -> ["Time", 0, " "," "," ",1]
Column 1 -> ["Altitude", 100, 200, 300, 400]
I am trying to create a list for each column that captures its name and all its data. That way I can do calculations and organize my data into a new file automatically (the sampled data I am working with has substantial number of rows)
I want to do this for any file not just one, so the number of columns can vary.
Normally if every file was consistent I would do something like:
import csv
time =[]
alt = []
dct = {}
with open('test.csv',"r") as csvfile:
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0])
alt.append(row[1]) #etc for all columns
I am pretty new in python. Is this a good way to tackle this, if not what is better methodology?
Thanks for your time
Pandas will probably work best for you. If you use csv_read from pandas, it will create a DataFrame based on the column. It's roughly a dictionary of lists.
You can also use the .tolist() functionality of pandas to convert it to a list if you want a list specifically.
import pandas as pd
data = pd.read_csv("soqn.csv")
dict_of_lists = {}
for column_name in data.columns:
temp_list = data[column_name].tolist()
dict_of_lists[column_name] = temp_list
print dict_of_lists
EDIT:
dict_of_lists={column_name: data[column_name].tolist() for column_name in data.columns}
#This list comprehension might work faster.
I think I made my problem more simpler and just focused on one column.
What I ultimately wanted to do was to interpolate to the highest sampling rate. So here is what I came up with... Please let me know if I can do anything more efficient. I used A LOT of searching on this site to help build this. Again I am new at Python (about 2-3 weeks but some former programming experience)
import csv
header = []
#initialize variables
loc_int = 0
loc_fin = 0
temp_i = 0
temp_f = 0
with open('test2.csv',"r") as csvfile: # open csv file
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0]) #make a list that consists of all content in column A
for x in range(0,len(header)-1): #go through entire column
if header[x].isdigit() and header[x+1]=="": # find lower bound of sample to be interpolated
loc_int = x
temp_i = int(header[x])
elif header[x+1].isdigit() and header[x]=="": # find upper bound of sample to be interpolated
loc_fin = x
temp_f = int(header[x+1])
if temp_f>temp_i: #calculate interpolated values
f_min_i = temp_f - temp_i
interp = f_min_i/float((loc_fin+1)-loc_int)
for y in range(loc_int, loc_fin+1):
header[y] = temp_i + interp*(y-loc_int)
print header
with open("output.csv", 'wb') as g: #write to new file
writer = csv.writer(g)
for item in header:
writer.writerow([item])
I couldn't figure out how to write my new list "header" with its interpolated values and replace it with column A of my old file , test2.csv.
Anywho thank you very much for looking...

Categories

Resources