Looking to make the following code parallel- it reads in data in one large 9gb proprietary format and produces 30 individual csv files based on the 30 columns of data. It currently takes 9 minutes per csv written on a 30 minute data set. The solution space of parallel libraries in Python is a bit overwhelming. Can you direct me to any good tutorials/sample code? I couldn't find anything very informative.
for i in range(0, NumColumns):
aa = datetime.datetime.now()
allData = [TimeStamp]
ColumnData = allColumns[i].data # Get the data within this one Column
Samples = ColumnData.size # Find the number of elements in Column data
print('Formatting Column {0}'.format(i+1))
truncColumnData = [] # Initialize truncColumnData array each time for loop runs
if ColumnScale[i+1] == 'Scale: '+ tempScaleName: # If it's temperature, format every value to 5 characters
for j in range(Samples):
truncValue = '{:.1f}'.format((ColumnData[j]))
truncColumnData.append(truncValue) # Appends formatted value to truncColumnData array
allData.append(truncColumnData) #append the formatted Column data to the all data array
zipObject = zip(*allData)
zipList = list(zipObject)
csvFileColumn = 'Column_' + str('{0:02d}'.format(i+1)) + '.csv'
# Write the information to .csv file
with open(csvFileColumn, 'wb') as csvFile:
print('Writing to .csv file')
writer = csv.writer(csvFile)
counter = 0
for z in zipList:
counter = counter + 1
timeString = '{:.26},'.format(z[0])
zList = list(z)
columnVals = zList[1:]
columnValStrs = list(map(str, columnVals))
formattedStr = ','.join(columnValStrs)
csvFile.write(timeString + formattedStr + '\n') # Writes the time stamps and channel data by columns
one possible solution may be to use Dask http://dask.pydata.org/en/latest/
A coworker recently recommended it to me which is why I thought of it.
Related
I am extracting data from a large pdf file using regex using python in databricks. This data is in form of a long string and I am using string split function to convert this into a pandas dataframe as I want the final data as csv file. But while doing line.split command it takes about 5 hours for the command to run and I am looking for ways to optimize this. I am new to python and I am not sure which part of the code should I look at for reducing this time of running the command.
for pdf in os.listdir(data_directory):
# creating an object
file = open(data_directory + pdf, 'rb')
# creating file reader object
fileReader = PyPDF2.PdfFileReader(file)
num_pages = fileReader.numPages
#print("total pages = " + str(num_pages))
extracted_string = "start of file"
current_page = 0
while current_page < num_pages:
#print("adding page " + str(current_page) + " to the file")
extracted_string += (fileReader.getPage(current_page).extract_text())
current_page = current_page + 1
regex_date = "\d{2}\/\d{2}\/\d{4}[^\n]*"
table_lines = re.findall(regex_date, extracted_string)
Above code is to get the data from PDF
#create dataframe out of extracted string and load into a single dataframe
for line in table_lines:
df = pd.DataFrame([x.split(' ') for x in line.split('\n')])
df.rename(columns={0: 'date_of_import', 1: 'entry_num', 2: 'warehouse_code_num', 3: 'declarant_ref_num', 4: 'declarant_EORI_num', 5: 'VAT_due'}, inplace=True)
table = pd.concat([table,df],sort= False)
This part of the code is what is taking up huge time. I have tried different ways to get a dataframe out of this data but the above has worked best for me. I am looking for faster way to run this code.
https://drive.google.com/file/d/1ew3Fw1IjeToBA-KMbTTD_hIINiQm0Bkg/view?usp=share_link pdf file for reference
There are 2 immediate optimization steps in your code.
Pre-compile regex if they are used many times. It may or not be relevant here, because I could not guess how many times table_lines = re.findall(regex_date, extracted_string) is executed. But this if often more efficient:
# before any loop
regex_date = re.compile("\d{2}\/\d{2}\/\d{4}[^\n]*")
...
# inside the loop
table_lines = regex_date.findall(extracted_string)
Do not repeatedly append to a dataframe. A dataframe is a rather complex container, and appending rows is a costly operation. It is generally much more efficient do build a Python container (list or dict) first and then convert it as a whole to a dataframe
data = [[x.split(' ') for x in line.split('\n')] for line in table_lines]
table = pd.DataFrame(data, columns = ['date_of_import', 'entry_num',
'warehouse_code_num', 'declarant_ref_num',
'declarant_EORI_num', 'VAT_due'])
I'm new to the group, and to python. I have a very specific type of input file that I'm working with. It is a text file with one header row of text. In addition there is a column of text too which makes things more annoying. What I want to do is read in this file, and then perform operations on the columns of numbers (like average, stdev, etc)...but reading in the file and parsing out the text column is giving me trouble.
I've played with many different approaches and got it close, but figured I'd reach out to the group here. If this were matlab I'd have had it down hours ago. As of now if I used fixed width to define my columns, I think it will work, but I thought there is likely a more efficient way to read in the lines and ignore strings properly.
Here is the file format. As you can see, row one is header...so can be ignored.
column 1 contains text.
postraw.txt
....I think I figured it out. My code is probably very crude, but it works for now:
CTlist = []
CLlist = []
CDlist = []
CMZlist = []
LDelist = []
loopout = {'a1':CTlist, 'a2':CLlist, 'a3':CDlist, 'a4':CMZlist, 'a5':LDelist}
#Specifcy number of headerlines
headerlines = 1
#set initial index to 0
i = 0
#begin loop to process input file, avoiding any header lines
with open('post.out', 'r') as file:
for row in file:
if i > (headerlines - 1):
rowvars = row.split()
for i in range(2,len(rowvars)):
#print(rowvars[i]) #JUST A CHECK/DEBUG LINE
loopout['a{0}'.format(i-1)].append(float(rowvars[i]))
i = i+1
CTlist = []
CLlist = []
CDlist = []
CMZlist = []
LDelist = []
loopout = {'a1':CTlist, 'a2':CLlist, 'a3':CDlist, 'a4':CMZlist, 'a5':LDelist}
#Specifcy number of headerlines
headerlines = 1
#set initial index to 0
i = 0
#begin loop to process input file, avoiding any header lines
with open('post.out', 'r') as file:
for row in file:
if i > (headerlines - 1):
rowvars = row.split()
for i in range(2,len(rowvars)):
#print(rowvars[i]) #JUST A CHECK/DEBUG LINE
loopout['a{0}'.format(i-1)].append(float(rowvars[i]))
i = i+1
I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file. My current algorithm does the following:
Scan each row of the second file (two integers each), go to the position of those two integers in a 2D array (so if the integers are 2 and 3, I'll go to position [2,3]) and assign a value of 1.
Go through each row of the first file, check if the position of the two integers of each row has a value of 1 in the array, and then print the according output to a third csv file.
Unfortunately the csv files are very large, so I instantly get "MemoryError:" when running this. What is an alternative for scanning through large csv files?
I am using Jupyter Notebook. My code:
import csv
import numpy
def SNP():
thelines = numpy.ndarray((6639,524525))
tempint = 0
tempint2 = 0
with open("SL05_AO_RO.tab") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
thelines[tempint,tempint2] = 1
return thelines
def common_sites():
tempint = 0
tempint2 = 0
temparray = SNP()
print('Checkpoint.')
with open('output_SL05.csv', 'w', newline='') as fp:
with open("covbreadth_common_sites.csv") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
if temparray[tempint,tempint2] == 1:
a = csv.writer(fp, delimiter=',')
data = [['','']]
a.writerows(data)
else:
a = csv.writer(fp, delimiter=',')
data = [['R','R']]
a.writerows(data)
print('Done.')
return
common_sites()
Files:
https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing and https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing
You're dataset really isn't that big, but it is relatively sparse. You aren't using a sparse structure to store the data which is causing the problem.
Just use a set of tuples to store the seen data, and then the lookup on that set is O(1), e.g:
In [1]:
import csv
with open("SL05_AO_RO.tab") as tsv:
seen = set(map(tuple, csv.reader(tsv, dialect="excel-tab")))
with open("covbreadth_common_sites.csv") as tsv:
common = [line for line in csv.reader(tsv, dialect="excel-tab") if tuple(line) in seen]
common[:10]
Out[1]:
[['1049', '7280'], ['1073', '39198'], ['1073', '39218'], ['1073', '39224'], ['1073', '39233'],
['1098', '661'], ['1098', '841'], ['1103', '15100'], ['1103', '15107'], ['1103', '28210']]
10 loops, best of 3: 150 ms per loop
In [2]:
len(common), len(seen)
Out[2]:
(190, 138205)
I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. My job is to go through each row of the first file, and see if it matches any of the rows in the second file. If it does, I print a blank line to my output csv file. Otherwise, I print 'R,R' to the output csv file.
import numpy as np
f1 = np.loadtxt('SL05_AO_RO.tab')
f2 = np.loadtxt('covbreadth_common_sites.csv')
f1.sort(axis=0)
f2.sort(axis=0)
i, j = 0, 0
while i < f1.shape[0]:
while j < f2.shape[0] and f1[i][0] > f2[j][0]:
j += 1
while j < f2.shape[0] and f1[i][0] == f2[j][0] and f1[i][1] > f2[j][1]:
j += 1
if j < f2.shape[0] and np.array_equal(f1[i], f2[j]):
print()
else:
print('R,R')
i += 1
Load data to ndarray to optimize memory usage
Sort data
Find matches in sorted arrays
Total complexity is O(n*log(n) + m*log(m)), where n and m are sizes of input files.
Using of set() will not reduce memory usage per unique entry so I do not recommend to use it with large datasets.
Since CSV is just a DB dump, import it to any SQL DB, then do query on it. This is very efficient way.
I am processing a CSV file in python thats delimited by a comma (,).
Each column is a sampled parameter, for instance column 0 is time, sampled at once a second, column 1 is altitude sampled at 4 times a second, etc.
So columns will look like as below:
Column 0 -> ["Time", 0, " "," "," ",1]
Column 1 -> ["Altitude", 100, 200, 300, 400]
I am trying to create a list for each column that captures its name and all its data. That way I can do calculations and organize my data into a new file automatically (the sampled data I am working with has substantial number of rows)
I want to do this for any file not just one, so the number of columns can vary.
Normally if every file was consistent I would do something like:
import csv
time =[]
alt = []
dct = {}
with open('test.csv',"r") as csvfile:
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0])
alt.append(row[1]) #etc for all columns
I am pretty new in python. Is this a good way to tackle this, if not what is better methodology?
Thanks for your time
Pandas will probably work best for you. If you use csv_read from pandas, it will create a DataFrame based on the column. It's roughly a dictionary of lists.
You can also use the .tolist() functionality of pandas to convert it to a list if you want a list specifically.
import pandas as pd
data = pd.read_csv("soqn.csv")
dict_of_lists = {}
for column_name in data.columns:
temp_list = data[column_name].tolist()
dict_of_lists[column_name] = temp_list
print dict_of_lists
EDIT:
dict_of_lists={column_name: data[column_name].tolist() for column_name in data.columns}
#This list comprehension might work faster.
I think I made my problem more simpler and just focused on one column.
What I ultimately wanted to do was to interpolate to the highest sampling rate. So here is what I came up with... Please let me know if I can do anything more efficient. I used A LOT of searching on this site to help build this. Again I am new at Python (about 2-3 weeks but some former programming experience)
import csv
header = []
#initialize variables
loc_int = 0
loc_fin = 0
temp_i = 0
temp_f = 0
with open('test2.csv',"r") as csvfile: # open csv file
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0]) #make a list that consists of all content in column A
for x in range(0,len(header)-1): #go through entire column
if header[x].isdigit() and header[x+1]=="": # find lower bound of sample to be interpolated
loc_int = x
temp_i = int(header[x])
elif header[x+1].isdigit() and header[x]=="": # find upper bound of sample to be interpolated
loc_fin = x
temp_f = int(header[x+1])
if temp_f>temp_i: #calculate interpolated values
f_min_i = temp_f - temp_i
interp = f_min_i/float((loc_fin+1)-loc_int)
for y in range(loc_int, loc_fin+1):
header[y] = temp_i + interp*(y-loc_int)
print header
with open("output.csv", 'wb') as g: #write to new file
writer = csv.writer(g)
for item in header:
writer.writerow([item])
I couldn't figure out how to write my new list "header" with its interpolated values and replace it with column A of my old file , test2.csv.
Anywho thank you very much for looking...
I am currently writing a script where I am creating a csv file ('tableau_input.csv') composed both of other csv files columns and columns created by myself. I tried the following code:
def make_tableau_file(mp, current_season = 2016):
# Produces a csv file containing predicted and actual game results for the current season
# Tableau uses the contents of the file to produce visualization
game_data_filename = 'game_data' + str(current_season) + '.csv'
datetime_filename = 'datetime' + str(current_season) + '.csv'
with open('tableau_input.csv', 'wb') as writefile:
tableau_write = csv.writer(writefile)
tableau_write.writerow(['Visitor_Team', 'V_Team_PTS', 'Home_Team', 'H_Team_PTS', 'True_Result', 'Predicted_Results', 'Confidence', 'Date'])
with open(game_data_filename, 'rb') as readfile:
scores = csv.reader(readfile)
scores.next()
for score in scores:
tableau_content = score[1::]
# Append True_Result
if int(tableau_content[3]) > int(tableau_content[1]):
tableau_content.append(1)
else:
tableau_content.append(0)
# Append 'Predicted_Result' and 'Confidence'
prediction_results = mp.make_predictions(tableau_content[0], tableau_content[2])
tableau_content += list(prediction_results)
tableau_write.writerow(tableau_content)
with open(datetime_filename, 'rb') as readfile2:
days = csv.reader(readfile2)
days.next()
for day in days:
tableau_write.writerow(day)
'tableau_input.csv' is the file I am creating. The columns 'Visitor_Team', 'V_Team_PTS', 'Home_Team', 'H_Team_PTS' come from 'game_data_filename'(e.g tableau_content = score[1::]). The columns 'True_Result', 'Predicted_Results', 'Confidence' are columns created in the first for loop.
So far, everything works but finally I tried to add to the 'Date' column data from the 'datetime_filename' using the same structure as above but when I open my 'tableau_input' file, there is no data in my 'Date' column. Can someone solve this problem?
For info, below are screenshots of csv files respectively for 'game_data_filename' and 'datetime_filename' (nb: datetime values are in datetime format)
It's hard to test this as I don't really know what the input should look like, but try something like this:
def make_tableau_file(mp, current_season=2016):
# Produces a csv file containing predicted and actual game results for the current season
# Tableau uses the contents of the file to produce visualization
game_data_filename = 'game_data' + str(current_season) + '.csv'
datetime_filename = 'datetime' + str(current_season) + '.csv'
with open('tableau_input.csv', 'wb') as writefile:
tableau_write = csv.writer(writefile)
tableau_write.writerow(
['Visitor_Team', 'V_Team_PTS', 'Home_Team', 'H_Team_PTS', 'True_Result', 'Predicted_Results', 'Confidence', 'Date'])
with open(game_data_filename, 'rb') as readfile, open(datetime_filename, 'rb') as readfile2:
scoreReader = csv.reader(readfile)
scores = [row for row in scoreReader]
scores = scores[1::]
daysReader = csv.reader(readfile2)
days = [day for day in daysReader]
if(len(scores) != len(days)):
print("File lengths do not match")
else:
for i in range(len(days)):
tableau_content = scores[i][1::]
tableau_date = days[i]
# Append True_Result
if int(tableau_content[3]) > int(tableau_content[1]):
tableau_content.append(1)
else:
tableau_content.append(0)
# Append 'Predicted_Result' and 'Confidence'
prediction_results = mp.make_predictions(tableau_content[0], tableau_content[2])
tableau_content += list(prediction_results)
tableau_content += tableau_date
tableau_write.writerow(tableau_content)
This combines both of the file reading parts into one.
As per your questions below:
scoreReader = csv.reader(readfile)
scores = [row for row in scoreReader]
scores = scores[1::]
This uses list comprehension to create a list called scores, with every element being one of the rows from scoreReader. As scorereader is a generator, every time we ask it for a row, it spits one out for us, until there are no more.
The second line scores = scores[1::] just chops off the first element of the list, as you don't want the header.
For more info try these:
Generators on Wiki
List Comprehensions
Good luck!