Ironpython code taking too long to execute? - python

I have a piece of code written in Iron Python that reads data from table present in SpotFire and serialize in JSON object. It is taking too long to get executed. Please provide alternates to it.
import clr
import sys
clr.AddReference('System.Web.Extensions')
from System.Web.Script.Serialization import JavaScriptSerializer
from Spotfire.Dxp.Data import IndexSet
from Spotfire.Dxp.Data import DataValueCursor
rowCount = MyTable.RowCount
rows = IndexSet(rowCount,True)
cols = MyTable.Columns
MyTableData=[]
for r in rows:
list={}
item={}
for c in cols:
item[c.Name] = c.RowValues.GetFormattedValue(r)
list['MyData']=item
MyTableData.append(list)
json=JavaScriptSerializer(MaxJsonLength=sys.maxint).Serialize(MyTableData)

Your code will be faster if you don't call list['MyData']=item for every column. You only need to call it once.
You could also use list and dictionary comprehensions, instead of appending, or looking up keys for every value.
MyTableData = [{'MyData': {column.Name: column.RowValues.GetFormattedValue(row)
for column in cols}}
for row in rows]
If column.RowValues is an expensive operation you may be better looping over columns, which isn't as neat.

Related

Python Multiprocessing write to csv data for huge volume files

I am trying to do calculation and write it to another txt file using multiprocessing program. I am getting count mismatch in output txt file. every time execute I am getting different output count.
I am new to python could some one please help.
import pandas as pd
import multiprocessing as mp
source = "\\share\usr\data.txt"
target = "\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
output_df.to_csv(target,index=None,sep='|',mode='a',header=False)
if __name__ == '__main__':
reader= pd.read_table(source,sep='|',chunksize = chunk,encoding='ANSI')
pool = mp.Pool(mp.cpu_count())
jobs = []
for each_df in reader:
process = mp.Process(target=calc_frame,args=(each_df)
jobs.append(process)
process.start()
for j in jobs:
j.join()
You have several issues in your source as posted that would prevent it from even compiling let alone running. I have attempted to correct those in an effort to also solving your main problem. But do check the code below thoroughly just to make sure the corrections make sense.
First, the args argument to the Process constructor should be specified as a tuple. You have specified args=(each_df), but (each_df) is not a tuple, it is a simple parenthesized expression; you need (each_df,) to make if a tuple (the statement is also missing a closing parentheses).
The problem you have in addition to making no provision against multiple processes simultaneously attempting to append to the same file is that you cannot be assured of the order in which the processes complete and thus you have no real control over the order in which the dataframes will be appended to the csv file.
The solution is to use a processing pool with the imap method. The iterable to pass to this method is just the reader, which when iterated returns the next dataframe to process. The return value from imap is an iterable that when iterated will return the next return value from calc_frame in task-submission order, i.e. the same order that the dataframes were submitted. So as these new, modified dataframes are returned, the main process can simply append these to the output file one by one:
import pandas as pd
import multiprocessing as mp
source = r"\\share\usr\data.txt"
target = r"\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
return output_df
if __name__ == '__main__':
with mp.Pool() as pool:
reader = pd.read_table(source, sep='|', chunksize=Chunk, encoding='ANSI')
for output_df in pool.imap(process_calc, reader):
output_df.to_csv(target, index=None, sep='|', mode='a', header=False)

Comparing number of lines in a CSV compared to number successfully processed into dataframe by Pandas?

We are using Pandas to read a CSV into a dataframe:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
Since we are allowing bad lines to be skipped, we want to be able to track how many have been skipped and put it into a value so that we can metric off of it.
To do this, I was thinking of comparing how many rows we have in the dataframe vs the number of rows in the original file.
I think this does what I want:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
initialRowCount = sum(1 for line in open('our_filepath_here'))
difference = initialRowCount - len(someDataframe.index))
But the hardware running this is super limited and I would rather not open the file and iterate through the whole thing just to get a row count when we're already going through the whole thing once via .read_csv. Does anyone know of a better way to get both the successfully processed count and the initial row count for the CSV?
Though I haven't tested this personally, I believe you can count the number of warnings generated by capturing them and checking the length of the returned list of captured warnings. Then add that to current shape of your dataframe:
import warnings
import pandas as pd
with warnings.catch_warnings(record=True) as warning_list:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
# May want to check if each warning object a pandas "bad line warning"
number_of_warned_lines = len(warning_list)
initialRowCount = len(someDataframe) + number_of_warned_lines
https://docs.python.org/3/library/warnings.html#warnings.catch_warnings
Edit: took a little bit of toying around, but this seems to work with Pandas. Instead of depending on the warnings built-in, we'll just temporarily redirect stderr. Then we can count the number of times "Skipping Lines" occurs in that string and we'll end with the count of bad lines with this warning message!
import contextlib
import io
bad_data = io.StringIO("""
a,b,c,d
1,2,3,4
f,g,h,i,j,
l,m,n,o
p,q,r,s
7,8,9,10,11
""".lstrip())
new_stderr = io.StringIO()
with contextlib.redirect_stderr(new_stderr):
df = pd.read_csv(bad_data, error_bad_lines=False, warn_bad_lines=True)
n_warned_lines = new_stderr.getvalue().count("Skipping line")
print(n_warned_lines) # 2

Pandas read_csv not reading all rows in file

I am trying to read a csv file with pandas. File has 14993 line after headers.
data = pd.read_csv(filename, usecols=['tweet', 'Sentiment'])
print(len(data))
it prints : 14900 and if I add one line to the end of file it is now 14901 rows, so it is not because of memory limit etc. And I also tried "error_bad_lines" but nothing has changed.
By the name of your headers one can supect that you have free text. That can easily trip any csv-parser.
In any case here's a version that easily allows you to track down inconsistencies in the csv, or at least gives a hint of what to look for… and then puts it into a dataframe.
import csv
import pandas as pd
with open('file.csv') as fc:
creader = csv.reader(fc) # add settings as needed
rows = [r for r in creader]
# check consistency of rows
print(len(rows))
print(set((len(r) for r in rows)))
print(tuple(((i, r) for i, r in enumerate(rows) if len(r) == bougus_nbr)))
# find bougus lines and modify in memory, or change csv and re-read it.
# assuming there are headers...
columns = list(zip(*rows))
df = pd.DataFrame({k: v for k, *v in columns if k in ['tweet', 'Sentiment']})
if the dataset is really big, the code should be rewritten to only use generators (which is not that hard to do..).
Only thing not to forget when using a technique like this is that if you have numbers, those columns should be recasted to suitable datatype if needed, but that becomes self evident if one attempts to do math on a dataframe filled with strings.

Python for loop to read csv using pandas

I can combined 2 csv scripts and it works well.
import pandas
csv1=pandas.read_csv('1.csv')
csv2=pandas.read_csv('2.csv')
merged=csv1.merge(csv2,on='field1')
merged.to_csv('output.csv',index=False)
Now, I would like to combine more than 2 csvs using the same method as above.
I have list of CSV which I defined to something like this
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
for i in collection:
csv=pandas.read_csv(i)
merged=csv.merge(??,on='field1')
merged.to_csv('output2.csv',index=False)
I havent got it work so far if more than 1 csv..I guess it just a matter iterate inside the list ..any idea?
You need special handling for the first loop iteration:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = None
for i in collection:
csv=pandas.read_csv(i)
if result is None:
result = csv
else:
result = result.merge(csv, on='field1')
if result:
result.to_csv('output2.csv',index=False)
Another alternative would be to load the first CSV outside the loop but this breaks when the collection is empty:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = pandas.read_csv(collection[0])
for i in collection[1:]:
csv = pandas.read_csv(i)
result = result.merge(csv, on='field1')
if result:
result.to_csv('output2.csv',index=False)
I don't know how to create an empty document (?) in pandas but that would work, too:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = pandas.create_empty() # not sure how to do this
for i in collection:
csv = pandas.read_csv(i)
result = result.merge(csv, on='field1')
result.to_csv('output2.csv',index=False)

Getting my output into another excel file

import os, sys
from xlrd import open_workbook
from xlutils.copy import copy
from xlwt import easyxf, Style
import time
rb = open_workbook('A1.xls', on_demand=True,formatting_info =True)
rs = rb.sheet_by_index(0)
wb = copy(rb)
ws = wb.get_sheet(0)
start =time.time()
g1 = dict()
for row in range(1,rs.nrows):
for cell in row:
cellContent = str(cell.value)
if cellContent not in g1.keys():
g1[cellContent]=1
else:
g1[cellContent]=g1[cellContent]+1
for cellContent in g1.keys():
print cellContent, g1[cellContent]
ws.write(row,1, cellContent)
wb.save('A2.xls')
When I run this code, I get the error message cell object not iterable
What could have gone wrong?
I am not familiar myself with xlrd or any of the other modules, but doing any work with csv or excel spreadsheets, I use Pandas, specifically this link. It allows you to easily read and make all sorts of modifications, and then write it out very easily as well. If all you wanted was to copy it would be really easy.
The problem you've got is that row is an integer, as it's populated using for row in range(1, rs.nrows): where the range() function returns an integer - In your case what I presume is each row number between 1 and the number of rows in your spreadsheet.
I'm not familiar with how the xlrd, xlutils and xlwt modules work, but I'd imagine you want to do something more like the following:
for row_number in range(1, rs.nrows):
row = rs.row(row_number)
for cell in row:
....
The Sheet.row(rowx) method gives you a sequence of Cell objects that you can iterate in your inner loop.

Categories

Resources