Python: efficient way to create new csv from large dataset

Python: efficient way to create new csv from large dataset - python

I have a script that removes "bad elements" from a master list of elements, then returns a csv with the updated elements and their associated values.
My question, is whether there is a more efficient way to perform the same operation in the for loop?
Master=pd.read_csv('some.csv', sep=',',header=0,error_bad_lines=False)
MasterList = Master['Elem'].tolist()
MasterListStrain1 = Master['Max_Principal_Strain'].tolist()
#this file should contain elements that are slated for deletion
BadElem=pd.read_csv('delete_me_elements_column.csv', sep=',',header=None, error_bad_lines=False)
BadElemList = BadElem[0].tolist()
NewMasterList = (list(set(MasterList) - set(BadElemList)))
filename = 'NewOutput.csv'
outfile = open(filename,'w')
#pdb.set_trace()
for i,j in enumerate(NewMasterList):
#pdb.set_trace()
Elem_Loc = MasterList.index(j)
line ='\n%s,%.25f'%(j,MasterListStrain1[Elem_Loc])
outfile.write(line)
print ("\n The new output file will be named: " + filename)
outfile.close()

Stage 1
If you necessarily want to iterate in the for loop then besides using pd.to_csv which likely to improve performance you can do the following:
...
SetBadElem = set(BadElemList)
...
for i,Elem_Loc in enumerate(MasterList):
if Elem_Loc not in SetBadElem:
line ='\n%s,%.25f'%(j,MasterListStrain1[Elem_Loc])
outfile.write(line)
Jumping around the index is never efficient whereas iteration with skipping will give you much better performance (checking presence in a set is log n operation so it is relatively quick).
Stage 2 Using Pandas properly
...
SetBadElem = set(BadElemList)
...
for Elem in Master:
if Elem not in SetBadElem:
line ='\n%s,%.25f'%(Elem['elem'], Elem['Max_Principal_Strain'])
outfile.write(line)
There is no need to create lists out of pandas dataframe columns. Using the whole dataframe (and indexing into it) is a much better approach.
Stage 3 Removing messy iterated formatting operations
We can add a column ('Formatted') that will contain formatted data. For that we will create a lambda function:
formatter = lambda row: '\n%s,%.25f'%(row['elem'], row['Max_Principal_Strain'])
Master['Formatted'] = Master.apply(formatter)
Stage 4 Pandas-way filtering and output
We can format the dataframe in two ways. My preference is to reuse the formatting function:
import numpy as np
formatter = lambda row: '\n%s,%.25f'%(row['elem'], row['Max_Principal_Strain']) if row not in SetBadElem else np.nan
Now we can use the built-in dropna which drops all rows that have any NaN values
Master.dropna()
Master.to_csv(filename)

Related

Comparing number of lines in a CSV compared to number successfully processed into dataframe by Pandas?

We are using Pandas to read a CSV into a dataframe:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
Since we are allowing bad lines to be skipped, we want to be able to track how many have been skipped and put it into a value so that we can metric off of it.
To do this, I was thinking of comparing how many rows we have in the dataframe vs the number of rows in the original file.
I think this does what I want:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
initialRowCount = sum(1 for line in open('our_filepath_here'))
difference = initialRowCount - len(someDataframe.index))
But the hardware running this is super limited and I would rather not open the file and iterate through the whole thing just to get a row count when we're already going through the whole thing once via .read_csv. Does anyone know of a better way to get both the successfully processed count and the initial row count for the CSV?

Though I haven't tested this personally, I believe you can count the number of warnings generated by capturing them and checking the length of the returned list of captured warnings. Then add that to current shape of your dataframe:
import warnings
import pandas as pd
with warnings.catch_warnings(record=True) as warning_list:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
# May want to check if each warning object a pandas "bad line warning"
number_of_warned_lines = len(warning_list)
initialRowCount = len(someDataframe) + number_of_warned_lines
https://docs.python.org/3/library/warnings.html#warnings.catch_warnings
Edit: took a little bit of toying around, but this seems to work with Pandas. Instead of depending on the warnings built-in, we'll just temporarily redirect stderr. Then we can count the number of times "Skipping Lines" occurs in that string and we'll end with the count of bad lines with this warning message!
import contextlib
import io
bad_data = io.StringIO("""
a,b,c,d
1,2,3,4
f,g,h,i,j,
l,m,n,o
p,q,r,s
7,8,9,10,11
""".lstrip())
new_stderr = io.StringIO()
with contextlib.redirect_stderr(new_stderr):
df = pd.read_csv(bad_data, error_bad_lines=False, warn_bad_lines=True)
n_warned_lines = new_stderr.getvalue().count("Skipping line")
print(n_warned_lines) # 2

Iterate through a for loop using multiple cores in Python

I have the following code that is currently running like normal Python code:
def remove_missing_rows(app_list):
print("########### Missing row removal ###########")
missing_rows = []
''' Remove any row that has missing data in the name, id, or description column'''
for row in app_list:
if not row[1]:
missing_rows.append(row)
continue # Continue loop to next row. No need to check more columns
if not row[5]:
missing_rows.append(row)
continue # Continue loop to next row. No need to check more columns
if not row[4]:
missing_rows.append(row)
print("Number of missing entries: " + str(len(missing_rows))) # 967 with current method
# Remove the missing_rows from the original data
app_list = [row for row in app_list if row not in missing_rows]
return app_list
Now, after writing this for a smaller sample I wish to run this on a very large data set. To do this I thought it would be useful to utilise the multiple cores of my computer.
I'm struggling to implement this using the multiprocessing module though. E.g. The idea I have is that Core 1 could work through the first half of the data set, while Core 2 would work through the last half. Etc. And do this in parallel. Is this possible?

This is probably not cpu bound. Try the code below.
I've used a set for very fast (hash-based) contains (you use it when you invoke if row not in missing_rows, and it's very slow for a long list).
If this is the csv module you're already holding tuples which are hashable so not many changes needed:
def remove_missing_rows(app_list):
print("########### Missing row removal ###########")
filterfunc = lambda row: not all([row[1], row[4], row[5]])
missing_rows = set(filter(filterfunc, app_list))
print("Number of missing entries: " + str(len(missing_rows))) # 967 with current method
# Remove the missing_rows from the original data
# note: should be a lot faster with a set
app_list = [row for row in app_list if row not in missing_rows]
return app_list

You can use filter, to not iterate twice:
def remove_missing_rows(app_list):
filter_func = lambda row: all((row[1], row[4], row[5]))
return list(filter(filter_func, app_list))
But if you are doing data analysis, you probably should have a look into pandas.
There you could do something like this:
import pandas as pd
df = pd.read_csv('your/csv/data/file', usecols=(1, 4, 5))
df = df.dropna() # remove missing values

Smart way to read big input file with multiple unmarked variables (assorted in columns) in python

I have the following code that runs for over a million lines. But this takes a lot of time. Is there a better way to read in such files? The current code looks like this:
for line in lines:
line = line.strip() #Strips extra characters from lines
columns = line.split() #Splits lines into individual 'strings'
x = columns[0] #Reads in x position
x = float(x) #Converts the strings to float
y = columns[1] #Reads in y
y = float(y) #Converts the strings to float
z = columns[2] #Reads in z
z = float(z) #Converts the strings to float
The file data looks like this:
347.528218024 354.824474847 223.554247185 -47.3141937738 -18.7595743981
317.843928028 652.710791858 795.452586986 -177.876355361 7.77755408015
789.419369714 557.566066378 338.090799912 -238.803813301 -209.784710166
449.259334688 639.283337249 304.600907059 26.9716202117 -167.461497735
739.302109761 532.139588049 635.08307865 -24.5716064556 -91.5271790951
I want to extract each number from different columns. Every element in a column is the same variable. How do I do that? For example I want a list, l, say to store the floats of first column.

It would be helpful to know what you are planning on doing with the data, but you might try:
data = [map(float, line.split()) for line in lines]
This will give you a list of lists with your data.

Pandas is built for this (among many other things)!
It uses numpy, which uses C under the hood and is very fast. (Actually, depending on what you're doing with the data, you may want to use numpy directly instead of pandas. However, I'd only do that after you've tried pandas; numpy is lower level and pandas will make your life easier.)
Here's how you could read in your data:
import pandas as pd
with open('testfile', 'r') as f:
d = pd.read_csv(f, delim_whitespace=True, header=None,
names=['delete me','col1','col2','col3','col4','col5'])
d = d.drop('delete me',1) # the first column is all spaces and gets interpreted
# as an empty column, so delete it
print d
This outputs:
col1 col2 col3 col4 col5
0 347.528218 354.824475 223.554247 -47.314194 -18.759574
1 317.843928 652.710792 795.452587 -177.876355 7.777554
2 789.419370 557.566066 338.090800 -238.803813 -209.784710
3 449.259335 639.283337 304.600907 26.971620 -167.461498
4 739.302110 532.139588 635.083079 -24.571606 -91.527179
The result d in this case is a powerful data structure called a dataframe that gives you a lot of options for manipulating the data very quickly.
As a simple example, this adds the two first columns and gets the mean of the result:
(d['col1'] + d['col2']).mean() # 1075.97544372
Pandas also handles missing data very nicely; if there are missing/bad values in the data file, pandas will simply replace them with NaN or None as appropriate when it reads them in.
Anyways, for fast,easy data analysis, I highly recommend this library.

Using CSV arrays as an input to Python

I have been presented with a csv file that is full of 100+ arrays that I need to run through my data analysis code but I am not sure how to read these arrays in Python. Each array is preceded with a line that includes only an integer that gives the number of rows in the array and ends with the line '1234567890' to be used as a line separator.
Here is a snippet of the csv file:
7,,,,,,,
1,-199.117,-105.4,-4.525,227.5415,225.2925647,-0.0198891,-2.6547518
2,133.0423,55.4573,-48.4174,155.16,144.1380093,-0.322813,0.3949385
3,129.8405,-16.9527,-303.3192,331.0847,130.9425427,-1.5644458,-0.1298311
4,-73.6373,71.4677,151.517,183.9712,102.616198,1.1678785,2.3711453
5,41.2654,10.4196,30.3773,54.0915,42.5605604,0.6351541,0.2473322
6,-20.3159,-32.4484,62.4574,74.8581,38.2836056,1.2022641,-2.1301853
7,-13.2904,22.029,-28.2895,38.5096,25.7276422,-0.9386666,2.1136489
1234567890,,,,,,,
5,,,,,,,
1,-136.0755,-204.2787,-48.2127,259.2592,245.4512762,-0.1881526,-2.158425
2,220.5184,46.9388,-113.6448,265.1745,225.4586784,-0.4581388,0.2097266
3,-45.3132,169.6283,-49.2729,188.9506,175.576326,-0.2669358,1.8318334
4,-40.7141,34.7414,25.5414,60.9535,53.5219844,0.4465159,2.4351851
5,15.3863,-49.6703,17.1692,56.7635,51.9988166,0.312235,-1.2704018
1234567890,,,,,,,
6,,,,,,,
1,-19.3083,295.4128,191.8666,360.3712,296.0431267,0.5935079,1.6360639
2,-169.8708,-128.3904,-1.0052,215.4187,212.9323449,-0.0046663,-2.4943822
3,15.4505,-209.6656,-178.0715,279.4077,210.2341118,-0.7536439,-1.4972381
4,172.4142,13.0485,-63.7912,192.2842,172.9072576,-0.3447988,0.0755371
5,16.7456,24.8768,-46.5025,55.9188,29.9878358,-1.1933262,0.9783247
6,-8.911,4.1138,12.7751,17.7283,9.8147477,0.9089022,2.7090895
1234567890,,,,,,,
I am certain I could import the array if the csv was just one big array but I am stumped when it comes to picking one array out of many. The data analysis needs to be run on the temporary array before it is replaced with the next array in the csv file.

You could use itertools.groupby to parse the rows into separate arrays:
import csv
import itertools
with open('errors','w') as err: pass
with open('data','r') as f:
for key, group in itertools.groupby(
csv.reader(f),
lambda row: row[0].startswith('1234567890')):
if key: continue # key is True means we've reach the end of an array
group=list(group) # group is an iterator; we turn it into a list
array=group[1:] # everything but the first row is data
arr_length=int(group[0][0]) # first row contains the length
if arr_length != len(array): # sanity check
with open('errors','a') as err:
err.write('''\
Data file claims arr_length = {l}
{a}
{h}
'''.format(l=arr_length,a=str(list(array)),h='-'*80))
print(array)
itertools.groupby returns an iterator. It loops through the rows in csv.reader(f), and applies the lambda function to each row. The lambda function returns True when the row starts with '1234567890'. The return value (e.g. True or False) is called the key. The important point is that itertools.groupby collects together all contiguous rows that return the same key.

This should give you a nicely formatted variable called "data" to work with.
import csv
rows = csv.reader(open('your_file.csv'))
data = []
temp = []
for row in rows:
if '1234567890' in row:
data.append(temp)
temp = []
continue
else:
temp.append(row)
if temp != []:
data.append(temp)

Need more efficient way to parse out csv file in Python

Here's a sample csv file
id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601
This is the output I'm looking for (list of serial_no withing a list of ids):
[2, [500,501,502]]
[3, [600, 601]]
I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
each_row = []
each_row.append(row[0])
each_row.append(row[1])
zipped_data.append(each_row)
for rec in zipped_data:
if rec[0] not in ids:
ids.append(rec[0])
for id in ids:
for rec in zipped_data:
if rec[0] == id:
ser_no.append(rec[1])
tmp.append(id)
tmp.append(ser_no)
print tmp
tmp = []
ser_no = []
**I've omitted var initializing for simplicity of code
print tmp
Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!

from collections import defaultdict
records = defaultdict(list)
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
records[row[0]].append(row[1])
#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results
If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])

Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.
result = collections.defaultdict(list)
for row in data:
result[row[0]].append(row[1])

Here's a version I wrote, looks like there are plenty of answers for this one already though.
You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).
#!/usr/bin/python
import csv
myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)
myData = {}
for myRow in csvFile:
myId = myRow['id']
if not myData.has_key(myId): myData[myId] = []
myData[myId].append(myRow['serial_no'])
for myId in sorted(myData):
print '%s %s' % (myId, myData[myId])
myFile.close()

Some observations:
0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...
1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.
2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.
3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.
4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.
5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.
6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).
Applying these ideas, we get:
filename = 'test.csv'
with open(filename) as in_file:
data = csv.reader(in_file)
data.next() # ignore the field labels
rows = list(data) # read the rest of the rows from the iterator
print [
# We want a list of all serial numbers from rows with a matching ID...
[serial_no for row_id, serial_no in rows if row_id == id]
# for each of the IDs that there is to match, which come from making
# a set from the first column of the data.
for id in set(zip(*rows)[0])
]
We can probably do even better than this by using the groupby function from the itertools module.

example using itertools.groupby. This only works if the rows are already grouped by id
from csv import DictReader
from itertools import groupby
from operator import itemgetter
filename = 'test.csv'
# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:
# group by id - this requires that the rows are already grouped by id
groups = groupby(DictReader(infile), key=itemgetter('id'))
# loop through the groups printing a list for each one
for i,j in groups:
print [i, map(itemgetter(' serial_no'), list(j))]
note the space in front of ' serial_no'. This is because of the space after the comma in the input file

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: efficient way to create new csv from large dataset - python

Related

Comparing number of lines in a CSV compared to number successfully processed into dataframe by Pandas?

Iterate through a for loop using multiple cores in Python

Smart way to read big input file with multiple unmarked variables (assorted in columns) in python

Using CSV arrays as an input to Python

Need more efficient way to parse out csv file in Python

Categories

Resources