Pandas GroupBy Mean of Large DataSet in CSV - python

A common SQLism is "Select A, mean(X) from table group by A" and I would like to replicate this in pandas. Suppose that the data is stored in something like a CSV file and is too big to be loaded into memory.
If the CSV could fit in memory a simple two-liner would suffice:
data=pandas.read_csv("report.csv")
mean=data.groupby(data.A).mean()
When the CSV cannot be read into memory one might try:
chunks=pandas.read_csv("report.csv",chunksize=whatever)
cmeans=pandas.concat([chunk.groupby(data.A).mean() for chunk in chunks])
badMeans=cmeans.groupby(cmeans.A).mean()
Except that the resulting cmeans table contains repeated entries for each distinct value of A, one for each appearance of that value of A in distinct chunks (since read_csv's chunksize knows nothing about the grouping fields). As a result the final badMeans table has the wrong answer... it needs to compute a weighted average mean.
So a working approach seems to be something like:
final=pandas.DataFrame({"A":[],"mean":[],"cnt":[]})
for chunk in chunks:
t=chunk.groupby(chunk.A).sum()
c=chunk.groupby(chunk.A).count()
cmean=pandas.DataFrame({"tot":t,"cnt":c}).reset_index()
joined=pandas.concat(final,cmean)
final=joined.groupby(joined.A).sum().reset_indeX()
mean=final.tot/final.cnt
Am I missing something? This seems insanely complicated... I would rather write a for loop that processes a CSV line by line than deal with this. There has to be a better way.

I think you could do something like the following which seems a bit simpler to me. I made the following data:
id,val
A,2
A,5
B,4
A,2
C,9
A,7
B,6
B,1
B,2
C,4
C,4
A,6
A,9
A,10
A,11
C,12
A,4
A,4
B,6
B,5
C,7
C,8
B,9
B,10
B,11
A,20
I'll do chunks of 5:
chunks = pd.read_csv("foo.csv",chunksize=5)
pieces = [x.groupby('id')['val'].agg(['sum','count']) for x in chunks]
agg = pd.concat(pieces).groupby(level=0).sum()
print agg['sum']/agg['count']
id
A 7.272727
B 6.000000
C 7.333333
Compared to the non-chunk version:
df = pd.read_csv('foo.csv')
print df.groupby('id')['val'].mean()
id
A 7.272727
B 6.000000
C 7.333333

Related

Anyone has a better way of efficiently creating a DataFrame from 60000 txt Files with keys in one column and values in the second?

Disclaimer !! This is my first post ever, so sorry if I don't meet certain standards of the community. _________________ _________________ _________________ _________________ _________________
I use python3, Jupyter Notebooks, Pandas
I used KMC kmer counter to count kmers of 60,000 DNA sequences in a reasonable amount of time. I want to use these kmer counts as input to ML algorithms as part of a Bag Of Words model.
The shape of a file containing kmer counts is as below, or as in image here and I have 60K files:
AAAAAC     2
AAAAAG     6
AAAAAT      2
AAAACC     4
AAAACG     2
AAAACT     3
AAAAGA     5
I want to create a single DataFrame from all the 60K files with one line per DNA sequence kmer counts which would have this form:
The target DataFrame shape
A first approach was successful and I managed to import 100 sequences(100 txt files) in 58 seconds, using this code:
import time
countsPath = r'D:\DataSet\MULTI\bow\6mer'
start = time.time()
for i in range(0, 60000):
sample = pd.read_fwf(countsPath + r'\kmers-' + str(k) +'-seqNb-'+ str(i) + '.txt',sep=" ", header=None).T
new_header = sample.iloc[0] #grab the first row for the header
sample = sample[1:] #take the data less the header row
sample.columns = new_header #set the header row as the df header
df= df.append(sample, ignore_index=True) #APPEND Sample to df DataSet
end = time.time()
# total time taken
print(f"Runtime of the program is {end - start} secs")
# display(sample)
display(df)
However, this was very slow, and took 59 secs on 100 files. On the full dataset, take a factor of x600.
I tried dask DataFrames Bag to accelerate the process because it reads dictionary-like data, but I couldn't append each file as a row. The resulting Dask DataFrame is as follows or as in this image:
0          AAAAA   18
1          AAAAC   16
2          AAAAG   13
...
1023   TTTTT   14
0          AAAAA   5
1          AAAAC   4
...
1023   TTTTT   9
0          AAAAA   18
1          AAAAC   16
2          AAAAG   13
3          AAAAT   12
4          AAACA   11
So the files are being inserted in a single column.
Anyone has a better way of efficiently creating a DataFrame from 60k txt Files?
Love the disclaimer. I have a similar one - this is the first time I'm trying to answer a question. But I'm pretty certain I got this...and so will you:
dict_name = dict(zip(df['column_name'],df['the_other_column_name']))

Is pandas and numpy any good for manipulation of non numeric data?

I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers

Can I speed up my reading and processing of many .csv files in python?

I am currently occupied with a dataset consisting of 90 .csv files. There are three types of .csv files (30 of each type).
Each csv has from 20k to 30k rows average and 3 columns(timestamp in linux format, Integer,Integer).
Here's an example of the header and a row:
Timestamp id1 id2
151341342 324 112
I am currently using 'os' to list all files in the directory.
The process for each CSV file is as follows:
Read it through pandas into a dataframe
iterate the rows of the file and for each row convert the timestamp to readable format.
Use the converted timestamp and Integers to create a relationship-type of object and add it on a list of relationships
The list will later be looped to create the relationships in my neo4j database.
The problem I am having is that the process takes too much time. I have asked and searched for ways to do it faster (I got answers like PySpark, Threads) but I did not find something that really fits my needs. I am really stuck as with my resources it takes around 1 hour and 20 minutes to do all that process for one of the big .csv file(meaning one with around 30k rows)
Converting to readable format:
ts = int(row['Timestamp'])
formatted_ts = datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
And I pass the parameters to the Relationship func of py2neo to create my relationships. Later that list will be looped .
node1 = graph.evaluate('MATCH (n:User) WHERE n.id={id} RETURN n', id=int(row["id1"]))
node2 = graph.evaluate('MATCH (n:User) WHERE n.id={id} RETURN n', id=int(row['id2']))
rels.append(Relationship(node1, rel_type, node2, date=date, time=time))
time to compute row: 0:00:00.001000
time to create relationship: 0:00:00.169622
time to compute row: 0:00:00.001002
time to create relationship: 0:00:00.166384
time to compute row: 0:00:00
time to create relationship: 0:00:00.173672
time to compute row: 0:00:00
time to create relationship: 0:00:00.171142
I calculated the time for the two parts of the process as shown above. It is fast and there really seems to not be a problem except the size of the files. This is why the only things that comes to mind is that Parallelism would help to compute those files faster(by computing lets say 4 files in the same time instead of one)
sorry for not posting everything
I am really looking forward for replies
Thank you in advance
That sounds fishy to me. Processing csv files of that size should not be that slow.
I just generated a 30k line csv file of the type you described (3 columns filled with random numbers of the size you specified.
import random
with open("file.csv", "w") as fid:
fid.write("Timestamp;id1;id2\n")
for i in range(30000):
ts = int(random.random()*1000000000)
id1 = int(random.random()*1000)
id2 = int(random.random()*1000)
fid.write("{};{};{}\n".format(ts, id1, id2))
Just reading the csv file into a list using plain Python takes well under a second. Printing all the data takes about 3 seconds.
from datetime import datetime
def convert_date(string):
ts = int(string)
formatted_ts = datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
split_ts = formatted_ts.split()
date = split_ts[0]
time = split_ts[1]
return date
with open("file.csv", "r") as fid:
header = fid.readline()
lines = []
for line in fid.readlines():
line_split = line.strip().split(";")
line_split[0] = convert_date(line_split[0])
lines.append(line_split)
for line in lines:
print(line)
Could you elaborate what you do after reading the data? Especially "create a relationship-type of object and add it on a list of relationships"
That could help pinpoint your timing issue. Maybe there is a bug somewhere?
You could try timing different parts of your code to see which one takes the longest.
Generally, what you describe should be possible within seconds, not hours.

Printing a Python pandas data frame with padding

I'm writing a short script to validate data in a CSV and I'm formatting the results to dump to stdout and for readability, I'm adding 5 space padding on the left. Note: I'm NOT using format because I don't want to justify output.
Code:
def duplicate_data():
dup_df = inventory_df[inventory_df.duplicated(['STORE_NO','SKU'],keep=False)]
if dup_df.empty:
print(five, 'INFO: No Duplicate Entries Found')
else:
#print('\n')
print(five, 'WARN: Duplicate STORE_ID and SKU Data Found!')
print(five, dup_df.to_string(index=False))
Results:
It all works great until it prints the data frame:
WARN: Duplicate STORE_ID and SKU Data Found!
Please Copy/Paste the following and send to the customer:
STORE_NO SKU ON_HAND_QTY
10000001 1000000000007 2
10000002 1000000000007 8
I could iterate over the rows but the formatting is worse than the example above.
for rows in dup_df.iterrows():
print(five,rows)
Any thoughts as to how I can format the data frame output?
Not super nice but you could to do something like this:
def padlines(text, padding):
return "\n".join(padding + l for l in text.splitlines())
And then padlines(df.to_string(), five)

In PyTables, how to create nested array of variable length?

I'm using PyTables 2.2.1 w/ Python 2.6, and I would like to create a table which contains nested arrays of variable length.
I have searched the PyTables documentation, and the tutorial example (PyTables Tutorial 3.8) shows how to create a nested array of length = 1. But for this example, how would I add a variable number of rows to data 'info2/info3/x' and 'info2/info3/y'?
For perhaps an easier to understand table structure, here's my homegrown example:
"""Desired Pytable output:
DIEM TEMPUS Temperature Data
5 0 100 Category1 <--||--> Category2
x <--| |--> y z <--|
0 0 0
2 1 1
4 1.33 2.67
6 1.5 4.5
8 1.6 6.4
5 1 99
2 2 0
4 2 2
6 2 4
8 2 6
5 2 96
4 4 0
6 3 3
8 2.67 5.33
Note that nested arrays have variable length.
"""
import tables as ts
tableDef = {'DIEM': ts.Int32Col(pos=0),
'TEMPUS': ts.Int32Col(pos=1),
'Temperature' : ts.Float32Col(pos=2),
'Data':
{'Category1':
{
'x': ts.Float32Col(),
'y': ts.Float32Col()
},
'Category2':
{
'z': ts.Float32Col(),
}
}
}
# create output file
fpath = 'TestDb.h5'
fh = ts.openFile(fpath, 'w')
# define my table
tableName = 'MyData'
fh.createTable('/', tableName, tableDef)
tablePath = '/'+tableName
table = fh.getNode(tablePath)
# get row iterator
row = table.row
for i in xrange(3):
print '\ni=', i
# calc some fake data
row['DIEM'] = 5
row['TEMPUS'] = i
row['Temperature'] = 100-i**2
for j in xrange(5-i):
# Note that nested array has variable number of rows
print 'j=', j,
# calc some fake nested data
val1 = 2.0*(i+j)
val2 = val1/(j+1.0)
val3 = val1 - val2
''' Magic happens here...
How do I write 'j' rows of data to the elements of
Category1 and/or Category2?
In bastardized pseudo-code, I want to do:
row['Data/Category1/x'][j] = val1
row['Data/Category1/y'][j] = val2
row['Data/Category2/z'][j] = val3
'''
row.append()
table.flush()
fh.close()
I have not found any indication in the PyTables docs that such a structure is not possible... but in case such a structure is in fact not possible, what are my alternatives to variable length nested columns?
EArray? VLArray? If so, how to integrate these data types into the above described structure?
some other idea?
Any assistance is greatly appreciated!
EDIT w/ additional info:
It appears that the PyTables gurus have already addressed the "is such a structure possible" question:
PyTables Mail Forum - Hierachical Datasets
So has anyone figured out a way to create an analogous PyTable data structure?
Thanks again!
I have a similar task: to dump fixed size data with arrays of a variable length.
I first tried using fixed size StringCol(64*1024) fields to store my variable length data (they are always < 64K). But it was rather slow and wasted a lot of disk space, despite blosc compression.
After days of investigation I ended with the following solution:
(spoiler: we store array fields in separate EArray instances, one EArray per one array field)
I store fixed size data in a regular pytables table.
I added 2 additional fields to these tables: arrFieldName_Offset and arrFieldName_Length:
class Particle(IsDescription):
idnumber = Int64Col()
ADCcount = UInt16Col()
TDCcount = UInt8Col()
grid_i = Int32Col()
grid_j = Int32Col()
pressure = Float32Col()
energy = FloatCol()
buffer_Offset = UInt32() # note this field!
buffer_Length = UInt32() # and this one too!
I also create one EArray instance per each array field:
datatype = StringAtom(1)
buffer = h5file.createEArray('/detector', 'arr', datatype, (0,), "")
Then I add rows corresponding to a fixed size data:
row['idnumber'] = ...
...
row['energy'] = ...
row['buffer_Offset'] = buffer.nrows
# my_buf is a string (I get it from a stream)
row['buffer_Length'] = len(my_buf)
table.append(row)
Ta-dah! Add the buffer into the array.
buffer.append(np.ndarray((len(my_buf),), buffer=my_buf, dtype=datatype))
That's the trick. In my experiments this approach is 2-10x times faster than storing ragged fixed sized arrays (like StringAtom(HUGE_NUMBER)) and the resulting DB is few times smaller (2-5x)
Getting the buffer data is easy. Suppose that row is a single row you read from your DB:
# Open array for reading
buffer = h5file.createEArray('/detector', 'Particle.buffer', datatype, (0,), "")
...
row = ...
...
bufferDataYouNeed = buffer[ row['buffer_Offset'] : row['buffer_Offset'] + row['buffer_Length']]
This is a common thing that folks starting out with PyTables want to do. Certainly, it was the first thing I tried to do. As of 2009, I don't think this functionality was supported. You can look here for one solution "I always recommend":
http://www.mail-archive.com/pytables-users#lists.sourceforge.net/msg01207.html
In short, just put each VLArray in a separate place. If you do that, maybe you don't end up needing VLArrays. If you store separate VLArrays for each trial (or whatever), you can keep metadata on those VLArrays (guaranteed to stay in sync with the array across renames, moves, etc.) or put it in a table (easier to search).
But you may also do well to pick whatever a single time-point would be for your column atom, then simply add another column for a time stamp. This would allow for a "ragged" array that still has a regular, repeated (tabular) structure in memory. For example:
Trial Data
1 0.4, 0.5, 0.45
2 0.3, 0.4, 0.45, 0.56
becomes
Trial Timepoint Data
1 1 0.4
1 2 0.5
...
2 4 0.56
Data above is a single number, but it could be, e.g. a 4x5x3 atom.
If nested VLArrays are supported in PyTables now, I'd certainly love to know!
Alternatively, I think h5py does support the full HDF5 feature-set, so if you're really committed to the nested data layout, you may have more luck there. You'll be losing out on a lot of nice features though! And in my experience, naive neuroscientists end up with quite poor performance since they don't get pytables intelligent choices for data layout, chunking, etc. Please report back if you go that route!
I also ran into this and I ended using a fixed array size. The arrays I was trying to store were of variable len so I created new ones from the with the correct fixed length
I did something along the lines of
def filled_list(src_list, targ_len):
"""takes a varible len() list and creates a new one with a fixed len()"""
for i in range(targ_len):
try:
yield src_list[i]
except IndexError:
yield 0
src_list = [1,2,3,4,5,6,7,8,9,10,11]
new_list = [x for x in filled_list(src_list, 100)]
That did the trick for me.

Categories

Resources