I'm trying to use DataFrame.map_partitions() from Dask to apply a function on each partition. The function takes in input a list of values and have to return the rows of the dataframe partition that contains these values, on a specific column (using loc() and isin()).
The issue is that I get this error:
"index = partition_info['number'] - 1
TypeError: 'NoneType' object is not subscriptable"
When I print partition_info, it prints None hundreds of times (but I only have 60 elements in the loop so we expect only 60 prints), is it normal to print None because it's a child process or am I missing something with partition_info? I cannot find useful information on that.
def apply_f(df, barcodes_per_core: List[List[str]], partition_info=None):
print(partition_info)
index = partition_info['number'] - 1
indexes = barcodes_per_core[index]
return df.loc[df['barcode'].isin(indexes)]
df = from_pandas(df, npartitions=nb_cores)
dfs_per_core = df.map_partitions(apply_f, barcodes_per_core, meta=df)
dfs_per_core = dfs_per_core.compute(scheduler='processes')
=> Doc of partition_info at the end of this page.
It's not clear why things are not working on your end, one potential thing is that you are re-using df multiple times. Here's a MWE that works:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame(range(10), columns=["a"])
ddf = dd.from_pandas(df, npartitions=3)
def my_func(d, x, partition_info=None):
print(x, partition_info)
ddf.map_partitions(my_func, 3, meta=df.head()).compute(scheduler='processes')
Related
I want to read only first n number of rows in pandas.I pasted the below code which I've tried.
def s3_read_file(src_bucket_name,s3_path,s3_filename):
try:
src_bucket_name ="lla.analytics.dev"
s3_path = "bigdata/dna/fixed/cwp/dt={}/".format(date_fmt)
result = s3.list_objects(Bucket=src_bucket_name, Prefix=s3_path) #getting dictionary
for i in result["Contents"]:
s3_filename = i['Key']
#print(s3_filename)
res = s3.get_object(Bucket=src_bucket_name, Key=s3_filename) #s3://lla.analytics.dev/bigdata/dna/fixed/cwp/dt=2021-12-05/file.parquet
#print(res)
#df = pd.read_parquet(io.BytesIO(res['Body'].read()))
#print(df)
pf = spark.read.parquet().limit(1)
logger.info("****")
logging.info('dataframe head - {}'.format(pf.count()))
logger.info("****")
except Exception as error:
logger.error(error)
I'm facing the below error: I tried with pyspark also but not getting
ERROR:root:read_table() got an unexpected keyword argument 'nrows'
I also tried with the below one but BytesIO is not taking two arguments.
#df = pd.read_parquet(io.BytesIO(s3_obj['Body'].read(),nrows = 10))
This may be a good place to start.
You can pass a subset of columns to read, which can be much faster than reading the whole file (due to the columnar layout):
pq.read_table('example.parquet', columns=['one', 'three'])
Out[11]:
pyarrow.Table
one: double
three: bool
----
one: [[-1,null,2.5]]
three: [[true,false,true]]
When reading a subset of columns from a file that used a Pandas dataframe as the source, we use
Also, you could write a loop and use read_row_group.
https://arrow.apache.org/docs/python/parquet.html#:~:text=row%20groups%20with-,read_row_group,-%3A
parquet_file.num_row_groups
Out[22]: 1
parquet_file.read_row_group(0)
Out[23]:
pyarrow.Table
one: double
two: string
three: bool
__index_level_0__: string
----
one: [[-1,null,2.5]]
two: [["foo","bar","baz"]]
three: [[true,false,true]]
__index_level_0__: [["a","b","c"]]
I have a function, which does some operations on each DataFrame column and extracts a shorter series from it (in the original code there is some time consuming calculations going on)
Then it adds it to a dictionary before it goes on with the next columns.
In the end it creates a dataframe from the dictionary and manipulates its index.
How can I parallelize the loop in which each column is manipulated?
This is a less complicated reproducable sample of the code.
import pandas as pd
raw_df = pd.DataFrame({"A":[ 1.1 ]*100000,
"B":[ 2.2 ]*100000,
"C":[ 3.3 ]*100000})
def preprocess_columns(raw_df, ):
df = {}
width = 137
for name in raw_df.columns:
'''
Note: the operations in this loop do not have a deep sense and are just for illustration of the function preprocess_columns. In the original code there are ~ 50 lines of list comprehensions etc.
'''
# 3. do some column operations. (actually theres more than just this operation)
seriesF = raw_df[[name]].dropna()
afterDropping_indices = seriesF.index.copy(deep=True)
list_ = list(raw_df[name])[width:]
df[name]=pd.Series(list_.copy(), index=afterDropping_indices[width:])
# create df from dict and reindex
df=pd.concat(df,axis=1)
df=df.reindex(df.index[::-1])
return df
raw_df = preprocess_columns(raw_df )
Maybe you can use this:
https://github.com/xieqihui/pandas-multiprocess
pip install pandas-multiprocess
from pandas_multiprocess import multi_process
args = {'width': 137}
result = multi_process(func=func, data=df, num_process=8, **args)
Trying to count occurrences of user ids (id) in a dataframe, by creating a new column called n_occurrences. It always results in a KeyError and I'm unsure why. In the same method, I create a new column for elapsed time (elapsed_time) which works without errors.
code sample:
import pandas as pd
class Clean_Data:
def __init__(self, data): #data is path to csv
self.make_df(data)
self.df
def make_df(self,data):
data_raw = pd.read_csv(data,header=0,nrows=100)
#do stuff that returns valid dataframe, cleans up unwanted data
self.df = data_raw
return self.df #df with shape 100 rows, 27 cols
def analyze_the_data(self):
self.df['n_occurrences'] = self.df.groupby(['id','time']).transform('count')
# ^ fails as well if I try with or without the line 'self.df['n_Occurrences']'
# has error ValueError: Wrong number of items passed 25, placement implies 1
self.df['time_elapsed'] = (self.df['time']-self.df['time'].shift()) # works--creates time_elapsed col
print(self.df.shape) #shows dataframe shape of 100 rows, 28 cols
if __name__ == '__main__':
c = Clean_Data(r'/path/to.csv')
c.analyze_the_data()
Dataframe (should) change to 63 rows, 28 cols after the groupby line. There are upwards of 50,000 lines so I'm just using 100 to save on time until everything is working.
output of error line:
ValueError: Wrong number of items passed 25, placement implies 1
why does it fail when I create the n_occurrences col but not when i do the time_elapsed col?
I'm trying to calculate tracking error for a number of different benchmarks versus a fund that I'm looking at (tracking error is defined as the standard deviation of the percent difference between the fund and benchmark). The time series for the fund and all the benchmarks are all in a data frame that I'm reading from an excel on file and what I have so far is this (with the idea that arg1 represents all the benchmarks and is then applied using applymap), but it's returning a KeyError, any suggestions?
import pandas as pd
import numpy as np
data = pd.read_excel('File_Path.xlsx')
def index_analytics(arg1):
tracking_err = np.std((data['Fund'] - data[arg1]) / data[arg1])
return tracking_err
data.applymap(index_analytics)
There are a few things that need fixed. First,applymap passes each individual value for all the columns to your calling function (index_analytics). So arg1 is the individual scalar value for all the values in your dataframe. data[arg1] is always going to return a key error unless all your values are also column names.
You also shouldn't need to use apply to do this. Assuming your benchmarks are in the same dataframe then you should be able to do something like this for each benchmark. Next time include a sample of your dataframe.
df['Benchmark1_result'] = (df['Fund'] - data['Benchmark1']) / data['Benchmark1']
And if you want to calculate all the standard deviations for all the benchmarks you can do this
# assume you have a dataframe with a list of all the benchmark columns
benchmark_columns = [list, of, benchmark, columns]
np.std((df['Fund'].values - df[benchmark_columns].values) / df['Fund'].values, axis=1)
Assuming you're following the definition of Tracking Error below:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
df['Active_Return'] = df['Portfolio_Returns'] - df['Bench_Returns']
print(df.head())
list_ = df['Active_Return']
temp_ = []
for val in list_:
x = val**2
temp_.append(x)
tracking_error = np.sqrt(sum(temp_))
print(f"Tracking Error is: {tracking_error}")
Or if you want it more compact (because apparently the cool kids do it):
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
tracking_error = np.sqrt(sum([val**2 for val in df['Portfolio_Returns'] - df['Bench_Returns']]))
print(f"Tracking Error is: {tracking_error}")
I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.
How can I do a pivot on data this large with a limited ammount of RAM?
EDIT: adding sample code
The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
You could do the appending with HDF5/pytables. This keeps it out of RAM.
Use the table format:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
Aside: You should also buy more RAM, it's cheap.
Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
Perhaps it's more pythonic/readable to write this as:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.