I'm coming from an R background where I didn't run into this issue.
Generally in the past I've made functions that act upon a dataframe and return some modified version of the dataframe. For example:
df=pd.DataFrame({"a":[1,2,3,4,5], "b":[6,7,8,9,10]})
def multiply_function(dataset):
dataset['output']=dataset.iloc[:,1] * dataset.iloc[:,0]
return(dataset)
new_df=multiply_function(df)
new_df # looks good!
df # I would expect that df stays the same and isn't updated with the new column
I'm trying to convert a good amount of functions or code from one language to another. I'd like to avoid having this issue happen so that df is NOT updated globally because of what happens inside a function.
This is particularly important when I'm re-running code or modifying code because a dataframe may not be valid to run through a function twice.
I have seen usage of
dataset = dataset.copy()
as the first line of code...but is this really ideal? Is there a better way around this? I was thinking that this would really blow up the amount of data in memory when working with large datasets?
Thank you!
Related
Both "krogh" and "barycentric" seem to not clean the dataframe fully (meaning between the first non-NaN and the last non-NaN).
What are they intended to use for? (My purpose would be a Timeseries).
Context: I'm setting up a pipeline with different cleaning functions to test later and adapted the whole pandas.DataFrame.interpolation() function because it comes in pretty handy.
I have two dataframes each around 400k rows call them a and b. What I want to do is for every row in df b find the account number in that row in data frame a. If it exists, i want to drop that row from dataframe a. Problem is, when I try to run this code, i keep getting memory errors. Initially I was using iterrows, but that seems to be bad when working with large datasets, so i switched to apply, but I'm running into the same error. Below is simplified pseudocode of what I'm trying:
def reduceAccount(accountID):
idx = frameA.loc[frameA["AccountID"] == accountID].index
frameB.drop(idx, inplace=True)
frameB["AccountID"].apply(reduceAccount)
I've even tried some shennanigans like iterating thru the first few hundred/thousand rows, but after a cycle, i still hit the memory error, which makes me think im still loading things into memory rather than clearing thru. Is there a better way to reduce dataframeA than what im trying? Note that I do not want to merge the frames (yet) just remove any row in dataframe a that has a duplicate key in dataframe b.
The issue is that in order to see all values to filter, you will need to store both DFs in memory at some point. You can improve your efficiency somewhat by not using apply(), which is still an iterator. The following code is a more efficient, vectorized approach using boolean masking directly.
dfB[~dfB["AccountID"].isin(dfA["AccountID"])]
However, if storage is the problem, then this may still not work. Some approaches to consider are chunking the data, as you say you've already tried, or some of the options in the documentation on enhancing performance
So basically you want every row in A which 'AccountID' is not in B.
This can be done with a left join: frameA = frameA.join(frameB, on='AccountID', how='left')
I think this is best in terms of memory efficiency for you'll be leveraging the power of pandas built-in optimized code.
TL;DR
I want to allow workers to use a scattered Pandas Dataframe, but not allow them to mutate any data. Look below for sample code. Is this possible? (or is this a pure Pandas question?)
Full question
I'm reading a Pandas Dataframe, and scattering it to the workers. I then use this future when I submit new tasks, and store it in a Variable for easy access.
Sample code:
df = pq.read_table('/data/%s.parq' % dataset).to_pandas()
df = client.scatter(df, broadcast=True, direct=True)
v = Variable('dataset')
v.set(df)
When I submit a job I use:
def top_ten_x_based_on_y(dataset, column):
return dataset.groupby(dataset['something'])[column].mean().sort_values(ascending=False)[0:10].to_dict()
a = client.submit(top_ten_x_based_on_y, df, 'a_column')
Now, I want to run 10-20 QPS on this dataset which all workers have in memory (data < RAM) but I want to protect against accidental changes of the dataset, such as one worker "corrupting" it's own memory which can lead to inconsistencies. Preferably raising an exception on trying to modify.
The data set is roughly 2GB.
I understand this might be problematic since a Pandas Dataframe itself is not immutable (although a Numpy array can be made to).
Other ideas:
Copy the dataset on each query, but 2GB copy takes time even in RAM (roughly 1.4 seconds)
Devise a way to hash a dataframe (probematic in itself, even though hash_pandas_object now exists), and check before and after (or every minute) if dataframe is the same as expected. Running hash_pandas_object takes roughly 5 seconds.
Unfortunately Dask currently offers no additional features on top of Python to avoid mutation in this way. Dask just runs Python functions, and those Python functions can do anything they like.
Your suggestions of copying or checking before running operations seems sensible to me.
You might also consider raising this as a question or feature request to Pandas itself.
I have a very large simulation in python with lots of modules. I call a lot of random functions. To keep the same random results I have a variable keep_seed_random.
As so:
import random
keep_seed_random = True
if keep_seed_random is False:
fixed_seed = random.Random(0)
else:
fixed_seed = random
Then I use fixed_seed all over the program, such as
fixed_seed.choice(['male', 'female'])
fixed_seed.randint()
fixed_seed.gammavariate(3, 3)
fixed_seed.random()
fixed_seed.randrange(20, 40)
and so on...
It used to work well.
But now, that the programme is too large, there is something else interfering and the results are no longer identical, even when I choose keep_seed_random = False
My question is whether there is any other source of randomness in Python that I am missing?
P.S. I import random just once.
EDITED
We have been trying to pinpoint the exact moment when the program turned from exact same results to different results. It seemed to be when we introduced a lot of reading of databases with no connection to random modules.
The results now ALTERNATE among two similar results.
That is, I run main.py once get a result of 8148.78 for GDP
I run again I get 7851.49
Again 8148.78 back
Again 7851.49
Also for the working version, before the change, the first result (when we create instances and pickle save them) I get one result. Then, from the second onwards the results are the same. So, I am guessing it is related to pickle reading/loading.
The question remains!
2nd EDITED
We partially found the problem.
The problem is when we create instances and pickle dump and then pickle load.
We still cannot have the exact same results for creating and just loading.
However, when loading repeatedly the results are exact.
Thus, the problem is in PICKLE
Some randomization may occur when dumping and loading (I guess).
Thanks,
This is difficult to diagnose without a good reproduce case as #mart0903 mentions. However, in general, there are several sources of randomness that can occur. A few things come to mind:
If for example you are using the multiprocessing and/or subprocess packages to spawn several parallel processes, you may be experiencing a race condition. That is, different processes finishing at different times each time you run the program. Perhaps you are combining the result in some way that is dependent on these threads executing in a particular order.
Perhaps you are simply looping over a dictionary and expecting the keys to be in a certain order, when in fact, dictionaries are not ordered. For example run the following a couple times in a row (I'm using Python 3.5 in case it matters) and you'll notice that the key-value pairs print out in a different order each time:
if __name__=='__main__':
data = dict()
data['a'] = 6
data['b'] = 7
data['c'] = 42
for key in data:
print(key + ' : ' + str(data[key]))
You might even be looking at time-stamps or set some value, or perhaps generating a uuid somewhere that you are using in a calculation.
The possibilities could go on. But again, difficult to nail down without a simple reproduce case. It may just take some good-ol breakpoints and a lot of stepping through code.
Good luck!
I am working in apache-spark to make multiple transformations on a single Dataframe with python.
I have coded some functions in order to make easier the different transformations. Imagine we have functions like:
clearAccents(df,columns)
#lines that remove accents from dataframe with spark functions or
#udf
return df
I use those functions to "overwrite" dataframe variable for saving the new dataframe transformed each time each function returns. I know this is not a good practice and now I am seeing the consequences.
I noticed that everytime I add a line like below, the running time is longer:
# Step transformation 1:
df = function1(df,column)
# Step transformation 2.
df = function2(df, column)
As I understand Spark is not saving the resulting dataframe but it saves all operations needed to get the dataframe in the current line. For example, when running the function function1 Spark runs only this function, but when running function2 Spark runs function1 and then, function2. What if I really need to run each function only one?
I tried with a df.cache() and df.persist() but I din't get the desired results.
I want to save partial results in a manner that wouldn't be necessary to compute all instruction since beginning and only from last transformation function, without getting an stackoverflow error.
You probably aren't getting the desired results from cache() or persist() because they won't be evaluated until you invoke an action. You can try something like this:
# Step transformation 1:
df = function1(df,column).cache()
# Now invoke an action
df.count()
# Step transformation 2.
df = function2(df, column)
To see that the execution graph changes, the SQL tab in the Spark Job UI is a particularly helpful debugging tool.
I'd also recommend checking out the ML Pipeline API and seeing if it's worth implementing a custom Transformer. See Create a custom Transformer in PySpark ML.