I am running analytics on edge device, to compute everything I need panda frames. Here is my problem, every 10 sec I am updating panda master dataframe with new set of rows. Some disagree with approach, it might hit performance. append is the only way I can update the rows, is there any other efficient way I can update panda frame, all I need is something like list.append(x) or list.extend(x) API in Panda. Hope I am using right API, any alternative for more efficient way ?
I do not have memory issue, since I am discarding after some time.
snippet
df.append(self.__get_pd_frame(tracker_data), ignore_index=True)
# tracker_data - another panda data frame contains 100-200 rows
I changed from append method to from_record API, something like below
data = np.array([[1, 3], [2, 4], [4, 5]])
pd.DataFrame.from_records(data, columns=("a", "b"))
Related
I have a simple/flat dataset that looks like...
columnA columnB columnC
value1a value1b value1c
value2a value2b value2c
...
valueNa valueNb valueNc
Although the structure is simple, it's tens of millions of rows deep and I have 50+ columns.
I need to validate that each value in a row conforms with certain format requirements. Some checks are simple (e.g. isDecimal, isEmpty, isAllowedValue etc) but some involve references to other columns (e.g. does columnC = columnA / columnB) and some involve conditional validations (e.g. if columnC = x, does columnB contain y).
I started off thinking that the most efficient way to validate this data was by applying lambda functions to my dataframe...
df.apply(lambda x: validateCol(x), axis=1)
But it seems like this can't support the full range of conditional validations I need to perform (where specific cell validations need to refer to other cells in other columns).
Is the most efficient way to do this to simply loop through all rows one-by-one and check each cell one-by-one? At the moment, I'm resorting to this but it's taking several minutes to get through the list...
df.columns = ['columnA','columnB','columnC']
myList = df.T.to_dict().values() #much faster to iterate over list
for row in myList:
#validate(row['columnA'], row['columnB'], row['columnC'])
Thanks for any thoughts on the most efficient way to do this. At the moment, my solution works, but it feels ugly and slow!
Iterating over rows is very inefficient. You should work on columns directly using vectorized functions or Pandas functions. Indeed, Pandas store data column-major in memory. Thus, row by row accesses require data from many different locations to be fetched and aggregated from memory. Vectorization in not possible or hard to achieve efficiently (since most hardware themselves cannot vectorize this). Moreover, Python loops are generally very slow (if you use the mainstream CPython interpreter).
You can work on columns using Pandas or Numpy directly. Note that Numpy operation are much faster if you work on well-defined native types (float, small integers and bounded strings, as opposed to Python objects like unbounded strings and large integers). Note also that Pandas already store data using Numpy internally.
Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'columnA':[1, 4, 5, 7, 8], 'columnB':[4, 7, 3, 6, 9], 'columnC':[5, 8, 6, 1, 2]})
a = df['columnA'].to_numpy()
b = df['columnB'].to_numpy()
c = df['columnC'].to_numpy()
# You can use np.isclose for a row-by-row result stored into an array
print(np.allclose(a/b, c))
Another approach is to split your dataset into smaller pieces and parallelize computation.
For validations part , type, or schema i suggest to use a library like voluptuous or others, i found a schema very maintainable approach.
Anyway, as told by #jerome vectorial approach can save you a lot of computational time
I have a large data frame with all binary variables (a sparse matrix that was converted into pandas so that I can later convert to Dask). The dimensions are 398,888 x 52,034.
I am trying to create a much larger data frame that consists of 10,000 different bootstrap samples from the original data frame. Each sample is the same size as the original data. The final data frame will also have a column that keeps track of which bootstrap sample that row is from.
Here is my code:
# sample df
df_pd = pd.DataFrame(np.array([[0, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 1]]),
columns=['a', 'b', 'c'])
# convert into Dask dataframe
df_dd = dd.from_pandas(df_pd, npartitions=4)
B = 2 # eventually 10,000
big_df = dd.from_pandas(pd.DataFrame([]), npartitions = 1000)
for i in range(B+1):
data = df_dd.sample(frac = 1, replace = True, random_state=i)
data["sample"] = i
big_df.append(data)
The data frame produced by the loop is empty, but I cannot figure out why. To be more specific, if I look at big_df.head() I get, UserWarning: Insufficient elements for 'head'. 5 elements requested, only 0 elements available. Try passing larger 'npartitions' to 'head'. If I try print(big_df), I get, ValueError: No objects to concatenate.
My guess is there is at least a problem with this line, big_df = dd.from_pandas(pd.DataFrame([]), npartitions = 1000), but I have no idea.
Let me know if I need to clarify anything. I am somewhat new to Python and even newer to Dask, so even small tips or feedback that don't fully answer the question would be greatly appreciated. Thanks!
You are probably better off using dask.dataframe.concat and concatting dataframes together -- still there are a few problems.
append creates a new object so you will have to save that object -> df = df.append(data)
try calling big_df.head(npartitions=-1), it use all partitions to get 5 rows (the appending/concatting here can create small partitions with less than 5 rows).
It would be good to write this first with Pandas before jumping to Dask especially. You might also be interested in reading through: https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask
I am trying to create new dataframe based on condition per groupby.
Suppose, I have dataframe with Name, Flag and Month.
import pandas as pd
import numpy as np
data = {'Name':['A', 'A', 'B', 'B'], 'Flag':[0, 1, 0, 1], 'Month':[1,2,1,2]}
df = pd.DataFrame(data)
need = df.loc[df['Flag'] == 0].groupby(['Name'], as_index = False)['Month'].min()
My condition is to find minimum month where flag equal to 0 per name.
I have used .loc to define my condition, it works fine but I found that it quite poor performance when applying with 10 million of rows.
Any more efficient way to do so?
Thank you!
Just had this same scenario yesterday, where I took a 90 second process down to about 3 seconds. Because speed is your concern (like mine was), and not using solely Pandas itself, I would recommend using Numba and NumPy. The catch is you're going to have to brush up on your data structures and types to get a good grasp on what Numba is really doing with JIT. Once you do though, it rocks.
I would recommend finding a way to get every value in your DataFrame to an integer. For your name column, try unique ID's. Flag and month already look good.
name_ids = []
for i, name in enumerate(np.unique(df["Name"])):
name_ids.append({i: name})
Then, create a function and loop the old-fashioned way:
#njit
def really_fast_numba_loop(data):
for row in data:
# do stuff
return data
new_df = really_fast_numba_loop(data)
The first time your function is called in your file, it will be about the same speed as it would elsewhere, but all the other times it will be lightning fast. So, the trick is finding the balance between what to put in the function and what to put in its outside loop.
In either case, when you're done processing your values, convert your name_ids back to strings and wrap your data in pd.DataFrame.
Et voila. You just beat Pandas iterrows/itertuples.
Comment back if you have questions!
Is there a way to estimate the size a dataframe would be without loading it into memory? I already know that I do not have enough memory for the dataframe that I am trying to create but I do not know how much more memory would be required to fully create it.
You can calculate for one row, and estimate based on it:
data = {'name': ['Bill'],
'year': [2012],
'num_sales': [4]}
df = pd.DataFrame(data, index = ['sales'])
df.memory_usage(index=True).sum() #-> 32
I believe you're looking for df.memory_usage, which would tell you how much each column will occupy.
Altogether it would go something like:
df.memory_usage().sum()
Output:
123123000
You can do more specifics things like including Index (Index = True) or using the Deep feature which will "introspect the data deeply". Feel free to check the documentation!
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html
What is best way to store and analyze high-dimensional date in python? I like Pandas DataFrame and Panel where I can easily manipulate the axis. Now I have a hyper-cube (dim >=4) of data. I have been thinking of stuffs like dict of Panels, tuple as panel entries. I wonder if there is a high-dim panel thing in Python.
update 20/05/16:
Thanks very much for all the answers. I have tried MultiIndex and xArray, however I am not able to comment on any of them. In my problem I will try to use ndarray instead as I found the label is not essential and I can save it separately.
update 16/09/16:
I came up to use MultiIndex in the end. The ways to manipulate it are pretty tricky at first, but I kind of get used to it now.
MultiIndex is most useful for higher dimensional data as explained in the docs and this SO answer because it allows you to work with any number of dimension in a DataFrame environment.
In addition to the Panel, there is also Panel4D - currently in experimental stage. Given the advantages of MultiIndex I wouldn't recommend using either this or the three dimensional version. I don't think these data structures have gained much traction in comparison, and will indeed be phased out.
If you need labelled arrays and pandas-like smart indexing, you can use xarray package which is essentially an n-dimensional extension of pandas Panel (panels are being deprecated in pandas in future in favour of xarray).
Otherwise, it may sometimes be reasonable to use plain numpy arrays which can be of any dimensionality; you can also have arbitrarily nested numpy record arrays of any dimension.
I recommend continuing to use DataFrame but utilize the MultiIndex feature. DataFrame is better supported and you preserve all of your dimensionality with the MultiIndex.
Example
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['A', 'B'])
df3 = pd.concat([df for _ in [0, 1]], keys=['one', 'two'])
df4 = pd.concat([df3 for _ in [0, 1]], axis=1, keys=['One', 'Two'])
print df4
Looks like:
One Two
a b a b
one A 1 2 1 2
B 3 4 3 4
two A 1 2 1 2
B 3 4 3 4
This is a hyper-cube of data. And you'll be much better served with support and questions and less bugs and many other benefits.