I have a function that creates new columns and data of a pandas dataframe. I am now trying to move these testing method to dask to be able to test on larger sets of data. I am having issue finding the problem as my function does not throw any errors,just that the data is wrong. I came to the conclusion that it must be a an issue with the fuctions I am calling. What am I missing? I think its here but if it was, I would think that python would give me an error but its not. I recently saw that transform is not supported. I also believe between_time is not supported.
validSignalTime=(df1.index.time>=en)&(df1.index.time<=NoSignalAfter)
time_condition=df1.index.isin(df1.between_time(st, en, include_start=True,
include_end=False).index)
df1['Entry_Price']=df1[time_condition].groupby(df1[time_condition].index.date)['High'].transform('cummax')
Related
Both "krogh" and "barycentric" seem to not clean the dataframe fully (meaning between the first non-NaN and the last non-NaN).
What are they intended to use for? (My purpose would be a Timeseries).
Context: I'm setting up a pipeline with different cleaning functions to test later and adapted the whole pandas.DataFrame.interpolation() function because it comes in pretty handy.
I want convert a sql query like
SELECT * FROM df WHERE id IN (SELECT id FROM an_df)
into dask equivalent.
So, I am trying this:
d=df[df['id'].isin(an_df['id'])]
But it is thorwing NotImplementedError: passing a 'dask.datframe.core.DataFrame' to'isin'
Then I converted this an_df['id'] to list like
d=df[df['id'].isin(list(an_df['id']))] or d=df[df['id'].isin(an_df['id'].compute())]
but this is very time consuming.
I want a solution that works as fast as dask.
df has approximately 100 million rows.
Please help me with it.
Thanks
I recommend adding a minimal reproducible example, which will make solving this particular issue easier:
https://stackoverflow.com/help/minimal-reproducible-example
It seems like you are converting the pandas.core.series.Series object returned by an_df['id'].compute() to a list, which is not needed. isin() will take a pandas series or dataframe object as argument. Please see:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.isin.html
In your example this should work:
series_to_match = an_df['id'].compute()
d=df[df['id'].isin(series_to_match)]
So you can omit the .to_list() cast. I expect this to be a little bit faster since that type casting can be dropped. But there are still things you need to consider here. Depending on the size of an_df['id'].compute() you may run into trouble since that statement is pulling the resulting series object into the memory of the machine your scheduler is running on.
if this series is small enough you could try to use a client.scatter to make sure all of your workers have that series persisted in memory, see:
http://distributed.dask.org/en/stable/locality.html
If that series is a huge object you'll have to tackle this differently.
I am applying a OneHotEncoding function to two very similar dataframes. The first dataframe is the following:
When I apply the one hot encoding, everything works fine:
However, when I apply the exact same function to this different, but very similar dataframe:
The following error occurs:
I don't understand why this happens, because dataframe 1 and 2 were both extracted from a previous dataframe (they work as a train and test df for a machine learning application). Both are pyspark.sql dataframes. Can anyone help me?
As the error says, you can't sort a list containing None and integers. There is perhaps a null in your column, which causes the line categories.sort() to crash.
If you want to do ML with Spark, I'd suggest using pyspark.ml package, instead of writing your own one-hot encoder. For example, see here.
I'm currently trying to learn more about Deep learning/CNN's/Keras through what I thought would be a quite simple project of just training a CNN to detect a single specific sound. It's been a lot more of a headache than I expected.
I'm currently reading through this ignoring the second section about gpu usage, the first part definitely seems like exactly what I'm needing. But when I go to run the script, (my script is pretty much totally lifted from the section in the link above that says "Putting the pieces together, you may end up with something like this:"), it gives me this error:
AttributeError: 'DataFrame' object has no attribute 'file_path'
I can't find anything in the pandas documentation about a DataFrame.file_path function. So I'm confused as to what that part of the code is attempting to do.
My CSV file contains two columns, one with the paths and then a second column denoting the file paths as either positive or negative.
Sidenote: I'm also aware that this entire guide just may not be the thing I'm looking for. I'm having a very hard time finding any material that is useful for the specific project I'm trying to do and if anyone has any links that would be better I'd be very appreciative.
The statement df.file_path denotes that you want access the file_path column in your dataframe table. It seams that you dataframe object does not contain this column. With df.head() you can check if you dataframe object contains the needed fields.
I'm using the sample Python Machine Learning "IRIS" dataset (for starting point of a project). These data are POSTed into a Flask web service. Thus, the key difference between what I'm doing and all the examples I can find is that I'm trying to load a Pandas DataFrame from a variable and not from a file or URL which both seem to be easy.
I extract the IRIS data from the Flask's POST request.values. All good. But at that point, I can't figure out how to get the pandas dataframe like the "pd.read_csv(....)" does. So far, it seems the only solution is to parse each row and build up several series I can use with the DataFrame constructor? There must be something I'm missing since reading this data from a URL is a simple one-liner.
I'm assuming reading a variable into a Pandas DataFrame should not be difficult since it seems like an obvious use-case.
I tried wrapping with io.StringIO(csv_data), then following up with read_csv on that variable, but that doesn't work either.
Note: I also tried things like ...
data = pd.DataFrame(csv_data, columns=['....'])
but got nothing but errors (for example, "constructor not called correctly!")
I am hoping for a simple method to call that can infer the columns and names and create the DataFrame for me, from a variable, without me needing to know a lot about Pandas (just to read and load a simple CSV data set, anyway).