I am trying to store a Dask dataframe, with a categorical column, to a *.h5 file per this tutorial - 1:23:25 - 1:23:45.
Here is my call to a store function:
stored = store(ddf,'/home/HdPC/Analyzed.h5', ['Tag'])
The function store is:
#delayed
def store(ddf,fp,c):
ddf.categorize(columns=c).to_hdf(fp, '/data2')
and uses categorize.
ddf and stored are of type:
print(type(ddf), type(stored))
>>> (<class 'dask.dataframe.core.DataFrame'>, <class 'dask.delayed.Delayed'>)
When I run compute(*[stored]) or stored.compute(), I get this:
dask.async.AttributeError: 'DataFrame' object has no attribute 'categorize'
Is there a way to achieve this categorization of the Tag column with the store function? Or should I use a different method to store the Dask dataframe with a categorical?
I would suggest you try the data-frame operations without the delayed call - daak-dataframes already are lazy compute graphs internally. I believe by calling compute, you are actually passing the resultant pandas data-frame to your function, which is why you get the error.
In your case: simply remove #delayed (remembering that to_hdf is a blocking call).
Related
Python noob and learning.
Running into an issue when I use the Groupby. If I remove the groupby and print result, it is fine. Not sure what the issue is, any help would be greatly appreciated.
import pandas as pd
path1 = "/content/NYC_Jobs_1.csv"
path2 = "/content/NYC_Jobs_2.xlsx"
df1 = pd.read_csv(path1)
df2 = pd.read_excel(path2)
result = df1.merge(df2,on="Job ID",how='outer')
grouped = result.groupby('Job ID')
grouped.to_csv('NYC.csv', index=False)
Im having an AttributeError
AttributeError Traceback (most recent call last)
<ipython-input-1-066a0fd6dfcb> in <module>
9 grouped = result.groupby('Job ID')
10
---> 11 grouped.to_csv('NYC.csv', index=False)
/usr/local/lib/python3.8/dist-packages/pandas/core/groupby/groupby.py in __getattr__(self, attr)
909 return self[attr]
910
--> 911 raise AttributeError(
912 f"'{type(self).__name__}' object has no attribute '{attr}'"
913 )
AttributeError: 'DataFrameGroupBy' object has no attribute 'to_csv'
The problem you've run into is that you've not kept clear in your mind the distinction between a DataFrame and a DataFrameGroupBy object.
If you're new to programming, one thing that may not be clear to you yet is the relationship between classes, objects, attributes, methods, and functions. So the error you got is opaque to you.
Let me translate:
a class represents a 'type of thing', like, say, 'a saucepan'
an object is a member of a class; a specific saucepan is an object, whereas the 'idea of a saucepan' is a class
objects can have attributes; a saucepan has a handle with a specific length, but this doesn't have to be the same for all saucepans
some attributes are not really 'properties' (like the length of the saucepan handle), but rather methods, meaning things you can do with the object. For example: 'cook spaghetti'.
these methods are the same kind of thing as a function, but they only make sense in the context of the object they are part of (good luck trying to cook spaghetti with your bare hands)
I'll illustrate.
The pandas library provides a function called read_csv. This function returns a DataFrame object. All DataFrame objects store data across various attributes representing columns in a table (these are themselves another type of object called a Series). The read_csv function creates a DataFrame that stores data read from a CSV file on disk.
DataFrame objects have a method, merge, which takes another DataFrame as the first argument. This method returns a third DataFrame, which is an amalgam of the two you started with.
DataFrames have a method to_csv which causes the contents of the DataFrame to be read out to a CSV file.
DataFrames have a method groupby which returns a DataFrameGroupBy object.
Now, DataFrameGroupBy objects do not have a method called to_csv. I hope that you can now understand the meaning of the error: AttributeError: 'DataFrameGroupBy' object has no attribute 'to_csv'.
The way forward
The DataFrameGroupBy object is easy to confuse with the DataFrame. As well as having a similar name and attributes for storing data, pandas often takes the approach of having DataFrame methods return new DataFrames.
Why? This allows method chaining:
pd.read_csv("/some/file.csv").merge(df2).some_other_method()...
Every (...) marks a new DataFrame being created. The expression is evaluated left to right:
pd.read_csv("/some/file.csv").merge(df2).some_other_method()...
-> df1.merge(df2).some_other_method()...
-> df3.some_other_method()...
-> df4...
Neat. Neat, but confusing if you aren't keeping track. Especially confusing if, just as you're getting used to it, one of these methods doesn't return a DataFrame at all, but rather some other kind of object.
The purpose of a DataFrameGroupBy object is to store groups of data ready for aggregation. They have a variety of methods available to do the aggregation, which you can read about in the documentation.
Here is a tutorial on the way to use groupby properly. Some examples would be
count_jobs = result.groupby("Job ID").count()
max_some_column_name = result.groupby("Job ID")["Some Column Name"].max()
The first of these directly aggregates the data, the second combines a group with another column and finds the maximum value of the other column for each Job ID.
In the second case, the output will be a Series object. Since such objects do have a to_csv method, you could successfully write the data:
result.groupby("Job ID")["Some Column Name"].max().to_csv("output.csv")
I'm trying to convert a pandas DataFrame to a PCollection from Apache Beam.
Unfortunately, when I use to_pcollection() function, I get the following error:
AttributeError: 'DataFrame' object has no attribute '_expr'
Does anyone know how to solve it?
I'm using pandas=1.1.4, beam=2.25.0 and Python 3.6.9.
to_pcollection was only ever intended to apply to Beam's deferred Dataframes, but looking at this it makes sense that it should work, and isn't obvious how to do manually. https://github.com/apache/beam/pull/14170 should fix this.
I get this problem when I use a "native" Pandas dataframe instead of a dataframe created by to_dataframe within Beam. I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr) that the native Pandas dataframe doesn't have.
The real answer involves knowing how to use apache_beam.dataframe.convert.to_dataframe, but I can't figure out how to set the proxy object correctly (I get Singleton errors when I try to later use to_pcollection). So since I can't get the "right" way to to work in 2.25.0 (I'm new to Beam and Pandas—and don't know how proxy objects work—so take all this with a grain of salt), I use this workaround:
class SomeDoFn(beam.DoFn):
def process(self, pair): # pair is a key/value tuple
df = pd.DataFrame(pair[1]) # just the array of values
## do something with the dataframe
...
records = df.to_dict('records')
# return a tuple with the same shape as the one we received
return [(rec["key"], rec) for rec in records]
which I invoke with something like this:
rows = (
pcoll
| beam.ParDo(SomeDoFn())
)
I hope others will give you a better answer than this workaround.
I'm trying to merge two dataframes using
grouped_data = pd.merge(grouped_data, df['Pattern'].str[7:11]
,how='left',left_on='Calc_DRILLING_Holes',
right_on='Calc_DRILLING_Holes')
But I get an error saying can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
What could be the issue here. The original dataframe that I'm trying to merge to was created from a much larger dataset with the following code:
import pandas as pd
raw_data = pd.read_csv(r"C:\Users\cherp2\Desktop\test.csv")
data_drill = raw_data.query('Activity =="DRILL"')
grouped_data = data_drill.groupby([data_drill[
'PeriodStartDate'].str[:10], 'Blast'])[
'Calc_DRILLING_Holes'].sum().reset_index(
).sort_values('PeriodStartDate')
What do I need to change here to make it a regular normal dataframe?
If I try to convert either of them to a dataframe using .to_frame() I get an error saying that 'DataFrame' object has no attribute 'to_frame'
I'm so confused at to what kind of data type it is.
Both objects in a call to pd.merge need to be DataFrame objects. Is grouped_data a Series? If so, try promoting it to a DataFrame by passing pd.DataFrame(grouped_data) instead of just grouped_data.
In the following code, I import a csv file into Python's pandas library and display the first 5 rows, and query the 'shape' of the pandas dataframe.
import pandas as pd
data = pd.read_csv('my_file.csv')
data.head() #returns the first 5 rows of the dataframe
data.shape # displays the # of rows and # of columns of dataframe
Why is it that the head() method requires empty parentheses after head but shape does not? Does it have to do with their types?
If I called head without following it with the empty parentheses, I would not get the same result. Is it that head is a method and shape is just an attribute?
How could I generalize the answer to the above question to the rest of Python? I am trying to learn not just about pandas here but Python in general. For example, a sentence such as "When _____ is the case, one must include empty parentheses if no arguments will be provided, but for other attributes one does not have to?
The reason that head is a method and not a attribute is most likely has to do with performance. In case head would be an attribute it would mean that every time you wrangle a dataframe, pandas would have to precompute the slice of data and store it in the head attribute, which would be waste of resources. The same goes for the other methods with empty parenthesis.
In case of shape, it is provided as an attribute since this information is essential to any dataframe manipulation thus it is precomputed and available as an attribute.
When you call data.head() you are calling the method head(self) on the object data,
However, when you write data.shape, you are referencing a public attribute of the object data
It is good to keep in mind that there is a distinct difference between methods and object attributes. You can read up on it here
I´m new to programming. I´m trying to use scipy minimize, had several issues and gotten through most of them.
Right now this is the code, but I'm not understanding why I´m getting this error.
par_opt = so.minimize(fun=fun_obj, x0=par_ini, method='Nelder-Mead', args=[series_pt_cal, dt, series_caudal_cal])
Not enough info is given by the OP, but basically somewhere in the code it's specified to operate by data frame column (axis=1) on an object that is a Pandas Series. If the code typically works but occasional gives errors, check for degenerative cases where a data frame may have only 1 row. Pandas has a nasty habit of guessing what you want -- it may decide to reduce a 1-row data frame to a Series (e.g., the apply() function; you can disable that by using reduce=False in there).
Add a line of code to check the object is isinstance(df, pd.DataFrame) or else convert the offending pandas Series to a data frame, something like s.to_frame().T for the problems I had to deal with.
Use pd.DataFrame(df) before your so.minimize function.
Pandas wants to run on DataFrame for that function.