Pandas Groupby issue - python

Python noob and learning.
Running into an issue when I use the Groupby. If I remove the groupby and print result, it is fine. Not sure what the issue is, any help would be greatly appreciated.
import pandas as pd
path1 = "/content/NYC_Jobs_1.csv"
path2 = "/content/NYC_Jobs_2.xlsx"
df1 = pd.read_csv(path1)
df2 = pd.read_excel(path2)
result = df1.merge(df2,on="Job ID",how='outer')
grouped = result.groupby('Job ID')
grouped.to_csv('NYC.csv', index=False)
Im having an AttributeError
AttributeError Traceback (most recent call last)
<ipython-input-1-066a0fd6dfcb> in <module>
9 grouped = result.groupby('Job ID')
10
---> 11 grouped.to_csv('NYC.csv', index=False)
/usr/local/lib/python3.8/dist-packages/pandas/core/groupby/groupby.py in __getattr__(self, attr)
909 return self[attr]
910
--> 911 raise AttributeError(
912 f"'{type(self).__name__}' object has no attribute '{attr}'"
913 )
AttributeError: 'DataFrameGroupBy' object has no attribute 'to_csv'

The problem you've run into is that you've not kept clear in your mind the distinction between a DataFrame and a DataFrameGroupBy object.
If you're new to programming, one thing that may not be clear to you yet is the relationship between classes, objects, attributes, methods, and functions. So the error you got is opaque to you.
Let me translate:
a class represents a 'type of thing', like, say, 'a saucepan'
an object is a member of a class; a specific saucepan is an object, whereas the 'idea of a saucepan' is a class
objects can have attributes; a saucepan has a handle with a specific length, but this doesn't have to be the same for all saucepans
some attributes are not really 'properties' (like the length of the saucepan handle), but rather methods, meaning things you can do with the object. For example: 'cook spaghetti'.
these methods are the same kind of thing as a function, but they only make sense in the context of the object they are part of (good luck trying to cook spaghetti with your bare hands)
I'll illustrate.
The pandas library provides a function called read_csv. This function returns a DataFrame object. All DataFrame objects store data across various attributes representing columns in a table (these are themselves another type of object called a Series). The read_csv function creates a DataFrame that stores data read from a CSV file on disk.
DataFrame objects have a method, merge, which takes another DataFrame as the first argument. This method returns a third DataFrame, which is an amalgam of the two you started with.
DataFrames have a method to_csv which causes the contents of the DataFrame to be read out to a CSV file.
DataFrames have a method groupby which returns a DataFrameGroupBy object.
Now, DataFrameGroupBy objects do not have a method called to_csv. I hope that you can now understand the meaning of the error: AttributeError: 'DataFrameGroupBy' object has no attribute 'to_csv'.
The way forward
The DataFrameGroupBy object is easy to confuse with the DataFrame. As well as having a similar name and attributes for storing data, pandas often takes the approach of having DataFrame methods return new DataFrames.
Why? This allows method chaining:
pd.read_csv("/some/file.csv").merge(df2).some_other_method()...
Every (...) marks a new DataFrame being created. The expression is evaluated left to right:
pd.read_csv("/some/file.csv").merge(df2).some_other_method()...
-> df1.merge(df2).some_other_method()...
-> df3.some_other_method()...
-> df4...
Neat. Neat, but confusing if you aren't keeping track. Especially confusing if, just as you're getting used to it, one of these methods doesn't return a DataFrame at all, but rather some other kind of object.
The purpose of a DataFrameGroupBy object is to store groups of data ready for aggregation. They have a variety of methods available to do the aggregation, which you can read about in the documentation.
Here is a tutorial on the way to use groupby properly. Some examples would be
count_jobs = result.groupby("Job ID").count()
max_some_column_name = result.groupby("Job ID")["Some Column Name"].max()
The first of these directly aggregates the data, the second combines a group with another column and finds the maximum value of the other column for each Job ID.
In the second case, the output will be a Series object. Since such objects do have a to_csv method, you could successfully write the data:
result.groupby("Job ID")["Some Column Name"].max().to_csv("output.csv")

Related

Joining two dataframe of one column generated with spark

I'm working with pyspark and pandas in Databricks. I'm generating the two following dataframe:
start_date = datetime.today() - timedelta(days=60)
end_date = datetime.today()
date_list = pd.date_range(start=start_date,end=end_date).strftime('%Y-%m-%d').tolist()
date_df = spark.createDataFrame(date_list, 'string').toDF("date")
and
random_list = np.random.normal(loc=50, scale=10, size=61)
random_list = [round(i) for i in random_list]
integer_df = spark.createDataFrame(random_list, 'integer').toDF("value")
so I have two dataframes of one column each ("date" and "value") of the same length and I'd like to "merge" them into one dataframe.
I've tried this:
integer_df=pd.concat(date_df)
which is returning the following error first argument must be an iterable of pandas-on-Spark objects, you passed an object of type "DataFrame"
and this
test_df = pd.concat([integer_df, date_df], axis=1, join='inner')
which is returning the following error cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
Mostly I'd like to understand these errors.
From what i could see you are not transitioning the objects correctly, for example you are trying to concatenate a sparkdf object to a pandasdf object.
first argument must be an iterable of pandas-on-Spark objects, you passed an object of type "DataFrame"
This guy was caused because because, you passed the wrong type object. To concatenate. You should try using pandas on spark object or just pandas objects, if you are going to use pandas.
So to fix your first error, i would just follow the convention. Work with the objects of the given library.
Something like this (or maybe just use pd.Series() or pd.DataFrame)
date_df = spark.createDataFrame(date_list, 'string').toPandas()
# toDF("date") is redundant, either use createDataFrame or toDf not both
integer_df = spark.createDataFrame(random_list, 'integer').toPandas()
After that try utilizing pd.concat([]), with the give results.
Your second error, was caused because pandas has a given condition to only accept type Series object (something similar to your list), since you are passing a pyspark df well i guess pandas gets confused and read it as a list.
So to fix it again utilize the correct object of the library, or transform it to numpy if you want something more efficient
Hope this helps.

How do you create a data-frame from NoneType?

I'm currently using the ffn library to analyze stock data. However, while using one of the functions to get information I've gotten back an object I can't manipulate.
While trying to turn the object back into a pandas data frame i get the error: Invalid file path or buffer object type: <class 'NoneType'>
Is there anyway to read this data type back into pandas?
From the python docs:
https://docs.python.org/3/library/constants.html
None
The sole value of the type NoneType. None is frequently used to represent the absence of a value, as when default arguments are not passed to a function. Assignments to None are illegal and raise a SyntaxError.
If you put nothing (the absence of value) inside of DataFrame the DataFrame would be empty. NoneType means that the object is equal to the singleton None.
So to create a DataFrame from NoneType you could test for None and if the result is True create an empty DataFrame.
import pandas as pd
if result is None:
df = pd.DataFrame()
But probably thats not what you want. More likely the problem is in your acquisition of the object in the first place. There is no information in that NoneType object.

Retain successfully transformed rows in the event of runtime error in pandas

When applying a string manipulation function on Pandas data frame column whose length is north of a million rows. Due to some bad data in between it fails with:
AttributeError: 'float' object has no attribute 'lower'
Is there a way I can save the progress made so far on the column?
Let's say the manipulation function is:
def clean_strings(strg):
strg = strg.lower() #lower
return strg
And is applied to the data frame as
df_sample['clean_content'] = df_sample['content'].apply(clean_strings)
Where 'content' is the column with strings and 'clean_content' is the new column added.
Please suggest other approaches. TIA
First use map as your input is only 1 column and map is faster than apply
df_sample['clean_content']= df_sample['content'].map(clean_strings)
Secondly just type cast your column to string type to run your function
df['content'] = df['content'].astype(str)
def clean_strings(strg):
strg= strg.lower() #lower
return strg
Is there a way I can save the progress made so far on the column?
Unfortunately not, these function calls are meant to act atomically on the dataframe, meaning either the entire operation succeeds, or fails. I'm assuming the str.lower is just a representative example, you're actually doing much more in your function. That means that this is a job for exception handling.
def clean_string(row):
try:
return row.lower()
except AttributeError:
return row
If a particular record fails, you can handle the raised exception inside the function itself, controlling what is returned in that case.
You'd call the function appropriately -
df_sample['clean_content'] = df_sample['content'].apply(clean_string)
Note that content is a column of objects, and objects generally offer very poor performance in terms of vectorised operations. I'd recommend performing a cast to string -
df_sample['content'] = df_sample['content'].astype(str)
After this, consider using pandas' vectorised .str accessor functions in place of clean_string.
For reference, if all you want to do is lowercase your string column, use str.lower -
df_sample['content'] = df_sample['content'].astype(str).str.lower()
Note that, for an object column, you can still use the .str accessor. However, non-string elements will be coerced to NaN -
df_sample['content'] = df_sample['content'].str.lower() # assuming `content` is of `object` type

Why does `head` need `()` and `shape` does not?

In the following code, I import a csv file into Python's pandas library and display the first 5 rows, and query the 'shape' of the pandas dataframe.
import pandas as pd
data = pd.read_csv('my_file.csv')
data.head() #returns the first 5 rows of the dataframe
data.shape # displays the # of rows and # of columns of dataframe
Why is it that the head() method requires empty parentheses after head but shape does not? Does it have to do with their types?
If I called head without following it with the empty parentheses, I would not get the same result. Is it that head is a method and shape is just an attribute?
How could I generalize the answer to the above question to the rest of Python? I am trying to learn not just about pandas here but Python in general. For example, a sentence such as "When _____ is the case, one must include empty parentheses if no arguments will be provided, but for other attributes one does not have to?
The reason that head is a method and not a attribute is most likely has to do with performance. In case head would be an attribute it would mean that every time you wrangle a dataframe, pandas would have to precompute the slice of data and store it in the head attribute, which would be waste of resources. The same goes for the other methods with empty parenthesis.
In case of shape, it is provided as an attribute since this information is essential to any dataframe manipulation thus it is precomputed and available as an attribute.
When you call data.head() you are calling the method head(self) on the object data,
However, when you write data.shape, you are referencing a public attribute of the object data
It is good to keep in mind that there is a distinct difference between methods and object attributes. You can read up on it here

Dask dataframe has no attribute categorize

I am trying to store a Dask dataframe, with a categorical column, to a *.h5 file per this tutorial - 1:23:25 - 1:23:45.
Here is my call to a store function:
stored = store(ddf,'/home/HdPC/Analyzed.h5', ['Tag'])
The function store is:
#delayed
def store(ddf,fp,c):
ddf.categorize(columns=c).to_hdf(fp, '/data2')
and uses categorize.
ddf and stored are of type:
print(type(ddf), type(stored))
>>> (<class 'dask.dataframe.core.DataFrame'>, <class 'dask.delayed.Delayed'>)
When I run compute(*[stored]) or stored.compute(), I get this:
dask.async.AttributeError: 'DataFrame' object has no attribute 'categorize'
Is there a way to achieve this categorization of the Tag column with the store function? Or should I use a different method to store the Dask dataframe with a categorical?
I would suggest you try the data-frame operations without the delayed call - daak-dataframes already are lazy compute graphs internally. I believe by calling compute, you are actually passing the resultant pandas data-frame to your function, which is why you get the error.
In your case: simply remove #delayed (remembering that to_hdf is a blocking call).

Categories

Resources