Convert pandas.core.groupby.SeriesGroupBy to a DataFrame - python

This question didn't have a satisfactory answer, so I'm asking it again.
Suppose I have the following Pandas DataFrame:
df1 = pd.DataFrame({'group': ['a', 'a', 'b', 'b'], 'values': [1, 1, 2, 2]})
I group by the first column 'group':
g1 = df1.groupby('group')
I've now created a "DataFrameGroupBy". Then I extract the first column from the GroupBy object:
g1_1st_column = g1['group']
The type of g1_1st_column is "pandas.core.groupby.SeriesGroupBy". Notice it's not a "DataFrameGroupBy" anymore.
My question is, how can I convert the SeriesGroupBy object back to a DataFrame object? I tried using the .to_frame() method, and got the following error:
g1_1st_column = g1['group'].to_frame()
AttributeError: Cannot access callable attribute 'to_frame' of 'SeriesGroupBy' objects, try using the 'apply' method.
How would I use the apply method, or some other method, to convert to a DataFrame?

Manish Saraswat kindly answered my question in the comments.
g1['group'].apply(pd.DataFrame)

Related

How to convert 'SeriesGroupBy' object to list

I have a dataframe with several columns that I am filtering in hopes of using the output to filter another dataframe further. Ultimately I'd like to convert the group by object to a list to further filter another dataframe but I am having a hard time converting the SeriesGroupBy object to a list of values. I am using:
id_list = df[df['date_diff'] == pd.Timedelta('0 days')].groupby('id')['id'].tolist()
I've tried to reset_index() and to_frame() and .values before to_list() with no luck.
error is:
'SeriesGroupBy' object has no attribute tolist
Expected output: Simply a list of id's
Try -
pd.Timedelta('0 days')].groupby('id')['id'].apply(list)
Also, I am a bit skeptical about how you are comparing df['date_diff'] with the groupby output.
EDIT: This might be useful for your intended purpose (s might be output of your groupby):
s = pd.Series([['a','a','b'],['b','b','c','d'],['a','b','e']])
s.explode().unique().tolist()
['a', 'b', 'c', 'd', 'e']

How to change specific row value in dataframe using pandas? [duplicate]

This question already has answers here:
Set value for particular cell in pandas DataFrame using index
(23 answers)
Closed 2 years ago.
Here I attached my data frame.I am trying to change specific value of row.but I am not getting succeed.Any leads would be appreciated.
df.replace(to_replace ="Agriculture, forestry and fishing ",
value ="Agriculture")
Image of My data frame
Try this:
df['Name'] = df['Name'].str.replace('Agriculture, forestry and fishing', 'Agriculture')
This should work for any data type:
df.loc[df.loc[:, 'Name']=='Agriculture, forestry and fishing', 'Name'] = 'Agriculture'
You can easily get all the columns names with calling: df.columns
then you can copy this list and replace the name of any column and reassign the list to df.columns.
For example:
import pandas as pd
df = pd.DataFrame(data=[[1, 2], [10, 20], [100, 200]], columns=['A', 'B'])
df.columns
the output will be in a jupyter notebook:
Index(['C', 'D'], dtype='object')
so you copy that list and then replace what you want to change and reassign it
df.columns = ['C', 'D']
and then you will get a dataframe with the name of columns changed from A and B to C and D, you check this by calling
df.head()

Python - Change the original dataframe object from a dictionary?

I'm trying to perform a number of operations on a list of dataframes. I've opted to use a dictionary to help me with this process, but I was wonder if it's possible to reference the originally created dataframe with the changes.
So using the below code as an example, is it possible to call the dfA object with the columns ['a', 'b', 'c'] that were added when it was nested within the dictionary object?
dfA = pd.DataFrame(data=[1], columns=['x'])
dfB = pd.DataFrame(data=[1], columns=['y'])
dfC = pd.DataFrame(data=[1], columns=['z'])
dfdict = {'A':dfA,
'B':dfB,
'C':dfC}
df_dummy = pd.DataFrame(data=[[1,2,3]], columns=['a', 'b', 'c'])
for key in dfdict:
dfdict[str(key)] = pd.concat([dfdict[str(key)], df_dummy], axis=1)
The initial dfA that you created and the dfA DataFrame from your dictionary are two different objects. (You can confirm this by running dfA is dfdict['A'] or id(dfA) == id(dfdict['A']), both of which should return False).
To access the second (newly created) object you need to call it from the dictionary.
dfdict['A']
Or:
dfdict.get('A')
The returned DataFrame will have the new columns you added.

Finding quartile using .apply in python

I am passing a column called petrol['tax'] of a dataframe to a function using .apply which returns 1st quartile. I am trying to use the below code but it throws me this error 'float' object has no attribute 'quantile'.
def Quartile(petrol_attrib):
return petrol_attrib.quantile(.25)
petrol['tax'].apply(Quartile)
I need help to implement this.
df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]),
columns=['a', 'b'])
Now you can use the quantile function in pandas. Make sure you using the numbers between 0 and 1. Here an example below:
df.a.quantile(0.5)
3.25
You can apply the function to the whole dataframe

Append a column to a dataframe in Pandas

I am trying to append a numpy.darray to a dataframe with little success.
The dataframe is called user2 and the numpy.darray is called CallTime.
I tried:
user2["CallTime"] = CallTime.values
but I get an error message:
Traceback (most recent call last):
File "<ipython-input-53-fa327550a3e0>", line 1, in <module>
user2["CallTime"] = CallTime.values
AttributeError: 'numpy.ndarray' object has no attribute 'values'
Then I tried:
user2["CallTime"] = user2.assign(CallTime = CallTime.values)
but I get again the same error message as above.
I also tried to use the merge command but for some reason it was not recognized by Python although I have imported pandas. In the example below CallTime is a dataframe:
user3 = merge(user2, CallTime)
Error message:
Traceback (most recent call last):
File "<ipython-input-56-0ebf65759df3>", line 1, in <module>
user3 = merge(user2, CallTime)
NameError: name 'merge' is not defined
Any ideas?
Thank you!
pandas DataFrame is a 2-dimensional data structure, and each column of a DataFrame is a 1-dimensional Series. So if you want to add one column to a DataFrame, you must first convert it into Series. np.ndarray is a multi-dimensional data structure. From your code, I believe the shape of np.ndarray CallTime should be nx1 (n rows and 1 colmun), and it's easy to convert it to a Series. Here is an example:
df = DataFrame(np.random.rand(5,2), columns=['A', 'B'])
This creates a dataframe df with two columns 'A' and 'B', and 5 rows.
CallTime = np.random.rand(5,1)
Assume this is your np.ndarray data CallTime
df['C'] = pd.Series(CallTime[:, 0])
This will add a new column to df. Here CallTime[:,0] is used to select first column of CallTime, so if you want to use different column from np.ndarray, change the index.
Please make sure that the number of rows for df and CallTime are equal.
Hope this would be helpful.
I think instead to provide only documentation, I will try to provide a sample:
import numpy as np
import pandas as pd
data = {'A': [2010, 2011, 2012],
'B': ['Bears', 'Bears', 'Bears'],
'C': [11, 8, 10],
'D': [5, 8, 6]}
user2 = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
#creating the array what will append to pandas dataframe user2
CallTime = np.array([1, 2, 3])
#convert to list the ndarray array CallTime, if you your CallTime is a matrix than after converting to list you can iterate or you can convert into dataframe and just append column required or just join the dataframe.
user2.loc[:,'CallTime'] = CallTime.tolist()
print(user2)
I think this one will help, also check numpy.ndarray.tolist documentation if need to find out why we need the list and how to do, also here is example how to create dataframe from numpy in case of need https://stackoverflow.com/a/35245297/2027457
Here is a simple solution.
user2["CallTime"] = CallTime
The problem here for you is that CallTime is an array, you couldn't use .values. Since .values is used to convert a dataframe to array. For example,
df = DataFrame(np.random.rand(10,2), columns=['A', 'B'])
# The followings are correct
df.values
df['A'].values
df['B'].values

Categories

Resources