How to use Pandas groupby in a for loop

How to use Pandas groupby in a for loop - python

I have the following pandas dataframe:
d2 = {'col1': [0, 0, 1, 1, 2], 'col2': [10, 11, 12, 13, 14]}
df2 = pd.DataFrame(data=d2)
df2
Output:
col1 col2
0 0 10
1 0 11
2 1 12
3 1 13
4 2 14
And I need to run the following:
for i, g in df2.groupby(['col1']):
col1_val = g["col1"].iloc[0]
print(col1_val)
The original code is more complex but writing so for the purpose of illustration.
And the part for i, g in df2.groupby(['col1']): gives the following warning:
FutureWarning: In a future version of pandas, a length 1 tuple will be returned
when iterating over a groupby with a grouper equal to a list of length 1.
Don't supply a list with a single grouper to avoid this warning.
How am I supposed to run the for loop to get rid of this warning?

This means that you should use a string instead of the list with a unique string:
for i, g in df2.groupby('col1'):
col1_val = g["col1"].iloc[0]
print(col1_val)
If you keep the original code, in the future i will have the value (0,)/(1,)/(2,) instead of 0/1/2

Related

pandas combine nested dataframes into one single dataframe

I have a dataframe, where in one column (we'll call it info) all the cells/rows contain another dataframe inside. I want to loop through all the rows in this column and literally stack the nested dataframes on top of each other, because they all have the same columns
How would I go about this?

You could try as follows:
import pandas as pd
length=5
# some dfs
nested_dfs = [pd.DataFrame({'a': [*range(length)],
'b': [*range(length)]}) for x in range(length)]
print(nested_dfs[0])
a b
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
# df with nested_dfs in info
df = pd.DataFrame({'info_col': nested_dfs})
# code to be implemented
lst_dfs = df['info_col'].values.tolist()
df_final = pd.concat(lst_dfs,axis=0, ignore_index=True)
df_final.tail()
a b
20 0 0
21 1 1
22 2 2
23 3 3
24 4 4
This method should be a bit faster than the solution offered by nandoquintana, which also works.
Incidentally, it is ill advised to name a df column info. This is because df.info is actually a function. E.g., normally df['col_name'].values.tolist() can also be written as df.col_name.values.tolist(). However, if you try this with df.info.values.tolist(), you will run into an error:
AttributeError: 'function' object has no attribute 'values'
You also run the risk of overwriting the function if you start assigning values to your column on top of doing something which you probably don't want to do. E.g.:
print(type(df.info))
<class 'method'>
df.info=1
# column is unaffected, you just create an int variable
print(type(df.info))
<class 'int'>
# but:
df['info']=1
# your column now has all 1's
print(type(df['info']))
<class 'pandas.core.series.Series'>

This is the solution that I came up with, although it's not the fastest which is why I am still leaving the question unanswered
df1 = pd.DataFrame()
for frame in df['Info'].tolist():
df1 = pd.concat([df1, frame], axis=0).reset_index(drop=True)

Our dataframe has three columns (col1, col2 and info).
In info, each row has a nested df as value.
import pandas as pd
nested_d1 = {'coln1': [11, 12], 'coln2': [13, 14]}
nested_df1 = pd.DataFrame(data=nested_d1)
nested_d2 = {'coln1': [15, 16], 'coln2': [17, 18]}
nested_df2 = pd.DataFrame(data=nested_d2)
d = {'col1': [1, 2], 'col2': [3, 4], 'info': [nested_df1, nested_df2]}
df = pd.DataFrame(data=d)
We could combine all nested dfs rows appending them to a list (as nested dfs schema is constant) and concatenating them later.
nested_dfs = []
for index, row in df.iterrows():
nested_dfs.append(row['info'])
result = pd.concat(nested_dfs, sort=False).reset_index(drop=True)
print(result)
This would be the result:
coln1 coln2
0 11 13
1 12 14
2 15 17
3 16 18

Drop every nth column in pandas dataframe

i have a pandas dataframe where the columns are named like:
0,1,2,3,4,.....,n
i would like to drop every 3rd column so that i get a new dataframe where i would have the columns like:
0,1,3,4,6,7,9,.....,n
I have tried like this:
shape = df.shape[1]
for i in range(2,shape,3):
df = df.drop(df.columns[i], axis=1)
but i get an error saying index is out of bound and i assume this happens because the shape of the dataframe changes when i am dropping the columns. if i just don't store the output of the "for" loop, then the code works but i don't get my new dataframe.
How do i solve this?
Thanks

The issue with code is, each time you drop a column in your loop, you end up with a different set of columns because you overwrite the df back after each iteration. When you try to drop the next 3rd column of THAT new set of columns, you not only drop the wrong one, you end up running out of columns eventually. That's why you get the error you are getting.
iter1 -> 0,1,3,4,5,6,7,8,9,10 ... n #first you drop 2 which is 3rd col
iter2 -> 0,1,3,4,5,7,8,9,10 ... n #next you drop 6 which is 6th col (should be 5)
iter3 -> 0,1,3,4,5,7,8,9, ... n #next you drop 10 which is 9th col (should be 8)
What you want to do is calculate the indexes beforehand and then remove them in one go.
You can simply just get the indexes of columns you want to remove with range and then drop those.
drop_idx = list(range(2,df.shape[1],3)) #Indexes to drop
df2 = df.drop(drop_idx, axis=1) #Drop them at once over axis=1
print('old columns->', list(df.columns))
print('idx to drop->', drop_idx)
print('new columns->',list(df2.columns))
old columns-> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
idx to drop-> [2, 5, 8]
new columns-> [0, 1, 3, 4, 6, 7, 9]
Note: This works only because your columns names are same as indexes. If however, your column names are not like that, you will have to do an extra step of fetching the column names based on the index you want to drop.
drop_idx = list(range(2,df.shape[1],3))
drop_cols = [j for i,j in enumerate(df.columns) if i in drop_idx] #<--
df2 = df.drop(drop_cols, axis=1)

Here is solution with inverted logic - select all columns with removed each 3rd column.
You can filter values by compare added 1 to helper array, with 3 modulo compare for not equal 0 and pass to DataFrame.loc:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df = df.loc[:, (np.arange(len(df.columns)) + 1) % 3 != 0]
print (df)
A B D E
0 a 4 1 5
1 b 5 3 3
2 c 4 5 6
3 d 5 7 9
4 e 5 1 2
5 f 4 0 4

You can use list comprehension to filter columns:
df = df[[k for k in df.columns if (k + 1) % 3 != 0]]
If the names are different (e.g. strings) and you want to discard every 3rd column regardless of its name, then:
df = df[[k for i, k in enumerate(df.columns, 1) if i % 3 != 0]]

Excluding Zeros from DataFrame

I am trying to get rid of values that are 0.000000 in my data frame so i can find the min/max values, excluding zero.
My dataframe, Answer 2 looks like this:
and when i try to exclude zeros using the code below, I am still getting the same dataframe with the zeros intact:
no_zero=Answer2.loc[(Answer2!=0.000000).any(1)]
no_zero
Any idea on how I can remove zeros?

You could replace the zeros with Nan:
import numpy
df = df.replace(0, numpy.nan)
then define your own min and max functions that ignore Nan values:
def non_zero_min(series):
return series.dropna().min()
df.apply(non_zero_min, axis=1)

For example, where df is:
df = pd.DataFrame({"A": [0, 2, 5, 5, 7], "B":[1, 5, 0, 1, 7]})
A B
0 0 1
1 2 5
2 5 0
3 5 1
4 7 7
You can do this to change to np.NaN where there is a 0 for every value in the dataframe.
df[df != 0]
Then you can find the minimum or maximum value by using: pandas.Series.min() and pandas.Series.max()
For example:
df = df[df != 0]
df.A.min() # 1
df.B.min() # 1

Proper way to pass list to value_vars using pandas melt

I'm unsure how to do this with python but am stuck. The ticker values in the column are in a list format. When trying to pass that list to melt's value_vars I get an error. When I try converting to a tuple it still contains the list brackets. The documentation says "value_vars -- tuple, list, or ndarray, optional" -- not having success w/ list or tuple. Thanks in advance.
My data:
sector ticker
0 Communication Services [ATVI.OQ, GOOGL.OQ, GOOG.OQ, T.N, CTL.N, CHTR....
1 Consumer Discretionary [AAP.N, AMZN.OQ, APTV.N, AZO.N, BBY.N, BKNG.OQ...
rowData = groups.loc[groups['sector'] == 'Communication Services']
print(tuple(rowData['ticker']))
new_df = pd.melt(new_df, id_vars=['date'], value_vars=rowData['ticker'])
The tuple doesn't look right with this output:
(['ATVI.OQ', 'GOOGL.OQ', 'GOOG.OQ', 'T.N', 'CTL.N', 'CHTR'],)
And here is the value_vars error:
TypeError: unhashable type: 'list'
EDIT
Solved using
tup = tuple(rowData['ticker'].explode())
new_df = pd.melt(new_df, id_vars=['date'], value_vars=tup)

You data are not actually in wide format. I think what you want is just to explode the column:
df = pd.DataFrame({'A': [[1, 2, 3], [4, 5, 6]], 'B': ['A', 'B']})
df.explode('A')
Out[21]:
A B
0 1 A
0 2 A
0 3 A
1 4 B
1 5 B
1 6 B
But I'm not 100 per cent sure about what the end goal is. See http://xyproblem.info/.

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?

The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use Pandas groupby in a for loop - python

This means that you should use a string instead of the list with a unique string: for i, g in df2.groupby('col1'): col1_val = g["col1"].iloc[0] print(col1_val) If you keep the original code, in the future i will have the value (0,)/(1,)/(2,) instead of 0/1/2

Related

pandas combine nested dataframes into one single dataframe

Drop every nth column in pandas dataframe

Excluding Zeros from DataFrame

Proper way to pass list to value_vars using pandas melt

how to add columns label on a Pandas DataFrame

Categories

Resources