Grouping data in Python with pandas yields a blank first row - python

I have this nice pandas dataframe:
And I want to group it by the column "0" (which represents the year) and calculate the mean of the other columns for each year. I do such thing with this code:
df.groupby(0)[2,3,4].mean()
And that successfully calculates the mean of every column. The problem here being the empty row that appears on top:

That's just a display thing, the grouped column now becomes the index and this is just the way that it is displayed, you will notice here that even when you set pd.set_option('display.notebook_repr_html', False) you still get this line, it has no effect on operations on the goruped df:
In [30]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5), 'c':np.arange(5)})
df
Out[30]:
a b c
0 0.766706 -0.575700 0
1 0.594797 -0.966856 1
2 1.852405 1.003855 2
3 -0.919870 -1.089215 3
4 -0.647769 -0.541440 4
In [31]:
df.groupby('c')['a','b'].mean()
Out[31]:
a b
c
0 0.766706 -0.575700
1 0.594797 -0.966856
2 1.852405 1.003855
3 -0.919870 -1.089215
4 -0.647769 -0.541440
Technically speaking it has assigneed the name attribute:
In [32]:
df.groupby('c')['a','b'].mean().index.name
Out[32]:
'c'
by default there will be no name if it has not been assigned:
In [34]:
print(df.index.name)
None

Related

Removing commented rows in place in pandas

I have a dataframe that may have commented characters at the bottom of it. Due to some other reasons, I cannot pass the comment character to initialize the dataframe itself. Here is an example of what I would have:
df = pd.read_csv(file,header=None)
df
0 1
0 132605 1
1 132750 2
2 # total: 100000
Is there a way to remove all rows that start with a comment character in-place -- that is, without having to re-load the data frame?
Using startswith
newdf=df[df.iloc[:,0].str.startswith('#').ne(True)]
Dataframe:
>>> df
0 1
0 132605 1
1 132750 2
2 # total: 100000
3 foo bar
Dropping in-place:
>>> to_drop = df[0].str.startswith('#').where(lambda s: s).dropna().index
>>> df.drop(to_drop, inplace=True)
>>> df
0 1
0 132605 1
1 132750 2
3 foo bar
Assumptions: you want to find rows where the column labeled 0 starts with '#'. Otherwise, adjust accordingly.

How to keep only the top n% rows of each group of a pandas dataframe?

I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1

pandas DataFrame assign with format

I'm trying to use assign to create a new column in a pandas DataFrame. I need to use something like str.format to have the new column be pieces of existing columns. For instance...
import pandas as pd
df = pd.DataFrame(np.random.randn(3, 3))
gives me...
0 1 2
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
an assign for a totally new column works
# works
print(df.assign(c = "a"))
0 1 2 c
0 -0.738703 -1.027115 1.129253 a
1 0.674314 0.525223 -0.371896 a
2 1.021304 0.169181 -0.884293 a
But, if I want to use an existing column into a new column it seems like pandas is adding the whole existing frame into the new column.
# doesn't work
print(df.assign(c = "a{}b".format(df[0])))
0 1 2 \
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
c
0 a0 -0.738703\n1 0.674314\n2 1.021304\n...
1 a0 -0.738703\n1 0.674314\n2 1.021304\n...
2 a0 -0.738703\n1 0.674314\n2 1.021304\n...
Thanks for the help.
In [131]: df.assign(c="a"+df[0].astype(str)+"b")
Out[131]:
0 1 2 c
0 0.833556 -0.106183 -0.910005 a0.833556419295b
1 -1.487825 1.173338 1.650466 a-1.48782514804b
2 -0.836795 -1.192674 -0.212900 a-0.836795026809b
'a{}b'.format(df[0]) is a str. "a"+df[0].astype(str)+"b" is a Series.
In [142]: type(df[0].astype(str))
Out[142]: pandas.core.series.Series
In [143]: type('{}'.format(df[0]))
Out[143]: str
When you assign a single string to the column c, that string is repeated for every row in df.
Thus, df.assign(c = "a{}b".format(df[0])) assigns the string 'a{}b'.format(df[0])
to each row of df:
In [138]: 'a{}b'.format(df[0])
Out[138]: 'a0 0.833556\n1 -1.487825\n2 -0.836795\nName: 0, dtype: float64b'
It is really no different than what happened with df.assign(c = "a").
In contrast, when you assign a Series to the column c, then the index of the Series is aligned with the index of df and the corresponding values are assigned to df['c'].
Under the hood, the Series.__add__ method is defined in such a way so that addition of the Series containing strings with a string results in a new Series with the string concatenated with the values in the Series:
In [149]: "a"+df[0].astype(str)
Out[149]:
0 a0.833556419295
1 a-1.48782514804
2 a-0.836795026809
Name: 0, dtype: object
(The astype method was called to convert the floats in df[0] into strings.)
df['c'] = "a" + df[0].astype(str) + 'b'
df
0 1 2 c
0 -1.134154 -0.367397 0.906239 a-1.13415403091b
1 0.551997 -0.160217 -0.869291 a0.551996920472b
2 0.490102 -1.151301 0.541888 a0.490101854737b

Python Reindex Producing Nan

Here is the code that I am working with:
import pandas as pd
test3 = pd.Series([1,2,3], index = ['a','b','c'])
test3 = test3.reindex(index = ['f','g','z'])
So originally every thing is fine and test3 has an index of 'a' 'b' 'c' and values 1,2,3. But then when I got to reindex test3 I get that my values 1 2 3 are lost. Why is that? The desired output would be:
f 1
g 2
z 3
The docs are clear on this behaviour :
Conform Series to new index with optional filling logic, placing
NA/NaN in locations having no value in the previous index
if you just want to overwrite the index values then do:
In [32]:
test3.index = ['f','g','z']
test3
Out[32]:
f 1
g 2
z 3
dtype: int64

Added column to existing dataframe but entered all numbers as NaN

So I created two dataframes from existing CSV files, both consisting of entirely numbers. The second dataframe consists of an index from 0 to 8783 and one column of numbers and I want to add it on as a new column to the first dataframe which has an index consisting of a month, day and hour. I tried using append, merge and concat and none worked and then tried simply using:
x1GBaverage['Power'] = x2_cut
where x1GBaverage is the first dataframe and x2_cut is the second. When I did this it added x2_cut on properly but all the values were entered as NaN instead of the numerical values that they should be. How should I be approaching this?
x1GBaverage['Power'] = x2_cut.values
problem solved :)
The thing about pandas is that values are implicitly linked to their indices unless you deliberately specify that you only need the values to be transferred over.
If they're the same row counts and you just want to tack it on the end, the indexes either need to match, or you need to just pass the underlying values. In the example below, columns 3 and 5 are the index matching & value versions, and 4 is what you're running into now:
In [58]: df = pd.DataFrame(np.random.random((3,3)))
In [59]: df
Out[59]:
0 1 2
0 0.670812 0.500688 0.136661
1 0.185841 0.239175 0.542369
2 0.351280 0.451193 0.436108
In [61]: df2 = pd.DataFrame(np.random.random((3,1)))
In [62]: df2
Out[62]:
0
0 0.638216
1 0.477159
2 0.205981
In [64]: df[3] = df2
In [66]: df.index = ['a', 'b', 'c']
In [68]: df[4] = df2
In [70]: df[5] = df2.values
In [71]: df
Out[71]:
0 1 2 3 4 5
a 0.670812 0.500688 0.136661 0.638216 NaN 0.638216
b 0.185841 0.239175 0.542369 0.477159 NaN 0.477159
c 0.351280 0.451193 0.436108 0.205981 NaN 0.205981
If the row counts differ, you'll need to use df.merge and let it know which columns it should be using to join the two frames.

Categories

Resources