Mean of all rows exluding first column - python

I have a dataframe that I would like to add a Mean column to for every row, but excludes the first column 'Dept'. So for example row 0 should have the 45.007000 instead of NaN.
df2 = df[MatchesWithDept].copy()
df2 = df2.replace(-999.250000, np.NaN)
df2 = df2.assign(Master_GR=df2.loc[:, Matches[:]].mean())
DEPT GRD GRR Master_GR
0 400.0 45.007000 NaN NaN
1 400.5 42.575001 NaN NaN
2 401.0 43.755001 NaN NaN
3 401.5 45.417000 NaN NaN
4 402.0 47.519001 NaN NaN

You can drop first column before mean:
df['Master_GR'] = df.drop('DEPT', axis=1).mean(axis=1)
Or select all columns without first by iloc:
df['Master_GR'] = df.iloc[:, 1:].mean(axis=1)

Related

pandas dataframe regex filtering of hierarchical columns

Consider the following dataframe:
df = pd.DataFrame(columns=['[mg]', '[mg] true'], index=range(3))
To filter for the column ending in ], one may use:
print(df.filter(regex="\]$"))
[mg]
0 NaN
1 NaN
2 NaN
Next, consider a hierarchical columns dataframe:
df1 = pd.DataFrame(columns=pd.MultiIndex.from_product([[0,1], ['[mg]', '[mg] true']]), index=range(3))
print(df1)
0 1
[mg] [mg] true [mg] [mg] true
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
When I again attempt to filter for the same columns ending in ], it now fails:
print(df1.filter(regex="\]$"))
Empty DataFrame
Columns: []
Index: [0, 1, 2]
Why does this fail, and what can I do to obtain the filtering I desire?
One option is to use str.contains on the get_level_values from columns then use loc to use the column index:
import pandas as pd
df1 = pd.DataFrame(
columns=pd.MultiIndex.from_product([[0, 1], ['[mg]', '[mg] true']]),
index=range(3))
# Apply Regex to Level 1 Of the Column Index
matches = df1.columns.get_level_values(1).str.contains(r"\]$")
# Filter Using loc
filtered_df = df1.loc[:, matches]
print(filtered_df)
filtered_df:
0 1
[mg] [mg]
0 NaN NaN
1 NaN NaN
2 NaN NaN
Interesting question. Observing the pandas source code for .filter, pandas will supply the strings from Dataframe._get_axis(1) to the regex. In this case, these are tuples (in string form):
MultiIndex([(0, '[mg]'),
(0, '[mg] true'),
(1, '[mg]'),
(1, '[mg] true')],
)
So to match only [mg] we can modify the regex to contain the final '):
print(df1.filter(regex=r"mg\]\'\)$"))
Prints:
0 1
[mg] [mg]
0 NaN NaN
1 NaN NaN
2 NaN NaN
NOTE: Probably it's very implementation dependent. So don't do it :)

Python pandas insert empty rows after each row

Hello I am trying to insert 3 empty rows after each row of the current data using pandas then export the data. For example a sample current data could be:
name profession
Bill cashier
Sam stock
Adam security
Ideally what I want to achieve:
name profession
Bill cashier
Nan Nan
Nan Nan
Nan Nan
Sam stock
Nan Nan
Nan Nan
Nan Nan
Adam security
Nan Nan
Nan Nan
Nan Nan
I have experimented with itertools however i am not sure how i can precisely get three empty rows using after each row using this method. Any help, guidance, sample would definitely be appreciative!
Using append on a dataframe is quite inefficient I believe (has to reallocate memory for the entire data frame each time).
DataFrames were meant for analyzing data and easily adding columns—but not rows.
So I think a good approach would be to create a new dataframe of the correct size and then transfer the data over to it. Easiest way to do that is using an index.
# Demonstration data
data = 'name profession Bill cashier Sam stock Adam security'
data = np.array(data.split()).reshape((4,2))
df = pd.DataFrame(data[1:],columns=data[0])
# Add n blank rows
n = 3
new_index = pd.RangeIndex(len(df)*(n+1))
new_df = pd.DataFrame(np.nan, index=new_index, columns=df.columns)
ids = np.arange(len(df))*(n+1)
new_df.loc[ids] = df.values
print(new_df)
Output:
name profession
0 Bill cashier
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 Sam stock
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 Adam security
9 NaN NaN
10 NaN NaN
11 NaN NaN
insert_rows = 3 # how many rows to insert
df.index = range(0, insert_rows * len(df), insert_rows)
# create new_df with added rows
new_df = df.reindex(index = range(insert_rows * len(df)))
If you provided more information that would be helpful, but a thing that comes to mind is to use this command
df.append(pd.Series(), ignore_index=True)
This will add an empty row to your data frame, though as you can see you have to pass set ignore_index=True, otherwise the append won't work.
The code below includes a function to add empty rows between the existing rows of a dataframe.
Might not be the best approach for what you want to do, it might be better to add the blank rows when you are exporting the data.
import pandas as pd
def add_blank_rows(df, no_rows):
df_new = pd.DataFrame(columns=df.columns)
for idx in range(len(df)):
df_new = df_new.append(df.iloc[idx])
for _ in range(no_rows):
df_new=df_new.append(pd.Series(), ignore_index=True)
return df_new
df = pd.read_csv('test.csv')
df_with_blank_rows = add_blank_rows(df, 3)
print(df_with_blank_rows)
this works
df_new = pd.DataFrame()
for i, row in df.iterrows():
df_new = df_new.append(row)
for _ in range(3):
df_new = df_new.append(pd.Series(), ignore_index=True)
df of course is the original DataFrame
Here is a function to do that with one loop:
def NAN_rows(df):
row = df.shape[0]
x = np.empty((3,2,)) # 3 empty row and 2 columns. You can increase according to your original df
x[:] = np.nan
df_x = pd.DataFrame( columns = ['name' ,'profession'])
for i in range(row):
temp = np.vstack([df.iloc[i].tolist(),x])
df_x = pd.concat([df_x, pd.DataFrame(temp,columns = ['name' ,'profession'])], axis=0)
return df_x
df = pd.DataFrame({
'name' : ['Bill','Sam','Adam'],
'profession' : ['cashier','stock','security']
})
print(NAN_rows(df))
#Output:
name profession
0 Bill cashier
1 nan nan
2 nan nan
3 nan nan
0 Sam stock
1 nan nan
2 nan nan
3 nan nan
0 Adam security
1 nan nan
2 nan nan
3 nan nan

Pandas: Sum multiple columns, but write NaN if any column in that row is NaN or 0

I am trying to create a new column in a pandas dataframe that sums the total of other columns. However, if any of the source columns are blank (NaN or 0), I need the new column to also be written as blank (NaN)
a b c d sum
3 5 7 4 19
2 6 0 2 NaN (note the 0 in column c)
4 NaN 3 7 NaN
I am currently using the pd.sum function, formatted like this
df['sum'] = df[['a','b','c','d']].sum(axis=1, numeric_only=True)
which ignores the NaNs, but does not write NaN to the sum column.
Thanks in advance for any advice
replace your 0 to np.nan then pass skipna = False
df.replace(0,np.nan).sum(1,skipna=False)
0 19.0
1 NaN
2 NaN
dtype: float64
df['sum'] = df.replace(0,np.nan).sum(1,skipna=False)

Python: Create New Column Equal to the Sum of all Columns Starting from Column Number 9

I want to create a new column called 'test' in my dataframe that is equal to the sum of all the columns starting from column number 9 to the end of the dataframe. These columns are all datatype float.
Below is the code I tried but it didn't work --> gives me back all NaN values in 'test' column:
df_UBSrepscomp['test'] = df_UBSrepscomp.iloc[:, 9:].sum()
If I'm understanding your question, you want the row-wise sum starting at column 9. I believe you want .sum(axis=1). See an example below using column 2 instead of 9 for readability.
df = DataFrame(npr.rand(10, 5))
df.iloc[0:3, 0:4] = np.nan # throw in some na values
df.loc[:, 'test'] = df.iloc[:, 2:].sum(axis=1); print(df)
0 1 2 3 4 test
0 NaN NaN NaN NaN 0.73046 0.73046
1 NaN NaN NaN NaN 0.79060 0.79060
2 NaN NaN NaN NaN 0.53859 0.53859
3 0.97469 0.60224 0.90022 0.45015 0.52246 1.87283
4 0.84111 0.52958 0.71513 0.17180 0.34494 1.23187
5 0.21991 0.10479 0.60755 0.79287 0.11051 1.51094
6 0.64966 0.53332 0.76289 0.38522 0.92313 2.07124
7 0.40139 0.41158 0.30072 0.09303 0.37026 0.76401
8 0.59258 0.06255 0.43663 0.52148 0.62933 1.58744
9 0.12762 0.01651 0.09622 0.30517 0.78018 1.18156

How to append new columns to a pandas groupby object from a list of values

I want to code a script that takes series values from a column, splits them into strings and makes a new column for each of the resulting strings (filled with NaN right now). As the df is groupedby Column1, I want to do this for every group
My input data frame looks like this:
df1:
Column1 Column2
0 L17 a,b,c,d,e
1 L7 a,b,c
2 L6 a,b,f
3 L6 h,d,e
What I finally want to have is:
Column1 Column2 a b c d e f h
0 L17 a,b,c,d,e nan nan nan nan nan nan nan
1 L7 a,b,c nan nan nan nan nan nan nan
2 L6 a,b,f nan nan nan nan nan nan nan
My code currently looks like this:
def NewCols(x):
for item, frame in group['Column2'].iteritems():
Genes = frame.split(',')
for value in Genes:
string = value
x[string] = np.nan
return x
df1.groupby('Column1').apply(NewCols)
My thought behind this was that the code loops through Column2 of every grouped object, splitting the values contained in frame at comma and creating a list for that group. So far the code works fine. Then I added
for value in Genes:
string = value
x[string] = np.nan
return x
with the intention of adding a new column for every value contained in the list Genes. However, my output looks like this:
Column1 Column2 d
0 L17 a,b,c,d,e nan
1 L7 a,b,c nan
2 L6 a,b,f nan
3 L6 h,d,e nan
and I am pretty much struck dumb. Can someone explain why only one column gets appended (which is not even named after the first value in the first list of the first group) and suggest how I could improve my code?
I think you just return too early in your function, before the end of the two loops. If you indent it back two times like this :
def NewCols(x):
for item, frame in group['Column2'].iteritems():
Genes = frame.split(',')
for value in Genes:
string = value
x[string] = np.nan
return x
UngroupedResGenesLineage.groupby('Column1').apply(NewCols)
It should work fine !
cols = sorted(list(set(df1['Column2'].apply(lambda x: x.split(',')).sum())))
df = df1.groupby('Column1').agg(lambda x: ','.join(x)).reset_index()
pd.concat([df,pd.DataFrame({c:np.nan for c in cols}, index=df.index)], axis=1)
Column1 Column2 a b c d e f h
0 L17 a,b,c,d,e NaN NaN NaN NaN NaN NaN NaN
1 L6 a,b,f,h,d,e NaN NaN NaN NaN NaN NaN NaN
2 L7 a,b,c NaN NaN NaN NaN NaN NaN NaN

Categories

Resources