I've just noticed that string operations on the index of a Pandas DataFrame doesn't maintain the index, so assigning the result back to the dataframe is kind of awkward. for example (and the case where I noticed it):
import pandas as pd
df = pd.DataFrame(
[[1,2],[3,4],[5,6]],
index=['a11','b12','c13'])
df['num'] = df.index.str.extract('([0-9]+)')
gives me:
0 1 num
a11 1 2 NaN
b12 3 4 NaN
c13 5 6 NaN
as the index has been lost and just reverts back to [0,1,2]
it took a bit of debugging to realise this index loss is why I was getting NaN's, but once I did it was obvious that I could just do:
df['num'] = df.index.str.extract('([0-9]+)').set_index(df.index)
is this right, or are there other methods that maintain the index?
You'll have to use the expand argument:
df['num'] = df.index.str.extract('([0-9]+)', expand=False)
from the docs:
expand : bool, default True
If True, return DataFrame with one column per capture group. If False, return a Series/Index if there is one capture group or
DataFrame if there are multiple capture groups.
New in version 0.18.0.
You can use expand command to give same desired results as yours using:
df['num'] = df.index.str.extract('([0-9]+)', expand=False)
expand=False returns series or index or dataframe, since you have only one extracting group you can use expand parameter.
How about use assign?
df.assign(num=df.index.str.extract('([0-9]+)').values)
Related
I am trying to replace the NaN values with "0" with the below code, its not giving any error but there's no change in the dataframe also, can you tell me where is my mistake?
df2 = df_num.iloc[:,[8,9,10,11,12,13,14,15,16]]
df2.replace(to_replace ="NaN",value =0, inplace=True)
df2
[the image to my dataframe output is here][1]
[1]: https://i.stack.imgur.com/LamrH.png
In your code you passed to_replace="NaN".
Note that you actually passed here a string containing just these 3 letters.
In Pandas you can pass np.nan, but only as the value to be assigned
to a cell in a DataFrame. The same pertains to a Numpy array.
You can not pass to_replace=np.nan, because the comparison rules are
that one np.nan is NOT equal to another np.nan.
One of possible solutions is to run:
df2 = df2.where(~df2.isna(), 0)
Other, simpler solution, as richardec suggested, is to use fillna,
but the argument should be 0 (zero) not "o" (a char):
df2 = df2.fillna(0)
For the following screenshot, I want to change the NaN values under the total_claim_count_ge65 to a 5 if the values of the ge65_suppress_flag have the # symbol.
I want to use a for loop to go through the ge65_suppress_flag column and every time it encounters a # symbol, it will change the NaN value in the very next column (total_claim_count_ge65) to a 5.
Try something like:
df[df['ge65_suppress_flag'] == '#']['total_claim_count_ge65'].fillna(5, inplace=True)
Creating a similar data frame
import pandas
df1 = pd.DataFrame({"ge65_suppress_flag": ['bla', 'bla', '#', 'bla'], "total_claim_count_ge65": [1.0, 2.0, None, 4.0]})
Filling in 5.0 in rows where ge65_suppress_flag column value equals to '#'
df1.loc[df1['ge65_suppress_flag']=="#", 'total_claim_count_ge65'] = 5.0
Using df.apply with a lambda:
d = {'ge65_suppress_flag': ['not_supressed','not_supressed','#'], 'total_claim_count_ge65': [516.03, 881.0, np.nan]}
df = pd.DataFrame(data=d)
df['total_claim_count_ge65'] = df.apply(lambda x: 5 if x['ge65_suppress_flag']=='#' else x['total_claim_count_ge65'], axis=1)
print(df)
prints:
ge65_suppress_flag total_claim_count_ge65
0 not_supressed 516.03
1 not_supressed 881.00
2 # 5.00
I'll explain step-by-step after giving you a solution.
Here's a one-liner that will work.
df[df[0]=='#'] = df[df[0]=='#'].fillna(5)
To make the solution more general, I used the column's index based on your screenshot. You can change the index number, or specify by name like so:
df['name_of_column']
Step-by-step explanation:
First, you want to use the variable attributes in your first column df[0] to select only those equal to string '#':
df[df[0]=='#']
Next, use the pandas fillna method to replace all variable attributes that are np.NaN with 5:
df[df[0]=='#'].fillna(5)
According to the fillna documentation, this function returns a new dataframe. So, to avoid this, you want to set the subsection of your dataframe to what is returned by the function:
df[df[0]=='#'] = df[df[0]=='#'].fillna(5)
I am trying to create a row at the bottom of a dataframe to show the sum of certain columns. I am under the impression that this shall be a really simple operation, but to my surprise, none of the methods I found on SO works for me in one step.
The methods that I've found on SO:
df.loc['TOTAL'] = df.sum()
This doesn't work for me as long as there are non-numeric columns in the dataframe. I need to select the columns first and then concat the non-numeric columns back
df.append(df.sum(numeric_only=True), ignore_index=True)
This won't preserve my data types. Integer column will be converted to float.
df3.loc['Total', 'ColumnA']= df['ColumnA'].sum()
I can only use this to sum one column.
I must have missed something in the process as this is not that hard an operation. Please let me know how I can add a sum row while preserving the data type of the dataframe.
Thanks.
Edit:
First off, sorry for the late update. I was on the road for the last weekend
Example:
df1 = pd.DataFrame(data = {'CountyID': [77, 95], 'Acronym': ['LC', 'NC'], 'Developable': [44490, 56261], 'Protected': [40355, 35943],
'Developed': [66806, 72211]}, index = ['Lehigh', 'Northampton'])
What I want to get would be
Please ignore the differences of the index.
It's a little tricky for me because I don't need to get the sum for the column 'County ID' since it's for specific indexing. So the question is more about getting the sum of specific numeric columns.
Thanks again.
Here is some toy data to use as an example:
df = pd.DataFrame({'A':[1.0,2.0,3.0],'B':[1,2,3],'C':['A','B','C']})
So that we can preserve the dtypes after the sum, we will store them as d
d = df.dtypes
Next, since we only want to sum numeric columns, pass numeric_only=True to sum(), but follow similar logic to your first attempt
df.loc['Total'] = df.sum(numeric_only=True)
And finally, reset the dtypes of your DataFrame to their original values.
df.astype(d)
A B C
0 1.0 1 A
1 2.0 2 B
2 3.0 3 C
Total 6.0 6 NaN
To select the numeric columns, you can do
df_numeric = df.select_dtypes(include = ['int64', 'float64'])
df_num_cols = df_numeric.columns
Then do what you did first (using what I found here)
df.loc['Total'] = pd.Series(df[df_num_cols].sum(), index = [df_num_cols])
When I create a dataframe using concat like this:
import pandas as pd
dfa = pd.DataFrame({'a':[1],'b':[2]})
dfb = pd.DataFrame({'a':[3],'b':[4]})
dfc = pd.concat([dfa,dfb])
And I try to reference like I would for any other DataFrame I get the following result:
>>> dfc['a'][0]
0 1
0 3
Name: a, dtype: int64
I would expect my concatenated DataFrame to behave like a normal DataFrame and return the integer that I want like this simple DataFrame does:
>>> dfa['a'][0]
1
I am just a beginner, is there a simple explanation for why the same call is returning an entire DataFrame and not the single entry that I want? Or, even better, an easy way to get my concatenated DataFrame to respond like a normal DataFrame when I try to reference it? Or should I be using something other than concat?
You've mistaken what normal behavior is. dfc['a'][0] is a label lookup and matches anything with an index value of 0 in which there are two because you concatenated two dataframes with index values including 0.
in order to specify position of 0
dfc['a'].iloc[0]
or you could have constructed dfc like
dfc = pd.concat([dfa,dfb], ignore_index=True)
dfc['a'][0]
Both returning
1
EDITED (thx piRSquared's comment)
Use append() instead pd.concat():
dfc = dfa.append(dfb, ignore_index=True)
dfc['a'][0]
1
Is it possible to select the negation of a given list from pandas dataframe?. For instance, say I have the following dataframe
T1_V2 T1_V3 T1_V4 T1_V5 T1_V6 T1_V7 T1_V8
1 15 3 2 N B N
4 16 14 5 H B N
1 10 10 5 N K N
and I want to get out all columns but column T1_V6. I would normally do that this way:
df = df[["T1_V2","T1_V3","T1_V4","T1_V5","T1_V7","T1_V8"]]
My question is on whether there is a way to this the other way around, something like this
df = df[!["T1_V6"]]
Do:
df[df.columns.difference(["T1_V6"])]
Notes from comments:
This will sort the columns. If you don't want to sort call difference with sort=False
The difference won't raise error if the dropped column name doesn't exist. If you want to raise error in case the column doesn't exist then use drop as suggested in other answers: df.drop(["T1_V6"])
`
For completeness, you can also easily use drop for this:
df.drop(["T1_V6"], axis=1)
Another way to exclude columns that you don't want:
df[df.columns[~df.columns.isin(['T1_V6'])]]
I would suggest using DataFrame.drop():
columns_to _exclude = ['T1_V6']
old_dataframe = #Has all columns
new_dataframe = old_data_frame.drop(columns_to_exclude, axis = 1)
You could use inplace to make changes to the original dataframe itself
old_dataframe.drop(columns_to_exclude, axis = 1, inplace = True)
#old_dataframe is changed
You need to use List Comprehensions:
[col for col in df.columns if col != 'T1_V6']