I have a dataframe formatted like this in pandas.
(df)
School ID Column 1
School 1 AD6000
School 2 3000TO4000
School 3 5000TO6000
School 4 AC2000
School 5 BB3300
School 6 9000TO9900
....
All I want to do is split column 1 rows that have the word 'TO' in it as a delimiter into two new columns while leaving the original. The result would be this.
(df)
School ID Column 1 Column 2 Column 3
School 1 AD6000 NaN NaN
School 2 3000TO4000 3000 4000
School 3 5000TO6000 5000 6000
School 4 AC2000 NaN NaN
School 5 BB3300 NaN NaN
School 6 9000TO9900 9000 9900
....
Here's the code I have that I thought works, but it turns out it is leaving blanks in columns 2 and 3 instead of splitting the numbers to the left and right of TO into their respective columns.
df[['Column 2','Column 3']] = df['Column 1'].str.extract(r'(\d+)TO(\d+)')
Thanks for the help.
That's because the right hand side is a dataframe with different column names (0, 1) and Pandas couldn't find Column 2 or Column 3 in that dataframe.
You can pass the underlying numpy array instead of the dataframe:
df[['Column 2','Column 3']] = df['Column 1'].str.extract(r'(\d+)TO(\d+)').values
Output:
School ID Column 1 Column 2 Column 3
0 School 1 AD6000 NaN NaN
1 School 2 3000TO4000 3000 4000
2 School 3 5000TO6000 5000 6000
3 School 4 AC2000 NaN NaN
4 School 5 BB3300 NaN NaN
5 School 6 9000TO9900 9000 9900
Use
new = df["Column 1"].str.split("TO", n = 1, expand = True)
And give the resulting columns new names
df["Col1"]= new[0]
df["Col2"]= new[1]
Related
I have multiple dataframes with data for each quarter of the year. My goal is to concatenate all of them so I can sum values and have a vision for my entire year.
I managed to concatenate the four dataframes (that have the same column names and same rows names) into one. But I keep getting NaN at two columns, even though I have the data. It goes like this
df1:
my_data 1st_quarter
0 occurrence_1 2
1 occurrence_3 3
2 occurrence_2 0
df2:
my_data 2nd_quarter
0 occurrence_1 5
1 occurrence_3 10
2 occurrence_2 3
df3:
my_data 3th_quarter
0 occurrence_1 10
1 occurrence_3 2
2 occurrence_2 1
So I run this:
df_results = pd.concat(
(df_results.set_index('my_data') for df_results in [df1, df2, df3]),
axis=1, join='outer'
).reset_index()
What Is happening is this output:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 NaN 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
If I use join='inner', the first row disappear. Note that the rows have the exact same name in all dataframes.
How can I solve the NaN problem? Or after doing pd.concat reorganize my DF to "fill" the NaN with the correct numbers?
Update: My original dataset (which I unfortunately can post publicly) has a inconsistency in the first row name. Any suggestions about how I can get around it? Can I rename a row? Or combine two rows after a concatenate the dataframes?
I managed to get around this problem using combine_first with loc:
df_results.loc[0] = df_results.loc[0].combine_first(df_results.loc[3])
So I got this:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 5 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
Then I dropped the last line:
df_results = df_results.drop([3])
I am trying to create a new column in a pandas dataframe that sums the total of other columns. However, if any of the source columns are blank (NaN or 0), I need the new column to also be written as blank (NaN)
a b c d sum
3 5 7 4 19
2 6 0 2 NaN (note the 0 in column c)
4 NaN 3 7 NaN
I am currently using the pd.sum function, formatted like this
df['sum'] = df[['a','b','c','d']].sum(axis=1, numeric_only=True)
which ignores the NaNs, but does not write NaN to the sum column.
Thanks in advance for any advice
replace your 0 to np.nan then pass skipna = False
df.replace(0,np.nan).sum(1,skipna=False)
0 19.0
1 NaN
2 NaN
dtype: float64
df['sum'] = df.replace(0,np.nan).sum(1,skipna=False)
I am trying to add a column from one pandas data-frame to another pandas data-frame.
Here is data frame 1:
print (df.head())
ID Name Gender
1 John Male
2 Denver 0
0
3 Jeff Male
Note: Both ID and Name are indexes
Here is the data frame 2:
print (df2.head())
ID Appointed
1 R
2
3 SL
Note: ID is the index here.
I am trying to add the Appointed column from df2 to df1 based on the ID. I tried inserting the column and copying the column from df2 but the Appointed column keeps returning all NAN values. So far I had no luck any suggestions would be greatly appreciated.
If I understand your problem correctly, you should get what you need using this:
df1.reset_index().merge(df2.reset_index(), left_on='ID', right_on='ID')
ID Name Gender Appointed
0 1 John Male R
1 2 Denver 0 NaN
2 3 Jeff Male SL
Or, as an alternative, as pointed out by Wen, you could use join:
df1.join(df2)
Gender Appointed
ID Name
1 John Male R
2 Denver 0 NaN
0 NaN NaN NaN
3 Jeff Male SL
Reset index for both datafrmes and then create a column named 'Appointed' in df1 and assign the same column of df2 in it.
After resetting index,both datafrmes have index beginning from 0. When we assign the column, they automatically align according to index which is a property of pandas dataframe
df1.reset_index()
df2.reset_index()
df1['Appointed'] = df2['Appointed']
I've two dataframes in pandas as shown below. EmpID is a primary key in both dataframes.
df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])
I want to join these two dataframes with EmpID so that
Missing data in one dataframe can be filled with value from another table if exists and key matches
If there are observations with new keys then they should be appended in the resulting dataframe
I've used below code for achieving this.
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
But this code gives me duplicate columns which I don't want so I only used unique columns from both tables for merging.
ColNames = list(df_second.columns.difference(df_first.columns))
ColNames.append('EmpID')
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
Now I don't get duplicate columns but don't get value either in observations where key matches.
I'll really appreciate if someone can help me with this.
Regards,
Kailash Negi
It seems you need combine_first with set_index for match by indices created by columns EmpID:
df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
EmpID Department Location Name Salary
0 1 HR Delhi A 1000.0
1 2 NaN NaN B NaN
2 3 Finance NaN C 3000.0
3 4 NaN NaN D 8000.0
4 5 Programming NaN E 6000.0
5 8 Admin Mumbai B NaN
6 9 Ops Banglore D NaN
7 10 Analytics Mumbai K NaN
EDIT:
For some order of columns need reindex:
#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')
df = (df_first.set_index('EmpID')
.combine_first(df_second.set_index('EmpID'))
.reset_index()
.reindex(columns=ColNames))
print (df)
EmpID Name Department Location Salary
0 1 A HR Delhi 1000.0
1 2 B NaN NaN NaN
2 3 C Finance NaN 3000.0
3 4 D NaN NaN 8000.0
4 5 E Programming NaN 6000.0
5 8 B Admin Mumbai NaN
6 9 D Ops Banglore NaN
7 10 K Analytics Mumbai NaN
I want to create a new column called 'test' in my dataframe that is equal to the sum of all the columns starting from column number 9 to the end of the dataframe. These columns are all datatype float.
Below is the code I tried but it didn't work --> gives me back all NaN values in 'test' column:
df_UBSrepscomp['test'] = df_UBSrepscomp.iloc[:, 9:].sum()
If I'm understanding your question, you want the row-wise sum starting at column 9. I believe you want .sum(axis=1). See an example below using column 2 instead of 9 for readability.
df = DataFrame(npr.rand(10, 5))
df.iloc[0:3, 0:4] = np.nan # throw in some na values
df.loc[:, 'test'] = df.iloc[:, 2:].sum(axis=1); print(df)
0 1 2 3 4 test
0 NaN NaN NaN NaN 0.73046 0.73046
1 NaN NaN NaN NaN 0.79060 0.79060
2 NaN NaN NaN NaN 0.53859 0.53859
3 0.97469 0.60224 0.90022 0.45015 0.52246 1.87283
4 0.84111 0.52958 0.71513 0.17180 0.34494 1.23187
5 0.21991 0.10479 0.60755 0.79287 0.11051 1.51094
6 0.64966 0.53332 0.76289 0.38522 0.92313 2.07124
7 0.40139 0.41158 0.30072 0.09303 0.37026 0.76401
8 0.59258 0.06255 0.43663 0.52148 0.62933 1.58744
9 0.12762 0.01651 0.09622 0.30517 0.78018 1.18156