I'm trying to concatenate several columns which mostly contain NaNs to one, but here is an example on 2 only:
2013-06-18 21:46:33.422096-05:00 A NaN
2013-06-18 21:46:35.715770-05:00 A NaN
2013-06-18 21:46:42.669825-05:00 NaN B
2013-06-18 21:46:45.409733-05:00 A NaN
2013-06-18 21:46:47.130747-05:00 NaN B
2013-06-18 21:46:47.131314-05:00 NaN B
This could go on for 3 or 4 or 10 columns, always 1 being pd.notnull() and the rest are NaN.
I want to concatenate these into 1 column the fastest way possible. How can I do this?
You get one string per line and the other cells are NaN, then the math to apply is to ask for the max value:
df.max(axis=1)
As per comment, if it doesn't work in Python 3, project your NaN into strings before:
df.fillna('').max(axis=1)
You could do
In [278]: df = pd.DataFrame([[1, np.nan], [2, np.nan], [np.nan, 3]])
In [279]: df
Out[279]:
0 1
0 1 NaN
1 2 NaN
2 NaN 3
In [280]: df.sum(1)
Out[280]:
0 1
1 2
2 3
dtype: float64
Since NaNs are treated as 0 when summed, they don't show up.
A couple of caveats: You need to be sure that only one of the columns has a non-Nan for this to work. It will also only work on numeric data.
You can also use
df.fillna(method='ffill', axis=1).iloc[:, -1]
The last column will now contain all the valid observations since the valid ones have been filled ahead. See the documentation here. The second way should be more flexible but slower. I slice off every row and the last column with iloc[:, -1].
Related
I have two dataframes, the main dataframe has two columns for Lat and Long some of which have values and some of which are NaN. I have another dataframe that is a subset of this main dataframe with Lat and Long filled in with values. I'd like to fill in the main DataFrame with these values based on matching ID.
Main DataFrame:
ID Lat Long
0 9547507704 33.853682 -80.369867
1 9777677704 32.942332 -80.066165
2 5791407702 47.636067 -122.302559
3 6223567700 34.224719 -117.372550
4 9662437702 42.521828 -82.913680
... ... ... ...
968552 4395967002 NaN NaN
968553 6985647108 NaN NaN
968554 7996438405 NaN NaN
968555 9054647103 NaN NaN
968556 9184687004 NaN NaN
DataFrame to fill:
ID Lat Long
0 2392497107 36.824257 -76.272486
1 2649457102 37.633918 -77.507746
2 2952437110 37.511077 -77.528711
3 3379937304 39.119430 -77.569008
4 3773127208 36.909731 -76.070420
... ... ... ...
23263 9512327001 37.371059 -79.194838
23264 9677417002 38.406665 -78.913133
23265 9715167306 38.761194 -77.454184
23266 9767568404 37.022287 -76.319882
23267 9872047407 38.823017 -77.057818
The two dataframes are of different lengths.
EDIT for clarification: I need to replace the NaN in the Lat & Long columns of the main DataFrame with the Lat & Long from the subset if ID matches in both DataFrames. My DataFrames are both >60 columns, I am only trying to replace the NaN for those two columns.
Edit:
I went with this mapping solution although it isn't exactly what I'm looking for, I know there is a much more simple solution.
#mapping coordinates to NaN values in main
m = dict(zip(fill_df.ID,fill_df.Lat))
main_df.Lat = main_df.Lat.fillna(main_df.ID.map(m))
n = dict(zip(fill_df.ID,fill_df.Long))
main_df.Long = main_df.Long.fillna(main_df.ID.map(n))
new_df = pd.merge(main_df, sub_df, how='left', on='ID')
I guess the left join will do the job.
One approach is to use DataFrame.combine_first. This method aligns DataFrames on index and columns, so you need to set ID as the index of each DataFrame, call df_main.combine_first(df_filler), then reset ID back into a column. (Seems awkward; there's probably a more elegant approach.)
Assuming your main DataFrame is named df_main and your DataFrame to fill is named df_filler:
df_main.set_index('ID').combine_first(df_filler.set_index('ID')).reset_index()
This should do the trick:
import math
A = pd.DataFrame({'ID' : [1, 2, 3], 'Lat':[4, 5, 6], 'Long': [7, 8, float('nan')]})
B = pd.DataFrame({'ID' : [2, 3], 'Lat':[5, 6], 'Long': [8, 9]})
print('Old table:')
print(A)
print('Fix table:')
print(B)
for i in A.index.to_list():
for j in B.index.to_list():
if not A['ID'][i] == B['ID'][j]:
continue
if math.isnan(A['Lat'][i]):
A.at[i, 'Lat'] = B['Lat'][j]
if math.isnan(A['Long'][i]):
A.at[i, 'Long'] = B['Long'][j]
print('New table:')
print(A)
Returns:
ID Lat Long
0 1 4 7.0
1 2 5 8.0
2 3 6 NaN
Fix table:
ID Lat Long
0 2 5 8
1 3 6 9
New table:
ID Lat Long
0 1 4 7.0
1 2 5 8.0
2 3 6 9.0
Not very elegant but gets the job done :)
I have a dataframe (df1) and I want to replace the values for the columns V2 and V3 if they have the same value than V1.
import pandas as pd
import numpy as np
df_start= pd.DataFrame({"ID":[1, 2 , 3 ,4, 5], "V1":[10,5,15,20,20], "V2":[10,5,20,17,15], "V3":[10, 25, 15, 10, 20]})
df_end = pd.DataFrame({"ID":[1, 2 , 3 ,4, 5], "V1":[10,5,15,20,20], "V2":[np.nan,np.nan,20,17,15], "V3":[np.nan, 25, np.nan, 10, np.nan]})
I know iterrows is not recommended but I don't know what I should do.
You can use mask:
For a seperate dataframe use assign:
df_end = df_start.assign(**df_start[['V2','V3']]
.mask(df_start[['V2','V3']].eq(df_start['V1'],axis=0)))
For modifying the input dataframe just assign inplace:
df_start[['V2','V3']] = (df_start[['V2','V3']]
.mask(df_start[['V2','V3']].eq(df_start['V1'],axis=0)))
ID V1 V2 V3
0 1 10 NaN NaN
1 2 5 NaN 25.0
2 3 15 20.0 NaN
3 4 20 17.0 10.0
4 5 20 15.0 NaN
You'll still use a regular loop to go through the columns, but the apply function is your best friend for this kind of row-wise operation. If you're going to use info from more than one column (here you're comparing some column and "V1"), you use apply on the DataFrame and specify the axis. If you were only looking at info from one column (like making a column that doubles values from V1 if they're even, you can use apply with just a Series.
For both versions of the function, the argument you're going to pass is a lambda expression. If you apply it do a DataFrame like you are here, the x represents the values in a row that can be indexed by a column. Finally, you assign the result back to a new or existing column in your DataFrame.
Assuming that df_start and df_end represent your planned input and output:
cols = ["V2","V3"]
for col in cols:
df_start[col] = df.apply(lambda x[col] if x[col] != x["V1"] else np.nan, axis=1]
(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0
I need to read a very large Excel file into a DataFrame. The file has string, integer, float, and Boolean data, as well as missing data and totally empty rows. It may also be worth noting that some of the cell values are derived from cell formulas and/or VBA - although theoretically that shouldn't affect anything.
As the title says, pandas sometimes reads Boolean values as float or int 1's and 0's, instead of True and False. It appears to have something to do with the amount of empty rows and type of other data. For simplicity's sake, I'm just linking a 2-sheet Excel file where the issue is replicated.
Boolean_1.xlsx
Here's the code:
import pandas as pd
df1 = pd.read_excel('Boolean_1.xlsx','Sheet1')
df2 = pd.read_excel('Boolean_1.xlsx','Sheet2')
print(df1, '\n' *2, df2)
Here's the print. Mainly note row ZBA, which has the same values in both sheets, but different values in the DataFrames:
Name stuff Unnamed: 1 Unnamed: 2 Unnamed: 3
0 AFD a dsf ads
1 DFA 1 2 3
2 DFD 123.3 41.1 13.7
3 IIOP why why why
4 NaN NaN NaN NaN
5 ZBA False False True
Name adslfa Unnamed: 1 Unnamed: 2 Unnamed: 3
0 asdf 6.0 3.0 6.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 ZBA 0.0 0.0 1.0
I was also able to get integer 1's and 0's output in the large file I'm actually working on (yay), but wasn't able to easily replicate it.
What could be causing this inconsistency, and is there a way to force pandas to read Booleans as they should be read?
Pandas type-casting is applied by column / series. In general, Pandas doesn't work well with mixed types, or object dtype. You should expect internalised logic to determine the most efficient dtype for a series. In this case, Pandas has chosen float dtype as applicable for a series containing float and bool values. In my opinion, this is efficient and neat.
However, as you noted, this won't work when you have a transposed input dataset. Let's set up an example from scratch:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [True, False, True, True],
'B': [np.nan, np.nan, np.nan, False],
'C': [True, 'hello', np.nan, True]})
df = df.astype({'A': bool, 'B': float, 'C': object})
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True 0.0 True
Option 1: change "row dtype"
You can, without transposing your data, change the dtype for objects in a row. This will force series B to have object dtype, i.e. a series storing pointers to arbitrary types:
df.iloc[3] = df.iloc[3].astype(bool)
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True False True
print(df.dtypes)
A bool
B object
C object
dtype: object
Option 2: transpose and cast to Boolean
In my opinion, this is the better option, as a data type is being attached to a specific category / series of input data.
df = df.T # transpose dataframe
df[3] = df[3].astype(bool) # convert series to Boolean
print(df)
0 1 2 3
A True False True True
B NaN NaN NaN False
C True hello NaN True
print(df.dtypes)
0 object
1 object
2 object
3 bool
dtype: object
Read_excel will determine the dtype for each column based on the first row in the column with a value. If the first row of that column is empty, Read_excel will continue to the next row until a value is found.
In Sheet1, your first row with values in column B, C, and D contains strings. Therefore, all subsequent rows will be treated as strings for these columns. In this case, FALSE = False
In Sheet2, your first row with values in column B, C, and D contains integers. Therefore, all subsequent rows will be treated as integers for these columns. In this case, FALSE = 0.
What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!
Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN