NaN values emerging when concatenating multiple dataframes with Pandas

NaN values emerging when concatenating multiple dataframes with Pandas - python

I have multiple dataframes with data for each quarter of the year. My goal is to concatenate all of them so I can sum values and have a vision for my entire year.
I managed to concatenate the four dataframes (that have the same column names and same rows names) into one. But I keep getting NaN at two columns, even though I have the data. It goes like this
df1:
my_data 1st_quarter
0 occurrence_1 2
1 occurrence_3 3
2 occurrence_2 0
df2:
my_data 2nd_quarter
0 occurrence_1 5
1 occurrence_3 10
2 occurrence_2 3
df3:
my_data 3th_quarter
0 occurrence_1 10
1 occurrence_3 2
2 occurrence_2 1
So I run this:
df_results = pd.concat(
(df_results.set_index('my_data') for df_results in [df1, df2, df3]),
axis=1, join='outer'
).reset_index()
What Is happening is this output:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 NaN 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
If I use join='inner', the first row disappear. Note that the rows have the exact same name in all dataframes.
How can I solve the NaN problem? Or after doing pd.concat reorganize my DF to "fill" the NaN with the correct numbers?
Update: My original dataset (which I unfortunately can post publicly) has a inconsistency in the first row name. Any suggestions about how I can get around it? Can I rename a row? Or combine two rows after a concatenate the dataframes?

I managed to get around this problem using combine_first with loc:
df_results.loc[0] = df_results.loc[0].combine_first(df_results.loc[3])
So I got this:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 5 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
Then I dropped the last line:
df_results = df_results.drop([3])

Related

Combine 3 dataframe columns into 1 with priority while avoiding apply

Let's say I have 3 different columns
Column1 Column2 Column3
0 a 1 NaN
1 NaN 3 4
2 b 6 7
3 NaN NaN 7
and I want to create 1 final column that would take first value that isn't NA, resulting in:
Column1
0 a
1 3
2 b
3 7
I would usually do this with custom apply function:
df.apply(lambda x: ...)
I need to do this for many different cases with millions of rows and this becomes very slow. Are there any operations that would take advantage of vectorization to make this faster?

Back filling missing values and select first column by [] for one column DataFrame or without for Series:
df1 = df.bfill(axis=1).iloc[:, [0]]
s = df.bfill(axis=1).iloc[:, 0]

You can use pd.fillna() for this, as below:
df['Column1'].fillna(df['Column2']).fillna(df['Column3'])
output:
0 a
1 3
2 b
3 7
For more than 3 columns, this can be placed in a for loop as below, with new_col being your output:
new_col = df['Column1']
for col in df.columns:
new_col = new_col.fillna(df[col])

Dropping multiple columns in a pandas dataframe between two columns based on column names

A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!

You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4

Group seperated counting values in a pandas dataframe

I have following df
A B
0 1 10
1 2 20
2 NaN 5
3 3 1
4 NaN 2
5 NaN 3
6 1 10
7 2 50
8 Nan 80
9 3 5
Consisting of repeating sequences from 1-3 seperated by a variable number of NaN's.I want to groupby each this sequences from 1-3 and get the minimum value of column B within these sequences.
Desired Output something like:
B_min
0 1
6 5
Many thanks beforehand
draj

Idea is first remove rows by missing values by DataFrame.dropna, then use GroupBy.cummin by helper Series created by compare A for equal by Series.eq and Series.cumsum, last data cleaning to one column DataFrame:
df = (df.dropna(subset=['A'])
.groupby(df['A'].eq(1).cumsum())['B']
.min()
.reset_index(drop=True)
.to_frame(name='B_min'))
print (df)
B_min
0 1
1 5

All you need to df.groupby() and apply min(). Is this what you are expecting?
df.groupby('A')['B'].min()
Output:
A
1 10
2 20
3 1
Nan 80
If you don't want the NaNs in your group you can drop them using df.dropna()
df.dropna().groupby('A')['B'].min()

Add pandas Series as new columns to a specific Dataframe row

Say I have a Dataframe
df = pd.DataFrame({'A':[0,1],'B':[2,3]})
A B
0 0 2
1 1 3
Then I have a Series generated by some other function using inputs from the first row of the df but which has no overlap with the existing df
s = pd.Series ({'C':4,'D':6})
C 4
D 6
Now I want to add s to df.loc[0] with the keys becoming new columns and the values added only to this first row. The end result for df should look like:
A B C D
0 0 2 4 6
1 1 3 NaN NaN
How would I do that? Similar questions I've found only seem to look at doing this for one column or just adding the Series as a new row at the end of the DataFrame but not updating an existing row by adding multiple new columns from a Series.
I've tried df.loc[0,list(['C','D'])] = [4,6] which was suggested in another answer but that only works if ['C','D'] are already existing columns in the Dataframe. df.assign(**s) works but then assigns the Series values to all rows.

join with transpose:
df.join(pd.DataFrame(s).T)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
Or use concat
pd.concat([df, pd.DataFrame(s).T], axis=1)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN

Issue dividing one dataframe by another

I have rewritten this question several times now as I thought I had solved the issue, however it appears not. I am currently trying to loop through columns of df1 and df2, dividing one column by the other to populate columns of a new df3 but I am having the issue that that all my cells are NaN.
My code for the loop is as follows:
#Divide One by the Other. Set up for loop
i = 0
for country in df3.columns:
df3[country] = df1.iloc[:, [i]].div(df2.iloc[:, [i]])
i += 1
The resulting df3 is a matrix full of NaNs only.
My df1 is of the structure:
And my df2 of the structure:
And I am creating my df3 as:
df3 = pd.DataFrame(index = df1.index, columns=tickers.index)
Which looks like (before population):
The only potential issue is the multi index in df3 perhaps? Struggling to see why they don't divide through.

The reason why your current approach does not work is because you're dividing pd.Series objects. pandas automatically tries to align the indices when dividing. Here's an example.
df1
5 0
4 1
3 2
2 3
1 4
dtype: int64
df2
5 0
6 1
7 2
8 3
9 4
dtype: int64
df1 / df2 # you'd expect all 1's in each row, but...
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
Ensure that you have the same number of rows and columns in df1 and df2, and then this should becomes easy if you divide the np.array counterparts of the dataframes.
v = df1.values / df2.values
df3 = pd.DataFrame(v, index=df1.index, columns=tickers.index)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

NaN values emerging when concatenating multiple dataframes with Pandas - python

Related

Combine 3 dataframe columns into 1 with priority while avoiding apply

Dropping multiple columns in a pandas dataframe between two columns based on column names

Group seperated counting values in a pandas dataframe

Add pandas Series as new columns to a specific Dataframe row

Issue dividing one dataframe by another

Categories

Resources