I have rewritten this question several times now as I thought I had solved the issue, however it appears not. I am currently trying to loop through columns of df1 and df2, dividing one column by the other to populate columns of a new df3 but I am having the issue that that all my cells are NaN.
My code for the loop is as follows:
#Divide One by the Other. Set up for loop
i = 0
for country in df3.columns:
df3[country] = df1.iloc[:, [i]].div(df2.iloc[:, [i]])
i += 1
The resulting df3 is a matrix full of NaNs only.
My df1 is of the structure:
And my df2 of the structure:
And I am creating my df3 as:
df3 = pd.DataFrame(index = df1.index, columns=tickers.index)
Which looks like (before population):
The only potential issue is the multi index in df3 perhaps? Struggling to see why they don't divide through.
The reason why your current approach does not work is because you're dividing pd.Series objects. pandas automatically tries to align the indices when dividing. Here's an example.
df1
5 0
4 1
3 2
2 3
1 4
dtype: int64
df2
5 0
6 1
7 2
8 3
9 4
dtype: int64
df1 / df2 # you'd expect all 1's in each row, but...
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
Ensure that you have the same number of rows and columns in df1 and df2, and then this should becomes easy if you divide the np.array counterparts of the dataframes.
v = df1.values / df2.values
df3 = pd.DataFrame(v, index=df1.index, columns=tickers.index)
Related
I have a dataframe with duplicate columns (number not known a priori) like this example:
a
a
a
b
b
0
1
1
1
1
1
1
1
nan
1
1
1
I need to be able to aggregate the columns by summing their values (by rows) and returning NaN if at least one value, in one of the columns among the duplicates, is NaN.
I have tried this code:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,1,1,1], [1,np.nan,1,1,1]], columns=['a','a','a','b','b'])
df = df.groupby(axis=1, level=0).sum()
The result i get is as follows, but it does not return NaN in the second row of column 'a'.
a
b
0
3
2
1
2
2
In the documentation of pandas.DataFrame.sum, there is the skipna parameter which might suit my case. But I am using the function pandas.core.groupby.GroupBy.sum which does not have this parameter, but the min_count which does what i want but the number is not known in advance and would be different for each duplicate column.
For example, a min_count=3 solves the problem for column 'a', but obviously returns NaN on the whole of column 'b'.
The result I want to achieve is:
a
b
0
3
2
1
nan
2
One workaround might be to use apply to get the DataFrame.sum:
df.groupby(level=0, axis=1).apply(lambda x: x.sum(axis=1, skipna=False))
Output:
a b
0 3.0 2.0
1 NaN 2.0
Another possible solution:
cols, ldf = df.columns.unique(), len(df)
pd.DataFrame(
np.reshape([sum(df.loc[i, x]) for i in range(ldf) for x in cols],
(len(cols), ldf)),
columns=cols)
Output:
a b
0 3.0 2.0
1 NaN 2.0
I have multiple dataframes with data for each quarter of the year. My goal is to concatenate all of them so I can sum values and have a vision for my entire year.
I managed to concatenate the four dataframes (that have the same column names and same rows names) into one. But I keep getting NaN at two columns, even though I have the data. It goes like this
df1:
my_data 1st_quarter
0 occurrence_1 2
1 occurrence_3 3
2 occurrence_2 0
df2:
my_data 2nd_quarter
0 occurrence_1 5
1 occurrence_3 10
2 occurrence_2 3
df3:
my_data 3th_quarter
0 occurrence_1 10
1 occurrence_3 2
2 occurrence_2 1
So I run this:
df_results = pd.concat(
(df_results.set_index('my_data') for df_results in [df1, df2, df3]),
axis=1, join='outer'
).reset_index()
What Is happening is this output:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 NaN 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
If I use join='inner', the first row disappear. Note that the rows have the exact same name in all dataframes.
How can I solve the NaN problem? Or after doing pd.concat reorganize my DF to "fill" the NaN with the correct numbers?
Update: My original dataset (which I unfortunately can post publicly) has a inconsistency in the first row name. Any suggestions about how I can get around it? Can I rename a row? Or combine two rows after a concatenate the dataframes?
I managed to get around this problem using combine_first with loc:
df_results.loc[0] = df_results.loc[0].combine_first(df_results.loc[3])
So I got this:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 5 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
Then I dropped the last line:
df_results = df_results.drop([3])
Let's say I have 3 different columns
Column1 Column2 Column3
0 a 1 NaN
1 NaN 3 4
2 b 6 7
3 NaN NaN 7
and I want to create 1 final column that would take first value that isn't NA, resulting in:
Column1
0 a
1 3
2 b
3 7
I would usually do this with custom apply function:
df.apply(lambda x: ...)
I need to do this for many different cases with millions of rows and this becomes very slow. Are there any operations that would take advantage of vectorization to make this faster?
Back filling missing values and select first column by [] for one column DataFrame or without for Series:
df1 = df.bfill(axis=1).iloc[:, [0]]
s = df.bfill(axis=1).iloc[:, 0]
You can use pd.fillna() for this, as below:
df['Column1'].fillna(df['Column2']).fillna(df['Column3'])
output:
0 a
1 3
2 b
3 7
For more than 3 columns, this can be placed in a for loop as below, with new_col being your output:
new_col = df['Column1']
for col in df.columns:
new_col = new_col.fillna(df[col])
I have a dataframe and wanted to fill the Nan values of particular column with the list derived from other calculation.
df = pd.DataFrame([1,Nan,3,Nan], columns=['A'])
values_to_be_filled = [100.942,90.942]
df
A
0 1
1 Nan
2 3
3 Nan
output:
df2
A
0 1
1 100.942
2 3
3 90.942
I have tried use the replace function but not able to replace with the list elements. Any help would be much appreciated
df.loc[df["A"].isnull(), "A"] = values_to_be_filled
Say I have a Dataframe
df = pd.DataFrame({'A':[0,1],'B':[2,3]})
A B
0 0 2
1 1 3
Then I have a Series generated by some other function using inputs from the first row of the df but which has no overlap with the existing df
s = pd.Series ({'C':4,'D':6})
C 4
D 6
Now I want to add s to df.loc[0] with the keys becoming new columns and the values added only to this first row. The end result for df should look like:
A B C D
0 0 2 4 6
1 1 3 NaN NaN
How would I do that? Similar questions I've found only seem to look at doing this for one column or just adding the Series as a new row at the end of the DataFrame but not updating an existing row by adding multiple new columns from a Series.
I've tried df.loc[0,list(['C','D'])] = [4,6] which was suggested in another answer but that only works if ['C','D'] are already existing columns in the Dataframe. df.assign(**s) works but then assigns the Series values to all rows.
join with transpose:
df.join(pd.DataFrame(s).T)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
Or use concat
pd.concat([df, pd.DataFrame(s).T], axis=1)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN