I'm working on exporting data frames to Excel using dataframe join.
However, after Join dataframe,
when calculating subtotal using groupby, the figure below is executed.
There's a "Subtotal" word in the index column.
enter image description here
Is there any way to move it into the code column and sort the indexes?
enter image description here
here codes :
def subtotal(df__, str):
container = []
for key, group in df__.groupby(['key']):
group.loc['subtotal'] = group[['quantity', 'quantity2', 'quantity3']].sum()
container.append(group)
df_subtotal = pd.concat(container)
df_subtotal.loc['GrandTotal'] = df__[['quantity', 'quantity2', 'quantity3']].sum()
print(df_subtotal)
return (df_subtotal.to_excel(writer, sheet_name=str))
Use np.where() to fill NaN in code column with value in df.index. Then assign a new index array to df.index.
import numpy as np
df['code'] = np.where(df['code'].isna(), df.index, df['code'])
df.index = np.arange(1, len(df) + 1)
print(df)
code key product quntity1 quntity2 quntity3
1 cs01767 a apple-a 10 0 10.0
2 Subtotal NaN NaN 10 0 10.0
3 cs0000 b bannana-a 50 10 40.0
4 cs0000 b bannana-b 0 0 0.0
5 cs0000 b bannana-c 0 0 0.0
6 cs0000 b bannana-d 80 20 60.0
7 cs0000 b bannana-e 0 0 0.0
8 cs01048 b bannana-f 0 0 NaN
9 cs01048 b bannana-g 0 0 0.0
10 Subtotal NaN NaN 130 30 100.0
11 cs99999 c melon-a 50 10 40.0
12 cs99999 c melon-b 20 20 0.0
13 cs01188 c melon-c 10 0 10.0
14 Subtotal NaN NaN 80 30 50.0
15 GrandTotal NaN NaN 220 60 160.0
I have dataframe for which I want to fill nan with values from previous rows mulitplied with pct_change column
col_to_fill pct_change
0 1 NaN
1 2 1.0
2 10 0.5
3 nan 0.5
4 nan 1.3
5 nan 2
6 5 3
so for 3rd row 10*0.5 = 5 and use that filled value to fill next rows if its nan.
col_to_fill pct_change
0 1 NaN
1 2 1.0
2 10 0.5
3 5 0.5
4 6.5 1.3
5 13 2
6 5 3
I have used this
while df['col_to_fill'].isna().sum() > 0:
df.loc[df['col_to_fill'].isna(), 'col_to_fill'] = df['col_to_fill'].shift(1) * df['pct_change']
but Its taking too much time as its only filling those row whos previous row are nonnan in one loop.
Try with cumprod after ffill
s = df.col_to_fill.ffill()*df.loc[df.col_to_fill.isna(),'pct_change'].cumprod()
df.col_to_fill.fillna(s, inplace=True)
df
Out[90]:
col_to_fill pct_change
0 1.0 NaN
1 2.0 1.0
2 10.0 0.5
3 5.0 0.5
4 6.5 1.3
5 13.0 2.0
6 5.0 3.0
In Python, I have the following Pandas dataframe:
Factor Value
0 a 1.2
1 b 3.4
2 b 4.5
3 b 5.6
4 c 1.3
5 d 4.6
I would like to organize this where:
unique row identifiers (the factor col) become columns
Their respective values remain under the created columns
The factor values are not in an organized.
Target:
A B C D
0 1.2 3.4 1.3 4.6
1 4.5
2 5.6
3
4
5
Use, set_index and unstack with groupby:
df.set_index(['Factor', df.groupby('Factor').cumcount()])['Value'].unstack(0)
Output:
Factor a b c d
0 1.2 3.4 1.3 4.6
1 NaN 4.5 NaN NaN
2 NaN 5.6 NaN NaN
Does anyone know how to iterate a pandas Dataframe with two columns for each iteration?
Say I have
a b c d
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
5.4 3.9 1.7 0.4
So something like
for x, y in ...:
correlation of x and y
So output will be
corr_ab corr_bc corr_cd
0.1 0.3 -0.4
You can use zip with indexing for tuples, create dictionary of one element lists with Series.corr and f-strings for columns names and pass to DataFrame constructor:
L = {f'corr_{col1}{col2}': [df[col1].corr(df[col2])]
for col1, col2 in zip(df.columns, df.columns[1:])}
df = pd.DataFrame(L)
print (df)
corr_ab corr_bc corr_cd
0 0.860108 0.61333 0.888523
You can use df.corr to get the correlation of the dataframe. You then use mask to avoid repeated correlations. After that you can stack your new dataframe to make it more readable. Assuming you have data like this
0 1 2 3 4
0 11 6 17 2 3
1 3 12 16 17 5
2 13 2 11 10 0
3 8 12 13 18 3
4 4 3 1 0 18
Finding the correlation,
corrData = data.corr(method='pearson')
We get,
0 1 2 3 4
0 1.000000 -0.446023 0.304108 -0.136610 -0.674082
1 -0.446023 1.000000 0.563112 0.773013 -0.258801
2 0.304108 0.563112 1.000000 0.494512 -0.823883
3 -0.136610 0.773013 0.494512 1.000000 -0.545530
4 -0.674082 -0.258801 -0.823883 -0.545530 1.000000
Masking out repeated correlations,
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
We get
0 1 2 3 4
0 NaN -0.446023 0.304108 -0.136610 -0.674082
1 NaN NaN 0.563112 0.773013 -0.258801
2 NaN NaN NaN 0.494512 -0.823883
3 NaN NaN NaN NaN -0.545530
4 NaN NaN NaN NaN NaN
Stacking the correlated data
dataCorr = dataCorr.stack().reset_index()
The stacked data will look as shown
level_0 level_1 0
0 0 1 -0.446023
1 0 2 0.304108
2 0 3 -0.136610
3 0 4 -0.674082
4 1 2 0.563112
5 1 3 0.773013
6 1 4 -0.258801
7 2 3 0.494512
8 2 4 -0.823883
9 3 4 -0.545530
I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.