How prevent pd.pivot_table from sorting columns - python

I have the following long df:
df = pd.DataFrame({'stations':["Toronto","Toronto","Toronto","New York","New York","New York"],'forecast_date':["Jul 30","Jul 31","Aug 1","Jul 30","Jul 31","Aug 1"],'low':[58,57,59,70,72,71],'high':[65,66,64,88,87,86]})
print(df)
I want to pivot the table to a wide df that looks like this:
Desired Output
so I used the following function:
df = df.pivot_table(index = 'stations', columns = "forecast_date", values = ["high","low"],aggfunc = "first").reset_index()
print(df)
but with this, I get the following df:
Output Received (Undesired)
So basically pd.pivot_table seems to be sorting the columns alphabetically, whereas I want it to be sorted in chronological order
Any help would be appreciated,
(Note that the dates are continuously changing, so other months will have a similar problem)

You won't be able to prevent the sorting, but you can always enforce the original ordering by using .reindex with the unique values from the column!
table = df.pivot_table(index = 'stations', columns = "forecast_date", values = ["high","low"],aggfunc = "first")
print(
table
)
high low
forecast_date Aug 1 Jul 30 Jul 31 Aug 1 Jul 30 Jul 31
stations
New York 86 88 87 71 70 72
Toronto 64 65 66 59 58 57
print(
table.reindex(columns=df['forecast_date'].unique(), level='forecast_date')
)
high low
forecast_date Jul 30 Jul 31 Aug 1 Jul 30 Jul 31 Aug 1
stations
New York 88 87 86 70 72 71
Toronto 65 66 64 58 57 59
Note that this is different than sorting in chronological order. To do that you would have to cast to a datetime and sort on that.

Related

Plotting different predictions with same column names and categories Python/Seaborn

I have df with different groups. I have two predictions (iqr, median).
cntx_iqr pred_iqr cntx_median pred_median
18-54 83 K18-54 72
R18-54 34 R18-54 48
25-54 33 18-34 47
K18-54 29 18-54 47
18-34 27 R25-54 29
K18-34 25 25-54 23
K25-54 24 K25-54 14
R18-34 22 R18-34 8
R25-54 17 K18-34 6
Now I want to plot them using seaborn and I have melted data for pilots. However, it does not look right to me.
pd.melt(df, id_vars=['cntx_iqr', 'cntx_median'], value_name='category', var_name="kind")
I am aiming to compare predictions (pred_iqr,pred_median) from those 2 groups (cntx_iqr, cntx_median) maybe stack barplot or some other useful plot to see how each group differs for those 2 predictions.
any help/suggestion would be appreciated
Thanks in advance
Not sure how you obtained the data frame, but you need to match the values first:
df = df[['cntx_iqr','pred_iqr']].merge(df[['cntx_median','pred_median']],
left_on="cntx_iqr",right_on="cntx_median")
df.head()
cntx_iqr pred_iqr cntx_median pred_median
0 18-54 83 18-54 47
1 R18-54 34 R18-54 48
2 25-54 33 25-54 23
3 K18-54 29 K18-54 72
4 18-34 27 18-34 47
Once you have this, you can just make a scatterplot:
sns.scatterplot(x = 'pred_iqr',y = 'pred_median',data=df)
The barplot requires a bit of pivoting, but should be:
sns.barplot(x = 'cntx_iqr', y = 'value', hue='variable',
data = df.melt(id_vars='cntx_iqr',value_vars=['pred_iqr','pred_median']))

Pandas dataframe Plotly line chart with two lines

I have a pandas dataframe as below and I would like to produce a few charts with the data. 'Name' column are the names of the accounts, 'Number' column is the number of users under each count, and the months columns are the login times of each account in every month.
Acc User Jan Feb Mar Apr May June
Nora 39 5 13 16 22 14 20
Bianca 53 14 31 22 21 20 29
Anna 65 30 17 18 28 12 13
Katie 46 9 12 30 34 25 15
Melissa 29 29 12 30 10 4 9
1st: I would like to monitor the trend of logins from January to May. One line illustrates Bianca's login and the other line illustrates everyone else's login.
2nd: I would like to monitor the percentage change of logins from January to May. One line illustrates Bianca's login percentage change and the other line illustrates everyone else's login percentage change.
Thank you for your time and assistance. I'm a beginner at this. I appreciate any help on this! Much appreciated!!
I suggest best approach to group is use categoricals. pct_change is not a direct aggregate function so it's a bit more involved to get it.
import io
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO("""Acc User Jan Feb Mar Apr May June
Nora 39 5 13 16 22 14 20
Bianca 53 14 31 22 21 20 29
Anna 65 30 17 18 28 12 13
Katie 46 9 12 30 34 25 15
Melissa 29 29 12 30 10 4 9"""), sep="\s+")
# just setup 2 plot areas
fig, ax = plt.subplots(1,2, figsize=[20,5])
# want to divide data into 2 groups
df["grp"] = pd.Categorical(df["Acc"], ["Bianca","Others"])
df["grp"].fillna("Others", inplace=True)
# just get it out of the way...
df.drop(columns="User", inplace=True)
# simple plot where function exists directly. Not transform to get lines..
df.groupby("grp").sum().T.plot(ax=ax[0])
# a bit more sophisticated to get pct change...
df.groupby("grp").sum().T.assign(
Bianca=lambda x: x["Bianca"].pct_change().fillna(0)*100,
Others=lambda x: x["Others"].pct_change().fillna(0)*100
).plot(ax=ax[1])
output

how to get column number by cell value in python using openpyxl

I am completely new to openpyxl and python and I am having a hard time with this issue and i need your help.
JAN FEB MAR MAR YTD 2019 YTD
25 9 57 23 7
61 41 29 5 57
54 34 58 10 7
13 13 63 26 45
31 71 40 40 40
24 38 63 63 47
31 50 43 2 61
68 33 13 9 63
28 1 30 39 71
I have an excel report with the data above. I'd like to search cells for those that contain a specific string (i.e., YTD) and get the column number for YTD column. I want to use the column number to extract data for that column. I do not want to use row and cell reference as the excel file gets updated regularly, thus d column will always move.
def t_PM(ff_sheet1,start_row):
wb = openpyxl.load_workbook(filename='report') # open report
report_sheet1 = wb.get_sheet_by_name('sheet 1')
col = -1
for j, keyword in enumerate(report_sheet1.values(0)):
if keyword=='YTD':
col = j
break
ff_sheet1.cell(row=insert_col + start_row, column= header['YTD_OT'], value=report_sheet1.cell(row=i + 7, column=col).value)
But then, I get an " 'generator' object is not callable" error. How can i fix this?
Your problem is that report_sheet1.values is a generator so you can't call it with (0). I'm assuming by your code that you don't want to rely that the "YTD" will appear in the first row so you iterate all cells. Do this by:
def find_YTD():
wb = openpyxl.load_workbook(filename='report') # open report
report_sheet1 = wb.get_sheet_by_name('sheet 1')
for col in report_sheet1.iter_cols(values_only=True):
for value in col:
if isinstance(value, str) and 'YTD' in value:
return col
If you are assuming this data will be in the first row, simply do:
for cell in report_sheet1[1]:
if isinstance(value, str) and 'YTD' in cell.value:
return cell.column
openpyxl uses '1-based' line indexing
Read the docs - access many cells

I need help building new dataframe from old one, by applying method to each row, keeping same index and columns

I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.
Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))

Remove Unnamed columns in pandas dataframe [duplicate]

This question already has answers here:
How to get rid of "Unnamed: 0" column in a pandas DataFrame read in from CSV file?
(11 answers)
Closed 4 years ago.
I have a data file from columns A-G like below but when I am reading it with pd.read_csv('data.csv') it prints an extra unnamed column at the end for no reason.
colA ColB colC colD colE colF colG Unnamed: 7
44 45 26 26 40 26 46 NaN
47 16 38 47 48 22 37 NaN
19 28 36 18 40 18 46 NaN
50 14 12 33 12 44 23 NaN
39 47 16 42 33 48 38 NaN
I have seen my data file various times but I have no extra data in any other column. How I should remove this extra column while reading ? Thanks
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
In [162]: df
Out[162]:
colA ColB colC colD colE colF colG
0 44 45 26 26 40 26 46
1 47 16 38 47 48 22 37
2 19 28 36 18 40 18 46
3 50 14 12 33 12 44 23
4 39 47 16 42 33 48 38
NOTE: very often there is only one unnamed column Unnamed: 0, which is the first column in the CSV file. This is the result of the following steps:
a DataFrame is saved into a CSV file using parameter index=True, which is the default behaviour
we read this CSV file into a DataFrame using pd.read_csv() without explicitly specifying index_col=0 (default: index_col=None)
The easiest way to get rid of this column is to specify the parameter pd.read_csv(..., index_col=0):
df = pd.read_csv('data.csv', index_col=0)
First, find the columns that have 'unnamed', then drop those columns. Note: You should Add inplace = True to the .drop parameters as well.
df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
The pandas.DataFrame.dropna function removes missing values (e.g. NaN, NaT).
For example the following code would remove any columns from your dataframe, where all of the elements of that column are missing.
df.dropna(how='all', axis='columns')
The approved solution doesn't work in my case, so my solution is the following one:
''' The column name in the example case is "Unnamed: 7"
but it works with any other name ("Unnamed: 0" for example). '''
df.rename({"Unnamed: 7":"a"}, axis="columns", inplace=True)
# Then, drop the column as usual.
df.drop(["a"], axis=1, inplace=True)
Hope it helps others.

Categories

Resources