I would like to group some strings in the column called 'type' and insert them in a plotly bar, the problem is that from the new table created with groupby I can't extract the x and y to define them in the graph:
tipol1 = df.groupby(['tipology']).nunique()
tipol1
the outpot gives me tipology as index and the grouping based on how many times they repeat
number data
typology
one 2 113
two 33 33
three 12 88
four 44 888
five 11 66
in the number column (in which I have other values it gives me the correct grouping of the tipology column)
Also in the date column it gives me values (I think grouping the dates but not the dates in the correct format)
I also found:
tipol=df.groupby(['tipology']).nunique()
tipol2 = tipol[['number']]
tipol2
to take only the number column,
but nothing to do, I would need the tipology column (not in index) and the column with the tipology grouping numbers to get the x and y axis to import it into plotly!
One last try I made (making a big mess):
tipol=df.groupby(['tipology'],as_index=False).nunique()
tipol2 = tipol[['number']]
fig = go.Figure(data=[
go.Bar(name='test', x=df['tipology'], y=tipol2)
])
fig.update_layout(barmode='stack')
fig.show()
any suggestions
thanks!
UPDATE
I would have too much code to give an example, it would be difficult for me and it would waste your time too. basically I would need a groupby with the addition of a column that would show the grouping value eg:
tipology Date
home 10/01/18
home 11/01/18
garden 12/01/18
garden 12/01/18
garden 13/01/18
bathroom 13/01/18
bedroom 14/01/18
bedroom 15/01/18
kitchen 16/01/18
kitchen 16/01/18
kitchen 17/01/18
I wish this would happen:
by deleting the date column and inserting the value column in the DataFrame that does the count
tipology value
home 2
garden 3
bathroom 1
bedroom 2
kitchen 3
Then (I'm working with jupyer notebook)
leaving the date column and adding the corresponding values to the value column based on their grouping:
tipology Date value
home 10/01/18 1
home 11/01/18 1
garden 12/01/18 2
garden 12/01/18_____.
garden 13/01/18 1
bathroom 13/01/18 1
bedroom 14/01/18 1
bedroom 15/01/18 1
kitchen 16/01/18 2
kitchen 16/01/18_____.
kitchen 17/01/18 1
I would need the columns to assign them to the x and y axes to import them to a graph! so none of the columns should be index
By default the method groupby will return a dataframe where the fields you are grouping on will be in the index of the dataframe. You can adjust this behaviour by setting as_index=False in the group by. Then tipology will still be a column in the dataframe that is returned:
tipol1 = df.groupby('tipology', as_index=False).nunique()
Related
I have the following dataframe, which contains 2 rows:
index name food color number year hobby music
0 Lorenzo pasta blue 5 1995 art jazz
1 Lorenzo pasta blue 3 1995 art jazz
I want to write a code that will be able to tell me which column is the one that can distinguish between the these two rows.
For example , in this dataframe, the column "number" is the one that distinguish between the two rows.
Unti now I have done this very simply by just go over column after column using iloc and see the values.
duplicates.iloc[:,3]
>>>
0 blue
1 blue
It's important to take into account that:
This should be for loop, each time I check it on new generated dataframe.
There may be nore than 2 rows which I need to check
There may be more than 1 column that can distinguish between the rows.
I thought that the way to check such a thing will be something like take each time one column, get the unique values and check if they are equal to each other ,similarly to this:
for n in np.arange(0,len(df.columns)):
tmp=df.iloc[:,n]
and then I thought to compare if all the values are similar to each other on the temporal dataframe, but here I got stuck because sometimes I have many rows and also I need.
My end goal: to be able to check inside for loop to identify the column that has different values in each row of the temporaldtaframe, hence can help to distinguish between the rows.
You can apply the duplicated method on all columns:
s = df.apply(pd.Series.duplicated).any()
s[~s].index
Output: ['number']
I'm trying to count the number of 0s in a column based on conditions in another column. I have three columns in the spreadsheet: DATE, LOCATION, and SALES. Column 1 is the date column. Column 2 is the location column (there are 5 different locations). Column 3 is the sales volume for the day.
I want to count the number of instances where the different locations have 0 sales for the day.
I have tried a number of groupby combinations and cannot get an answer.
df_summary = df.groupby(['Location']).count()['Sales'] == 0
Any help is appreciated.
Try filter first:
(df[df['Volume']==0].groupby('Location').size()
.reindex(df['Location'].unique()) # fill in the zero numbers
.reset_index(name='No Sales Days') # convert to dataframe
)
Or
df['Volume'].eq(0).groupby(df['Location']).sum()
Hi I have a table like this after group by:
t = df.loc[(year-3 <= year) & (year <= year-1), 'Net Sum'].groupby([month, association]).sum()
t
YearMonth Type
1 Other 27471.73
base -14563752.74
plan 16286620.30
2 Other 754691.36
base 30465722.53
plan 17906687.29
3 Other 20285.92
base 29339325.21
plan 15492558.91
How can I fill the blanks with grouped Year Month without resetting the index as I'd like to keep YearMonth as index?
Expected Outcome.
t
YearMonth Type
1 Other 27471.73
1 base -14563752.74
1 plan 16286620.30
2 Other 754691.36
2 base 30465722.53
2 plan 17906687.29
3 Other 20285.92
3 base 29339325.21
3 plan 15492558.91
I think this can only be achieved by altering the display option:
with pd.option_context('display.multi_sparse', False):
print(t)
If we refer the docs
display.multi_sparse True
“Sparsify” MultiIndex display (don’t display repeated elements in outer levels within groups)
Hence we can set this to False.
Following should do the work
t.reset_index()
https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.reset_index.html
Lets say I start with a dataframe that looks like this:
Group Val date
0 home first 2017-12-01
1 home second 2017-12-02
2 away first 2018-03-07
3 away second 2018-03-01
Data types are [string, string, datetime]. I would like to get a dataframe that for each group, shows me the value that was entered most recently:
Group Most rececnt Val Most recent date
0 home second 12-02-2017
1 away first 03-07-2018
(Data types are [string, string, datetime])
My initial thought is that I should be able to do something like this by grouping by 'group' and then aggregating the dates and vals. I know I can get the most recent datetime using the 'max' agg function, but I'm stuck on what function to use to get the corresponding val:
df.groupby('Group').agg({'val':lambda x: ____????____
'date':'max'})
Thanks,
In case I understood you right, you can do this:
df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]
Or as a whole example:
import pandas as pd
import numpy as np
np.random.seed(42)
data = [(np.random.choice(['home', 'away'], size=1)[0],
np.random.choice(['first', 'second'], size=1)[0],
pd.Timestamp(np.random.rand()*1.9989e+18)) for i in range(10)]
df = pd.DataFrame.from_records(data)
df.columns = ['Group', 'Val', 'date']
df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]
Which selects
Group Val date
5 away first 2031-06-09 06:26:43.486610432
0 home second 2030-03-22 04:07:07.082781440
from
Group Val date
0 home second 2030-03-22 04:07:07.082781440
1 home second 2007-12-03 05:07:24.061456384
2 home second 1979-11-18 23:57:26.700035456
3 home first 2024-11-12 08:18:17.789517824
4 away second 2014-11-07 13:17:55.756515328
5 away first 2031-06-09 06:26:43.486610432
6 away second 1983-06-14 13:17:28.334806208
7 away second 1981-08-14 03:21:14.746028864
8 away second 2003-03-29 11:00:31.189680256
9 away first 1988-06-12 16:58:48.341865984
First select the indeces of the dataframe whose variable value is maximum
max_indeces = df.groupby(['Group'])['date'].idxmax()
and then select the corresponding rows in the original dataframe, maybe only indicating the actual value you are interested in:
df.iloc[max_indeces]['Val']
I have a dataset
a b c d
10-Apr-86 Jimmy 1 this is
11-Apr-86 Minnie 2 the way
12-Apr-86 Jimmy 3 the world
13-Apr-86 Minnie 4 ends
14-Apr-86 Jimmy 5 this is the
15-Apr-86 Eliot 6 way
16-Apr-86 Jimmy 7 the world ends
17-Apr-86 Eliot 8 not with a bang
18-Apr-86 Minnie 9 but a whimper
I want to make a chart in matplotlib that looks like this
I've figure out how to get just the dots (no annotations) using the following code:
df = (pd.read_csv('python.csv'))
df_wanted = pd.pivot_table(
df,
index='a',
columns='b',
values='c')
df_wanted.index = pd.to_datetime(df_wanted.index)
plt.scatter(df_wanted.index, df_wanted['Jimmy'])
plt.scatter(df_wanted.index,df_wanted['Minnie'])
plt.scatter(df_wanted.index,df_wanted['Eliot'])
I think that to annotate, I need a list of values (as demonstrated here ) on the final column of my pivot table
My problem is: how do I get that final column 'd' of the original dataset to become the final column of my pivot table?
I tried dat1 = pd.concat([df_wanted, df['d']], axis = 1) - but this created a new set of rows underneath the rows of my dataframe. I realized the axis wasn't the same, so I tried to make a new pivot table with the d column as values - but got the error message No numeric types to aggregate.
I tried df_wanted2.append(df['d']) - but this made a new column for every element in column d.
Any advice? Ultimately, I want to make it so the data labels appear when one rolls over the point with the mouse
In this specific case, it doesn't seem you need to set column d as the final column of your pivot table.
plt.scatter(df_wanted.index, df_wanted['Jimmy'])
plt.scatter(df_wanted.index,df_wanted['Minnie'])
plt.scatter(df_wanted.index,df_wanted['Eliot'])
plt.legend(loc=0)
for k, v in df.set_index('a').iterrows():
plt.text(k, v['c'], v['d']) # or: plt.annotate(xy=(k, v['c']), s=v['d'])