I have 3 dataframes for yearly data (one for 2014, 2015 and 2016), each having 3 columns named, 'PRACTICE', 'BNF NAME', 'ITEMS'.
BNF NAME refers to drug names and I am picking out 3 Ampicillin, Amoxicillin and Co-Amoxiclav. This column has different strengths/dosages (e.g Co-Amoxiclav 200mg or Co-Amoxiclav 300mg etc etc) that I want to ignore, so I have used str.contains() to select these 3 drugs.
ITEMS is the total number of prescriptions written for each drug.
I want to create a stacked bar chart with the x axis being year (2014, 2014, 2015) and the y axis being total number of prescriptions, and each of the 3 bars to be split up into 3 for each drug name.
I am assuming I need to use df.groupby() and select a partial string maybe, however I am unsure how to combine the yearly data and then how to group the data to create the stacked bar chart.
Any guidance would be much appreciated.
This is the line of code I am using to select the rows for the 3 drug names only.
frame=frame[frame['BNF NAME'].str.contains('Ampicillin' and 'Amoxicillin' and 'Co-Amoxiclav')]
This is what each of the dataframes resembles:
PRACTICE | BNF NAME | ITEMS
Y00327 | Co-Amoxiclav_Tab 250mg/125mg | 23
Y00327 | Co-Amoxiclav_Susp 125mg/31mg/5ml S/F | 10
Y00327 | Co-Amoxiclav_Susp 250mg/62mg/5ml S/F | 6
Y00327 | Co-Amoxiclav_Susp 250mg/62mg/5ml | 1
Y00327 | Co-Amoxiclav_Tab 500mg/125mg | 50
There are likely going to be a few different ways in which you could accomplish this. Here's how I would do it. I'm using a jupyter notebook, so your matplotlib imports may be different.
import pandas as pd
%matplotlib
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
df = pd.DataFrame({'PRACTICE': ['Y00327', 'Y00327', 'Y00327', 'Y00327', 'Y00327'],
'BNF NAME': ['Co-Amoxiclav_Tab 250mg/125mg', 'Co-Amoxiclav_Susp 125mg/31mg/5ml S/F',
'Co-Amoxiclav_Susp 250mg/62mg/5ml S/F', 'Ampicillin 250mg/62mg/5ml',
'Amoxicillin_Tab 500mg/125mg'],
'ITEMS': [23, 10, 6, 1, 50]})
Out[52]:
BNF NAME ITEMS PRACTICE
0 Co-Amoxiclav_Tab 250mg/125mg 23 Y00327
1 Co-Amoxiclav_Susp 125mg/31mg/5ml S/F 10 Y00327
2 Co-Amoxiclav_Susp 250mg/62mg/5ml S/F 6 Y00327
3 Ampicillin 250mg/62mg/5ml 1 Y00327
4 Amoxicillin_Tab 500mg/125mg 50 Y00327
To simulate your three dataframes:
df1 = df.copy()
df2 = df.copy()
df3 = df.copy()
Set a column indicating what year the dataframe represents.
df1['YEAR'] = 2014
df2['YEAR'] = 2015
df3['YEAR'] = 2016
Combining the three dataframes:
combined_df = pd.concat([df1, df2, df3], ignore_index=True)
To set what drug each row represents:
combined_df['parsed_drug_name'] = "" # creates a blank column
amp_bool = combined_df['BNF NAME'].str.contains('Ampicillin', case=False)
combined_df.loc[amp_bool, 'parsed_drug_name'] = 'Ampicillin' # sets the row to amplicillin, if BNF NAME contains 'ampicillin.'
amox_bool = combined_df['BNF NAME'].str.contains('Amoxicillin', case=False)
combined_df.loc[amox_bool, 'parsed_drug_name'] = 'Amoxicillin'
co_amox_bool = combined_df['BNF NAME'].str.contains('Co-Amoxiclav', case=False)
combined_df.loc[co_amox_bool, 'parsed_drug_name'] = 'Co-Amoxiclav'
Finally, perform a pivot on the data, and plot the results:
combined_df.pivot_table(index='YEAR', columns='parsed_drug_name', values='ITEMS', aggfunc='sum').plot.bar(rot=0, stacked=True)
Related
I'm trying to filter unique values for product descriptions ( short: description), with a unique price (net_price). example:
description net_price site
Product 1 9 USCA
Product 2 7 USCA
product 3 6 USCA
product 1 12 USNY
product 2 7 USNY
product 4 10 USBP
What I get after filtering with un= master.dropna().groupby('description')['net_price'].unique(), which is what I want "A list of services unique depending on price"
description net_price
product 1 [9,12]
product 2 [7]
product 3 [6]
product 4 [10]
using this: serv= master.pivot_table(values='quantity', index=un, columns='site', aggfunc=np.sum) I'm trying to build a pivot table using the unique descriptions as the index
what I'm expecting to get
USCA USNY USBP
description
product 1 9
product 1 12
product 2 7
product 3 6
product 4 10
what I actually get is an empty DataFrame. NOTE: this works if I don't filter to get unique values except it does not give me duplicates with dif. prices
my script:
import camelot
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
re_filter = re.compile(r'^[A-Z]')
pdf_data = camelot.read_pdf('test.pdf', pages='all', flavor='lattice', encoding='latin-1' )# # GETS Tables
list_of_df = list()# LIST HOLDS ALL DATAFRAMES
prev_df = []# # LIST HOLDS ARRAY WITH ENDING SITE COL AS DIGIT
for table in pdf_data:# iterates over all dataframes
df = table.df.rename({0:'product_no',1:'description', 2:'quantity', 3:'net_price', 4:'total'}, axis=1)
site = df.iloc[-1, 0]
df =df.reindex(df.index.drop(0)).reset_index(drop=True)
if re.match(re_filter, str(site)): # CHECKS IF SITE STARTS LIKE 'USCA' INSTEAD OF THIS '112233'
if len(prev_df) > 0: # CHECKS IF 'prev_df' HAS ANY DATAFRAME IN IT
prev_df.insert(1,df)
df =pd.concat(prev_df) # TRYS TO CONCATENATE CURRENT DATAFRAME WITH STORED ON 'prev_df' DATAFRAME
df["site"] = site
df.drop(index=df.index[-1], axis=0, inplace=True)# THIS IS DROPING LAST ROW OF 'prev_df -->need to Move '
list_of_df.append(df)
prev_df.clear() # Removes item in Prev_df
else:
# makes normal changes to current DataFrame
df['site'] = site
df.drop(index=df.index[-1], axis=0, inplace=True)
list_of_df.append(df)
else:
prev_df.insert(0,df) # APPENDS INCOMPLETE DATAFRAME TO 'prev_df'
master = pd.concat(list_of_df)
# FILTER UNIQUE LIST OF DESCRIPTIONS
un= master.dropna().groupby('description')['net_price'].unique()
print(un)
# master_filtered = master.dropna(axis=1, how='all', inplace=True)
# currently gets data in this format need to display like this
# | service1 | 4 | site1 | site1 | site2 | site3 | site4 | site5 |
# | 2 | site2 service 1 | 4 | 2 | | | 4 |
# | 4 | site5
serv= master.pivot_table(values='quantity', index=un, columns='site', aggfunc=np.sum)
print(serv)
serv.to_csv('test2-output.csv')
Search column for each month of the year. Column is organized like this "01-Jan-2018". I want to find how many times "Jan-2018" appears in the column. Basically count it and plot it on a bar graph. I want it to show all the quantities for "Jan-2018" , "Feb-2018", etc. Should be 12 bars on the graph. Maybe using count or sum. I am pulling the data from a CSV using pandas and python.
I have tried to printing it out onto the console with some success. But I am getting confused as correct way to search a portion of the date.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import seaborn as sns
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile1.csv', error_bad_lines=False, encoding="ISO-8859-1", skiprows=6)
cols = data.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str)) else x)
data.columns = cols
print(data.groupby('Case_Date').mean().plot(kind='bar'))
I am expecting the a bar graph that will show the total quantity for each month. So there should be 12 bar graphs. But I am not sure how to search the column 12 times and each time only looking for the data of each month. While excluding the date, only searching for the month and year.
IIUC, this is what you need.
Let's work with the below dataframe as input dataframe.
date
0 1/31/2018
1 2/28/2018
2 2/28/2018
3 3/31/2018
4 4/30/2018
5 5/31/2018
6 6/30/2018
7 6/30/2018
8 7/31/2018
9 8/31/2018
10 9/30/2018
11 9/30/2018
12 9/30/2018
13 9/30/2018
14 10/31/2018
15 11/30/2018
16 12/31/2018
The below mentioned lines of code will get the number of count for each month as a bar graph. When you have a column as as datetime object, a lot of function are much easy & the contents of the column are much more flexible. With that, you don't need search string of the name of the month.
df['date'] = pd.to_datetime(df['date'])
df['my']=df.date.dt.strftime('%b-%Y')
ax = df.groupby('my', sort=False)['my'].value_counts().plot(kind='bar')
ax.set_xticklabels(df.my, rotation=90);
Output
I believe my question can be solved with a loop but I haven't been able to create such. I have a data sample which looks like this
sample data
And I would like to have dataframe that would be organised by the year:
result data
I tried pivot-function by creating a year column with df['year'] = df.index.year and then reshaping with pivot but it will populate only the first year column because of the index.
I have managed to do this type of reshaping manually but with several years of data it is time consuming solution. Here is the example code for manual solution:
mydata = pd.DataFrame()
mydata2 = pd.DataFrame()
mydata3 = pd.DataFrame()
mydata1['1'] = df['data'].iloc[160:664]
mydata2['2'] = df['data'].iloc[2769:3273]
mydata3['3'] = df['data'].iloc[5583:6087]
mydata1.reset_index(drop=True, inplace=True)
mydata2.reset_index(drop=True, inplace=True)
mydata3.reset_index(drop=True, inplace=True)
mydata = pd.concat([mydata1, mydata2, mydata3],axis=1, ignore_index=True)
mydata.columns = ['78','88','00','05']
Welcome to StackOverflow! I think I understood what you were asking for from your question, but please correct me if I'm wrong. Basically, you want to reshape your current pandas.DataFrame using a pivot. I set up a sample dataset and solved the problem in the following way:
import pandas as pd
#test set
df = pd.DataFrame({'Index':['2.1.2000','3.1.2000','3.1.2001','4.1.2001','3.1.2002','4.1.2002'],
'Value':[100,101,110,111,105,104]})
#create a year column for yourself
#by splitting on '.' and selecting year element.
df['Year'] = df['Index'].str.split('.', expand=True)[2]
#pivot your table
pivot = pd.pivot_table(df, index=df.index, columns='Year', values='Value')
#now, in my pivoted test set there should be unwanted null values showing up so
#we can apply another function that drops null values in each column without losing values in other columns
pivot = pivot.apply(lambda x: pd.Series(x.dropna().values))
Result on my end
| Year | 2000 | 2001 | 2002 |
|------|------|------|------|
| 0 | 100 | 110 | 105 |
| 1 | 101 | 111 | 104 |
Hope this solves your problem!
I am rather new to Pandas and am currently running into a problem when trying to insert a Dataframe inside a Dataframe.
What I want to do:
I have multiple simulations and corresponding signal files and I want all of them in one big DataFrame. So I want a DataFrame which has all my simulation parameters and also my signals as an nested DataFrame. It should look something like this:
SimName | Date | Parameter 1 | Parameter 2 | Signal 1 | Signal 2 |
Name 1 | 123 | XYZ | XYZ | DataFrame | DataFrame |
Name 2 | 456 | XYZ | XYZ | DataFrame | DataFrame |
Where SimName is my Index for the big DataFrame and every entry in Signal 1 and Signal 2 is an individuall DataFrame.
My idea was to implement this like this:
big_DataFrame['Signal 1'].loc['Name 1']
But this results in an ValueError:
Incompatible indexer with DataFrame
Is it possible to have this nested DataFrames in Pandas?
Nico
The 'pointers' referred to at the end of ns63sr's answer could be implemented as a class, e.g...
Definition:
class df_holder:
def __init__(self, df):
self.df = df
Set:
df.loc[0,'df_holder'] = df_holder(df)
Get:
df.loc[0].df_holder.df
the docs say that only Series can be within a DataFrame. However, passing DataFrames seems to work as well. Here is an exaple assuming that none of the columns is in MultiIndex:
import pandas as pd
signal_df = pd.DataFrame({'X': [1,2,3],
'Y': [10,20,30]} )
big_df = pd.DataFrame({'SimName': ['Name 1','Name 2'],
'Date ':[123 , 456 ],
'Parameter 1':['XYZ', 'XYZ'],
'Parameter 2':['XYZ', 'XYZ'],
'Signal 1':[signal_df, signal_df],
'Signal 2':[signal_df, signal_df]} )
big_df.loc[0,'Signal 1']
big_df.loc[0,'Signal 1'][X]
This results in:
out1: X Y
0 1 10
1 2 20
2 3 30
out2: 0 1
1 2
2 3
Name: X, dtype: int64
In case nested dataframes are not properly working, you may implement some sort of pointers that you store in big_df that allow you to access the signal dataframes stored elsewhere.
Instead of big_DataFrame['Signal 1'].loc['Name 1'] you should use
big_DataFrame.loc['Name 1','Signal 1']
I've got data in the below format, and what I'm trying to do is to:
1) loop over each value in Region
2) For each region, plot a time series of the aggregated (across Category) sales number.
Date |Region |Category | Sales
01/01/2016| USA| Furniture|1
01/01/2016| USA| Clothes |0
01/01/2016| Europe| Furniture|2
01/01/2016| Europe| Clothes |0
01/02/2016| USA| Furniture|3
01/02/2016| USA|Clothes|0
01/02/2016| Europe| Furniture|4
01/02/2016| Europe| Clothes|0 ...
The plot should look like the attached (done in excel).
However, if I try to do it in Python using the below, I get multiple charts when I really want all the lines to show up in one figure.
Python code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(r'C:\Users\wusm\Desktop\Book7.csv')
plt.legend()
for index, group in df.groupby(["Region"]):
group.plot(x='Date',y='Sales',title=str(index))
plt.show()
Short of reformatting the data, could anyone advise on how to get the graphs in one figure please?
You can use pivot_table:
df = df.pivot_table(index='Date', columns='Region', values='Sales', aggfunc='sum')
print (df)
Region Europe USA
Date
01/01/2016 2 1
01/02/2016 4 3
or groupby + sum + unstack:
df = df.groupby(['Date', 'Region'])['Sales'].sum().unstack(fill_value=0)
print (df)
Region Europe USA
Date
01/01/2016 2 1
01/02/2016 4 3
and then DataFrame.plot
df.plot()