I have two csv which have 5 columns id product name normal price discount price card price and discount both csv have different prices from different timelines so i am looking to plot a graph because i want to see the behavior of the prices for each product, both have products that are repeated, so I would like those that are repeated to appear in the graph.
I tried converting to a data frame but no success.
This is how it look the csv file
It would be helpful to your second .csv file too.
From this answer.
Assuming both your files have the same column names, use Pandas' pandas.concat() method to make one dataframe from a list of dataframes.
Then, simply plot a graph of your new dataframe as usual.
import pandas as pd
listOfDFs = []
df1 = pd.read_csv('csv1.csv')
df2 = pd.read_csv('csv2.csv')
listOfDFs.append(df1)
listOfDFs.append(df2)
dfBoth = pd.concat(listOfDFs, axis=0, ignore_index=True)
dfBoth.plot()
I am working with timeseries data that is broken down by several geographical levels (province and health region). The data are further broken down by age group. I would like to aggregate data first up to the health region, and then up to the province, depending on the need.
Sample data:
df = pd.DataFrame({'Date':['1/1/2022','1/1/2022','1/1/2022','1/1/2022','1/1/2022','1/1/2022','1/1/2022','1/1/2022','1/1/2022'],
'province':[35,35,35,35,35,35,35,35,35,],'health region':[1,1,1,2,2,2,3,3,3],
'age group':[1,2,3,1,2,3,1,2,3],'cases':[6,1,9,7,9,0,4,2,2]})
Desired output when aggregating up to the health region level:
df_hr = pd.DataFrame({'Date':['1/1/2022','1/1/2022','1/1/2022'],
'province':[35,35,35],
'health region':[1,2,3],'cases':[16,16,8]})
When I use the following code:
df = df.groupby('health region').sum()
I lose dates.
When I try:
df = df.groupby(['health region','Date']).sum()
or
df = df.groupby(['health region',df['Date'].dt.date).sum()
I get an error ValueError: mixed datetimes and integers in passed array
Is there an easy way to do this? I was thinking of using a loop to split the data by health region, saving unique dates, aggregating, merging dates back, then stacking health regions back together. But I'd rather not if there is an easy way to do that.
Thank you,
i.
I figured out a way how to do that.
First need to convert Date to datetime and set it as index:
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace = True)
Then I can group by region and resample data by day
df1 = df.groupby('health region').resample('D').sum()
And reset index
df1 = df1.reset_index()
I have weather data over a variety of years. In this I am trying to find the long term averages for the temperature of each month, which I achieved using the following.
mh3 = mh3.groupby([mh3.index.month, mh3.index.day])
mh3 = mh3[['dry_bulb_tmp_mean', 'global_horiz_radiation']].mean()
However, in doing this, I get two index's for the dataframe (both month and day which is fine). The issue is that both of these index columns are assigned the name date. Is there a way to manually add a name? This causes problems later in my code when I need to do some data analysis by month. Thank you
The name of the Series you group with becomes the name of the Index levels so rename them in the grouper.
mh3 = mh3.groupby([mh3.index.month.rename('month'), mh3.index.day.rename('day')])
Or if you don't want to type as much you can create the grouping with a list comprehension, getattr and renaming to the attribute.
import pandas as pd
df = pd.DataFrame(index=pd.date_range('2010-01-01', freq='4H', periods=10),
data={'col1': range(10)})
grpr = [getattr(df.index, attr).rename(attr) for attr in ['month', 'day']]
df.groupby(grpr).sum()
# col1
#month day
#1 1 15
# 2 30
I have the following dataframe, that is grouped according to the invoice cycle first, then added to a count of clinics in each invoice cycle.
Dataframe after groupby function
I used the following code to add the count column:
df5 = df4.groupby(['Invoice Cycle', 'Clinic']).size().reset_index(name='counts')
and then this code to set the index and get the dataframe, as seen in the image above:
df5 = df5.set_index(['Invoice Cycle','Clinic'])
Now, I want to reorder the Invoice Cycle column so the dates are in order 16-Dec, 17-Jan, 17-Feb, 17-Mar, etc.
Then I want to reorder the clinics in each invoice cycle so clinic with the highest count is on the top and the clinic with the lowest count is on the bottom.
Given the values in Invoice Cycle are strings, and not timestamps, I can't seem to do both of the above tasks.
Is there a way to reorder the dataframe?
You can create a function to transform the date-string into a datetime format:
import pandas as pd
import datetime
def str_to_date(string):
# This will get you the date with the first day of the month (ex. 01-Jan-2017)
date = datetime.datetime.strptime(string, '%y-%b')
return date
df['Invoice Cycle'] = df['Invoice Cycle'].apply(str_to_date)
# now you an sort correctly
df = df.sort_values(['Invoice Cycle', 'counts'])
I am trying to combine 2 separate data series using one minute data to create a ratio then creating Open High Low Close (OHLC) files for the ratio for the entire day. I am bringing in two time series then creating associated dataframes using pandas. The time series have missing data so I am creating a datetime variable in each file then merging the files using the pd.merge approach on the datetime variable. Up this this point everything is going fine.
Next I group the data by the date using groupby. I then feed the grouped data to a for loop that calculates the OHLC and feeds that into a new dataframe for each respective day. However, the newly populated dataframe uses the date (from the grouping) as the dataframe index and the sorting is off. The index data looks like this (even when sorted):
01/29/2013
01/29/2014
01/29/2015
12/2/2013
12/2/2014
In short, the sorting is being done on only the month not the whole date as a date so it isn't chronological. My goal is to get it sorted by date so it would be chronological. Perhaps I need to create a new column in the dataframe referencing the index (not sure how). Or maybe there is a way to tell pandas the index is a date not just a value? I tried using various sort approaches including sort_index but since the dates are the index and don't seem to be treated as dates the sort functions sort by the month regardless of the year and thus my output file is out of order. In more general terms I am not sure how to reference/manipulate the actual unique identifier index in a pandas dataframe so any associated material would be useful.
Thank you
Years later...
This fixes the problem.
df is a dataframe
import pandas as pd
df.index = pd.to_datetime(df.index) #convert the index to a datetime object
df = df.sort_index() #sort the converted
This should get the sorting back into chronological order