How to access columns after creating multiIndex

How to access columns after creating multiIndex - python

I am making my DataFrame like this:
influenza_data = pd.DataFrame(data, columns = ['year', 'week', 'weekly_infections'])
and then I create MultiIndex from year and week columns:
influenza_data = influenza_data.set_index(['year', 'week'])
If I have MultiIndex my DataFrame looks like this:
weekly_infections
year week
2009 40 6600
41 7100
42 7700
43 8300
44 8600
... ...
2019 10 8900
11 6200
12 5500
13 3900
14 3300
and data_influenza.columns:
Index(['weekly_infections'], dtype='object')
The problem I have is that I can't access year and week columns now.
If I try data_influenza['week'] or year I get KeyError: 'week'. I can only do data_influenza.weekly_infections and that returns a whole DataFrame
I know if I remove multiIndex I can easily access them but why can't I data_influenza.year or week with MultiIndex? I specified columns when I was creating Dataframe

As Pandas documentation says here, you can access MultiIndex object levels by get_level_values(index) method:
influenza_data.index.get_level_values(0) # year
influenza_data.index.get_level_values(1) # week
Obviously, the index parameter represents the order of indices.

Related

Segmenting a dataframe based on date with datetime column. Python

I have a dataframe named data as shown:
Date
Value
X1
X2
X3
2019-05
15
23
65
98
2019-05
34
132
56
87
2019-06
23
66
90
44
The date column is in a datetime format of Year-Month starting from 2017-01 and the most recent 2022-05. I want to write a piece that will extract data into separate data frames. More specifically I want one data frame to contain the rows of the current month and year (2022-05), another dataframe to contain to data from the previous month (2022-04), and one more dataframe that contains data from 12 months ago (2021-05).
For my code I have the following:
import pandas as pd
from datetime import datetime as dt
data = pd.read_csv("results.csv")
current = data[data["Date"].dt.month == dt.now().month]
My results show the following:
Date
Value
X1
X2
X3
2019-05
15
23
65
98
2019-05
34
132
56
87
2020-05
23
66
90
44
So I get the rows that match the current month but I need it to match the current year I assumed I could just add multiple conditions to match current month and current year but that did not seem to work for me.
Also is there a way to write the code in such a way where I can extract the data from the previous month and the previous year based on what the current month-year is? My first thought was to just take the month and subtract 1 and do the same thing for the year and if the current year is in January I would just write an exception to subtract 1 from both the month and year for the previous month analysis.

Split your DF into a dict of DFs and then access the one you want directly by the date (YYYY-MM).
index
Date
Value
X1
X2
X3
0
2017-04
15
23
65
98
1
2019-05
34
132
56
87
2
2021-06
23
66
90
44
dfs = {x:df[df.Date == x ] for x in df.Date.unique()}
dfs['2017-04']
index
Date
Value
X1
X2
X3
0
2017-04
15
23
65
98

You can do this with a groupby operation, which is a first-class kind of thing in tabular analysis (sql/pandas). In this case, you want to group by both year and month, creating dataframes:
dfs = []
for key, group_df in df.groupby([df.Date.dt.year, df.Date.dt.month]):
dfs.append(group_df)
dfs will have the subgroups you want.
One thing: it's worth noting that there is a performance cost breaking dataframes into list items. Its just as likely that you could do whatever processing comes next directly in the groupby statement, such as df.groupby(...).X1.transform(sum) for example.

count values of groups by consecutive days

i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot

Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.

Python convert daily column into a new dataframe with year as index week as column

I have a data frame with the date as an index and a parameter. I want to convert column data into a new data frame with year as row index and week number as column name and cells showing weekly mean value. I would then use this information to plot using seaborn https://seaborn.pydata.org/generated/seaborn.relplot.html.
My data:
df =
data
2019-01-03 10
2019-01-04 20
2019-05-21 30
2019-05-22 40
2020-10-15 50
2020-10-16 60
2021-04-04 70
2021-04-05 80
My code:
# convert the df into weekly averaged dataframe
wdf = df.groupby(df.index.dt.strftime('%Y-%W')).data.mean()
wdf
2019-01 15
2019-26 35
2020-45 55
2021-20 75
Expected answer: Column name denotes the week number, index denotes the year. Cell denotes the sample's mean in that week.
01 20 26 45
2019 15 NaN 35 NaN # 15 is mean of 1st week (10,20) in above df
2020 NaN NaN NaN 55
2021 NaN 75 NaN NaN
No idea on how to proceed further to get the expected answer from the above-obtained solution.

You can use a pivot_table :
df['year'] = pd.DatetimeIndex(df['date']).year
df['week'] = pd.DatetimeIndex(df['date']).week
final_table = pd.pivot_table(data = df,index= 'year', columns = 'week',values = 'data', aggfunc = np.mean )

You need to use two dimensions in the groupby, and then unstack to lay out the data as a grid:
df.groupby([df.index.year,df.index.week])['data'].mean().unstack()

How to fill missing observations in time series data

I have a hypothetical time series data frame, which is with some missing observations (assumption is that the data frame shall include all dates and corresponding values and for all the dates in the year). As we can see in the head and tail information, there are certain dates and corresponding values are missing (30th Jan & 29th Dec). There would be many more such in the data frame, sometimes missing observations for more than one consecutive date.
Is there a way that missing dates are detected and inserted into the data frame and corresponding values are filled with a rolling average with one week window (this would naturally increase the number of rows of the data frame)? Appreciate inputs.
df.head(3)
date value
0 2020-01-28 25
1 2020-01-29 32
2 2020-01-31 45
df.tail(3)
date value
3 2020-12-28 24
4 2020-12-30 35
5 2020-12-31 37
df.dtypes
date object
value int64
dtype: object

Create DaetimeIndex, then use DataFrame.asfreq with rolling and mean:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('d').rolling('7D').mean()
If need all values by year use:
df['date'] = pd.to_datetime(df['date'])
idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()

using pd.to_datetime to form a date by taking input of year,months,day present in different columns in a data frame

I have a problem combining the day month year columns to form a date column in a data frame using pd.to_datetime. Below is the dataframe,i'm working on and the columns Yr,Mo,Dy represents as year month day.
data = pd.read_table("/ALabs/wind.data",sep = ',')
Yr Mo Dy RPT VAL ROS KIL
61 1 1 15.04 14.96 13.17 9.29
61 1 2 14.71 NaN 10.83 6.50
61 1 3 18.50 16.88 12.33 10.13
So I've tried the below code, i get the following error: "to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing"
Step 1:
data['Date'] = pd.to_datetime(data[['Yr','Mo','Dy']],format="%y-%m-%d")
Next I've tried converting Yr,Mo,Dy column datatype to datetime64 from int64 and assigning the result to new columns Year,Month,Day. Now when i try to combine the columns i'm getting the proper date format in the new date column and i have no idea how i got the desired result.
Step2:
data['Year'] = pd.to_datetime(data.Yr,format='%y').dt.year
data['Month'] = pd.to_datetime(data.Mo,format='%m').dt.month
data['Day'] = pd.to_datetime(data.Dy,format ='%d').dt.day
data['Date'] =pd.to_datetime(data[['Year','Month','Day']])
Result:
Yr Mo Dy Year Month Day Date
61 1 1 2061 1 1 2061-01-01
61 1 2 2061 1 2 2061-01-02
61 1 3 2061 1 3 2061-01-03
61 1 4 2061 1 4 2061-01-04
But if i try doing the same method by changing the column names from year,month, day to Yy,Mh,Di like in the below code. I get the same error "to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing"
Step3:
data['Yy'] = pd.to_datetime(data.Yr,format='%y').dt.year
data['Mh'] = pd.to_datetime(data.Mo,format='%m').dt.month
data['Di'] = pd.to_datetime(data.Dy,format ='%d').dt.day
data['Date'] =pd.to_datetime(data[['Yy','Mh','Di']])
What i want to know :
1) Is it mandatory for the argument names to be 'Year' 'Month' 'Day' if we are using pd.to_datetime?
2) Is there any other way to combine the columns in a dataframe to form a date, rather than using this long method?
3) Is this error specific only to python version 3.7??
4)where have i gone wrong in Step 1 and Step 3 ,and why i'm getting o/p when i follow step 2 ?

As per the pandas.to_datetime docs, the column names really do have to be 'year', 'month', and 'day' (capitalizing the first letter is fine). This explains the answer to all of your questions, and no it has nothing to do with the version of Python (and all recent versions of Pandas behave the same).
Also, you should be aware that when you call to_datetime with a sequence of columns (as opposed to a single column/list of strings), the format argument seems to be ignored. So you'll need to normalize your years (to 1961 or 2061 or 1061, etc) yourself. Here's a complete example of how you could do the conversion in a single line:
import pandas as pd
d = '''Yr Mo Dy RPT VAL ROS KIL
61 1 1 15.04 14.96 13.17 9.29
61 1 2 14.71 NaN 10.83 6.50
61 1 3 18.50 16.88 12.33 10.13 '''
data = pd.read_csv(pd.compat.StringIO(d), sep='\s+')
dtime = pd.to_datetime({k:data[c]+v for c,k,v in zip(('Yr', 'Mo', 'Dy'), ('Year', 'Month', 'Day'), (1900, 0, 0))})
print(dtime)
Output:
0 1961-01-01
1 1961-01-02
2 1961-01-03
dtype: datetime64[ns]
In the above code, instead of adding the appropriately named columns to the dataframe data, I just made a dict where the key/value pairs are eg. ('Year', data['Yr']), and also added 1900 to the years.
You can simplify the dict comprehension a bit by just adding 1900 directly to the appropriate column:
data['Yr'] += 1900
dtime = pd.to_datetime({k:data[c] for c,k in zip(('Yr', 'Mo', 'Dy'), ('year', 'month', 'day'))})
This code will have the same output as the previous.

I don't really know how Python deals with years, but the reason it wasn't working had to do with the fact that you were using the year 61.
This works for me
d = {'Day': ["1", "2","3"],
'Month': ["1", "1","1"],
'Year':["61", "61", "61"]}
df = pd.DataFrame(data=d)
df["Year"] = pd.to_numeric(df["Year"])
df.Year = df.Year+2000
df['Date'] = pd.to_datetime(df[['Year','Month','Day']], format='%Y%m%d')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to access columns after creating multiIndex - python

As Pandas documentation says here, you can access MultiIndex object levels by get_level_values(index) method: influenza_data.index.get_level_values(0) # year influenza_data.index.get_level_values(1) # week Obviously, the index parameter represents the order of indices.

Related

Segmenting a dataframe based on date with datetime column. Python

count values of groups by consecutive days

Python convert daily column into a new dataframe with year as index week as column

How to fill missing observations in time series data

using pd.to_datetime to form a date by taking input of year,months,day present in different columns in a data frame

Categories

Resources