Index automatically replaced when creating a new column out of it - python

I am currently doing some exercises on a Pandas DataFrame indexed by date (DD/MM/YY). The current exercise requires me to groupby on Year to obtain average yearly values.
So what I tried to do was to create a new column containing only the years extracted from the DataFrame's index. The code I wrote is:
data["year"] = [t.year for t in data.index]
data.groupby("year").mean()
but for some reason, the new column "year" ends up replacing the previous full-date indexing (which does not even become a "standard" column, it plain disappears), which came a bit by surprise. How can this be?
Thanks in advance!

For a sample dataframe:
value
2016-01-22 1
2014-02-02 2
2014-08-27 3
2016-01-23 4
2014-03-18 5
If you would like to keep your logic, you just need to call the column you want to take the mean() of and use transform() and then assign it back to the value column:
data['year'] = [t.year for t in data.index]
data['value'] = data.groupby('year')['value'].transform('mean')
Yields:
value year
2016-01-22 2.500000 2016
2014-02-02 3.333333 2014
2014-08-27 3.333333 2014
2016-01-23 2.500000 2016
2014-03-18 3.333333 2014

Related

Mapping two rows to one row in pandas

I have a dataframe a with 14 rows and another dataframe comp1sum with 7 rows. a has date column for 7 days in 12hr interval. So that makes it 14 rows. Also, comp1sum has a column with 7 days.
This is the comp1sum dataframe
And this is the a dataframe
I want to map 2 rows of a dataframe to single rows of comp1sum dataframe. So, that one day of dataframe a is mapped to one day of comp1sum dataframe.
I have the following code for that
j=0
for i in range(0,7):
a.loc[i,'comp1_sum'] = comp_sum.iloc[j]['comp1sum']
a.loc[i,'comp2_sum'] = comp_sum.iloc[j]['comp2sum']
j=j+1
And its output is
dt_truncated comp1_sum
3 2015-02-01 00:00:00 142.0
10 2015-02-01 12:00:00 144.0
12 2015-02-03 00:00:00 145.0
2 2015-02-05 00:00:00 141.0
14 2015-02-05 12:00:00 NaN
The code is mapping the days from comp1sum based on index of a and not based on dates of a. I want 2015-02-01 00:00:00 to have the values 139.0 and 2015-02-02 00:00:00 to have the value 140.0 and so on such that increasing dates have increasing values.
I am not able to map in such a way. please help.
Edit1- As per #Ssayan answer, I am getting this error-
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-255-77e55efca5f9> in <module>
3 # use the sorted index to iterate through the sorted dataframe
4 for i, idx in enumerate(a.index):
----> 5 a.loc[idx, 'comp1_sum'] = b.iloc[i//2]['comp1sum']
6 a.loc[idx,'comp2_sum'] = b.iloc[i//2]['comp2sum']
IndexError: single positional indexer is out-of-bounds
Your issue is that your DataFrame a is not sorted by date so the index 0 does not match the earliest date. When you use loc it uses the value of the index, not the order in which the table is, so even with sorting the DataFrame the issue remains.
One way out is to sort the DataFrame a by date and then to use the sorted index to apply the value in the order you need.
# sort the dataframe by date
a = a.sort_values("dt_truncated")
# use the sorted index to iterate through the sorted dataframe
for i, idx in enumerate(a.index):
a.loc[idx, 'val_1'] = b.iloc[i//2]['val1']
a.loc[idx,'val_2'] = b.iloc[i//2]['val2']

Is there any Python code to help me replace the years of every date by 2022

I have a pandas dataframe column named disbursal_date which is a datetime:
disbursal_date
2009-01-28
2008-01-03
2008-07-15
and so on...
I want to keep the date and month part and replace the years by 2022 for all values.
I tried using df['disbursal_date'].map(lambda x: x.replace(year=2022)) but this didn't work for me.
You need to use apply not map to run a python function on a dataframe columns.
We need to make sure that the dtype is datetime of pandas and not object or string.
Below is the sample code I tried and it works fine, it replaces the year to 2022.
df = pd.DataFrame(['2009-01-28', '2008-01-03', '2008-07-15'],columns=['disbursal_old'])
df['disbursal_old'] = df['disbursal_old'].astype('datetime64[ns]')
df['disbursal_new'] = df['disbursal_old'].apply(lambda x : x.replace(year=2022))
print(df['disbursal_new'])
0 2022-01-28
1 2022-01-03
2 2022-07-15
Name: disbursal_new, dtype: datetime64[ns]
The below code gives the difference between the years.
df['disbursal_diff_year'] = df['disbursal_new'].dt.year - df['disbursal_old'].dt.year
print(df)
disbursal_old disbursal_new disbursal_diff_year
0 2009-01-28 2022-01-28 13
1 2008-01-03 2022-01-03 14
2 2008-07-15 2022-07-15 14

How to set the column name for first column in python pandas? Weird error

I have an xls with the title row as :
AZ-Phoenix CA-Los Angeles CA-San Diego
YEAR PHXR LXXR SDXR
January 1987 59.33 54.67 77
February 1987 59.65 54.89 78
March 1987 59.99 55.16 79
Note : the first row has no name above "YEAR column". How to set the name as YEAR for this row?
I have tried : data_xls = data_xls.rename(columns={data_xls.columns[0]: 'YEAR'})
But it is replacing the AZ-Phoenix row with YEAR. and i cant really change the column i want to .
How to change this row??
YEAR is not a column, it's an index here.
try:
df.index.name = 'foobar'
or:
df = df.reset_index()
in this case, YEAR will become a normal column and you can rename it.
If the text you pasted was the format of the Excel file which looked like this:
you can handle this in a couple of ways:
You can pretend that the two lines are multilevel indexes:
df = pandas.read_excel('test.xlsx', header=[0,1])
This results in a DataFrame which you can index like this:
df['AZ-Phoenix']
resulting in
YEAR PHXR
1987-01-01 59.33
1987-02-01 59.65
1987-03-01 59.99
If the first row is actually superfluous (it seems like the airport is already uniquely defined by the the three letter airport code in there with an R tacked on), you can simply ignore that row when importing and get a "flatter" DataFrame:
df_flat = pandas.read_excel('test.xlsx', skiprows=1, index_col=0)
This gives you something you can index by the airport code:
df_flat.PHXR
gives
YEAR
1987-01-01 59.33
1987-02-01 59.65
1987-03-01 59.99
Name: PHXR, dtype: float64
By using rename_axis
df.rename_axis('YEAR',1).rename_axis('YEAR',0) # change YEAR to whatever you need for rename :)
Out[754]:
YEAR value timestamp
YEAR
0 1 2017-10-03 14:33:52
1 Water 2017-10-04 14:33:48
2 1 2017-10-04 14:33:45
3 1 2017-10-05 14:33:30
4 Water 2017-10-03 14:33:40
5 Water 2017-10-05 14:32:13
6 Water 2017-10-04 14:32:01
7 1 2017-10-03 14:31:55

Adjusting Monthly Time Series Data in Pandas

I have a pandas DataFrame like this.
As you can see, the data corresponds to end of month data. The problem is that the end of month date is not the same for all the columns. ( The underlying reason is that the last trading day of the month does not always coincide with the end of the month. )
Currently, the end of 2016 January have two rows "2016-01-29" and "2016-01-31." It should be just one row. For example, the end of 2016 January should just be 451.1473 1951.218 1401.093 for Index A, Index B and Index C.
Another point is that even though each row almost always corresponds to the end of monthly data, the data might not be nice enough and can conceivably include the middle of the month data for a random columns. In that case, I don't want to make any adjustment so that any prior data collection error would be caught.
What is the most efficient way to achieve this goal.
EDIT:
Index A Index B Index C
DATE
2015-03-31 2067.89 1535.07 229.1
2015-04-30 2085.51 1543 229.4
2015-05-29 2107.39 NaN NaN
2015-05-31 NaN 1550.39 229.1
2015-06-30 2063.11 1534.96 229
2015-07-31 2103.84 NaN 228.8
2015-08-31 1972.18 1464.32 NaN
2015-09-30 1920.03 1416.84 227.5
2015-10-30 2079.36 NaN NaN
2015-10-31 NaN 1448.39 227.7
2015-11-30 2080.41 1421.6 227.6
2015-12-31 2043.94 1408.33 227.5
2016-01-29 1940.24 NaN NaN
2016-01-31 NaN 1354.66 227.5
2016-02-29 1932.23 1355.42 227.3
So, in this case, I need to combine rows at the end of 2015-05, 2015-10, 2016-01. However, rows at 2015-07 and 2015-08 simply does not have data. So, in this case, I would like to leave 2015-07 and 2015-08 as NaN while I like to merge the end of month rows at 2015-05, 2015-10, 2016-01. Hopefully, this provides more insight to what I am trying to do.
You can use:
df = df.groupby(pd.TimeGrouper('M')).fillna(method='ffill')
df = df.resample(rule='M', how='last')
to create a new DateTimeIndex ending on the last day of the months and sample the last available data point for each months. fillna() ensures that, for columns with of missing data for the last available date, you use the prior available value.

Pandas: Count Unique Values after Resample

I'm just getting started with Pandas and am trying to combine: Grouping my data by date, and counting the unique values in each group.
Here's what my data looks like:
User, Type
Datetime
2014-04-15 11:00:00, A, New
2014-04-15 12:00:00, B, Returning
2014-04-15 13:00:00, C, New
2014-04-20 14:00:00, D, New
2014-04-20 15:00:00, B, Returning
2014-04-20 16:00:00, B, Returning
2014-04-20 17:00:00, D, Returning
And here's what I would like to get to: Resample the datetime index to the day (which I can do), and also count the unique users for each day.
I'm not interested in the 'Type' column yet.
Day, Unique Users
2014-04-15, 3
2014-04-20, 2
I'm trying df.user.resample('D', how='count').unique but it doesn't seem to give me the right answer.
You don't need to do a resample to get the desired output in your question. I think you can get by with just a groupby on date:
print df.groupby(df.index.date)['User'].nunique()
2014-04-15 3
2014-04-20 2
dtype: int64
And then if you want to you could resample to fill in the time series gaps after you count the unique users:
cnt = df.groupby(df.index.date)['User'].nunique()
cnt.index = cnt.index.to_datetime()
print cnt.resample('D')
2014-04-15 3
2014-04-16 NaN
2014-04-17 NaN
2014-04-18 NaN
2014-04-19 NaN
2014-04-20 2
Freq: D, dtype: float64
I came across the same problem. Resample worked for me with nunique. The nice way with resample is that it makes it very simple to change the sample rate for example to hour or minutes and that the timestamp is kept as index.
df.user.resample('D').nunique()
I was running into the same problem. Karl D's answer works for some kind of reindexing -- on date, for example. but what if you want the index to be
Jan 2014
Feb 2014
March 2014
and then plot it as a timeseries?
Here's what I did:
df.user.resample('M',lambda x: x.nunique())

Categories

Resources