How can I call individual columns in pandas? - python

I understand this must be a very basic question, but oddly enough, the resources I've read online don't seem very clear on how to do the following:
How can I index specific columns in pandas?
For example, after importing data from a csv, I have a pandas Series object with individual dates, along with a corresponding dollar amount for each date.
Now, I'd like to group the dates by month (and add their respective dollar amounts for that given month). I plan to create an array where the indexing column is the month, and the next column is the sum of dollar amounts for that month. I would then take this array and create another pandas Series object out of it.
My problem is that I can't seem to call the specific columns from the current pandas series object I have.
Any help?
Edited to add:
from pandas import Series
from matplotlib import pyplot
import numpy as np
series = Series.from_csv('FCdata.csv', header=0, parse_dates = [0], index_col =0)
print(series)
pyplot.plot(series)
pyplot.show() # this successfully plots the x-axis (date) with the y-axis (dollar amount)
dates = series[0] # this is where I try to call the column, but with no luck
This is what my data looks like in a csv:
Dates Amount
1/1/2015 112
1/2/2015 65
1/3/2015 63
1/4/2015 125
1/5/2015 135
1/6/2015 56
1/7/2015 55
1/12/2015 84
1/27/2015 69
1/28/2015 133
1/29/2015 52
1/30/2015 91
2/2/2015 144
2/3/2015 114
2/4/2015 59
2/5/2015 95
2/6/2015 72
2/9/2015 73
2/10/2015 119
2/11/2015 133
2/12/2015 128
2/13/2015 141
2/17/2015 105
2/18/2015 107
2/19/2015 81
2/20/2015 52
2/23/2015 135
2/24/2015 65
2/25/2015 58
2/26/2015 144
2/27/2015 102
3/2/2015 95
3/3/2015 98

You are reading the CSV file into a Series. A Series is a one-dimensional object - there are no columns associated with it. You see the index of that Series (dates) and probably think that's another column but it's not.
You have two alternatives: you can convert it to a DataFrame (either by calling reset_index() or to_frame or use it as a Series.
series.resample('M').sum()
Out:
Dates
2015-01-31 1040
2015-02-28 1927
2015-03-31 193
Freq: M, Name: Amount, dtype: int64
Since you already have an index formatted as date, grouping by month with resample is very straightforward so I'd suggest keeping it as a Series.
However, you can always convert it to a DataFrame with:
df = series.to_frame('Value')
Now, you can use df['Value'] to select that single column. resampling can be done both on the DataFrame and the Series:
df.resample('M').sum()
Out:
Value
Dates
2015-01-31 1040
2015-02-28 1927
2015-03-31 193
And you can access the index if you want to use that in plotting:
series.index # df.index would return the same
Out:
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
'2015-01-05', '2015-01-06', '2015-01-07', '2015-01-12',
'2015-01-27', '2015-01-28', '2015-01-29', '2015-01-30',
'2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05',
'2015-02-06', '2015-02-09', '2015-02-10', '2015-02-11',
'2015-02-12', '2015-02-13', '2015-02-17', '2015-02-18',
'2015-02-19', '2015-02-20', '2015-02-23', '2015-02-24',
'2015-02-25', '2015-02-26', '2015-02-27', '2015-03-02',
'2015-03-03'],
dtype='datetime64[ns]', name='Dates', freq=None)
Note: For basic time-series charts, you can use pandas' plotting tools.
df.plot() produces:
And df.resample('M').sum().plot() produces:

Related

Plotting timestamps as string vs. datetime object

I am observing some strange behavior when plotting a column from my Pandas dataframe.
My data looks like the following:
value timestamp
0 67 2023-01-04T05:17:49+00:00
1 67 2023-01-04T05:27:57+00:00
2 66 2023-01-04T05:28:01+00:00
3 81 2023-01-04T05:28:19+00:00
4 67 2023-01-04T05:33:02+00:00
... ... ...
102 73 2023-01-04T22:22:04+00:00
103 76 2023-01-04T22:26:59+00:00
104 77 2023-01-04T22:27:12+00:00
105 75 2023-01-04T22:27:13+00:00
106 73 2023-01-04T22:32:35+00:00
Now comes the fun part:
I convert the timestamp column to a pandas datetime object with: df['timestamp_dt'] = pd.to_datetime(df['timestamp']). I create a new column so you can observe the behavior.
I then plot both of the column with seaborn:
#Plot where timestamp is "String"
sns.lineplot(data=df, x="timestamp", y="bpm")
plt.show()
#Plot where timestamp is "dattime object"
sns.lineplot(data=df, x="timestamp_dt", y="bpm")
plt.show()
As you can see, converting the timestamps into datetime object results in a weird behavior of the graph.
Why is this and how can I convert the timestamps and have a normal looking graph?
I have looked at the following solution, which suggests adding a format. However I was not able to solve it that way.
Why does changing "Date" column to datetime ruin graph?

How do I delete specific dataframe rows based on a columns value?

I have a pandas dataframe with 2 columns ("Date" and "Gross Margin). I want to delete rows based on what the value in the "Date" column is. This is my dataframe:
Date Gross Margin
0 2021-03-31 44.79%
1 2020-12-31 44.53%
2 2020-09-30 44.47%
3 2020-06-30 44.36%
4 2020-03-31 43.69%
.. ... ...
57 2006-12-31 49.65%
58 2006-09-30 52.56%
59 2006-06-30 49.86%
60 2006-03-31 46.20%
61 2005-12-31 40.88%
I want to delete every row where the "Date" value doesn't end with "12-31". I read some similar posts on this and the pandas.drop() function seemed to be the solution, but I haven't figured out how to use it for this specific case.
Please leave any suggestions as to what I should do.
you can try the following code, where you match the day and month.
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df[df['Date'].dt.strftime('%m-%d') == '12-31']
Assuming you have the date formatted as year-month-day
df = df[~df['Date'].str.endswith('12-31')]
If the dates are using a consistent format, you can do it like this:
df = df[df['Date'].str.contains("12-31", regex=False)]

How to fill missing observations in time series data

I have a hypothetical time series data frame, which is with some missing observations (assumption is that the data frame shall include all dates and corresponding values and for all the dates in the year). As we can see in the head and tail information, there are certain dates and corresponding values are missing (30th Jan & 29th Dec). There would be many more such in the data frame, sometimes missing observations for more than one consecutive date.
Is there a way that missing dates are detected and inserted into the data frame and corresponding values are filled with a rolling average with one week window (this would naturally increase the number of rows of the data frame)? Appreciate inputs.
df.head(3)
date value
0 2020-01-28 25
1 2020-01-29 32
2 2020-01-31 45
df.tail(3)
date value
3 2020-12-28 24
4 2020-12-30 35
5 2020-12-31 37
df.dtypes
date object
value int64
dtype: object
Create DaetimeIndex, then use DataFrame.asfreq with rolling and mean:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('d').rolling('7D').mean()
If need all values by year use:
df['date'] = pd.to_datetime(df['date'])
idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()

How to access columns after creating multiIndex

I am making my DataFrame like this:
influenza_data = pd.DataFrame(data, columns = ['year', 'week', 'weekly_infections'])
and then I create MultiIndex from year and week columns:
influenza_data = influenza_data.set_index(['year', 'week'])
If I have MultiIndex my DataFrame looks like this:
weekly_infections
year week
2009 40 6600
41 7100
42 7700
43 8300
44 8600
... ...
2019 10 8900
11 6200
12 5500
13 3900
14 3300
and data_influenza.columns:
Index(['weekly_infections'], dtype='object')
The problem I have is that I can't access year and week columns now.
If I try data_influenza['week'] or year I get KeyError: 'week'. I can only do data_influenza.weekly_infections and that returns a whole DataFrame
I know if I remove multiIndex I can easily access them but why can't I data_influenza.year or week with MultiIndex? I specified columns when I was creating Dataframe
As Pandas documentation says here, you can access MultiIndex object levels by get_level_values(index) method:
influenza_data.index.get_level_values(0) # year
influenza_data.index.get_level_values(1) # week
Obviously, the index parameter represents the order of indices.

Pandas group values and get mean by date range

I have a DataFrame like this
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
value date
0 64.885 2018-01-11
1 74.839 2018-01-15
2 41.481 2018-01-17
3 22.027 2018-01-17
4 53.747 2018-01-18
... ... ...
514 61.017 2018-12-22
515 68.376 2018-12-21
516 79.079 2018-12-26
517 73.975 2018-12-26
518 76.923 2018-12-26
519 rows × 2 columns
And I want to plot this value vs date and I am using this
df.plot( x='date',y='value')
And I get this
The point here, this plot have to many fluctuation, and I want to soften this, my idea is group the values by date intervals and get the mean, for example 10 days, the mean between July 1 and July 10, and create de point in July 5
A long way is, get date range, separate in N ranges with start and end dates, filter data with date calculate the mean, and put in other DataFrame
Is there a short way to do that?
PD: Ignore the peaks
One thing you could do for instance is to take the rolling mean of the dataframe, using DataFrame.rolling along with mean:
df = df.set_index(df.date).drop('date', axis=1)
df.rolling(3).mean().plot()
For the example dataframe you have, directly plotting the dataframe would result in:
And having taking the rolling mean, you would have:
Here I chose a window of 3, but his will depend on how wmooth you want it to be
Based on yatu answer
The problem with his answer, is the rolling function considere values as index, not as date, with some transformations rolling can read Timestamp as use time as window [ pandas.rolling ]
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
df['date'] = df.apply(lambda row: pd.Timestamp(row.date), axis=1 )
df = df.set_index(df.date).drop('date', axis=1)
df.sort_index(inplace=True)
df.rolling('10d').mean().plot( ylim=(30,100) , figsize=(16,5),grid='true')
Final results

Categories

Resources