I have not found a post here that quite fits my situation. I have a csv file where the first column is year (2002), the second column is Month (January) and the third column is MonthCode (1 for January, etc). I would like to import into a Pandas dataframe to create a full date index. The following code gives an error, but should show you what I am trying to do.
The error is:
ValueError: time data '2002' does not match format '%Y%b%d'
Note: I do not have a Day of the month in the data so I have to use the first or last, unless there is a way to index on just Year and Month with no Day.
The data looks like this:
Year Month Month Code District Code District
2002 January 1 1 Albany
2002 January 1 2 Allegany
2002 January 1 3 Broome
2002 January 1 4 Cattaraugus
2002 January 1 5 Cayuga
The code that does not work:
file = 'C:/.../snap.csv'
parser = lambda date: pd.datetime.strptime(date, '%Y%b%d')
# create dataframe from csv file
snapdf = pd.read_csv(file, parse_dates = [0,1], date_parser = parser)
# NOTE: I also tried parse_dates = [0,2] but same error
I altered the data to make it more obvious how the dates gets parsed into the dataframe
Year,Month,Month Code,District Code,District
2002,January,1,1,Albany
2004,February,1,2,Allegany
2005,December,1,3,Broome
2007,August,1,4,Cattaraugus
2001,March,1,5,Cayuga
using parse_dates parameter with column 1-3:
>>>> with open('snap.csv') as f:
df = pd.read_csv(f, parse_dates={'Date': [0,1,2]}, index_col='Date')
>>>> df
District Code District
Date
2002-01-01 1 Albany
2004-02-01 2 Allegany
2005-12-01 3 Broome
2007-08-01 4 Cattaraugus
2001-03-01 5 Cayuga
>>>> df.District
Date
2002-01-01 Albany
2004-02-01 Allegany
2005-12-01 Broome
2007-08-01 Cattaraugus
2001-03-01 Cayuga
Name: District, dtype: object
I finally got this running and it was actually quite simple in the end.
snapdf["DateIndex"] = pd.to_datetime(snapdf['Year'].astype(str), format='%Y')
This takes the value from the Year column of a dataframe (stored as Int) and converts it to a datestring in a new column DateIndex. Because there are no month or day data, it automatically inserts 01/01 as the month and day.
So, 2017 in the Year column becomes 01/10/2017
Related
I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.
You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])
You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999
I have a dataset of property prices and they are currently listed by 'DATE_SOLD'. I'd like to be able to count them by year. The dataset looks like this -
SALE_DATE COUNTY SALE_PRICE
0 2010-01-01 Dublin 343000.0
1 2010-01-03 Laois 185000.0
2 2010-01-04 Dublin 438500.0
3 2010-01-04 Meath 400000.0
4 2010-01-04 Kilkenny 160000.0
This is the code I've tried -
by_year = property_prices['SALE_DATE'] = pd.to_datetime(property_prices['SALE_DATE'])
print(by_year)
I think I'm close but as a biblical noob it's quite frustrating!
Thank you for any help you can provide; this site has been awesome so far in finding little tips and tricks to make my life easier
You are close. As you did, you can use pd.to_datetime to convert your sale_date to a datetime column. Then groupby the year, using dt.year which gets the year of the datetime, and use size() on that which computes the size of each group, which in this case is the year.
property_prices['SALE_DATE'] = pd.to_datetime(property_prices['SALE_DATE'])
property_prices.groupby(property_prices.SALE_DATE.dt.year).size()
Which prints:
SALE_DATE
2010 5
dtype: int64
import pandas as pd
sample_dict = {'Date':['2010-01-11', '2020-01-22', '2010-03-12'], 'Price':[1000,2000,3500]}
df = pd.DataFrame(sample_dict)
# Creating 'year' column using the Date column
df['year'] = df.apply(lambda row: row.Date.split('-')[0], axis=1)
# Groupby function
df1 = df.groupby('Year')
# Print the first value in each group
df1.first()
Output:
Date x
year
2010 2010-01-11 1
2020 2020-01-22 2
I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08
I have a problem combining the day month year columns to form a date column in a data frame using pd.to_datetime. Below is the dataframe,i'm working on and the columns Yr,Mo,Dy represents as year month day.
data = pd.read_table("/ALabs/wind.data",sep = ',')
Yr Mo Dy RPT VAL ROS KIL
61 1 1 15.04 14.96 13.17 9.29
61 1 2 14.71 NaN 10.83 6.50
61 1 3 18.50 16.88 12.33 10.13
So I've tried the below code, i get the following error: "to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing"
Step 1:
data['Date'] = pd.to_datetime(data[['Yr','Mo','Dy']],format="%y-%m-%d")
Next I've tried converting Yr,Mo,Dy column datatype to datetime64 from int64 and assigning the result to new columns Year,Month,Day. Now when i try to combine the columns i'm getting the proper date format in the new date column and i have no idea how i got the desired result.
Step2:
data['Year'] = pd.to_datetime(data.Yr,format='%y').dt.year
data['Month'] = pd.to_datetime(data.Mo,format='%m').dt.month
data['Day'] = pd.to_datetime(data.Dy,format ='%d').dt.day
data['Date'] =pd.to_datetime(data[['Year','Month','Day']])
Result:
Yr Mo Dy Year Month Day Date
61 1 1 2061 1 1 2061-01-01
61 1 2 2061 1 2 2061-01-02
61 1 3 2061 1 3 2061-01-03
61 1 4 2061 1 4 2061-01-04
But if i try doing the same method by changing the column names from year,month, day to Yy,Mh,Di like in the below code. I get the same error "to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing"
Step3:
data['Yy'] = pd.to_datetime(data.Yr,format='%y').dt.year
data['Mh'] = pd.to_datetime(data.Mo,format='%m').dt.month
data['Di'] = pd.to_datetime(data.Dy,format ='%d').dt.day
data['Date'] =pd.to_datetime(data[['Yy','Mh','Di']])
What i want to know :
1) Is it mandatory for the argument names to be 'Year' 'Month' 'Day' if we are using pd.to_datetime?
2) Is there any other way to combine the columns in a dataframe to form a date, rather than using this long method?
3) Is this error specific only to python version 3.7??
4)where have i gone wrong in Step 1 and Step 3 ,and why i'm getting o/p when i follow step 2 ?
As per the pandas.to_datetime docs, the column names really do have to be 'year', 'month', and 'day' (capitalizing the first letter is fine). This explains the answer to all of your questions, and no it has nothing to do with the version of Python (and all recent versions of Pandas behave the same).
Also, you should be aware that when you call to_datetime with a sequence of columns (as opposed to a single column/list of strings), the format argument seems to be ignored. So you'll need to normalize your years (to 1961 or 2061 or 1061, etc) yourself. Here's a complete example of how you could do the conversion in a single line:
import pandas as pd
d = '''Yr Mo Dy RPT VAL ROS KIL
61 1 1 15.04 14.96 13.17 9.29
61 1 2 14.71 NaN 10.83 6.50
61 1 3 18.50 16.88 12.33 10.13 '''
data = pd.read_csv(pd.compat.StringIO(d), sep='\s+')
dtime = pd.to_datetime({k:data[c]+v for c,k,v in zip(('Yr', 'Mo', 'Dy'), ('Year', 'Month', 'Day'), (1900, 0, 0))})
print(dtime)
Output:
0 1961-01-01
1 1961-01-02
2 1961-01-03
dtype: datetime64[ns]
In the above code, instead of adding the appropriately named columns to the dataframe data, I just made a dict where the key/value pairs are eg. ('Year', data['Yr']), and also added 1900 to the years.
You can simplify the dict comprehension a bit by just adding 1900 directly to the appropriate column:
data['Yr'] += 1900
dtime = pd.to_datetime({k:data[c] for c,k in zip(('Yr', 'Mo', 'Dy'), ('year', 'month', 'day'))})
This code will have the same output as the previous.
I don't really know how Python deals with years, but the reason it wasn't working had to do with the fact that you were using the year 61.
This works for me
d = {'Day': ["1", "2","3"],
'Month': ["1", "1","1"],
'Year':["61", "61", "61"]}
df = pd.DataFrame(data=d)
df["Year"] = pd.to_numeric(df["Year"])
df.Year = df.Year+2000
df['Date'] = pd.to_datetime(df[['Year','Month','Day']], format='%Y%m%d')
I have an xls with the title row as :
AZ-Phoenix CA-Los Angeles CA-San Diego
YEAR PHXR LXXR SDXR
January 1987 59.33 54.67 77
February 1987 59.65 54.89 78
March 1987 59.99 55.16 79
Note : the first row has no name above "YEAR column". How to set the name as YEAR for this row?
I have tried : data_xls = data_xls.rename(columns={data_xls.columns[0]: 'YEAR'})
But it is replacing the AZ-Phoenix row with YEAR. and i cant really change the column i want to .
How to change this row??
YEAR is not a column, it's an index here.
try:
df.index.name = 'foobar'
or:
df = df.reset_index()
in this case, YEAR will become a normal column and you can rename it.
If the text you pasted was the format of the Excel file which looked like this:
you can handle this in a couple of ways:
You can pretend that the two lines are multilevel indexes:
df = pandas.read_excel('test.xlsx', header=[0,1])
This results in a DataFrame which you can index like this:
df['AZ-Phoenix']
resulting in
YEAR PHXR
1987-01-01 59.33
1987-02-01 59.65
1987-03-01 59.99
If the first row is actually superfluous (it seems like the airport is already uniquely defined by the the three letter airport code in there with an R tacked on), you can simply ignore that row when importing and get a "flatter" DataFrame:
df_flat = pandas.read_excel('test.xlsx', skiprows=1, index_col=0)
This gives you something you can index by the airport code:
df_flat.PHXR
gives
YEAR
1987-01-01 59.33
1987-02-01 59.65
1987-03-01 59.99
Name: PHXR, dtype: float64
By using rename_axis
df.rename_axis('YEAR',1).rename_axis('YEAR',0) # change YEAR to whatever you need for rename :)
Out[754]:
YEAR value timestamp
YEAR
0 1 2017-10-03 14:33:52
1 Water 2017-10-04 14:33:48
2 1 2017-10-04 14:33:45
3 1 2017-10-05 14:33:30
4 Water 2017-10-03 14:33:40
5 Water 2017-10-05 14:32:13
6 Water 2017-10-04 14:32:01
7 1 2017-10-03 14:31:55