Looking to clean multiple data sets in a more automated way. The current format is year as column, month as row, the number values.
Below is an example of the current format, the original data has multiple years/months.
Current Format:
Year
Jan
Feb
2022
300
200
Below is an example of how I would like the new format to look like. It combines month and year into one column and transposes the number into another column.
How would I go about doing this in excel or python? Have files with many years and multiple months.
New Format:
Date
Number
2022-01
300
2022-02
200
Check below solution. You need to extend month_df for the months, current just cater to the example.
import pandas as pd
df = pd.DataFrame({'Year':[2022],'Jan':[300],'Feb':[200]})
month_df = pd.DataFrame({'Char_Month':['Jan','Feb'], 'Int_Month':['01','02']})
melted_df = pd.melt(df, id_vars=['Year'], value_vars=['Jan', 'Feb'], var_name='Char_Month',value_name='Number')
pd.merge(melted_df, month_df,left_on='Char_Month', right_on='Char_Month').\
assign(Year=melted_df['Year'].astype('str')+'-'+month_df['Int_Month'])\
[['Year','Number']]
Output:
Related
I am doing a time series analysis. I have run the below code to generate random year in the dataframe as the original year did not have year values:
wc['Random_date'] = wc.Monthdate.apply(lambda val: f'{val} {randint(2019,2022)}')
#Generating random year from 2019 to 2022 to create ideal conditions
And now I have a dataframe that looks like this:
wc.head()
The ID column is the index currently, and I would like to generate a pivoted dataframe that looks like this:
Random_date
Count_of_ID
Jul 3 2019
2
Jul 4 2019
3
I do understand that aggregation will be needed to be done after I pivot the data, but the following code is not working:
abscount = wc.pivot(index= 'Random_date', columns= 'Random_date', values= 'ID')
Here is the ending part of the error that I see:
Please help. Thanks.
You may check with
df['Random_date'].value_counts()
If need unique count
df.reset_index().drop_duplicates('ID')['Random_date'].value_counts()
Or
df.reset_index().groupby('Random_date')['ID'].nunique()
I have a pandas dataframe with date values, however, I need to convert it from dates to text General format like in Excel, not to date string, in order to match with primary keys values in SQL, which are, unfortunately, reordered in general format. Is it possible to do it Python or the only way to convert this column to general format in Excel?
Here is how the dataframe's column looks like:
ID Desired Output
1/1/2022 44562
7/21/2024 45494
1/1/1931 11324
Yes, it's possible. The general format in Excel starts counting the days from the date 1900-1-1.
You can calculate a time delta between the dates in ID and 1900-1-1.
Inspired by this post you could do...
data = pd.DataFrame({'ID': ['1/1/2022','7/21/2024','1/1/1931']})
data['General format'] = (
pd.to_datetime(data["ID"]) - pd.Timestamp("1900-01-01")
).dt.days + 2
print(data)
ID General format
0 1/1/2022 44562
1 7/21/2024 45494
2 1/1/1931 11324
The +2 is because:
Excel starts counting from 1 instead of 0
Excel incorrectly considers 1900 as a leap year
Excel stores dates as sequential serial numbers so that they can be
used in calculations. By default, January 1, 1900 is serial number 1,
and January 1, 2008 is serial number 39448 because it is 39,447 days
after January 1, 1900.
-Microsoft's documentation
So you can just calculate (difference between your date and January 1, 1900) + 1
see How to calculate number of days between two given dates
import pandas as pd
import numpy as np
# Show the specified columns and save it to a new file
col_list= ["STATION", "NAME", "DATE", "AWND", "SNOW"]
df = pd.read_csv('Data.csv', usecols=col_list)
df.to_csv('filteredData.csv')
df['year'] = pd.DatetimeIndex(df['DATE']).year
df2016 = df[(df.year==2016)]
df_2016 = df2016.groupby(['NAME', 'DATE'])['SNOW'].mean()
df_2016.to_csv('average2016.csv')
How come my dates are not ordered correctly here? Row 12 should be on the top but it's on the bottom of May instead and same goes for row 25
The average of SNOW per NAME/month is also not being displayed on my excel sheet. Why is that? Basically, I'm trying to calculate the average SNOW for May in ADA 0.7 SE, MI US. Then calculate the average SNOW for June in ADA 0.7 SE, MI US. etc..
I've spent all day and this is all I have got... Any help will be appreciated. Thanks in advance.
original data
https://gofile.io/?c=1gpbyT
Please try
Data
df=pd.read_csv(r'directorywhere the data is\data.csv')
df
Working
df.dtypes# Checking the datatype on each column
df.columns#listing columns
df['DATE']=pd.to_datetime(df['DATE'])#Converting date from object to a date format
df.set_index(df['DATE'], inplace=True)#Seeting the date as index
df['SNOW'].fillna(0)#filling all Not a Number values with zeros to make aggregation possible
df['SnowMean']=df.groupby([df.index.month, df.NAME])['SNOW'].transform('mean')#Groupby name, month and calculate the mean of snow. Store the result in anew column called df['SnowMean']
df
Checking
df.loc[:,['DATE','Month','SnowMean']]# Slice relevant columns to check
I realize you have multiple years. If you wanted mean per month in each year, again extract the year and add it in the groups to groupby as follows
df['SnowMeanPerYearPerMonth']=df.groupby([df.index.month,df.index.year,df.NAME])['SNOW'].transform('mean')
df
Check again
pd.set_option('display.max_rows',999)#diaplay upto 999 rows to check
df.loc[:,['DATE','Month','Year','SnowMean']]# Slice relevant columns to check
This question already has answers here:
Extracting just Month and Year separately from Pandas Datetime column
(13 answers)
Closed 3 months ago.
I have a dataframe with a date column (type datetime). I can easily extract the year or the month to perform groupings, but I can't find a way to extract both year and month at the same time from a date. I need to analyze performance of a product over a 1 year period and make a graph with how it performed each month. Naturally I can't just group by month because it will add the same months for 2 different years, and grouping by year doesn't produce my desired results because I need to look at performance monthly.
I've been looking at several solutions, but none of them have worked so far.
So basically, my current dates look like this
2018-07-20
2018-08-20
2018-08-21
2018-10-11
2019-07-20
2019-08-21
And I'd just like to have 2018-07, 2018-08, 2018-10, and so on.
You can use to_period
df['month_year'] = df['date'].dt.to_period('M')
If they are stored as datetime you should be able to create a string with just the year and month to group by using datetime.strftime (https://strftime.org/).
It would look something like:
df['ym-date'] = df['date'].dt.strftime('%Y-%m')
If you have some data that uses datetime values, like this:
sale_date = [
pd.date_range('2017', freq='W', periods=121).to_series().reset_index(drop=True).rename('Sale Date'),
pd.Series(np.random.normal(1000, 100, 121)).rename('Quantity')
]
sales = pd.concat(data, axis='columns')
You can group by year and date simultaneously like this:
d = sales['Sale Date']
sales.groupby([d.dt.year.rename('Year'), d.dt.month.rename('Month')]).sum()
You can also create a string that represents the combination of month and year and group by that:
ym_id = d.apply("{:%Y-%m}".format).rename('Sale Month')
sales.groupby(ym_id).sum()
A couple of options, one is to map to the first of each month:
Assuming your dates are in a column called 'Date', something like:
df['Date_no_day'] = df['Date'].apply(lambda x: x.replace(day=1))
If you are really keen on storing the year and month only, you could map to a (year, month) tuple, eg:
df['Date_no_day'] = df['Date'].apply(lambda x: (x.year, x.month))
From here, you can groupby/aggregate by this new column and perform your analysis
One way could be to transform the column to get the first of month for all of these dates and then create your analsis on month to month:
date_col = pd.to_datetime(['2011-09-30', '2012-02-28'])
new_col = date_col + pd.offsets.MonthBegin(1)
Here your analysis remains intact as monthly
I have a dataframe in the following format:
month count
2015/01 100
2015/02 200
2015/03 300
...
And want to get a new dataframe which contains month greater than a month, e.g.. 2015/03.
I tried to use the following code:
sdf = df.loc[datetime.strptime(df['month'], '%Y-%m')>=datetime.date(2015,9,1,0,0,0)]
I'm new to Python. Really appreciate it if some one can help.
If you want to get rows which have months larger than 2015/02 for instance, use pd.to_datetime instead of datetime.strptime since the former is vectorized and can accept a Series object as parameter (assuming you are using pandas):
import pandas as pd
df[pd.to_datetime(df.month) >= pd.to_datetime("2015/02")]
# month count
#1 2015/02 200
#2 2015/03 300