How to use loop extract index num to columns will overwrite? - python

I want to let index number can be separate to column[year] and column[month].
for i in range(len(df_tmp.index)):
yy=str(df_tmp.index[i])[0:4]
mm=str(df_tmp.index[i])[-2:]
df_tmp['year']=yy
print(df_tmp['year'])
i=i+1
But now the output is columns[year] be overwritten by the end of index.
and I don't know how to solve it.
trace the wrong
the output result

Try using this sample code:
I = ['2111','2112','2201','2202']
df = pd.DataFrame( [3,1,2,5], index=I, columns=['variable'])
df['year'] = df.index.str[:2]
df['month'] = df.index.str[2:]
variable year month
2111 3 21 11
2112 1 21 12
2201 2 22 01
2202 5 22 02

Related

How to remove duplicate entries but keep the first row selected columns value and last row selected columns value?

I'm creating the charts in periscopedata and doing pandas to derive our results. I'm facing difficulties when removing duplicates from the results.
This is our data look like in final dataframe after calculating.
vendor_ID date opening purchase paid closing
B2345 01/01/2015 5 20 10 15
B2345 01/01/2015 15 50 20 45
B2345 02/01/2015 45 4 30 19
I want to remove the duplicate entry based on vendor_ID and date but keep the starting opening and keep the last entry closing
i.e) Expected result I want
vendor_ID date opening purchase paid closing
B2345 01/01/2015 5 70 30 45
B2345 02/01/2015 45 4 30 19
I've tried below code to remove the duplicates but that gave us different error.
df.drop_duplicates(subset=["vendor_ID", "date"], keep="last", inplace=True)
How do I code such way to remove the duplicates and keep the first and last as mentioned in above example.
Use GroupBy.agg with GroupBy.first, GroupBy.last and GroupBy.sum specified for each column for output:
Notice: Thanks #Erfan - if need use minimal and maximal column instead first and last change dict to {'opening':'min','purchase':'sum','paid':'sum', 'closing':'max'}
df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
.agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 01/01/2015 5 70 30 45
1 B2345 02/01/2015 45 4 30 19
Also if not sure if datetimes are sorted:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])
df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
.agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 2015-01-01 5 70 30 45
1 B2345 2015-01-02 45 4 30 19
You can also create dictionary dynamic for sum all columns without first 2 and used for first and last:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])
d = {'opening':'first', 'closing':'last'}
sum_cols = df.columns.difference(list(d.keys()) + ['vendor_ID','date'])
final_d = {**dict.fromkeys(sum_cols,'sum'), **d}
df1 = df.groupby(["vendor_ID", "date"], as_index=False).agg(final_d).reindex(df.columns,axis=1)
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 2015-01-01 5 70 30 45
1 B2345 2015-01-02 45 4 30 19

How to calculate number of events per day using python?

I am having problems calculating/counting the number of events per day using python. I have a .txt file of earthquake data that I am using to do this. Here is what the file looks like:
2000 Jan 19 00 21 45 -118.815670 37.533170 3.870000 2.180000 383.270000
2000 Jan 11 16 16 46 -118.804500 37.551330 5.150000 2.430000 380.930000
2000 Jan 11 19 55 54 -118.821830 37.508830 0.600000 2.360000 378.080000
2000 Jan 11 05 33 02 -118.802000 37.554670 4.820000 2.530000 375.480000
2000 Jan 08 19 37 04 -118.815500 37.534670 3.900000 2.740000 373.650000
2000 Jan 09 19 34 27 -118.817670 37.529670 3.990000 3.170000 373.07000
Where column 0 is the year, 1 is the month, 2 is the day. There are no headers.
I want to calculate/count the number of events per day. Each line in the file (example: 2000 Jan 11) is an event. So, On January 11th, I would like to know how many times there was an event. In this case, on January 11th, there were 3 events.
I've tried looking on stack for some guidance and have found code that works for arrays such as:
a = [1, 1, 1, 0, 0, 0, 1]
which counts the occurrence of certain items in the array using code like:
unique, counts = numpy.unique(a, return_counts=True)
dict(zip(unique, counts))
I have not been able to find anything that helps me. Any help/advice would be appreciated.
groupby() is going to be your friend here. However, I would concatenate the Year, Month and Day so that you can use dataframe.groupby(["full_date"]).count()
Full solution
Setup DF
df = pd.DataFrame([[2000, "Jan", 19],[2000, "Jan", 20],[2000, "Jan", 19],[2000, "Jan", 19]], columns = ["Year", "Month", "Day"])
Convert datatypes to str for concatenation
df["Year"] = df["Year"].astype(str)
df["Day"] = df["Day"].astype(str)
Create 'full_date' column
df["full_date"] = df["Year"] + "-" + df["Month"] + "-" + df["Day"]
Count the # of days
df.groupby(["full_date"])["Day"].count()
Hope this helps/provides value :)

Pandas get the Month Ending Values from Series

I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08

grouping time-series data based on starting and ending date

I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.
How would I go about grouping the games by season, like -
season(2016-2017), season(2017-2018), etc..
This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.
This is what the date column looks like:
DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
...
29/09/18
07/10/18
28/10/18
03/11/18
I want to group by seasons so that I can perform visualization operations over the aggregated data.
UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:
split_df = np.array_split(df, np.arange(0, len(df),32))
But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.
The key to success is proper grouping, in your case pd.Grouper(key='DATA', freq='AS-AUG').
Note that freq='AS-AUG' states that your groups should start from the start of
August each year.
Look at the following script:
import pandas as pd
# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09',
'25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
'25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Grouping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))
If you print the results:
for name, group in gr:
print()
print(name)
print(group)
you will get:
2008-08-01 00:00:00
DATA SCORE_X SCORE_Y
0 2009-04-01 16 11
1 2009-07-31 10 7
2009-08-01 00:00:00
DATA SCORE_X SCORE_Y
2 2009-08-01 19 6
3 2009-09-26 14 5
4 2009-10-04 8 11
5 2009-12-17 12 19
6 2010-01-25 0 0
7 2010-04-20 17 6
8 2010-07-31 18 2
2010-08-01 00:00:00
DATA SCORE_X SCORE_Y
9 2010-08-01 15 18
10 2010-10-28 2 4
11 2010-11-03 8 16
12 2010-12-25 13 1
13 2011-04-20 19 7
14 2011-07-31 8 3
As you can see, each group starts just on 1-st of August and ends on
31-st of July.
They you can do with your groups whatever you want.
Use -
df.groupby(df['DATE'].dt.year).count()
Output
DATE
DATE
2009 5
2018 4
Custom Season Grouping
min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()
Output
DATE
DATE
(2009-07-31, 2010-07-31] 3
(2010-07-31, 2011-07-31] 0
(2011-07-31, 2012-07-31] 0
(2012-07-31, 2013-07-31] 0
(2013-07-31, 2014-07-31] 0
(2014-07-31, 2015-07-31] 0
(2015-07-31, 2016-07-31] 0
(2016-07-31, 2017-07-31] 0
(2017-07-31, 2018-07-31] 1
Resampling using 'A-JUL' as an anchored offset alias should do the trick:
>>> df
SAMPLE
DATE
2009-01-30 1
2009-07-10 4
2009-11-20 3
2010-01-01 5
2010-05-13 1
2010-08-01 1
>>> df.resample('A-JUL').sum()
SAMPLE
DATE
2009-07-31 5
2010-07-31 9
2011-07-31 1
A indicates it is a yearly interval, -JUL indicates it ends in July.
You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.
import pandas as pd
from datetime import date
dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])
# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)
# copy season column to the main dataframe
df['season'] = df_tmp['season']
df.groupby('season').count()

Subsetting a Pandas.DataFrame object only where there is a difference between two rows in python

I was wondering if it there were an easy way in python to return a subset of my DataFrame rows only where there is a change between two consecutive rows. For example, my dataframe object might look like this:
Date A B
20160713070000 20 21
20160713070100 20 23
20160713070128 20 23
20160713070128 21 24
20160713070134 23 24
In this case, I would want to return the following dataframe object:
Date A B
20160713070000 20 21
20160713070100 20 23
20160713070128 21 24
20160713070134 23 24
Thanks for the help!
I'd use drop_duplicates() function:
In [262]: df.drop_duplicates(subset=['A','B'])
Out[262]:
Date A B
0 20160713070000 20 21
1 20160713070100 20 23
3 20160713070128 21 24
4 20160713070134 23 24
Assuming your dataframe is df, try the following:
sub_df = df[df.groupby('Date')['A'].transform(lambda x: x.index[-1])==df.index]

Categories

Resources