I have the following column in my dataframe
year-month
2020-01
2020-01
2020-01
2020-02
2020-02
...
2021-06
This column is stored as an "object" type in my dataframe. I didn't convert it to a "datetime" type from the onset because then my values would change to "2020-01-01" instead(?)
Anyway, I wanted to get do a value_counts(), by month, so that I can plot it out subsequently. How can I order the value_counts() by month while reflecting the month as "Jan", "Feb"..."Dec" at the same time?
I've tried this:
pd.DateTime(df['year-month']).dt.month.value_counts().sort_index()
However, the months are reflected as "1","2"..."12" which isn't what I want
I then tried this:
pd.DateTime(df['year-month']).dt.strftime('%b').value_counts().sort_index()
Which gives me the month by "Jan","Feb"..."Dec" indeed but now it's sorted by alphabetical order instead of by the actual month sequence.
From this point of yours:
result = pd.to_datetime(df["year-month"]).dt.strftime("%b").value_counts()
we can reindex the result so that the index becomes the month name abbreviations in order. This can be borrowed from the calendar module:
import calendar
# slicing out the first since it is empty string
month_names = calendar.month_abbr[1:]
# reindex and put 0 to those that didn't appear at all
result = result.reindex(month_names, fill_value=0)
to get
>>> result
Jan 3
Feb 2
Mar 0
Apr 0
May 0
Jun 1
Jul 0
Aug 0
Sep 0
Oct 0
Nov 0
Dec 0
(The reason calendar.month_abbr has an empty string in the begining is because Python is 0-indexed but we say 2nd month is February; so putting an empty string there results in month_abbr[2] == "February".)
I'm trying to analyse a covid data set and kind of at a loss on how to fix the data via pandas. The data set looks like the following:
I'm trying to make it look like this:
April 2 | April 3 | April 4
unique_tests total unique tests for april 2 | total unique tests for april 3|total unique tests for april 4
positive total positive for april 2 | total positive for april 3 |total positive for april 4
negative total negative for april 2 | total negative for april 3 |total negative for april 4
remaining total remaining for april 2 | total remaining for april 3 |total remaining for april 4
I have dates up to april 24.
Any ideas on how i can implement this? I can't make it work with pivot table in pandas
Use:
#convert columns to numeric and date to datetimes
df = pd.read_csv(file, thousands=',', parse_dates=['date'])
#create custom format of datetimes and aggregate sum, last transpose
df1 = df.groupby(df['date'].dt.strftime('%d-%b')).sum().T
Or is possible reassign column date filled by new format of datetimes:
df1 = df.assign(date = df['date'].dt.strftime('%d-%b')).groupby('date').sum().T
I've got a pandas dataframe that looks like this
miles dollars gallons date gal_cost mpg tank%_used day
0 253.2 21.37 11.138 2019-01-15 1.918657 22.732986 0.821993 Tuesday
1 211.9 22.24 11.239 2019-01-26 1.978824 18.853991 0.829446 Saturday
2 258.1 22.70 11.708 2019-02-02 1.938845 22.044756 0.864059 Saturday
3 223.0 22.24 11.713 2019-02-15 1.898745 19.038675 0.864428 Friday
I'd like to create a new column called 'id' that is unique for each entry. For the first entry in the df, the id would be c0115201901 because it is from the df_c dataframe, the date is 01 15 2019 and it is the first entry.
I know I'll end up doing something like this
df_c = df_c.assign(id=('c'+df_c['date']) + ?????)
but I'd like to parse the df_c['date'] column to pull values for the day, month and year individually. The df_c['date'] column is a datetime64[ns] type.
The other issue is I'd like to have a counter at the end of the id to count which number entry for the date it is. For example, 01 for the first entry, 02 for the second, etc.
I also have a df_m dataframe, but I can repeat the process with a different letter for that dataframe.
Refer pandas datetime-properties docs.
The date can be extracted easily like
df_c['date'].dt.day + df_c['date'].dt.month df_c['date'].dt.year
I have the following data (in csv form):
Country,City,Year,Value1,Value2
Germany,Berlin,2020,9,3
Germany,Berlin,2017,1,4
Germany,Berlin,2011,1,4
Israel,Tel Aviv, 2007,4.5,1
I would like to create bins according to the Year column such that instead of using the specific year there would be a 5-year-range, and then sum up the values in Value1, Value2, grouping by the Country, City and bin ID (in the following example, I called this YearRange).
For example, after running this process, the data would look like so:
Country,City,YearRange,Value1,Value2
Germany,Berlin,2016-2020,10,7
Germany,Berlin,2011-2015,1,4
Israel,Tel Aviv,2006-2010,4.5,1
If this simplifies thigs, I don't mind creating the possible ranges in advance (i.e. I will have a table with all possible ranges: 2016-2020, 2011-2015, 2006-2010, until the earliest date possible in my data).
How can I achieve this using Pandas?
Thanks!
Using pd.cut with groupby
df.groupby([df.Country,df.City,pd.cut(df.Year,[2006,2011,2016,2020]).astype(str)])[['Value1','Value2']].sum().reset_index()
Out[254]:
Country City Year Value1 Value2
0 Germany Berlin (2006, 2011] 1.0 4
1 Germany Berlin (2016, 2020] 10.0 7
2 Israel Tel Aviv (2006, 2011] 4.5 1
I have a dataframe that looks as follows
SIM Sim_1 Sim_2
2015 100.0000 100.0000
2016 2.504613 0.123291
2017 3.802958 -0.919886
2018 4.513224 -1.976056
2019 -0.775783 3.914312
The following function
df = sims.shift(1, axis = 0)*(1+sims/100)
returns a dataframe which looks like this
SIMULATION Sim_1 Sim_2
2015 NaN NaN
2016 102.504613 100.123291
2017 2.599862 0.122157
2018 3.974594 -0.901709
The value in 2016 is exactly the one that should be calculated. But the value in 2017 should take the output of the formula in 2016 (102.504613 and 100.123291) as input for the calculation in 2017. Here the formula takes the original values (2.599862 and 0.122157)
Is there a simple way to run this in pyhton?
you are trying to show the growth of 100 given subsequent returns. Your problem is that the initial 100 is not in the same space. If you replace it with zero (0% return) then do a cumprod, your problem is solved.
sims.iloc[0] = 0
sims.div(100).add(1).cumprod().mul(100)
Just a crude way of implementing this:
for i in range(len(df2)):
try:
df2['Sim1'][i] = float(df2['Sim1'][i]) + float(df2['Sim1'][i-1])
df2['Sim2'][i] = float(df2['Sim2'][i]) + float(df2['Sim2'][i-1])
except:
pass
There may be a better way to optimize this.