I am having problems calculating/counting the number of events per day using python. I have a .txt file of earthquake data that I am using to do this. Here is what the file looks like:
2000 Jan 19 00 21 45 -118.815670 37.533170 3.870000 2.180000 383.270000
2000 Jan 11 16 16 46 -118.804500 37.551330 5.150000 2.430000 380.930000
2000 Jan 11 19 55 54 -118.821830 37.508830 0.600000 2.360000 378.080000
2000 Jan 11 05 33 02 -118.802000 37.554670 4.820000 2.530000 375.480000
2000 Jan 08 19 37 04 -118.815500 37.534670 3.900000 2.740000 373.650000
2000 Jan 09 19 34 27 -118.817670 37.529670 3.990000 3.170000 373.07000
Where column 0 is the year, 1 is the month, 2 is the day. There are no headers.
I want to calculate/count the number of events per day. Each line in the file (example: 2000 Jan 11) is an event. So, On January 11th, I would like to know how many times there was an event. In this case, on January 11th, there were 3 events.
I've tried looking on stack for some guidance and have found code that works for arrays such as:
a = [1, 1, 1, 0, 0, 0, 1]
which counts the occurrence of certain items in the array using code like:
unique, counts = numpy.unique(a, return_counts=True)
dict(zip(unique, counts))
I have not been able to find anything that helps me. Any help/advice would be appreciated.
groupby() is going to be your friend here. However, I would concatenate the Year, Month and Day so that you can use dataframe.groupby(["full_date"]).count()
Full solution
Setup DF
df = pd.DataFrame([[2000, "Jan", 19],[2000, "Jan", 20],[2000, "Jan", 19],[2000, "Jan", 19]], columns = ["Year", "Month", "Day"])
Convert datatypes to str for concatenation
df["Year"] = df["Year"].astype(str)
df["Day"] = df["Day"].astype(str)
Create 'full_date' column
df["full_date"] = df["Year"] + "-" + df["Month"] + "-" + df["Day"]
Count the # of days
df.groupby(["full_date"])["Day"].count()
Hope this helps/provides value :)
Related
I want to let index number can be separate to column[year] and column[month].
for i in range(len(df_tmp.index)):
yy=str(df_tmp.index[i])[0:4]
mm=str(df_tmp.index[i])[-2:]
df_tmp['year']=yy
print(df_tmp['year'])
i=i+1
But now the output is columns[year] be overwritten by the end of index.
and I don't know how to solve it.
trace the wrong
the output result
Try using this sample code:
I = ['2111','2112','2201','2202']
df = pd.DataFrame( [3,1,2,5], index=I, columns=['variable'])
df['year'] = df.index.str[:2]
df['month'] = df.index.str[2:]
variable year month
2111 3 21 11
2112 1 21 12
2201 2 22 01
2202 5 22 02
I am trying to handle the following dataframe
import pandas as pd
import io
csv_data = '''
ID,age,Year
100,75,2020
100,76,2021
200,64,2020
200,65,2021
200,66,2022
300,69,2020
300,70,2021
300,71,2022
300,72,2023
'''
df = pd.read_csv(io.StringIO(csv_data))
df = df.set_index(['ID', 'age'])
df
Year
ID age
100 75 2020
76 2021
200 64 2020
65 2021
66 2022
300 69 2020
70 2021
71 2022
72 2023
The ID in this data frame represents the same person, and Year means the year of the visit.
This will be multi-indexed based on ID and age. At this point, is it possible to display from the ID with more age elements?
The ideal display is shown below.
Year
ID age
300 69 2020
70 2021
71 2022
72 2023
200 64 2020
65 2021
66 2022
100 75 2020
76 2021
Unless it's a requirement you immediately set_index after your .read_csv, then the following should be reasonably okay...
Get a count of how many times an ID occurs - we can use .value_counts for this which handily automatically sorts by descending order...
id_freq = df['ID'].value_counts()
Then index your DF and .reindex using the index of id_freq, eg:
df = df.set_index(['ID', 'age']).reindex(id_freq.index, level=0)
This'll give you:
Year
ID age
300 69 2020
70 2021
71 2022
72 2023
200 64 2020
65 2021
66 2022
100 75 2020
76 2021
Might also have a side effect of being useful as you could also run id_freq.value_counts() to get a distribution of how many times each patient in general appears.
If you do have to index from the start, then you might as well provide it to .read_csv, such as:
df = pd.read_csv(io.StringIO(csv_data), index_col=['ID', 'age'])
Then similar to above, reindex but on the level values of the first level of the index, eg:
df = df.reindex(df.index.get_level_values(0).value_counts().index, level=0)
You can first compute the length of each ID group using groupby and size, and then sort the ID values based on the length:
s = df.groupby('ID').size()
df.sort_values('ID', key=lambda i:s[i], ascending=False)
Output:
Year
ID age
300 69 2020
70 2021
71 2022
72 2023
200 64 2020
65 2021
66 2022
100 75 2020
76 2021
You can sort based on two columns and set which column ascending and which not ascending like below:
>>> print(df.sort_values(['ID', 'age'], ascending=[False, True]))
Year
ID age
300 69 2020
70 2021
71 2022
72 2023
200 64 2020
65 2021
66 2022
100 75 2020
76 2021
You could sort these values like this:
df = df.sort_values(['ID', 'age'],
ascending = [False, True])
pandas sort_index() is useful for sorting dataframe by index values.
df.sort_index(ascending=[False, True])
I have daily temperature data from 1901-1940. I want to exclude leap years i.e. remove any temperature data that falls on 2/29. My data is currently one long array. I am reshaping it so that every year is a row and every column is a day. I'm trying to remove the leap years with the last line of code here:
import requests
from datetime import date
params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly", "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
#package into the dataframe
df = pd.DataFrame(columns=['date', 'tmax'], data=resp.json()['data'])
#convert the date column to datetimes
df['date']=pd.to_datetime(df['date'])
#select years
mask = (df['date'] >= '1900-01-01') & (df['date'] <= '1940-12-31')
Baseline=df.loc[mask]
#get rid of leap years:
Baseline=Baseline.loc[(Baseline['date'].dt.day!=29) & (Baseline['date'].dt.month!=2)]
but when I reshape the array I notice that there are 366 columns instead of 365 so I don't think I'm actually getting rid of february 29th data. How would I completely eliminate any temperature data that is recorded on 2/29 throughout my data set. I only want 365 data points for each year.
daily=pd.DataFrame(data={'date':Baseline.date,'tmax':Baseline.tmax})
daily['day']=daily.date.dt.dayofyear
daily['year']=daily.date.dt.year
daily.pivot(index='year', columns='day', values='tmax')
The source of your problem is that you used daily.date.dt.dayofyear.
Each day in a year, including Feb 29 has its own number.
To make thing worse, e.g. Mar 1 has dayofyear:
61 in leap years,
60 in non-leap years.
One of possible solutions is to set the day column to a string
representation of month and day.
To provide proper sort in the pivoted table, the month part should come first.
So, after you convert date column to datetime, to create both
additional columns run:
daily['year'] = daily.date.dt.year
daily['day'] = daily.date.dt.strftime('%m-%d')
Then you can filter out Feb 29 and generate the pivot table in one go:
result = daily[daily.day != '02-29'].pivot(index='year', columns='day',
values='tmax')
For some limited source data sample, other than yours, I got:
day 02-27 02-28 03-01 03-02
year
2020 11 10 14 15
2021 11 21 22 24
An alternative
Create 3 additional columns:
daily['year'] = daily.date.dt.year
daily['month'] = daily.date.dt.strftime('%m')
daily['day'] = daily.date.dt.strftime('%d')
Note the string representation of month and day, to keep leading
zeroes.
Then filter out Feb 29 and generate the pivot table with a MulitiIndex
on columns:
result = daily[(daily.month != '02') | (daily.day != '29')].pivot(
index='year', columns=['month', 'day'], values='tmax')
This time the result is:
month 02 03
day 27 28 01 02
year
2020 11 10 14 15
2021 11 21 22 24
The easy way is to eliminate those items before building the array.
import requests
from datetime import date
params = {"sid": "PHLthr", "sdate":"1900-12-31", "edate":"2020-12-31", "elems": [{"name": "maxt", "interval": "dly", "duration": "dly", "prec": 6}]}
baseurl = "http://data.rcc-acis.org/StnData"
#get the data
resp = requests.post(baseurl, json=params)
vals = resp.json()
rows = [row for row in vals['data'] if '02-29' not in row[0]]
print(rows)
You get 366 columns because of using dayofyear. That will calculate the day per the actual calendar (i.e. without removing 29 Feb).
To see this:
>>> daily.iloc[1154:1157]
date tmax day year
1154 1904-02-28 38.000000 59 1904
1156 1904-03-01 39.000000 61 1904
1157 1904-03-02 37.000000 62 1904
Notice the day goes from 59 to 61 (the 60th day was 29 February 1904).
I have a dataframe with historical market caps for which I need to compute their 5-year compound annual growth rates (CAGRs). However, the dataframe has hundreds of companies with 20 years of values each, so I need to be able to isolate each company's data to compute their CAGRs. How do I go about doing this?
The function to calculate a CAGR is: (end/start)^(1/# years)-1. I have never used .groupby() or .apply(), so I don't know how to implement the CAGR equation for rolling values.
Here is a screenshot of part of the dataframe so you have a visual representation of what I am trying to use:
Screeshot of dataframe.
Any guidance would be greatly appreciated!
Assuming there is 1 value per company per year. You can reduce the date to year. This is a lot simpler. No need for groupby or apply.
Say your dataframe is name df. First, reduce date to year:
df['year'] = df['Date'].dt.year
Second, add year+5
df['year+5'] = df['year'] + 5
Third, merge the 'df' with itself:
df_new = pandas.merge(df, df, how='inner', left_on=['Instrument', 'year'], right_on=['Instrument','year+5'], suffixes=['_start', '_end'])
Finally, calculate rolling CAGR
df_new['CAGR'] = (df_new['Company Market Cap_end']/df_new['Company Market Cap_start'])**(0.2)-1
Setting up a toy example:
import numpy as np
import pandas as pd
idx_level_0 = np.repeat(["company1", "company2", "company3"], 5)
idx_level_1 = np.tile([2015, 2016, 2017, 2018, 2019], 3)
values = np.random.randint(low=1, high=100, size=15)
df = pd.DataFrame({"values": values}, index=[idx_level_0, idx_level_1])
df.index.names = ["company", "year"]
print(df)
values
company year
company1 2015 19
2016 61
2017 87
2018 55
2019 46
company2 2015 1
2016 68
2017 50
2018 93
2019 84
company3 2015 11
2016 84
2017 54
2018 21
2019 55
I suggest to use groupby to group by individual companies. You then could apply your computation via a lambda function. The result is basically a one-liner.
# actual computation for a two-year period
cagr_period = 2
df["cagr"] = df.groupby("company").apply(lambda x, period: ((x.pct_change(period) + 1) ** (1/period)) - 1, cagr_period)
print(df)
values cagr
company year
company1 2015 19 NaN
2016 61 NaN
2017 87 1.139848
2018 55 -0.050453
2019 46 -0.272858
company2 2015 1 NaN
2016 68 NaN
2017 50 6.071068
2018 93 0.169464
2019 84 0.296148
company3 2015 11 NaN
2016 84 NaN
2017 54 1.215647
2018 21 -0.500000
2019 55 0.009217
I'm working in Python and I have a Pandas DataFrame of Uber data from New York City. A part of the DataFrame looks like this:
Year Week_Number Total_Dispatched_Trips
2015 51 1,109
2015 5 54,380
2015 50 8,989
2015 51 1,025
2015 21 10,195
2015 38 51,957
2015 43 266,465
2015 29 66,139
2015 40 74,321
2015 39 3
2015 50 854
As it is right now, the same week appears multiple times for each year. I want to sum the values for "Total_Dispatched_Trips" for every week for each year. I want each week to appear only once per year. (So week 51 can't appear multiple times for year 2015 etc.). How do I do this? My dataset is over 3k rows, so I would prefer not to do this manually.
Thanks in advance.
okidoki here is it, borrowing on Convert number strings with commas in pandas DataFrame to float
import locale
from locale import atof
locale.setlocale(locale.LC_NUMERIC, '')
df['numeric_trip'] = pd.to_numeric(df.Total_Dispatched_trips.apply(atof), errors = 'coerce')
df.groupby(['Year', 'Week_number']).numeric_trip.sum()