I am trying to join two datasets, but they are not the same or have the same criteria.
Currently I have the dataset below, which contains the number of fires based on month and year, but the months are part of the header and the years are a column.
I would like to add this data, using as target data_medicao column from this other dataset, into a new column (let's hypothetically call it nr_total_queimadas).
The date format is YYYY-MM-DD, but the day doesn't really matter here.
I tried to make a loop of this case, but I think I'm doing something wrong and I don't have much idea how to proceed in this case.
Below an example of how I would like the output with the junction of the two datasets:
I used as an example the case where some dates repeat (which should happen) so the number present in the dataset that contains the number of fires, should also repeat.
First, I assume that the first dataframe is in variable a and the second is in variable b.
To make looking up simpler, we set the index of a to year:
a = a.set_index('year')
Then, we take the years from the data_medicao in the dataframe b:
years = b['medicao'].dt.year
To get the month name from the dataframe b, we use strftime. Then we need to make the month name into lower case so that it matches the column names in a. To do that, we use .str.lower():
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
Then using lookup we can list all the values from dataframe a using indices years and month_name_lowercase:
num_fires = a.lookup(years.values, month_name_lowercase.values)
Finally add the new values into the new column in b:
b['nr_total_quimadas'] = num_fires
So the complete code is like this:
a = a.set_index('year')
years = b['medicao'].dt.year
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
num_fires = a.lookup(years.values, month_name_lowercase.values)
b['nr_total_queimadas'] = num_fires
Assume following data for year vs month. Convert month names to numbers:
columns = ["year","jan","feb","mar"]
data = [
(2001,110,120,130),
(2002,210,220,230),
(2003,310,320,330)
]
df = pd.DataFrame(data=data, columns=columns)
month_map = {"jan":"1", "feb":"2", "mar":"3"}
df = df.rename(columns=month_map)
[Out]:
year 1 2 3
0 2001 110 120 130
1 2002 210 220 230
2 2003 310 320 330
Assume following data for datewise transactions. Extract year and month from date:
columns2 = ["date"]
data2 = [
("2001-02-15"),
("2001-03-15"),
("2002-01-15"),
("2002-03-15"),
("2003-01-15"),
("2003-02-15"),
]
df2 = pd.DataFrame(data=data2, columns=columns2)
df2["date"] = pd.to_datetime(df2["date"])
df2["year"] = df2["date"].dt.year
df2["month"] = df2["date"].dt.month
[Out]:
date year month
0 2001-02-15 2001 2
1 2001-03-15 2001 3
2 2002-01-15 2002 1
3 2002-03-15 2002 3
4 2003-01-15 2003 1
5 2003-02-15 2003 2
Join on year:
df2 = df2.merge(df, left_on="year", right_on="year", how="left")
[Out]:
date year month 1 2 3
0 2001-02-15 2001 2 110 120 130
1 2001-03-15 2001 3 110 120 130
2 2002-01-15 2002 1 210 220 230
3 2002-03-15 2002 3 210 220 230
4 2003-01-15 2003 1 310 320 330
5 2003-02-15 2003 2 310 320 330
Compute row-wise sum of months:
df2["nr_total_queimadas"] = df2[list(month_map.values())].apply(pd.Series.sum, axis=1)
df2[["date", "nr_total_queimadas"]]
[Out]:
date nr_total_queimadas
0 2001-02-15 360
1 2001-03-15 360
2 2002-01-15 660
3 2002-03-15 660
4 2003-01-15 960
5 2003-02-15 960
*Input:*
df["waiting_time"].value_counts()
*Output:*
2 days 6724
4 days 5290
1 days 5213
7 days 4906
6 days 4037
...
132 days 1
125 days 1
117 days 1
146 days 1
123 days 1
Name: waiting_time, Length: 128, dtype: int64
I tried:
df['wait_dur'] = df['waiting_time'].values.astype(str)
and I've tried apply as well. No changes to the data type, it stays the same.
You need to skip the 'values' part in your code:
df['wait_dur'] = df['waiting_time'].astype(str)
If you check first row for example, you will get:
type(df['wait_dur'][0])
<class 'str'>
df = df.applymap(str)
This should work, it applies the map string throughout.
If you want to see more methods go here.
I try to calculate number of days until and since last and next holiday. My method of calculation it is like below:
holidays = pd.Series(pd.to_datetime(["01.01.2013", "06.01.2013", "14.02.2013","29.03.2013",
"31.03.2013", "01.04.2013", "01.05.2013", "03.05.2013",
"19.05.2013", "26.05.2013", "30.05.2013", "23.06.2013",
"15.07.2013", "27.10.2013", "01.11.2013", "11.11.2013",
"24.12.2013", "25.12.2013", "26.12.2013", "31.12.2013",
"01.01.2014", "06.01.2014", "14.02.2014", "30.03.2014",
"18.04.2014", "20.04.2014", "21.04.2014", "01.05.2014",
"03.05.2014", "03.05.2014", "26.05.2014", "08.06.2014",
"19.06.2014", "23.06.2014", "15.08.2014", "26.10.2014",
"01.11.2014", "11.11.2014", "24.12.2014", "25.12.2014",
"26.12.2014", "31.12.2014",
"01.01.2015", "06.01.2015", "14.02.2015", "29.03.2015",
"03.04.2015", "05.04.2015", "06.04.2015", "01.05.2015",
"03.05.2015", "24.05.2015", "26.05.2015", "04.06.2015",
"23.06.2015", "15.08.2015", "25.10.2015", "01.11.2015",
"11.11.2015", "24.12.2015", "25.12.2015", "26.12.2015",
"31.12.2015"], dayfirst=True))
#Number of days until next holiday
d_until_next_holiday = []
#Number of days since last holiday
d_since_last_holiday = []
for row in data.itertuples():
next_special_date = holidays[holidays >= row["Date"]].iloc[0]
d_until_next_holiday.append((next_special_date - row["Date"])/pd.Timedelta('1D'))
previous_special_date = holidays[holidays <= row.index].iloc[-1]
d_since_last_holiday.append((row["Date"] - previous_special_date)/pd.Timedelta('1D'))
#Add new cols to DF
sto2STG14["d_until_next_holiday"] = d_until_next_holiday
sto2STG14["d_since_last_holiday"] = d_since_last_holiday
Nevertheless, I have en error like below:
TypeError: tuple indices must be integers or slices, not str
Why I have this erro ? I know that row is tuple, but i use in my code .iloc[0] and .iloc[-1] ? WHat can I do ?
With pandas, you rarely need to loop. In this case, the .shift method allows you to compute everything in one go:
import pandas
holidays = pandas.Series(pandas.to_datetime([
"01.01.2013", "06.01.2013", "14.02.2013","29.03.2013",
"31.03.2013", "01.04.2013", "01.05.2013", "03.05.2013",
"19.05.2013", "26.05.2013", "30.05.2013", "23.06.2013",
"15.07.2013", "27.10.2013", "01.11.2013", "11.11.2013",
"24.12.2013", "25.12.2013", "26.12.2013", "31.12.2013",
"01.01.2014", "06.01.2014", "14.02.2014", "30.03.2014",
"18.04.2014", "20.04.2014", "21.04.2014", "01.05.2014",
"03.05.2014", "03.05.2014", "26.05.2014", "08.06.2014",
"19.06.2014", "23.06.2014", "15.08.2014", "26.10.2014",
"01.11.2014", "11.11.2014", "24.12.2014", "25.12.2014",
"26.12.2014", "31.12.2014",
"01.01.2015", "06.01.2015", "14.02.2015", "29.03.2015",
"03.04.2015", "05.04.2015", "06.04.2015", "01.05.2015",
"03.05.2015", "24.05.2015", "26.05.2015", "04.06.2015",
"23.06.2015", "15.08.2015", "25.10.2015", "01.11.2015",
"11.11.2015", "24.12.2015", "25.12.2015", "26.12.2015",
"31.12.2015"
], dayfirst=True)
)
results = (
holidays
.sort_values()
.to_frame('holiday')
.assign(
days_since_prev=lambda df: df['holiday'] - df['holiday'].shift(1),
days_until_next=lambda df: df['holiday'].shift(-1) - df['holiday'],
)
)
results.head(10)
And I get:
holiday days_since_prev days_until_next
0 2013-01-01 NaT 5 days
1 2013-01-06 5 days 39 days
2 2013-02-14 39 days 43 days
3 2013-03-29 43 days 2 days
4 2013-03-31 2 days 1 days
5 2013-04-01 1 days 30 days
6 2013-05-01 30 days 2 days
7 2013-05-03 2 days 16 days
8 2013-05-19 16 days 7 days
9 2013-05-26 7 days 4 days
I have a pandas DataFrame with a column "StartTime" that could be any datetime value. I would like to create a second column that gives the StartTime relative to the beginning of the week (i.e., 12am on the previous Sunday). For example, this post is 5 days, 14 hours since the beginning of this week.
StartTime
1 2007-01-19 15:59:24
2 2007-03-01 04:16:08
3 2006-11-08 20:47:14
4 2008-09-06 23:57:35
5 2007-02-17 18:57:32
6 2006-12-09 12:30:49
7 2006-11-11 11:21:34
I can do this, but it's pretty dang slow:
def time_since_week_beg(x):
y = x.to_datetime()
return pd.Timedelta(days=y.weekday(),
hours=y.hour,
minutes=y.minute,
seconds=y.second
)
df['dt'] = df.StartTime.apply(time_since_week_beg)
What I want is something like this, that doesn't result in an error:
df['dt'] = pd.Timedelta(days=df.StartTime.dt.dayofweek,
hours=df.StartTime.dt.hour,
minute=df.StartTime.dt.minute,
second=df.StartTime.dt.second
)
TypeError: Invalid type <class 'pandas.core.series.Series'>. Must be int or float.
Any thoughts?
You can use a list comprehension:
df['dt'] = [pd.Timedelta(days=ts.dayofweek,
hours=ts.hour,
minutes=ts.minute,
seconds=ts.second)
for ts in df.StartTime]
>>> df
StartTime dt
0 2007-01-19 15:59:24 4 days 15:59:24
1 2007-03-01 04:16:08 3 days 04:16:08
2 2006-11-08 20:47:14 2 days 20:47:14
3 2008-09-06 23:57:35 5 days 23:57:35
4 2007-02-17 18:57:32 5 days 18:57:32
5 2006-12-09 12:30:49 5 days 12:30:49
6 2006-11-11 11:21:34 5 days 11:21:34
Depending on the format of StartTime, you may need:
...for ts in pd.to_datetime(df.StartTime)