I am trying to join two datasets, but they are not the same or have the same criteria.
Currently I have the dataset below, which contains the number of fires based on month and year, but the months are part of the header and the years are a column.
I would like to add this data, using as target data_medicao column from this other dataset, into a new column (let's hypothetically call it nr_total_queimadas).
The date format is YYYY-MM-DD, but the day doesn't really matter here.
I tried to make a loop of this case, but I think I'm doing something wrong and I don't have much idea how to proceed in this case.
Below an example of how I would like the output with the junction of the two datasets:
I used as an example the case where some dates repeat (which should happen) so the number present in the dataset that contains the number of fires, should also repeat.
First, I assume that the first dataframe is in variable a and the second is in variable b.
To make looking up simpler, we set the index of a to year:
a = a.set_index('year')
Then, we take the years from the data_medicao in the dataframe b:
years = b['medicao'].dt.year
To get the month name from the dataframe b, we use strftime. Then we need to make the month name into lower case so that it matches the column names in a. To do that, we use .str.lower():
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
Then using lookup we can list all the values from dataframe a using indices years and month_name_lowercase:
num_fires = a.lookup(years.values, month_name_lowercase.values)
Finally add the new values into the new column in b:
b['nr_total_quimadas'] = num_fires
So the complete code is like this:
a = a.set_index('year')
years = b['medicao'].dt.year
month_name_lowercase = b['medicao'].dt.strftime("%B").str.lower()
num_fires = a.lookup(years.values, month_name_lowercase.values)
b['nr_total_queimadas'] = num_fires
Assume following data for year vs month. Convert month names to numbers:
columns = ["year","jan","feb","mar"]
data = [
(2001,110,120,130),
(2002,210,220,230),
(2003,310,320,330)
]
df = pd.DataFrame(data=data, columns=columns)
month_map = {"jan":"1", "feb":"2", "mar":"3"}
df = df.rename(columns=month_map)
[Out]:
year 1 2 3
0 2001 110 120 130
1 2002 210 220 230
2 2003 310 320 330
Assume following data for datewise transactions. Extract year and month from date:
columns2 = ["date"]
data2 = [
("2001-02-15"),
("2001-03-15"),
("2002-01-15"),
("2002-03-15"),
("2003-01-15"),
("2003-02-15"),
]
df2 = pd.DataFrame(data=data2, columns=columns2)
df2["date"] = pd.to_datetime(df2["date"])
df2["year"] = df2["date"].dt.year
df2["month"] = df2["date"].dt.month
[Out]:
date year month
0 2001-02-15 2001 2
1 2001-03-15 2001 3
2 2002-01-15 2002 1
3 2002-03-15 2002 3
4 2003-01-15 2003 1
5 2003-02-15 2003 2
Join on year:
df2 = df2.merge(df, left_on="year", right_on="year", how="left")
[Out]:
date year month 1 2 3
0 2001-02-15 2001 2 110 120 130
1 2001-03-15 2001 3 110 120 130
2 2002-01-15 2002 1 210 220 230
3 2002-03-15 2002 3 210 220 230
4 2003-01-15 2003 1 310 320 330
5 2003-02-15 2003 2 310 320 330
Compute row-wise sum of months:
df2["nr_total_queimadas"] = df2[list(month_map.values())].apply(pd.Series.sum, axis=1)
df2[["date", "nr_total_queimadas"]]
[Out]:
date nr_total_queimadas
0 2001-02-15 360
1 2001-03-15 360
2 2002-01-15 660
3 2002-03-15 660
4 2003-01-15 960
5 2003-02-15 960
Related
I have a dataframe with three median rent variables. The dataframe looks like this:
region_id
year
1bed_med_rent
2bed_med_rent
3bed_med_rent
1
2010
800
1000
1200
1
2011
850
1050
1250
2
2010
900
1000
1100
2
2011
950
1050
1150
I would like to combine all rent variables into one variable using common elements of region and year like so:
region_id
year
med_rent
1
2010
1000
1
2011
1050
2
2010
1000
2
2011
1050
Using the agg() function in pandas, I have been able to perform functions on multiple variables, but I have not been able to combine variables and insert into the dataframe. I have attempted to use the assign() function in combination with the below code without success.
#Creating the group list of common IDs
group_list = ['region_id', 'year']
#Grouping by common ID and taking median values of each group
new_df = df.groupby(group_list).agg({'1bed_med_rent': ['median'],'2bed_med_rent':
['median'], '3bed_med_rent': ['median']}).reset_index()
What other method might there be for this?
Here set_index combined with apply applied to the rest of the row ought to do it:
(df.set_index(['region_id','year'])
.apply(lambda r:r.median(), axis=1)
.reset_index()
.rename(columns = {0:'med_rent'})
)
produces
region_id year med_rent
0 1 2010 1000.0
1 1 2011 1050.0
2 2 2010 1000.0
3 2 2011 1050.0
Lets say I create the following Pandas Series, which contains some daily measurement over 10 years at three different stations
import numpy as np
import pandas as pd
stations = ['a', 'b', 'c']
dates = pd.date_range(start = '2000-01-01', end = '2009-12-31')
index = [(stations[i], dates[j]) for i in range(len(stations)) for j in range(len(dates))]
index = pd.MultiIndex.from_tuples(index, names=["station", "date"])
x = np.random.random(len(index))
df = pd.Series(index = index, data = x)
Resulting in a Series that looks like:
>>> df
station date
a 2000-01-01 0.736381
2000-01-02 0.203178
2000-01-03 0.640063
2000-01-04 0.942664
2000-01-05 0.953994
...
c 2009-12-27 0.713189
2009-12-28 0.800085
2009-12-29 0.033923
2009-12-30 0.972547
2009-12-31 0.387804
Length: 10959, dtype: float64
Now, for each station, I want to calculate the average number of days per year that have measurement values which are greater than the daily mean on a given day.
I know I can calculate the daily mean value for each station like this:
daily_mean = df.groupby(['station',index.get_level_values('date').dayofyear]).mean()
>>> daily_mean
station date
a 1 0.529211
2 0.432048
3 0.438350
4 0.629226
5 0.523919
...
c 362 0.524537
363 0.346734
364 0.423349
365 0.433348
366 0.316085
Length: 1098, dtype: float64
But after this step, I can't figure out what to do.
Basically I want to do something like:
df['a','2000-01-01'] > daily_mean['a', 1]
df['a','2000-01-02'] > daily_mean['a', 2]
...
df['a','2000-12-31'] > daily_mean['a', 365]
...Then calculate how many days that year were above average, and do this for each year, and then take the mean number of days above average across all years. And then do that for each station.
I could probably do what I want with some painful looping, but I figure there might be a more Pandas-y way to do it?
You can compare a value for a column to the within-group column average with the following pattern.
This technique uses transform method on a grouped dataframe, which will yield a result of the same length as the original grouped dataframe, rather than condensing rows. As an illustrative example:
test = pd.DataFrame({'A': np.random.choice(['a', 'b', 'c'], 10), 'B': np.random.beta(2, 9, 10)})
test
Out
A B
0 b 0.099245
1 c 0.081244
2 b 0.239556
3 b 0.211645
4 c 0.256624
5 c 0.091649
6 b 0.213261
7 a 0.327473
8 a 0.240529
9 c 0.235569
test.groupby('A').B.mean()
Out
A
a 0.284001
b 0.190927
c 0.166271
Name: B, dtype: float64
Using transform:
test['within_A_mean'] = test.groupby('A').B.transform('mean')
test.sort_values('A')
Out
A B within_A_mean
7 a 0.327473 0.284001
8 a 0.240529 0.284001
0 b 0.099245 0.190927
2 b 0.239556 0.190927
3 b 0.211645 0.190927
6 b 0.213261 0.190927
1 c 0.081244 0.166271
4 c 0.256624 0.166271
5 c 0.091649 0.166271
9 c 0.235569 0.166271
So, going back to your example:
# setting up the data as a dataframe instead of a series, with 'measurement' column
import numpy as np
import pandas as pd
stations = ['a', 'b', 'c']
dates = pd.date_range(start = '2000-01-01', end = '2009-12-31')
index = [(stations[i], dates[j]) for i in range(len(stations)) for j in range(len(dates))]
index = pd.MultiIndex.from_tuples(index, names=["station", "date"])
x = np.random.random(len(index))
df = pd.DataFrame(index = index, data = x, columns=['measurement'])
# create a new boolean column which will indicate if a particular measurement
# is above the average measurement for the same day of year across the dataset
df['above_average'] = df\
.groupby(df.index.get_level_values('date').dayofyear)\
.measurement\
.transform(lambda x: x > x.mean())
The expression for df['above_average'] reads: for each grouped dataframe (ie for each dataframe for each dayofyear), for each row, is the row value greater than the average value of the column within the group df?
Once you have this boolean column calculated, you can easily get the number of days for each year that were above average:
df.groupby(df.index.get_level_values('date').year).above_average.mean()
Out
date
2000 0.478142
2001 0.515068
2002 0.508676
2003 0.466667
2004 0.534608
2005 0.518721
2006 0.478539
2007 0.484932
2008 0.467213
2009 0.509589
Name: above_average, dtype: float64
You can also get the overall average of days that were above day-of-year average:
df.above_average.mean()
Out
0.49621315813486633
EDIT:
To get number rather than mean, use sum() instead of mean() as your aggregate function. Getting this count by station/year is a matter of grouping by those values.
df = df.reset_index()
df.groupby(['station', df['date'].dt.year]).above_average.sum()
Out
station date
a 2000 193
2001 175
2002 181
2003 177
2004 163
2005 183
2006 200
2007 178
2008 180
2009 176
b 2000 159
2001 185
2002 186
2003 170
2004 188
2005 176
2006 190
2007 175
2008 185
2009 171
c 2000 183
2001 186
2002 194
2003 178
2004 181
2005 187
2006 185
2007 169
2008 195
2009 175
Name: above_average, dtype: int64
I have previously worked with Stata and am now trying to get the same done with Python. However, I have troubles with the merge command. Somehow I must be missing something. My two dataframes I want to merge look like this:
df1:
Date id Market_Cap
2000 1 400
2000 2 200
2001 1 410
2001 2 220
df2:
id Ticker
1 Shell
2 ExxonMobil
My aim now is to get the following dataset:
Date id Market_Cap Ticker
2000 1 400 Shell
2000 2 200 ExxonMobil
2001 1 410 Shell
2001 2 220 ExxonMobil
I tried the following command:
merged= pd.merge(df1, df2, how="left", on="id")
This merges the datasets, but gives me only nan's in the Ticker column.
I looked at several sources and maybe I am mistaken, but isn't the "left" command the right thing do to for my purpose? I also tried "right" and "outer". They don't get the result I want to and "inner" does not seem to work here in general.
Am I missing something crucial?
Thyere is problem your column id in one df is object (obviously string) and another int, so no match and get NaN.
If have same dtypes:
print (df1['id'].dtypes)
int64
print (df2['id'].dtypes)
int64
merged = pd.merge(df1, df2, how="left", on="id")
print (merged)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil
Another solution if need add only one new column is map:
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil
Simulate your problem:
print (df1['id'].dtypes)
object
print (df2['id'].dtypes)
int64
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 NaN
1 2000 2 200 NaN
2 2001 1 410 NaN
3 2001 2 220 NaN
And solution is convert to int by astype (or column id in df2 to str):
df1['id'] = df1['id'].astype(int)
#alternatively
#df2['id'] = df2['id'].astype(str)
df1['Ticker'] = df1['id'].map(df2.set_index('id')['Ticker'])
print (df1)
Date id Market_Cap Ticker
0 2000 1 400 Shell
1 2000 2 200 ExxonMobil
2 2001 1 410 Shell
3 2001 2 220 ExxonMobil
I'm looping through a DataFrame of 200k rows. It's doing what I want but it takes hours. I'm not very sophisticated when it comes to all the ways you can join and manipulate DataFrames so I wonder if I'm doing this in a very inefficient way. It's quite simple, here's the code:
three_yr_gaps = []
for index, row in df.iterrows():
three_yr_gaps.append(df[(df['GROUP_ID'] == row['GROUP_ID']) &
(df['BEG_DATE'] >= row['THREE_YEAR_AGO']) &
(df['END_DATE'] <= row['BEG_DATE'])]['GAP'].sum() + row['GAP'])
df['GAP_THREE'] = three_yr_gaps
The DF has a column called GAP that holds an integer value. the logic I'm employing to sum this number up is:
for each row get these columns from the dataframe:
those that match on the group id, and...
those that have a beginning date within the last 3 years of this rows start date, and...
those that have an ending date before this row's beginning date.
sum up those rows GAP number and add this row's GAP number then append those to a list of indexes.
So is there a faster way to introduce this logic into some kind of automatic merge or join that could speed up this process?
PS.
I was asked for some clarification on input and output, so here's a constructed dataset to play with:
from dateutil import parser
df = pd.DataFrame( columns = ['ID_NBR','GROUP_ID','BEG_DATE','END_DATE','THREE_YEAR_AGO','GAP'],
data = [['09','185',parser.parse('2008-08-13'),parser.parse('2009-07-01'),parser.parse('2005-08-13'),44],
['10','185',parser.parse('2009-08-04'),parser.parse('2010-01-18'),parser.parse('2006-08-04'),35],
['11','185',parser.parse('2010-01-18'),parser.parse('2011-01-18'),parser.parse('2007-01-18'),0],
['12','185',parser.parse('2014-09-04'),parser.parse('2015-09-04'),parser.parse('2011-09-04'),0]])
and here's what I wrote at the top of the script, may help:
The purpose of this script is to extract gaps counts over the
last 3 year period. It uses gaps.sql as its source extract. this query
returns a DataFrame that looks like this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP
09 185 2008-08-13 2009-07-01 2005-08-13 44
10 185 2009-08-04 2010-01-18 2006-08-04 35
11 185 2010-01-18 2011-01-18 2007-01-18 0
12 185 2014-09-04 2015-09-04 2011-09-04 0
The python code then looks back at the previous 3 years (those
previous rows that have the same GROUP_ID but whose effective dates
come after their own THIRD_YEAR_AGO and whose end date come before
their own beginning date). Those rows are added up and a new column is
made called GAP_THREE. What remains is this:
ID_NBR GROUP_ID BEG_DATE END_DATE THREE_YEAR_AGO GAP GAP_THREE
09 185 2008-08-13 2009-07-01 2005-08-13 44 44
10 185 2009-08-04 2010-01-18 2006-08-04 35 79
11 185 2010-01-18 2011-01-18 2007-01-18 0 79
12 185 2014-09-04 2015-09-04 2011-09-04 0 0
you'll notice that row id_nbr 11 has a 79 value in the last 3 years but id_nbr 12 has 0 because the last gap was 35 in 2009 which is more than 3 years before 12's beginning date of 2014
So, I was trying to accomplish this in SQL but was advised there would be a simple way to do this in Pandas... I would appreciate your help/hints!
I currently have the table on the left with two columns (begin subsession and end subsession), and I would like to add the two left columns "session start" and "session end". I know how to simply add the columns, but I can't figure out the query that would allow me to identify the continuous values in the two original columns (ie the end sub-session value is the same as the next rows begin sub-session value) and then add the first begin session value, and last end session value (for continuous rows) to the respective rows in my new columns. Please refer to the image.. for example, for the first three rows the "end subsession" value is the same as the next rows "begin subsession" values, so the first three "session start" and "session end" would be the same, with the minimum of the "begin subsession" values and the maximum "end sub session" value.
I was trying something along these lines in SQL, obviously didn't work, and I realized the aggregate function doesn't work in this case...
SELECT
FROM viewershipContinuous =
CASE
WHEN endSubsession.ROWID = beginSubession.ROWID+1
THEN MIN(beginSubsession)
ELSE beginSubsession.ROWID+1
END;
The table on the left is what I have, the table on the right is what I want to achieve
You can first compare next value by shifted column esub with column bsub if not equal (!=) and then create groups by cumsum:
s = df['bsub'].ne(df['esub'].shift()).cumsum()
print (s)
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 3
8 3
dtype: int32
Then groupby by Series s with transform min and max:
g = df.groupby(s)
df['session start'] = g['bsub'].transform('min')
df['session end'] = g['esub'].transform('max')
print (df)
bsub esub session start session end
0 1700 1705 1700 1800
1 1705 1730 1700 1800
2 1730 1800 1700 1800
3 1900 1920 1900 1965
4 1920 1950 1900 1965
5 1950 1960 1900 1965
6 1960 1965 1900 1965
7 2000 2001 2000 2002
8 2001 2002 2000 2002