Assign a value to a group of rows pandas - python

I have a large DataFrame and I am interested in three main columns: individual id, education, and year. I would like to create a new variable called education1985 where I assign to all the individuals the education they had in 1985, no matter what year is in the row. I want to do that without a loop since my data is very large. Also, I want to do it without knowing the different individual's IDs in advance. Here I attach an example with a created data of what I need.
# This is the initial data frame :
individual_id = [1,1,1,1,2,2,2,2,2,4,4,4,4,5,5,6,6,6,7,7,8,9,9,9,9,9,9]
education = [1,1,2,3,1,2,3,3,3,2,2,3,4,3,4,4,4,5,1,2,2,1,2,3,3,3,3]
year = [1984,1985,1986,1987,1983,1984,1985,1986,1987,1985,1986,1987,1989,1984,1985,1984,1985,1986,1985,1986,1985,1983,1984,1985,1986,1987,1987]
df = pd.DataFrame()
df["id"] = individual_id
df["education"] = education
df["year"] = year
# And the desired outcome is create the variable educ85 :
df2 = df.copy()
df2["educ85"] = educ85
Many thanks!!

Try this:
df = df.join(
other=df[df.year==1985][['id','education']].set_index('id').rename(
columns={'education': 'educ85'}
),
how='left',
on='id'
)

I tend to like approaches that are relatively simple.
Initial data setup.
individual_id = [1,1,1,1,2,2,2,2,2,4,4,4,4,5,5,6,6,6,7,7,8,9,9,9,9,9,9]
education = [1,1,2,3,1,2,3,3,3,2,2,3,4,3,4,4,4,5,1,2,2,1,2,3,3,3,3]
year = [1984,1985,1986,1987,1983,1984,1985,1986,1987,1985,1986,1987,1989,1984,1985,1984,1985,1986,1985,1986,1985,1983,1984,1985,1986,1987,1987]
df = pd.DataFrame()
df["id"] = individual_id
df["education"] = education
df["year"] = year
Filter the data down to the 1985 data and then create a dictionary mapping ID to Education.
d1985 = df[df.year == 1985]
d1985 = dict(zip(d1985.id, d1985.education))
Then, create your new DataFrame -
df2 = df.copy()
df2['educ85'] = df2.id.replace(d1985)

Related

How can I speed up a multi-column loop?

I have a ~8million-ish row data frame consisting of sales for 615 products across 16 stores each day for five years.
I need to make new column/s that consists of the sales shifted back from 1 to 7 days. I've decided to sort the data frame by date, product and location. The I concatenate item and location as its own column.
Using that column I loop through each unique item/location concatenation and make the shifted sales columns. This code is below:
import pandas as pd
#sort values by item, location, date
df = df.sort_values(['date', 'product', 'location'])
df['sort_values'] = df['product']+"_"+df['location']
df1 = pd.DataFrame()
z = 0
for i in list(df['sort_values'].unique()):
df_ = df[df['sort_values']==i]
df_ = df_.sort_values('ORD_DATE')
df_['eaches_1'] = df_['eaches'].shift(-1)
df_['eaches_2'] = df_['eaches'].shift(-2)
df_['eaches_3'] = df_['eaches'].shift(-3)
df_['eaches_4'] = df_['eaches'].shift(-4)
df_['eaches_5'] = df_['eaches'].shift(-5)
df_['eaches_6'] = df_['eaches'].shift(-6)
df_['eaches_7'] = df_['eaches'].shift(-7)
df1 = pd.concat((df1, df_))
z+=1
if z % 100 == 0:
print(z)
The above code gets me exactly what I want, but takes FOREVER to complete. Is there a faster way to accomplish what I want?

How to create lag feature in pandas in this case?

I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.
Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.
I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?

Reformatting a dataframe to access it for sort after concatenating two series

I've joined or concatenated two series into a dataframe. However one of the issues I'm not facing is that I have no column headings on the actual data that would help me do a sort
hist_a = pd.crosstab(category_a, category, normalize=True)
hist_b = pd.crosstab(category_b, category, normalize=True)
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index])
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index])
df_plots = pd.concat([counts_a, counts_b], axis=1).fillna(0)
The data looks like the following:
0 1
category
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
and I'd like to do a sort, but there are no proper column headings
df_plots = df_plots.sort_values(by=['0?'])
But the dataframe seems to be in two parts. How could I better structure the dataframe to have 'proper' columns such as '0' or 'plot a' rather than being indexable by an integer, which seems to be hard to work with.
category plot a plot b
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
Just rename the columns of the dataframe, for example:
df = pd.DataFrame({0:[1,23]})
df = df.rename(columns={0:'new name'})
If you have a lot of columns you rename all of them at once like:
df = pd.DataFrame({0:[1,23]})
rename_dict = {key: f'Col {key}' for key in df.keys() }
df = df.rename(columns=rename_dict)
You can also define the series with the name, so you avoid changing the name afterwards:
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index], name = 'counts_a')
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index], name = 'counts_b')

How to find the total runs for every match (Can be understood by match_id and innings)

df=pd.read_csv('./ipl/all_matches.csv')
df1=df[['match_id','season','venue','innings','striker','bowler','batting_team','bowling_team','ball','runs_off_bat','extras']]
df1=df1.loc[(df1['ball'] < 6.1)]
df1['total'] = df1['runs_off_bat'] + df1['extras']
The data frame should look like
Few changes that I've made:
Specify the columns that you wanna read in the pd.read_csv args.
groupby match_id and aggregated the sum of the total column for each group.
use_col = ['match_id','season','venue','innings','striker','bowler','batting_team','bowling_team','ball','runs_off_bat','extras']
df=pd.read_csv('./ipl/all_matches.csv', usecols= use_col)
df1=df1.loc[(df1['ball'] < 6.1)]
df1['total'] = df1['runs_off_bat'] + df1['extras']
total_run_per_match = df1.groupby('match_id').agg({'total': sum})
If you wanna preserve the original data frame:
df1['total_run_per_match'] = df1.groupby('match_id')['transform'].sum()

Python how to create new dataset from an existing one based on condition

For example:
I have this code:
import pandas
df = pandas.read_csv('covid_19_data.csv')
this dataset has a column called countryterritoryCode which is the country code of the country.sample data from the dataset
This dataset has information about covid19 cases from all the countries in the world.
How do I create a new dataset where only the USA info appears
(where countryterritoryCode == USA)
import pandas
df = pandas.read_csv('covid_19_data.csv')
new_df = df[df["country"] == "USA"]
or
new_df = df[df.country == "USA"]
Use df.groupby:
df = pandas.read_csv('covid_19_data.csv')
df_new = df.groupby('countryterritoryCode', axis = 1)

Categories

Resources