How to fetch preceding ids on fly using pandas - python

I have a data frame like as shown below
df = pd.DataFrame({'subject_id':[11,11,11,12,12,12],
'test_date':['02/03/2012 10:24:21','05/01/2019 10:41:21','12/13/2011 11:14:21','10/11/1992 11:14:21','02/23/2002 10:24:21','07/19/2005 10:24:21'],
'original_enc':['A742','B963','C354','D563','J323','G578']})
hash_file = pd.DataFrame({'source_enc':['A742','B963','C354','D563','J323','G578'],
'hash_id':[1,2,3,4,5,6]})
cols = ["subject_id","test_date","enc_id","previous_enc_id"]
test_df = pd.DataFrame(columns=cols)
test_df.head()
I would like to do two things here
Map original_enc to their corresponding hash_id and store it in enc_id
Find the previous hash_id for each subject based on their current hash_id and store it in previous_enc_id
I tried the below
test_df['subject_id'] = df['subject_id']
test_df['test_date'] = df['test_date']
test_df['enc_id'] = df['original_enc'].map(hash_file)
test_df = test_df.sort_values(['subject_id','test_date'],ascending=True)
test_df['previous_enc_id'] = test_df.groupby(['subject_id','test_date'])['enc_id'].shift(1)
However, I don't get the expected output for the previous_enc_id column as it is all NA.
I expect my output to be like as shown below. You see NA in the expected row for the 1st row of every subject because that's their 1st encounter. There is no info to look back to.

Use only one column for groupby:
test_df['previous_enc_id'] = test_df.groupby('subject_id')['enc_id'].shift()

Related

Appending values corresponding to matching date in different dataframes (in R or Python)

I have following data:
#1. dates of 15 day frequency:
dates = seq(as.Date("2017-01-01"), as.Date("2020-12-30"), by=15)
#2. I have a dataframe containing dates where certain observation is recoded per variable as:
#3. Values corresponding to dates in #2 as:
What I am trying to do is assign values to respective dates, and keep other as NaN for the dates which has no observation, and save as a text file. The output look something like below. Appreciate your help. Can be in R or in python.
This code work on the example data you provided.
Due to loop, it will not be the fastest way out there, but it does the job.
The date DataFrames is containing the dates, your data shown in #2. And data is the DataFrames containing the data shown in #3.
# IMPORT PACKAGES AND LOAD DATA
import pandas as pd
import numpy as np
data = pd.read_csv("./data.csv")
date = pd.read_csv("./date.csv")
# GET THE UNIQUE DATES
date_unique = pd.Series(np.concatenate([date[col_name].dropna().unique() for col_name in date.columns]).flat).unique()
# GET THE END DATA FRAME READY
data_reform = pd.DataFrame(data=[], columns=["date"])
for col in data.columns:
data_reform.insert(len(data_reform.columns),col,[])
data_reform["date"]=date_unique
data_reform.sort_values(by=["date"],inplace=True)
# ITER THROUGH THE DATA AND ALLOCATED THEM TO THE FINAL DATA FRAME
for row_id, row_val in data.iterrows():
for col_id, col_val in row_val.dropna().items():
data_reform[col_id][data_reform["date"]==date[col_id].iloc[row_id]]=col_val
You can use stack, then merge and finally spread for going back to your column-style matrix. However, I think this can create a sparse matrix (not too optimal in memory for large datasets).
library(pivottabler)
library(dplyr)
library(tidyr)
dates_df <- data.frame(read.csv("date.txt"))
dates_df$dates_formatted <- as.Date(dates_df$dates, format = "%d/%m/%Y")
dates_df <- data.frame(dates_df[,2])
names(dates_df) <- c("dates")
valid_observations <- read.csv("cp.txt", sep = " " ,na.strings=c('NaN'))
observations <- read.csv("cpAC.txt", sep = " ")
# Idea is to produce an EAV table
# date var_name value
EAV_Table <- cbind(stack(valid_observations), stack(observations))
complete_EAV <- EAV_Table[complete.cases(EAV_Table), 1:3]
complete_EAV$date <- as.Date(complete_EAV$values, format = "%Y-%m-%d")
complete_EAV <- complete_EAV[, 2:4]
names(complete_EAV) <- c("Variable", "Value", "dates")
complete_EAV$Variable <- as.factor(complete_EAV$Variable)
dates_measures <- merge(dates_df, complete_EAV, all.x = TRUE)
result <- spread(dates_measures, Variable, Value)
write.csv(result, "data_measurements.csv", row.names = FALSE)

How to create lag feature in pandas in this case?

I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.
Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.
I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?

Reformatting a dataframe to access it for sort after concatenating two series

I've joined or concatenated two series into a dataframe. However one of the issues I'm not facing is that I have no column headings on the actual data that would help me do a sort
hist_a = pd.crosstab(category_a, category, normalize=True)
hist_b = pd.crosstab(category_b, category, normalize=True)
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index])
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index])
df_plots = pd.concat([counts_a, counts_b], axis=1).fillna(0)
The data looks like the following:
0 1
category
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
and I'd like to do a sort, but there are no proper column headings
df_plots = df_plots.sort_values(by=['0?'])
But the dataframe seems to be in two parts. How could I better structure the dataframe to have 'proper' columns such as '0' or 'plot a' rather than being indexable by an integer, which seems to be hard to work with.
category plot a plot b
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
Just rename the columns of the dataframe, for example:
df = pd.DataFrame({0:[1,23]})
df = df.rename(columns={0:'new name'})
If you have a lot of columns you rename all of them at once like:
df = pd.DataFrame({0:[1,23]})
rename_dict = {key: f'Col {key}' for key in df.keys() }
df = df.rename(columns=rename_dict)
You can also define the series with the name, so you avoid changing the name afterwards:
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index], name = 'counts_a')
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index], name = 'counts_b')

Concatenate specific columns in pandas

Im trying to concatenate 4 different datasets onto pandas python. I can concatenated them but it results in several of the same column names. How do I only produce only one column of the same name, then multiples?
concatenated_dataframes = pd.concat(
[
dice.reset_index(drop=True),
json.reset_index(drop=True),
flexjobs.reset_index(drop=True),
indeed.reset_index(drop=True),
simply.reset_index(drop=True),
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dice.columns),
list(json.columns),
list(flexjobs.columns),
list (indeed.columns),
list(simply.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
df= concatenated_dataframes
This results in
UNNAMED: 0 TITLE COMPANY DESCRIPTION LOCATION TITLE JOBLOCATION POSTEDDATE DETAILSPAGEURL COMPANYPAGEURL COMPANYLOGOURL SALARY CLIENTBRANDID COMPANYNAME EMPLOYMENTTYPE SUMMARY SCORE EASYAPPLY EMPLOYERTYPE WORKFROMHOMEAVAILABILITY ISREMOTE UNNAMED: 0 TITLE SALARY JOBTYPE LOCATION DESCRIPTION UNNAMED: 0 TITLE SALARY JOBTYPE DESCRIPTION LOCATION UNNAMED: 0 COMPANY DESCRIPTION LOCATION SALARY TITLE
Again, how do i combined all the 'titles' in one column, all the 'location' in one column, and so on? Instead of have multiple of them.
I think we can get away with making a blank dataframe that just has the columns we will want at the end and then concat() everything onto it.
import numpy as np
import pandas as pd
all_columns = list(dice.columns) + list(json.columns) + list(flexjobs.columns) + list(indeed.columns) + list(simply.columns)
all_unique_columns = np.unique(np.array(all_columns)) # this will, as the name suggests, give an end list of just the unique columns. You could run print(all_unique_columns) to make sure it has what you want
df = pd.DataFrame(columns=all_unique_columns)
df = pd.concat([dice, json, flexjobs, indeed, simply],axis=0)
It's a little tricky not having reproducible examples of the dataframes that you have. I tested this on a small mock-up example I put together, but let me know if it works for your more complex example.

Inaccesable first column in pandas dataframe?

I have a dataframe with multiple columns. When I execute the following code it assigns the header for the first column to the second column thereby making the first column inaccessible.
COLUMN_NAMES = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave_points_worst',
'symmetry_worst']
TUMOR_TYPE = ['M', 'B']
path_to_file = list(files.upload().keys())[0]
data = pd.read_csv(path_to_file, names=COLUMN_NAMES, header=0)
print(data)
id diagnosis ... concave_points_worst symmetry_worst
842302 M 17.99 ... 0.4601 0.11890
842517 M 20.57 ... 0.2750 0.08902
84300903 M 19.69 ... 0.3613 0.08758
The id tag is supposed to be associated with the first column but it's associated with the second one resulting in the last column header to get deleted.
pd.read_csv is going to make your first column the index rather than a column like the rest of what is on your list.
You could update it to be:
path_to_file = list(files.upload().keys())[0]
data = pd.read_csv(path_to_file, names=COLUMN_NAMES, header=0,index_col = False)
to make sure that first column isn't treated as the index.

Categories

Resources