I would like to iterate through a dataframe rows and concatenate that row to a different dataframe basically building up a different dataframe with some rows.
For example:
`IPCSection and IPCClass Dataframes
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
if (secrow[0] in clrow[0]):
pdList = [finalpatentclasses, pd.DataFrame(secrow), pd.DataFrame(clrow)]
finalpatentclasses = pd.concat(pdList, axis=0, ignore_index=True)
display(finalpatentclasses)
The output is:
I want the nan values to dissapear and move all the data under the correct columns. I tried axis = 1 but messes up the column names. Append does not work as well all values are placed diagonally at the table with nan values as well.
Alright, I have figured it out. The idea is that you create a newrowDataframe and concatenate all the data in a list from there you can add it to the dataframe and then conc with the final dataframe.
Here is the code:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
newrow = pd.DataFrame(columns=allcolumns)
values = np.concatenate((secrow.values, subclrow.values), axis=0)
newrow.loc[len(newrow.index)] = values
finalpatentclasses = pd.concat([finalpatentclasses, newrow], axis=0)
finalpatentclasses.reset_index(drop=false, inplace=True)
display(finalpatentclasses)
Update the code below is more efficient:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns, IPCSubClass.columns, IPCGroup.columns), axis = 0)
newList = []
for secrow in IPCSection.itertuples():
for clrow in IPCClass.itertuples():
if (secrow[1] in clrow[1]):
values = ([secrow[1], secrow[2], subclrow[1], subclrow[2]])
new_row = {IPCSection.columns[0]: [secrow[1]], IPCSection.columns[1]: [secrow[2]],
IPCClass.columns[0]: [clrow[1]], IPCClass.columns[1]: [clrow[2]]}
newList.append(values)
finalpatentclasses = pd.DataFrame(newList, columns=allcolumns)
display(finalpatentclasses)
I have this code which I thought would reformat the dataframe so that the columns with the same column name would be replaced by their duplicates.
# Function that splits dataframe into two separate dataframes, one with all unique
# columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
columns = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = columns[columns == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from
dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate
columns:\n\n{}".format(unq_df.columns.tolist(), dup_df.columns.tolist()))
Output:
Unique columns:
['total_tracks', 'popularity']
Duplicate columns:
['t_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 't_energy1', 't_energy2',
't_key0', 't_key1', 't_key2', 't_speech0', 't_speech1', 't_speech2', 't_acous0', 't_acous1', 't_acous2',
't_ins0', 't_ins1', 't_ins2', 't_live0', 't_live1', 't_live2', 't_val0', 't_val1', 't_val2', 't_tempo0',
't_tempo1', 't_tempo2']
Then I tried to use wide_to_long to combine columns with the same name:
cols = unq_df.columns.tolist()
temp = pd.wide_to_long(dataset.reset_index(), stubnames=['t_dur','t_dance', 't_energy', 't_key', 't_mode',
't_speech', 't_acous', 't_ins', 't_live', 't_val',
't_tempo'], i=['index'] + cols, j='temp', sep='t_')
.reset_index().groupby(cols, as_index=False).mean()
temp
Which gave me this output:
I tried to look at this question, but the dataframe that's returned has "Nothing to show". What am I doing wrong here? How do I fix this?
EDIT
Here is an example of how I've done it "by-hand", but I am trying to do it more efficiently using the already defined built-in functions.
The desired output is the dataframe that is shown last.
I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.
Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.
I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?
I have a dataframe with multiple columns. When I execute the following code it assigns the header for the first column to the second column thereby making the first column inaccessible.
COLUMN_NAMES = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave_points_worst',
'symmetry_worst']
TUMOR_TYPE = ['M', 'B']
path_to_file = list(files.upload().keys())[0]
data = pd.read_csv(path_to_file, names=COLUMN_NAMES, header=0)
print(data)
id diagnosis ... concave_points_worst symmetry_worst
842302 M 17.99 ... 0.4601 0.11890
842517 M 20.57 ... 0.2750 0.08902
84300903 M 19.69 ... 0.3613 0.08758
The id tag is supposed to be associated with the first column but it's associated with the second one resulting in the last column header to get deleted.
pd.read_csv is going to make your first column the index rather than a column like the rest of what is on your list.
You could update it to be:
path_to_file = list(files.upload().keys())[0]
data = pd.read_csv(path_to_file, names=COLUMN_NAMES, header=0,index_col = False)
to make sure that first column isn't treated as the index.
I'm looking to append a multi-index column headers to an existing dataframe, this is my current dataframe.
Name = pd.Series(['John','Paul','Sarah'])
Grades = pd.Series(['A','A','B'])
HumanGender = pd.Series(['M','M','F'])
DogName = pd.Series(['Rocko','Oreo','Cosmo'])
Breed = pd.Series(['Bulldog','Poodle','Golden Retriever'])
Age = pd.Series([2,5,4])
DogGender = pd.Series(['F','F','F'])
SchoolName = pd.Series(['NYU','UCLA','UCSD'])
Location = pd.Series(['New York','Los Angeles','San Diego'])
df = (pd.DataFrame({'Name':Name,'Grades':Grades,'HumanGender':HumanGender,'DogName':DogName,'Breed':Breed,
'Age':Age,'DogGender':DogGender,'SchoolName':SchoolName,'Location':Location}))
I want add 3 columns on top of the existing columns I already have. For example, columns [0,1,2,3] should be labeled 'People', columns [4,5,6] should be labeled 'Dogs', and columns [7,8] should be labeled 'Schools'. In the final result, it should be 3 columns on top of 9 columns.
Thanks!
IIUC, you can do:
newlevel = ['People']*4 + ['Dogs']*3 + ['Schools']*2
df.columns = pd.MultiIndex.from_tuples([*zip(newlevel, df.columns)])
Note [*zip(newlevel, df.columns)] is equivalent to
[(a,b) for a,b in zip(new_level, df.columns)]