I'm getting the error pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects when I attempt to concatenate two dataframes. I believe the issue is present somewhere in this code where I map a column to another column.
mapping_g = {'Hospice' : ['ALLCARE', 'CARING EDGE MINOT', 'CARING EDGE HERMANTOWN', 'CARING EDGE BISMARK', 'BLUEBIRD HOSPICE', 'DOCTORS HOSPICE', \
'FIRST CHOICE HOSPICE', 'KEYSTONE HOSPICE' , 'JOURNEY\'S HOSPICE', 'LIGHTHOUSE HOSPICE', 'SALMON VALLEY HOSPICE'] ,'Group': ['ACH1507', \
'CE11507', 'CE21507', 'CE51507', 'BBH1507', 'DOC1507', 'FCH1507', 'KEY1507', 'JOU1507', 'LHH1507', 'SVH1507']}
g_mapping_df = pd.DataFrame(data=mapping_g)
g = dict(zip(g_mapping_df.Group, g_mapping_df.Hospice))
raw_pbm_data['Name Of Hospice'] = raw_pbm_data['GroupID'].map(g)
combined_data = pd.concat([raw_direct_data,raw_pbm_data], axis=0, ignore_index=True)
I think it's something to do with when I place the GroupID column into the Name of Hospice column in the second to last line.
Realized that raw_pbm_data['Name Of Hospice'] = raw_pbm_data['GroupID'].map(g)
should have actually been raw_pbm_data['HospiceName'] = raw_pbm_data['GroupID'].map(g)
I'm working with a bunch of different Excel files and all the column names are slightly different while all being similar. I can't manually rename the Excel files because they're in the format my employer wants them to be in, so I am juggling the information in dataframes. Therefore, the above change solved my issue as it was simply appending a new column onto the dataframe instead of replacing an existing column, which I believe makes sense as to why concatenate was breaking.
Related
I'm trying to make a temporary dataframe that is created by filtering an existing dataframe (stock_data) based on two criteria;
The stock_data ticker column is matching the tick_id variable
The stock_data date column is within a range from start to end (the variables are created using pd.to_datetime)
I've attempted this using two different solutions
First:
temp = stock_data[(stock_data.ticker == tick_id) & (stock_data["date"].isin(pd.date_range(start, end)))]
Second:
mask = (stock_data.ticker == tick_id) & ((stock_data.date > start) & (stock_data.date <= end))
temp = stock_data.loc[mask]
Both solutions result in the same error:
ValueError: Can only compare identically-labeled Series objects
The error's telling you your tick_id Series has different labels to stock_data['ticker']. I'm guessing one is a ticker name like 'AAPL', the other is a numerical ticker-id? (or even worse, just auto-indexed 0,1,2... from the output of some previous operation)
Solution: make ticker or tick_id the index to both series/dataframes. Do this everywhere, if you can. No numerical indices. Then joins, aggregations etc. are trivially easy.
(PS this is better for mose use-cases, also makes exporting CSV or pickle more intuitive.)
Anyway, this looks like a bad attempt to do a join operation.
I have this function to split the 'text' column with each column called 'emotion'. However this is working correctly with a premade dataframe, but is not working with a big dataframe. Since when applying the function, it creates another column with a list of the 'text' column.
def splitting_rows(df, subset, subset_explode, split_value='\s+'):
'''
Creates new rows splitting the subset targeted.
Transform each element of a list-like to a row, replicating index values
:param df: dataframe
:param subset: target column to be splitted
:param subset_explode: the subset to transform each element of a list-like to a row, replicating index values
:param split_value: Value to split.
# split('\s') is almost always wrong because it creates empty strings if there is more than one space separator,
# use split('\s+) or simple split()
:return: splitted dataset with new rows
'''
return df.assign(text=df[subset].str.split(split_value)).explode(subset_explode)
Example of the correct output:
# DATAFRAME INPUT
df = pd.DataFrame({
'emotion': ['joy', 'fear', 'sadness'],
'text': ['falling love', 'involved traffic accident', 'lost person']
})
# EXPECTED OUTPUT
df_result = pd.DataFrame({
'emotion': ['joy', 'joy', 'fear', 'fear', 'fear', 'fear' 'sadness', 'sadness', 'sadness'],
'text': ['falling', 'love', 'involved', 'traffic', 'accident', 'lost', 'person', 'meant']
})
# This will give the correct ouptut
splitting_rows(df, subset='text', subset_explode='text')
Current problem with the dataframe
Emotion Text text
0 joy period falling love time met especially met lo... [period, falling, love, time, met, lo...
1 fear involved traffic accident [involved, traffic, accident]
2 anger driving home several days hard work motorist a... [driving, home, several, days, hard, work, mot...
3 sadness lost person meant [lost, person, meant]
I tried to recreate the dataframe, appending each column to a list and each list to a new dataframe (there's no nan values) to get something similar to the first working example, but it was the same.
I'm using this dataframe.
The issue comes from the text named argument in the assign method. The named argument refers to the column name. In your dataframe you have text and on the online one it is Text.
The correct approach would to build the name argument in apply dynamically based on the value of the subset parameter.
Replace your return statement with this one :
return df.assign(**{subset:df[subset].str.split(split_value)}).explode(subset_explode)
I first split the data and expanded the columns, then I joined the expanded columns with the original, so I could get the emotion variable connected:
df_expand = df['text'].str.split(' ', expand=True)
df_merge = pd.concat([df, df_expand], axis=1).drop('text', axis=1)
After that I put in a list the names of the variables in a list and dropped the emotion, this way I had only the names of the expanded columns:
lc = list(df_merge.columns)
lc.remove('emotion')
Then I used melt to unpivot the values of the expanded dataset
df_melt = pd.melt(df_merge, id_vars=['emotion'], value_vars=lc).drop('variable', axis=1)
dropped the values that were null and sorted the value for a cleaner view
df_melt = df_melt[df_melt['value'].notnull()]
df_melt.sort_values('emotion')
This is what I got
i have the following problem and had an idea to solve it, but it didn't worked:
I have the data on DAX Call and Put Options for every trading day in a month. After transforming and some calculations I have the following DataFrame:
DaxOpt. The goal is now to get rid of every row (either Call or Put Option) which does not have the respective pair. With pair I mean a Call and Put Option with the same 'EXERCISE_PRICE' and 'TAU', where 'TAU' = the time to maturity in years. The red boxes in the picture are examples for a pair. So either having a DataFrame with only the pairs or having two DataFrames with Call and Put Options where the rows are the respective pairs.
My idea was creating two new DataFrames one which contains only the Call Options and the other the Put Options, sort them after 'TAU' and 'EXERCISE_PRICE' and working my way through with pandas isin function, in order to get rid of the Call or Put Options which do not have the respective pair.
DaxOptCall = DaxOpt[DaxOpt.CALL_PUT_FLAG == 'C']
DaxOptPut = DaxOpt[DaxOpt.CALL_PUT_FLAG == 'P']
The problem is that the DaxOptCall and DaxOptPut have different dimensions, so isin function is not applicable. I am trying to find the most efficient way, since the data I am using now is just a fraction from the real data.
Would appreciate any help or idea.
See if this works for you:
Once you separated your df into two dfs by CALL/PUT options, convert the column(s) that are unique to your pairs into index columns:
# Assuming your unique columns are TAU and EXERCISE_PRICE
df_call = df_call.set_index(["EXERCISE_PRICE", "TAU"])
df_put = df_put.set_index(["EXERCISE_PRICE", "TAU"])
Next, take the intersection of the indexes, which will return a pandas MultiIndex object
mtx = df_call.index.intersection(df_put.index)
Then use the mtx object to extract the common elements from the dfs
df_call.loc[mtx]
df_put.loc[mtx]
You can merge these if you want them to be in the same df and reset the index to the original column.
I am trying to map the values of one df column with values in another one.
First df contains football match results:
Date|HomeTeam|AwayTeam
2009-08-15|0|2
2009-08-15|18|15
2009-08-15|20|10
Second df contains teams and has only one column:
TeamName
Arsenal
Bournetmouth
Chelsea
The end result is the first df with matches but with team names instead of numbers in "HomeTeam" and "AwayTeam". The numbers in the first df mean indexes of the second one.
I've tried ".replace":
for item in matches.HomeTeam:
matches = matches.replace(to_replace = matches.HomeTeam[item], value=teams.TeamName[item])
It did replace the values for some items (~80% of them), but ignored the other ones. I could not find a way to replace the other values.
Please let me know what I did wrong and how this can be fixed. Thanks!
Maybe try using applymap:
df[['HomeTeam', 'AwayTeam']] = df[['HomeTeam', 'AwayTeam']].applymap(lambda x: teams['TeamName'].tolist()[x])
And now:
print(df)
Output will be as expected.
I assume that teams is also a DataFrame, something like:
teams = pd.DataFrame(data=[['Team_0'], ['Team_1'], ['Team_2'], ['Team_3'],
['Team_4'], ['Team_5'], ['Team_6'], ['Team_7'], ['Team_8'],
['Team_9']], columns=['TeamName'])
but you failed to include the index in the provided sample (actually, in
both samples).
Then my proposition is:
matches.set_index('Date')\
.applymap(lambda id: teams.loc[id, 'TeamName'])\
.reset_index()
I am trying to read a certain DF from file and add to it two more columns containing, say, the year and the week from other columns in DF. When i apply the code to generate a single new column, all works great. But when there are few columns to be created, the change does not apply. Specifically, new columns are created but their values are not what they are supposed to be.
I know that this happens because i first set all new values to a certain initial string and then change some of them, but I don't understand why it works on a single column and is "nulled" for multiple columns, leaving only the latest column changed... Help please?
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
bad_ind = list(np.where(tbl[date_cols[i]] == 'No Fill')[0])
tbl_ind = range(len(tbl))
for i in range(len(bad_ind)):
tbl_ind.remove(bad_ind[i])
tmp = pd.to_datetime(tbl[date_cols[i]][tbl_ind])
tbl[tmp_col_name][tbl_ind] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
If I try the following lines, disregarding possible "empty data values", everything works...
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
tmp = pd.to_datetime(tbl[date_cols[i]])
tbl[tmp_col_name] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
it has to do with not changing all data values, but i don't understand why the change does not apply - after all, before the second iteration begins, the DF seems to be updated and then tbl[tmp_col_name] = 'No Week' for the second iteration "deletes" the changes made in the first iteration, but only partially - it leaves the new column created but filled with 'No Week' values...
Many thanks to #EdChum! Performing chained indexing may or may not work. In case of creating new multiple columns and then filling in only some of their values, it doesn't work. More precise, it does work but only on the last updated column. Using loc, iloc or ix accessors to set the data works. In case of the above code, to make it work, one needs to cast the tbl_ind into np.array, using tbl[col_name[j]].iloc[np.array(tbl_ind)] = tmp.apply(lambda x: x.year)
Many thanks and credit for the answer to #EdChum.