Join All dataframes which names starts with df - python

I've multiple dataframes (with exacly the same structure, same variables...) and they all starts with "df_".
What I would like to do is to join all these dataframes into one.
I can do it manually, but I have many data frames and their names can change.
frames = [df_24_10000, df_48_10000, df_64_20000, df_82_30000]
result = pd.concat(frames)
Is it possible to join all data frames that starts with "df_"?

This is normally the sign of a design problem. If you find yourself trying to group a number of objects by their names, it means that should have been elements of the same container (list, dict, set, map or whatever) from the very beginning.
Said differently, if instead of df_24_10000, df_48_10000, df_64_20000, you had dfs['24_10000'], dfs['48_10000'], dfs['64_20000'], the join would simply be:
result = pd.concat(dfs.values)

Related

How Do I Count The Number of Times a Subset of Words Appear In My Pandas Dataframe?

I have a bunch of keywords stored in a 620x2 pandas dataframe seen below. I think I need to treat each entry as its own set, where semicolons separate elements. So, we end up with 1240 sets. Then I'd like to be able to search how many times keywords of my choosing appear together. For example, I'd like to figure out how many times 'computation theory' and 'critical infrastructure' appear together as a subset in these sets, in any order. Is there any straightforward way I can do this?
Use .loc to find if the keywords appear together.
Do this after you have split the data into 1240 sets. I don't understand whether you want to make new columns or just want to keep the columns as is.
# create a filter for keyword 1
filter_keyword_2 = (df['column_name'].str.contains('critical infrastructure'))
# create a filter for keyword 2
filter_keyword_2 = (df['column_name'].str.contains('computation theory'))
# you can create more filters with the same construction as above.
# To check the number of times both the keywords appear
len(df.loc[filter_keyword_1 & filter_keyword_2])
# To see the dataframe
subset_df = df.loc[filter_keyword_1 & filter_keyword_2]
.loc selects the conditional data frame. You can use subset_df=df[df['column_name'].str.contains('string')] if you have only one condition.
To the column split or any other processing before you make the filters or run the filters again after processing.
Not sure if this is considered straightforward, but it works. keyword_list is the list of paired keywords you want to search.
df['Author Keywords'] = df['Author Keywords'].fillna('').str.split(';\s*').apply(set)
df['Index Keywords'] = df['Index Keywords'].fillna('').str.split(';\s*').apply(set)
df.apply(lambda x : x.apply(lambda y : all([kw in y for kw in keyword_list]))).sum().sum()

How to correctly merge/join a feature to another dataframe

I have to dataframes and I want to add 3 features of the first dataframe to the second but ONLY if they match on a certain key value (TicketNr). This key is not unique and can occur multiple times in both dataframes.
I have tried different versions of concat, merge and join but I can't get it they way I need. I don't want to add any rows to the dataframe, just these three columns.
I think this illustration sums up my question. Who can help me with the right code?
You mentioned that TicketNr is not unique in the training set, but if I am correct to assume that TicketSurvRate, AllSurvived, AllDIED are the same as long as TicketNr is the same, we could try the following:
# Drop duplicates to get one row per TicketNr, assuming that
# TicketSurvRate, AllSurvived, AllDIED are uniquely defined by TicketNr
x = engineered_train[
['TicketNr', 'TicketSurvRate', 'AllSurvived', 'AllDIED']].drop_duplicates()
# Merge test dataset with these de-duplicated stats.
# The how='left' parameter will keep all records from the test set.
# There will be `NaN`s where no match for TicketNr is found.
engineered_test.merge(x, how='left', on='TicketNr')

How to loop a command in python with a list as variable input?

This is my first post to the coding community, so I hope I get the right level of detail in my request for help!
Background info:
I want to repeat (loop) command in a df using a variable that contains a list of options. While the series 'amenity_options' contains a simple list of specific items (let's say only four amenities as the example below) the df is a large data frame with many other items. My goal is the run the operation below for each item in the 'amenity_option' until the end of the list.
amenity_options = ['cafe','bar','cinema','casino'] # this is a series type with multiple options
df = df[df['amenity'] == amenity_options] # this is my attempt to select the the first value in the series (e.g. cafe) out of dataframe that contains such a column name.
df.to_excel('{}_amenity.xlsx, format('amenity') # wish to save the result (e.g. cafe_amenity) as a separate file.
Desired result:I wish to loop step one and two for each and every item available in the list (e.g. cafe, bar, cinema...). So that I will have separate excel files in the end. Any thoughts?
What #Rakesh suggested is correct, you probably just need one more step.
df = df[df['amenity'].isin(amenity_options)]
for key, g in df.groupby('amenity'):
g.to_excel('{}_amenity.xlsx'.format(key))
After you call groupby() of your df, you will get 4 groups so that you can directly loop on them.
The key is the group key, which are cafe, bar and etc. and the g is the sub-dataframe that specifically filtered by that key.
Seems like you just need a simple for loop:
for amenity in amenity_options:
df[df['amenity'] == amenity].to_excel(f"{amenity}_amenity.xlsx")

Splitting DataFrame into two DataFrames and filter these two DataFrames in order to have the same dimensions

i have the following problem and had an idea to solve it, but it didn't worked:
I have the data on DAX Call and Put Options for every trading day in a month. After transforming and some calculations I have the following DataFrame:
DaxOpt. The goal is now to get rid of every row (either Call or Put Option) which does not have the respective pair. With pair I mean a Call and Put Option with the same 'EXERCISE_PRICE' and 'TAU', where 'TAU' = the time to maturity in years. The red boxes in the picture are examples for a pair. So either having a DataFrame with only the pairs or having two DataFrames with Call and Put Options where the rows are the respective pairs.
My idea was creating two new DataFrames one which contains only the Call Options and the other the Put Options, sort them after 'TAU' and 'EXERCISE_PRICE' and working my way through with pandas isin function, in order to get rid of the Call or Put Options which do not have the respective pair.
DaxOptCall = DaxOpt[DaxOpt.CALL_PUT_FLAG == 'C']
DaxOptPut = DaxOpt[DaxOpt.CALL_PUT_FLAG == 'P']
The problem is that the DaxOptCall and DaxOptPut have different dimensions, so isin function is not applicable. I am trying to find the most efficient way, since the data I am using now is just a fraction from the real data.
Would appreciate any help or idea.
See if this works for you:
Once you separated your df into two dfs by CALL/PUT options, convert the column(s) that are unique to your pairs into index columns:
# Assuming your unique columns are TAU and EXERCISE_PRICE
df_call = df_call.set_index(["EXERCISE_PRICE", "TAU"])
df_put = df_put.set_index(["EXERCISE_PRICE", "TAU"])
Next, take the intersection of the indexes, which will return a pandas MultiIndex object
mtx = df_call.index.intersection(df_put.index)
Then use the mtx object to extract the common elements from the dfs
df_call.loc[mtx]
df_put.loc[mtx]
You can merge these if you want them to be in the same df and reset the index to the original column.

Is there a more efficient tool than iterrows() in this situation?

Okay so, here's the thing. I'm working with a lot of pandas data frames and arrays. Often times, I need to pair up a value from one frame with a value from another, ideally combining the information into one frame in the end.
Say I'm looking at image files. There's a set of information specific to each file. Sometimes there's certain types of image files that share the same kind of information. Simple example:
FILEPATH, TYPE, COLOR, VALUE_I,<br>
/img2.jpg, A, 'green', 0.6294<br>
/img45.jpg, B, 'green', 0.1846<br>
/img87.jpg, A, 'blue', 34.78<br>
Often, this information is indexed out by type/color/value etc and fed into some other function that gives me another important output, let's say VALUE_II. But I can't concatenate it directly onto the original dataframe because the indices won't match, either because of the nature of the output or because I only fed part of the frame.
Or another situation: I learn that images of a certain TYPE have a specific value attached to them, so I make a dictionary of types and their value. Again, this column doesn't exist, so in this case I would use iterrows() to march down the frame, see if the type matches a specific key, and if it does append it to an array. Then in the end, I convert that array to a dataframe and concatenate it onto the original.
Here's the worse offender. With up to 1800 rows in each frame, it takes FOREVER.:
newColumn = []
for index, row in originalDataframe.iterrows():
for indx, rw in otherDataframe.iterrows():
if row['filename'] in rw['filepath']:
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
newColumn = pd.DataFrame(newColumn, columns = ['VALUE_I', 'VALUE_II', 'VALUE_III'])
originalDataframe = pd.concat([originalDataframe, newColumn], axis=1)
Solutions would be appreciated!
If you can split filename from otherDataframe["filepath"], you can then just compare for equality with orinalDataframe's filename without need to check in. After that you can simplify calculation with pandas.DataFrame.join, which for each filename in originalDataframe will find the same filename in otherDataframe and add all other columns from it.
import os
otherDataframe["filename"] = otherDataframe["filepath"].map(os.path.basename)
joinedDataframe = originalDataframe.join(otherDataframe.set_index("filename"), on="filename")
If there are columns with the same name in originalDataframe and otherDataframe you should set lsuffix or rsuffix.
focusing on the second half of your question, as that's what you provided code for. Your program is checking every row of df1 against every row in df2, yielding potentially 1800 *1800, or 3240000 possible combinations. If there is only one possible match for each row then adding 'break' in will help some, but is not ideal.
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
break
if the structure of you data allows it, i would try something like:
ref = {}
for i, path in enumerate(otherDataframe['filepath']):
*_, file = path.split('\\')
ref[file] = i
originalDataframe['VALUE_I'] = None
originalDataframe['VALUE_II'] = None
originalDataframe['VALUE_III'] = None
for i, file in enumerate(originalDataframe['filename']):
try:
j = ref[file]
originalDataframe.loc[i, 'VALUE_I'] = otherDataframe.loc[j, 'VALUE_I']
originalDataframe.loc[i, 'VALUE_II'] = otherDataframe.loc[j, 'VALUE_II']
originalDataframe.loc[i, 'VALUE_III'] = otherDataframe.loc[j, 'VALUE_III']
except:
pass
Here we we iterate through the paths in otherDataframe (I assume they follow a pattern of C:\asdf\asdf\file), split the path on \ to pull out file, and then construct a dictionary of files to row numbers. Next we initialize the 3 columns in originalDataframe that you want to write to.
Lastly we iterate through the files in originalDataframe, check to see if that file exists in our dictionary of files in otherDataframe (done inside a try to catch errors), and pull the row number (out of the dictionary) which we then use to write the values from other to original.
Side note, you describe you paths as being in the vein of 'C:/asd/fdg/img2.jpg', in which case you should use:
*_, file = path.split('/')

Categories

Resources