I have a data frame with x variables and an id_number 1:n (n is large). I want to create a new data frame that horizontally merges each pair based on id_number from the data frame.
Original data looks like this:
id_number var_x1 var_x2
1 sth stuff
2 other things
3 more info
I want to get this for every possible pair:
id_numberA var_x1A var_x2A id_numberB var_x1B var_x2B
1 sth stuff 1 sth stuff
1 sth stuff 2 other things
1 sth stuff 3 more info
2 other things 3 more info
What is the most efficient way to do this for a large dataset?
You can create a merging index with:
df['temp'] = 1
And then merge the dataframe to itself with:
merged_df = df.merge(df, on='temp', suffixes=('A', 'B')).drop('temp', axis=1)
If you don't want the combinations of the same id_number, do finally:
merged_df = merged_df[merged_df['id_numberA'] != merged_df['id_numberB']]
And if you don't want duplicated mixes of id_numberA and id_numberB, do finally instead:
merged_df = merged_df[merged_df['id_numberA'] < merged_df['id_numberB']]
Related
So I struggled to even come up with a title for this question. Not sure I can edit the question title, but I would be happy to do so once there is clarity.
I have a data set from an experiment where each row is a point in time for a specific group. [Edited based on better approach to generate data by Daniela Vera below]
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
df.head(10)
x1 time grp
0 0.533131 1 c
1 1.486672 2 c
2 1.560158 3 c
3 -1.076457 4 a
4 -1.835047 5 a
5 -0.374595 6 b
6 -1.301875 7 b
7 -0.533907 8 c
8 0.052951 9 c
9 -0.257982 10 c
10 -0.442044 1 c
In the dataset some people/group only start to have values after time 5. In this case group b. However, in the dataset I am working with there are up to 5,000 groups rather than just the 3 groups in this example.
I would like to be able to identify everyone that only have values that appear after time 5, and drop them from the overall dataframe.
I have come up with a solution that works, but I feel like it is very clunky, and wondered if there was something cleaner.
# First I split the data into before and after the time of interest
after = df[df['time'] > 5].copy()
before = df[df['time'] < 5].copy()
#Then I merge the two dataframes and use indicator to find out which ones only appear after time 5.
missing = pd.merge(after,before, on='grp', how='outer', indicator = True)
#Then I use groupby and nunique to identify the groups that only appear after time 5 and save it as
an array
something = missing[missing['_merge'] == 'left_only'].groupby('ent_id').nunique()
#I extract the list of group ids from the array
something = something.index
# I go back to my main dataframe and make group id the index
df = df.set_index('grp')
#I then apply .drop on the array of group ids
df = df.drop(something)
df = df.reset_index()
Like I said, super clunky. But I just couldn't figure out an alternative. Please let me know if anything isn't clear and I'll happily edit with more details.
I am not sure If I get it, but let's say you have this data:
df = pd.DataFrame({'x1': np.random.randn(30),'time': [1,2,3,4,5,6,7,8,9,10] * 3,'grp': ['c', 'c', 'c','a','a','b','b','c','c','c'] * 3})
In this case, group "b" just has data for times 6, 7, which is above time 5. You can use this process to get a dictionary with the times in which each group has at least one data point and also a list called "keep" with the groups that have data point over the time 5.
list_groups = ["a","b","c"]
times_per_group = {}
keep = []
for group in list_groups:
times_per_group[group] = list(df[df.grp ==group].time.unique())
condition = any([i<5 for i in list(df[df.grp==group].time.unique())])
if condition:
keep.append(group)
Finally, you just keep the groups present in the list "keep":
df = df[df.grp.isin(keep)]
Let me know if I understood your question!
Of course you can just simplify the process, the dictionary is just to check, but you actually don´t need the whole code.
If this results is what you´re looking for, you can just do:
keep = [group for group in list_groups if any([i<5 for i in list(df[df.grp == group].time.unique())])]
So, I have indexes in range data frame. I want to use them to find values in test dataframe and extract values from into new data frame.
My current code is:
d = []
for index in _range_.index:
d.append((test.loc[[index],:]))
_range_ data set:
a
2334 0.097946
3345 0.098201
3357 0.091249
3486 0.098214
5862 0.097946
6873 0.098201
6885 0.091249
7014 0.098214
_test_ data set:
0 1 2 3 4 5
0 4.187268 4.261664 4.329495 4.458864 3.071192 3.652938
You could join the two dataframes together on their common index using 'inner' then keep only the test columns.
cols = __test__.columns
df = __range__.join(__test__, how='inner')
df=df[cols]
If you have an overlap of colum names between the two dataframes, attach an lsuffix='_l' or something similar to ensure the range columns are ignored.
I'm unable to test this code on for your example though, it might be worth reading over this for future posts https://stackoverflow.com/help/minimal-reproducible-example
def check_LM_CapLetters(nex):
df_ja = nex.copy()
df_nein = nex.copy()
criterion_nein = df_nein["NAME2"].map(lambda x: x.endswith("_nein"))
criterion_ja = df_ja["NAME2"].map(lambda y: y.endswith("_ja"))
df_ja[criterion_ja]
df_nein[criterion_nein]
frames = [df_ja,df_nein]
df_found_small = pd.concat(frames)
return df_found_small
Trying to reduce my dataframe to the rows where the cell entry ends with "_ja" or "_nein".
But the output is like the two copies merged. What I want is a limitation, which only shows the rows which fulfill my criteria.
By the way... is there a more elegant and efficient way? First time dealing with list comprehensions and I'm kind of overwhelmed..
My data looks like:
Relation ID; TermID; NAME; bla;blub; TermID2; NAME2
Try the following:
- apply Pandas string function .str.endswith()
- use complex criteria for sorting
Working example:
df = pd.DataFrame (['aaa','bba','baa','cba', 'xbb'], columns = ['name2'])
df_small = df[(df.name2.str.endswith ('aa'))|(df.name2.str.endswith ('ba'))]
>>>
name2
0 aaa
1 bba
2 baa
3 cba
I have the following two frames:
frame1:
id
0 111-111-111
1 111-111-222
2 222-222-222
3 333-333-333
frame2:
data id
0 ones 111-111
1 threes 333-333
And, I have a lambda function that maps the frame1.id to frame2.id:
id_map = lambda x: x[:7]
My goal is to perform an inner join between these two tables, but to have the id go through the lambda. So that the output is:
id data
0 111-111-111 ones
1 111-111-222 ones
2 333-333-333 threes
I've come up with a rather non-elegant solution that almost does what I'm trying to do, however it messes up when the inner join removes rows:
# Save a copy the original ids of frame1
frame1_ids = frame1['id'].copy()
# Apply the id change to frame1
frame1['id'] = frame1['id'].apply(id_map)
# Merge
frame1 = frame1.merge(frame2, how='inner', on='id')
# Set the ids back to what they originally were
frame1['id'] = frame1_ids
Is there a elegant solution for this?
Could use assign to create a dummy id column (newid) to join on like this:
frame1.assign(newid=frame1['id'].str[:7])
.merge(frame2, left_on='newid', right_on='id', suffixes=('','_y'))
.drop(['id_y','newid'], axis=1)
Output:
id data
0 111-111-111 ones
1 111-111-222 ones
2 333-333-333 threes
Fairly new to pandas and I have created a data frame called rollParametersDf:
rollParametersDf = pd.DataFrame(columns=['insampleStart','insampleEnd','outsampleStart','outsampleEnd'], index=[])
with the 4 column headings given. Which I would like to hold the reference dates for a study I am running. I want to add rows of data (one at a time) with the index name roll1, roll2..rolln that is created using the following code:
outsampleEnd = customCalender.iloc[[totalDaysAvailable]]
outsampleStart = customCalender.iloc[[totalDaysAvailable-outsampleLength+1]]
insampleEnd = customCalender.iloc[[totalDaysAvailable-outsampleLength]]
insampleStart = customCalender.iloc[[totalDaysAvailable-outsampleLength-insampleLength+1]]
print('roll',rollCount,'\t',outsampleEnd,'\t',outsampleStart,'\t',insampleEnd,'\t',insampleStart,'\t')
rollParametersDf.append({insampleStart,insampleEnd,outsampleStart,outsampleEnd})
I have tried using append but cannot get an individual row to append.
I would like the final dataframe to look like:
insampleStart insampleEnd outsampleStart outsampleEnd
roll1 1 5 6 8
roll2 2 6 7 9
:
rolln
You give key-values pairs to append
df = pd.DataFrame({'insampleStart':[], 'insampleEnd':[], 'outsampleStart':[], 'outsampleEnd':[]})
df = df.append({'insampleStart':[1,2], 'insampleEnd':[5,6], 'outsampleStart':[6,7], 'outsampleEnd':[8,9]}, ignore_index=True)
The pandas documentation has an example of appending rows to a DataFrame. This appending action is different from that of a list in that this appending action generates a new DataFrame. This means that for each append action you are rebuilding and reindexing the DataFrame which is pretty inefficient. Here is an example solution:
# create empty dataframe
columns=['insampleStart','insampleEnd','outsampleStart','outsampleEnd']
rollParametersDf = pd.DataFrame(columns=columns)
# loop through 5 rows and append them to the dataframe
for i in range(5):
# create some artificial data
data = np.random.normal(size=(1, len(columns)))
# append creates a new dataframe which makes this operation inefficient
# ignore_index causes reindexing on each call.
rollParametersDf = rollParametersDf.append(pd.DataFrame(data, columns=columns),
ignore_index=True)
print rollParametersDf
insampleStart insampleEnd outsampleStart outsampleEnd
0 2.297031 1.792745 0.436704 0.706682
1 0.984812 -0.417183 -1.828572 -0.034844
2 0.239083 -1.305873 0.092712 0.695459
3 -0.511505 -0.835284 -0.823365 -0.182080
4 0.609052 -1.916952 -0.907588 0.898772