This question already has answers here:
Pandas join/merge/concat two dataframes
(2 answers)
Closed 3 years ago.
I am sending 15 minute audio files of 2 person conversations to a transcription/speaker diarization service. Circumstances require me chunk 15 minute files into three 5 minute files. Unfortunately, speaker labels are not consistent across chunks, but I need them to be for analysis.
For example, in the first file, speakers are labeled '0' and '1'. However, in the second file, they are labeled '1' and '2'. In the third file, they may be labeled '1' and '0' respectively. This is a problem as I need consistent labeling.
My current approach is to represent data from each chunk in a dataframe. To have a reference for labels across dataframes, I overlapped each dataframe by 10 seconds. I want to merge each dataframe where 'transcript', 'start', and/or 'start' columns match.
Then, I want to modify the speaker labeling scheme on the newly merged dataframe to match the previous dataframe based on the overlapping values.
This is what dataframe 1 looks like:
df
transcript start stop speaker_label
0 hello world 1.2 2.2 0
1 why hello, how are you? 2.3 4.0 1
2 fine, thank you 4.1 5.0 0
This is what dataframe 2 looks like. Note how the first row matches the last row in the previous dataframe because of the overlapping, but now the speaker_label scheme is different.
df1
transcript start stop speaker_label
0 fine, thank you 4.1 5.0 1
1 you?(should be speaker 0) 5.1 6.0 1
2 good, thanks(should be speaker 1) 6.1 7.0 2
This is what I want, dataframes vertically merged where 'start' values match, and having the 'df1' 'speaker_label' scheme match the scheme of 'df'.
ideal_df
transcript start stop speaker_label
0 hello world 1.2 2.2 0
1 why hello, how are you? 2.3 4.0 1
2 fine, thank you 4.1 5.0 0
3 you?(should be speaker 0) 5.1 6.0 0
4 good, thanks(should be speaker 1) 6.1 7.0 1
You can use pd.concat to merge/concat vertically. You can refer to Pandas merging concat join examples
ideal_df=pd.concat([df,df1])
ideal_dfdrop_duplicates(keep='first',inplace=True)
Try to do ;) :
import pandas as pd
df1 = pd.DataFrame({'c1':['titi','toto','tutu'], 'c2': [0,1,0]})
df2 = pd.DataFrame({'c1':['tata','tete','titi'], 'c2': [1,1,0]})
df = pd.concat([df1, df2])
df.drop_duplicates(keep='first')
Related
Context: I have 5 years of weight data. The first column is the date (month and day), the succeeding columns are the years with corresponding weight for each day of the month. I want to have a full plot of all of my data among other things and so I want to combine all into just two columns. First column is the dates from 2018 to 2022, then the second column is the corresponding weight to each date. I have managed the date part, but can't combine the weight data. In essence, I want to turn ...
0 1
0 1 4.0
1 2 NaN
2 3 6.0
Into ...
0
0 1
1 2
2 3
3 4
4 NaN
5 6.0
pd.concat only puts the year columns next to each other. .join, .merge, melt, stack. agg don't work either. How do I do this?
sample code:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'2018': [1, 2, 3]})
df2 = pd.DataFrame({'2019': [4, np.NaN, 6]})
merged_df = pd.concat([df1,df2],axis=1, ignore_index=True, levels = 0)
print(merged_df)
P.S. I particularly don't want to input any index names (like id_vars="2018") because I want this process to be automated as the years go by with more data.
concat, merge, melt, join, stack, agg. i want to combine all column data into just one series
I think np.ravel(merged_df,order='F') will do the job for you.
If you want it in the form of a dataframe then pd.DataFrame(np.ravel(merged_df,order='F')).
It's not fully clear what's your I/O but based on your first example, you can use concat like this :
pd.concat([df["0"], df["1"].rename("0")], ignore_index=True)
Output :
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
5 6.0
Name: 0, dtype: float64
I have two dataframes with different dimensions. Lets say it´s displacement measurements but the readings are slightly different values and one has more data. Looks like this:
df1
Index
displacement
1
0
2
2
3
4
4
2
5
0
df2
Index
displacement
other data
1
0
5
2
0.4
6
3
0.9
7
4
1.3
8
5
1.8
9
6
2.4
10
I want to add the "other data" to the first dataframe (df1), by looking for similar displacement value in df2 and asociating displacement value. In this case, the output i want must be similar to this:
df1
Index
displacement
other data (from df2)
1
0
5
2
2
9
And to keep adding the "other data" from df2. I dont know if pd.merge will work and im thinking maybe with a loop till displacement is higher than what im looking from and add the data from the previous row, but df2 has 10 times more rows than df1 and if the displacement measurement is the same as the one from a previous row it may not work. Any help in a cleaner/easier way to do it will be greatly appreciated.
I used the merge_asof function to find the nearest value base on two DataFrames' displacement columns, and then filtered the resulting DataFrame by a threshold.
df1['displacement'] =df1['displacement'].astype(float)
df1 = df1.drop_duplicates('displacement', keep='last')
df_out = pd.merge_asof(
df1.sort_values("displacement"),
df2.sort_values("displacement").assign(df2_displacement=lambda d: d["displacement"]),
on="displacement",
direction="nearest",
)
threshold = .5
dfout1 = df_out[abs(df_out['displacement'] -df_out['df2_displacement'] )< threshold ]
This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 1 year ago.
i turned a json file into a dataframe, but I am unsure of how to map a certain value from the json dataframe onto the existing data frame i have.
df1 = # (2nd column does'nt matter just there)
category_id
tags
1
a
1
a
10
b
10
c
40
d
df2(json) =
id
title
1
film
2
music
3
travel
4
cooking
5
dance
I would like to make a new column in df1, that maps the titles from the df2 onto df1 corresponding to the category_id. I am sorry I am new to python programming. I know I can hard code the dictionary and key values and go from there. However I was wondering if there is a way with python/pandas to do this in an easier way.
You can use pandas.Series.map() which maps values of Series according to input correspondence.
df1['tilte'] = df1['category_id'].map(df2.set_index('id')['title'])
# print(df1)
category_id tags tilte
0 1 a film
1 1 a film
2 10 b NaN
3 10 c NaN
4 40 d NaN
I have data that tracks a group of individuals over time. To give a small example it looks kind of like this:
ID TIME HEIGHT
0 0 10.2
0 1 3.3
0 2 2.1
1 0 11.3
1 1 8.6
1 2 9.1
2 0 10.0
2 1 35.0
2 2 4.1
.
.
.
100 0 1.0
100 1 3.0
100 2 9.0
Where, for illustration, ID refers to a particular person. Thus, this plotting TIME on the x-axis and HEIGHT on the y-axis for all the values of ID=0 gives us the change in person 0s height.
I want to graph a random sample of these people and plot them. So for instance, I want to plot the change in height over time of 3 people. However, applying the usual df.sample(3) will not always ensure that I get all of the time for a particular person, instead it will select randomly 3 rows and plot them. Is there a preferred/convenient way in pandas to sample random groups?
A lot of questions like this one seem to be about sampling from every group which is not what I want to do.
You want to plot 'TIME' in the x-axis, then get a rectangular dataframe with 'TIME' as the index and 'ID' as the columns. From there, use sample with axis=1 to sample columns and leave the index intact.
df.set_index(['TIME', 'ID']).HEIGHT.unstack().sample(3, axis=1).plot()
I'm working on topic modeling data where I have one data frame with a small selection of topics and their scores for each document or author (called "scores"), and another data frame with the top three words for all 250 topics (called "words").
I'm trying to combine the two data frames in a way to have an extra column in "scores", in which the top three words from "words" appear for each of the topics included in "scores". This is useful for visualizing the data as a heatmap, as seaborn or pyplot will pick up the labels automatically from such a dataframe.
I have tried a wide variety of merge and concat commands, but do not get the desired result. The strange thing is: what seems the most logical command, according to my understanding of the relevant documentation and the examples there (i.e. use concat on the two df with axis=1 and join="inner"), works on toy data but does not work on my real data.
Here is my toy data with the code I used to generate it and to do the merge:
import pandas as pd
## Defining the two data frames
scores = pd.DataFrame({'author1': ['1.00', '1.50'],
'author2': ['2.75', '1.20'],
'author3': ['0.55', '1.25'],
'author4': ['0.95', '1.3']},
index=[1, 3])
words = pd.DataFrame({'words': ['cadavre','fenêtre','musique','mariage']},
index=[0, 1, 2, 3])
## Inspecting the two dataframes
print("\n==scores==\n", scores)
print("\n==words==\n", words)
## Merging the dataframes
merged = pd.concat([scores, words], axis=1, join="inner")
## Check the result
print("\n==merged==\n", merged)
And this is the output, as expected:
==scores==
author1 author2 author3 author4
1 1.00 2.75 0.55 0.95
3 1.50 1.20 1.25 1.3
==words==
words
0 cadavre
1 fenêtre
2 musique
3 mariage
==merged==
author1 author2 author3 author4 words
1 1.00 2.75 0.55 0.95 fenêtre
3 1.50 1.20 1.25 1.3 mariage
This is exactly what I would like to accomplish with my real data. And although the two dataframes seem no different from the test data, I get an empty dataframe as the result of the merge.
Here are is a small example from my real data:
someScores (complete table):
blanche policier
108 0.003028 0.017494
71 0.002997 0.016956
115 0.029324 0.016127
187 0.004867 0.017631
122 0.002948 0.015118
firstWords (first 5 rows only; the index goes to 249, all index entries in "someScores" have an equivalent in "firstwords"):
topicwords
0 château-pays-intendant (0)
1 esclave-palais-race (1)
2 linge-voisin-chose (2)
3 question-messieurs-réponse (3)
4 prince-princesse-monseigneur (4)
5 arbre-branche-feuille (5)
My merge command:
dataToPlot = pd.concat([someScores, firstWords], axis=1, join="inner")
And the resulting data frame (empty)!
Empty DataFrame
Columns: [blanche, policier, topicwords]
Index: []
I have tried many variants, like using merge instead or creating extra columns replicating the indexes and then merging on those with left_on and right_on, but then I either get the same result or I just get NaN in the "topicwords" column.
Any hints and help would be greatly appreciated!
Inner join only returns rows whose index is present in both dataframes.
Consider your row indices for someScores ( 108 71 115 187 122 ) and firstWords ( 0 1 2 3 4 5 ) contain no common value in row index the resultant is an empty dataframe.
Either set these indeces correctly or specify different criteria for joining.
You can confirm the problem by checking for common values in both index
someScores.index.intersection(firstWords.index)
For different strategies of joining refer documentation.