Convient way for sample chunks in Pandas? - python

I have data that tracks a group of individuals over time. To give a small example it looks kind of like this:
ID TIME HEIGHT
0 0 10.2
0 1 3.3
0 2 2.1
1 0 11.3
1 1 8.6
1 2 9.1
2 0 10.0
2 1 35.0
2 2 4.1
.
.
.
100 0 1.0
100 1 3.0
100 2 9.0
Where, for illustration, ID refers to a particular person. Thus, this plotting TIME on the x-axis and HEIGHT on the y-axis for all the values of ID=0 gives us the change in person 0s height.
I want to graph a random sample of these people and plot them. So for instance, I want to plot the change in height over time of 3 people. However, applying the usual df.sample(3) will not always ensure that I get all of the time for a particular person, instead it will select randomly 3 rows and plot them. Is there a preferred/convenient way in pandas to sample random groups?
A lot of questions like this one seem to be about sampling from every group which is not what I want to do.

You want to plot 'TIME' in the x-axis, then get a rectangular dataframe with 'TIME' as the index and 'ID' as the columns. From there, use sample with axis=1 to sample columns and leave the index intact.
df.set_index(['TIME', 'ID']).HEIGHT.unstack().sample(3, axis=1).plot()

Related

Splice two different dataframes based on similar value

I have two dataframes with different dimensions. Lets say it´s displacement measurements but the readings are slightly different values and one has more data. Looks like this:
df1
Index
displacement
1
0
2
2
3
4
4
2
5
0
df2
Index
displacement
other data
1
0
5
2
0.4
6
3
0.9
7
4
1.3
8
5
1.8
9
6
2.4
10
I want to add the "other data" to the first dataframe (df1), by looking for similar displacement value in df2 and asociating displacement value. In this case, the output i want must be similar to this:
df1
Index
displacement
other data (from df2)
1
0
5
2
2
9
And to keep adding the "other data" from df2. I dont know if pd.merge will work and im thinking maybe with a loop till displacement is higher than what im looking from and add the data from the previous row, but df2 has 10 times more rows than df1 and if the displacement measurement is the same as the one from a previous row it may not work. Any help in a cleaner/easier way to do it will be greatly appreciated.
I used the merge_asof function to find the nearest value base on two DataFrames' displacement columns, and then filtered the resulting DataFrame by a threshold.
df1['displacement'] =df1['displacement'].astype(float)
df1 = df1.drop_duplicates('displacement', keep='last')
df_out = pd.merge_asof(
df1.sort_values("displacement"),
df2.sort_values("displacement").assign(df2_displacement=lambda d: d["displacement"]),
on="displacement",
direction="nearest",
)
threshold = .5
dfout1 = df_out[abs(df_out['displacement'] -df_out['df2_displacement'] )< threshold ]

How to return a new dataframe where numbers represent a percent of their containing row [duplicate]

This question already has answers here:
Normalizing pandas DataFrame rows by their sums
(2 answers)
Closed 1 year ago.
Using Python and Pandas I have a dataframe that is filled with numerical values. What I am trying to do, and can't figure out is how do I return a new data frame where each number represents a percentage of that row
Essentially what I need is to return a new data frame where the numbers from the old data frame are changed to represent the % they represent of that specific row as a whole. Hope that makes sense.
Below is an example of the starting data frame, each row would total 10 to make the example easy and simple
ambivalent negative neutral positive
11/15/2021 6 2 1 1
11/8/2021 4 1 2 3
what I want to achieve is this
ambivalent negative neutral positive
11/15/2021 60% 20% 10% 10%
11/8/2021 40% 10% 20% 30%
I don't need the actual % symbol just the actual percent numbers will work.
Can someone point me in the right direction in how to do this?
You can use the .apply() method with a lambda function:
result = df.apply(lambda row: row/sum(row)*100,axis=1)
Example:
df = pd.DataFrame({'a':[2,3],'b':[3,5],'c':[5,2]})
result = df.apply(lambda row: row/sum(row),axis=1)
df is:
a b c
0 2 3 5
1 3 5 2
result is:
a b c
0 20.0 30.0 50.0
1 30.0 50.0 20.0

How to vertically merge dataframes on matching column value [duplicate]

This question already has answers here:
Pandas join/merge/concat two dataframes
(2 answers)
Closed 3 years ago.
I am sending 15 minute audio files of 2 person conversations to a transcription/speaker diarization service. Circumstances require me chunk 15 minute files into three 5 minute files. Unfortunately, speaker labels are not consistent across chunks, but I need them to be for analysis.
For example, in the first file, speakers are labeled '0' and '1'. However, in the second file, they are labeled '1' and '2'. In the third file, they may be labeled '1' and '0' respectively. This is a problem as I need consistent labeling.
My current approach is to represent data from each chunk in a dataframe. To have a reference for labels across dataframes, I overlapped each dataframe by 10 seconds. I want to merge each dataframe where 'transcript', 'start', and/or 'start' columns match.
Then, I want to modify the speaker labeling scheme on the newly merged dataframe to match the previous dataframe based on the overlapping values.
This is what dataframe 1 looks like:
df
transcript start stop speaker_label
0 hello world 1.2 2.2 0
1 why hello, how are you? 2.3 4.0 1
2 fine, thank you 4.1 5.0 0
This is what dataframe 2 looks like. Note how the first row matches the last row in the previous dataframe because of the overlapping, but now the speaker_label scheme is different.
df1
transcript start stop speaker_label
0 fine, thank you 4.1 5.0 1
1 you?(should be speaker 0) 5.1 6.0 1
2 good, thanks(should be speaker 1) 6.1 7.0 2
This is what I want, dataframes vertically merged where 'start' values match, and having the 'df1' 'speaker_label' scheme match the scheme of 'df'.
ideal_df
transcript start stop speaker_label
0 hello world 1.2 2.2 0
1 why hello, how are you? 2.3 4.0 1
2 fine, thank you 4.1 5.0 0
3 you?(should be speaker 0) 5.1 6.0 0
4 good, thanks(should be speaker 1) 6.1 7.0 1
You can use pd.concat to merge/concat vertically. You can refer to Pandas merging concat join examples
ideal_df=pd.concat([df,df1])
ideal_dfdrop_duplicates(keep='first',inplace=True)
Try to do ;) :
import pandas as pd
df1 = pd.DataFrame({'c1':['titi','toto','tutu'], 'c2': [0,1,0]})
df2 = pd.DataFrame({'c1':['tata','tete','titi'], 'c2': [1,1,0]})
df = pd.concat([df1, df2])
df.drop_duplicates(keep='first')

Python: create a lag (t-1) data structure of multiple elements

I'm having trouble creating a time lag column for my data. It works fine when I do it for a dataframe with a just a kind of elements, but it doesn't not work fine, when I have different elements. For example, my dataset looks something like this:
when using the command suggested:
data1['lag_t'] = data1['total_tax'].shift(1)
I get a result like this:
As you can see, it just displace all the 'total_tax' value one row. However, I need to do this lag for EACH ONE of the id_inf (as separate items).
My dataset is really huge, so I need to find a way to solve this issue. So I can get as a result a table like this:
You can groupby on index and shift
# an example with random data.
data1 = pd.DataFrame({'id': [9,9,9,54,54,54],'total_tax':[5,6,7,1,2,3]}).set_index('id')
data1['lag_t'] = data1.groupby(level=0)['total_tax'].apply(lambda x: x.shift())
print (data1)
tax lag_t
id
9 5 NaN
9 6 5.0
9 7 6.0
54 1 NaN
54 2 1.0
54 3 2.0

Taking the average of values from a range of rows in a pandas dataframe

I have a pandas dataframe with a column called 'coverage'. For a series of specific index values, I'd like to get the mean 'coverage' value for the 100 prior rows. For example, for index position 1001, I want the mean 'coverage' for rows 901-1000. My index values of interest are in a separate list.
I'm stumped on how to tell pandas to look at a series of rows relative to a given index. I don't think I can use GroupBy, since there will be some groups of rows that overlap (for example, suppose my list of index values of interest includes 1001 and 1050).
If anyone can point me in the right direction, I'd be very grateful!
pandas.rolling_mean seems like a good candidate for your problem
For instance:
In [9]: pandas.rolling_mean(pandas.Series(range(10)), window=2)
Out[9]:
0 NaN
1 0.5
2 1.5
3 2.5
4 3.5
5 4.5
6 5.5
7 6.5
8 7.5
9 8.5
dtype: float64

Categories

Resources