Convert frequency table to raw data in Pandas - python

I have a sensor. For some reasons, the sensor like to record data like this:
>df
obs count
-0.3 3
0.9 2
1.4 5
i.e. it first records observations and make a count table out of it. What I would like to do it convert this df into a series with raw observations. For example, I would like to end up with: [-0.3,-0.3,-0.3,0.9,0.9,1.4,1.4 ....]
Similar question asked for excel.

If your dataframe structure is like this one (or similar):
obs count
0 -0.3 3
1 0.9 2
2 1.4 5
This is an option, using numpy.repeat:
import numpy as np
times = df['count']
df2['obs'] = np.concatenate([np.repeat(df['obs'],times)])
print(df2)
obs
0 -0.3
1 -0.3
2 -0.3
3 0.9
4 0.9
5 1.4
6 1.4
7 1.4
8 1.4
9 1.4

Related

Replacing values on cell by multiplier by other row in Pandas

I have the following dataframe:
I want to verify if the value of a cell is 0 for any date. If it is, I want to replace the value of the cell by multiplying the value on the previous cell by the proper multiplier.
For example, Day 14 = 0, I want to multiply Day 7 by Mul 14 and store the new value in Day 14. And so on with the whole dataframe.
I have tried this code but it is not working:
if df['day 30'] == 0.00:
df['day 30'] = df['day 14']*df['Mul 30']
And this is my expected output:
Thanks!
Here is a solution with small example:
import pandas as pd
import numpy as np
df=pd.DataFrame([[0.8,0.9,0.7,2,6],[0.6,0,0,2,3],[0.2,0,0,4,2]],columns=["Day 7","Day 14","Day 30","Mul 14","Mul 30"])
print(df)
df["Day 14"]=np.where(df["Day 14"]==0,df["Day 7"]*df["Mul 14"],df["Day 14"])
df["Day 30"]=np.where(df["Day 30"]==0,df["Day 14"]*df["Mul 30"],df["Day 30"])
print(df)
If you want ypu can iterate over [7,14,10,90] instead of writing individual lines.
Result of above code:
Day 7 Day 14 Day 30 Mul 14 Mul 30
0 0.8 0.9 0.7 2 6
1 0.6 0.0 0.0 2 3
2 0.2 0.0 0.0 4 2
Day 7 Day 14 Day 30 Mul 14 Mul 30
0 0.8 0.9 0.7 2 6
1 0.6 1.2 3.6 2 3
2 0.2 0.8 1.6 4 2

select columns based on row values of another dataset

So I have two dataframes dfA and dfB. I want to select several columns of dfA based on the rows in dfB. This is how my dfA looks like:
index abandoned dismiss yes train tram go
0 0.5 9.1 1.4 2.5 2.5 5.6
1 2.4 3.2 1.8 4.9 9.3 3.2
2 1.5 5.7 3.9 2.1 1.1 0.9
and this is how dfB looks like:
index keywords
0 abandoned
1 wanted
2 goes
3 train
4 bold
5 go
6 images
7 links
so I want my dfC looks like this:
index abandoned train go
0 0.5 2.5 5.6
1 2.4 4.9 3.2
2 1.5 2.1 0.9
This was my attempt, but it gave me null dataframe:
dfC= dfB[~dfB["keywords"].isin(dfA)]
can anyone help me? thank you
Use DataFrame.loc with filter columns names by Index.isin:
dfC = dfA.loc[:, dfA.columns.isin(dfB['keywords'])]
Or filtering by Index.intersection:
dfC = dfA[dfA.columns.intersection(dfB['keywords'])]
print (dfC)
abandoned train go
index
0 0.5 2.5 5.6
1 2.4 4.9 3.2
2 1.5 2.1 0.9

Python/Pandas:How does one pivot a table whereby the unique values in a specified multi -index or column form part of the resultant column name?

I am trying to pivot a pandas table composed of 3 columns whereby the process id identifies the process that generates a series of scalar values, forms part of the resultant dataframe column (per process) as the following describes:
Input
time scalar process_id
1 0.5 A
1 0.6 B
2 0.7 A
2 1.5 B
3 1.6 A
3 1.9 B
Resultant:
time scalar_A scalar_B
1 0.5 0.6
2 0.7 1.5
3 1.6 1.9
I have tried using unstack (after setting process id in a multi index), however this causes the columns and process id that generated them to be nested:
bicir.set_index(['time', 'process_id'], inplace=True)
df.unstack(level=-1)
How would one most efficiently/effectively achieve this?
Thanks
It's actually already covered by pd.DataFrame.pivot method:
new_df = df.pivot(index='time', columns='process_id', values='scalar').reset_index()
Output:
process_id time A B
0 1 0.5 0.6
1 2 0.7 1.5
2 3 1.6 1.9
And if you want to rename your columns:
new_df = df.pivot(index='time', columns='process_id', values='scalar')
new_df.columns = [f'scalar_{i}' for i in new_df.columns]
new_df = new_df.reset_index()
Output:
time scalar_A scalar_B
0 1 0.5 0.6
1 2 0.7 1.5
2 3 1.6 1.9

Python Pandas .loc update 2 columns at once

I face problem in pandas where I perform many changes on data. But eventually I dont know which change caused the final state of value in the column.
For example I change volumes like this. But I run many checks like this one:
# Last check
for i in range(5):
df_gp.tail(1).loc[ (df_gp['volume']<df_gp['volume'].shift(1)) | (df_gp['volume']<0.4),['new_volume'] ] = df_gp['new_volume']*1.1
I want to update not only 'new_volume' column, but also column 'commentary' if the conditions are fulfilled.
Is it possible to add it somewhere, so that I 'commentary' is updated in the same time as 'new_volume'?
Thanks!
Yes, it is possible by assign, but in my opinion less readable, better is update each columns separately by boolean mask cached in variable:
df_gp = pd.DataFrame({'volume':[.1,.3,.5,.7,.1,.7],
'new_volume':[5,3,6,9,2,4],
'commentary':list('aaabbb')})
print (df_gp)
volume new_volume commentary
0 0.1 5 a
1 0.3 3 a
2 0.5 6 a
3 0.7 9 b
4 0.1 2 b
5 0.7 4 b
#create boolean mask and assign to variable for reuse
m = (df_gp['volume']<df_gp['volume'].shift(1)) | (df_gp['volume']<0.4)
#change columns by assign by condition and assign back only filtered columns
c = ['commentary','new_volume']
df_gp.loc[m, c] = df_gp.loc[m, c].assign(new_volume=df_gp['new_volume']*1.1
commentary='updated')
print (df_gp)
volume new_volume commentary
0 0.1 5.5 updated
1 0.3 3.3 updated
2 0.5 6.0 a
3 0.7 9.0 b
4 0.1 2.2 updated
5 0.7 4.0 b
#multiple filtered column by scalar
df_gp.loc[m, 'new_volume'] *= 1.1
#append new value to filtered column
df_gp.loc[m, 'commentary'] = 'updated'
print (df_gp)
volume new_volume commentary
0 0.1 5.5 updated
1 0.3 3.3 updated
2 0.5 6.0 a
3 0.7 9.0 b
4 0.1 2.2 updated
5 0.7 4.0 b

CSV data - max values for segments of columns using numpy

So Let's say I have a csv file with data like so:
'time' 'speed'
0 2.3
0 3.4
0 4.1
0 2.1
1 1.3
1 3.5
1 5.1
1 1.1
2 2.3
2 2.4
2 4.4
2 3.9
I want to be able to return this file so that for each increasing number under the header 'time', I fine the max number found in the column speed and return that number for speed next to the number for time in an array. The actual csv file I'm using is a lot larger so I'd want to iterate over a big mass of data and not just run it where 'time' is 0, 1, or 2.
So basically I want this to return:
array([[0,41], [1,5.1],[2,4.4]])
Using numpy specifically.
This is a bit tricky to get done in a fully vectorised way in NumPy. Here's one option:
a = numpy.genfromtxt("a.csv", names=["time", "speed"], skip_header=1)
a.sort()
unique_times = numpy.unique(a["time"])
indices = a["time"].searchsorted(unique_times, side="right") - 1
result = a[indices]
This will load the data into a one-dimenasional array with two fields and sort it first. The result is an array that has its data grouped by time, with the biggest speed value always being the last in each group. We then determine the unique time values that occur and find the rightmost entry in the array for each time value.
pandas fits nicely for this kind of stuff:
>>> from io import StringIO
>>> import pandas as pd
>>> df = pd.read_table(StringIO("""\
... time speed
... 0 2.3
... 0 3.4
... 0 4.1
... 0 2.1
... 1 1.3
... 1 3.5
... 1 5.1
... 1 1.1
... 2 2.3
... 2 2.4
... 2 4.4
... 2 3.9
... """), delim_whitespace=True)
>>> df
time speed
0 0 2.3
1 0 3.4
2 0 4.1
3 0 2.1
4 1 1.3
5 1 3.5
6 1 5.1
7 1 1.1
8 2 2.3
9 2 2.4
10 2 4.4
11 2 3.9
[12 rows x 2 columns]
once you have the data-frame, all you need is groupby time and aggregate by maximum of speeds:
>>> df.groupby('time')['speed'].aggregate(max)
time
0 4.1
1 5.1
2 4.4
Name: speed, dtype: float64

Categories

Resources