I have the following data (just a snippet). They start at 0 min and end at 65 min.
R.Time (min) Intensity 215 Intensity 260 Intensity 280
0 0.00000 0 0 0
1 0.01067 0 0 0
2 0.02133 0 0 0
3 0.03200 0 0 0
and
Time %B c B c KCl
0 16.01 0.00 0.0000 0.00
1 16.01 0.00 0.0000 0.00
2 17.00 0.85 0.0085 4.25
3 18.00 1.70 0.0170 8.50
How can I create a dataframe with the time [min] column and all other columns at the correct row for that time. So I need to tell pandas the time column and how to merge and then it sorts the rows? But I also needs to combine rows, when the time is the same.
Related
What I want to do is add the ratios in the 1st table based on the correct child in the 2nd table.
So for example for the 1st observation I want to do 0.52 (1st child 16-17)+0.84 (2nd child 11-13)+0.78 (3rd child 0-3)=2.14 and create a new column for those values.
There are no observations with more than 1 child in any age range. The "Child_18-older_18" and "Pregnant" columns should be seen as someone with an age of 16-17 and 0-3 in the ratio table, respectively. With regards to the 2nd table, the entire dataframe consists of 4000 observations. These 5 observations were picked randomly
Age
First_child_ratio
Second_child_ratio
Third_child_ratio
Fourth_child_ratio
0-3
1.0
0.72
0.78
0.66
4-6
0.83
0.6
0.65
0.54
7-10
0.77
0.69
0.73
0.59
11-13
0.88
0.84
0.87
0.86
14-15
0.52
0.52
0.68
0.68
16-17
0.52
0.52
0.52
0.52
Pregnant
Child_0-3
Child_4-6
Child_7-10
Child_11-13
Child_14-15
Child_16-17
Child_18-older_18
No_Child
0
1
0
0
1
0
1
0
0
0
1
0
1
0
0
0
0
0
0
0
1
0
1
0
1
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
To explain clearly my data has temperature values and when the sensor is not registering any values, it will give continuous 0s. My interval is in ms and this will give me a string of 0s each time. And when a string of 0s come, I want the counter to increment from 1 to 2 to 3 and so on.
Essentially, I want the counter column as shown below
No.
Temp
Count
1
80.0
0
2
81.6
0
3
0.00
0
4
0.00
0
5
0.00
0
6
81.6
1
7
80.0
0
8
83.7
0
9
0.00
0
10
0.00
0
11
0.00
0
12
81.6
1
13
81.6
0
14
80.0
0
15
83.7
0
16
0.00
0
17
0.00
0
18
0.00
0
19
81.6
1
I was thinking of
df['count'] = df.groupby((df['col'] == df['col'].shift(2)).cumsum()).cumcount()+1
But there has to be an easier way. Also, this can mess if my temperature values coincidentally adhere with this logic.
You can use a simple boolean to keep track of if the last value was zero or not and increment a counter whenever the values stop becoming zero.
I saw that you used a PANDAS df but since your question said that you will be using realtime data, I decided to keep the code ambiguous so that you can apply it to your own use case.
Instead of a iterating through a list with a for loop, you can place the logic that connects to your sensor. And of course you can replace the print statement at the bottom with whatever you need.
nums = [80.0, 81.6, 0.0, 0.0, 0.0, 81.6, 80.0, 83.7, 0.0, 0.0, 0.0, 81.6]
zeroCount = 1
isZero = False
for num in nums: #Can be whatever type of iteration you need
if num == 0:
isZero = True
if isZero:
if num != 0:
zeroCount += 1
isZero = False
print("%.1f %d" % (num, zeroCount))
Running that code produces the output you tabulated in your question:
80.0 1
81.6 1
0.0 1
0.0 1
0.0 1
81.6 2
80.0 2
83.7 2
0.0 2
0.0 2
0.0 2
81.6 3
I have two data frames, each with 672 rows of data.
I want to subtract the values in a column of one data frame from the values in a column of the other data frame. The result can either be a new data frame, or a series, it does not really matter to me. The size of the result should obviously be 672 rows or 672 values.
I have the code:
stock_returns = beta_portfolios_196307_201906.iloc[:,6] - \
fama_french_factors_196307_201906.iloc[:,4]
I also tried
stock_returns = beta_portfolios_196307_201906["Lo 10"] + \
fama_french_factors_196307_201906["RF"]
For both, the result is a series size (1116, ), and most of the value in the series are NaN, with a few being numeric values.
Could someone please explain why this happening and how I can get the result I want?
Here is a the .head() of my data frames:
beta_portfolios_196307_201906.head()
Date Lo 20 Qnt 2 Qnt 3 Qnt 4 ... Dec 6 Dec 7 Dec 8 Dec 9 Hi 10
0 196307 1.13 -0.08 -0.97 -0.94 ... -1.20 -0.49 -1.39 -1.94 -0.77
1 196308 3.66 4.77 6.46 6.23 ... 7.55 7.57 4.91 9.04 10.47
2 196309 -2.78 -0.76 -0.78 -0.81 ... -0.27 -0.63 -1.00 -1.92 -3.68
3 196310 0.74 3.56 2.03 5.70 ... 1.78 6.63 4.78 3.10 3.01
4 196311 -0.63 -0.26 -0.81 -0.92 ... -0.69 -1.32 -0.51 -0.20 0.52
[5 rows x 16 columns]
fama_french_factors_196307_201906.head()
Date Mkt-RF SMB HML RF
444 196307 -0.39 -0.56 -0.83 0.27
445 196308 5.07 -0.94 1.67 0.25
446 196309 -1.57 -0.30 0.18 0.27
447 196310 2.53 -0.54 -0.10 0.29
448 196311 -0.85 -1.13 1.71 0.27
One last thing I should add: At first, all of the values in both data frames were strings, so I had to convert the values to numeric values using:
beta_portfolios_196307_201906 = beta_portfolios_196307_201906.apply(pd.to_numeric, errors='coerce')
Let's explain the issue on an example with just 5 rows.
When both DataFrames, a and b have the same indices, e.g.:
a b
Lo 10 Xxx RF Yyy
0 10 1 0 9 1
1 20 1 1 8 1
2 30 1 2 7 1
3 40 1 3 6 1
4 50 1 4 5 1
The result of subtraction a['Lo 10'] - b['RF'] is:
0 1
1 12
2 23
3 34
4 45
dtype: int64
Rows of both DataFrames are aligned on the index and then corresponding
elements are subtracted.
And now take a look at the case when b has some other indices, e.g.:
RF Yyy
0 9 1
1 8 1
2 7 1
8 6 1
9 5 1
i.e. last 2 rows have index 8 and 9 absent in a.
Then the result of the same subtraction is:
0 1.0
1 12.0
2 23.0
3 NaN
4 NaN
8 NaN
9 NaN
dtype: float64
i.e.:
rows with index 0, 1 and 2 - as before - both DataFrames have these
values.
but if some index is present in only one DataFrame, the result is
NaN,
the number of rows in this result is bigger.
If you want to align both columns by position instead of by the index, you
can run a.reset_index()['Lo 10'] - b.reset_index()['RF'], getting the
result as in the first case.
I want to plot my data of a timeseries (of bikes) by showing the mean of number of bikes per hour and weekday.
here is an extract of the initial data :
date nb_bike
2019-09-20 12:00:00 15
2019-09-20 13:00:00 10
2019-09-20 14:00:00 17
2019-09-20 15:00:00 12
2019-09-20 16:00:00 24
I computed this mean per weekday and hour that way :
data_b = data_b_init.groupby([data_b_init.index.weekday.rename('wkday'),data_b_init.index.hour.rename('hour')]).mean()
data_b = data_b.reset_index()
So I want to plot these data (here an extract) :
data_b
wkday hour nb_bike_mean
0 0 0.44
0 1 0.11
0 2 0.00
0 3 0.11
0 4 0.00
0 5 0.67
0 6 0.78
0 7 6.44
0 8 13.83
0 9 9.78
I would like to do something like that : How to plot data per hour, grouped by days?
(especially like this graph :
but I don't find how to do it by keeping just weekdays and hours information and not days.
For example, I tried this code :
sns.lineplot(x='hour',y='nb_bike_mean',data=data_b, hue='wkday')
but it isn't what I want, because I want both wkday and hour on x axis.
Do you know a way to plot with 2 levels on x axis ?
Or better, to have wkday and hour recognized as datetime and to join them as index ?
I have got a dataframe of several hundred thousand rows. Which is of the following format:
time_elapsed cycle
0 0.00 1
1 0.50 1
2 1.00 1
3 1.30 1
4 1.50 1
5 0.00 2
6 0.75 2
7 1.50 2
8 3.00 2
I want to create a third column that will give me the percentage of each time instance that the row is of the cycle (until the next time_elapsed = 0). To give something like:
time_elapsed cycle percentage
0 0.00 1 0
1 0.50 1 33
2 1.00 1 75
3 1.30 1 87
4 1.50 1 100
5 0.00 2 0
6 0.75 2 25
7 1.50 2 50
8 3.00 2 100
I'm not fussed about the number of decimal places, I've just excluded them for ease here.
I started going along this route, but I keep getting errors.
data['percentage'] = data['time_elapsed'].sub(data.groupby(['cycle'])['time_elapsed'].transform(lambda x: x*100/data['time_elapsed'].max()))
I think it's the lambda function causing errors, but I'm not sure what I should do to change it. Any help is much appreciated :)
Use Series.div for division instead sub for subtract, then solution is simplify - get only max per groups, multiple by Series.mul, if necessary Series.round and last convert to integers by Series.astype:
s = data.groupby(['cycle'])['time_elapsed'].transform('max')
data['percentage'] = data['time_elapsed'].div(s).mul(100).round().astype(int)
print (data)
time_elapsed cycle percentage
0 0.00 1 0
1 0.50 1 33
2 1.00 1 67
3 1.30 1 87
4 1.50 1 100
5 0.00 2 0
6 0.75 2 25
7 1.50 2 50
8 3.00 2 100