I have the data flowing into a csv file daily which shows the no. of pieces being manufactured. I want to clearly show the daily % increase in pieces being produced
I have tried transpose(), unstack() but have not been able to solve this.
Here is what the data looks like:
I want to clearly show the daily % increase in pieces being produced. The output should be something like this:
How should I get this done?
You would need s.pct_change() and series.shift():
df.insert(1,'Day2',df.Day.shift(-1))
df['Percent_change']=(df.Peice_Produced.pct_change()*100).shift(-1).fillna(0).round(2)
print(df)
Day Day2 Peice_Produced Percent_change
0 1/1/17 1/2/17 10 -50.00
1 1/2/17 1/3/17 5 200.00
2 1/3/17 1/4/17 15 -60.00
3 1/4/17 1/5/17 6 250.00
4 1/5/17 1/6/17 21 -66.67
5 1/6/17 1/7/17 7 300.00
6 1/7/17 1/8/17 28 -71.43
7 1/8/17 1/9/17 8 350.00
8 1/9/17 1/10/17 36 -75.00
9 1/10/17 1/11/17 9 400.00
10 1/11/17 NaN 45 0.00
I admit I do not fully understand what your intent is. Nevertheless, I may have a solution as i understand it ..
Use diff() function to find the discrete difference
Your Simulated DataFarme:
>>> df
Day Peice_Produced
0 1/1/17 10
1 1/2/17 5
2 1/3/17 15
3 1/4/17 6
4 1/5/17 21
5 1/6/17 7
6 1/7/17 28
7 1/8/17 8
8 1/9/17 36
9 1/10/17 9
10 1/11/17 45
Solution: One way around of doing..
>>> df['Day_over_day%'] = df.Peice_Produced.diff(periods=1).fillna(0).astype(str) + '%'
>>> df
Day Peice_Produced Day_over_day%
0 1/1/17 10 0.0%
1 1/2/17 5 -5.0%
2 1/3/17 15 10.0%
3 1/4/17 6 -9.0%
4 1/5/17 21 15.0%
5 1/6/17 7 -14.0%
6 1/7/17 28 21.0%
7 1/8/17 8 -20.0%
8 1/9/17 36 28.0%
9 1/10/17 9 -27.0%
10 1/11/17 45 36.0%
You can just add a calculated column. I'm assuming you are storing this data in a pandas DataFrame called df. You can do this simply with:
df['change'] = (df['Pieces Produced'] / df['Pieces Produced'].shift(1))-1
Related
I have a column as follows:
A B
0 0 20.00
1 1 35.00
2 2 75.00
3 3 29.00
4 4 125.00
5 5 16.00
6 6 52.50
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.20
17 17 27.44
18 18 57.01
19 19 29.88
I want to change the values of the column as follows
if 0<B<10.0, then Replace the cell value of B by "0 to 10"
if 10.1<B<20.0, then Replace the cell value of B by "10 to 20"
continue like this until the maximum range achieved.
I have tried
ds['B'] = np.where(ds['B'].between(10.0,20.0), "10 to 20", ds['B'])
But once I perform this operation, the DataFrame is occupied by the string "10 to 20" so I cannot perform this operation again for the remaining values of the DataFrame. After this step, the DataFrame looks like this:
A B
0 0 10 to 20
1 1 35.0
2 2 75.0
3 3 29.0
4 4 125.0
5 5 10 to 20
6 6 52.5
7 7 nan
8 8 nan
9 9 nan
10 10 nan
11 11 nan
12 12 nan
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.2
17 17 27.44
18 18 57.01
19 19 29.88
And the following line: ds['B'] = np.where(ds['B'].between(20.0,30.0), "20 to 30", ds['B']) will throw TypeError: '>=' not supported between instances of 'str' and 'float'
How can i code this to change all of the values in the DataFrame to these strings of ranges all at once?
Build your bins and labels and use pd.cut:
bins = np.arange(0, df["B"].max() // 10 * 10 + 10, 10).astype(int)
labels = [' to '.join(t) for t in zip(bins[:-1].astype(str), bins[1:].astype(str))]
df["B"] = pd.cut(df["B"], bins=bins, labels=labels)
>>> df
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 NaN
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30
This can be done with much less code as this is actually just a matter of string formatting.
ds['B'] = ds['B'].apply(lambda x: f'{int(x/10) if x>=10 else ""}0 to {int(x/10)+1}0' if pd.notnull(x) else x)
You can create a custom function that maps each range to a string. For example, 19.0 will be mapped to "10 to 20", and then apply this function to each row.
I've written the code so that the minimum and maximum of the range is generalizable to the DataFrame, and takes on values that are multiples of 10.
import numpy as np
import pandas as pd
## copy and paste your DataFrame
ds = pd.read_clipboard()
# floor to nearest multiple of 10
ds_min = ds['B'].min()//10*10
# ceiling to the nearest multiple of 10
ds_max = round(ds['B'].max(),-1)
ranges = np.linspace(ds_min, ds_max, ((ds_max-ds_min)/10)+1)
def map_value_to_string(value):
for idx in range(1,len(ranges)):
low_value, high_value = ranges[idx-1], ranges[idx]
if low_value < value <= high_value:
return f"{int(low_value)} to {int(high_value)}"
else:
continue
ds['B'] = ds['B'].apply(lambda x: map_value_to_string(x))
Output:
>>> ds
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 None
8 8 None
9 9 None
10 10 None
11 11 None
12 12 None
13 13 230 to 240
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30
Assume I'm having dataframe as shown below.
In the data frame we are representing the events occurred on every sec.
Time events_occured
1 2
2 3
3 7
4 4
5 6
6 3
7 86
8 26
9 7
10 26
. .
. .
. .
996 56
997 26
998 97
999 58
1000 34
Now I need to get the cumulative occurrences of events in every 5 secs.
As in first 5 seconds 22 events occurred, from 6 to 10 secs 148 events occurred and so on.
Like this:
In [647]: df['cumulative'] = df.events_occured.groupby(df.index // 5).cumsum()
In [648]: df
Out[648]:
Time events_occured cumulative
0 1 2 2
1 2 3 5
2 3 7 12
3 4 4 16
4 5 6 22
5 6 3 3
6 7 86 89
7 8 26 115
8 9 7 122
9 10 26 148
if there are missing values of Time using df.index could produce errors in the logic so use df['Time'].
It also works if time starts at any value N and if there are missing values greater than N
GROUP_SIZE = 5
df['cumulative'] = df.events_occured\
.groupby(df['Time'].sub(df['Time'].min()) // GROUP_SIZE).cumsum()
print(df)
Time events_occured cumulative
0 1 2 2
1 2 3 5
2 3 7 12
3 4 4 16
4 5 6 22
5 6 3 3
6 7 86 89
7 8 26 115
8 9 7 122
9 10 26 148
I have a dataframe, "df", with a datetime index. Here is a rough snapshot of its dimensions:
V1 V2 V3 V4 V5
1/12/2008 4 15 11 7 1
1/13/2008 5 2 8 7 1
1/14/2008 13 13 9 6 4
1/15/2008 14 15 12 9 3
1/16/2008 1 10 2 12 15
1/17/2008 10 5 9 9 1
1/18/2008 13 11 5 7 2
1/19/2008 2 6 7 9 6
1/20/2008 5 4 14 3 7
1/21/2008 11 11 4 7 15
1/22/2008 9 4 15 10 3
1/23/2008 2 13 13 10 3
1/24/2008 12 15 14 12 8
1/25/2008 1 4 2 6 15
Some of the days in the index are weekends and holidays.
I would like to move all dates, in the datetime index of "df", to their respective closest (US) business day (i.e. Mon-Friday, excluding holidays).
How would you recommend for me to do this? I am aware that Pandas has a "timeseries offset" facility for this. But, I haven't been able to find an example that walks a novice reader through this.
Can you help?
I am not familiar with this class but after looking at the source code it seems fairly straightforward to achieve this. Keep in mind that it picks the next closest business day meaning Saturday turns into Monday as opposed to Friday. Also making your index be non-unique will decrease performance on your DataFrame, so I suggest assigning these values to a new column.
The one prerequisite is you have to make sure your index is any of these three types, datetime, timedelta, pd.tseries.offsets.Tick.
offset = pd.tseries.offsets.CustomBusinessDay(n=0)
df.assign(
closest_business_day=df.index.to_series().apply(offset)
)
V1 V2 V3 V4 V5 closest_business_day
2008-01-12 4 15 11 7 1 2008-01-14
2008-01-13 5 2 8 7 1 2008-01-14
2008-01-14 13 13 9 6 4 2008-01-14
2008-01-15 14 15 12 9 3 2008-01-15
2008-01-16 1 10 2 12 15 2008-01-16
2008-01-17 10 5 9 9 1 2008-01-17
2008-01-18 13 11 5 7 2 2008-01-18
2008-01-19 2 6 7 9 6 2008-01-21
2008-01-20 5 4 14 3 7 2008-01-21
2008-01-21 11 11 4 7 15 2008-01-21
2008-01-22 9 4 15 10 3 2008-01-22
2008-01-23 2 13 13 10 3 2008-01-23
2008-01-24 12 15 14 12 8 2008-01-24
2008-01-25 1 4 2 6 15 2008-01-25
I have created a dataframe from an Excel sheet, then filtered it to values in the [Date_rank] column less than 10. The resulting dataframe is filtered
I've then used: g = groupby("Well_name") to segregate the data by each well
Now that I have the data grouped by Well_name, how can I find the standard deviation of [RandomNumber] in this group (providing me with the stdev for both of the wells RandomNumbers)? Perhaps it was not necessary to use the groupby function?
df = pd.read_csv('here.csv')
print(df)
filtered = df[df['Date_rank']<10] #filter the datafram to less than 10
print(filtered)
g = filtered.groupby('Well_name') #grouped the data to segregate by well name
Here is my data
Well_name Date_rank RandomNumber
0 Velta 1 4
1 Velta 2 5
2 Velta 3 2
3 Velta 4 4
4 Velta 5 4
5 Velta 6 9
6 Velta 7 0
7 Velta 8 9
8 Velta 9 1
9 Velta 10 3
10 Velta 11 8
11 Velta 12 3
12 Velta 13 10
13 Velta 14 10
14 Velta 15 0
15 Ronnie 1 8
16 Ronnie 2 1
17 Ronnie 3 6
18 Ronnie 4 2
19 Ronnie 5 2
20 Ronnie 6 9
21 Ronnie 7 6
22 Ronnie 8 5
23 Ronnie 9 2
24 Ronnie 10 1
25 Ronnie 11 3
26 Ronnie 12 3
27 Ronnie 13 4
28 Ronnie 14 0
29 Ronnie 15 4
You should be able to solve the problem with groupby() as you stated. The code you should use is the following:
g = filtered.groupby('Well_name')['RandomNumber'].std()
Or using .agg()
g = filtered.groupby('Well_name').agg({'RandomNumber':'np.std'})
happy new year to all!
I guess this question might be an easy one, but i can't figure it out.
How can i turn hourly data into 15 minute buckets quickly in python (see table below). Basically the left column should be converted into the right one.Just duplicate the hourly value for times and dump it into a new column.
Thanks for the support!
Cheers!
Hourly 15mins
1 28.90 1 28.90
2 28.88 1 28.90
3 28.68 1 28.90
4 28.67 1 28.90
5 28.52 2 28.88
6 28.79 2 28.88
7 31.33 2 28.88
8 32.60 2 28.88
9 42.00 3 28.68
10 44.00 3 28.68
11 44.00 3 28.68
12 44.00 3 28.68
13 39.94 4 28.67
14 39.90 4 28.67
15 38.09 4 28.67
16 39.94 4 28.67
17 44.94 5 28.52
18 66.01 5 28.52
19 49.45 5 28.52
20 48.37 5 28.52
21 38.02 6 28.79
22 34.55 6 28.79
23 33.33 6 28.79
24 32.05 6 28.79
7 31.33
7 31.33
7 31.33
7 31.33
You could also do this through constructing a new DataFrame and using numpy methods.
import numpy as np
pd.DataFrame(np.column_stack((np.arange(df.shape[0]).repeat(4, axis=0),
np.array(df).repeat(4, axis=0))),
columns=['hours', '15_minutes'])
which returns
hours 15_minutes
0 0 28.90
1 0 28.90
2 0 28.90
3 0 28.90
4 1 28.88
5 1 28.88
...
91 22 33.33
92 23 32.05
93 23 32.05
94 23 32.05
95 23 32.05
column_stack appends arrays by columns (index=0). np.arange(df.shape[0]).repeat(4, axis=0) gets the hour IDs by repeating 0 through 23 four times, and the values for each 15 minutes is constructed in a similar manner. pd.DataFrame produces the DataFrames and column names are added as well.
Create datetime-like index for your DataFrame, then you can use resample.
series.resample('15T')