I have this DataFrame with x-axis data organized in column. However, for the non-existent, the columns were omitted, so the steps are uneven. For instance:
0.1 0.2 0.5 ...
0 1 4 7 ...
1 2 5 8 ...
2 3 6 9 ...
I want to plot each of those in with x-axis np.arange(0, max(df.columns), step=0.1) and also combined plot of those. Is there any easy way to achieve this with matplotlib.pyplot?
plt.plot(np.arange(0, max(df.columns), step=0.1), new_data)
Any help would be appreciated.
If I understood you correctly, your final dataframe is supposed to look like this:
0.0 0.1 0.2 0.3 0.4 0.5
0 0.0 1 4 0.0 0.0 7
1 0.0 2 5 0.0 0.0 8
2 0.0 3 6 0.0 0.0 9
which can be generated (and then also plotted) like this:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({0.1:[1,2,3],0.2:[4,5,6],0.5:[7,8,9]})
## make sure to actually include the maximum value (add one step)
# or alternatively rather use np.linspace() with appropriate number of points
xs = np.arange(0, max(df.columns) +0.1, step=0.1)
df = df.reindex(columns=xs, fill_value=0.0)
plt.plot(df.T)
plt.show()
which yields:
Related
I have two time-based data. One is the accelerometer's measurement data, another is label data.
For example,
accelerometer.csv
timestamp,X,Y,Z
1.0,0.5,0.2,0.0
1.1,0.2,0.3,0.0
1.2,-0.1,0.5,0.0
...
2.0,0.9,0.8,0.5
2.1,0.4,0.1,0.0
2.2,0.3,0.2,0.3
...
label.csv
start,end,label
1.0,2.0,"running"
2.0,3.0,"exercising"
Maybe these data are unrealistic because these are just examples.
In this case, I want to merge these data to below:
merged.csv
timestamp,X,Y,Z,label
1.0,0.5,0.2,0.0,"running"
1.1,0.2,0.3,0.0,"running"
1.2,-0.1,0.5,0.0,"running"
...
2.0,0.9,0.8,0.5,"exercising"
2.1,0.4,0.1,0.0,"exercising"
2.2,0.3,0.2,0.3,"exercising"
...
I'm using the "iterrows" of pandas. However, the number of rows of real data is greater than 10,000. Therefore, the running time of program is so long. I think, there is at least one method for this work without iteration.
My code like to below:
import pandas as pd
acc = pd.read_csv("./accelerometer.csv")
labeled = pd.read_csv("./label.csv")
for index, row in labeled.iterrows():
start = row["start"]
end = row["end"]
acc.loc[(start <= acc["timestamp"]) & (acc["timestamp"] < end), "label"] = row["label"]
How can I modify my code to get rid of "for" iteration?
If the times in accelerometer don't go outside the boundaries of the times in label, you could use merge_asof:
accmerged = pd.merge_asof(acc, labeled, left_on='timestamp', right_on='start', direction='backward')
Output (for the sample data in your question):
timestamp X Y Z start end label
0 1.0 0.5 0.2 0.0 1.0 2.0 running
1 1.1 0.2 0.3 0.0 1.0 2.0 running
2 1.2 -0.1 0.5 0.0 1.0 2.0 running
3 2.0 0.9 0.8 0.5 2.0 3.0 exercising
4 2.1 0.4 0.1 0.0 2.0 3.0 exercising
5 2.2 0.3 0.2 0.3 2.0 3.0 exercising
Note you can remove the start and end columns with drop if you want to:
accmerged = accmerged.drop(['start', 'end'], axis=1)
Output:
timestamp X Y Z label
0 1.0 0.5 0.2 0.0 running
1 1.1 0.2 0.3 0.0 running
2 1.2 -0.1 0.5 0.0 running
3 2.0 0.9 0.8 0.5 exercising
4 2.1 0.4 0.1 0.0 exercising
5 2.2 0.3 0.2 0.3 exercising
The program I'm writing simulates rolling 4 dice and adds the result from each together into a "Total" column. I'm trying to print the outcomes for 10,000 dice rolls but for some reason the value of each dice drops to 0.0 somewhere in the program and it continues like this until the end. Could anyone tell me what's going wrong here and how to fix it? Thanks :)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(101)
four_dice = np.zeros([pow(10,4),5]) # 10,000 rows, 5 columns
n = 0
outcomes = [1,2,3,4,5,6]
for i in outcomes:
for j in outcomes:
for k in outcomes:
for l in outcomes:
four_dice[n,:] = [i,j,k,l,i+j+k+l]
n +=1
four_dice_df = pd.DataFrame(four_dice,columns=('1','2','3','4','Total'))
print(four_dice_df) #print the table
OUTPUT
1 2 3 4 Total
0 1.0 1.0 1.0 1.0 4.0
1 1.0 1.0 1.0 2.0 5.0
2 1.0 1.0 1.0 3.0 6.0
3 1.0 1.0 1.0 4.0 7.0
4 1.0 1.0 1.0 5.0 8.0
... ... ... ... ... ...
9995 0.0 0.0 0.0 0.0 0.0
9996 0.0 0.0 0.0 0.0 0.0
9997 0.0 0.0 0.0 0.0 0.0
9998 0.0 0.0 0.0 0.0 0.0
9999 0.0 0.0 0.0 0.0 0.0
[10000 rows x 5 columns]
Does this work for what you want?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,7,size=(10000,4)),columns = [1,2,3,4])
df['total'] = df.sum(axis=1)
You ran out of dice combinations. You made your table 10^4 rows long, but there are only 6^4 combinations. Any row from 1296 through 9999 will be 0, because that's the initialized value.
To fix this, cut your table at the proper value: pow(6, 4)
Response to OP comment:
Of course you can write a loop. In this case, the controlling factor should be the number of results you want. Then you generate permutations to fulfill your needs. The Pythonic way to do this is to use the itertools package: permutations will give you the rolls in order; cycle will repeat the sequence until you quit asking.
However, the more obvious way for your current programming is perhaps to simply count in base 6:
digits = [1, 1, 1, 1, 1]
for i in range(10000):
# Record your digits in the data frame
...
# Add one for the next iteration; roll over if the die is already 6
for idx, die in enumerate(digits):
if die < 6:
digits[idx] += 1
break
else: # Reset die to 1 and continue to next die
digits[idx] = 1
This will increment the dice, left to right, until you either have one that doesn't need a reset to 1, or run out of dice.
Another possibility is to copy any of the many base-conversion functions available on line. Convert your iteration counter i to base 6, take the lowest 4 digits (quantity of dice), and add 1 to each digit.
Assume a big set of data like
Height (m) My data
0 18 5.0
1 25 6.0
2 10 1.0
3 13 1.5
4 32 8.0
5 26 6.7
6 23 5.0
7 5 2.0
8 7 2.0
And I want to plot the average (and, if possible, the standard deviation) of "My data" as a function of height, separated in the range [0,5),[5,10),[10,15) and so on.
Any idea? I've tried different approaches and none of them work
If I understand you correctly:
# Precompute bins for pd.cut
bins = list(range(0, df['Height (m)'].max() + 5, 5))
# Cut Height into intervals which exclude the right endpoint,
# with bin edges at multiples of 5
df['HeightBin'] = pd.cut(df['Height (m)'], bins=bins, right=False)
# Within each bin, get mean, stdev (normalized by N-1 by default),
# and also show sample size to explain why some std values are NaN
df.groupby('HeightBin')['My data'].agg(['mean', 'std', 'count'])
mean std count
HeightBin
[0, 5) NaN NaN 0
[5, 10) 2.00 0.000000 2
[10, 15) 1.25 0.353553 2
[15, 20) 5.00 NaN 1
[20, 25) 5.00 NaN 1
[25, 30) 6.35 0.494975 2
[30, 35) 8.00 NaN 1
If I understand correctly, this is what you would like to do:
import pandas as pd
import numpy as np
bins = np.arange(0, 30, 5) # adjust as desired
df_stats = pd.DataFrame(columns=['mean', 'st_dev']) # DataFrame for the results
df_stats['mean'] = df.groupby(pd.cut(df['Height (m)'], bins, right=False)).mean()['My data']
df_stats['st_dev'] = df.groupby(pd.cut(df['Height (m)'], bins, right=False)).std()['My data']
I have some timeseries data that basically contains information on price change period by period. For example, let's say:
df = pd.DataFrame(columns = ['TimeStamp','PercPriceChange'])
df.loc[:,'TimeStamp']=[1457280,1457281,1457282,1457283,1457284,1457285,1457286]
df.loc[:,'PercPriceChange']=[0.1,0.2,-0.1,0.1,0.2,0.1,-0.1]
so that df looks like
TimeStamp PercPriceChange
0 1457280 0.1
1 1457281 0.2
2 1457282 -0.1
3 1457283 0.1
4 1457284 0.2
5 1457285 0.1
6 1457286 -0.1
What I want to achieve is to calculate the overall price change before the an increase/decrease streak ends, and store the value in the row where the streak started. That is, what I want is a column 'TotalPriceChange' :
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 1.1 * 1.2 - 1 = 0.31
1 1457281 0.2 0
2 1457282 -0.1 -0.1
3 1457283 0.1 1.1 * 1.2 * 1.1 - 1 = 0.452
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 -0.1
I can identify the starting points using something like:
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
to get
TimeStamp PercPriceChange turn
0 1457280 0.1 NaN or 1?
1 1457281 0.2 0
2 1457282 -0.1 1
3 1457283 0.1 1
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 1
Given this column "turn", I need help proceeding with my quest (or perhaps we don't need this "turn" at all). I am pretty sure I can write a nested for-loop going through the entire DataFrame row by row, calculating what I need and populating the column 'TotalPriceChange', but given that I plan on doing this on a fairly large data set (think minute or hour data for couple of years), I imagine nested for-loops will be really slow.
Therefore, I just wanted to check with you experts to see if there is any efficient solution to my problem that I am not aware of. Any help would be much appreciated!
Thanks!
The calculation you are looking for looks like a groupby/product operation.
To set up the groupby operation, we need to assign a group value to each row. Taking the cumulative sum of the turn column gives the desired result:
df['group'] = df['turn'].cumsum()
# 0 0
# 1 0
# 2 1
# 3 2
# 4 2
# 5 2
# 6 3
# Name: group, dtype: int64
Now we can define the TotalPriceChange column (modulo a little cleanup work) as
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
import pandas as pd
df = pd.DataFrame({'PercPriceChange': [0.1, 0.2, -0.1, 0.1, 0.2, 0.1, -0.1],
'TimeStamp': [1457280, 1457281, 1457282, 1457283, 1457284, 1457285, 1457286]})
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
df['group'] = df['turn'].cumsum()
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
mask = (df['group'].diff() != 0)
df.loc[~mask, 'TotalPriceChange'] = 0
df = df[['TimeStamp', 'PercPriceChange', 'TotalPriceChange']]
print(df)
yields
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 0.320
1 1457281 0.2 0.000
2 1457282 -0.1 -0.100
3 1457283 0.1 0.452
4 1457284 0.2 0.000
5 1457285 0.1 0.000
6 1457286 -0.1 -0.100
I am trying to plot line graphs in matplotlib with the following data, x,y points belonging to same id is one line, so there are 3 lines in the below df.
id x y
0 1 0.50 0.0
1 1 1.00 0.3
2 1 1.50 0.5
4 1 2.00 0.7
5 2 0.20 0.0
6 2 1.00 0.8
7 2 1.50 1.0
8 2 2.00 1.2
9 2 3.50 2.0
10 3 0.10 0.0
11 3 1.10 0.5
12 3 3.55 2.2
It can be simply plotted with following code:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib notebook
fig, ax = plt.subplots(figsize=(12,8))
cmap = plt.cm.get_cmap("viridis")
groups = df.groupby("id")
ngroups = len(groups)
for i1, (key, grp) in enumerate(groups):
grp.plot(linestyle="solid", x = "x", y = "y", ax = ax, label = key)
plt.show()
But, I have another data frame df2 where weight of each id is given, and I am hoping to find a way to control the thickness of each line according to it's weight, the larger the weight, thicker is the line. How can I do this? Also what relation will be followed between the weight and width of the line ?
id weight
0 1 5
1 2 15
2 3 2
Please let me know if anything is unclear.
Based on the comments, you need to know a few things:
How to set the line width?
That's simple: linewidth=number. See https://matplotlib.org/examples/pylab_examples/set_and_get.html
How to take the weight and make it a significant width?
This depends on the range of your weight. If it's consistently between 2 and 15, I'd recommend simply dividing it by 2, i.e.:
linewidth=weight/2
If you find this aesthetically unpleasing, divide by a bigger number, though that would obviously reduce the number of linewidths you get.
How to get the weight out of df2?
Given the df2 you described and the code you showed, key is the id of df2. So you want:
df2[df2['id'] == key]['weight']
Putting it all together:
Replace your grp.plot line with the following:
grp.plot(linestyle="solid",
linewidth=df2[df2['id'] == key]['weight'] / 2.0,
x = "x", y = "y", ax = ax, label = key)
(All this is is your line with the entry for linewidth added in.)