Related
I have time series data with a column that sums up seconds that something is running. All numbers are divisible by 30s but sometimes it does skip numbers (may jump from 30 to 90). This column can reset along as it is running, setting the start count back to 30s. How would I break up every chunk of runtime.
For example: If numbers in the column are 30, 60, 120, 150, 30, 60, 90, 30, 60, how would I break apart the dataframe into the full sequences with no resets.
30, 60, 120, 150 in 1 dataframe and 30, 60, 90 in the next and 30, 60 in the last? At the end, I need to take the max of each dataframe and add them together (that part I could figure out).
Using #RSale's input:
import pandas as pd
df = pd.DataFrame({'data': [30, 60, 120, 150, 30, 60, 90, 30, 60]})
d = dict(tuple(df.groupby(df['data'].eq(30).cumsum())))
d is a dictionary of three dataframes:
d[1]:
data
0 30
1 60
2 120
3 150
d[2]:
data
4 30
5 60
6 90
And d[3}:
data
7 30
8 60
Not very elegant but it get's the job done.
Loop through an array. Add array to a list when a number is smaller than the one before. Remove the saved array from the list and repeat until no change is detected.
numpy & recursive
import numpy as np
a = np.array([30, 60, 120, 150, 30, 60, 90, 30, 60])
y = []
def split(a,y):
for count,val in enumerate(a):
if count == 0:
pass
elif val < a[count-1]:
y.append(a[:count])
a = a[count:]
if len(a)> 0 and sorted(a) != list(a):
split(a,y)
else:
y.append(a)
a = []
return(y)
return(y)
y = split(a,y)
print(y)
>>[array([ 30, 60, 120, 150]), array([30, 60, 90]), array([30, 60])]
print([max(lis) for lis in y])
>>[150,90,60]
This will not consider 30 as a starting point but the samllest number after the reset.
Or using diff to find the change points.
numpy & diff version
import numpy as np
a = np.array([30, 60, 120, 150, 30, 60, 90, 30, 60])
y = []
def split(a,y):
a_diff = np.asarray(np.where(np.diff(a)<0))[0]
while len(a_diff)>1:
a_diff = np.asarray(np.where(np.diff(a)<0))[0]
y.append(a[:a_diff[0]+1])
a = a[a_diff[0]+1:]
y.append(a)
return(y)
y = split(a,y)
print(y)
print([max(lis) for lis in y])
>>[array([ 30, 60, 120, 150]), array([30, 60, 90]), array([30, 60])]
>>[150, 90, 60]
pandas & DataFrame version
import pandas as pd
df = pd.DataFrame({'data': [30, 60, 120, 150, 30, 60, 90, 30, 60]})
y = []
def split(df,y):
a = df['data']
a_diff = [count for count,val in enumerate(a.diff()[1:]) if val < 0 ]
while len(a_diff)>1:
a_diff = [count for count,val in enumerate(a.diff()[1:]) if val < 0 ]
y.append(a[:a_diff[0]+1])
a = a[a_diff[0]+1:]
y.append(a)
return(y)
y = split(df,y)
print(y)
print([max(lis) for lis in y])
I am trying to convert column 'reward levels' to int type, it seems that it is listed as object type.
I have tried
.astype(int)
ValueError: invalid literal for int() with base 10: '25,50,100,250,500,1,000,2,500'
also:
tuple(map(int, df['reward levels'].split(',')))
AttributeError: 'Series' object has no attribute 'split'
final:
**pd.to_numeric(df['reward levels'])
ValueError: Unable to parse string "25,50,100,250,500,1,000,2,500" at position 0**
https://drive.google.com/file/d/0By26wLpAqHfQaF9Jb19RUFVnNjA/view
link to the data. Thanks in advance I am a novice.
After looking at your data, it seems that reward levels has either , separated values preceding with $ sign or NaN, so what you can do is, for each value of reward levels:
Remove all $ signs, you can simply replace them by empty string ''
Split each values by comma ,, you will get list of integers as list of string
Call pd.to_numeric for each row in reward levels
df['reward levels'] = df['reward levels'].str.replace('$', '', regex=False).str.split(',').apply(pd.to_numeric)
OUTPUT:
1 [1, 5, 10, 25, 50]
2 [1, 10, 25, 40, 50, 100, 250, 1, 0, 1, 337, 9, 1]
3 [1, 10, 25, 30, 50, 75, 85, 100, 110, 250, 500...
4 [10, 25, 50, 100, 150, 250]
...
45952 [20, 50, 100]
45953 [1, 5, 10, 25, 50, 50, 75, 100, 200, 250, 500,...
45954 [10, 25, 100, 500]
45955 [15, 16, 19, 29, 29, 39, 75]
45956 [25, 25, 50, 100, 125, 250, 500, 1, 250, 2, 50...
Name: reward levels, Length: 45957, dtype: object
Furthermore, if you wish to have each of the list items on a separate row, you can use explode
df.explode('reward levels')
OUTPUT:
0 25
0 50
0 100
0 250
0 500
...
45956 250
45956 2
45956 500
45956 5
45956 0
Name: reward levels, Length: 416706, dtype: object
It depends what you want the output format to be. If you just want to split the strings as comma separated values and cast them as ints, you can use:
data = {'reward_levels': {0: '25,50,100,250,500,1,000,2,500',
1: '25,50,10',
2: '15,16,19,22'}}
df = pd.DataFrame(data)
df.apply(lambda x: [int(j) for j in x.reward_levels.split(",")], axis=1)
but the result may not be exactly what you want:
0 [25, 50, 100, 250, 500, 1, 0, 2, 500]
1 [25, 50, 10]
2 [15, 16, 19, 22]
It is more typical to have a single value for each cell/index. You can either explode into multiple columns, or duplicate as rows; the latter might be preferable as your arrays are of unequal length:
df.reward_levels.str.split(",", expand=True)
output:
0 1 2 3 4 5 6 7 8
0 25 50 100 250 500 1 000 2 500
1 25 50 10 None None None None None None
2 15 16 19 22 None None None None None
or
df.reward_levels.str.split(",").explode().astype(int)
output:
0 25
0 50
0 100
0 250
0 500
0 1
0 0
0 2
0 500
1 25
1 50
1 10
2 15
2 16
2 19
2 22
I concatenated 500 XSLX-files, which has the shape (672006, 12). All processes have a unique number, which I want to groupby() the data to obtain relevant information. For temperature I would like to select the first and for number the most frequent value.
Test data:
df_test =
pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3],
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4],
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80]})
df_test.groupby('number')['temperature'].first()
df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])
I get the following error for trying to getting the most frequent height per number:
IndexError: index 0 is out of bounds for axis 0 with size 0
Strange enough, mean() / first() / max() etc are all working.
And on the second part of the dataset that I concatenated seperately the aggregation worked.
Can somebody suggest what to do with this error?
Thanks!
I think your problem is one or more of your groupby is returning all NaN heights:
See this example, where I added a number 4 with np.NaN as its heights.
df_test = pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3,4,4],
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4, 5, 5],
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80, np.nan, np.nan]})
df_test.groupby('number')['temperature'].first()
df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])
Output:
IndexError: index 0 is out of bounds for axis 0 with size 0
Let's fill those NaN with zero and rerun.
df_test = pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3,4,4],
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4, 5, 5],
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80, np.nan, np.nan]})
df_test = df_test.fillna(0) #Add this line
df_test.groupby('number')['temperature'].first()
df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])
Output:
number
1 100.0
2 90.0
3 80.0
4 0.0
Name: height, dtype: float64
I have an Nx2 matrix such as:
M = [[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]]
I need to create a Nx3 matrix, that reflects the relationship of the rows from the first matrix in the following way:
Use the right column to identify candidates for range boundaries, the condition is value >= 1000
This condition applied to the matrix:
[[10, 1000],
[20, 5000],
[32, 3000],
[35, 3500],
[50, 5000],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000],]
So far I came up with "M[M[:,1]>=1000]" which works. For this new matrix I want to now check the points in the first column where distance to the next point <= 10 applies, and use these as range boundaries.
What I came up with so far: np.diff(M[:,0]) <= 10 which returns:
[True, False, True, False, True, True, True, False]
This is where I'm stuck. I want to use this condition to define lower and upper boundary of a range. For example:
[[10, 1000], #<- Range 1 start
[20, 5000], #<- Range 1 end (as 32 would be 12 points away)
[32, 3000], #<- Range 2 start
[35, 3500], #<- Range 2 end
[50, 5000], #<- Range 3 start
[55, 2000], #<- Range 3 cont (as 55 is only 5 points away)
[58, 3000], #<- Range 3 cont
[66, 4000], #<- Range 3 end
[90, 5000]] #<- Range 4 start and end (as there is no point +-10)
Lastly, referring back to the very first matrix, I want to add the right-column values together for each range within (and including) the boundaries.
So I have the four ranges which define start and stop for boundaries.
Range 1: Start 10, end 20
Range 2: Start 32, end 35
Range 3: Start 50, end 66
Range 4: Start 90, end 90
The resulting matrix would look like this, where column 0 is the start boundary, column 1 the end boundary and column 2 the added values from matrix M from the right column in between start and end.
[[10, 20, 7000], # 7000 = 1000+200+800+5000
[32, 35, 6500], # 6500 = 3000+3500
[50, 66, 14100], # 14100 = 5000+100+2000+3000+4000
[90, 90, 5000]] # 5000 = just 5000 as upper=lower boundary
I got stuck at the second step, after I get the true/false values for range boundaries. But how to create the ranges from the boolean values, and then how to add values together within these ranges is unclear for me. Would appreciate any suggestions. Also, I'm not sure on my approach, maybe there is a better way to get from the first to the last matrix, maybe skipping one step??
EDIT
So, I came a bit further with the middle step, and I can now return the start and end values of the range:
start_diffs = np.diff(M[:,0]) > 10
start_indexes = np.insert(start_diffs, 0, True)
end_diffs = np.diff(M[:,0]) > 10
end_indexes = np.insert(end_diffs, -1, True)
start_values = M[:,0][start_indexes]
end_values = M[:,0][end_indexes]
print(np.array([start_values, end_values]).T)
Returns:
[[10 20]
[32 35]
[50 66]
[90 90]]
What is missing is somehow using these ranges now to calculate the sums from matrix M in the right column.
If you are open to using pandas, here's a solution that seems a bit over-thought in retrospect, but works:
# Initial array
M = np.array([[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]])
# Build a DataFrame with default integer index and column labels
df = pd.DataFrame(M)
# Get a subset of rows that represent potential interval edges
subset = df[df[1] >= 1000].copy()
# If a row is the first row in a new range, flag it with 1.
# Then cumulatively sum these 1s. This labels each row with a
# unique integer, one per range
subset[2] = (subset[0].diff() > 10).astype(int).cumsum()
# Get the start and end values of each range
edges = subset.groupby(2).agg({0: ['first', 'last']})
edges
0
first last
2
0 10 20
1 32 35
2 50 66
3 90 90
# Build a pandas IntervalIndex out of these interval edges
tups = list(edges.itertuples(index=False, name=None))
idx = pd.IntervalIndex.from_tuples(tups, closed='both')
# Build a Series that maps each interval to a unique range number
mapping = pd.Series(range(len(idx)), index=idx)
# Apply this mapping to create a new column of the original df
df[2] = [mapping.loc[i] if idx.contains(i) else None for i in df[0]]
df
0 1 2
0 10 1000 0.0
1 11 200 0.0
2 15 800 0.0
3 20 5000 0.0
4 28 100 NaN
5 32 3000 1.0
6 35 3500 1.0
7 38 100 NaN
8 50 5000 2.0
9 51 100 2.0
10 55 2000 2.0
11 58 3000 2.0
12 66 4000 2.0
13 90 5000 3.0
# Group by this new column, get edges of each interval,
# sum values, and get the underlying numpy array
df.groupby(2).agg({0: ['first', 'last'], 1: 'sum'}).values
array([[ 10, 20, 7000],
[ 32, 35, 6500],
[ 50, 66, 14100],
[ 90, 90, 5000]])
I would like to replace row values in pandas.
In example:
import pandas as pd
import numpy as np
a = np.array(([100, 100, 101, 101, 102, 102],
np.arange(6)))
pd.DataFrame(a.T)
Result:
array([[100, 0],
[100, 1],
[101, 2],
[101, 3],
[102, 4],
[102, 5]])
Here, I would like to replace the rows with the values [101, 3] with [200, 10] and the result should therefore be:
array([[100, 0],
[100, 1],
[101, 2],
[200, 10],
[102, 4],
[102, 5]])
Update
In a more general case I would like to replace multiple rows.
Therefore the old and new row values are represented by nx2 sized matrices (n is number of row values to replace). In example:
old_vals = np.array(([[101, 3]],
[[100, 0]],
[[102, 5]]))
new_vals = np.array(([[200, 10]],
[[300, 20]],
[[400, 30]]))
And the result is:
array([[300, 20],
[100, 1],
[101, 2],
[200, 10],
[102, 4],
[400, 30]])
For the single row case:
In [35]:
df.loc[(df[0]==101) & (df[1]==3)] = [[200,10]]
df
Out[35]:
0 1
0 100 0
1 100 1
2 101 2
3 200 10
4 102 4
5 102 5
For the multiple row-case the following would work:
In [60]:
a = np.array(([100, 100, 101, 101, 102, 102],
[0,1,3,3,3,4]))
df = pd.DataFrame(a.T)
df
Out[60]:
0 1
0 100 0
1 100 1
2 101 3
3 101 3
4 102 3
5 102 4
In [61]:
df.loc[(df[0]==101) & (df[1]==3)] = 200,10
df
Out[61]:
0 1
0 100 0
1 100 1
2 200 10
3 200 10
4 102 3
5 102 4
For multi-row update like you propose the following would work where the replacement site is a single row, first construct a dict of the old vals to search for and use the new values as the replacement value:
In [78]:
old_keys = [(x[0],x[1]) for x in old_vals]
new_valss = [(x[0],x[1]) for x in new_vals]
replace_vals = dict(zip(old_keys, new_vals))
replace_vals
Out[78]:
{(100, 0): array([300, 20]),
(101, 3): array([200, 10]),
(102, 5): array([400, 30])}
We can then iterate over the dict and then set the rows using the same method as my first answer:
In [93]:
for k,v in replace_vals.items():
df.loc[(df[0] == k[0]) & (df[1] == k[1])] = [[v[0],v[1]]]
df
0 1
0 100 0
0 1
5 102 5
0 1
3 101 3
Out[93]:
0 1
0 300 20
1 100 1
2 101 2
3 200 10
4 102 4
5 400 30
The simplest way should be this one:
df.loc[[3],0:1] = 200,10
In this case, 3 is the third row of the data frame while 0 and 1 are the columns.
This code instead, allows you to iterate over each row, check its content and replace it with what you want.
target = [101,3]
mod = [200,10]
for index, row in df.iterrows():
if row[0] == target[0] and row[1] == target[1]:
row[0] = mod[0]
row[1] = mod[1]
print(df)
Replace 'A' with 1 and 'B' with 2.
df = df.replace(['A', 'B'],[1, 2])
This is done over the entire DataFrame no matter the column.
However, we can target a single column in this way
df[column] = df[column].replace(['A', 'B'],[1, 2])
More in-depth examples are available HERE.
Another possibility is:
import io
a = np.array(([100, 100, 101, 101, 102, 102],
np.arange(6)))
df = pd.DataFrame(a.T)
string = df.to_string(header=False, index=False, index_names=False)
dictionary = {'100 0': '300 20',
'101 3': '200 10',
'102 5': '400 30'}
def replace_all(text, dic):
for i, j in dic.items():
text = text.replace(i, j)
return text
string = replace_all(string, dictionary)
df = pd.read_csv(io.StringIO(string), delim_whitespace=True)
I found this solution better, since when dealing with large amount of data to replace, the time is shorter than by EdChum's solution.