I would like to transform a list of dates in the following format:
01-02-12
01-03-12
01-27-12
02-01-12
02-23-12
.
.
.
01-03-13
02-02-13
as
1
1
1
2
2
.
.
.
13
14
ie: index each date by month, with respect to year also.
I am not sure how to do this and can't find a similar problem, so advice would be appreciated.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Edit:
In response to #Psidom.
Just an example dataset with made up numbers. In the actual dataset I'm dealing with I've transformed the dates to datetime objects.
dat = pd.read_csv('matchdata-update.csv',encoding = "ISO-8859-1")
dat['Date']=pd.to_datetime(dat['Date'],format='%m-%d-%y% I:%M%p').
Ideally I would like it to count a month , even if it was not observed.
End goal is to index each month and to count number of rows in that insex, so if no month was observed then the number of rows counted for that index would just be 0.
If you want to count the number of rows for every month, this should work:
dat.set_index("Date").resample("M").size()
Here is a different answer using the data as given and producing the answer requested, including 0s for missing monthes.
dates = '''\
01-02-12
01-03-12
01-27-12
02-01-12
02-23-12
01-03-13
02-02-13
'''.splitlines()
def monthnum(date, baseyear):
"Convert date as 'mm-dd-yy' to month number starting with baseyear xx."
m,d,y = map(int, date.split('-'))
return m + 12 * (y-baseyear)
print(monthnum(dates[0], 12) == 1, monthnum(dates[-1], 12) == 14)
def monthnums(dates, baseyear):
"Yield month numbers of 'mm-dd-yy' starting with baseyear."
for date in dates:
m,d,y = map(int, date.split('-'))
yield m + 12 * (y-baseyear)
print(list(monthnums(dates, 12)) == [1,1,1,2,2,13,14])
def num_per_month(mnums):
prev, n = 1, 0
for k in mnums:
if k == prev:
n += 1
else:
yield prev, n
for i in range(prev+1, k):
yield i, 0
prev, n = k, 1
yield prev, n
for m, n in num_per_month(monthnums(dates, 12)):
print(m, n)
prints
True True
True
1 3
2 2
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 1
14 1
Related
I have a DataFrame that contains gas concentrations and the corresponding valve number. This data was taken continuously where we switched the valves back and forth (valves=1 or 2) for a certain amount of time to get 10 cycles for each valve value (20 cycles total). A snippet of the data looks like this (I have 2,000+ points and each valve stayed on for about 90 seconds each cycle):
gas1 valveW time
246.9438 2 1
247.5367 2 2
246.7167 2 3
246.6770 2 4
245.9197 1 5
245.9518 1 6
246.9207 1 7
246.1517 1 8
246.9015 1 9
246.3712 2 10
247.0826 2 11
... ... ...
My goal is to save the last N points of each valve's cycle. For example, the first cycle where valve=1, I want to index and save the last N points from the end before the valve switches to 2. I would then save the last N points and average them to find one value to represent that first cycle. Then I want to repeat this step for the second cycle when valve=1 again.
I am currently converting from Matlab to Python so here is the Matlab code that I am trying to translate:
% NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
ind_noaaHigh_end = zeros(1,length(t_c));
numPoints = 40;
for i = 1:length(valveW_c)-1
if (valveW_c(i) == 1 && valveW_c(i+1) ~= 1)
test = (i-numPoints):i;
ind_noaaHigh_end(test) = 1;
n2o_noaaHigh = [n2o_noaaHigh mean(n2o_c(test))];
co2_noaaHigh = [co2_noaaHigh mean(co2_c(test))];
co_noaaHigh = [co_noaaHigh mean(co_c(test))];
h2o_noaaHigh = [h2o_noaaHigh mean(h2o_c(test))];
end
end
ind_noaaHigh_end = logical(ind_noaaHigh_end);
This is what I have so far for Python:
# NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
t_c_High = []; # time
for i in range(len(valveW_c)):
# NOAA HIGH
if (valveW_c[i] == 1):
t_c_High.append(t_c[i])
n2o_noaaHigh.append(n2o_c[i])
co2_noaaHigh.append(co2_c[i])
co_noaaHigh.append(co_c[i])
h2o_noaaHigh.append(h2o_c[i])
Thanks in advance!
I'm not sure if I understood correctly, but I guess this is what you are looking for:
# First we create a column to show cycles:
df['cycle'] = (df.valveW.diff() != 0).cumsum()
print(df)
gas1 valveW time cycle
0 246.9438 2 1 1
1 247.5367 2 2 1
2 246.7167 2 3 1
3 246.677 2 4 1
4 245.9197 1 5 2
5 245.9518 1 6 2
6 246.9207 1 7 2
7 246.1517 1 8 2
8 246.9015 1 9 2
9 246.3712 2 10 3
10 247.0826 2 11 3
Now you can use groupby method to get the average for the last n points of each cycle:
n = 3 #we assume this is n
df.groupby('cycle').apply(lambda x: x.iloc[-n:, 0].mean())
Output:
cycle 0
1 246.9768
2 246.6579
3 246.7269
Let's call your DataFrame df; then you could do:
results = {}
for k, v in df.groupby((df['valveW'].shift() != df['valveW']).cumsum()):
results[k] = v
print(f'[group {k}]')
print(v)
Shift(), as it suggests, shifts the column of the valve cycle allows to detect changes in number sequences. Then, cumsum() helps to give a unique number to each of the group with the same number sequence. Then we can do a groupby() on this column (which was not possible before because groups were either of ones or twos!).
which gives e.g. for your code snippet (saved in results):
[group 1]
gas1 valveW time
0 246.9438 2 1
1 247.5367 2 2
2 246.7167 2 3
3 246.6770 2 4
[group 2]
gas1 valveW time
4 245.9197 1 5
5 245.9518 1 6
6 246.9207 1 7
7 246.1517 1 8
8 246.9015 1 9
[group 3]
gas1 valveW time
9 246.3712 2 10
10 247.0826 2 11
Then to get the mean for each cycle; you could e.g. do:
df.groupby((df['valveW'].shift() != df['valveW']).cumsum()).mean()
which gives (again for your code snippet):
gas1 valveW time
valveW
1 246.96855 2.0 2.5
2 246.36908 1.0 7.0
3 246.72690 2.0 10.5
where you wouldn't care much about the time mean but the gas1 one!
Then, based on results you could e.g. do:
n = 3
mean_n_last = []
for k, v in results.items():
if len(v) < n:
mean_n_last.append(np.nan)
else:
mean_n_last.append(np.nanmean(v.iloc[len(v) - n:, 0]))
which gives [246.9768, 246.65796666666665, nan] for n = 3 !
If your dataframe is sorted by time you could get the last N records for each valve like this.
N=2
valve1 = df[df['valveW']==1].iloc[-N:,:]
valve2 = df[df['valveW']==2].iloc[-N:,:]
If it isn't currently sorted you could easily sort it like this.
df.sort_values(by=['time'])
Consider this pandas dataframe where the condition column is 1 when value is below 5 (any threshold).
import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df
Out[1]:
value condition
0 30 0
1 100 0
2 4 1
3 0 1
4 80 0
5 0 1
6 1 1
7 4 1
8 70 0
9 70 0
What I want is to have all consecutive values below 5 to have the same id and all values above five have 0 (or NA or a negative value, doesn't matter, they just need to be the same). I want to create a new column called new_id that contains these cumulative ids as follows:
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
In a very inefficient for loop I would do this (which works):
for i in range(0,df.shape[0]):
if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
new_id = counter # assign new id
counter += 1
elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
new_id = counter-1 # assign current id
elif (df.loc[df.index[i],'condition']==0):
new_id = df.loc[df.index[i],'condition'] # assign 0
df.loc[df.index[i],'new_id'] = new_id
df
But this is very inefficient and I have a very big dataset. Therefore I tried different kinds of vectorization but I so far failed to keep it from counting up inside each "cluster" of consecutive points:
# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]
# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]
I also tried using apply() with a custom if else function but it seems like this does not allow me to use a counter.
There is already a ton of similar posts about this but none of them keep the same id for consecutive rows.
Example posts are:
Maintain count in python list comprehension
Pandas cumsum on a separate column condition
Python - keeping counter inside list comprehension
python pandas conditional cumulative sum
Conditional count of cumulative sum Dataframe - Loop through columns
You can use the cumsum(), as you did in your first try, just modify it a bit:
# calculate delta
df['delta'] = df['condition']-df['condition'].shift(1)
# get rid of -1 for the cumsum (replace it by 0)
df['delta'] = df['delta'].replace(-1,0)
# cumulative sum conditional: multiply with condition column
df['cumsum_x'] = df['delta'].cumsum()*df['condition']
Welcome to SO! Why not just rely on base Python for this?
def counter_func(l):
new_id = [0] # First value is zero in any case
counter = 0
for i in range(1, len(l)):
if l[i] == 0:
new_id.append(0)
elif l[i] == 1 and l[i-1] == 0:
counter += 1
new_id.append(counter)
elif l[i] == l[i-1] == 1:
new_id.append(counter)
else: new_id.append(None)
return new_id
df["new_id"] = counter_func(df["condition"])
Looks like this
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
Edit :
You can also use numba, which sped up the function quite a lot for me about : about 1sec to ~60ms.
You should input numpy arrays into the function to use it, meaning you'll have to df["condition"].values.
from numba import njit
import numpy as np
#njit
def func(arr):
res = np.empty(arr.shape[0])
counter = 0
res[0] = 0 # First value is zero anyway
for i in range(1, arr.shape[0]):
if arr[i] == 0:
res[i] = 0
elif arr[i] and arr[i-1] == 0:
counter += 1
res[i] = counter
elif arr[i] == arr[i-1] == 1:
res[i] = counter
else: res[i] = np.nan
return res
df["new_id"] = func(df["condition"].values)
I want to find the count for the number of previous rows that have the a greater value than the current row in a column and store it in a new column. It would be like a rolling countif that goes back to the beginning of the column. The desired example output below shows the value column given and the count column I want to create.
Desired Output:
Value Count
5 0
7 0
4 2
12 0
3 4
4 3
1 6
I plan on using this code with a large dataframe so the fastest way possible is appreciated.
We can do subtract.outer from numpy , then get lower tri and find the value is less than 0, and sum the value per row
a = np.sum(np.tril(np.subtract.outer(df.Value.values,df.Value.values), k=0)<0, axis=1)
# results in array([0, 0, 2, 0, 4, 3, 6])
df['Count'] = a
IMPORTANT: this only works with pandas < 1.0.0 and the error seems to be a pandas bug. An issue is already created at https://github.com/pandas-dev/pandas/issues/35203
We can do this with expanding and applying a function which checks for values that are higher than the last element in the expanding array.
import pandas as pd
import numpy as np
# setup
df = pd.DataFrame([5,7,4,12,3,4,1], columns=['Value'])
# calculate countif
df['Count'] = df.Value.expanding(1).apply(lambda x: np.sum(np.where(x > x[-1], 1, 0))).astype('int')
Input
Value
0 5
1 7
2 4
3 12
4 3
5 4
6 1
Output
Value Count
0 5 0
1 7 0
2 4 2
3 12 0
4 3 4
5 4 3
6 1 6
count = []
for i in range(len(values)):
count = 0
for j in values[:i]:
if values[i] < j:
count += 1
count.append(count)
The below generator will do what you need. You may be able to further optimize this if needed.
def generator (data) :
i=0
count_dict ={}
while i<len(data) :
m=max(data)
v=data[i]
count_dict[v] =count_dict[v] +1 if v in count_dict else 1
t=sum([(count_dict[j] if j in count_dict else 0) for j in range(v+1,m)])
i +=1
yield t
d=[1, 5,7,3,5,8]
foo=generator (d)
result =[b for b in foo]
print(result)
I have an pandas data frame and I want the average number of consecutive values in a row. For example, for the following data
a b c d e f g h i j k l
p1 0 0 4 4 4 4 4 4 1 4 4 1
p2 0 4 4 0 4 4 0 1 4 4 0 1
so the average number of consecutive 4's for p1 is (6+2)/2 = 4 and for p2 is (2+2+2)/3 = 2
Is there also a way to find the min and max number of consecutive values? i.e. max for p1 is 6.
You can transpose your dataframe and use the method suggested in below post. You will get a dataframe of count of consecutive numbers, using which you can perform Mean, Min and Max.
https://stackoverflow.com/a/29643066/12452044
This will work for p1. To get p2, just replace the 0's with 1's whenever you see an 'iloc' function being used.
dict = {0:[],1:[],2:[],3:[],4:[]}
counter = 1
for i in range(len(df.iloc[0])-1):
num = df.iloc[0,i]
num2 = df.iloc[0,i+1]
if num == num2:
counter += 1
else:
dict[num].append(counter)
counter = 1
Then to get the average number of consecutive 4's:
print(sum(dict[4])/len(dict[4]))
And to get the max number of consecutive 4's:
print(max(dict[4]))
Given a set or a list (assume its ordered)
myset = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
I want to find out how many numbers appear in a range.
say my range is 10. Then given the list above, I have two sets of 10.
I want the function to return [10,10]
if my range was 15. Then I should get [15,5]
The range will change. Here is what I came up with
myRange = 10
start = 1
current = start
next = current + myRange
count = 0
setTotal = []
for i in myset:
if i >= current and i < next :
count = count + 1
print str(i)+" in "+str(len(setTotal)+1)
else:
current = current + myRange
next = myRange + current
if next >= myset[-1]:
next = myset[-1]
setTotal.append(count)
count = 0
print setTotal
Output
1 in 1
2 in 1
3 in 1
4 in 1
5 in 1
6 in 1
7 in 1
8 in 1
9 in 1
10 in 1
12 in 2
13 in 2
14 in 2
15 in 2
16 in 2
17 in 2
18 in 2
19 in 2
[10, 8]
notice 11 and 20 where skipped. I also played around with the condition and got wired results.
EDIT: Range defines a range that every value in the range should be counted into one chuck.
think of a range as from current value to currentvalue+range as one chunk.
EDIT:
Wanted output:
1 in 1
2 in 1
3 in 1
4 in 1
5 in 1
6 in 1
7 in 1
8 in 1
9 in 1
10 in 1
11 in 2
12 in 2
13 in 2
14 in 2
15 in 2
16 in 2
17 in 2
18 in 2
19 in 2
[10, 10]
With the right key function, thegroupbymethod in the itertoolsmodule makes doing this fairly simple:
from itertools import groupby
def ranger(values, range_size):
def keyfunc(n):
key = n/(range_size+1) + 1
print '{} in {}'.format(n, key)
return key
return [len(list(g)) for k, g in groupby(values, key=keyfunc)]
myset = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
print ranger(myset, 10)
print ranger(myset, 15)
You want to use simple division and the remainder; the divmod() function gives you both:
def chunks(lst, size):
count, remainder = divmod(len(lst), size)
return [size] * count + ([remainder] if remainder else [])
To create your desired output, then use the output of chunks():
lst = range(1, 21)
size = 10
start = 0
for count, chunk in enumerate(chunks(lst, size), 1):
for i in lst[start:start + chunk]:
print '{} in {}'.format(i, count)
start += chunk
count is the number of the current chunk (starting at 1; python uses 0-based indexing normally).
This prints:
1 in 1
2 in 1
3 in 1
4 in 1
5 in 1
6 in 1
7 in 1
8 in 1
9 in 1
10 in 1
11 in 2
12 in 2
13 in 2
14 in 2
15 in 2
16 in 2
17 in 2
18 in 2
19 in 2
20 in 2
If you don't care about what numbers are in a given chunk, you can calculate the size easily:
def chunk_sizes(lst, size):
complete = len(lst) // size # Number of `size`-sized chunks
partial = len(lst) % size # Last chunk
if partial: # Sometimes the last chunk is empty
return [size] * complete + [partial]
else:
return [size] * complete