I have the following DataFrame:
df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
I want to calculate the frequency of each value, but not an overall count - the count of each value until it changes to another value.
I tried:
df['values'].value_counts()
but it gives me
10 6
9 3
23 2
12 1
The desired output is
10:2
23:2
9:3
10:4
12:1
How can I do this?
Use:
df = df.groupby(df['values'].ne(df['values'].shift()).cumsum())['values'].value_counts()
Or:
df = df.groupby([df['values'].ne(df['values'].shift()).cumsum(), 'values']).size()
print (df)
values values
1 10 2
2 23 2
3 9 3
4 10 4
5 12 1
Name: values, dtype: int64
Last for remove first level:
df = df.reset_index(level=0, drop=True)
print (df)
values
10 2
23 2
9 3
10 4
12 1
dtype: int64
Explanation:
Compare original column by shifted with not equal ne and then add cumsum for helper Series:
print (pd.concat([df['values'], a, b, c],
keys=('orig','shifted', 'not_equal', 'cumsum'), axis=1))
orig shifted not_equal cumsum
0 10 NaN True 1
1 10 10.0 False 1
2 23 10.0 True 2
3 23 23.0 False 2
4 9 23.0 True 3
5 9 9.0 False 3
6 9 9.0 False 3
7 10 9.0 True 4
8 10 10.0 False 4
9 10 10.0 False 4
10 10 10.0 False 4
11 12 10.0 True 5
You can keep track of where the changes in df['values'] occur, and groupby the changes and also df['values'] (to keep them as index) computing the size of each group
changes = df['values'].diff().ne(0).cumsum()
df.groupby([changes,'values']).size().reset_index(level=0, drop=True)
values
10 2
23 2
9 3
10 4
12 1
dtype: int64
itertools.groupby
from itertools import groupby
pd.Series(*zip(*[[len([*v]), k] for k, v in groupby(df['values'])]))
10 2
23 2
9 3
10 4
12 1
dtype: int64
It's a generator
def f(x):
count = 1
for this, that in zip(x, x[1:]):
if this == that:
count += 1
else:
yield count, this
count = 1
yield count, [*x][-1]
pd.Series(*zip(*f(df['values'])))
10 2
23 2
9 3
10 4
12 1
dtype: int64
Using crosstab
df['key']=df['values'].diff().ne(0).cumsum()
pd.crosstab(df['key'],df['values'])
Out[353]:
values 9 10 12 23
key
1 0 2 0 0
2 0 0 0 2
3 3 0 0 0
4 0 4 0 0
5 0 0 1 0
Slightly modify the result above
pd.crosstab(df['key'],df['values']).stack().loc[lambda x:x.ne(0)]
Out[355]:
key values
1 10 2
2 23 2
3 9 3
4 10 4
5 12 1
dtype: int64
Base on python groupby
from itertools import groupby
[ (k,len(list(g))) for k,g in groupby(df['values'].tolist())]
Out[366]: [(10, 2), (23, 2), (9, 3), (10, 4), (12, 1)]
This is far from the most time/memory efficient method that in this thread but here's an iterative approach that is pretty straightforward. Please feel encouraged to suggest improvements on this method.
import pandas as pd
df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
dict_count = {}
for v in df['values'].unique():
dict_count[v] = 0
curr_val = df.iloc[0]['values']
count = 1
for i in range(1, len(df)):
if df.iloc[i]['values'] == curr_val:
count += 1
else:
if count > dict_count[curr_val]:
dict_count[curr_val] = count
curr_val = df.iloc[i]['values']
count = 1
if count > dict_count[curr_val]:
dict_count[curr_val] = count
df_count = pd.DataFrame(dict_count, index=[0])
print(df_count)
The function groupby in itertools can help you, for str:
>>> string = 'aabbaacc'
>>> for char, freq in groupby('aabbaacc'):
>>> print(char, len(list(freq)), sep=':', end='\n')
[out]:
a:2
b:2
a:2
c:2
This function also works for list:
>>> df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
>>> for char, freq in groupby(df['values'].tolist()):
>>> print(char, len(list(freq)), sep=':', end='\n')
[out]:
10:2
23:2
9:3
10:4
12:1
Note: for df, you always use this way like df['values'] to take 'values' column, because DataFrame have a attribute values
Related
Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas has a function idxmax which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?
Use numpy.argsort for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))
If efficiency is important, numpy.argpartition is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object
I have two dataframes that I want to compare, but only want to use the values that are not in both dataframes.
Example:
DF1:
A B C
0 1 2 3
1 4 5 6
DF2:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
So, from this example I want to work with row index 2 and 3 ([7, 8, 9] and [10, 11, 12]).
The code I currently have (only remove duplicates) below.
df = pd.concat([di_old, di_new])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
print(df.reindex(idx))
I would do :
df_n = df2[df2.isin(df1).all(axis=1)]
ouput
A B C
0 1 2 3
1 4 5 6
I want to drop and count duplicates in column val when val equal to 1.
Then set start to be the first row and end to be the last row of consecutive duplicates.
df = pd.DataFrame()
df['start'] = [1, 2, 3, 4, 5, 6, 18, 30, 31]
df['end'] = [2, 3, 4, 5, 6, 18, 30, 31, 32]
df['val'] = [1 , 1, 1, 1, 1, 12, 12, 1, 1]
df
start end val
0 1 2 1
1 2 3 1
2 3 4 1
3 4 5 1
4 5 6 1
5 6 18 12
6 18 30 12
7 30 31 1
8 31 32 1
Expected Result
start end val
0 1 6 5
1 6 18 12
2 18 30 12
3 30 32 2
I tried. df[~((df.val==1) & (df.val == df.val.shift(1)) & (df.val == df.val.shift(-1)))]
start end val
0 1 2 1
4 5 6 1
5 6 18 12
6 18 30 12
7 30 31 1
8 31 32 1
but I can't figure out how to complete my expected result, any suggestion?
Use:
#mask by condition
m = df.val==1
#consecutive groups
g = m.ne(m.shift()).cumsum()
#filter by condition and aggregate per groups
df1 = df.groupby(g[m]).agg({'start':'first', 'end':'last', 'val':'sum'})
#concat together, for correct order create index by g
df = pd.concat([df1, df.set_index(g)[~m.values]]).sort_index().reset_index(drop=True)
print (df)
start end val
0 1 6 5
1 6 18 12
2 18 30 12
3 30 32 2
You could also do a two-liner with a mask to groupby:
m = (df.val.ne(1) | df.val.ne(df.val.shift())).cumsum()
df = df.groupby(m).agg({'start': 'first', 'end': 'last', 'val': 'last'})
Solution by #jezrael is perfect, but here is slightly different approach:
df['aux'] = (df['val'] != df['val'].shift()).cumsum()
df.loc[df['val'] == 1, 'end'] = df[df['val'] == 1].groupby('aux')['end'].transform('last')
df.loc[df['val'] == 1, 'val'] = df.groupby('aux')['val'].transform('sum')
df = df.drop_duplicates(subset=df.columns.difference(['start']), keep='first')
df = df.drop(columns=['aux'])
I have time series data in pandas dataframe with index as time at the start of measurement and columns with list of values recorded at a fixed sampling rate (difference in consecutive index/number of elements in the list)
Here is the what it looks like:
Time A B ....... Z
0 [1, 2, 3, 4] [1, 2, 3, 4]
2 [5, 6, 7, 8] [5, 6, 7, 8]
4 [9, 10, 11, 12] [9, 10, 11, 12]
6 [13, 14, 15, 16] [13, 14, 15, 16 ]
...
I want to expand each row in all the columns to multiple rows such that:
Time A B .... Z
0 1 1
0.5 2 2
1 3 3
1.5 4 4
2 5 5
2.5 6 6
.......
So far I am thinking along these lines (code doesn't wok):
def expand_row(dstruc):
for i in range (len(dstruc)):
for j in range (1,len(dstruc[i])):
dstruc.loc[i+j/len(dstruc[i])] = dstruc[i][j]
dstruc.loc[i] = dstruc[i][0]
return dstruc
expanded = testdf.apply(expand_row)
I also tried using split(',') and stack() together but I am not able to fix my indexing appropriately.
import numpy as np
import pandas as pd
df = pd.DataFrame({key: zip(*[iter(range(1, 17))]*4) for key in list('ABC')},
index=range(0,8,2))
result = pd.DataFrame.from_items([(index, zipped) for index, row in df.iterrows() for zipped in zip(*row)], orient='index', columns=df.columns)
result.index.name = 'Time'
grouped = result.groupby(level=0)
increment = (grouped.cumcount()/grouped.size())
result.index = result.index + increment
print(result)
yields
In [183]: result
Out[183]:
A B C
Time
0.00 1 1 1
0.25 2 2 2
0.50 3 3 3
0.75 4 4 4
2.00 5 5 5
2.25 6 6 6
2.50 7 7 7
2.75 8 8 8
4.00 9 9 9
4.25 10 10 10
4.50 11 11 11
4.75 12 12 12
6.00 13 13 13
6.25 14 14 14
6.50 15 15 15
6.75 16 16 16
Explanation:
One way to loop over the contents of list is to use a list comprehension:
In [172]: df = pd.DataFrame({key: zip(*[iter(range(1, 17))]*4) for key in list('ABC')}, index=range(2,10,2))
In [173]: [(index, zipped) for index, row in df.iterrows() for zipped in zip(*row)]
Out[173]:
[(0, (1, 1, 1)),
(0, (2, 2, 2)),
...
(6, (15, 15, 15)),
(6, (16, 16, 16))]
Once you have the values in the above form, you can build the desired DataFrame with pd.DataFrame.from_items:
result = pd.DataFrame.from_items([(index, zipped) for index, row in df.iterrows() for zipped in zip(*row)], orient='index', columns=df.columns)
result.index.name = 'Time'
yields
In [175]: result
Out[175]:
A B C
Time
2 1 1 1
2 2 2 2
...
8 15 15 15
8 16 16 16
To compute the increments to be added to the index, you can group by the index and find the ratio of the cumcount to the size of each group:
In [176]: grouped = result.groupby(level=0)
In [177]: increment = (grouped.cumcount()/grouped.size())
In [179]: result.index = result.index + increment
In [199]: result.index
Out[199]:
Int64Index([ 0.0, 0.25, 0.5, 0.75, 2.0, 2.25, 2.5, 2.75, 4.0, 4.25, 4.5,
4.75, 6.0, 6.25, 6.5, 6.75],
dtype='float64', name=u'Time')
Probably not ideal, but this can be done using groupby and apply a function which returns the expanded DataFrame for each row (here the time difference is assumed to be fixed at 2.0):
def expand(x):
data = {c: x[c].iloc[0] for c in x if c != 'Time'}
n = len(data['A'])
step = 2.0 / n;
data['Time'] = [x['Time'].iloc[0] + i*step for i in range(n)]
return pd.DataFrame(data)
print df.groupby('Time').apply(expand).set_index('Time', drop=True)
Output:
A B
Time
0.0 1 1
0.5 2 2
1.0 3 3
1.5 4 4
2.0 5 5
2.5 6 6
3.0 7 7
3.5 8 8
4.0 9 9
4.5 10 10
5.0 11 11
5.5 12 12
6.0 13 13
6.5 14 14
7.0 15 15
7.5 16 16
Say, the dataframe wanted to be expanded is named as df_to_expand, you could do the following using eval.
df_expanded_list = []
for coln in df_to_expand.columns:
_df = df_to_expand[coln].apply(lambda x: pd.Series(eval(x), index=[coln + '_' + str(i) for i in range(len(eval(x)))]))
df_expanded_list.append(_df)
df_expanded = pd.concat(df_expanded_list, axis=1)
References:
covert a string which is a list into a proper list python
I am trying to concatenate multiple Pandas DataFrames, some of which use multi-indexing and others use single indices. As an example, let's consider the following single indexed dataframe:
> import pandas as pd
> df1 = pd.DataFrame({'single': [10,11,12]})
> df1
single
0 10
1 11
2 12
Along with a multiindex dataframe:
> level_dict = {}
> level_dict[('level 1','a','h')] = [1,2,3]
> level_dict[('level 1','b','j')] = [5,6,7]
> level_dict[('level 2','c','k')] = [10, 11, 12]
> level_dict[('level 2','d','l')] = [20, 21, 22]
> df2 = pd.DataFrame(level_dict)
> df2
level 1 level 2
a b c d
h j k l
0 1 5 10 20
1 2 6 11 21
2 3 7 12 22
Now I wish to concatenate the two dataframes. When I try to use concat it flattens the multiindex as follows:
> df3 = pd.concat([df2,df1], axis=1)
> df3
(level 1, a, h) (level 1, b, j) (level 2, c, k) (level 2, d, l) single
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If instead I append a single column to the multiindex dataframe df2 as follows:
> df2['single'] = [10,11,12]
> df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
How can I instead generate this dataframe from df1 and df2 with concat, merge, or join?
I don't think you can avoid converting the single index into a MultiIndex. This is probably the easiest way, you could also convert after joining.
In [48]: df1.columns = pd.MultiIndex.from_tuples([(c, '', '') for c in df1])
In [49]: pd.concat([df2, df1], axis=1)
Out[49]:
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If you're just appending one column you could access df1 essentially as a series:
df2[df1.columns[0]] = df1.iloc[:, 0]
df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If you could have just made a series in the first place it would be a little easier to read. This command would do the same thing:
ser1 = df1.iloc[:, 0] # make df1's column into a series
df2[ser1.name] = ser1