pandas unique values how to iterate as a starting point - python

Good Morning, (bad beginner)
I have the following pandas dataframe:
My goal is to take the firs time a new ID appears and let the VALUE COLUMN be 1000* DELTA of that row. for all consecutive rows of that ID, the VALUE is the VALUE of the row above * the DELTA of the current row.
I tried by getting all unique ID values:
a=stocks2.ID.unique()
a.tolist()
It works, unfortunately I do not really know how to iterate in the way I described. Any kind of help or tip would be greatly appreciated!

A way to do it would be as follows. Example dataframe:
df = pd.DataFrame({'ID':[1,1,5,3,3], 'delta':[0.3,0.5,0.2,2,4]}).assign(value=[2,5,4,2,3])
print(df)
ID delta value
0 1 0.3 2
1 1 0.5 5
2 5 0.2 4
3 3 2.0 2
4 3 4.0 3
Fill value from the row above as:
df['value'] = df.shift(1).delta * df.shift(1).value
Groupby to get the indices where the first ID appears:
w = df.groupby('ID', as_index=False).nth(0).index.values
And compute the values for value using the indices in w:
df.loc[w,'value'] = df.loc[w,'delta'] * 1000
Which gives for this example:
ID delta value
0 1 0.3 300.0
1 1 0.5 0.6
2 5 0.2 200.0
3 3 2.0 2000.0
4 3 4.0 4.0

Related

How to get the ID linked with the highest value in a subset of values based on a label, for every label in pandas?

So I think this question can be visualized the best as following, given a dataframe:
val_1 true_val ID label
-0.0127894447 0.0 1 A
0.9604560385 1.0 2 A
0.0001271985 0.0 3 A
0.0007419337 0.0 3 B
0.3420448566 0.0 2 B
0.1322384726 1.0 4 B
So what I want to get is:
val_1 true_val label ID_val_1_highest ID_true_val_highest
0.9604560385 1.0 A 2 2
0.3420448566 1.0 B 2 4
Or even more preferable, the last 2 columns only (so just the ID's so I can calculate precision and recall)
I want to get the ID that has the highest value for both val_1 and true_val and then return both corresponding ID's for every label.
Anyone have an idea how to do this? I tried:
df.sort_values('val_1', ascending=False).drop_duplicates(['label'])
But it doesn't return the ID associated with the highest value for label X, for both values. Note: ID can appear more than once in the 'ID' column.
Use DataFrameGroupBy.idxmax with convert ID to index for ID by maximal val1 and true_val columns, last add to first DataFrame by DataFrame.join:
df1 = df.sort_values('true_val', ascending=False).drop_duplicates(['label'])
print (df1)
val_1 true_val ID label
1 0.960456 1.0 2 A
5 0.132238 1.0 6 B
df2 = df.set_index('ID').groupby('label').idxmax().add_suffix('_highest')
print (df2)
val_1_highest true_val_highest
label
A 2 2
B 5 6
df = df1.join(df2, on='label')
print (df)
val_1 true_val ID label val_1_highest true_val_highest
1 0.960456 1.0 2 A 2 2
5 0.132238 1.0 6 B 5 6

How can I compute my data frame by slicing the index

I have the data as
A=[0.1,0.3,0.5,0.7,0.9,1.1,1.3,1.5,1.7]
B=[3,4,6,8,2,10,2,3,4]
If A is my index and B is the value corresponding to A. I have to group the first three i.e [0.1,0.3,0.5] and calculate the average in B i.e [3,4,6]. similarly average of 2nd 3 data [8,2,10] corresponding to [0.7,0.9,1.1] and again of [2,3,4] corresponding to [1.3,1.5,1.7] and then prepare the table for this three average values. Final Data frame should be like
A=[1,2,3]
B=[average 1, average 2, average 3]
If need aggregate mean by each 3 values use helper array by length of DataFrame with integer division by 3 with GroupBy.mean:
A=[0.1,0.3,0.5,0.7,0.9,1.1,1.3,1.5,1.7]
B=[3,4,6,8,2,10,2,3,4]
df = pd.DataFrame({'col':B}, index=A)
print (df)
col
0.1 3
0.3 4
0.5 6
0.7 8
0.9 2
1.1 10
1.3 2
1.5 3
1.7 4
df = df.groupby(np.arange(len(df)) // 3).mean()
df.index +=1
print (df)
col
1 4.333333
2 6.666667
3 3.000000

How to take difference in a specific dataframe

I'm trying to take a difference of consecutive numbers in one of dataframe columns, while preserving an order in another columns, for example:
import pandas as pd
df = pd.DataFrame({"A": [1,1,1,2,2,2,3,3,3,4],
"B": [2,1,3,3,2,1,1,2,3,4],
"C": [2.1,2.0,2.2,1.2,1.1,1.0,3.0,3.1,3.2,3.3]})
In [1]: df
Out[1]:
A B C
0 1 2 2.1
1 1 1 2.0
2 1 3 2.2
3 2 3 1.4
4 2 2 1.2
5 2 1 1.0
6 3 1 3.0
7 3 2 3.3
8 3 3 3.6
9 4 4 4.0
I would like to:
- for each distinctive element of column A (1, 2, 3, and 4)
- sort column B and take consecutive differences of column C
without a loop, to get something like that
In [2]: df2
Out[2]:
A B C Diff
0 1 2 2.1 0.1
2 1 3 2.2 0.1
3 2 3 1.2 0.2
4 2 2 1.1 0.2
7 3 2 3.1 0.3
8 3 3 3.2 0.3
I have run a number of operations:
df2 = df.groupby(by='A').apply(lambda x: x.sort_values(by = ['B'])['C'].diff())
df3 = pd.DataFrame(df2)
df3.reset_index(inplace=True)
df4 = df3.set_index('level_1')
df5 = df.copy()
df5['diff'] = df4['C']
and got what I wanted:
df5
Out[1]:
A B C diff
0 1 2 2.1 0.1
1 1 1 2.0 NaN
2 1 3 2.2 0.1
3 2 3 1.2 0.1
4 2 2 1.1 0.1
5 2 1 1.0 NaN
6 3 1 3.0 NaN
7 3 2 3.1 0.1
8 3 3 3.2 0.1
9 4 4 3.3 NaN
but is there a more efficient way of doing so?
(NaN values can be easily removed so I'm not fussy about that part)
A little unclear on what is expected as result (why are there less rows?).
For taking the consecutive differences you probably want to use Series.diff() (see docs here)
df['Diff'] = df.C.diff()
You can use the period keyword if you wanted some (positive or negative) lags to take the differences.
Don't see where the sort part comes into effect, but for that you probably want to use Series.sort_values() (see docs here)
EDIT
Based on your updated information, I believe this may be what you are looking for:
df.sort_values(by=['B', 'C'], inplace=True)
df['diff'] = df.C.diff()
EDIT 2
Based on your new updated information about the calculation, you want to:
- groupby by A (see docs on DataFrame.groupby() here)
- sort (each group) by B (or presort by A then B, prior to groupby)
- calculate differences of C (and dismiss the first record since it will be missing).
The following code achieves that:
df.sort_values(by=['A','B'], inplace=True)
df['Diff'] = df.groupby('A').apply(lambda x: x['C'].diff()).values
df2 = df.dropna()
Explanation of the code:
First line sorts the dataframe first.
The second line there has a bunch of things going...:
First groupby (which now generates a grouped DataFrame, see the helpful pandas page on split-apply-combine if you're new to the groupby)
then obtain the differences of C for each group
and "flatten" the grouped dataframe by obtaining a series with .values
which we assign to df['Diff'] (that is why we needed to presort the dataframe, so this assignment would get it right... if not we would have to merge the series on A and B).
The third line just removes the NAs and assigns that to df2.
EDIT3
I think my EDIT2 version is maybe what you are looking for in, a bit more concise and less aux data generated. However, you can also improve your version of the solution a little by:
df3.reset_index(level=0, inplace=True) # no need to reset and then set again
df5 = df.copy() # only if you don't want to change df
df5['diff'] = df3.C # else, just do df.insert(2, 'diff', df3.C)

Pandas MultiIndex Aggregation

I am trying to do some aggregation on a multi-indexDataFrame based on a DatetimeIndex generated from pandas.date_range.
My DatetimeIndex looks like this:
DatetimeIndex(['2000-05-30', '2000-05-31', '2000-06-01' ... '2001-1-31'])
And my multi-index DateFrame looks like this:
value
date id
2000-05-31 1 0
2 1
3 1
2000-06-30 2 1
3 0
4 0
2000-07-30 2 1
4 0
1 0
2002-09-30 1 1
3 1
The dates in the DatetimeIndex may or may not be in the date index.
I need to retrieve all the id such that the percentage of value==1 is greater than or equal to some decimal threshold e.g. 0.6 for all the rows where the date for that id is in the DatetimeIndex.
For example if the threshold is 0.5, then the output should be [2, 3] or some DataFrame containing 2 and 3.
1 does not meet the requirement because 2002-09-30 is not in the DatetimeIndex.
I have a solution with loops and dictonaries to keep track of how often value==1 for each id, but it runs very slowly.
How can I utilize pandas to perform this aggregation?
Thank you.
You can use:
#define range
rng = pd.date_range('2000-05-30', '2000-7-01')
#filtering with isin
df = df[df.index.get_level_values('date').isin(rng)]
#get all treshes
s = df.groupby('id')['value'].mean()
print (s)
id
1 0.0
2 1.0
3 0.5
4 0.0
Name: value, dtype: float64
#get all values of index by tresh
a = s.index[s >= 0.5].tolist()
print (a)
[2, 3]

taking a count of numbers occurring in a column in a dataframe using pandas

I have a dataframe like the one given below. There is one column on top. There is a second column with element name given below it. I am trying to take a count of all the numbers under each element and trying to transpose the data so that the ranking will become the column header after transposing and the count will be the data underneath each rank. tried multiple methods using pandas like
df.eq('1').sum(axis=1)
df2=df.transpose
but not getting the desired output.
how would you rank these items on a scale of 1-5
X Y Z
1 2 1
2 1 3
3 1 1
1 3 2
1 1 2
2 5 3
4 1 2
1 4 4
3 3 5
desired output is something like
1 2 3 4 5
X (count of 1s)(count of 2s).....so on
Y (count of 1s)(count of 2s).......
Z (count of 1s)(count of 2s)............
any help would really mean a lot.
You can apply the pd.value_counts to all columns, which will count values from all the columns and then transpose the result:
df.apply(pd.value_counts).fillna(0).T
# 1 2 3 4 5
#X 4.0 2.0 2.0 1.0 0.0
#Y 4.0 1.0 2.0 1.0 1.0
#Z 2.0 3.0 2.0 1.0 1.0
Option 0
pd.concat
pd.concat({c: s.value_counts() for c, s in df.iteritems()}).unstack(fill_value=0)
Option 1
stack preserves int dtype
df.stack().groupby(level=1).apply(
pd.value_counts
).unstack(fill_value=0)

Categories

Resources