python - How do I perform the calculation using shift function - python

below is the df
df = pd.DataFrame({
'Sr. No': [1, 2, 3, 4, 5, 6],
'val1' : [2,3,2,4,1,2],
})
I want Val2 such that the first row is same as first row of val1
but but row 2 and below the formula is as show in the pic. I am assuming it should be an easy one with shift, but just not getting my head around this.

This is mul and cumsum:
df["new"] = df["Sr. No"].mul(df["val1"]).cumsum()
print (df)
Sr. No val1 new
0 1 2 2
1 2 3 8
2 3 2 14
3 4 4 30
4 5 1 35
5 6 2 47

Related

Select rows of pandas dataframe in order of a given list with repetitions and keep the original index

After looking here and here and in the documentation, I still cannot find a way to select rows from a DataFrame according to all these criteria:
Return rows in an order given from a list of values from a given column
Return repeated rows (associated with repeated values in the list)
Preserve the original indices
Ignore values of the list not present in the DataFrame
As an example, let
df = pd.DataFrame({'A': [5, 6, 3, 4], 'B': [1, 2, 3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
and let
list_of_values = [3, 4, 6, 4, 3, 8]
Then I would like to get the following DataFrame:
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
How can I accomplish that? Zero's answer looks promising as it is the only one I found which preserves the original index, but it does not work with repetitions. Any ideas about how to modify/generalize it?
We have to preserve the index by assigning it as a column first so we can set_index after the mering:
list_of_values = [3, 4, 6, 4, 3, 8]
df2 = pd.DataFrame({'A': list_of_values, 'order': range(len(list_of_values))})
dfn = (
df.assign(idx=df.index)
.merge(df2, on='A')
.sort_values('order')
.set_index('idx')
.drop('order', axis=1)
)
A B
idx
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
If you want to get rid of the index name (idx), use rename_axis:
dfn = dfn.rename_axis(None)
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
Here's a way to do that using merge:
list_df = pd.DataFrame({"A": list_of_values, "order": range(len(list_of_values))})
pd.merge(list_df, df, on="A").sort_values("order").drop("order", axis=1)
The output is:
A B
0 3 3
2 4 5
4 6 2
3 4 5
1 3 3

How do I find rows in a pandas dataframe where values have changed?

I am in a situation where I get data on various items and sometimes some fields for an item changes. I'm looking for an easy way to detect when something has changed. Here's an example. Let's say have like this:
df = pd.DataFrame({'ID': [1, 1, 1, 2, 3, 2],
'SUBID': [1, 2, 1, 1, 1, 1],
'val1': [1, 4, 1, 7, 10, 8],
'val2': [2, 5, 2, 8, 11, 9],
'val3': [3, 6, 4, 9, 12, 9],
'date': ['20200413', '20200413', '20200414', '20200414',
'20200415', '20200416']})
ID SUBID val1 val2 val3 date
1 1 1 2 3 20200413
1 2 4 5 6 20200413
1 1 1 2 4 20200414
2 1 7 8 9 20200414
3 1 10 11 12 20200415
2 1 8 9 9 20200416
I'd like an operation that gives me something like this:
ID SUBID col date prev new
0 1 1 val3 20200414 3 4
1 2 1 val1 20200416 7 8
2 2 1 val2 20200416 8 9
Thanks in advance!
IIUC, you can do it by first reshaping the dataframe with set_index, stack, then groupby some columns to agg date and the values columns with first and/or last, then to filter only the one that have changed you can query.
df_f = (df.set_index(['ID', 'SUBID', 'date'])
.rename_axis(columns='col').stack()
.reset_index('date', name='val')
.groupby(['ID', 'SUBID', 'col'])
.agg(date=('date','last'), prev=('val','first'), new=('val','last'))
.query('prev != new').reset_index())
print (df_f)
ID SUBID col date prev new
0 1 1 val3 20200414 3 4
1 2 1 val1 20200416 7 8
2 2 1 val2 20200416 8 9
Here is my approach with GroupBy.shift and DataFrame.melt
new_df = (df.melt(['ID','SUBID','date'], value_name='new')
.assign(prev=lambda x: x.groupby(['ID','SUBID','variable'])['new']
.shift())
.loc[lambda x: ~x['new'].eq(x['prev']) & x['prev'].notna()]
.sort_values(['ID','SUBID']))
print(new_df)
ID SUBID date variable new prev
14 1 1 20200414 val3 4 3.0
5 2 1 20200416 val1 8 7.0
11 2 1 20200416 val2 9 8.0

create a dataframe from 3 other dataframes in python

I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks
Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3

Subdataframes of Subdataframes

If I have a dataframe df_i and I want to split it into sub-dataframes based on unique values of 'Cycle Number'
I use:
dfs = {k: df_i[df_i['Cycle Number'] == k] for k in df_i['Cycle Number'].unique()}
Assuming the 'Cycle Number' ranges from 1 to 50 and in each cycle, I have steps ranging from 1 to 15, how do I split each data frame into 15 further data frames?
I am presuming something of this type would work:
for i in range(1,51):
dsfs = {k: dfs[i][dfs[i]['Step Number'] == k] for k in dfs[i]['Step Number'].unique()}
But, this will return me 15 data frames only from the cycle number corresponding to 50, not the ones before.
If I want to access a sub-dataframe in the 20th Cycle with step number 10, is there a way of generating the subdata frame such that I can access it using something like dfs[20][10]?
A simple parallel:
Step Number Cycle Number Desired Access
1 1 dfs[1][1]
2 1 dfs[1][2]
3 1 dfs[1][3]
4 1 dfs[1][4]
5 1 dfs[1][5]
1 2 dfs[2][1]
2 2 dfs[2][2]
3 2 dfs[2][3]
4 2 dfs[2][4]
5 2 dfs[2][5]
1 3 dfs[3][1]
2 3 dfs[3][2]
3 3 dfs[3][3]
4 3 dfs[3][4]
5 3 dfs[3][5]
1 4 dfs[4][1]
2 4 dfs[4][2]
3 4 dfs[4][3]
4 4 dfs[4][4]
5 4 dfs[4][5]
You can use tuple keys instead and utilize groupby. Here's a minimal example:
df = pd.DataFrame([[0, 1, 2], [0, 1, 3], [1, 2, 4], [1, 2, 5], [1, 3, 6], [1, 3, 7]],
columns=['col1', 'col2', 'col3'])
dfs = dict(tuple(df.groupby(['col1', 'col2'])))
for k, v in dfs.items():
print(k)
print(v)
(0, 1)
col1 col2 col3
0 0 1 2
1 0 1 3
(1, 2)
col1 col2 col3
2 1 2 4
3 1 2 5
(1, 3)
col1 col2 col3
4 1 3 6
5 1 3 7

Sort all columns of a pandas DataFrame independently using sort_values()

I have a dataframe and want to sort all columns independently in descending or ascending order.
import pandas as pd
data = {'a': [5, 2, 3, 6],
'b': [7, 9, 1, 4],
'c': [1, 5, 4, 2]}
df = pd.DataFrame.from_dict(data)
a b c
0 5 7 1
1 2 9 5
2 3 1 4
3 6 4 2
When I use sort_values() for this it does not work as expected (to me) and only sorts one column:
foo = df.sort_values(by=['a', 'b', 'c'], ascending=[False, False, False])
a b c
3 6 4 2
0 5 7 1
2 3 1 4
1 2 9 5
I can get the desired result if I use the solution from this answer which applies a lambda function:
bar = df.apply(lambda x: x.sort_values().values)
print(bar)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
But this looks a bit heavy-handed to me.
What's actually happening in the sort_values() example above and how can I sort all columns in my dataframe in a pandas-way without the lambda function?
You can use numpy.sort with DataFrame constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=0), index=df.index, columns=df.columns)
print (df1)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
EDIT:
Answer with descending order:
arr = df.values
arr.sort(axis=0)
arr = arr[::-1]
print (arr)
[[6 9 5]
[5 7 4]
[3 4 2]
[2 1 1]]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df1)
a b c
0 6 9 5
1 5 7 4
2 3 4 2
3 2 1 1
sort_values will sort the entire data frame by the columns order you pass to it. In your first example you are sorting the entire data frame with ['a', 'b', 'c']. This will sort first by 'a', then by 'b' and finally by 'c'.
Notice how, after sorting by a, the rows maintain the same. This is the expected result.
Using lambda you are passing each column to it, this means sort_values will apply to a single column, and that's why this second approach sorts the columns as you would expect. In this case, the rows change.
If you don't want to use lambda nor numpy you can get around using this:
pd.DataFrame({x: df[x].sort_values().values for x in df.columns.values})
Output:
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5

Categories

Resources