I have column col1 in the data frame with following values:
col1 col2 col3
9.1
9.1
9.11
9.12
9.13
9.14
9.15
9.16
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
10.1
10.1
10.2
10.3
Is it possilbe to sort data frame based on col1 values as follows:
col1 col2 col3
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
9.13
9.14
9.15
9.16
10.1
10.1
10.2
10.3
There are two things here:
9.10 is interpreted as 9.1 which I want avoid.
I want 9.10 appear after 9.9 in sort order.
Here is example code:
>>> import pandas as pd
>>> pd.DataFrame([9.1,9.7,9.8,9.9,9.10,10.0,10.1,10.2,10.11])
0
0 9.10
1 9.70
2 9.80
3 9.90
4 9.10
5 10.00
6 10.10
7 10.20
8 10.11
>>> df.sort_values(0)
0
0 9.10
4 9.10
1 9.70
2 9.80
3 9.90
5 10.00
6 10.10
8 10.11
7 10.20
I wanted it to be:
0
0 9.1
1 9.7
2 9.8
3 9.9
4 9.10
5 10.0
6 10.1
7 10.2
8 10.11
I am ok if the it shows two digits after decimal point like 9.70, but the order should be same.
PS: I didnt specify any column type as I am ok with any. My goal is to achieve two points specified above. This column values are actually directory names which I am loading in data frame and trying to sort in the order I specified above.
You must create the dataframe with str data (I shuffled it randomly):
data = ['9.1', '10.1', '10.2', '10.11', '9.8', '10.0', '9.10', '9.7', '9.9']
df = pd.DataFrame(data, columns = ['col1'])
# col1
#0 9.1
#1 10.1
#2 10.2
#3 10.11
#4 9.8
#5 10.0
#6 9.10
#7 9.7
#8 9.9
Now, you can split the column:
new = df['col1'].str.split('.', expand = True)
# 0 1
#0 9 1
#1 10 1
#2 10 2
#3 10 11
#4 9 8
#5 10 0
#6 9 10
#7 9 7
#8 9 9
Add new columns to df and sort it following them. Remember 'new' contains 'str' instances, so you can cast them to int so you can compare values (in order to sort the dataframe):
df['num0'] = new[0].astype(int)
df['num1'] = new[1].astype(int)
df = df.sort_values(['num0','num1'])
# col1 num0 num1
#0 9.1 9 1
#7 9.7 9 7
#4 9.8 9 8
#8 9.9 9 9
#6 9.10 9 10
#5 10.0 10 0
#1 10.1 10 1
#2 10.2 10 2
#3 10.11 10 11
Optional
If you don't want to keep the columns num0 and num1, change the last code line for:
df = df.sort_values(['num0','num1'])['col1']
You can also reset the dataframe index with:
df = df.reset_index(drop=True)
df.col1 = df.col1.astype(float)
df = df.sort_values(by='col1')
Try this:
data = [9.1, 9.1, 9.11, 9.12, 9.13, 9.14, 9.15, 9.16,9.2,9.3,9.4,9.5,9.6,9.7,9.8,9.9,10.1,10.1,10.2,10.3,]
df = pd.DataFrame([[i,"",""] for i in data], columns=["col1", "col2", "col3"]).astype("str")
df.sort_values(by=['col1'], key=lambda x: [(int(i[0]), int(i[-1])) for i in x.str.split(".")], )
Related
I have MultiIndex dataframe (table1) and I want to merge specific columns from another dataframe that is not multiIndex (table 2).
Example of table 1:
>>> name 2020-10-21 2020-10-22 ...
Column 9 10 11 12 9 10 11 12
0 A5 2.1 2.2 2.4 2.8 5.4 3.4 1.1 7.3
1 B9 7.2 1.2 14.5 7.5 3.4 5.2 6.4 8.1
2 C3 1.1 6.5 8.4 9.1 1.1 4.3 6.5 8.7
...
Example of table 2:
>>>name indc control code
0 A5 0.32 yes 1
1 C3 0.11 no 2
2 B18 0.23 yes 2
3 B9 0.45 no 3
I want to merge the column "code" based on key "name" from table 2 (and "index" from table 1) to get the code beside te name:
>>> index 2020-10-21 2020-10-22 ...
Column code 9 10 11 12 9 10 11 12
0 A5 1 2.1 2.2 2.4 2.8 5.4 3.4 1.1 7.3
1 B9 3 7.2 1.2 14.5 7.5 3.4 5.2 6.4 8.1
2 C3 2 1.1 6.5 8.4 9.1 1.1 4.3 6.5 8.7
...
I know how to merge when the index is not multindex level, then I do so something like this:
df = table1.merge(table2[['code','name']], how = 'left',
left_on = 'index', right_on = 'name')
but now I get error:
UserWarning: merging between different levels can give an unintended
result (2 levels on the left,1 on the right) warnings.warn(msg,
UserWarning)
and then:
ValueError: 'index' is not in list
when I print the columns I can see that thy are like tuples but I don't know why it says the index is not in list as when I print the oclumns of table 1 I get:
Index([ ('index', ''), (2020-10-22, 9)...
so i'm a bit confused.
My end goal: to merge the code column based on the columns "name" and "index"
For correct working need MultiIndex in both DataFrames:
df2 = table2[['code','name']].rename(columns={'name':'index'})
df2.columns = pd.MultiIndex.from_product([df2.columns, ['']])
df = table1.merge(df2, how = 'left', on = [('index', '')])
#if necessary reorder columns names
cols = df.columns[:1].tolist() + df.columns[-1:].tolist() + df.columns[1:-1].tolist()
df = df[cols]
print (df)
index code 2020-10-21 2020-10-22
9 10 11 12 9 10 11 12
0 A5 1 2.1 2.2 2.4 2.8 5.4 3.4 1.1 7.3
1 B9 3 7.2 1.2 14.5 7.5 3.4 5.2 6.4 8.1
2 C3 2 1.1 6.5 8.4 9.1 1.1 4.3 6.5 8.7
I have a pandas dataframe where in the first row I have multiple entries but the 2nd row has repeating columns.
A B C
Date open r close open r close open r close
2000-07-03 19.7 5 17.1 66.26 4 6.22 23.26. 1 9.9
2000-07-05 49.8 2 8.3 78.81 6 4.34 39.81 5 5.1
2000-07-15 89.5 3 4.1 43.45 7 2.45 29.3 8 1.2
2000-08-13 74.7 6 7.4 34.26 8 6.4 72.26 9 5.4
2000-08-25 39.84 1 8.4 95.43 3 4.3 69.81. 0 5.2
2000-08-28 61.8 4 4.2 43.81 1 2.2 129.81 6 1.3
2000-09-11 82.79 7 7.4 66.26 1 6.5 72.25 6 5.6
2000-09-16 64.8 8 8.7 73.45 5 4.7 69.45 4 5.4
2000-09-22 58.5 9 3.3 13.81 8 2.9 777.8 8 1.4
I want to extract data for 7th month of 2000 and find out which is the lowest (Open - Close) from A or B or C?
MY PLAN:
s=data.stack(level=0)
values = s[s.index.get_level_values(1)]['open', 'close'].reset_index()
values['Date'] = pd.to_datetime(values['Date'])
start_date = 2000-07-01
end_date = 2000-08-01
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
df['Val_Diff'] = df['open'] - df['close']
print(df['Val_Diff'].max())
I get the error
KeyError: "None of [Index are in the [columns]"
why is multiindex a problem for this code?
I think it's caused by the unnamed columns in the index when the stack deforms vertically.
Process flow:
Flatten the column names of multi-indexes.
Transform from horizontal to vertical using the wide_to_long function
Convert the date sequence to 'Datetime' format for conditional extraction.
import pandas as pd
import numpy as np
import io
import datetime
data = '''
Date open r close open r close open r close
2000-07-03 19.7 5 17.1 66.26 4 6.22 23.26 1 9.9
2000-07-05 49.8 2 8.3 78.81 6 4.34 39.81 5 5.1
2000-07-15 89.5 3 4.1 43.45 7 2.45 29.3 8 1.2
2000-08-13 74.7 6 7.4 34.26 8 6.4 72.26 9 5.4
2000-08-25 39.84 1 8.4 95.43 3 4.3 69.81 0 5.2
2000-08-28 61.8 4 4.2 43.81 1 2.2 129.81 6 1.3
2000-09-11 82.79 7 7.4 66.26 1 6.5 72.25 6 5.6
2000-09-16 64.8 8 8.7 73.45 5 4.7 69.45 4 5.4
2000-09-22 58.5 9 3.3 13.81 8 2.9 777.8 8 1.4
'''
data = pd.read_csv(io.StringIO(data), sep='\s+')
idx = pd.MultiIndex.from_arrays([['','A','A','A','B','B','B','C','C','C'], ['Date','open','r','close','open','r','close','open','r','close']])
data.columns = idx
new_cols = [k[1]+'_'+k[0] for k in data.columns[1:]]
new_cols.insert(0, 'Date')
data.columns = new_cols
data = pd.wide_to_long(data,['open','r','close'], i='Date', j='item', sep='_', suffix='\\w+')
data.reset_index(inplace=True)
data['Date'] = pd.to_datetime(data['Date'])
start_date = datetime.datetime(2000,7,1)
end_date = datetime.datetime(2000,8,1)
mask = (data.Date > start_date) & (data.Date <= end_date)
data = data.loc[mask]
data
Date item open r close
0 2000-07-03 A 19.70 5 17.10
1 2000-07-05 A 49.80 2 8.30
2 2000-07-15 A 89.50 3 4.10
9 2000-07-03 B 66.26 4 6.22
10 2000-07-05 B 78.81 6 4.34
11 2000-07-15 B 43.45 7 2.45
18 2000-07-03 C 23.26 1 9.90
19 2000-07-05 C 39.81 5 5.10
20 2000-07-15 C 29.30 8 1.20
data['Val_Diff'] = data['open'] - data['close']
print(data['Val_Diff'].max())
85.4
Using pandas I like to use groupby and an aggregate function, e.g. mean
and then put the results back in the original dataframe, but in the next group and not in the group itself. How to do this in a vectorized way?
I have a pandas dataframe like this:
data = {'Group': ['A','A','B','B','B','B', 'C','C', 'D','D'],
'Value': [1.1,1.3,9.1,9.2,9.5,9.4,6.2,6.4,2.2,2.3]
}
df = pd.DataFrame(data, columns = ['Group','Value'])
print (df)
Group Value
0 A 1.1
1 A 1.3
2 B 9.1
3 B 9.2
4 B 9.5
5 B 9.4
6 C 6.2
7 C 6.4
8 D 2.2
9 D 2.3
I like to get this, where each group has the mean value of the previous group.
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
I tried this, but this is without the shift to the next group
df.groupby('Group')['Value'].transform('mean')
Easy, use map on a groupby result:
df['Value'] = df['Group'].map(df.groupby('Group')['Value'].mean().shift())
df
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
How It Works
Get the mean
df.groupby('Group')['Value'].mean()
Group
A 1.20
B 9.30
C 6.30
D 2.25
Name: Value, dtype: float64
Shift it down by 1
df.groupby('Group')['Value'].mean().shift()
Group
A NaN
B 1.2
C 9.3
D 6.3
Name: Value, dtype: float64
Map it back.
df['Group'].map(df.groupby('Group')['Value'].mean().shift())
0 NaN
1 NaN
2 1.2
3 1.2
4 1.2
5 1.2
6 9.3
7 9.3
8 6.3
9 6.3
Name: Group, dtype: float64
You can calculate aggregated GroupBy.mean of each group value and use pd.Series.shift and take advantage of pandas index alignment.
df.set_index('Group').assign(value = df.groupby('Group').mean().shift()).reset_index()
Group Value value
0 A 1.1 NaN
1 A 1.3 NaN
2 B 9.1 1.2
3 B 9.2 1.2
4 B 9.5 1.2
5 B 9.4 1.2
6 C 6.2 9.3
7 C 6.4 9.3
8 D 2.2 6.3
9 D 2.3 6.3
I am trying to calculate rolling averages within groups. For this task I want a rolling average from the rows above so thought the easiest way would be to use shift() and then do rolling(). The problem is that shift() shifts the data from previous groups which makes first row in group 2 and 3 incorrect. Column 'ma' should have NaN in rows 4 and 7. How can I achieve this?
import pandas as pd
df = pd.DataFrame(
{"Group": [1, 2, 3, 1, 2, 3, 1, 2, 3],
"Value": [2.5, 2.9, 1.6, 9.1, 5.7, 8.2, 4.9, 3.1, 7.5]
})
df = df.sort_values(['Group'])
df.reset_index(inplace=True)
df['ma'] = df.groupby('Group', as_index=False)['Value'].shift(1).rolling(3, min_periods=1).mean()
print(df)
I get this:
index Group Value ma
0 0 1 2.5 NaN
1 3 1 9.1 2.50
2 6 1 4.9 5.80
3 1 2 2.9 5.80
4 4 2 5.7 6.00
5 7 2 3.1 4.30
6 2 3 1.6 4.30
7 5 3 8.2 3.65
8 8 3 7.5 4.90
I tried answers from couple similar questions but nothing seems to work.
If I understand the question correctly, then the solution you require can be achieved in 2 steps using the following:
df['sa'] = df.groupby('Group', as_index=False)['Value'].transform(lambda x: x.shift(1))
df['ma'] = df.groupby('Group', as_index=False)['sa'].transform(lambda x: x.rolling(3, min_periods=1).mean())
I got the below output, where 'ma' is the desired column
index Group Value sa ma
0 0 1 2.5 NaN NaN
1 3 1 9.1 2.5 2.5
2 6 1 4.9 9.1 5.8
3 1 2 2.9 NaN NaN
4 4 2 5.7 2.9 2.9
5 7 2 3.1 5.7 4.3
6 2 3 1.6 NaN NaN
7 5 3 8.2 1.6 1.6
8 8 3 7.5 8.2 4.9
Edit: Example with one groupby
def shift_ma(x):
return x.shift(1).rolling(3, min_periods=1).mean()
df['ma'] = df.groupby('Group', as_index=False)['Value'].apply(shift_ma).reset_index(drop=True)
I need add three columns in a pandas dataframe, from existing data.
df
>>
n a b
0 3 1.2 1.4
1 2 2.8 3.8
2 3 2.3 2.0
3 3 1.7 5.7
4 2 6.9 4.9
5 1 3.9 19.0
6 9 2.3 8.3
7 5 8.5 3.1
8 18 6.7 7.0
9 10 5.6 6.4
I have done the following
import pandas
import numpy
def add_tests(add_df):
new_tests = """
(a+b)/n
(a*b)/n
((a+b)/n)**-1
""".split()
rows = add_df.shape[0]
cols = len(new_tests)
U = pandas.DataFrame(numpy.empty([rows, cols]), columns=new_tests)
add_df = pandas.concat([df, U], axis=1)
for i, row in add_df.iterrows():
# 1) good calculation:
add_df['(a+b)/n'].loc[i] = (add_df['a'].loc[i] + add_df['b'].loc[i])/ add_df['n'].loc[i]
# 2) good calculation (Both ways):
add_df['(a*b)/n'].loc[i] = (row['a'] * row['b'])/ row['n']
# 3) bad calculation
add_df['((a+b)/n)**-1'].loc[i] = row['(a+b)/n'] ** -1
pass
return add_df
I get the next warning message:
df = add_tests(df)
df
>>
C:...\pandas\core\indexing.py:141: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 0.833333
1 2 2.8 3.8 3.300000 5.320000 0.588235
2 3 2.3 2.0 1.433333 1.533333 0.434783
3 3 1.7 5.7 2.466667 3.230000 0.178571
4 2 6.9 4.9 5.900000 16.905000 0.500000
5 1 3.9 19.0 22.900000 74.100000 0.052632
6 9 2.3 8.3 1.177778 2.121111 0.142857
7 5 8.5 3.1 2.320000 5.270000 0.263158
8 18 6.7 7.0 0.761111 2.605556 0.111111
9 10 5.6 6.4 1.200000 3.584000 0.666667
Obviously step 3 does not work properly ...
How to do it the right way?
Fun with eval
define tuples of temporary column names with formulas
create a \n separated string of formulas to pass to eval
use dictionary to make formulas into column names
ftups = [('aa', '(a+b)/n'), ('bb', '(a*b)/n'), ('cc', '((a+b)/n)**-1')]
forms = '\n'.join([' = '.join(tup) for tup in ftups])
fdict = dict(ftups)
df.eval(forms, inplace=False).rename(columns=fdict)
n a b (a+b)/n (a*b)/n ((a+b)/n)**-1
0 3 1.2 1.4 0.866667 0.560000 1.153846
1 2 2.8 3.8 3.300000 5.320000 0.303030
2 3 2.3 2.0 1.433333 1.533333 0.697674
3 3 1.7 5.7 2.466667 3.230000 0.405405
4 2 6.9 4.9 5.900000 16.905000 0.169492
5 1 3.9 19.0 22.900000 74.100000 0.043668
6 9 2.3 8.3 1.177778 2.121111 0.849057
7 5 8.5 3.1 2.320000 5.270000 0.431034
8 18 6.7 7.0 0.761111 2.605556 1.313869
9 10 5.6 6.4 1.200000 3.584000 0.833333