New to pandas, I'm trying to manage some dataframe operations with pandas where I have 4 columns on a multi-index dataframe and where I need an extra column where the value in that column would be equal to the value in one row divided by a specific row.
In my example below, I would like for each entry, to have the new column "Agg" be the result of the column "Values" for each Type (1, 2, 3) divided by the "Values" for Calc.
Date Values Agg
2016-01-01 Type 1 17 1.7
Type 2 23 2.3
Type 3 11 1.1
Calc 10 1.0
2016-01-02 Type 1 25 0.25
Type 2 39 0.39
Type 3 34 0.34
Calc 100 1.00
2016-01-03 Type 1 20 1.00
Type 2 9 0.45
Type 2 12 0.60
Calc 20 1.00
In my actual code I have a groupby "Date" and other indexes: these changes depending on the results from a query to the db.
Thanks in advance !
The code below works. I spent too much time writing it, so I have to leave it at that. Let me know if you need explanations!
def func(df1):
idx = df1.index.get_level_values(0)[0]
df1 = df1.loc[idx]
return (df1['Values'] / df1.loc['Calc']['Values']).to_frame()
df.groupby(level=0).apply(func)
Related
I am very new to python and trying to complete an appointment for uni. I've already tried googling the issue (and there may already be a solution) but could not find a solution to my problem.
I have a dataframe with values and a timestamp. It looks like this:
created_at
delta
2020-01-01
1.45
2020-01-02
0.12
2020-01-03
1.01
...
...
I want to create a new column 'sum' which summarizes all the previous values, like this:
created_at
delta
sum
2020-01-01
1.45
1.45
2020-01-02
0.12
1.57
2020-01-03
1.01
2.58
...
...
...
I want to define a method that I can use on different files (the data is spread across multiple files).
I have tried this but it doesn't work
def sum_ (data_index):
df_sum = delta_(data_index) #getting the data
y = len(df_sum)
for x in range(0,y):
df_sum['sum'].iloc[[0]] = df_sum['delta'].iloc[[0]]
df_sum['sum'].iloc[[x]] = df_sum['sum'].iloc[[x-1]] + df_sum['delta'].iloc[[x]]
return df_sum
For any help, I am very thankful.
Kind regards
Try cumsum():
df['sum'] = df['delta'].cumsum()
Use cumsum simple example
import pandas as pd
df = pd.DataFrame({'x':[1,2,3,4,5]})
df['y'] = df['x'].cumsum()
print(df)
output
x y
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15
I have kind of very specific requirement, but one might be able to re-use the knowledge of this question for his/her scenario.
I have following dataframe which contains value columns and relay column related to that value column next to each value column. I need to multiply each value column by it's next column and write the result into a new dataset. (this is kind of summary). there are many columns, so it is very difficult to multiply these columns manually.
What is the best way to do this
A B C D ....
12.23 0 43.34 1 ....
78.56 1 67.78 0 ....
Result
X Y
0 43.34 ....
78.56 0 ....
Use:
out=pd.DataFrame(df[df.columns[::2]].values*df[df.columns[1::2]].values,columns=['X','Y'])
print(out)
X Y
0 0.00 43.34
1 78.56 0.00
I would like to remove the characters from the columns in the pandas' data frame. I got around 10 columns and each has characters. Please see the sample column. Column type is a string and would like to remove the characters and convert the column int float
10.2\I
10.1\Y
NAN
12.5\T
13.3\T
9.4\J
NAN
12.2\N
NAN
11.9\U
NAN
12.4\O
NAN
8.3\U
13.5\B
NAN
13.1\V
11.0\Q
11.0\X
8.200000000000001\U
NAN
13.1\T
8.1\O
9.4\N
I would like to remove the '\', all the Alphabets and make it into a float. I don't want to change the NAN.
I used df[column name'] = df.str[:4] - It removes some of the cells but not all cells. Also, unable to convert into a float as I am getting an error
df[column name'] = df.str[:4]
df['column name'].astype(float)
0 10.2
1 10.1
2 NaN
3 12.5
4 13.3
5 9.4\
6 8.3\
22 8.1\
27 9.4\
28 NaN
29 10.6
30 10.8
31 NaN
32 7.3\
33 9.8\
34 NaN
35 12.4
36 8.1\
Still it's not converting other cells
Getting error when I tried to convert into a float
ValueError: could not convert string to float: '10.2\I'
Two reasons I can see why your code is not working:
Using [:4] will not work for all values in your example since the number of digits before the decimal point (and apparently after it) varies.
In the df['column name'] = df.str[:4] assignment there needs to be the same column identifier on the right side of the equal sign.
Here is a solution with a sample dataframe I prepared with two abbreviated columns like in your example. It uses [:-2] to truncate each value from the right side and then replaces remaining N's with the original NAN's before converting to float.
import pandas as pd
col = pd.Series(["10.2\I","10.1\Y",'NAN','12.5\T'])
col2 = pd.Series(["11.0\Q","11.0\X",'NAN',r'8.200000000000001\U'])
df = pd.concat([col,col2],axis=1)
df.rename(columns={0:'col1',1:'col2'},inplace=True)
df
col1 col2
0 10.2\I 11.0\Q
1 10.1\Y 11.0\X
2 NAN NAN
3 12.5\T 8.200000000000001\U
#apply the conversion to all columns in the dataframe
for col in df:
df[col] = df[col].str[:-2].replace('N','NAN').astype(float)
df
col1 col2
0 10.2 11.0
1 10.1 11.0
2 NaN NaN
3 12.5 8.2
I have a DataFrame A in Jupiter that looks like the following
Index Var1.A.1 Var1.B.1 Var1.CA.1 Var2.A.1 Var2.B.1 Var2.CA.1
0 1 21 3 3 4 4
1 3 5 4 9 5 1
....
100 9 75 2 4 8 2
I'd like to assess the mean value based on the extension of the name, i.e.
Mean value of .A.1
Mean Value of .B.1
Mean value of .CA.1
For example, to assess the mean value of the variable with extension .A.1, I've tried the following, which doesn't return what I look for
List=['.A.1', '.B.1', '.CA.1']
A[List[List.str.contains('.A.1')]].mean()
However, in this way I get the mean values of the different variables, getting also CA.1, which is not what it look for.
Any advice?
thanks
If want mean per rows by all values after first . use groupby with lambda function and mean:
df = df.groupby(lambda x: x.split('.', 1)[-1], axis=1).mean()
print (df)
A.1 B.1 CA.1
0 2.0 12.5 3.5
1 6.0 5.0 2.5
100 6.5 41.5 2.0
Here is a thrid option:
columns = A.columns
A[[s for s in columns if ".A.1" in s]].stack().reset_index().mean()
dfA.filter(like='.A.1') - gives you the column containing the '.A.1' substring
I am trying to take the days out of the column Days_To_Maturity. So instead of Days 0, it will just be 0. I have tried a few things but am wondering if there is a easy way to do this built into python. Thanks
In[12]:
from pandas import *
XYZ = read_csv('XYZ')
df_XYZ = DataFrame(XYZ)
df_XYZ.head()
Out[12]:
Dates Days_To_Maturity Yield
0 5/1/2002 Days 0 0.00
1 5/1/2002 Days 1 0.06
2 5/1/2002 Days 2 0.12
3 5/1/2002 Days 3 0.18
4 5/1/2002 Days 4 0.23
5 rows × 3 columns
You can explore the possibility of using .str method, either you can extract the numbers using regex, or take a slice .str.slice, or like in this example, replace days with a empty string:
In [109]:
df.Days_To_Maturity.str.replace('Days ','').astype(int)
Out[109]:
0 0
1 1
2 2
3 3
4 4
Name: Days_To_Maturity, dtype: int32
I think the solution you are looking for is in the "converters" option of the read_csv function of pandas. From help(pandas.read_csv):
converters: dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
So instead of read_csv('XYZ') you would make a custom converter:
myconverter = { 'Days_To_Maturity': lambda x: x.split(' ')[1] }
read_csv('XYZ',converter=myconverter)
This should work. Please let me know if it helps!