I am attempting to perform arithmetic on the 'data_d' column.
dataframe
data_a data_b data_c data_d
60 0.30786 Discharge 2.31714
61 0.30792 Rest 2.34857
121 0.62095 Rest 2.38647
182 0.93398 Discharge 2.31115
183 0.93408 Rest 2.34550
243 1.24711 Rest 2.37162
304 1.56014 Discharge 2.30855
305 1.56019 Rest 2.34215
365 1.87322 Rest 2.36276
426 2.18630 Discharge 2.30591
I want to assign the variables A,B,C into a new column named 'variable'. As shown below.
dataframe2
data_a data_b data_c data_d variable
60 0.30786 Discharge 2.31714 A
61 0.30792 Rest 2.34857 B
121 0.62095 Rest 2.38647 C
182 0.93398 Discharge 2.31115 A
183 0.93408 Rest 2.34550 B
243 1.24711 Rest 2.37162 C
304 1.56014 Discharge 2.30855 A
305 1.56019 Rest 2.34215 B
365 1.87322 Rest 2.36276 C
426 2.18630 Discharge 2.30591 A
The script then should perform the following operation iteratively over the entire 'data_d' column.
(C - (B-A))
(2.38647 - (2.34857-2.31714))
(2.35504)
...
dataframe3
measurement
0 2.35504
1 2.33727
2 2.32916
... ...
And so on.
Thank you in advance for any insight.
We use the cumsum to create the groupby key , then do cumcount with groupby map the number of count back to letter
key = df['data_c'].eq('Discharge').cumsum()
df['variable'] = df.groupby(key).cumcount().map({0:'A',1:'B',2:'C'})
df
Out[61]:
data_a data_b data_c data_d variable
0 60 0.30786 Discharge 2.31714 A
1 61 0.30792 Rest 2.34857 B
2 121 0.62095 Rest 2.38647 C
3 182 0.93398 Discharge 2.31115 A
4 183 0.93408 Rest 2.34550 B
5 243 1.24711 Rest 2.37162 C
6 304 1.56014 Discharge 2.30855 A
7 305 1.56019 Rest 2.34215 B
8 365 1.87322 Rest 2.36276 C
9 426 2.18630 Discharge 2.30591 A
Then we just need to pivot : here I am using crosstab
s = pd.crosstab(index=key, columns=df['variable'], values=df['data_d'], aggfunc='sum')
dfout = s.eval('C - (B-A)').to_frame(name = 'measurement')
dfout
Out[69]:
measurement
data_c
1 2.35504
2 2.33727
3 2.32916
4 NaN
Related
I have a DataFrame in which I have already defined rows to be summed up and store the results in a new row.
For example in Year 1990:
Category
A
B
C
D
Year
E
147
78
476
531
1990
F
914
356
337
781
1990
G
117
874
15
69
1990
H
45
682
247
65
1990
I
20
255
465
19
1990
Here, the rows G - H should be summed up and the results stored in a new row. The same categories repeat every year from 1990 - 2019
I have already tried it with .iloc e.g. [4:8], [50:54] [96:100] and so on, but with iloc I can not specify multiple index. I can't manage to make a loop over the single years.
Is there a way to sum the values in categories (G-H) for each year (1990 -2019)?
I'm not sure the multiple index what you mean.
It usually appear after some group and aggregate function.
At your table, it looks just multiple column
So, if I understand correctly.
Here a complete code to show how to use the multiple condition of DataFrame
import io
import pandas as pd
data = """Category A B C D Year
E 147 78 476 531 1990
F 914 356 337 781 1990
G 117 874 15 69 1990
H 45 682 247 65 1990
I 20 255 465 19 1990"""
table = pd.read_csv(io.StringIO(data), delimiter="\t")
years = table["Year"].unique()
for year in years:
row = table[((table["Category"] == "G") | (table["Category"] == "H")) & (table["Year"] == year)]
row = row[["A", "B", "C", "D"]].sum()
row["Category"], row["Year"] = "sum", year
table = table.append(row, ignore_index=True)
If you are only interested in G/H, you can slice with isin combined with boolean indexing, then sum:
df[df['Category'].isin(['G', 'H'])].sum()
output:
Category GH
A 162
B 1556
C 262
D 134
Year 3980
dtype: object
NB. note here the side effect of sum that combines the two "G"/"H" strings into one "GH".
Or, better, set Category as index and slice with loc:
df.set_index('Category').loc[['G', 'H']].sum()
output:
A 162
B 1556
C 262
D 134
Year 3980
dtype: int64
Still learning my way around Python and trying to figure out how to process some data. I've got a dataframe with 1 column that I need to extract into 3 columns of data. I don't need to keep the original column.
Here's the data - "Given Data" is the original column and I want to extract out columns A and B, then do the math for column C (A/B). Thanks for your help!
Try with str.strip and str.split:
df[["A", "B"]] = df["Given Data"].str.strip("()").str.split(" / ", expand=True).astype(int)
df["C"] = df["A"].div(df["B"])
>>> df
Given Data A B C
0 (313 / 321) 313 321 0.975078
1 (654 / 654) 654 654 1.000000
2 (673 / 842) 673 842 0.799287
3 (342 / 402) 342 402 0.850746
4 (586 / 774) 586 774 0.757106
If you want to convert the numeric "C" column to percentage strings, you can do:
df["C"] = df["C"].mul(100).map("{:.2f}%".format)
>>> df
Given Data A B C
0 (313 / 321) 313 321 97.51%
1 (654 / 654) 654 654 100.00%
2 (673 / 842) 673 842 79.93%
3 (342 / 402) 342 402 85.07%
4 (586 / 774) 586 774 75.71%
I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)
Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])
The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].
I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47
I'm trying to write a function to aggregate and perform various stats calcuations on a dataframe in Pandas and then merge it to the original dataframe however, I'm running to issues. This is code equivalent in SQL:
SELECT EID,
PCODE,
SUM(PVALUE) AS PVALUE,
SUM(SQRT(SC*EXP(SC-1))) AS SC,
SUM(SI) AS SI,
SUM(EE) AS EE
INTO foo_bar_grp
FROM foo_bar
GROUP BY EID, PCODE
And then join on the original table:
SELECT *
FROM foo_bar_grp INNER JOIN
foo_bar ON foo_bar.EID = foo_bar_grp.EID
AND foo_bar.PCODE = foo_bar_grp.PCODE
Here are the steps: Loading the data
IN:>>
pol_dict = {'PID':[1,1,2,2],
'EID':[123,123,123,123],
'PCODE':['GU','GR','GU','GR'],
'PVALUE':[100,50,150,300],
'SI':[400,40,140,140],
'SC':[230,23,213,213],
'EE':[10000,10000,2000,30000],
}
pol_df = DataFrame(pol_dict)
pol_df
OUT:>>
EID EE PCODE PID PVALUE SC SI
0 123 10000 GU 1 100 230 400
1 123 10000 GR 1 50 23 40
2 123 2000 GU 2 150 213 140
3 123 30000 GR 2 300 213 140
Step 2: Calculating and Grouping on the data:
My pandas code is as follows:
#create aggregation dataframe
poagg_df = pol_df
del poagg_df['PID']
po_grouped_df = poagg_df.groupby(['EID','PCODE'])
#generate acc level aggregate
acc_df = po_grouped_df.agg({
'PVALUE' : np.sum,
'SI' : lambda x: np.sqrt(np.sum(x * np.exp(x-1))),
'SC' : np.sum,
'EE' : np.sum
})
This works fine until I want to join on the original table:
IN:>>
po_account_df = pd.merge(acc_df, po_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
OUT:>>
KeyError: u'no item named EID'
For some reason, the grouped dataframe can't join back to the original table. I've looked at ways of trying to convert the groupby columns to actual columns but that doesn't seem to work.
Please note, the end goal is to be able to find the percentage for each column (PVALUE, SI, SC, EE) IE:
pol_acc_df['PVALUE_PCT'] = np.round(pol_acc_df.PVALUE_Po/pol_acc_df.PVALUE_Acc,4)
Thanks!
By default, groupby output has the grouping columns as indicies, not columns, which is why the merge is failing.
There are a couple different ways to handle it, probably the easiest is using the as_index parameter when you define the groupby object.
po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)
Then, your merge should work as expected.
In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]:
EID PCODE SC_Acc EE_Acc SI_Acc PVALUE_Acc EE_Po PVALUE_Po \
0 123 GR 236 40000 1.805222e+31 350 10000 50
1 123 GR 236 40000 1.805222e+31 350 30000 300
2 123 GU 443 12000 8.765549e+87 250 10000 100
3 123 GU 443 12000 8.765549e+87 250 2000 150
SC_Po SI_Po
0 23 40
1 213 140
2 230 400
3 213 140
From the pandas docs:
Transformation: perform some group-specific computations and return a like-indexed object
Unfortunately, transform works series by series, so you wouldn't be able to perform multiple functions on multiple columns as you've done with agg, but transform does allow you to skip merge
po_grouped_df = pol_df.groupby(['EID','PCODE'])
pol_df['sum_pval'] = po_grouped_df['PVALUE'].transform(sum)
pol_df['func_si'] = po_grouped_df['SI'].transform(lambda x: np.sqrt(np.sum(x * np.exp(x-1))))
pol_df['sum_sc'] = po_grouped_df['SC'].transform(sum)
pol_df['sum_ee'] = po_grouped_df['EE'].transform(sum)
pol_df
Results in:
PID EID PCODE PVALUE SI SC EE sum_pval func_si sum_sc sum_ee
1 123 GU 100 400 230 10000 250 8.765549e+87 443 12000
1 123 GR 50 40 23 10000 350 1.805222e+31 236 40000
2 123 GU 150 140 213 2000 250 8.765549e+87 443 12000
2 123 GR 300 140 213 30000 350 1.805222e+31 236 40000
For more info, check out this SO answer.