Summing rows in Python Dataframe - python

I just started learning Python so forgive me if this question has already been answered somewhere else. I want to create a new column called "Sum", which will simply be the previous columns added up.
Risk_Parity.tail()
VCIT VCLT PCY RWR IJR XLU EWL
Date
2017-01-31 21.704155 11.733716 9.588649 8.278629 5.061788 7.010918 7.951747
2017-02-28 19.839319 10.748690 9.582891 7.548530 5.066478 7.453951 7.950232
2017-03-31 19.986782 10.754507 9.593623 7.370828 5.024079 7.402774 7.654366
2017-04-30 18.897307 11.102380 10.021139 9.666693 5.901137 7.398604 11.284331
2017-05-31 63.962659 23.670240 46.018698 9.917160 15.234977 12.344524 20.405587
The table columns are a little off but all I need is (21.70 + 11.73...+7.95)
I can only get as far as creating the column Risk_Parity['sum'] = , but then I'm lost.
I'd rather not do have to do Risk_Parity['sum] = Risk_Parity['VCIT'] + Risk_Parity['VCLT']...
After creating the sum column, I want to divide each column by the sum column and make that into a new dataframe, which wouldn't include the sum column.
If anyone could help, I'd greatly appreciate it. Please try to dumb your answers down as much as possible lol.
Thanks!
Tom

Use sum with the parameter axis=1 to specify summation over rows
Risk_Parity['Sum'] = Risk_Parity.sum(1)
To create a new copy of Risk_Parity without writing a new column to the original
Risk_Parity.assign(Sum= Risk_Parity.sum(1))
Notice also, that I named the column Sum and not sum. I did this to avoid colliding with the very same method named sum I used to create the column.
To only include numeric columns... however, sum knows to skip non-numeric columns anyway.
RiskParity.assign(Sum=RiskParity.select_dtypes(['number']).sum(1))
# same as
# RiskParity.assign(Sum=RiskParity.sum(1))
VCIT VCLT PCY RWR IJR XLU EWL Sum
Date
2017-01-31 21.70 11.73 9.59 8.28 5.06 7.01 7.95 71.33
2017-02-28 19.84 10.75 9.58 7.55 5.07 7.45 7.95 68.19
2017-03-31 19.99 10.75 9.59 7.37 5.02 7.40 7.65 67.79
2017-04-30 18.90 11.10 10.02 9.67 5.90 7.40 11.28 74.27
2017-05-31 63.96 23.67 46.02 9.92 15.23 12.34 20.41 191.55

l = ['VCIT' , VCLT' ,PCY' ... 'EWL']
Risk_Parity['sum'] = 0
for item in l:
Risk_Parity['sum'] += Risk_Parity[item]

Related

Pandas Dataframe Comparison and Copying

Below I have two dataframes, the first being dataframe det and the second being orig. I need to compare det['Detection'] with orig['Date/Time']. Once the values are found during the comparion, I need to copy values from orig and det to some final dataframe (final). The format that I need the final dataframe in is det['Date/Time'] orig['Lat'] orig['Lon'] orig['Dep'] det['Mag'] I hope that my formatting is adequate for folks. I was not sure how to handle the dataframes so I just placed them in tables. Some additional information that probably won't matter is that det is 3385 rows by 3 columns and orig is 818 rows by 9 columns.
det:
Date/Time
Mag
Detection
2008/12/27T01:06:56.37
0.280
2008/12/27T13:50:07.00
2008/12/27T01:17:39.39
0.485
2008/12/27T01:17:39.00
2008/12/27T01:33:23.00
-0.080
2008/12/27T01:17:39.00
orig:
Date/Time
Lat
Lon
Dep
Ml
Mc
N
Dmin
ehz
2008/12/27T01:17:39.00
44.5112
-110.3742
5.07
-9.99
0.51
5
6
3.2
2008/12/27T04:33:30.00
44.4985
-110.3750
4.24
-9.99
1.63
9
8
0.9
2008/12/27T05:38:22.00
44.4912
-110.3743
4.73
-9.99
0.37
8
8
0.8
final:
det['Date/Time']
orig['Lat']
orig['Lon']
orig['Dep']
det['Mag']
You can merge the two dataframes, since you want to use Detection column from the first data frame and Date/Time column from the second dataframe, you can just rename the column of second dataframe while merging since the column name already exits in the first dataframe:
det.merge(org.rename(columns={'Date/Time': 'Detection'}))
OUTPUT:
Date/Time Mag Detection Lat Lon Dep Ml Mc N Dmin ehz
0 2008/12/27T01:17:39.39 0.485 2008/12/27T01:17:39.00 44.5112 -110.3742 5.07 -9.99 0.51 5 6 3.2
1 2008/12/27T01:33:23.00 -0.080 2008/12/27T01:17:39.00 44.5112 -110.3742 5.07 -9.99 0.51 5 6 3.2
You can then select the columns you want.

How to perform calculation in string column pandas

df[concentration]
'4.0±0.0'
'2.5±0.2'
'5.8±0.1'
'45.0'
'23'
'26.07'
'64'
I want result as:
4.00
2.70
5.90
45.00
23.00
26.07
64.00
which is (4.0+0.0) taking highest possible value.
However my concentration column is not float then how can I perform calculation on such type of data?
Any kind of help will be appreciated.
Thank you.
Use split(regex, expand), convert outcome digits to float and add row wise using lambda.
Data
df=pd.DataFrame({'concentration':['4.0±0.0','2.5±0.2','5.8±0.1','45.0','23','26.07','64']})
Solution
df['concentration']=df.concentration.str.split('±', expand=True).apply(lambda x: x.astype(float)).sum(1)
concentration
0 4.00
1 2.70
2 5.90
3 45.00
4 23.00
5 26.07
6 64.00

How to remove missing data and 0s whilst keeping the dataframe the same shape using Pandas?

I have a dataframe and I want to reformat it in order it remove the instances of whether a missing value or a zero occurs before the first non-zero value appears across a row. However I do not want to delete any rows or columns and I do not want to remove any 0s or missing values which appear after the non-zeroes.
Below is the dataframe I am working with:
> data =[['Adam',2.55,4.53,3.45,2.12,3.14],['Bill',np.NaN,2.14,3.65,4.12],['Chris',np.NaN,0,2.82,0,6.04],['David',np.NaN,0,7.42,3.52]]
> df = pd.DataFrame(data, columns = ['Name', 'A','B','C','D','E'])
Moreover, here is the expected outcome:
> data1 =[['Adam',2.55,4.53,3.45,2.12,3.14],['Bill',2.14,3.65,4.12],['Chris',2.82,0,6.04],['David',7.42,3.52]]
> df1 = pd.DataFrame(data1, columns = ['Name', 'A','B','C','D','E'])
This is not a trivial problem. Here is the solution:
m=df.set_index('Name')
m=m[m.isin(m.mask(m.le(0)).bfill(axis=1).iloc[:,0]).cumsum(axis=1).astype(bool)]
print(m)
A B C D E
Name
Adam 2.55 4.53 3.45 2.12 3.14
Bill NaN 2.14 3.65 4.12 NaN
Chris NaN NaN 2.82 0.00 6.04
David NaN NaN 7.42 3.52 NaN
Then using justify:
pd.DataFrame(justify(m.values,np.nan),columns=m.columns,index=m.index).reset_index()
Name A B C D E
0 Adam 2.55 4.53 3.45 2.12 3.14
1 Bill 2.14 3.65 4.12 NaN NaN
2 Chris 2.82 0.00 6.04 NaN NaN
3 David 7.42 3.52 NaN NaN NaN
Explanation:
Step1: Set the Name column as index so we can deal with numeric values only.
Step2: m.mask(m.le(0)).bfill(axis=1).iloc[:,0] gives the first value which is greater than 0.
Step3: Then using isin() to return True wherever the value appears in each row.
Step4: cumsum(axis=1).astype(bool) makes all the remaining elements as True so we can filter only those values, other values becomes NaN.
Then use justify function from the linked post.

Get time difference between two values in csv file [duplicate]

This question already has answers here:
Pandas: Difference to previous value
(2 answers)
Closed 3 years ago.
I trying to get the avarage, max and min time difference between value occurrences in a csv file.
The contains a multiple columns and rows.
I am currently working in python and trying to use pandas to solve my problem.
I have managed to break down the csv file to the column i want to get the time difference from and the time column.
Where the "payload" column "value occurrences" happens.
looking like:
time | payload
12.1 2368
13.8 2508
I have also tried to get the time in a array when the value occurrences happens and tried to step through the array but failed bad. I felt like there was a easier way to do it.
def average_time(avg_file):
avg_read = pd.read_csv(avg_file, skiprows=2, names=new_col_names, usecols=[2, 3], na_filter=False, skip_blank_lines=True)
test=[]
i=0
for row in avg_read.payload:
if row != None:
test[i]=avg_read.time
i+=1
if len[test] > 2:
average=test[1]-test[0]
i=0
test=[]
return average
The csv-file currently look like:
time | payload
12.1 2250
12.5 2305
12.9 (blank)
13.1 (blank)
13.5 2309
14.6 2350
14.9 2680
15.0 (blank)
I want to get the time diffenrence between the values in the payload columen. example time between
2250 and 2305 --> 12.5-12.1 = 0.4 sec
and the get the difference between
2305 and 2309 --> 13.5-12.5 = 1 s
Skipping the blank numbers
To later on get the maximum, minimun and average difference.
First use dropna then use Series.diff
DataFrame used:
print(df)
time payload
0 12.1 2250.0
1 12.5 2305.0
2 12.9 NaN
3 13.1 NaN
4 13.5 2309.0
5 14.6 2350.0
6 14.9 2680.0
7 15.0 NaN
df.dropna().time.diff()
0 NaN
1 0.4
4 1.0
5 1.1
6 0.3
Name: time, dtype: float64
Note I assumed your (blank) values are NaN, else use the following before running my code:
df.replace('(blank)', np.NaN, inplace=True, axis=1)
# Or if they are whitespaces
df.replace('', np.NaN, inplace=True, axis=1)

Join/merge not working in python

Trying to join df61 and df_petsy_gz on pic_code. I've included the data types for the variables also. My code is outputting a bunch of NaN indicating none of the pic_codes match between the two data sets. There are a couple million lines of data so I'm certain there are a bunch of matches. I think I'm doing something wrong.
df61.head(3)
mpe_wgt pic_code
10 420336479305589843900801597032
10 420907139300189843900792911982
10 420967449300189843900797682603
mpe_wgt object
pic_code object
df_petsy_gz.head(3)
monthly_fiscal_year month pic_code class_of_mail
2017 11 420606019300189843900566128707 FC
2017 11 420731629300189843900584700299 FC
2017 11 420405029300189843900568579224 FC
weight calc_postage calc_total_postage MikeZone
0.8750 4.02 4.02 5
0.3750 2.77 2.77 6
0.6875 3.60 3.60 8
monthly_fiscal_year int64
month int64
pic_code object
class_of_mail object
weight float64
calc_postage float64
calc_total_postage float64
MikeZone int64
df61_mpe=pd.merge(df_petsy_gz,df61,on='pic_code', how='outer')
output
monthly_fiscal_year month pic_code class_of_mail \
2017.0 11.0 420606019300189843900566128707 FC
2017.0 11.0 420731629300189843900584700299 FC
2017.0 11.0 420405029300189843900568579224 FC
2017.0 11.0 420301349300189843900567382542 FC
weight calc_postage calc_total_postage MikeZone mpe_wgt
0.8750 4.02 4.02 5.0 NaN
0.3750 2.77 2.77 6.0 NaN
0.6875 3.60 3.60 8.0 NaN
0.5000 2.77 2.77 4.0 NaN
I don't know how your data looks like but all what i know is that the type of join affects how the rows will be formed after the join.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
How to handle the operation of the two objects.
left: use calling frame’s index (or column if on is specified)
right: use other frame’s index
outer: form union of calling frame’s index (or column if on is specified) with other frame’s index, and sort it lexicographically
inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one
Try using 'inner' join and see if that is what you need.
This will only return rows of where pic_code is found in both dataframes and having mpe_wgt.
Additionally make sure that the pic_code has no trailing/leading whitespaces so that similar pic_code from the two dataframes would match.

Categories

Resources