I have the following Dataframe:
Original Dataframe
I want the following output:
output Dataframe
I have tried using groupby on "Container" column (and sum and other columns) but it only gives the first row as output.
I am very new to python and pandas. and not sure if am doing it correct.
Some of the answer of slimier questions are too advanced for me to understand.
I am just wondering if i can get the output with just 2/3 lines of coding.
Expected result exactly as the one you showed as "Output Dataframe": first "NaN" values in "Container" column of your Original Dataframe must replaced with the corresponding immediate upper value. I added more "NaN" values to exemplify:
Original DataFrame:
df
Container SB No Pkgs CBM Weight
257 CXRU1219452 195375 1650 65 23000
259 BEAU4883430 140801 26 3 575
260 NaN 140868 60 8 1153
261 NaN 140824 11 1 197
262 NaN 140851 253 32 4793
263 NaN 140645 14 1 278
264 NaN 140723 5 0 71
265 NaN 140741 1 0 22
266 NaN 140768 5 0 93
268 SZLU9366565 189355 1800 65 23000
259 ZBCD1234567 100000 100 10 1000
260 NaN 100000 100 10 1000
261 NaN 100000 100 10 1000
262 NaN 100000 100 10 1000
Use "fillna" function with method "ffill" as suggested by [https://stackoverflow.com/a/27905350/6057650][1]
Then you will get "Container" column without "NaN" values:
df=df.fillna(method='ffill')
df
Container SB No Pkgs CBM Weight
257 CXRU1219452 195375 1650 65 23000
259 BEAU4883430 140801 26 3 575
260 BEAU4883430 140868 60 8 1153
261 BEAU4883430 140824 11 1 197
262 BEAU4883430 140851 253 32 4793
263 BEAU4883430 140645 14 1 278
264 BEAU4883430 140723 5 0 71
265 BEAU4883430 140741 1 0 22
266 BEAU4883430 140768 5 0 93
268 SZLU9366565 189355 1800 65 23000
259 ZBCD1234567 100000 100 10 1000
260 ZBCD1234567 100000 100 10 1000
261 ZBCD1234567 100000 100 10 1000
262 ZBCD1234567 100000 100 10 1000
Now you can get the expected "Output DataFrame" using groupby:
df.groupby(['Container']).sum()
SB No Pkgs CBM Weight
Container
BEAU4883430 1126221 375 45 7182
CXRU1219452 195375 1650 65 23000
SZLU9366565 189355 1800 65 23000
ZBCD1234567 400000 400 40 4000
I believe you could groupby and sum like below. The dropna will drop the NaN/empty values in your DataFrame.
df.dropna().groupby(['Container']).sum()
import pandas as pd
d = [['CXRU',195, 1650,65,23000],
['BEAU',140, 26, 3, 575],
['NaN', 140, 60 , 8, 1153]]
df=pd.DataFrame(mylist,columns=['Container','SB No', 'Pkgs', 'CBM','Weight'])
df
sel= df['Container']!='NaN'
df[sel]
import pandas as pd
df = pd.DataFrame({'id':['aaa', 'aaa', 'bbb', 'ccc', 'bbb', 'NaN', 'NaN', 'aaa', 'NaN'],
'values':[1,2,3,4,5,6,7,8,9]})
df
for i in range(len(df)):
if df.iloc[i,0] == "NaN":
df.iloc[i,0] = df.iloc[i-1,0]
df.groupby('id').sum()
Related
I am new to python and I am trying to understand how to work with aggregating data and manipulation.
I have a dataframe:
df3
Out[122]:
SBK SSC CountRecs
0 99 22 9
1 99 12 10
2 99 121 11
3 99 138 12
4 99 123 8
... ... ...
160247 184 1318 1
160248 394 2659 1
160249 412 757 1
160250 357 1312 1
160251 202 106 1
I want to understand in the entire data frame, what percentage of CountRecs for each SBK.
For example, in this case, I want to understand 80618 is what % of the summation total number of SBK's with 99. in this case it is 9/50 * 100. But I want this to be done automated for all rows. How can I go about this?
you need to group by the column you want,
marge by the grouped column.
2.1 you can change the name of the new column.
add the percentage column.
a = df3.merge(pd.DataFrame(df3.groupby('SBK' ['CountRecs'].sum()),on='SBK')
df3['percent'] = (a['CountRecs_x']/a['CountRecs_y']) *100
df3
Use GroupBy.transform for Series with same size like original DataFrame filled by counts, so you can divide original column:
df3['percent'] = df3['CountRecs'] / df3.groupby('SBK')['CountRecs'].transform('sum') * 100
print (df3)
SBK SSC CountRecs percent
0 99 22 9 18.0
1 99 12 10 20.0
2 99 121 11 22.0
3 99 138 12 24.0
4 99 123 8 16.0
160247 184 1318 1 100.0
160248 394 2659 1 100.0
160249 412 757 1 100.0
160250 357 1312 1 100.0
160251 202 106 1 100.0
I started this question yesterday and have done more work on it.
Thanks #AMC , #ALollz
I have a dataframe of surgical activity data that has 58 columns and 200,000 records. One of the columns is treatment specialty Each row corresponds to a patient encounter. I want to see the relative conribution of medical specialties. One column is 'TRETSPEF' = treatment_specialty. I have used `pd.read_csv('csv, usecols = ['TRETSPEF') to import the series.
df
TRETSPEF
0 150
1 150
2 150
3 150
4 150
... ...
218462 150
218463 &
218464 150
218465 150
218466 218`
The most common treatment specialty is neurosurgery (code 150). So heres the problem. When I apply
.value_counts I get two groups for the 150 code (and the 218 code)
df['TRETSPEF'].value_counts()
150 140411
150 40839
218 13692
108 10552
218 4143
...
501 1
120 1
302 1
219 1
106 1
Name: TRETSPEF, Length: 69, dtype: int64
There are some '&' in there (454) so I wondered if the fact they aren't integers was messing things up so I changed them to null values, and ran value counts.
df['TRETSPEF'].str.replace("&", "").value_counts()
150 140411
218 13692
108 10552
800 858
110 835
811 692
191 580
323 555
454
100 271
400 116
420 47
301 45
812 38
214 24
215 23
180 22
300 17
370 15
421 11
258 11
314 5
422 4
260 4
192 4
242 4
171 4
350 2
307 2
302 2
328 2
160 1
219 1
120 1
107 1
101 1
143 1
501 1
144 1
320 1
104 1
106 1
430 1
264 1
Name: TRETSPEF, dtype: int64
so now I seem to have lost the second group of 150 - about 40000 records by changing '&' to null. The nulls are still showing up in .value_counts though.The length of the series has gone down to 45 fromn 69.
I tried stripping whitespace - no difference. Not sure what tests to run to see why this is happening. I feel it must somehow be due to the data.
This is 100% a data cleansing issue. Try to force the column to be numeric.
pd.to_numeric(df['TRETSPEF'], errors='coerce').value_counts()
I've managed to display the columns from the csv to pandas on Python 3. However, the columns are being separated to 3 lines. Is it possible to squeeze all the columns onto a single line? This was done on jupyter notebook.
import pandas as pd
import numpy as np
raw = pd.read_csv("D:/Python/vitamin.csv")
print(raw.head())
Result
RowID Gender BMI Energy_Actual VitaminA_Actual VitaminC_Actual \
0 1 F 18.0 1330 206 15
1 2 F 25.0 1792 469 59
2 3 F 21.6 1211 317 18
3 4 F 23.9 1072 654 24
4 5 F 24.3 1534 946 118
Calcium_Actual Iron_Actual Energy_DRI VitaminA_DRI VitaminC_DRI \
0 827 22 1604 700 65
1 900 12 2011 700 65
2 707 7 2242 700 75
3 560 11 1912 700 75
4 851 12 1895 700 65
Calcium_DRI Iron_DRI
0 1300 15
1 1300 15
2 1000 8
3 1000 18
4 1300 15
You should use below code in the beginning, please refer pandas.set_option:
pd.set_option('display.expand_frame_repr', False)
Let me say I have a DataFrame where the data is ordered with respect to time. I have a column as weights and I want to find the maximum weight relative to the current index. For example the max value found for the 10th Row would be from elements 11 to the end.
I ended up writing this function. But performance is a big threat.
import pandas as pd
df=pd.DataFrame({"time":[100,200,300,400,500,600,700,800],"weights":
[120,160,190,110,34,55,66,33]})
totalRows=df['time'].count()
def findMaximumValRelativeToCurrentRow(row):
index= row.name
if index!= totalRows:
tempDf = df[index:totalRows]
val=tempDf['weights'].max()
df.set_value(index,'max',val)
else:
df.set_value(index,'max',row['weights'])
df.apply(findMaximumValRelativeToCurrentRow,axis=1)
print df
Is there any better way to do the operation than this?
You can use cummax with iloc for reverse order:
print (df['weights'].iloc[::-1])
7 33
6 66
5 55
4 34
3 110
2 190
1 160
0 120
Name: weights, dtype: int64
df['max1'] = df['weights'].iloc[::-1].cummax()
print (df)
time weights max max1
0 100 120 190.0 190
1 200 160 190.0 190
2 300 190 190.0 190
3 400 110 110.0 110
4 500 34 66.0 66
5 600 55 66.0 66
6 700 66 66.0 66
7 800 33 33.0 33
I have a dataframe containing strings, as read from a sloppy csv:
id Total B C ...
0 56 974 20 739 34 482
1 29 479 10 253 16 704
2 86 961 29 837 43 593
3 52 687 22 921 28 299
4 23 794 7 646 15 600
What I want to do: convert every cell in the frame into a number. It should be ignoring whitespaces, but put NaN where the cell contains something really strange.
I probably know how to do it using terribly unperformant manual looping and replacing values, but was wondering if there's a nice and clean why to do this.
You can use read_csv with regex separator \s{2,} - 2 or more whitespaces and parameter thousands:
import pandas as pd
from pandas.compat import StringIO
temp=u"""id Total B C
0 56 974 20 739 34 482
1 29 479 10 253 16 704
2 86 961 29 837 43 593
3 52 687 22 921 28 299
4 23 794 7 646 15 600 """
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="\s{2,}", engine='python', thousands=' ')
print (df)
id Total B C
0 0 56974 20739 34482
1 1 29479 10253 16704
2 2 86961 29837 43593
3 3 52687 22921 28299
4 4 23794 7646 15600
print (df.dtypes)
id int64
Total int64
B int64
C int64
dtype: object
And then if necessary apply function to_numeric with parameter errors='coerce' - it replace non numeric to NaN:
df = df.apply(pd.to_numeric, errors='coerce')