Modify and round numbers in a pandas dataframe in Python - python

Long story short, I have a csv file which I read as a pandas dataframe. The file contains a weather report, but all of the measurements for temperature are in Fahrenheit. I've figured out how to convert them:
import pandas as np
df = np.read_csv('report.csv')
df['average temperature'] = (df['average temperature'] - 32) * 5/9
But then the data for this column is in decimals up to 6 points.
I've found code that will round up all the data in the dataframe, but I need only this column.
df.round(2)
I don't like how it has to be a separate piece of code on a separate line and how it modifies all of my data. Is there a way to go about this problem more elegantly? Is there a way to apply this to other columns in my dataframe, such as maximum temperature and minimum temperature without having to copy the above piece of code?

For round only some columns use subset:
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = df[cols].round(2)
If want convert only some columns from list:
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = ((df[cols] - 32) * 5/9).round(2)
If want round each column separately:
df['average temperature'] = df['average temperature'].round(2)
df['maximum temperature'] = df['maximum temperature'].round(2)
df['minimum temperature'] = df['minimum temperature'].round(2)
Sample:
df = (pd.DataFrame(np.random.randint(30, 100, (10, 3)),
columns=['maximum temperature','minimum temperature','average temperature'])
.assign(a='m', b=range(10)))
print (df)
maximum temperature minimum temperature average temperature a b
0 97 60 98 m 0
1 64 86 64 m 1
2 32 64 95 m 2
3 60 56 93 m 3
4 43 89 64 m 4
5 40 62 86 m 5
6 37 40 70 m 6
7 61 33 46 m 7
8 36 44 46 m 8
9 63 30 33 m 9
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = ((df[cols] - 32) * 5/9).round(2)
print (df)
maximum temperature minimum temperature average temperature a b
0 36.11 15.56 36.67 m 0
1 17.78 30.00 17.78 m 1
2 0.00 17.78 35.00 m 2
3 15.56 13.33 33.89 m 3
4 6.11 31.67 17.78 m 4
5 4.44 16.67 30.00 m 5
6 2.78 4.44 21.11 m 6
7 16.11 0.56 7.78 m 7
8 2.22 6.67 7.78 m 8
9 17.22 -1.11 0.56 m 9

Here's a single line solution with apply and a conversion function.
def convert_to_celsius (f):
return 5.0/9.0*(f-32)
df[['Column A','Column B']] = df[['Column A','Column B']].apply(convert_to_celsius).round(2)

Related

Calculate Multiple Column Growth in Python Dataframe

The data I used look like this
data
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
1 100 50 120 45 110 50
2 95 40 100 45 105 50
3 110 45 100 45 110 40
I want to calculate each variable growth for each year so the result will look like this
Subject 2001_X1_gro 2001_X2_gro 2002_X1_gro 2002_X2_gro
1 0.2 -0.1 -0.08333 0.11111
2 0.052632 0.125 0.05 0.11111
3 -0.09091 0 0.1 -0.11111
I already do it manually for each variable for each year with code like this
data[2001_X1_gro]= (data[2001_X1]-data[2000_X1])/data[2000_X1]
data[2002_X1_gro]= (data[2002_X1]-data[2001_X1])/data[2001_X1]
data[2001_X2_gro]= (data[2001_X2]-data[2000_X2])/data[2000_X2]
data[2002_X2_gro]= (data[2002_X2]-data[2001_X2])/data[2001_X2]
Is there a way to do it more efficient escpecially if I have more year and/or more variable?
import pandas as pd
df = pd.read_csv('data.txt', sep=',', header=0)
Input
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2
0 1 100 50 120 45 110 50
1 2 95 40 100 45 105 50
2 3 110 45 100 45 110 40
Next, a loop is created and the columns are filled:
qqq = '_gro'
new_name = ''
year = ''
for i in range(1, len(df.columns) - 2):
year = str(int(df.columns[i][:4]) + 1) + df.columns[i][4:]
new_name = year + qqq
df[new_name] = (df[year] - df[df.columns[i]])/df[df.columns[i]]
print(df)
Output
Subject 2000_X1 2000_X2 2001_X1 2001_X2 2002_X1 2002_X2 2001_X1_gro \
0 1 100 50 120 45 110 50 0.200000
1 2 95 40 100 45 105 50 0.052632
2 3 110 45 100 45 110 40 -0.090909
2001_X2_gro 2002_X1_gro 2002_X2_gro
0 -0.100 -0.083333 0.111111
1 0.125 0.050000 0.111111
2 0.000 0.100000 -0.111111
In the loop, the year is extracted from the column name, converted to int, 1 is added to it. The value is again converted to a string, the prefix '_Xn' is added. A new_name variable is created, to which the string '_gro ' is also appended. A column is created and filled with calculated values.
If you want to count, for example, for three years, then you need to add not 1, but 3. This is with the condition that your data will be ordered. And note that the loop does not go through all the elements: for i in range(1, len(df.columns) - 2):. In this case, it skips the Subject column and stops short of the last two values. That is, you need to know where to stop it.

How to calculate cumulative sum and average on file data in python

I have a below data in file
NAME,AGE,MARKS
A1,12,40
B1,13,54
C1,15,67
D1,11,41
E1,16,59
F1,10,60
If the data was in database table , I would have used Sum and Average function to get the cumulative sum and average
But How to get it with python is a bit challenging , As i am learner
Expected output :
NAME,AGE,MARKS,CUM_SUM,AVG
A1,12,40,40,40
B1,13,54,94,47
C1,15,67,161,53.66
D1,11,41,202,50.5
E1,16,59,261,43.5
F1,10,60,321,45.85
IIUC use:
df = pd.read_csv('file')
df['CUM_SUM'] = df['MARKS'].cumsum()
df['AVG'] = df['MARKS'].expanding().mean()
print (df)
NAME AGE MARKS CUM_SUM AVG
0 A1 12 40 40 40.000000
1 B1 13 54 94 47.000000
2 C1 15 67 161 53.666667
3 D1 11 41 202 50.500000
4 E1 16 59 261 52.200000
5 F1 10 60 321 53.500000
Last use:
df.to_csv('file.csv', index=False)
Or:
out = df.to_string(index=False)

Results in columns without decimal places?

I have looked through a lot of posts, but none of the solutions I can implement in my code:
x4 = x4.set_index('grupa').T.rename_axis('DANE').reset_index().rename_axis(None,1).round()
After which I get the results DataFrame:
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5.0 94.0 61.0 623.0
1 marza_netto 7.0 120.0 69.0 668.0
2 marza_procent2 32.0 34.0 29.0 27.0
But I would like to receive:
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5 94 61 623
1 marza_netto 7 120 69 668
2 marza_procent2 32 34 29 27
I tried replace('.0',''),int(round(),astype(int), but I don't get good results or I get the incompatibility of the attributes with the DataFrame.
If only non numeric column is DANE then cast before convert to column:
x4 = x4.set_index('grupa')
.T
.rename_axis('DANE')
.astype(int)
.reset_index()
.rename_axis(None,1)
More general solution is select all floats columns and cast:
cols = df.select_dtypes(include=['float']).columns
df[cols] = df[cols].astype(int)
print (df)
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5 94 61 623
1 marza_netto 7 120 69 668
2 marza_procent2 32 34 29 27
If some NaNs values convert to int is not possible.
So is possible:
1.drop all NaNs rows:
df = df.dropna()
2.replace NaNs to some integer like 0:
df = df.fillna(0)
Not 100% sure I got your question, but you can use an astype(int) conversion.
df = df.set_index('DANE').astype(int).reset_index()
df
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5 94 61 623
1 marza_netto 7 120 69 668
2 marza_procent2 32 34 29 27
If you're dealing with rows that have NaNs, either drop those rows and convert, or convert to astype(object). The latter is not recommended because you lose performance.

Plot histogram using two columns (values, counts) in python dataframe

I have a dataframe having multiple columns in pairs: if one column is values then the adjacent column is the corresponding counts. I want to plot a histogram using values as x variable and counts as the frequency.
For example, I have the following columns:
Age Counts
60 1204
45 700
21 400
. .
. .
34 56
10 150
I want my code to bin the Age values in ten-year intervals between the maximum and minimum values and get the cumulative frequencies for each interval from the Counts column and then plot a histogram. Is there a way to do this using matplotlib ?
I have tried the following but in vain:
patient_dets.plot(x='PatientAge', y='PatientAgecounts', kind='hist')
(patient_dets is the dataframe with 'PatientAge' and 'PatientAgecounts' as columns)
I think you need Series.plot.bar:
patient_dets.set_index('PatientAge')['PatientAgecounts'].plot.bar()
If need bins, one possible solution is with pd.cut:
#helper df with min and max ages
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120]})
print (df1)
G Max Min
0 14 yo and younger 14 0
1 15-19 19 15
2 20-24 24 20
3 25-29 29 25
4 30-34 34 30
5 35-39 39 35
6 40-44 44 40
7 45-49 49 45
8 50-54 54 50
9 55-59 59 55
10 60-64 64 60
11 65+ 120 65
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
patient_dets['Groups'] = pd.cut(patient_dets.PatientAge, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (patient_dets)
PatientAge PatientAgecounts Groups
0 60 1204 60-64
1 45 700 45-49
2 21 400 20-24
3 34 56 30-34
4 10 150 14 yo and younger
patient_dets.groupby(['PatientAge','Groups'])['PatientAgecounts'].sum().plot.bar()
You can use pd.cut() to bin your data, and then plot using the function plot('bar')
import numpy as np
nBins = 10
my_bins = np.linspace(patient_dets.Age.min(),patient_dets.Age.max(),nBins)
patient_dets.groupby(pd.cut(patient_dets.Age, bins =nBins)).sum()['Counts'].plot('bar')

How to read unstructured csv in pandas

I have got a messy csv file (just extension is csv). But when i open this file in ms excel with ; delimited it looks like as below(dummy sample)-
I investigated this file and found following:
Some column has name and others does not have.
The length of row is variable but contains newline char to trigger next line start.
Question:
How can i read this table in pandas whereas all existing columns(headers) remain and blank columns are filled with consecutive numbers caring variable length of rows.
In fact i want to take 8 cell-value again and again until any row exhausts. from the header-less columns for analysis.
N.B. I have tried usecols,names,skiprows,sep etc in read_csv but with no success
EDIT
Added sample input and expected output (formatting is worse but pandas.read_clipboard() should work)
INPUT
car_id car_type entry_gate entry_time(ms) exit_gate exit_time(ms) traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] )
24 Bus 25 4300 26 48520 118.47 2.678999 509552.78 5039855.59 10.074 0.429 0.2012 0 509552.97 5039855.57 10.0821 0.3853 0.2183 20
25 Car 25 20 26 45900 113.91 2.482746 509583.7 5039848.78 4.5344 -0.1649 0.2398 0 509583.77 5039848.71
26 Car - - - - 109.68 8.859805 509572.75 5039862.75 4.0734 -0.7164 -0.1066 0 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17 5039855.55 10.0886 0.2636 0.2356 40
27 Car - - - - 119.84 3.075936 509582.73 5039862.78 1.191 0.5247 0.0005 0 509582.71 5039862.78 1.2015 0.5322
28 Car - - - - 129.64 4.347466 509591.07 5039862.9 1.6473 0.1987 -0.0033 0 509591.04 5039862.89 1.6513 0.2015 -0.0036 20
Expected OUTPUT(dataframe)
car_id car_type entry_gate entry_time(ms) exit_gate exit_time(ms) traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] 1 2 3 4 5 6 7 8 9 10 11 12
24 Bus 25 4300 26 48520 118.47 2.678999 509552.78 5039855.59 10.074 0.429 0.2012 0 509552.97 5039855.57 10.0821 0.3853 0.2183 20
25 Car 25 20 26 45900 113.91 2.482746 509583.7 5039848.78 4.5344 -0.1649 0.2398 0 509583.77 5039848.71
26 Car - - - - 109.68 8.859805 509572.75 5039862.75 4.0734 -0.7164 -0.1066 0 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17 5039855.55 10.0886 0.2636 0.2356 40
27 Car - - - - 119.84 3.075936 509582.73 5039862.78 1.191 0.5247 0.0005 0 509582.71 5039862.78 1.2015 0.5322
28 Car - - - - 129.64 4.347466 509591.07 5039862.9 1.6473 0.1987 -0.0033 0 509591.04 5039862.89 1.6513 0.2015 -0.0036 20
Preprocessing
Function get_names() open file, check max length of splitted rows.
Then I read first row and add missing values from max length.
Last value of first row is ), so I remove it by firstline[:-1] and then I add
to range missing columns by +1 rng = range(1, m - lenfirstline + 2).
+2 is because range starts from value 1.
Then you can use function read_csv, skipp first line and as names use output from get_names().
import pandas as pd
import csv
#preprocessing
def get_names():
with open('test/file.txt', 'r') as csvfile:
reader = csv.reader(csvfile)
num = []
for i, row in enumerate(reader):
if i ==0:
firstline = ''.join(row).split()
lenfirstline = len(firstline)
#print firstline, lenfirstline
num.append(len(''.join(row).split()))
m = max(num)
rng = range(1, m - lenfirstline + 2)
#remove )
rng = firstline[:-1] + rng
return rng
#names is list return from function
df = pd.read_csv('test/file.txt', sep="\s+", names=get_names(), index_col=[0], skiprows=1)
#temporaly display 10 rows and 30 columns
with pd.option_context('display.max_rows', 10, 'display.max_columns', 30):
print df
car_type entry_gate entry_time(ms) exit_gate exit_time(ms) \
car_id
24 Bus 25 4300 26 48520
25 Car 25 20 26 45900
26 Car - - - -
27 Car - - - -
28 Car - - - -
traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] \
car_id
24 118.47 2.678999 509552.78 5039855.59
25 113.91 2.482746 509583.70 5039848.78
26 109.68 8.859805 509572.75 5039862.75
27 119.84 3.075936 509582.73 5039862.78
28 129.64 4.347466 509591.07 5039862.90
speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] \
car_id
24 10.0740 0.4290 0.2012 0
25 4.5344 -0.1649 0.2398 0
26 4.0734 -0.7164 -0.1066 0
27 1.1910 0.5247 0.0005 0
28 1.6473 0.1987 -0.0033 0
1 2 3 4 5 6 7 \
car_id
24 509552.97 5039855.57 10.0821 0.3853 0.2183 20 NaN
25 509583.77 5039848.71 NaN NaN NaN NaN NaN
26 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17
27 509582.71 5039862.78 1.2015 0.5322 NaN NaN NaN
28 509591.04 5039862.89 1.6513 0.2015 -0.0036 20 NaN
8 9 10 11 12
car_id
24 NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN
26 5039855.55 10.0886 0.2636 0.2356 40
27 NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN
Postprocessing
First you have to estimate max number of columns N. I know their real number is 26, so I estimate to N = 30. Function read_csv with parameter name = range(N) return NaN columns, what are difference between estimated and real length of columns.
After dropping you can select first row with columns names, where are not NaN (I remove last column ) by [:-1] ) - df1.loc[0].dropna()[:-1].
Then you can append new Series with range from 1 to length of NaN values in first row.
Last first row is removed by subset of df.
#set more as estimated number of columns
N = 30
df1 = pd.read_csv('test/file.txt', sep="\s+", names=range(N))
df1 = df1.dropna(axis=1, how='all') #drop columns with all NaN
df1.columns = df1.loc[0].dropna()[:-1].append(pd.Series(range(1, len(df1.columns) - len(df1.loc[0].dropna()[:-1]) + 1 )))
#remove first line with uncomplete column names
df1 = df1.ix[1:]
print df1.head()

Categories

Resources