Sum all columns in each column in csv using Pandas - python

The program I have written generally has done what I've wanted it to do - for the most part. To add totals of each column. My dataframe uses the csv file format. My code is below:
import pandas as pd
import matplotlib.pyplot
class ColumnCalculation:
"""This houses the functions for all the column manipulation calculations"""
def max_electricity(self):
df.set_index('Date', inplace=True)
df.loc['Total'] = df.sum()
print(df)
df = pd.read_csv("2011-onwards-city-elec-consumption.csv")
ColumnCalculation.max_electricity(df)
Also my dataset (I didn't know how to format it properly)
The code nicely adds up all totals into a total column at the bottom of each column, except when it comes to the last column(2017)(image below):
I am not sure the program does is, I've tried to use different formatting options like .iloc or .ix but it doesn't seem to make a difference. I have also tried adding each column individually (below):
def max_electricity(self):
df.set_index('Date', inplace=True)
df.loc['Total', '2011'] = df['2011'].sum()
df.loc['Total', '2012'] = df['2012'].sum()
df.loc['Total', '2013'] = df['2013'].sum()
df.loc['Total', '2014'] = df['2014'].sum()
df.loc['Total', '2015'] = df['2015'].sum()
df.loc['Total', '2016'] = df['2016'].sum()
df.loc['Total', '2017'] = df['2017'].sum()
print(df)
But I receive an error, as I assume this would be too much? I've tried to figure this out for a good hour and a bit.

Your last column isn't being parsed as floats, but strings.
To fix this, try casting to numeric before summing:
import locale
locale.setlocale(locale.LC_NUMERIC, '')
df['2017'] = df['2017'].map(locale.atoi)
Better still, try reading in the data as numeric data. For example:
df = pd.read_csv('file.csv', sep='\t', thousands=',')

Related

CSV file handling with Pandas for text comparison

I have this csv file called input.csv
KEY;Rate;BYld;DataAsOfDate
CH04;0.719;0.674;2020-01-29
CH03;1.5;0.148;2020-01-29
then I execute the following code:
import pandas as pd
input_df = pd.read_csv('input.csv', sep=";")
input_df.to_csv('output.csv', sep=";")
and get the following output.csv file
KEY;Rate;BYld;DataAsOfDate
CH04;0.7190000000000001;0.674;2020-01-29
CH03;1.5;0.14800000000000002;2020-01-29
I was hoping for and expecting an output like this:
(to be able to use a tool like winmerge.org to detect real differences on each row)
(my real code truly modifies the dataframe - this stack overflow example is for demonstration only)
KEY;Rate;BYld;DataAsOfDate
CH04;0.719;0.674;2020-01-29
CH03;1.5;0.148;2020-01-29
What is the idiomatic way with to achieve such an unmodified output with Pandas?
Python does not use traditional rounding to so as to prevent problems with bankers rounding. However, if being close is not a problem you could use the round function and replace the "2" with whichever number you would like to round to
d = [['CH04',0.719,0.674,'2020-01-29']]
df = pd.DataFrame(d, columns = (['KEY', 'Rate', 'BYld', 'DataAsOfDate']))
df['Rate'] = df['Rate'].apply(lambda x : round(x, 2))
df
Using #Prokos idea I changed the code like this:
import pandas as pd
input_df = pd.read_csv('input.csv', dtype='str',sep=";")
input_df.to_csv('str_output.csv', sep=";", index=False)
and that meets the requirement - all columns come out unchanged.

Action on one pandas dataframe does the same to the one it was copied from

I was using this bit of code (re-worked for my application) when I found that the df_temp.drop(index=sample.index, inplace=True) performed the same action on df_input i.e. it emptied it!!! I was not expecting that at all.
I solved it by changing df_temp = df_input to df_temp = df_input.copy() but can someone illuminate me on what is going on here?
import seaborn as sns
import pandas as pd
df_input = sns.load_dataset('diamonds')
df = df_input.loc[[]]
df_temp = df_input # this is where we're sampling from
n_samples = 1000
for _ in range(n_samples):
sample = df_temp.sample(1)
df_temp.drop(index=sample.index, inplace=True)
df = df.append(sample)
assert((df.index.value_counts() > 1).sum() == 0)
df
Pandas does not copy the whole df if you simply assign it to a new variable. After executing df_temp = df_input you end up with two variables referring to the exact same df. It's not the case that both are referring to an identical df, they are actually pointing to the same df. (think: you just gave this one df two names (variable names)) So no matter which variable (think: name) you are using to alter the df you're also changing for the other variable. If you use .copy() you get what you intended, namely two variables with two distinct versions of the df.

When trying to drop a column from my dataset using pandas, i get the error "['churn'] not found in axis"

I want the x to be all the columns except "churn" column.
But when i do the below i get the "['churn'] not found in axis" error, eventhough i can see the column name when i write "print(list(df.column))"
Here is my code:
import pandas as pd
import numpy as np
df = pd.read_csv("/Users/utkusenel/Documents/Data Analyzing/data.csv", header=0)
print(df.head())
print(df.columns)
print(len(df.columns))
x = df.drop(["churn"], axis=1) ## this is the part it gives the error
I am adding a snippet of my dataset as well:
account_length;area_code;international_plan;voice_mail_plan;number_vmail_messages;total_day_minutes;total_day_calls;total_day_charge;total_eve_minutes;total_eve_calls;total_eve_charge;total_night_minutes;total_night_calls;total_night_charge;total_intl_minutes;total_intl_calls;total_intl_charge;number_customer_service_calls;churn;
1;KS;128;area_code_415;no;yes;25;265.1;110;45.07;197.4;99;16.78;244.7;91;11.01;10;3;2.7;1;no
2;OH;107;area_code_415;no;yes;26;161.6;123;27.47;195.5;103;16.62;254.4;103;11.45;13.7;3;3.7;1;no
3;NJ;137;area_code_415;no;no;0;243.4;114;41.38;121.2;110;10.3;162.6;104;7.32;12.2;5;3.29;0;no
I see that your df snippet is separeted with ';' (semi colon). If that is what your actual data looks like, then probably your csv is being read wrong. Please try adding sep=';' to read_csv function.
df = pd.read_csv("/Users/utkusenel/Documents/Data Analyzing/data.csv", header=0, sep=';')
I also suggest print df.columns again and check if there is a leading or trailing whitespace in the column name for churn.

Error when .loc() rows with a list of dates in pandas

I have the following code:
import pandas as pd
from pandas_datareader import data as web
df = web.DataReader('^GSPC', 'yahoo')
df['pct'] = df['Close'].pct_change()
dates_list = df.index[df['pct'].gt(0.002)]
df2 = web.DataReader('^GDAXI', 'yahoo')
df2['pct2'] = df2['Close'].pct_change()
i was trying to run this:
df2.loc[dates_list, 'pct2']
But i keep getting this error:
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported,
I am guessing this is because there are missing data for dates in dates_list. To resolve this:
idx1 = df.index
idx2 = df2.index
missing = idx2.difference(idx1)
df.drop(missing, inplace = True)
df2.drop(missing, inplace = True)
However i am still getting the same error. I dont understand why that is.
Note that dates_list has been created from df, so it includes
some dates present in index there (in df).
Then you read df2 and attempt to retrieve pct2 from rows on
just these dates.
But there is a chance that the index in df2 does not contain
all dates given in dates_list.
And just this is the cause of your exception.
To avoid it, retrieve only rows on dates present in the index.
To look for only such "allowed" (narrow down the rows specifidation),
you should pass:
dates_list[dates_list.isin(df2.index)]
Run this alone and you will see the "allowed" dates (some dates will
be eliminated).
So change the offending instruction to:
df2.loc[dates_list[dates_list.isin(df2.index)], 'pct']

dataframe values converted as 'nan' after applied df.iloc()

nan values
I ran into a problem after runnning: pd.DataFrame(), the whole data-frame became 'nan' (empty). I could not reverse this again. I also assigned the data-frame columns names, but their values also disappeared:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('PuntaCapi.csv', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
df.to_csv("PuntaCapi.tab",sep="\t",header=None, index=False)
print(df)
Akim =df.iloc[:,0:1]
A= pd.DataFrame(data =Akim ,columns=['Akim'])
veriler2 = pd.DataFrame(data = df, columns=['Akim','Kuvvet','Zaman','Soguma','Yaklasma','Baski','SacKalinliği','PuntaCapi'])
print(veriler2)
Please view the following results from the above mentioned code:
[![Spyder View DataFrame code [][2]][2]1
There is no nan value into the csv file. But after .iloc[], entire dataframe became nan value. I have tried solve the problem but I could not. I need help to solve problem
enter image description here
I do not understand your question.
You read data using pd.read_csv('PuntaCapi.csv', header=None, sep='\n') and save it as df, but you modify df as df[0].str.split(',', expand=True), which directly impact on the result.
Try this code.
df = pd.read_csv('PuntaCapi.csv', header=None, sep='\n')
veriler2 = pd.DataFrame(data = df.values, columns=['Akim','Kuvvet','Zaman','Soguma','Yaklasma','Baski','SacKalinliği','PuntaCapi'])

Categories

Resources