Cumsum with pandas on financial data - python

I'm very new to python and I have managed to import data from an excel datasheet using the pd.read_excel function. The data is arranged in the following manner in a dataframe :
I'm trying to do a cumsum() over this dataframe, however I get this error message :
TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'
How can I force only the cumsum() on the returns columns without removing my dates columns ?
I added the data with the following function :
oFX = pd.read_excel('C:\\Work\\Python Dev\\Athenes\\FX.xlsx', 0)
The data.info() is the following :
oFX.info() <class 'pandas.core.frame.DataFrame'>
Int64Index: 4133 entries, 0 to 4132
Data columns (total 11 columns):
Dates 4133 non-null datetime64[ns]
AUD 4133 non-null float64
CAD 4133 non-null float64
CHF 4133 non-null float64
EUR 4133 non-null float64
GBP 4133 non-null float64
JPY 4133 non-null float64
KRW 4133 non-null float64
MEP 4133 non-null float64
NZD 4133 non-null float64
USD 4133 non-null float64
dtypes: datetime64[ns](1), float64(10)
memory usage: 387.5 KB
Thanks in advance.

Seeing as your 'Dates' are just daily entries you can temporarily set the index to that column, call cumsum and then reset_index:
oFX.set_index('Dates').cumsum().reset_index()

Related

How to convert (%) sign in float data type or delete it

When I try to read the date from Excel file, I found that
the column called "TOT_SALES" the data type is float64 and all values with % sign.
I want to remove this sign and dividing all values on 100. And at the same time the values in the column in Excel file are regular as I mentioned.
any help how to remove the (%) from the values.
df= pd.read_excel("QVI_transaction_data_1.xlsx")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE 264836 non-null datetime64[ns]
1 STORE_NBR 264836 non-null int64
2 LYLTY_CARD_NBR 264836 non-null int64
3 TXN_ID 264836 non-null int64
4 PROD_NBR 264836 non-null int64
5 PROD_NAME 264836 non-null object
6 PROD_QTY 264836 non-null int64
7 TOT_SALES 264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB
df.head()
this is the result appear in the DataFrame
TOT_SALES
1180.00%
740.00%
420.00%
1080.00%
660.00%
This is the values in the Excel file without (%) sign
TOT_SALES
11.80
7.40
4.20
10.80
6.60
enter image description here
enter image description here
enter image description here

copy output from describe and info commands to different dataframes python

I am reading a csv file as a dataframe in python. Then i use below two commands to get more information about those files.
Is there a way to copy output of these two commands into separate data frames?
data.describe(include='all')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null object
2 color 53940 non-null object
3 clarity 53940 non-null object
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB
Regarding df.describe, as it's of type Dataframe itself, you can either create a new dataframe directly or save it to csv as below:
des=pd.DataFrame(df.describe())
or
df.describe().to_csv()
Regarding df.info(), this is of type 'Nonetype' which means that cannot be saved directly. You can check for some alternative solutions here:
Is there a way to export pandas dataframe info -- df.info() into an excel file?

Did dropna() on dataframe, why is the number of rows lower than expected?

I have a Dataframe where most columns have 10866 non-null values, except a couple of columns with fewer. The column with the least number of non-null values is "keywords" (9373). So when I drop the NA-values from the Dataframe , I expect the number of non-null values for each column to be equal to the number of non-null values in the column with the least non-null values; in this case "keywords".
However, when I apply df.dropna(inplace = True), the number of non-null values in each column is reduced to the number which previously was not even contained in the Dataframe: 8665, not even in the column "keywords", where least non-null values were contained.
How is this possible? And how does the number 8665 come about?
Here is what the original Dataframe looks like:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 19 columns):
id 10866 non-null int64
imdb_id 10856 non-null object
popularity 10866 non-null float64
budget 10866 non-null int64
revenue 10866 non-null int64
original_title 10866 non-null object
cast 10790 non-null object
director 10822 non-null object
keywords 9373 non-null object
overview 10862 non-null object
runtime 10866 non-null int64
genres 10843 non-null object
production_companies 9836 non-null object
release_date 10866 non-null object
vote_count 10866 non-null int64
vote_average 10866 non-null float64
release_year 10866 non-null int64
budget_adj 10866 non-null float64
revenue_adj 10866 non-null float64
dtypes: float64(4), int64(6), object(9)
memory usage: 1.6+ MB
And here is what the Dataframe looks like after I have dropped the NA
df.dropna(inplace = True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8665 entries, 0 to 10865
Data columns (total 19 columns):
id 8665 non-null int64
imdb_id 8665 non-null object
popularity 8665 non-null float64
budget 8665 non-null int64
revenue 8665 non-null int64
original_title 8665 non-null object
cast 8665 non-null object
director 8665 non-null object
keywords 8665 non-null object
overview 8665 non-null object
runtime 8665 non-null int64
genres 8665 non-null object
production_companies 8665 non-null object
release_date 8665 non-null object
vote_count 8665 non-null int64
vote_average 8665 non-null float64
release_year 8665 non-null int64
budget_adj 8665 non-null float64
revenue_adj 8665 non-null float64
dtypes: float64(4), int64(6), object(9)
memory usage: 1.3+ MB
Consider the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"name": ['A', 'B', 'C'],
1: [1, 2, np.nan],
2: [1, np.nan, 3],
3: [np.nan, 2, 3]})
print(df)
df.dropna(inplace=True)
print(df)
What do you think the dataframe will look like after df.dropna? By default pandas will drop a row in which any column has a null value. So even though each column only has one null value, all three rows are dropped. You can change this behavior with the how, thresh and subset arguments to the dropna function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

Display all information with data.info() in Pandas

I would display all information of my data frame which contains more than 100 columns with .info() from pandas but it won't :
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85529 entries, 0 to 85528
Columns: 110 entries, ID to TARGET
dtypes: float64(40), int64(19), object(51)
memory usage: 71.8+ MB
I would like it displays like this :
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime 10886 non-null object
season 10886 non-null int64
holiday 10886 non-null int64
workingday 10886 non-null int64
weather 10886 non-null int64
temp 10886 non-null float64
atemp 10886 non-null float64
humidity 10886 non-null int64
windspeed 10886 non-null float64
casual 10886 non-null int64
registered 10886 non-null int64
count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
But the problem seems to be the high number of columns from my previous data frame. I would like to show all values including non null values (NaN).
You can pass optional arguments verbose=True and show_counts=True (null_counts=True deprecated since pandas 1.2.0) to the .info() method to output information for all of the columns
pandas >=1.2.0: data_train.info(verbose=True, show_counts=True)
pandas <1.2.0: data_train.info(verbose=True, null_counts=True)

Python - Input contains NaN, infinity or a value too large for dtype('float64')

I am new on Python. I am trying to use sklearn.cluster.
Here is my code:
from sklearn.cluster import MiniBatchKMeans
kmeans=MiniBatchKMeans(n_clusters=2)
kmeans.fit(df)
But I get the following error:
50 and not np.isfinite(X).all()):
51 raise ValueError("Input contains NaN, infinity"
---> 52 " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
I checked that the there is no Nan or infinity value. So there is only one option left. However, my data info tells me that all variables are float64, so I don't understand where the problem comes from.
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 362358 entries, 135 to 4747145
Data columns (total 8 columns):
User 362358 non-null float64
Hour 362352 non-null float64
Minute 362352 non-null float64
Day 362352 non-null float64
Month 362352 non-null float64
Year 362352 non-null float64
Latitude 362352 non-null float64
Longitude 362352 non-null float64
dtypes: float64(8)
memory usage: 24.9 MB
Thanks a lot,
By looking at your df.info(), it appears that there are 6 more non-null Users values than there are values of any other column. This would indicate that you have 6 nulls in each of the other columns, and that is the reason for the error.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 362358 entries, 135 to 4747145
Data columns (total 8 columns):
User 362358 non-null float64
Hour 362352 non-null float64
Minute 362352 non-null float64
Day 362352 non-null float64
Month 362352 non-null float64
Year 362352 non-null float64
Latitude 362352 non-null float64
Longitude 362352 non-null float64
dtypes: float64(8)
memory usage: 24.9 MB
I think that fit() accepts only "array-like, shape = [n_samples, n_features]", not pandas dataframes. So try to pass the values of the dataframe into it as:
kmeans=MiniBatchKMeans(n_clusters=2)
kmeans.fit(df.values)
Or shape them in order to run the function correctly. Hope that helps.
By looking at your df.info(), it appears that there are 6 more non-null Users values than there are values of any other column. This would indicate that you have 6 nulls in each of the other columns, and that is the reason for the error.
So you can slice your data to the right fit with iloc():
df = pd.read_csv(location1, encoding = "ISO-8859-1").iloc[2:20]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 2 to 19
Data columns (total 6 columns):
zip_code 18 non-null int64
latitude 18 non-null float64
longitude 18 non-null float64
city 18 non-null object
state 18 non-null object
county 18 non-null object
dtypes: float64(2), int64(1), object(3)

Categories

Resources