group and divide values ​in python - python

What I want to make records that have the same "NROCUENTA", create a column where the result of the first "SALDO" divided by the number of records in that group
import pandas as pd
import csv, sys
try:
file_encoding = 'utf8'
input_fd = open('DAT_210.del', encoding=file_encoding)
df = pd.read_csv(input_fd, sep = ' ', quotechar='"', error_bad_lines=False)
result=df.groupby('NROCUENTA').apply(
lambda x: ................................
)
except csv.Error as e:
sys.exit('file {}, line {}: {}'.format("datahist.del", reader.line_num, e))
resutl2=result.to_csv('result001.csv',mode = 'w', index=False )
SALDO=FIRST(SALDO)/COUNT(NROCUENTA)
DATA
"NROCUENTA" "SALDO"
"210-1-388" 159.20
"210-1-388" 159.20
"210-1-1219" 0.93
"210-1-11657" 0.06
"210-1-11657" 0.06
"210-1-11657" 0.06
RESULT
"210-1-388" 79.6
"210-1-388" 79.6
"210-1-1219" 0.93
"210-1-11657" 0.02
"210-1-11657" 0.02
"210-1-11657" 0.02
TRIED
I was trying with the dfply library, but it throws errors at me and I decided to do it with pandas

IIUC, you need transform with count and divide it by SALDO columns. I assign result to column AVG_SALDO
df['AVG_SALDO'] = df['SALDO'] / df.groupby('NROCUENTA').SALDO.transform('count')
Out[1112]:
NROCUENTA SALDO AVG_SALDO
0 210-1-388 159.20 79.60
1 210-1-388 159.20 79.60
2 210-1-1219 0.93 0.93
3 210-1-11657 0.06 0.02
4 210-1-11657 0.06 0.02
5 210-1-11657 0.06 0.02

Related

Peak detection for unevenly spaced time series : dataframe with one datetime column and NaN values

I'm working with a dataframe containing environnemental values (sentinel2 satellite : NDVI) like:
Date ID_151894 ID_109386 ID_111656 ID_110006 ID_112281 ID_132408
0 2015-07-06 0.82 0.61 0.85 0.86 0.76 nan
1 2015-07-16 0.83 0.81 0.77 0.83 0.84 0.82
2 2015-08-02 0.88 0.89 0.89 0.89 0.86 0.84
3 2015-08-05 nan nan 0.85 nan 0.83 0.77
4 2015-08-12 0.82 0.77 nan 0.65 nan 0.42
5 2015-08-22 0.85 0.85 0.88 0.87 0.83 0.83
The columns correspond to different places and the nan values are due to cloudy conditions (which happen often in Belgium). There are obviously lot more values. To remove outliers, I use the method described in the timesat manual (Jönsson & Eklundh, 2015) :
it deviates more than a maximum deviation (here called cutoff) from the median
value is lower than the mean value of its immediate neighbors minus the cutoff
or it is larger than the highest value of its immediate neighbor plus the cutoff
So, I have made the code below to do so :
NDVI = pd.read_excel("C:/Python_files/Cartofor/NDVI_frene_5ha.xlsx")
date = NDVI["Date"]
MED = NDVI.median(axis = 0, skipna = True, numeric_only=True)
SD = NDVI.std(axis = 0, skipna = True, numeric_only=True)
cutoff = 1.5 * SD
for j in range(1,21): #columns
for i in range(1,480): #rows
if (NDVIF.iloc[i,j] < (((NDVIF.iloc[i-1,j] + NDVIF.iloc[i+1,j])/2) - cutoff.iloc[j])):
NDVIF.iloc[i,j] == float('NaN')
elif (NDVIF.iloc[i,j] > (max(NDVIF.iloc[i-1,j], NDVIF.iloc[i+1,j]) + cutoff.iloc[j])): #2)
NDVIF.iloc[i,j] == float('NaN')
elif ((NDVIF.iloc[i,j] >= abs(MED.iloc[j] - cutoff.iloc[j]))) & (NDVIF.iloc[i,j] <= abs(MED.iloc[j] + cutoff.iloc[j])): #1)
NDVIF.iloc[i,j] == NDVIF.iloc[i,j]
else:
NDVIF.iloc[i,j] == float('NaN')
The problem is that I need to omit the 'NaN' values for the calculations. The goal is to have a dataframe like the one above without the outliers.
Once this is made, I have to interpolate the values for a new chosen time index (e.g. one value per day or one value every five days from 2016 to 2020) and write each interpolated column on a txt file to enter it on the TimeSat software.
I hope my english is not too bad and thank you for your answers! :)

How can I convert a 1d array to a 2d array?

I am running the code below.
import datetime
import pandas as pd
import numpy as np
import pylab as pl
import datetime
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from matplotlib.collections import LineCollection
from pandas_datareader import data as wb
from sklearn import cluster, covariance, manifold
###############################################################################
start = '2019-02-01'
end = '2020-02-01'
tickers = ['MMM',
'ABT',
'ABBV',
'ABMD',
'ACN',
'ATVI']
thelen = len(tickers)
price_data = []
for ticker in tickers:
prices = wb.DataReader(ticker, start = start, end = end, data_source='yahoo')[['Open','Adj Close']]
price_data.append(prices.assign(ticker=ticker)[['ticker', 'Open', 'Adj Close']])
df = pd.concat(price_data)
df.rename(columns = {'ticker':'Ticker', 'Adj Close':'Close'}, inplace = True)
df.dtypes
df.head()
df.shape
#df.reset_index()
pd.set_option('display.max_columns', 500)
open = np.array([df.Open]).astype(np.float)
close = np.array([df.Close]).astype(np.float)
# The daily variations of the quotes are what carry most information
variation = (close - open)
The code above gives me this 1d array, here.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0 0.38 0.93 0.3 0.72 -0.42 0.37 0.36 0.71 0.89 -0.32 0.11 -0.06 -0.17 0.4 0.25 -0.48 0.1 -0.29 -0.29 -0.38 0.21 0.22 0.11 -0.01 -0.07 -0.66 0 -0.78 0.24 -0.89 0.07
My desired output would be a 2d array, like this.
0 1 2 3 4 5 6 7 8 9 10
0 0.38 0.93 0.3 0.72 -0.42 0.37 0.36 0.71 0.89 -0.32 0.11
1 0.61 0.18 0.63 0.02 -0.03 -0.27 -0.75 -1 0.48 -0.74 -0.34
2 1.77 0.95 1.69 2.05 -1.36 2.25 1.83 -0.8 1.35 -0.99 -1.35
3 0.7 -0.12 0.32 -0.14 -0.53 0.63 0.85 0.46 0.23 -0.83 0.59
4 1.71 -0.8 0.74 -0.58 -1.2 0.38 0.35 0.06 0.56 -0.38 0.64
5 0.47 0.25 0.93 -0.9 -0.15 0.64 -0.11 -0.09 0.44 -0.47 -0.09
How can I change my 1d array to a 2d array, with the difference between open and close horizontal, and different stock open-close vertical? Thanks?
I actually got this to work. Apparently you have to store items in a list rather than a dataframe.
import datetime
import pandas as pd
import numpy as np
import pylab as pl
import datetime
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from matplotlib.collections import LineCollection
from pandas_datareader import data as wb
from sklearn import cluster, covariance, manifold
start = '2019-02-01'
end = '2020-02-01'
tickers = ['AXP',
'AAPL',
'BA',
'CAT',
'CSCO',
'CVX',
'XOM',
'GS',
'HD',
'IBM',
'INTC',
'JNJ',
'KO',
'JPM',
'MCD',
'MMM',
'MRK',
'MSFT',
'NKE',
'PFE',
'PG',
'TRV',
'UNH',
'UTX',
'VZ',
'V',
'WBA',
'WMT',
'DIS']
thelen = len(tickers)
price_data = []
for ticker in tickers:
prices = wb.DataReader(ticker, start = start, end = end, data_source='yahoo')[['Open','Adj Close']]
price_data.append(prices.assign(ticker=ticker)[['ticker', 'Open', 'Adj Close']])
#names = np.reshape(price_data, (len(price_data), 1))
names = pd.concat(price_data)
names.reset_index()
#pd.set_option('display.max_columns', 500)
open = np.array([q['Open'] for q in price_data]).astype(np.float)
close = np.array([q['Adj Close'] for q in price_data]).astype(np.float)
#close_prices = np.array([q.close for q in quotes]).astype(np.float)
# The daily variations of the quotes are what carry most information
variation = (close - open)
# pd.DataFrame(variation).to_csv("C:\\path\\file.csv")
# Learn a graphical structure from the correlations
edge_model = covariance.GraphicalLassoCV()
X = variation
# standardize the time series: using correlations rather than covariance
# is more efficient for structure recovery
X = variation.copy().T
X /= X.std(axis=0)
edge_model.fit(X)
# Cluster using affinity propagation
_, labels = cluster.affinity_propagation(edge_model.covariance_)
n_labels = labels.max()
details = [(name,cluster) for name, cluster in zip(tickers,labels)]
for detail in details:
print(detail)

Only one index label in the dataset

I am working with the ecoli dataset from http://archive.ics.uci.
edu/ml/datasets/Ecoli. The values are separated by tabs. I would like to index each column and give them a name. But when i do that using the following code:
import pandas as pd
ecoli_cols= ['N_ecoli', 'info1', 'info2', 'info3', 'info4','info5','info6,'info7','type']
d= pd.read_table('ecoli.csv',sep= ' ',header = None, names= ecoli_cols)
Instead of creating the name for each index it creates a 6 new columns. But i would like to have those index name for each of the columns that i already have. And later i would like to extract information from this dataset. So it is important to have them as comma separated or in tables. Thanks
You can use url with data and separator \s+ - one or more whitespaces:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/ecoli/ecoli.data'
ecoli_cols= ['N_ecoli', 'info1', 'info2', 'info3', 'info4','info5','info6','info7','type']
df = pd.read_table(url,sep= '\s+',header = None, names= ecoli_cols)
#alternative use parameter delim_whitespace
#df = pd.read_table(url, delim_whitespace= True, header = None, names = ecoli_cols)
print (df.head())
N_ecoli info1 info2 info3 info4 info5 info6 info7 type
0 AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35 cp
1 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44 cp
2 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46 cp
3 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36 cp
4 ADI_ECOLI 0.23 0.32 0.48 0.5 0.55 0.25 0.35 cp
But if want use your file with separator as tab:
d = pd.read_table('ecoli.csv', sep='\t',header = None, names= ecoli_cols)
And if separator is ;:
d = pd.read_table('ecoli.csv', sep=';',header = None, names= ecoli_cols)

Create new DF with values representing difference between two dataframes

I am working with two numeric data.frames, both with 13803obs and 13803 variables. Their col- and rownames are identical however their entries are different. What I want to do is create a new data.frame where I have subtracted df2 values with df1 values.
"Formula" would be this, df1(entri-values) - df2(entri-values) = df3 difference. In other words, the purpose is to find the difference between all entries.
My problem illustrated here.
DF1
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.71 0.98 0.32
[GENE128] 0.23 0.61 0.90
[GENE271] 0.87 0.95 0.63
DF2
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.70 0.94 0.30
[GENE128] 0.25 0.51 0.80
[GENE271] 0.82 0.92 0.60
NEW DF3
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.01 0.04 0.02
[GENE128] -.02 0.10 0.10
[GENE271] 0.05 0.03 0.03
So, in DF3 the values are the difference between DF1 and DF2 for each entry.
DF1(GENE231) - DF2(GENE231) = DF3(DIFFERENCE-GENE231)
DF1(GENE271) - DF2(GENE271) = DF3(DIFFERENCE-GENE271)
and so on...
Help would be much appreciated!
Kind regards,
Harebell

python pandas How to remove outliers from a dataframe and replace with an average value of preceding records

I have a dataframe 16k records and multiple groups of countries and other fields. I have produced an initial output of the a data that looks like the snipit below. Now i need to do some data cleansing, manipulating, remove skews or outliers and replace it with a value based on certain rules.
i.e. on the below how could i identify the skewed points (any value greater than 1) and replace them with the average of the next two records or previous record if there no later records.(in that group)
So in the dataframe below I would like to replace Bill%4 for IT week1 of 1.21 with the average of week2 and week3 for IT so it is 0.81.
any tricks for this?
Country Week Bill%1 Bill%2 Bill%3 Bill%4 Bill%5 Bill%6
IT week1 0.94 0.88 0.85 1.21 0.77 0.75
IT week2 0.93 0.88 1.25 0.80 0.77 0.72
IT week3 0.94 1.33 0.85 0.82 0.76 0.76
IT week4 1.39 0.89 0.86 0.80 0.80 0.76
FR week1 0.92 0.86 0.82 1.18 0.75 0.73
FR week2 0.91 0.86 1.22 0.78 0.75 0.71
FR week3 0.92 1.29 0.83 0.80 0.75 0.75
FR week4 1.35 0.87 0.84 0.78 0.78 0.74
I don't know of any built-ins to do this, but you should be able to customize this to meet your needs, no?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.index = list('abcdeflght')
# Define cutoff value
cutoff = 0.90
for col in df.columns:
# Identify index locations above cutoff
outliers = df[col][ df[col]>cutoff ]
# Browse through outliers and average according to index location
for idx in outliers.index:
# Get index location
loc = df.index.get_loc(idx)
# If not one of last two values in dataframe
if loc<df.shape[0]-2:
df[col][loc] = np.mean( df[col][loc+1:loc+3] )
else:
df[col][loc] = np.mean( df[col][loc-3:loc-1] )

Categories

Resources