I have a dataframe with dates (30/09/2022 to 31/11/2022) and 15 stock prices (wrote 5 as reference) for each of these dates (excluding weekends).
Current Data:
DATES | A | B | C | D | E |
30/09/22 |100.5|151.3|233.4|237.2|38.42|
01/10/22 |101.5|148.0|237.6|232.2|38.54|
02/10/22 |102.2|147.6|238.3|231.4|39.32|
03/10/22 |103.4|145.7|239.2|232.2|39.54|
I wanted to get the Pearson correlation matrix, so I did this:
df = pd.read_excel(file_path, sheet_name)
df=df.dropna() #Remove dates that do not have prices for all stocks
log_df = df.set_index("DATES").pipe(lambda d: np.log(d.div(d.shift()))).reset_index()
corrM = log_df.corr()
Now I want to build the Pearson Uncentered Correlation Matrix, so I have the following function:
def uncentered_correlation(x, y):
x_dim = len(x)
y_dim = len(y)
xy = 0
xx = 0
yy = 0
for i in range(x_dim):
xy = xy + x[i] * y[i]
xx = xx + x[i] ** 2.0
yy = yy + y[i] ** 2.0
corr = xy/np.sqrt(xx*yy)
return(corr)
However, I do not know how to apply this function to each possible pair of columns of the dataframe to get the correlation matrix.
try this? not elegant enough, but perhaps working for you. :)
from itertools import product
def iter_product(a, b):
return list(product(a, b))
df='your dataframe hier'
re_dict={}
iter_re=iter_product(df.columns,df.columns)
for i in iter_re:
result=uncentered_correlation(df[f'{i[0]}'],df[f'{i[1]}'])
re_dict[i]=result
re_df=pd.DataFrame(re_dict,index=[0]).stack()
First compute a list of possible column combinations. You can use the itertools library for that
Then use the pandas.DataFrame.apply() over multiple columns as explained here
Here is a simple code example:
import pandas as pd
import itertools
data = {'col1': [1,3], 'col2': [2,4], 'col3': [5,6]}
df = pd.DataFrame(data)
def add(num1,num2):
return num1 + num2
cols = list(df)
combList = list(itertools.combinations(cols, 2))
for tup in combList:
firstCol = tup[0]
secCol = tup[1]
df[f'sum_{firstCol}_{secCol}'] = df.apply(lambda x: add(x[firstCol], x[secCol]), axis=1)
Related
I have a data frame as follows
I/P
date,low,high,close
d1,l1,h1,c1
d2,l2,h2,c2
d3,l3,h3,c3
d4,l4,h4,c4
d5,l5,h5,c5
d6,l6,h5,c5
d7,l7,h7,c7
O/P
d1,l1,h1,c1,d2,l2,h2,c2,d3,l3,h3,c3
d2,l2,h2,c2,d3,l3,h3,c3,d4,l5,h4,c4
d3,l3,h3,c3,d4,l5,h4,c4,d5,l5,h5,c5
d4,l5,h4,c4,d5,l5,h5,c5,d6,l6,h6,c6
....
Basically join all rows, split into subarrays of 3 size each staring at each index, and create the op data frame.
Following code works. Buts its too verbose and slow. Does pandas have something inbuilt for this?
def flatten(df):
candles = []
i = 0
while i < len(df):
candles.append(df.iloc[i])
i= i+1
return candles
def slide_and_expand(candles, k):
return [candles[i:i+k] for i in range(len(candles) - k + 1)]
def candle_to_dict(col_name_prefix, candle_series):
candle_dict = {}
for index, val in candle_series.iteritems():
col_name = col_name_prefix+index
candle_dict[col_name] = val
return candle_dict
def candle_group_to_feature_vector(candle_group):
feature_vector_dict = {}
i = 0
for candle in candle_group:
col_name_prefix = f"c{i}_"
candle_dict = candle_to_dict(col_name_prefix, candle)
feature_vector_dict.update(candle_dict)
i= i+1
return feature_vector_dict
def candle_groups_to_feature_vectors(candle_groups):
feature_vectors = []
for candle_group in candle_groups:
feature_vector = candle_group_to_feature_vector(candle_group)
feature_vectors.append(feature_vector)
return feature_vectors
fv_len = 3
candles = flatten(data)
candle_groups = slide_and_expand(candles,fv_len)
feature_vectors = candle_groups_to_feature_vectors(candle_groups)
data_fv = pd.DataFrame.from_dict(feature_vectors, orient='columns')
data_fv
You could do something like this:
n = len(df.index) # number of rows in original dataframe 'df'
df_0 = df.loc[0:n-3]
df_1 = df.loc[1:n-2]
df_2 = df.loc[2:n-1]
df_final = pandas.concat([df_0, df_1, df_2], axis = 1)
You can save a few steps using Pandas rolling function using the windows size as the desired subarray length (window=SUBARR_SZ). Then, join each column with a ,, transform the result to a Series to be able to apply a join again, but now using each row in the Series (which contains the specific amount of subarrays).
import pandas as pd
df = pd.read_csv('sample.csv')
SUBARR_SZ = 3 # subarray size
df_list = []
for w in df.rolling(window=SUBARR_SZ):
if len(w) == SUBARR_SZ:
s = w.apply(','.join, axis=1).apply(pd.Series).apply(','.join)
df_list.append(s)
dff = pd.concat(df_list).reset_index(drop=True)
print(dff)
Output from dff
0 d1,l1,h1,c1,d2,l2,h2,c2,d3,l3,h3,c3
1 d2,l2,h2,c2,d3,l3,h3,c3,d4,l4,h4,c4
2 d3,l3,h3,c3,d4,l4,h4,c4,d5,l5,h5,c5
3 d4,l4,h4,c4,d5,l5,h5,c5,d6,l6,h6,c6
4 d5,l5,h5,c5,d6,l6,h6,c6,d7,l7,h7,c7
dtype: object
I'm the process of cleaning a data frame, and one particular column contains values that are comprised of lists. I'm trying to find the average of those lists and update the existing column with an int while preserving the indices. I can successfully and efficiently convert those values to a list, but I lose the index values in the process. The code I've written below is too memory-tasking to execute. Is there a simpler code that would work?
data: https://docs.google.com/spreadsheets/d/1Od7AhXn9OwLO-SryT--erqOQl_NNAGNuY4QPSJBbI18/edit?usp=sharing
def Average(lst):
sum1 = 0
average = 0
if len(x) == 1:
for obj in x:
sum1 = int(obj)
if len(x)>1:
for year in x:
sum1 += int(year)
average = sum1/len(x)
return mean(average)
hello = hello[hello.apply([lambda x: mean(x) for x in hello])]
Here's the loop I used to convert the values into a list:
df_list1 = []
for x in hello:
sum1 = 0
average = 0
if len(x) == 1:
for obj in x:
df_list1.append(int(obj))
if len(x)>1:
for year in x:
sum1 += int(year)
average = sum1/len(x)
df_list1.append(int(average))
Use apply and np.mean.
import numpy as np
df = pd.DataFrame(data={'listcol': [np.random.randint(1, 10, 5) for _ in range(3)]}, index=['a', 'b', 'c'])
# np.mean will return NaN on empty list
df['listcol'] = df['listcol'].fillna([])
# can use this if all elements in lists are numeric
df['listcol'] = df['listcol'].apply(lambda x: np.mean(x))
# use this instead if list has numbers stored as strings
df['listcol'] = df['listcol'].apply(lambda x: np.mean([int(i) for i in x]))
Output
>>>df
listcol
a 5.0
b 5.2
c 4.4
Im trying to create function which will create a new column in a pandas dataframe, where it figures out which substring is in a column of strings and takes the substring and uses that for the new column.
The problem being that the text to find does not appear at the same location in variable x
df = pd.DataFrame({'x': ["var_m500_0_somevartext","var_m500_0_vartextagain",
"varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6,8]})
finds = ["m500_0","0_500","m150_0"]
which of finds is in a given df["x"] row
I've made a function that works, but is terribly slow for large datasets
def pd_create_substring_var(df,new_var_name = "new_var",substring_list=["1"],var_ori="x"):
import re
df[new_var_name] = "na"
cols = list(df.columns)
for ix in range(len(df)):
for find in substring_list:
for m in re.finditer(find, df.iloc[ix][var_ori]):
df.iat[ix, cols.index(new_var_name)] = df.iloc[ix][var_ori][m.start():m.end()]
return df
df = pd_create_substring_var(df,"t",finds,var_ori="x")
df
x x1 t
0 var_m500_0_somevartext 4 m500_0
1 var_m500_0_vartextagain 5 m500_0
2 varwithsomeothertext_0_500 6 0_500
3 varwithsomext_m150_0_text 8 m150_0
Does this accomplish what you need ?
finds = ["m500_0", "0_500", "m150_0"]
df["t"] = df["x"].str.extract(f"({'|'.join(finds)})")
Use pandas.str.findall:
df['x'].str.findall("|".join(finds))
0 [m500_0]
1 [m500_0]
2 [0_500]
3 [m150_0]
Probably not the best way:
df['t'] = df['x'].apply(lambda x: ''.join([i for i in finds if i in x]))
And now:
print(df)
Is:
x x1 t
0 var_m500_0_somevartext 4 m500_0
1 var_m500_0_vartextagain 5 m500_0
2 varwithsomeothertext_0_500 6 0_500
3 varwithsomext_m150_0_text 8 m150_0
And now, just adding to #pythonjokeun's answer, you can do:
df["t"] = df["x"].str.extract("(%s)" % '|'.join(finds))
Or:
df["t"] = df["x"].str.extract("({})".format('|'.join(finds)))
Or:
df["t"] = df["x"].str.extract("(" + '|'.join(finds) + ")")
I don't know how large your dataset is, but you can use map function like below:
def subset_df_test():
df = pandas.DataFrame({'x': ["var_m500_0_somevartext", "var_m500_0_vartextagain",
"varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6, 8]})
finds = ["m500_0", "0_500", "m150_0"]
df['t'] = df['x'].map(lambda x: compare(x, finds))
print df
def compare(x, finds):
for f in finds:
if f in x:
return f
Try this
df["t"] = df["x"].apply(lambda x: [i for i in finds if i in x][0])
My data looks as follows:
ID my_val db_val
a X X
a X X
a Y X
b X Y
b Y Y
b Y Y
c Z X
c X X
c Z X
Expected result :
ID my_val db match
a X:2;Y:1 X full_match
b Y:2;X:1 Y full_match
c z:2;X:1 X partial_match
a full_match is when db_val matches the most abundant my_val
a partial_match is when db_val is in the other values but doesn't match the top one.
My current approach consists of grouping by ID then counting values into a seperate column then concatenating the value and its count, then aggregating all values into one row for each ID.
This is how I aggregate the columns:
def all_hits_aggregate_df(df, columns=['my_val']):
grouped = data.groupby('ID')
l=[]
for c in columns:
res = grouped[c].value_counts(ascending=False, normalize=False).to_frame('count_'+c).reset_index(level=1)
res[c] = res[c].astype(str) +':'+ res['count_'+c].astype(str)
l.append(res.groupby('ID').agg(lambda x: ';'.join(x)))
return reduce(lambda x, y: pd.merge(x, y, on = 'ID'), l)
And for the comparison phase, I loop through each row and parse the my_val column into lists then do the comparison.
I am sure that the way I do the comparison step is extremely inefficient but I am unsure how I would do it before aggregation to avoid having to parse the generated string later in the process.
We can groupby the DataFrame by ID, then count my_val values with value_counts and convert to json with to_json, which, with some small changes in formatting, gives us the format that was requested (we just need to remove curly brackets and quotes and replace commas with semicolons). On the grouped data we also take the first (and presumably the only one per ID) value of db_val and calculate the percentage of matches (more than 50% will give us full_match, 0-50% is partial_match and 0% is no_match):
df['match'] = df['my_val']==df['db_val']
z = (df
.groupby('ID')
.agg({'my_val': lambda x: x.value_counts().to_json(),
'db_val': 'first',
'match': 'mean'})
).reset_index()
z['my_val'] = z['my_val'].str.replace('[{"}]','').str.replace(',',';')
z['match'] = np.select(
[z['match'] > 0.5, z['match'] > 0],
['full_match', 'partial_match'], 'no_match')
print(z)
Output:
ID my_val db_val match
0 a X:2;Y:1 X full_match
1 b Y:2;X:1 Y full_match
2 c Z:2;X:1 X partial_match
This should give you the first part of what you want:
df['equal'] = df.my_val == df.db_val
df2 = pd.DataFrame()
df2['my_val'] = df.groupby('ID')['my_val'].sum()
df2['db'] = df.groupby('ID')['db_val'].unique()
df2['match_val'] = df.groupby('ID')['equal'].sum()
df2['match'] = ''
df2.loc[df2.match_val/len(df2.my_val) > 0.5, 'match'] = 'full_match'
df2.loc[df2.match_val/len(df2.my_val) <= 0.5, 'match'] = 'partial_match'
df2.loc[df2.match_val/len(df2.my_val) == 0, 'match'] = 'no_match'
df2 = df2.drop(columns = 'match_val')
print(df2)
my_val db match
ID
a XXY [X] full_match
b XYY [Y] full_match
c ZXZ [X] partial_match
I have a bunch of data (10M + records) that breaks down to an identifier, a location and a date. I want to find the number of times that any identifier moved from some locationA to some other locationB over the entire set of dates. Any identifier may not have a location for all possible dates. When an identifier does not have a location recorded, that should be treated as an actual 'unknown' location for that date.
Here is some reproducible fake data...
import numpy as np
import pandas as pd
import datetime
base = datetime.date.today()
num_days = 50
dates = np.array([base - datetime.timedelta(days=x) for x in range(num_days-1, -1, -1)])
ids = np.arange(50)
mi = pd.MultiIndex.from_product([ids, dates])
locations = np.array([chr(x) for x in 97 + np.random.randint(26, size=len(mi))])
s = pd.Series(locations, index=mi)
mask = np.random.rand(len(mi)) > .5
s[mask] = np.nan
s = s.dropna()
My initial thought was to create a dataframe and use boolean masking/vectorized operations to solve this
df = s.unstack(0).fillna('unknown')
Apparently my data is sparse enough to cause a MemoryError (from all the extra entries resulting from unstacking).
My current working solution is the following
def series_fn(s):
s = s.reindex(pd.date_range(s.index.levels[1].min(), s.index.levels[1].max()), level=-1).fillna('unknown')
mask_prev = (s != s.shift(-1))[:-1]
mask_next = (s != s.shift())[1:]
s_prev = s[:-1][mask_prev]
s_next = s[1:][mask_next]
s_tup = pd.Series(list(zip(s_prev, s_next)))
return s_tup.value_counts()
result_per_id = s.groupby(level=0).apply(series_fn)
result = result_per_id.sum(level=-1)
result looks like
(a, b) 1
(a, c) 5
(a, e) 3
(a, f) 3
(a, g) 3
(a, h) 3
(a, i) 1
(a, j) 1
(a, k) 2
(a, l) 2
...
This is going to take ~5 hours for all my data. Does anyone know any faster ways of doing this?
Thanks!
Hmmm, I guess I should have transposed the data... well that was a relatively simple fix. Instead of using groupby and apply,
s = s.reorder_levels(['date', 'id'])
s = s.sortlevel(0)
results = []
for i in range(len(s.index.levels[0])-1):
t = time.time()
s0 = s.loc[s.index.levels[0][i]]
s1 = s.loc[s.index.levels[0][i+1]]
df = pd.concat((s0, s1), axis=1)
# Note: this is slower than the line above
# df = s.loc[s.index.levels[0][0:2], :].unstack(0)
df = df.fillna('unknown')
mi = pd.MultiIndex.from_arrays((df.iloc[:, 0], df.iloc[:, 1]))
s2 = pd.Series(1, mi)
res = s2.groupby(level=[0, 1]).apply(np.sum)
results.append(res)
print(time.time() - t)
results = pd.concat(results, axis=1)
Still unclear on why the commented out section takes about three times as long as the three lines above it.