Have the following multi index data frame, df.
I performed a 20 day moving average operation on the df[‘Close’] with the following code, ma20.
How do I append m20 to df as a multi index data frame?
I should have 3 level 0 columns, ['Adj Close','Close', ‘ma20’], each with the 3 tickers, ['MSFT','AAPL','AMZN'], at level 1 columns.
The answer should also not require me to type out all the tickers manually.
import yfinance as yf
df = yf.download(['MSFT','AAPL','AMZN'], start="2022-01-01", end="2022-09-01").loc[:,['Adj Close','Close']]
ma20 = df['Close'].sort_index(ascending=True).rolling(20, min_periods=20).mean()
pd.concat([df,ma20], axis=1)????
It's not very elegant solution, but should work, the idea is to specify explicitly the multiindex names for new columns:
df[[('Mean', col) for col in ma20.columns]] = ma20
UPDATE:
If you want to use concat() method, you need firstly to add another column index level to ma20. The way you can do this looks counter-intuitive up to me:
pd.concat((df, pd.concat({'Mean': ma20}, axis=1)), sort=False, axis=1)
The purpose of pd.concat({'Mean': ma20}, axis=1)) is just to add another index level to the columns of ma20
Related
Having a bit of trouble understanding the documentation
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
C:/Users/erasmuss/PycharmProjects/Sarah/farmdata.py:38: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Code is basically to re-arrange and clean some data to make analysis easier.
Code in given row-by per each animal, but has repetitions, blanks, and some other sparse values
Idea is to basically stack rows into columns and grab the useful data (Weight by date and final BCS) per animal
Initial DF
few snippets of the dataframe
Output Format
Output DF/csv
import pandas as pd
import numpy as np
#Function for cleaning up multiple entries of breeds
def testbreed(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
#Read Data
df1 = pd.read_csv("farmdata.csv")
#Drop empty rows
df1.dropna(how='all', axis=1, inplace=True)
#Copy to extract Weights in DF2
df2 = df1.copy()
df2 = df2.drop(['BCS', 'Breed','Age'], axis=1)
#Pivot for ID names in DF1
df1 = df1.pivot(index='ID', columns='Date', values=['Breed','Weight', 'BCS'])
#Pivot for weights in DF2
df2 = df2.pivot(index='ID', columns='Date', values = 'Weight')
#Split out Breeds and BCS into individual dataframes w/Duplicate/missing data for each ID
df3 = df1.copy()
dfbreed = df3[['Breed']]
dfBCS = df3[['BCS']]
#Drop empty BCS columns
df1.dropna(how='all', axis=1, inplace=True)
#Shorten Breed and BCS to single Column by grabbing first value that is real. see function above
dfbreed['x'] = dfbreed.apply(testbreed, axis=1)
dfBCS['x'] = dfBCS.apply(testbreed, axis=1)
#Populate BCS and Breed into new DF
df5= pd.DataFrame(data=None)
df5['Breed'] = dfbreed['x']
df5['BCS'] = dfBCS['x']
#Join Weights
df5 = df5.join(df2)
#Write output
df5.to_csv(r'.\out1.csv')
I want to take the BCS and Breed dataframes which are multi-indexed on the column by Breed or BCS and then by date to take the first non-NaN value in the rows of dates and set that into a column named breed.
I had a lot of trouble getting the columns to pick the first unique values in-situ on the DF
I found a work-around with a 2015 answer:
2015 Answer
which defined the function at the top.
reading through the setting a value on the copy-of a slice makes sense intuitively,
but I can't seem to think of a way to make it work as a direct-replacement or index-based.
Should I be looping through?
Trying from The second answer here
I get
dfbreed.loc[:,'Breed'] = dfbreed['Breed'].apply(testbreed, axis=1)
dfBCS.loc[:, 'BCS'] = dfBCS.apply['BCS'](testbreed, axis=1)
which returns
ValueError: Must have equal len keys and value when setting with an iterable
I'm thinking this has something to do with the multi-index
keys come up as:
MultiIndex([('Breed', '1/28/2021'),
('Breed', '2/12/2021'),
('Breed', '2/4/2021'),
('Breed', '3/18/2021'),
('Breed', '7/30/2021')],
names=[None, 'Date'])
MultiIndex([('BCS', '1/28/2021'),
('BCS', '2/12/2021'),
('BCS', '2/4/2021'),
('BCS', '3/18/2021'),
('BCS', '7/30/2021')],
names=[None, 'Date'])
Sorry for the long question(s?)
Can anyone help me out?
Thanks.
You created dfbreed as:
dfbreed = df3[['Breed']]
So it is a view of the original DataFrame (limited to just this one column).
Remember that a view has not any own data buffer, it is only a tool to "view"
a fragment of the original DataFrame, with read only access.
When you attempt to perform dfbreed['x'] = dfbreed.apply(...), you
actually attempt to violate the read-only access mode.
To avoid this error, create dfbreed as an "independent" DataFrame:
dfbreed = df3[['Breed']].copy()
Now dfbreed has its own data buffer and you are free to change the data.
In my notebook I have 3 dataframes.
I would like to calculate the mean age based on Pclass and Age. I achieved this by using a groupby function. The result of the groupby function will override the NaN fields:
avg = traindf_cln.groupby(["Pclass", "Sex"])["Age"].transform('mean')
traindf_cln["Age"].fillna(avg, inplace=True)
validationdf_cln["Age"].fillna(avg, inplace=True)
testdf_cln["Age"].fillna(avg, inplace=True)
The problem is that the code above is only working on the traindf_cln dataframe and not on the other two.
I think the issue is that you can't use a value (of a groupby) of a specific dataframe on another dataframe.
How can I fix this?
Dataframe traindf_cln:
Edit:
New code:
group = traindf_cln.groupby(["Pclass", "Sex"])["Age"].mean()
lookup_keys = pd.Series(tuple(zip(traindf_cln["Pclass"], traindf_cln["Sex"])))
traindf_cln["Age"].fillna(lookup_keys.map(group), inplace=True)
lookup_keys_val = pd.Series(tuple(zip(validationdf_cln["Pclass"], validationdf_cln["Sex"])))
validationdf_cln["Age"].fillna(lookup_keys_val.map(group), inplace=True)
Few samples of traindf_cln where Age is still NaN. Some did change, but not all of them.
You don't need to use transform, just a groupby object that can then be mapped onto the Pclass and Sex columns of the test/validation DataFrames. Here we create a Series with tuples of Pclass and Sex that can be used to map the groupby values into the missing Age data:
group = traindf_cln.groupby(["Pclass", "Sex"])["Age"].mean()
lookup_keys = pd.Series(tuple(zip(traindf_cln["Pclass"], traindf_cln["Sex"])))
traindf_cln["Age"].fillna(lookup_keys.map(group), inplace=True)
Then just repeat the final 2 lines using the same group object on the test/validation sets.
I have an array of dataframes dfs = [df0, df1, ...]. Each one of them have a date column of varying size (some dates might be in one dataframe but not the other).
What I'm trying to do is this:
pd.concat(dfs).groupby("date", as_index=False).sum()
But with date no longer being a column but an index (dfs = [df.set_index("date") for df in dfs]).
I've seen you can pass df.index to groupby (.groupby(df.index)) but df.index might not include all the dates.
How can I do this?
The goal here is to call .sum() on the groupby, so I'm not tied to using groupby nor concat is there's any alternative method to do so.
If I am able to understand maybe you want something like this:
df = pd.concat([dfs])
df.groupby(df.index).sum()
Here's small example:
tmp1 = pd.DataFrame({'date':['2019-09-01','2019-09-02','2019-09-03'],'value':[1,1,1]}).set_index('date')
tmp2 = pd.DataFrame({'date':['2019-09-01','2019-09-02','2019-09-04','2019-09-05'],'value':[2,2,2,2]}).set_index('date')
df = pd.concat([tmp1,tmp2])
df.groupby(df.index).sum()
I set up a pandas dataframes that besides my data stores the respective units with it using a MultiIndex like this:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
Now I can for example extract only the Volume_STP data by
Unit ccm/g
Description
0 29.3601
1 30.3071
2 31.1643
3 31.8513
4 32.3972
5 32.8724
With .values I can obtain a numpy array of the data. However how can I get the stored unit? I can't figure out what I need to do to receive the stored ccm/g string.
EDIT: Added example how data frame is generated
Let's say I have a string that looks like this:
Relative Volume # STP
Pressure
cc/g
4.26910e-02 29.3601
7.83190e-02 30.3071
1.29529e-01 31.1643
1.83355e-01 31.8513
2.33435e-01 32.3972
2.80847e-01 32.8724
3.34769e-01 33.4049
3.79123e-01 33.8401
I then use this function:
def read_result(contents, columns, units, descr):
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.drop(df.index[-1], inplace=True)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
return df
like this
def isotherm(contents):
columns = ['Relative_Pressure','Volume_STP']
units = ['-','ccm/g']
descr = ['p/p0','']
df = read_result(contents, columns, units, descr)
return df
to generate the DataFrame at the beginning of my question.
As df has a MultiIndex as columns, df.Volume_STP is still a pandas DataFrame. So you can still access its columns attribute, and the relevant item will be at index 0 because the dataframe contains only 1 Series.
So, you can extract the names that way:
print(df.Volume_STP.columns[0])
which should give: ('ccm/g', '')
At the end you extract the unit with .colums[0][0] and the description with .columns[0][1]
You can do something like this:
df.xs('Volume_STP', axis=1).columns.remove_unused_levels().get_level_values(0).tolist()[0]
Output:
'ccm/g'
Slice the dataframe from the 'Volume_STP' using xs, then select the columns remove the unused parts of the column headers, then get the value for the top most level of that slice which is the Units. Convert to a list as select the first value.
A generic way of accessing values on multi-index/columns is by using the index.get_level_values or columns.get_level_values functions of a data frame.
In your example, try df.columns.get_level_values(1) to access the second level of the multi-level column "Unit". If you have already selected a column, say "Volume_STP", then you have removed the top level and in this case, your units would be in the 0th level.
I am working with two csv files and imported as dataframe, df1 and df2
df1 has 50000 rows and df2 has 150000 rows.
I want to compare (iterate through each row) the 'time' of df2 with
df1, find the difference in time and return the values of all column
corresponding to similar row, save it in df3 (time synchronization)
For example, 35427949712 (of 'time' in df1) is nearest or equal to
35427949712 (of 'time' in df2), So I would like to return the
contents to df1 ('velocity_x' and 'yaw') and df2 ('velocity' and
'yawrate') and save in df3
For this i used two techniques, shown in code.
Code 1 takes very long time to execute 72 hours which is not practice since i have lot of csv files
Code 2 gives me "memory error" and kernel dies.
Would be great, if I get a more robust solution for the problem considering computational time, memory and power(Intel Core i7-6700HQ, 8 GB Ram)
Here is the sample data,
import pandas as pd
df1 = pd.DataFrame({'time': [35427889701, 35427909854, 35427929709,35427949712, 35428009860],
'velocity_x':[12.5451, 12.5401,12.5351,12.5401,12.5251],
'yaw' : [-0.0787806, -0.0784749, -0.0794889,-0.0795915,-0.0795472]})
df2 = pd.DataFrame({'time': [35427929709, 35427949712, 35427009860,35427029728, 35427049705],
'velocity':[12.6583, 12.6556,12.6556,12.6556,12.6444],
'yawrate' : [-0.0750492, -0.0750492, -0.074351,-0.074351,-0.074351]})
df3 = pd.DataFrame(columns=['time','velocity_x','yaw','velocity','yawrate'])
Code1
for index, row in df1.iterrows():
min=100000
for indexer, rows in df2.iterrows():
if abs(float(row['time'])-float(rows['time']))<min:
min = abs(float(row['time'])-float(rows['time']))
#storing the position
pos = indexer
df3.loc[index,'time'] = df1['time'][pos]
df3.loc[index,'velocity_x'] = df1['velocity_x'][pos]
df3.loc[index,'yaw'] = df1['yaw'][pos]
df3.loc[index,'velocity'] = df2['velocity'][pos]
df3.loc[index,'yawrate'] = df2['yawrate'][pos]
Code2
df1['key'] = 1
df2['key'] = 1
df1.rename(index=str, columns ={'time' : 'time_x'}, inplace=True)
df = df2.merge(df1, on='key', how ='left').reset_index()
df['diff'] = df.apply(lambda x: abs(x['time'] - x['time_x']), axis=1)
df.sort_values(by=['time', 'diff'], inplace=True)
df=df.groupby(['time']).first().reset_index()[['time', 'velocity_x', 'yaw', 'velocity', 'yawrate']]
You're looking for pandas.merge_asof. It allows you to combine 2 DataFrames on a key, in this case time, without the requirement that they are an exact match. You can choose a direction for prioritizing the match, but in this case it's obvious that you want nearest
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
One caveat is that you need to sort things for merge_asof to work.
import pandas as pd
pd.merge_asof(df2.sort_values('time'), df1.sort_values('time'), on='time', direction='nearest')
# time velocity yawrate velocity_x yaw
#0 35427009860 12.6556 -0.074351 12.5451 -0.078781
#1 35427029728 12.6556 -0.074351 12.5451 -0.078781
#2 35427049705 12.6444 -0.074351 12.5451 -0.078781
#3 35427929709 12.6583 -0.075049 12.5351 -0.079489
#4 35427949712 12.6556 -0.075049 12.5401 -0.079591
Just be careful about which DataFrame you choose as the left or right frame, as that changes the result. In this case I'm selecting the time in df1 which is closest in absolute distance to the time in df2.
You also need to be careful if you have duplicated on keys in the right df because for exact matches, merge_asof only merges the last sorted row of the right df to the left df, instead of creating multiple entries for each exact match. If that's a problem, you can instead merge the exact keys first to get all of the combinations, and then merge the remainder with asof.
just a side note (as not an answer)
min_delta=100000
for indexer, rows in df2.iterrows():
if abs(float(row['time'])-float(rows['time']))<min_delta:
min_delta = abs(float(row['time'])-float(rows['time']))
#storing the position
pos = indexer
can be written as
diff = np.abs(row['time'] - df2['time'])
pos = np.argmin(diff)
(always avoid for loops)
and don't call your vars with a built-in name (min)