Subtract each column by the preceding column on Dataframe in Python - python

Using the following code I can build a simple table with the current COVID-19 cases worldwide, per country:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
raw_data = pd.read_csv(url, sep=",")
raw_data.drop(['Province/State','Lat','Long'], axis = 1, inplace = True)
plot_data = raw_data.groupby('Country/Region').sum()
The plot_data is a simple DataFrame:
What I would like to do now is to subtract the values on each column by the values on the column on a prior day - i.e., I wan to get the new cases per day.
If I do something like plot_data['3/30/20'].add(-plot_data['3/29/20']), it works well. But if I do something like plot_data.iloc[:,68:69].add(-plot_data.iloc[:,67:68]), I got two columns with NaN values. I.e. Python tries to "preserve" de columns header and doesn't perform the operation the way I would like it to.
My goal was to perform this operation in an "elegant way". I was thinking something in the lines of plot_data.iloc[:,1:69].add(-plot_data.iloc[:,0:68]). But of course, if it doesn't work as the single-column example, it doesn't work with multiple columns either (as Python will match the column headers and return a bunch of zeros/NaN values).
Maybe there is a way to tell Python to ignore the headers during an operation with a DataFrame? I know that I can transform my DataFrame into a NumPy array and do a bunch of operations. However, since this is a simple/small table, I thought I would try to keep using a DataFrame data type.

The good old shift can be used on the horizontal axis:
plot_data - plot_data.shift(-1, axis=1)
should be what you want.

Thank you very much #Serge Ballesta! Your answer is exactly the type of "elegant solution" I was looking for. The only comment is the shift sign should be "positive".
plot_data - plot_data.shift(1, axis=1)
This way we bring the historical figures forward one day and now I can subtract it from the actual numbers on each day.

Related

Pycharm problem set (Stuck from step 3 onwards)

Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before

Pandas: going from long to wide format in a dataframe

I am having trouble going from a long format to a wide one, in pandas. There are plenty of examples going from wide to long, but I did not find one from long to wide.
I am trying to reformat my dataframe and pivot, groupby, unstack are a bit confusing for my use case.
This is how I want it to be. The numbers are actually the intensity column from the second image.
And this is how it is now
I tried to build a MultiIndex based on Peptide, Charge and Protein. Then I tried to pivot based on that multi index, and keep all the samples and their intensity as values:
df.set_index(['Peptide', 'Charge', 'Protein'], append=False)
df.pivot(index=df.index, columns='Sample', values='Intensity')
Of course, this does not work since my index is now a combination of the 3 and not an actual column in the dataframe.
It tells me
KeyError: None of [RangeIndex(start=0, stop=3397898, step=1)] are in the [columns]
I tried also to group by, but I am not sure how to move from the long format back to wide. I am quite new to the dataframe way of thinking and I want to learn how to do this right.
It was very tempting for me to do an old school "java"-like approach with 4 for loops and building it as a matrix. Thank you in advnace!
I think based on your attempt that this might work:
df2 = df.pivot(['Peptide', 'Charge', 'Protein'], columns='Sample', values='Intensity').reset_index()
After that, if you want to remove the name from the column axis:
df2 = df2.rename_axis(None, axis=1)

Z-score calculation/standardisation using pandas

I come across this video and it bugs me.
Essentially, at 5:50, they calculate Z-score for the whole data frame by the following snippet:
df_z = (df - df.describle.T['mean'])/df.describle.T['std']
It is a neat and beautiful line.
However, df.describle.T looks like this and df looks like this
df.describle.T['mean'] and df.describle.T['std'] are two individual series, which take the df columns name as index and describle statistic parameters as columns, and df is an ordinary pd.DataFramewhich has numercial index and columns names in the right places.
My question is: how does that line make sense when they are not matching at all, in particular, how do they ensure that every variable example (x_i) matches their mean or std?
Thank you.

GroupBy on a dataframe with multiindex columns using periodindex

I have a pivot_table generated DataFrame with a single index for its rows, and a MultiIndex for its columns. The top level of the MultiIndex is the name of the data I am running calculations on, and the second level is the DATE of that data. The values are the result of those calculations. It looks like this:
Imgur link - my reputation not high enough to post inline images
I am trying to group this data by quarters (Q42018, for example), instead of every single day (the native format of the data).
I found this post that uses PeriodIndex and GroupBy to convert an index of dates into an index of quarters/years to be quite elegant and make the most sense.
The problem is that this solution is for a dataframe with only single index columns. I'm running into a problem trying to do this because my columns are a multi-index, and I can't figure out how to get it to work. Here is my attempt thus far:
bt = cleaned2018_df.pivot_table(index='Broker',
values=['Interaction Id','Net Points'],
columns='Date',
aggfunc={'Interaction Id':pd.Series.nunique,
'Net Points':np.sum},
fill_value=0)
pidx = pd.PeriodIndex(bt.columns.levels[1], freq='Q')
broker_qtr_totals = bt.groupby(pidx, axis=1, level=1).sum()
As you can see, I'm grabbing the second level of the MultiIndex that contains all the dates, and running it through the PeriodIndex function to get back an index of quarters. I then pass that PeriodIndex into groupby, and tell it to operate on columns and the second level where the dates are.
This returns a ValueError response of Grouper and axis must be same length. And I know the reason is because the pidx value I'm passing in to the GroupBy is of length x, whereas the column axis of the dataframe is length 2x (since the 1st level of the multiindex has 2 values).
I'm just getting hung up on how to properly apply this to the entire index. I can't seem to figure it out syntactically, so I wanted to rely on the community's expertise to see if someone could help me out.
If my explanation is not clear, I'm happy to clarify further. Thank you in advance.
I figured this out, and am going to post the answer in case anyone else with a similar problem lands here. I was thinking about the problem correctly, but had a few errors in my first attempt.
The length error was due to me passing an explicit reference to the 2nd level of the MultiIndex into the PeriodIndex function, and then passing that into groupby. The better solution is to use the .get_level_values function, as this takes into account the multi-level nature of the index and returns the appropriate # of values based on how many items are in higher levels.
For instance - if you have a DataFrame with MultiIndex columns with 2 levels - and those 2 levels each contain 3 values, your table will have 9 columns, as the lower level is broken out for each value in the top level. My initial solution was just grabbing those 3 values from the second level directly, instead of all 9. get_level_values corrects for this.
The second issue was that I was passing just this PeriodIndex object by itself into the groupby. That will work, but then it basically just disregards the top level of the MultiIndex. So you need to make sure to pass in a list that contains the original top level, and your new second level that you want to group by.
Corrected code:
#use get_level_values instead of accessing levels directly
pIdx = pd.PeriodIndex(bt.columns.get_level_values(1), freq='Q')
# to maintain original grouping, pass in a list of your original top level,
# and the new second level
broker_qtr_totals = bt.groupby(by=[bt.columns.get_level_values(0), pidx],
axis=1).sum()
This works
imgur link to dataframe image as my rep is too low

How to combine multiple rows of data into a single sting per group

To preface: I'm new to using Python.
I'm working on cleaning up a file where data was spread across multiple rows. I'm struggling to find a solution that will concatenate multiple text strings to a single cell. The .csv data looks similar to this:
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
with one or two blank rows between each entry, too.
The amount of rows used for 'description' isn't consistent. Sometimes it's just one cell, sometimes up to about four. The ideal output turns these multiple rows into a single row of useful data, without all the wasted space. I thought maybe I could create a series of masks by copying the data across a few columns, shifted up, and then iterating in some way. I haven't found a solution that matches what I'm trying to do, though. This is where I'm at so far:
#Add column f description stuff and shift up a row for concatenation
DogData['Z'] = DogData['Y'].shift(-1)
DogData['AA'] = DogData['Z'].shift(-1)
DogData['AB'] = DogData['AA'].shift(-1)
#create series checks to determine how to concat values properly
YNAs = DogData['Y'].isnull()
ZNAs = DogData['Z'].isnull()
AANAs = DogData['AA'].isnull()
The idea here was basically that I'd iterate over column 'Y', check if the same row in column 'Z' was NA or had a value, and concat if it did. If not, just use the value in 'Y'. Carry that logic across but stopping if it encountered an NA in any subsequent columns. I can't figure out how to do that, or if there's a more efficient way to do this.
What do I have to do to get to my end result? I can't figure out the right way to iterate or concatenate in the way I was hoping to.
'''
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
'''
df = pd.read_clipboard(sep=',')
df.fillna(method = 'ffill').groupby([
'name',
'date'
]).description.apply(lambda x : ', '.join(x)).to_frame(name = 'description')
I'm not sure I follow exactly what you mean. I took that text, saved it as a csv file, and successfully read it into a pandas dataframe.
import pandas as pd
df = pd.read_csv('test.csv')
df
Output:
name date description
0 bundy 12-12-2017 good dog
1 NaN NaN smells kind of weird
2 NaN NaN needs to be washed
Isn't this the output you require?

Categories

Resources