Pandas: going from long to wide format in a dataframe

Pandas: going from long to wide format in a dataframe - python

I am having trouble going from a long format to a wide one, in pandas. There are plenty of examples going from wide to long, but I did not find one from long to wide.
I am trying to reformat my dataframe and pivot, groupby, unstack are a bit confusing for my use case.
This is how I want it to be. The numbers are actually the intensity column from the second image.
And this is how it is now
I tried to build a MultiIndex based on Peptide, Charge and Protein. Then I tried to pivot based on that multi index, and keep all the samples and their intensity as values:
df.set_index(['Peptide', 'Charge', 'Protein'], append=False)
df.pivot(index=df.index, columns='Sample', values='Intensity')
Of course, this does not work since my index is now a combination of the 3 and not an actual column in the dataframe.
It tells me
KeyError: None of [RangeIndex(start=0, stop=3397898, step=1)] are in the [columns]
I tried also to group by, but I am not sure how to move from the long format back to wide. I am quite new to the dataframe way of thinking and I want to learn how to do this right.
It was very tempting for me to do an old school "java"-like approach with 4 for loops and building it as a matrix. Thank you in advnace!

I think based on your attempt that this might work:
df2 = df.pivot(['Peptide', 'Charge', 'Protein'], columns='Sample', values='Intensity').reset_index()
After that, if you want to remove the name from the column axis:
df2 = df2.rename_axis(None, axis=1)

Related

Divide dataframe column by series

Python beginner here so I reckon I'm massively overcomplicating this in some way.
I have a dataframe with about 20 columns, but I've only shown a small subset for simplicity. I want to get totals for red blue and none as percentages of the total for that month. So I thought it might be easiest to take a subset of these three columns then add the result back to the rest of the data:
data = [['2022-08', 10,'red',0,0], ['2022-04', 15,'blue',1,0], ['2022-08', 14,'none',1,1],['2022-04', 14,'blue',0,0],['2022-03', 14,'none',1,0]]
df = pd.DataFrame(data, columns=['Month', 'Balance','Type','Flag_1','Flag_2'])
df2=df[['Month','Type','Balance']].groupby(['Month','Type']).sum('Balance').unstack().fillna(0)
df2['balance_all_categories']= df2.sum(axis=1)
Now I want to add this back to my full dataframe and make my balances for red, blue, none into percentages of the total for that month. I have many more than just 2 flags and I will need to make subsets based on all flags being zero, all flags being 1 and so on. So if I groupby month and type here, to access any one column they start to have incredibly long names I think, so I don't want to do that if avoidable.
Is there an easy way to deal with this? Thanks for any suggestions! :)

Pandas - How do you find the top n elements of 1 column based on a condition from another column

I am struggling with a question based on Pandas. I have an earthquake data set with columns of countries and magnitudes. I am asked to:
"Find the top 10 states / countries where the strongest and weakest
earthquakes occurred."
From this question, I garnered that I am meant to find the top 10 countries ["country"] with the highest values (value_counts) , but sorting by magnitude ["mag"].
How would I go about doing this? I've looked around but there's nothing I've found about this online.

Are you sure you did not find something useful? If I understand your question correct, it is a simple one. After creating a dataframe by using below methods, you will get what you need.
import pandas as pd
df = pd.read_csv(".csv")
df.nlargest(x, ['Column Name'])
x is the number of elements which are the largest ones.
Same is goes for nsmallest too. Just use these:
DataFrame.nsmallest(n, columns, keep='first')
DataFrame.nlargest(n, columns, keep='first')
Please read and check the documentation first.

Subtract each column by the preceding column on Dataframe in Python

Using the following code I can build a simple table with the current COVID-19 cases worldwide, per country:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
raw_data = pd.read_csv(url, sep=",")
raw_data.drop(['Province/State','Lat','Long'], axis = 1, inplace = True)
plot_data = raw_data.groupby('Country/Region').sum()
The plot_data is a simple DataFrame:
What I would like to do now is to subtract the values on each column by the values on the column on a prior day - i.e., I wan to get the new cases per day.
If I do something like plot_data['3/30/20'].add(-plot_data['3/29/20']), it works well. But if I do something like plot_data.iloc[:,68:69].add(-plot_data.iloc[:,67:68]), I got two columns with NaN values. I.e. Python tries to "preserve" de columns header and doesn't perform the operation the way I would like it to.
My goal was to perform this operation in an "elegant way". I was thinking something in the lines of plot_data.iloc[:,1:69].add(-plot_data.iloc[:,0:68]). But of course, if it doesn't work as the single-column example, it doesn't work with multiple columns either (as Python will match the column headers and return a bunch of zeros/NaN values).
Maybe there is a way to tell Python to ignore the headers during an operation with a DataFrame? I know that I can transform my DataFrame into a NumPy array and do a bunch of operations. However, since this is a simple/small table, I thought I would try to keep using a DataFrame data type.

The good old shift can be used on the horizontal axis:
plot_data - plot_data.shift(-1, axis=1)
should be what you want.

Thank you very much #Serge Ballesta! Your answer is exactly the type of "elegant solution" I was looking for. The only comment is the shift sign should be "positive".
plot_data - plot_data.shift(1, axis=1)
This way we bring the historical figures forward one day and now I can subtract it from the actual numbers on each day.

Z-score calculation/standardisation using pandas

I come across this video and it bugs me.
Essentially, at 5:50, they calculate Z-score for the whole data frame by the following snippet:
df_z = (df - df.describle.T['mean'])/df.describle.T['std']
It is a neat and beautiful line.
However, df.describle.T looks like this and df looks like this
df.describle.T['mean'] and df.describle.T['std'] are two individual series, which take the df columns name as index and describle statistic parameters as columns, and df is an ordinary pd.DataFramewhich has numercial index and columns names in the right places.
My question is: how does that line make sense when they are not matching at all, in particular, how do they ensure that every variable example (x_i) matches their mean or std?
Thank you.

GroupBy on a dataframe with multiindex columns using periodindex

I have a pivot_table generated DataFrame with a single index for its rows, and a MultiIndex for its columns. The top level of the MultiIndex is the name of the data I am running calculations on, and the second level is the DATE of that data. The values are the result of those calculations. It looks like this:
Imgur link - my reputation not high enough to post inline images
I am trying to group this data by quarters (Q42018, for example), instead of every single day (the native format of the data).
I found this post that uses PeriodIndex and GroupBy to convert an index of dates into an index of quarters/years to be quite elegant and make the most sense.
The problem is that this solution is for a dataframe with only single index columns. I'm running into a problem trying to do this because my columns are a multi-index, and I can't figure out how to get it to work. Here is my attempt thus far:
bt = cleaned2018_df.pivot_table(index='Broker',
values=['Interaction Id','Net Points'],
columns='Date',
aggfunc={'Interaction Id':pd.Series.nunique,
'Net Points':np.sum},
fill_value=0)
pidx = pd.PeriodIndex(bt.columns.levels[1], freq='Q')
broker_qtr_totals = bt.groupby(pidx, axis=1, level=1).sum()
As you can see, I'm grabbing the second level of the MultiIndex that contains all the dates, and running it through the PeriodIndex function to get back an index of quarters. I then pass that PeriodIndex into groupby, and tell it to operate on columns and the second level where the dates are.
This returns a ValueError response of Grouper and axis must be same length. And I know the reason is because the pidx value I'm passing in to the GroupBy is of length x, whereas the column axis of the dataframe is length 2x (since the 1st level of the multiindex has 2 values).
I'm just getting hung up on how to properly apply this to the entire index. I can't seem to figure it out syntactically, so I wanted to rely on the community's expertise to see if someone could help me out.
If my explanation is not clear, I'm happy to clarify further. Thank you in advance.

I figured this out, and am going to post the answer in case anyone else with a similar problem lands here. I was thinking about the problem correctly, but had a few errors in my first attempt.
The length error was due to me passing an explicit reference to the 2nd level of the MultiIndex into the PeriodIndex function, and then passing that into groupby. The better solution is to use the .get_level_values function, as this takes into account the multi-level nature of the index and returns the appropriate # of values based on how many items are in higher levels.
For instance - if you have a DataFrame with MultiIndex columns with 2 levels - and those 2 levels each contain 3 values, your table will have 9 columns, as the lower level is broken out for each value in the top level. My initial solution was just grabbing those 3 values from the second level directly, instead of all 9. get_level_values corrects for this.
The second issue was that I was passing just this PeriodIndex object by itself into the groupby. That will work, but then it basically just disregards the top level of the MultiIndex. So you need to make sure to pass in a list that contains the original top level, and your new second level that you want to group by.
Corrected code:
#use get_level_values instead of accessing levels directly
pIdx = pd.PeriodIndex(bt.columns.get_level_values(1), freq='Q')
# to maintain original grouping, pass in a list of your original top level,
# and the new second level
broker_qtr_totals = bt.groupby(by=[bt.columns.get_level_values(0), pidx],
axis=1).sum()
This works
imgur link to dataframe image as my rep is too low

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.