Divide dataframe column by series - python

Python beginner here so I reckon I'm massively overcomplicating this in some way.
I have a dataframe with about 20 columns, but I've only shown a small subset for simplicity. I want to get totals for red blue and none as percentages of the total for that month. So I thought it might be easiest to take a subset of these three columns then add the result back to the rest of the data:
data = [['2022-08', 10,'red',0,0], ['2022-04', 15,'blue',1,0], ['2022-08', 14,'none',1,1],['2022-04', 14,'blue',0,0],['2022-03', 14,'none',1,0]]
df = pd.DataFrame(data, columns=['Month', 'Balance','Type','Flag_1','Flag_2'])
df2=df[['Month','Type','Balance']].groupby(['Month','Type']).sum('Balance').unstack().fillna(0)
df2['balance_all_categories']= df2.sum(axis=1)
Now I want to add this back to my full dataframe and make my balances for red, blue, none into percentages of the total for that month. I have many more than just 2 flags and I will need to make subsets based on all flags being zero, all flags being 1 and so on. So if I groupby month and type here, to access any one column they start to have incredibly long names I think, so I don't want to do that if avoidable.
Is there an easy way to deal with this? Thanks for any suggestions! :)

Related

How to create a new df with pandas based on a set of conditions that compares each row by the others within the same df?

I am attempting to use pandas to create a new df based on a set of conditions that compares the rows from one another within the original df. I am new to using pandas and feel comfortable comparing two df from one another and basic column comparisons, but for some reason the row by row comparison is stumping me. My specific conditions and problem are found below:
Cosine_i_ start_time fid_ Shape_Area
0 0.820108 2022-08-31T10:48:34Z emit20220831t104834_o24307_s000 0.067763
1 0.962301 2022-08-27T12:25:06Z emit20220827t122506_o23908_s000 0.067763
2 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.404882
3 0.788322 2023-01-29T13:23:39Z emit20230129t132339_o02909_s000 0.404882
4 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.108256
^^Above is my original df that I will be working with.
Goal: I am hoping to create a new df that contains only the FIDs that meet the following conditions: If the shape area is equal, the cosi values have a difference greater than 0.1, and the start time has a difference greater than 5 days. This is going to be applied to a large dataset, the df displayed is just a small sample one I made to help write the code.
For example: Rows 2 & 3 have the same shape area, so then looking at the cosi values, they have a difference in values greater than 0.1, and lastly they have a difference in their start times that is greater than 5 days. They meet all set conditions, so I would then like to take the FID values for BOTH of these rows and append it to a new df.
So essentially I want to compare every row with the other rows and that's where I am having trouble.
I am looking for as much guidance as possible on how to set this up as I am very very new to coding and am hoping to get a tutorial of some sort!
Thanks in advance.
Group by Shape_Area and filter each pair (single items on Shape_Area are omitted) by required conditions:
fids = df.groupby('Shape_Area').filter(lambda x: x.index.size > 1
and x['Cosine_i_'].diff().values[-1] >= 0.1
and x['start_time'].diff().abs().dt.days.values[-1] > 5,
dropna=True)['fid_'].tolist()
print(fids)

Sort the DataFrames columns which are dynamically generated

I have a dataframe which is similar to this
d1 = pd.DataFrame({'name':['xyz','abc','dfg'],
'age':[15,34,22],
'sex':['s1','s2','s3'],
'w-1(6)':[96,66,74],
'w-2(5)':[55,86,99],
'w-3(4)':[11,66,44]})
Note that in my original DataFrame the week numbers are generated dynamically (i.e) The columns
w-1(6),w-2(5) and w-3(4) are generated dynamically and change every week. I want to sort all the three columns of the week based on descending order of the values.
But the names of the columns cannot be used as they change every week.
Is there any possible way to achieve this?
Edit : The numbers might not always present for all the three weeks, in the sense that if W-1 has no data, i wont have that column in the dataset at all. So that would mean only two week columns and not three.
You can use the column indices.
d1.sort_values(by=[d1.columns[3], d1.columns[4], d1.columns[5]] , ascending=False)

Pandas: going from long to wide format in a dataframe

I am having trouble going from a long format to a wide one, in pandas. There are plenty of examples going from wide to long, but I did not find one from long to wide.
I am trying to reformat my dataframe and pivot, groupby, unstack are a bit confusing for my use case.
This is how I want it to be. The numbers are actually the intensity column from the second image.
And this is how it is now
I tried to build a MultiIndex based on Peptide, Charge and Protein. Then I tried to pivot based on that multi index, and keep all the samples and their intensity as values:
df.set_index(['Peptide', 'Charge', 'Protein'], append=False)
df.pivot(index=df.index, columns='Sample', values='Intensity')
Of course, this does not work since my index is now a combination of the 3 and not an actual column in the dataframe.
It tells me
KeyError: None of [RangeIndex(start=0, stop=3397898, step=1)] are in the [columns]
I tried also to group by, but I am not sure how to move from the long format back to wide. I am quite new to the dataframe way of thinking and I want to learn how to do this right.
It was very tempting for me to do an old school "java"-like approach with 4 for loops and building it as a matrix. Thank you in advnace!
I think based on your attempt that this might work:
df2 = df.pivot(['Peptide', 'Charge', 'Protein'], columns='Sample', values='Intensity').reset_index()
After that, if you want to remove the name from the column axis:
df2 = df2.rename_axis(None, axis=1)

How to set parameters for new column in pandas dataframe OR for a value count on python?

I'm using a some data from Kaggle about blue plaques in Europe. Many of these plaques describe famous people, but others describe places or events or animals. The dataframe includes the years of both birth and death for those famous people, and I have added a new column that displays the age of the lead subject at their time of death with the following code:
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
This works for some of the dataset, but since some of the subjects don't have values for the columns 'lead_subject_died_in' and 'lead_subject_born_in', some of my results are funky.
I was trying to determine the most common age of death with this:
agecount = plaques['subject_age'].value_counts()
print(agecount)
--and I got some crazy stuff-- negative numbers, 600+, etc.-- how do I make it so that it only counts the values for people who actually have data in both of those columns?
By the way, I'm a beginner, so if the operations you suggest are very difficult, please explain what they're doing so that I can learn and use it in the future!
you can use dropna function to remove the nan values in certain columns:
# remove nan values from these 2 columns
plaques = plaques.dropna(subset = ['lead_subject_died_in', 'lead_subject_born_in'])
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
# get the most frequent age
plaques['subject_age'].value_counts().idxmax()
# get the top 5 top five most common ages
plaques['subject_age'].value_counts().head()

Subtract each column by the preceding column on Dataframe in Python

Using the following code I can build a simple table with the current COVID-19 cases worldwide, per country:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
raw_data = pd.read_csv(url, sep=",")
raw_data.drop(['Province/State','Lat','Long'], axis = 1, inplace = True)
plot_data = raw_data.groupby('Country/Region').sum()
The plot_data is a simple DataFrame:
What I would like to do now is to subtract the values on each column by the values on the column on a prior day - i.e., I wan to get the new cases per day.
If I do something like plot_data['3/30/20'].add(-plot_data['3/29/20']), it works well. But if I do something like plot_data.iloc[:,68:69].add(-plot_data.iloc[:,67:68]), I got two columns with NaN values. I.e. Python tries to "preserve" de columns header and doesn't perform the operation the way I would like it to.
My goal was to perform this operation in an "elegant way". I was thinking something in the lines of plot_data.iloc[:,1:69].add(-plot_data.iloc[:,0:68]). But of course, if it doesn't work as the single-column example, it doesn't work with multiple columns either (as Python will match the column headers and return a bunch of zeros/NaN values).
Maybe there is a way to tell Python to ignore the headers during an operation with a DataFrame? I know that I can transform my DataFrame into a NumPy array and do a bunch of operations. However, since this is a simple/small table, I thought I would try to keep using a DataFrame data type.
The good old shift can be used on the horizontal axis:
plot_data - plot_data.shift(-1, axis=1)
should be what you want.
Thank you very much #Serge Ballesta! Your answer is exactly the type of "elegant solution" I was looking for. The only comment is the shift sign should be "positive".
plot_data - plot_data.shift(1, axis=1)
This way we bring the historical figures forward one day and now I can subtract it from the actual numbers on each day.

Categories

Resources