Combine pandas DataFrames to give unique element counts - python

I have a few pandas DataFrames and I am trying to find a good way to calculate and plot the number of times each unique entry occurs across DataFrames. As an example if I had the 2 following DataFrames:
year month
0 1900 1
1 1950 2
2 2000 3
year month
0 1900 1
1 1975 2
2 2000 3
I was thinking maybe there is a way to combine them into a single DataFrame while using a new column counts to keep track of the number of times a unique combination of year + month occurred in any of the DataFrames. From there I figured I could just scatter plot the year + month combinations with their corresponding counts.
year month counts
0 1900 1 2
1 1950 2 1
2 2000 3 2
3 1975 2 1
Is there a good way to achieve this?

concat then using groupby agg
pd.concat([df1,df2]).groupby('year').month.agg(['count','first']).reset_index().rename(columns={'first':'month'})
Out[467]:
year count month
0 1900 2 1
1 1950 1 2
2 1975 1 2
3 2000 2 3

Related

How to Multiply a Column in a Dataframe with a Column in a seperate Dataframe where the column name is dependant on the value of the first column

So essentially I am using Python with Pandas and I have 2 Dataframes based around sales of items.
One that contains an index and the indexes percentage sales contributions per week, where each column is a week. Something like the below:
Index
Week 1
Week 2
1
5.00%
10.00%
2
6.00%
9.00%
3
7.00%
14.00%
The second Dataframe contains an index, items, the number of weeks they have been selling for and the total sales value for the number of selling weeks, like the below:
Index
Item
Selling Weeks
Total Sales
1
1
5
10
1
2
1
3
3
3
2
1
2
4
52
1000
2
5
3
2
1
6
10
34
What I would like to do is create a new column on Dataframe 2 that is a multiplication between the Total Sales in Dataframe 2 and the corresponding column in Dataframe 1 that matches the Selling weeks value in Dataframe 2.
Is there a way to do this?

Create a new column based on timestamp values to count the days by steps

I would like to add a column to my dataset which corresponds to the time stamp, and counts the day by steps. That is, for one year there should be 365 "steps" and I would like for all grouped payments for each account on day 1 to be labeled 1 in this column and all payments on day 2 are then labeled 2 and so on up to day 365. I would like it to look something like this:
account time steps
0 A 2022.01.01 1
1 A 2022.01.02 2
2 A 2022.01.02 2
3 B 2022.01.01 1
4 B 2022.01.03 3
5 B 2022.01.05 5
I have tried this:
def day_step(x):
x['steps'] = x.time.dt.day.shift()
return x
df = df.groupby('account').apply(day_step)
however, it only counts for each month, once a new month begins it starts again from 1.
How can I fix this to make it provide the step count for the entire year?
Use GroupBy.transform with first or min Series, subtract column time, convert timedeltas to days and add 1:
df['time'] = pd.to_datetime(df['time'])
df['steps1'] = (df['time'].sub(df.groupby('account')['time'].transform('first'))
.dt.days
.add(1)
print (df)
account time steps steps1
0 A 2022-01-01 1 1
1 A 2022-01-02 2 2
2 A 2022-01-02 2 2
3 B 2022-01-01 1 1
4 B 2022-01-03 3 3
5 B 2022-01-05 5 5
First idea, working only if first row is January 1:
df['steps'] = df['time'].dt.dayofyear

Add new column to a dataframe showing the sum of all rows containing a number [duplicate]

This question already has answers here:
pandas add column to groupby dataframe
(3 answers)
Closed 2 years ago.
My dataframe looks like this
Year X
2000 a
2000 b
2004 c
2004 d
2004 e
2001 f
I would like to add a new column that contains the sum of all rows associated with a specific year. The output would look like this:
Year X Total
2000 a 2
2000 b 2
2004 c 3
2004 d 3
2004 e 3
2001 f 1
For example, the total number of rows with the year '2004' is three, so the number three is in the total column for every row associated with the value '2004'
How would I go about adding this column?
df['Total'] = df['Year'].map({2000:2, 2001:1, 2004:3})

returning rows within range in pandas MultiIndex

I have a dataframe that looks like:
count
year person
a.smith 1
2008 b.johns 2
c.gilles 3
a.smith 4
2009 b.johns 3
c.gilles 2
in which both year and person are part of the index. I'd like to return all rows with a.smith for all years. I can locate a count for a specific year with df.loc[(2008, 'a.smith)], which outputs 1. But if I try df.loc[(:,'a.smith)], I get SyntaxError: invalid syntax.
How do I use df.loc for a range of index values in a MultiIndex?
Using pd.IndexSlice
idx = pd.IndexSlice
df.loc[idx[:,'a.smith'],:]
Out[200]:
count
year person
2008 a.smith 1
2009 a.smith 4
Data Input
df
Out[211]:
count
year person
2008 a.smith 1
b.johns 2
c.gilles 3
2009 a.smith 4
b.johns 3
c.gilles 2

How to Combine 2 integer columns in a dataframe and keep the type as integer itself in python

I have 2 columns df[year] and df[month]. It has values ranging from 2000 to 2017 and month values 1 - 12.
How to combine these to another column which would contain the combined output.
Eg:
Year Month Y0M
2000 1 200001
2000 2 200002
2000 3 200003
2000 10 200010
Note : there is a 0 added in between Year and Month in Y0M column, (only for single digit numbers and not double digit)
Currently I am able to only do this by converting to string, but I want to retain them as type numbers
Alternative solution:
In [11]: df['Y0M'] = df[['Year','Month']].dot([100,1])
In [12]: df
Out[12]:
Year Month Y0M
0 2000 1 200001
1 2000 2 200002
2 2000 3 200003
3 2000 10 200010
Maybe something like df[year] * 100 + df[month] would help.

Categories

Resources