I have a few pandas DataFrames and I am trying to find a good way to calculate and plot the number of times each unique entry occurs across DataFrames. As an example if I had the 2 following DataFrames:
year month
0 1900 1
1 1950 2
2 2000 3
year month
0 1900 1
1 1975 2
2 2000 3
I was thinking maybe there is a way to combine them into a single DataFrame while using a new column counts to keep track of the number of times a unique combination of year + month occurred in any of the DataFrames. From there I figured I could just scatter plot the year + month combinations with their corresponding counts.
year month counts
0 1900 1 2
1 1950 2 1
2 2000 3 2
3 1975 2 1
Is there a good way to achieve this?
concat then using groupby agg
pd.concat([df1,df2]).groupby('year').month.agg(['count','first']).reset_index().rename(columns={'first':'month'})
Out[467]:
year count month
0 1900 2 1
1 1950 1 2
2 1975 1 2
3 2000 2 3
Related
So essentially I am using Python with Pandas and I have 2 Dataframes based around sales of items.
One that contains an index and the indexes percentage sales contributions per week, where each column is a week. Something like the below:
Index
Week 1
Week 2
1
5.00%
10.00%
2
6.00%
9.00%
3
7.00%
14.00%
The second Dataframe contains an index, items, the number of weeks they have been selling for and the total sales value for the number of selling weeks, like the below:
Index
Item
Selling Weeks
Total Sales
1
1
5
10
1
2
1
3
3
3
2
1
2
4
52
1000
2
5
3
2
1
6
10
34
What I would like to do is create a new column on Dataframe 2 that is a multiplication between the Total Sales in Dataframe 2 and the corresponding column in Dataframe 1 that matches the Selling weeks value in Dataframe 2.
Is there a way to do this?
I would like to add a column to my dataset which corresponds to the time stamp, and counts the day by steps. That is, for one year there should be 365 "steps" and I would like for all grouped payments for each account on day 1 to be labeled 1 in this column and all payments on day 2 are then labeled 2 and so on up to day 365. I would like it to look something like this:
account time steps
0 A 2022.01.01 1
1 A 2022.01.02 2
2 A 2022.01.02 2
3 B 2022.01.01 1
4 B 2022.01.03 3
5 B 2022.01.05 5
I have tried this:
def day_step(x):
x['steps'] = x.time.dt.day.shift()
return x
df = df.groupby('account').apply(day_step)
however, it only counts for each month, once a new month begins it starts again from 1.
How can I fix this to make it provide the step count for the entire year?
Use GroupBy.transform with first or min Series, subtract column time, convert timedeltas to days and add 1:
df['time'] = pd.to_datetime(df['time'])
df['steps1'] = (df['time'].sub(df.groupby('account')['time'].transform('first'))
.dt.days
.add(1)
print (df)
account time steps steps1
0 A 2022-01-01 1 1
1 A 2022-01-02 2 2
2 A 2022-01-02 2 2
3 B 2022-01-01 1 1
4 B 2022-01-03 3 3
5 B 2022-01-05 5 5
First idea, working only if first row is January 1:
df['steps'] = df['time'].dt.dayofyear
This question already has answers here:
pandas add column to groupby dataframe
(3 answers)
Closed 2 years ago.
My dataframe looks like this
Year X
2000 a
2000 b
2004 c
2004 d
2004 e
2001 f
I would like to add a new column that contains the sum of all rows associated with a specific year. The output would look like this:
Year X Total
2000 a 2
2000 b 2
2004 c 3
2004 d 3
2004 e 3
2001 f 1
For example, the total number of rows with the year '2004' is three, so the number three is in the total column for every row associated with the value '2004'
How would I go about adding this column?
df['Total'] = df['Year'].map({2000:2, 2001:1, 2004:3})
I have a dataframe that looks like:
count
year person
a.smith 1
2008 b.johns 2
c.gilles 3
a.smith 4
2009 b.johns 3
c.gilles 2
in which both year and person are part of the index. I'd like to return all rows with a.smith for all years. I can locate a count for a specific year with df.loc[(2008, 'a.smith)], which outputs 1. But if I try df.loc[(:,'a.smith)], I get SyntaxError: invalid syntax.
How do I use df.loc for a range of index values in a MultiIndex?
Using pd.IndexSlice
idx = pd.IndexSlice
df.loc[idx[:,'a.smith'],:]
Out[200]:
count
year person
2008 a.smith 1
2009 a.smith 4
Data Input
df
Out[211]:
count
year person
2008 a.smith 1
b.johns 2
c.gilles 3
2009 a.smith 4
b.johns 3
c.gilles 2
I have 2 columns df[year] and df[month]. It has values ranging from 2000 to 2017 and month values 1 - 12.
How to combine these to another column which would contain the combined output.
Eg:
Year Month Y0M
2000 1 200001
2000 2 200002
2000 3 200003
2000 10 200010
Note : there is a 0 added in between Year and Month in Y0M column, (only for single digit numbers and not double digit)
Currently I am able to only do this by converting to string, but I want to retain them as type numbers
Alternative solution:
In [11]: df['Y0M'] = df[['Year','Month']].dot([100,1])
In [12]: df
Out[12]:
Year Month Y0M
0 2000 1 200001
1 2000 2 200002
2 2000 3 200003
3 2000 10 200010
Maybe something like df[year] * 100 + df[month] would help.