I have a dataframe as follows. I'm attempting to sum the values in the Total column, for each date, for each unique pair from columns P_buy and P_sell.
+--------+----------+-------+---------+--------+----------+-----------------+
| Index | Date | Type | Quantity| P_buy | P_sell | Total |
+--------+----------+-------+---------+--------+----------+-----------------+
| 0 | 1/1/2020 | 1 | 10 | 1 | 1 | 10 |
| 1 | 1/1/2020 | 1 | 10 | 2 | 1 | 20 |
| 2 | 1/1/2020 | 2 | 20 | 3 | 1 | 25 |
| 3 | 1/1/2020 | 2 | 20 | 4 | 1 | 20 |
| 4 | 2/1/2020 | 3 | 30 | 1 | 1 | 35 |
| 5 | 2/1/2020 | 3 | 30 | 2 | 1 | 30 |
| 6 | 2/1/2020 | 1 | 40 | 3 | 1 | 45 |
| 7 | 2/1/2020 | 1 | 40 | 4 | 1 | 40 |
| 8 | 3/1/2020 | 2 | 50 | 1 | 1 | 55 |
| 9 | 3/1/2020 | 2 | 50 | 2 | 1 | 53 |
+--------+----------+-------+---------+--------+----------+-----------------+
My desired output would be as follows: Where for each combination of unique P_buy/P_sell pairs, I'm receiving a sum of the total at each date.
+--------+----------+-------+---------+
| P_buy | P_sell | Total |
+--------+----------+-------+---------+
| 1 | 1 | 100 |
| 2 | 1 | 103 |
| 3 | 1 | 70 |
+--------+----------+-------+---------+
My attempts have been using the groupby function, but I haven't been able to successfully implement.
# use a groupby on the desired columns and sum the total
df.groupby(['P_buy','P_sell'], as_index=False)['Total'].sum()
P_buy P_sell Total
0 1 1 100
1 2 1 103
2 3 1 70
3 4 1 60
I have a dataset similar to the this sample below:
| id | old_a | old_b | new_a | new_b |
|----|-------|-------|-------|-------|
| 6 | 3 | 0 | 0 | 0 |
| 6 | 9 | 0 | 2 | 0 |
| 13 | 3 | 0 | 0 | 0 |
| 13 | 37 | 0 | 0 | 1 |
| 13 | 30 | 0 | 0 | 6 |
| 13 | 12 | 2 | 0 | 0 |
| 6 | 7 | 0 | 2 | 0 |
| 6 | 8 | 0 | 0 | 0 |
| 6 | 19 | 0 | 3 | 0 |
| 6 | 54 | 0 | 0 | 0 |
| 87 | 6 | 0 | 2 | 0 |
| 87 | 11 | 1 | 1 | 0 |
| 87 | 25 | 0 | 1 | 0 |
| 87 | 10 | 0 | 0 | 0 |
| 9 | 8 | 1 | 0 | 0 |
| 9 | 19 | 0 | 2 | 0 |
| 9 | 1 | 0 | 0 | 0 |
| 9 | 34 | 0 | 7 | 0 |
I'm providing this sample dataset for the above table:
data=[[6,3,0,0,0],[6,9,0,2,0],[13,3,0,0,0],[13,37,0,0,1],[13,30,0,0,6],[13,12,2,0,0],[6,7,0,2,0],
[6,8,0,0,0],[6,19,0,3,0],[6,54,0,0,0],[87,6,0,2,0],[87,11,1,1,0],[87,25,0,1,0],[87,10,0,0,0],
[9,8,1,0,0],[9,19,0,2,0],[9,1,0,0,0],[9,34,0,7,0]]
data= pd.DataFrame(data,columns=['id','old_a','old_b','new_a','new_b'])
I want to look into columns 'new_a' and 'new_b' for each id and even if a single value exists in these two columns for each id, I want to count it as 1 irrespective of the number of times any value has occurred and assign 0 if no value is present. For example, if I look into the id '9', there are two distinct values in new_a, but I want to count it as 1. Similarly, for id '13', there are no values in new_a, so I would want to assign it 0.
My final output should like:
| id | new_a | new_b |
|----|-------|-------|
| 6 | 1 | 0 |
| 9 | 1 | 0 |
| 13 | 0 | 1 |
| 87 | 1 | 0 |
I would eventually want to calculate the % of clients using new_a and new_b. So from the above table, 75% clients use new_a and 25% use new_b. I'm a beginner in python and not sure how to proceed in this.
Use GroupBy.any, because 0 are processing like Falses and convert output boolean to integers:
df = data.groupby('id')[['new_a','new_b']].any().astype(int).reset_index()
print (df)
id new_a new_b
0 6 1 0
1 9 1 0
2 13 0 1
3 87 1 0
For percentage use mean of output above:
s = df[['new_a','new_b']].mean().mul(100)
print (s)
new_a 75.0
new_b 25.0
dtype: float64
I have the following pandas dataframe where the first column is the datetime index. I am trying to achieve the desired_output column which increments every time the flag changes from 0 to 1 or 1 to 0. I have been able to achieve this type of thing in SQL however after finding that pandasql sqldf for some strange reason changes the values of the field undergoing the partition I am now trying to achieve this using regular python syntax.
Any help would be much appreciated.
+-------------+------+----------------+
| date(index) | flag | desired_output |
+-------------+------+----------------+
| 1/01/2020 | 0 | 1 |
| 2/01/2020 | 0 | 1 |
| 3/01/2020 | 0 | 1 |
| 4/01/2020 | 1 | 2 |
| 5/01/2020 | 1 | 2 |
| 6/01/2020 | 0 | 3 |
| 7/01/2020 | 1 | 4 |
| 8/01/2020 | 1 | 4 |
| 9/01/2020 | 1 | 4 |
| 10/01/2020 | 1 | 4 |
| 11/01/2020 | 1 | 4 |
| 12/01/2020 | 1 | 4 |
| 13/01/2020 | 0 | 5 |
| 14/01/2020 | 0 | 5 |
| 15/01/2020 | 0 | 5 |
| 16/01/2020 | 0 | 5 |
| 17/01/2020 | 1 | 6 |
| 18/01/2020 | 0 | 7 |
| 19/01/2020 | 0 | 7 |
| 20/01/2020 | 0 | 7 |
| 21/01/2020 | 0 | 7 |
| 22/01/2020 | 1 | 8 |
| 23/01/2020 | 1 | 8 |
+-------------+------+----------------+
Use diff and cumsum:
print (df["flag"].diff().ne(0).cumsum())
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
8 4
9 4
10 4
11 4
12 5
13 5
14 5
15 5
16 6
17 7
18 7
19 7
20 7
21 8
22 8
I have loaded raw_data from MySQL using sqlalchemy and pymysql
engine = create_engine('mysql+pymysql://[user]:[passwd]#[host]:[port]/[database]')
df = pd.read_sql_table('data', engine)
df is something like this
| Age Category | Category |
|--------------|----------------|
| 31-26 | Engaged |
| 26-31 | Engaged |
| 31-36 | Not Engaged |
| Above 51 | Engaged |
| 41-46 | Disengaged |
| 46-51 | Nearly Engaged |
| 26-31 | Disengaged |
Then i had performed analysis as follow
age = pd.crosstab(df['Age Category'], df['Category'])
| Category | A | B | C | D |
|--------------|---|----|----|---|
| Age Category | | | | |
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
I want to change it to
Pandas DataFrame something like this.
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
Thank you for your time and consideration
Both texts are called columns and index names, solution for change them is use DataFrame.rename_axis:
age = age.rename_axis(index=None, columns='Age Category')
Or set columns names by index names, and then set index names to default - None:
age.columns.name = age.index.name
age.index.name = None
print (age)
Age Category Disengaged Engaged Nearly Engaged Not Engaged
26-31 1 1 0 0
31-26 0 1 0 0
31-36 0 0 0 1
41-46 1 0 0 0
46-51 0 0 1 0
Above 51 0 1 0 0
But this texts are something like metadata, so some functions should remove them.
I have Pandas object created using cross tabulation function
df = pd.crosstab(db['Age Category'], db['Category'])
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
df.dtype give me
Age Category
A int64
B int64
C int64
D int64
dtype: object
But, when i am writing this to MySQL I am not getting first column
The Output of MySQL is shown below:
| A | B | C | D |
|---|----|----|---|
| | | | |
| 2 | 2 | 4 | 1 |
| 7 | 11 | 12 | 5 |
| 3 | 5 | 5 | 2 |
| 2 | 4 | 1 | 7 |
| 0 | 1 | 3 | 2 |
| 0 | 0 | 2 | 3 |
| 0 | 3 | 0 | 6 |
I want to write in MySQL with First column.
I have created connection using
SQLAlchemy and PyMySQL
engine = create_engine('mysql+pymysql://[user]:[passwd]#[host]:[port]/[database]')
and I am writing using pd.to_sql()
df.to_sql(name = 'demo', con = engine, if_exists = 'replace', index = False)
but this is not giving me first column in MySQL.
Thank you for your time and consideration.