I have loaded raw_data from MySQL using sqlalchemy and pymysql
engine = create_engine('mysql+pymysql://[user]:[passwd]#[host]:[port]/[database]')
df = pd.read_sql_table('data', engine)
df is something like this
| Age Category | Category |
|--------------|----------------|
| 31-26 | Engaged |
| 26-31 | Engaged |
| 31-36 | Not Engaged |
| Above 51 | Engaged |
| 41-46 | Disengaged |
| 46-51 | Nearly Engaged |
| 26-31 | Disengaged |
Then i had performed analysis as follow
age = pd.crosstab(df['Age Category'], df['Category'])
| Category | A | B | C | D |
|--------------|---|----|----|---|
| Age Category | | | | |
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
I want to change it to
Pandas DataFrame something like this.
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
Thank you for your time and consideration
Both texts are called columns and index names, solution for change them is use DataFrame.rename_axis:
age = age.rename_axis(index=None, columns='Age Category')
Or set columns names by index names, and then set index names to default - None:
age.columns.name = age.index.name
age.index.name = None
print (age)
Age Category Disengaged Engaged Nearly Engaged Not Engaged
26-31 1 1 0 0
31-26 0 1 0 0
31-36 0 0 0 1
41-46 1 0 0 0
46-51 0 0 1 0
Above 51 0 1 0 0
But this texts are something like metadata, so some functions should remove them.
Related
I have a dataframe as follows. I'm attempting to sum the values in the Total column, for each date, for each unique pair from columns P_buy and P_sell.
+--------+----------+-------+---------+--------+----------+-----------------+
| Index | Date | Type | Quantity| P_buy | P_sell | Total |
+--------+----------+-------+---------+--------+----------+-----------------+
| 0 | 1/1/2020 | 1 | 10 | 1 | 1 | 10 |
| 1 | 1/1/2020 | 1 | 10 | 2 | 1 | 20 |
| 2 | 1/1/2020 | 2 | 20 | 3 | 1 | 25 |
| 3 | 1/1/2020 | 2 | 20 | 4 | 1 | 20 |
| 4 | 2/1/2020 | 3 | 30 | 1 | 1 | 35 |
| 5 | 2/1/2020 | 3 | 30 | 2 | 1 | 30 |
| 6 | 2/1/2020 | 1 | 40 | 3 | 1 | 45 |
| 7 | 2/1/2020 | 1 | 40 | 4 | 1 | 40 |
| 8 | 3/1/2020 | 2 | 50 | 1 | 1 | 55 |
| 9 | 3/1/2020 | 2 | 50 | 2 | 1 | 53 |
+--------+----------+-------+---------+--------+----------+-----------------+
My desired output would be as follows: Where for each combination of unique P_buy/P_sell pairs, I'm receiving a sum of the total at each date.
+--------+----------+-------+---------+
| P_buy | P_sell | Total |
+--------+----------+-------+---------+
| 1 | 1 | 100 |
| 2 | 1 | 103 |
| 3 | 1 | 70 |
+--------+----------+-------+---------+
My attempts have been using the groupby function, but I haven't been able to successfully implement.
# use a groupby on the desired columns and sum the total
df.groupby(['P_buy','P_sell'], as_index=False)['Total'].sum()
P_buy P_sell Total
0 1 1 100
1 2 1 103
2 3 1 70
3 4 1 60
I am new in python have two dataframes, df1 contains information about all students with their group and score, and df2 contains updated information about few students when they change their group and score. How could I update the information in df1 based on the values of df2 (group and score)?
df1
+----+----------+-----------+----------------+
| |student No| group | score |
|----+----------+-----------+----------------|
| 0 | 0 | 0 | 0.839626 |
| 1 | 1 | 0 | 0.845435 |
| 2 | 2 | 3 | 0.830778 |
| 3 | 3 | 2 | 0.831565 |
| 4 | 4 | 3 | 0.823569 |
| 5 | 5 | 0 | 0.808109 |
| 6 | 6 | 4 | 0.831645 |
| 7 | 7 | 1 | 0.851048 |
| 8 | 8 | 3 | 0.843209 |
| 9 | 9 | 4 | 0.84902 |
| 10 | 10 | 0 | 0.835143 |
| 11 | 11 | 4 | 0.843228 |
| 12 | 12 | 2 | 0.826949 |
| 13 | 13 | 0 | 0.84196 |
| 14 | 14 | 1 | 0.821634 |
| 15 | 15 | 3 | 0.840702 |
| 16 | 16 | 0 | 0.828994 |
| 17 | 17 | 2 | 0.843043 |
| 18 | 18 | 4 | 0.809093 |
| 19 | 19 | 1 | 0.85426 |
+----+----------+-----------+----------------+
df2
+----+-----------+----------+----------------+
| | group |student No| score |
|----+-----------+----------+----------------|
| 0 | 2 | 1 | 0.887435 |
| 1 | 0 | 19 | 0.81214 |
| 2 | 3 | 17 | 0.899041 |
| 3 | 0 | 8 | 0.853333 |
| 4 | 4 | 9 | 0.88512 |
+----+-----------+----------+----------------+
The result
df: 3
+----+----------+-----------+----------------+
| |student No| group | score |
|----+----------+-----------+----------------|
| 0 | 0 | 0 | 0.839626 |
| 1 | 1 | 2 | 0.887435 |
| 2 | 2 | 3 | 0.830778 |
| 3 | 3 | 2 | 0.831565 |
| 4 | 4 | 3 | 0.823569 |
| 5 | 5 | 0 | 0.808109 |
| 6 | 6 | 4 | 0.831645 |
| 7 | 7 | 1 | 0.851048 |
| 8 | 8 | 0 | 0.853333 |
| 9 | 9 | 4 | 0.88512 |
| 10 | 10 | 0 | 0.835143 |
| 11 | 11 | 4 | 0.843228 |
| 12 | 12 | 2 | 0.826949 |
| 13 | 13 | 0 | 0.84196 |
| 14 | 14 | 1 | 0.821634 |
| 15 | 15 | 3 | 0.840702 |
| 16 | 16 | 0 | 0.828994 |
| 17 | 17 | 3 | 0.899041 |
| 18 | 18 | 4 | 0.809093 |
| 19 | 19 | 0 | 0.81214 |
+----+----------+-----------+----------------+
my code to update df1 from df2
dfupdated = df1.merge(df2, how='left', on=['student No'], suffixes=('', '_new'))
dfupdated['group'] = np.where(pd.notnull(dfupdated['group_new']), dfupdated['group_new'],
dfupdated['group'])
dfupdated['score'] = np.where(pd.notnull(dfupdated['score_new']), dfupdated['score_new'],
dfupdated['score'])
dfupdated.drop(['group_new', 'score_new'],axis=1, inplace=True)
dfupdated.reset_index(drop=True, inplace=True)
but I face the following error
KeyError: "['group'] not in index"
I don't know what's wrong I ran same and got the answer
giving a different way to solve it
try :
dfupdated = df1.merge(df2, on='student No', how='left')
dfupdated['group'] = dfupdated['group_y'].fillna(dfupdated['group_x'])
dfupdated['score'] = dfupdated['score_y'].fillna(dfupdated['score_x'])
dfupdated.drop(['group_x', 'group_y','score_x', 'score_y'], axis=1,inplace=True)
will give you the solution you want.
to get the max from each group
dfupdated.groupby(['group'], sort=False)['score'].max()
I have cross tabulation data something. which i have created using
x = pd.crosstab(a['Age Category'], a['Category'])
| Category | A | B | C | D |
|--------------|---|----|----|---|
| Age Category | | | | |
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
And I want to add new column Total which will contain row sum something like this in cross tabulation data.
| Category | A | B | C | D | Total |
|--------------|---|----|----|---|-------|
| Age Category | | | | | |
| 21-26 | 2 | 2 | 4 | 1 | 9 |
| 26-31 | 7 | 11 | 12 | 5 | 35 |
| 31-36 | 3 | 5 | 5 | 2 | 15 |
| 36-41 | 2 | 4 | 1 | 7 | 14 |
| 41-46 | 0 | 1 | 3 | 2 | 6 |
| 46-51 | 0 | 0 | 2 | 3 | 5 |
| Above 51 | 0 | 3 | 0 | 6 | 9 |
I tried x['Total'] = x.sum(axis = 1) but this code is giving me TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
Thanks you for your time and consideration.
Use CategoricalIndex.add_categories for append new category to columns:
x.columns = x.columns.add_categories(['Total'])
x['Total'] = x.sum(axis = 1)
print (x)
A B C D Total
Category
21-26 2 2 4 1 9
26-31 7 11 12 5 35
31-36 3 5 5 2 15
36-41 2 4 1 7 14
41-46 0 1 3 2 6
46-51 0 0 2 3 5
Above 51 0 3 0 6 9
I have Pandas object created using cross tabulation function
df = pd.crosstab(db['Age Category'], db['Category'])
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
df.dtype give me
Age Category
A int64
B int64
C int64
D int64
dtype: object
But, when i am writing this to MySQL I am not getting first column
The Output of MySQL is shown below:
| A | B | C | D |
|---|----|----|---|
| | | | |
| 2 | 2 | 4 | 1 |
| 7 | 11 | 12 | 5 |
| 3 | 5 | 5 | 2 |
| 2 | 4 | 1 | 7 |
| 0 | 1 | 3 | 2 |
| 0 | 0 | 2 | 3 |
| 0 | 3 | 0 | 6 |
I want to write in MySQL with First column.
I have created connection using
SQLAlchemy and PyMySQL
engine = create_engine('mysql+pymysql://[user]:[passwd]#[host]:[port]/[database]')
and I am writing using pd.to_sql()
df.to_sql(name = 'demo', con = engine, if_exists = 'replace', index = False)
but this is not giving me first column in MySQL.
Thank you for your time and consideration.
How can I merge/join these two dataframes ONLY on "sample_id" and drop the extra rows from the second dataframe when merging/joining?
Using pandas in Python.
First dataframe (fdf)
| sample_id | name |
|-----------|-------|
| 1 | Mark |
| 1 | Dart |
| 2 | Julia |
| 2 | Oolia |
| 2 | Talia |
Second dataframe (sdf)
| sample_id | salary | time |
|-----------|--------|------|
| 1 | 20 | 0 |
| 1 | 30 | 5 |
| 1 | 40 | 10 |
| 1 | 50 | 15 |
| 2 | 33 | 0 |
| 2 | 23 | 5 |
| 2 | 24 | 10 |
| 2 | 28 | 15 |
| 2 | 29 | 20 |
So the resulting df will be like -
| sample_id | name | salary | time |
|-----------|-------|--------|------|
| 1 | Mark | 20 | 0 |
| 1 | Dart | 30 | 5 |
| 2 | Julia | 33 | 0 |
| 2 | Oolia | 23 | 5 |
| 2 | Talia | 24 | 10 |
There are duplicates, so need helper column for correct DataFrame.merge with GroupBy.cumcount for counter:
df = (fdf.assign(g=fdf.groupby('sample_id').cumcount())
.merge(sdf.assign(g=sdf.groupby('sample_id').cumcount()), on=['sample_id', 'g'])
.drop('g', axis=1))
print (df)
sample_id name salary time
0 1 Mark 20 0
1 1 Dart 30 5
2 2 Julia 33 0
3 2 Oolia 23 5
4 2 Talia 24 10
final_res = pd.merge(df,df2,on=['sample_id'],how='left')
final_res.sort_values(['sample_id','name','time'],ascending=[True,True,True],inplace=True)
final_res.drop_duplicates(subset=['sample_id','name'],keep='first',inplace=True)