I'm trying to run a crosstab on my dataframe called "d_recent" using the following line of code:
pd.crosstab(d_recent['BinnedAge'],' d_recent['APBI']')
The output I am getting is this:
|Age Bin|Brachytherapy|EBRT|IORT|
|-------|-------------|----|----|
|51-60|1|1|0|
|71-80|86|62|11|
|61-70|2578|723|276|
|41-50|9386|2049|1188|
|81-90|13860|3257|2449|
|31-40|7725|2078|1628|
|21-30|1958|615|425|
This is wrong. What it should look like is:
|Age Bin|Brachytherapy|EBRT|IORT|
|-------|-------------|----|----|
|21-30|1|1|0|
|31-40|86|62|11|
|41-50|2578|723|276|
|51-60|9386|2049|1188|
|61-70|13860|3257|2449|
|71-80|7725|2078|1628|
|81-90|1958|615|425|
Any idea what is going on here and how I can fix it? I can tell that the order of the rows in the first table is related to the order the specific bins are encountered in my dataframe. I can get the correct output if I sort by age prior to running the crosstab, but this isn't a preferable solution because I need to do this with multiple variables. Thanks!
Related
I want to create a new column(not from an existing columns) having MapType(StringType(),StringType()) how do I achieve this.
So far, I was able to write the below code but it's not working.
df = df.withColumn("normalizedvariation",lit(None).cast(MapType(StringType(),StringType())))
Also I would like to know the different methods to achieve the same, Thank you.
This is one way you can try, newMapColumn here woukld be the name of the Map column.
You can see the output I got below. If that is not what you are looking for, please let me know. Thanks!
Also you will have to import the functions using the below line:
from pyspark.sql.functions import col,lit,create_map
I have a dataframe as such:
I wish to transpose it to:
I understand that this might be a basic question, therefore, if someone could direct me to the correct references so I can try to figure out how to do so in pandas.
try with melt() and set_index():
out=(df.melt(id_vars=['Market','Product'],var_name='Date',value_name='Value')
.set_index('Date'))
If needed use:
out.index.name=None
Now If you print out you will get your desired output
I got a question regarding the multiple aggregation in pandas.
Originally I have a dataset which shows the oil price, and the detail is as follows:
And the head of the dataset is as follows:
What I want to do here is to get the mean and standard deviation for each quarter of the year 2014. And the ideal output is as follows:
In my script, I have already created the quarter info by doing so .
However, one thing that I do not understand here:
If I tried to use this command to do so
brent[brent.index.year == 2014].groupby('quarter').agg({"average_price": np.mean, "std_price": np.std})
I got an error as follows:
And if I use the following script, then it works
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price','std'))
So the questions are:
What's wrong with the first approach here?
And why do we need to use the second approach here?
Thank you all for the help in advance!
What's wrong with the first approach here?
There is passed dict, so pandas looking for columns from keys average_price, std_price and because not exist in DataFrame if return error.
Possible solution is specified column after groupby and pass list of tuples for specified new columns names with aggregate functions:
brent[brent.index.year == 2014].groupby('quarter')['Price'].agg([('average_price','mean'),('std_price',np.std)])
It is possible, because for one column Price is possible defined multiple columns names.
In later pandas versions are used named aggregations:
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price',np.std))
Here is logic - for each aggregation is defined nw column name with aggregate column and aggregate function. So is possible aggregate multiple columns with different functions:
brent[brent.index.year == 2014].groupby('quarter').agg(average_price=('Price','mean'),
std_price=('Price',np.std),
sumQ=('quarter','sum'))
Notice, np.std has default ddof=0 and pandas std has ddof=1, so different outputs.
This problem has been solved (I think). Excel was the problem and not python after all. The below code should work for my needs and doesn't seem to be dropping rows after all.
Rows Highlighted in yellow are the rows I want to select in DF1. The selection should be made based on the values in column_2 of DF1 that match the values of column_1 of DF2
Here was my preferred solution using Pandas package in python after a lot of trail and error/searching:
NEW_MATCHED_DF1 = DF1.loc[DF1['column 2'].isin(DF2['column_1'])]
The problem I am seeing is that when I compare my results to what happens in excel when I do the same thing, I am getting almost double the results and I think that my python technique is dropping duplicates. Of course, it is possible that I am doing something wrong in excel, or excel is incorrect for some other reason, but it is something I have verified in the past and much more familiar with excel so I am suspecting that it is more likely that I am doing something wrong in python. EXCEL IS THE PROBLEM AFTER ALL!! :/
Ultimately, I would like to use python to select any and all rows in DF1 where column_2 of DF1 matches column_1 of DF2. Excel is absurdly slow and I would like to move away from using excel for manipulating large dataframes.
I appreciate any help or directions to help. I really haven't been able to figure out if my code is in fact dropping duplicates and/or if there is another solution that I can be confident that wont do this.
Try this using np.where:
import numpy as np
list_df2 = df2['column1'].unique().tolist()
df1['matching_rows'] = np.where(df1['column2'].isin(list_df2),'Match','No Match')
And then create a new dataframe with the matches:
matched_df = df1[df1['matching_rows']=='Match']
I use the pd.pivot_table() method to create a user-item matrix by pivoting the user-item activity data. However, the dataframe is so large that I got complain like this:
Unstacked DataFrame is too big, causing int32 overflow
Any suggestions on solving this problem? Thanks!
r_matrix = df.pivot_table(values='rating', index='userId', columns='movieId')
Some Solutions:
You can downgrade your pandas version to 0.21 which is no problem with pivot table with big size datas.
You can set your data to dictionary format like df.groupby('EVENT_ID')['DIAGNOSIS'].apply(list).to_dict()
You can use groupby instead. Try this code:
reviews.groupby(['userId','movieId'])['rating'].max().unstack()
An integer overflow inside library code is nothing you can do much about. You have basically three options:
Change the input data you provide to the library so the overflow does not occur. You probably need to make the input smaller in some sense. If that does not help, you may be using the library in a wrong way or hit a bug in the library.
Use a different library (or none at all); it seems that the library you are using is not intended to operate on large input.
Modify the code of the library itself so it can handle your input. This may be hard to do, but if you submit a pull request to the library source code, many people will profit from it.
You don't provide much code, so I cannot tell what is the best solution for you.
If you want movieId as your columns, first sort the dataframe using movieId as the key.
Then divide (half) the dataframe such that each subset contains all the ratings for a particular movie.
subset1 = df[:n]
subset2 = df[n:]
Now, apply to each of the subsets
matrix1 = subset1.pivot_table(values='rating', index='userId', columns='movieId')
matrix2 = subset2.pivot_table(values='rating', index='userId', columns='movieId')
Finally join matrix1 and matrix2 using,
complete_matrix = matrix1.join(matrix2)
On the other hand, if you want userId as your columns, sort the dataframe using userId as the key and repeat the above process.
***Please be sure to delete subset1, subset2, matrix1 & matrix2 after you're done or else you'll end up with Memory Error.
Converting the values column should resolve your issue:
df[‘ratings’] = df[‘ratings’].astype(‘int64’)