I want to create a DataFrame that breaks down the genres of movies into separate columns, with each individual genre column having a value of 1 for movies that are of that genre.
from this movie dataframe
to this
dataframe with distinct genre column created, 1 for true and 0 for false
I'm using Databricks PySpark.
many thanks!
I would first get the unique values of the dataframe column in a list, and then iterate over the list.
The name of dataframe is taken as df here
unique_vals = df.select('genres').distinct().rdd.flatMap(lambda x: x).collect()
Now lets iterate over the list
df1=df
for i in unique_vals:
df2 = df1.withColumn(i,F.when(F.col('centroid')==i,1).otherwise(0))
df1=df2
df2.show()
I think this would work
df.groupby().pivot('genres').agg(lit(1)).fillna(0)
Related
I have a dataframe 'raw' that looks like this -
It has many rows with duplicate values in each column.
I want to make a new dataframe 'new_df' which has unique customer_code corresponding and market_code.
The new_df should look like this -
It sounds like you simply want to create a DataFrame with unique customer_code which also shows market_code. Here's a way to do it:
df = df[['customer_code','market_code']].drop_duplicates('customer_code')
Output:
customer_code market_code
0 Cus001 Mark001
1 Cus003 Mark003
3 Cus004 Mark003
4 Cus005 Mark004
The part reading df[['customer_code','market_code']] gives us a DataFrame containing only the two columns of interest, and the drop_duplicates('customer_code') part eliminates all but the first occurrence of duplicate values in the customer_code column (though you could instead keep the last occurrence of each duplicate by calling it using the keep='last' argument).
I would like to set for each row, the value of ART_IN_TICKET to be the number of rows that have the same TICKET_ID as this row.
for example, for the first 5 rows of this dataframe, TICKET_ID is 35592159 and ART_IN_TICKET should be 5 since there are 5 rows with that same TICKET_ID.
There can be other solutions as well. A relatively simple solution would be to get the count of rows for each TICKET_ID and then merge the new df with this one to get the final result in ART_IN_TICKET. Assuming the above dataframe is in df.
count_df = df[['TICKET_ID', 'ART_IN_TICKET']].groupby("TICKET_ID").count().reset_index()
df = df[list(set(df.columns.tolist())-set(["ART_IN_TICKET"]))] # Removing ART_IN_TICKET column before merging
final_df = df.merge(count_df, on="TICKET_ID")
I have the below DataFrame
As you can see, ItemNo 1 is duplicated three times, and each column has a value corresponding to it.
I am looking for a method to check against all columns, and if they match then put Price, Sales, and Stock as one entry, not three.
Any help will be greatly appreciated.
Simply remove all the NaN instances and redefine the column names
df = df1.apply(lambda x: pd.Series(x.dropna().values), axis=1)
df.columns = ['ItemNo','Category','SIZE','Model','Customer','Week Date','<New col name>']
For converging to one row, you can use groupby like this
df.groupby('ItemNo', as_index=False).first()
I have got pandas dataframe which looks as following: enter image description here
There are multiple columns for Australia based on the provinces, and the columns are titled Australia, Australia.1, Australia.2 and so on. It is also the case for other countries such as the USA, the UK or Canada. I want to have only one column for each of these countries. For example, I want to have one column named Australia with the sum total of values in each provinces, and I want to avoid duplicate column names. How can I do it using pandas dataframe in Python?
You can transpose the dataframe and reset the index. Then, remove periods and number and groupby the columns which in my example is index:
df = df.T.reset_index()
df['index'] = df['index'].str.replace('\.\d+', '')
df.iloc[:,2:] = df.iloc[:,2:].astype(float)
df = df.groupby('index').sum()
df
I am working on a pandas dataframe with 168 columns. First three columns contain name of the country, latitude and longtitude. Rest of the columns contain numerical data. Each row represents a country but for some countries there are multiple rows. I need to aggregate those rows by summing. I can aggregate first three columns with following code:
df = df.groupby('Country', as_index=False).agg({'Lat':'first','Long':'first'})
However, I couldn't find a way to include in that code remaining 165 columns without explicitly writing all the column names. In addition, column names represent dates and are named like 5/27/20,5/28/20,5/29/20, etc. So I need to keep the column names.
How can I do that? Thanks.
Maybe you can generate the dictionary from the column names:
df = df.groupby('Country', as_index=False).agg({c: 'first' for c in df.columns})