How to transpose rows into columns on pyspark?

How to transpose rows into columns on pyspark? - python

One question: how can I transpose rows into columns on pyspark?
My original dataframe looks like this:
ID | DATE | APP | DOWNLOADS | ACTIVE_USERS
___________________________________________________________
0 | 2021-01-10 | FACEBOOK | 1000 | 5000
1 | 2021-01-10 | INSTAGRAM | 9000 | 90000
2 | 2021-02-10 | FACEBOOK | 9000 | 72000
3 | 2021-02-10 | INSTAGRAM | 16000 | 500000
But I need it like this:
ID | DATE | FACEBOOK - DOWNLOADS | FACEBOOK - ACTIVE_USERS | INSTAGRAM - DOWNLOADS | INSTAGRAM - ACTIVE_USERS
___________________________________________________________________________________________________________________
0 | 2021-01-10 | 1000 | 5000 | 9000 | 90000
1 | 2021-02-10 | 9000 | 72000 | 16000 | 50000
I tried using the answer on this question: Transpose pyspark rows into columns, but i couldn't make it work.
Could you help me please? Thank you!

From your example I assume the "ID" column is not needed to group on, as it looks to be reset in your outcome. That would make the query something like below:
import pyspark.sql.functions as F
df.groupBy("DATE").pivot('APP').agg(
F.first('DOWNLOADS').alias("DOWNLOADS"),
F.first("ACTIVE_USERS").alias("ACTIVE_USERS")
)
We groupby the date and pivot on app and retrieve the first value for downloads and active users.
outcome:
+----------+------------------+---------------------+-------------------+----------------------+
| DATE|FACEBOOK_DOWNLOADS|FACEBOOK_ACTIVE_USERS|INSTAGRAM_DOWNLOADS|INSTAGRAM_ACTIVE_USERS|
+----------+------------------+---------------------+-------------------+----------------------+
|2021-02-10| 9000| 72000| 16000| 500000|
|2021-01-10| 1000| 5000| 9000| 90000|
+----------+------------------+---------------------+-------------------+----------------------+

Related

Create column based on percentage of recurring customers

I have a DataFrame which contains order data, specified per row. So each row is a different order.
date_created
customer_id
total_value
recurring_customer
A customer is a recurring customer when they have ordered for the third time. I want to find out the percentage to which returning customers contribute to the total value.
The DataFrame looks like this:
df = pd.DataFrame(
{
"date_created" ["2019-11-16", "2019-11-16", "2019-11-16", "2019-11-16", "2019-11-16", "2019-11-16"]
"customer_id": ["1733", "6356", "6457", "6599", "6637", "6638"],
"total": ["746.02", "1236.60", "1002.32", "1187.21", "1745.03", "2313.14"],
"recurring_customer": ["False", "False", "False", "False", "False", "False"],
}
)
By resampling the data to monthly data:
df_monthly = df.resample('1M').mean()
I got the following output:
df_monthly = pd.DataFrame(
{
"date_created": ["2019-11-30", "2019-12-31", "2020-01-31", "2020-02-29", "2020-03-31", "2020-04-30"]
"customer_id": ["4987.02", "5291.56", "5702.13", "6439.27", "7263.11", "8080.91",],
"total": ["2915.25", "2550.85", "2486.72", "2515.81", "2633.77", "2558.19"],
"recurring_customer": ["0.009050", "0.016667", "0.075630", "0.138122", "0.130045", "0.175503"],
}
)
So, the real question is that I want to find out the percentage to which returning customers contribute to the total value of the month.
The desired output should look something like this:
| date_created | customer_id | total | recurring_customer | recurring_customer_total | recurring_customer_total_percentage |
| ------------ | ----------- | ------ | ------------------ | ------------------------ | ----------------------------------- |
| 2019-11-30 | 4987.02 | 2915.25 | 0.009050 | ?????? | ??????
| 2019-12-31 | 5291.56 | 2550.85 | 0.016667 | ?????? | ??????
| 2020-01-31 | 5702.13 | 2486.72 | 0.075630 | ?????? | ??????
| 2020-02-29 | 6439.27 | 2515.81 | 0.138122 | ?????? | ??????
| 2020-03-31 | 7263.11 | 2633.77 | 0.130045 | ?????? | ??????
| 2020-04-30 | 8080.91 | 2558.19 | 0.175503 | ?????? | ??????
Note that I can't just calculate the recurring_customer percentages times the total value because I assume the group of recurring customers contribute a lot more to the total value than customers who aren't a recurring customer.
I tried the np.where() function on the daily dataframe, where :
I would create a column 'recurring_customer_total' in the daily dataframe and it would copy the value of the 'total' column but only when 'recurring_customer' return True, otherwise return 0. I found a similar question here: get values from first column when other columns are true using a lookup list. Another similar question was asked here:
Getting indices of True values in a boolean list.
This answer returns all 'True' values and it's position, I want the
value of 'total' copied into 'recurring_customer_total' when
'recurring_customer' is 'True'.
Then I would resample the daily dataframe to a monthly dataframe and that would give me the mean of the amount 'recurring_customers' contributed to the total value. Those values would be visible in 'recurring_customers_total'.
The final step would be to calculate the percentage of the 'recurring_customer_total' based on the 'total' column. Those values should be stored in 'recurrings_customer_total_percentage'.
I think those are the steps I need to follow, the only problem is that I don't really know how to get there.
Thanks in advance!

So I'm fairly new to Python but I've managed to answer my own question. Can't say this is the best, easiest, fastest way but it surely helped.
First of all I made a new dataframe which is an exact copy of the original dataframe, but only with 'True' values of the column 'recurring_customer'. I did that by using the following code:
df_recurring_customers = df.loc[df['recurring_customer'] == True]
It gave me the following dataframe:
df_recurring_customers.head()
{
"date_created" ["2019-11-25", "2019-11-28", "2019-12-02", "2019-12-09", "2019-12-11"]
"customer_id": ["577", "6457", "577", "6647", "840"],
"total": ["33891.12", "81.98", "9937.68", "1166.28", "2969.60"],
"recurring_customer": ["True", "True", "True", "True", "True"],
}
)
Then I resampled the values using:
df_recurring_customers_monthly_sum = df_recurring_customers.resample('1M').sum()
I then dropped the 'number' and 'customer_id' column, which had no value. The next step was to join the two dataframes 'df_monthly' and 'df_recurring_customers_monthly_sum' using:
df_total = df_recurring_customers_monthly_sum.join(df_monthly)
This gave me:
| date_created | total | recurring_customer_total |
| ------------ | ---------- | ------------------------ |
| 2019-11-30 | 644272.02 | 33973.10 |
| 2019-12-31 | 612205.99 | 15775.29 |
| 2020-01-31 | 887761.60 | 61612.27 |
| 2020-02-29 | 910724.75 | 125315.31 |
| 2020-03-31 | 1174662.59 | 125315.31 |
| 2020-04-30 | 1399332.26 | 248277.97 |
Then I wanted to know the percentage so
df_total['total_recurring_customer_percentage'] = (df_total['recurring_customer_total'] / df_total['total']) * 100
Which gave me:
| date_created | total | recurring_customer_total | recurring_customer_total_percentage |
| ------------ | ---------- | ------------------------ | ----------------------------------- |
| 2019-11-30 | 644272.02 | 33973.10 | 5.273099
| 2019-12-31 | 612205.99 | 15775.29 | 2.576794
| 2020-01-31 | 887761.60 | 61612.27 | 6.940182
| 2020-02-29 | 910724.75 | 125315.31 | 13.759954
| 2020-03-31 | 1174662.59 | 125315.31 | 13.967221
| 2020-04-30 | 1399332.26 | 248277.97 | 17.742603

groupby: how to show max(field1) and the value of field2 corresponding to max(field1)?

Let's say I have a table with 3 fields: client, city, sales, with sales being a float.
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | NY | 0 |
| a | LA | 1 |
| a | London | 2 |
| b | NY | 3 |
| b | LA | 4 |
| b | London | 5 |
+--------+--------+-------+
For each client, I would like to show what is the city with the greatest sales, and what those sales are, ie I want this output:
+--------+--------+-------+
| client | city | sales |
+--------+--------+-------+
| a | London | 2 |
| b | London | 5 |
+--------+--------+-------+
Any suggestions?
This table can be generated with:
df=pd.DataFrame()
df['client']= np.repeat( ['a','b'],3 )
df['city'] = np.tile( ['NY','LA','London'],2)
df['sales']= np.arange(0,6)
This is wrong because it calculates the 'maximum' of the city, and shows NY because it considers that N > L
max_by_id = df.groupby('client').max()
I can first create a dataframe with the highest sales, and then merge it with the initial dataframe to retrieve the city; it works, but I was wondering if there is a faster / more elegant way?
out = pd.merge( df, max_by_id, how='inner' ,on=['client','sales'] )
I remember doing something similar with cross apply statements in SQL but wouldn't know how to run a Pandas equivalent.

You need to sort by sales and then groupby client and pick first
df.sort_values(['sales'], ascending=False).groupby('client').first().reset_index()
OR
As #user3483203:
df.loc[df.groupby('client')['sales'].idxmax()]
Output:
client city sales
0 a London 2
1 b London 5

How do I split a single dataframe into multiple dataframes by the range of a column value?

First off, I realize that this question has been asked a ton of times in many different forms, but a lot of the answers just give code that solves the problem without explaining what the code actually does or why it works.
I have an enormous data set of phone numbers and area codes that I have loaded into a dataframe in python to do some processing with. Before I do that processing, I need to split the single dataframe into multiple dataframes that contain phone numbers in certain ranges of area codes that I can then do more processing on. For example:
+---+--------------+-----------+
| | phone_number | area_code |
+---+--------------+-----------+
| 1 | 5501231234 | 550 |
+---+--------------+-----------+
| 2 | 5051231234 | 505 |
+---+--------------+-----------+
| 3 | 5001231234 | 500 |
+---+--------------+-----------+
| 4 | 6201231234 | 620 |
+---+--------------+-----------+
into
area-codes (500-550)
+---+--------------+-----------+
| | phone_number | area_code |
+---+--------------+-----------+
| 1 | 5501231234 | 550 |
+---+--------------+-----------+
| 2 | 5051231234 | 505 |
+---+--------------+-----------+
| 3 | 5001231234 | 500 |
+---+--------------+-----------+
and
area-codes (600-650)
+---+--------------+-----------+
| | phone_number | area_code |
+---+--------------+-----------+
| 1 | 6201231234 | 620 |
+---+--------------+-----------+
I get that this should be possible using pandas (specifically groupby and a Series object I think) but the documentation and examples on the internet I could find were a little too nebulous or sparse for me to follow. Maybe there's a better way to do this than the way I'm trying to do it?

You can use pd.cut to bin the area column , then use the labels to group the data and store in a dictionary. Finally print each key to see the dataframe:
bins=[500,550,600,650]
labels=['500-550','550-600','600-650']
d={f'area_code_{i}':g for i,g in
df.groupby(pd.cut(df.area_code,bins,include_lowest=True,labels=labels))}
print(d['area_code_500-550'])
print('\n')
print(d['area_code_600-650'])
phone_number area_code
0 5501231234 550
1 5051231234 505
2 5001231234 500
phone_number area_code
3 6201231234 620

You can also do this by select rows in dataframe by chaining multiple condition with & or | operator
df1 select rows with area_code between 500-550
df2 select rows with area_code between 600-650
df = pd.DataFrame({'phone_number':[5501231234, 5051231234, 5001231234 ,6201231234],
'area_code':[550,505,500,620]},
columns=['phone_number', 'area_code'])
df1 = df[ (df['area_code']>=500) & (df['area_code']<=550) ]
df2 = df[ (df['area_code']>=600) & (df['area_code']<=650) ]
df1
phone_number area_code
0 5501231234 550
1 5051231234 505
2 5001231234 500
df2
phone_number area_code
3 6201231234 620

(Python 3.x and SQL) Check if entries exists in multiple tables and convert to binary values

I have a 'master table' that contains just one column of ids from all my other tables. I also have several other tables that contain some of the ids, along with other columns of data. I am trying to iterate through all of the ids for each smaller table, create a new column for the smaller table, check if the id exists on that table and create a binary entry in the master table. (0 if the id doesn't exist, and 1 if the id does exist on the specified table)
That seems pretty confusing, but the application of this is to check if a user exists on the table for a specific date, and keep track of this information day to day.
Right now my I am iterating through the dates, and inside each iteration I am iterating through all of the ids to check if they exist for that date. This is likely going to be incredibly slow, and there is probably a better way to do this though. My code looks like this:
def main():
dates = init()
id_list = getids()
print(dates)
for date in reversed(dates):
cursor.execute("ALTER TABLE " + table + " ADD " + date + " BIT;")
cnx.commit()
for ID in id_list:
(...)
I know that the next step will be to generate a query using each id that looks something like:
SELECT id FROM [date]_table
WHERE EXISTS (SELECT 1 FROM master_table WHERE master_table.id = [date]_table.id)
I've been stuck on this problem for a couple days and so far I cannot come up with a query that gives a useful result.
.
For an example, if I had three tables for three days...
Monday:
+------+-----+
| id | ... |
+------+-----+
| 1001 | ... |
| 1002 | ... |
| 1003 | ... |
| 1004 | ... |
| 1005 | ... |
+------+-----+
Tuesday:
+------+-----+
| id | ... |
+------+-----+
| 1001 | ... |
| 1003 | ... |
| 1005 | ... |
+------+-----+
Wednesday:
+------+-----+
| id | ... |
+------+-----+
| 1002 | ... |
| 1004 | ... |
+------+-----+
I'd like to end up with a master table like this:
+------+--------+---------+-----------+
| id | monday | tuesday | wednesday |
+------+--------+---------+-----------+
| 1001 | 1 | 1 | 0 |
| 1002 | 1 | 0 | 1 |
| 1003 | 1 | 1 | 0 |
| 1004 | 1 | 0 | 1 |
| 1005 | 1 | 1 | 0 |
+------+--------+---------+-----------+
Thank you ahead of time for any help with this issue. And since it's sort of a confusing problem, please let me know if there are any additional details I can provide.

python: how to split excel files grouped by first column

I have a table that needs to be split into multiple files grouped by values in column 1 - serial.
+--------+--------+-------+
| serial | name | price |
+--------+--------+-------+
| 100-a | rdl | 123 |
| 100-b | gm1 | -120 |
| 100-b | gm1 | 123 |
| 180r | xxom | 12 |
| 182d | data11 | 11.50 |
+--------+--------+-------+
the output would be like this:
100-a.xls
100-b.xls
180r.xls etc.etc.
and opening 100-b.xls cotains this:
+--------+------+-------+
| serial | name | price |
+--------+------+-------+
| 100-b | gm1 | -120 |
| 100-b | gm1 | 123 |
+--------+------+-------+
I tried using Pandas to define the dataframe by using this code:
import pandas as pd
#from itertools import groupby
df = pd.read_excel('myExcelFile.xlsx')
I was successful in getting the data frame, but i have no idea what to do next. I tried following this similar question on Stackoverflow, but the scenario is a bit different. What is the next approach to this?

This is not a groupby but a filter.
You need to follow 2 steps :
Generate the data that you need in the excel file
Save dataframe as excel.
Something like this should do the trick -
for x in list(df.serial.unique()) :
df[df.serial == x].to_excel("{}.xlsx".format(x))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to transpose rows into columns on pyspark? - python

Related

Create column based on percentage of recurring customers

groupby: how to show max(field1) and the value of field2 corresponding to max(field1)?

How do I split a single dataframe into multiple dataframes by the range of a column value?

(Python 3.x and SQL) Check if entries exists in multiple tables and convert to binary values

python: how to split excel files grouped by first column

Categories

Resources