I want to make a daily table of team stats and want to create a data frame where every team would have a row on every date. I tried doing it with counting and then repetitive joining and it gets all messy so I was wondering if there is a simpler solution like a function.
Not sure if that is what you want, but it seens like pandas.melt() is the case:
reference: https://www.geeksforgeeks.org/python-pandas-melt/
Related
This is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is a real stumper for me.
Grab a large data file based on a parquet - no problem
Select 5 columns from the parquet and create a dataframe - no problem
import pandas
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
columns=["ts", "session_id", "event", "duration", "sample_data"])
But here is where it gets a bit tricky for me. One column(a key column) is called "session_id" . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the "duration" column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it's own SUM with a row added with that total listed along with the session_id I'm thinking there is a nested loop formula that will work for me but every effort has been a mess to date.
Ultimately, I'd like to have a final dataframe that is a collection of these unique sub dataframes. I guess I'd need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply
Finally, write this final df to a new parquet file. Should be simple enough so I won't need help with that.
But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows
I think the groupby function will work:
df.groupby('session_id')['duration'].sum()
More info here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
I'm new to the community and I only recently started to use Python and more specifically Pandas.
The data set I have I would like the columns to be the date. For each Date I would like to have a customer list that then breaks down to more specific row elements. Everything would be rolled up by an order number, so a distinct count on an order number because sometimes a client purchases more than 1 item. In excel I create a pivot table and process it by distinct order. Then I sort each row element by the distinct count of the order number. I collapse each row down until I just have the client name. If I click to expand the cell then I see each row element.
So my question: If I'm pulling in these huge data sets as a dataframe can I pull in xlsx in as an array? I know it will strip the values, so I would have to set the datetime as a datetime64 element. I've been trying to reshape the array around the date being column, and the rows I want but so far I haven't had luck. I have tried to use pivot_table and groupby with some success but I wasn't able to move the date to the column.
Summary: Overall what I'm looking to know is am I going down the wrong rabbit hole together? I'm looking to basically create a collapsible pivot table with specific color parameters for the table as well so that the current spreadsheet will look identical to the one I'm automating.
I really appreciate any help, as I said I'm brand new to Pandas so direction is key. If I know I'm onto the "best" way of dealing with the export to excel after I've imported and modified the spreadsheet. I get a single sheet of raw data kicked out in .xlsx form. Thanks again!
So, I am trying to pivot this data (link here) to where all the metrics/numbers are in a column with another column being the ID column. Obviously having a ton of data/metrics in a bunch of columns is much harder to compare and do calculations off of than having it all in one column.
So, I know what tools I need for this; Pandas, Findall, Wide_to_long (or melt) and maybe stack. However, I am having a bit of difficulty putting them all in the right place.
I easily can import the data into the df and view it, but when it comes to using findall with wide_to_long to pivot the data I get pretty confused. I am basing my idea off of this example (about half way down when they use the findall / regex to define new column names). I am looking to create a new column for each category of the metrics (ie. population estimate is one column and then % change is another, they should not be all one column)
Can someone help me set up the syntax correctly for this part? I am not good at the expressions dealing with pattern recognition.
Thank you.
What is the simplest code that can be used to create a basic dataframe with a single column (let's call it date), with daily rows between dateA and dateB?
This dataframe can later be used for multiple purposes.
I can think of many ways to create it, but all of them need multiple lines of code. I wonder is there is a one liner, or an example of very simple code, for a task so simple?
You can use
df = pd.DataFrame({'date': pd.date_range('2018-10-01','2019-09-01', freq='D')})
If i need to do a transformation on a dataframe(for example adding a column) , which is the better way to get optimal performance?.
1.
a=[1,2,3]
df=spark.createDataframe(a)
df=df.withColumn("b",lit(1))
2.
a=[1,2,3]
df=spark.createDataframe(a)
df2=df.withColumn("b",lit(1))
Consider i am adding 200 columns.
withColumn
will be evaluated lazily, so you need to understand how you are going add new column to same data frame without calling an action statement.
As far as your question you can use same data frame if every time you are going to use the latest(updated) data frame, considering you don't need the df further.
Hope it's clear
When you use withcolumn to add a new column in spark df, a new narrow task is added to the execution plan for each withcolumn statement. You can try the method specified in this blog. This explains the scenario properly.