Strategy for creating pivot tables that collapse with large data sets

Strategy for creating pivot tables that collapse with large data sets - python

I'm new to the community and I only recently started to use Python and more specifically Pandas.
The data set I have I would like the columns to be the date. For each Date I would like to have a customer list that then breaks down to more specific row elements. Everything would be rolled up by an order number, so a distinct count on an order number because sometimes a client purchases more than 1 item. In excel I create a pivot table and process it by distinct order. Then I sort each row element by the distinct count of the order number. I collapse each row down until I just have the client name. If I click to expand the cell then I see each row element.
So my question: If I'm pulling in these huge data sets as a dataframe can I pull in xlsx in as an array? I know it will strip the values, so I would have to set the datetime as a datetime64 element. I've been trying to reshape the array around the date being column, and the rows I want but so far I haven't had luck. I have tried to use pivot_table and groupby with some success but I wasn't able to move the date to the column.
Summary: Overall what I'm looking to know is am I going down the wrong rabbit hole together? I'm looking to basically create a collapsible pivot table with specific color parameters for the table as well so that the current spreadsheet will look identical to the one I'm automating.
I really appreciate any help, as I said I'm brand new to Pandas so direction is key. If I know I'm onto the "best" way of dealing with the export to excel after I've imported and modified the spreadsheet. I get a single sheet of raw data kicked out in .xlsx form. Thanks again!

Related

Fill out Excel Template with Python

I'm having trouble finding a solution to fill out an excel template using python. I currently have a pandas dataframe where I use openpyxl to write the necessary data to specific Rows and Cells in a for loop. The issue I have is that in my next project several of the cells I have to write are not continuous so for example instead of going A1,A2,A3 it can go A1,A5,A9. However this time if I were to list the cells like I did in the past it would be impractical.
So I was looking for something that would work similar to a Vlookup in excel. Where in the template we have Python would match the necessary Row and Column to drop the information. I know I might need to use different commands.
I added a picture below as an example. So I would need to drop values in the empty cells and ideally Python would read "USA and Revenue" and know to drop that information on cell B2. I know I might need something to map it also I am just not sure on how to start or if it is even possible.
enter image description here

Pandas Pivot table and Excel's style cells

I get data measurements from instruments. These measurements depend on several parameters, and a pivot table is a good solution to represent the data. Every measurement can be associated to a scope screenshoot to be more explicit. I get all the data in the following csv format :
The number of measurements and parameters can change.
I am trying to write a Python script (for now with Pandas lib) which allows me to create a pivot table in Excel. With Pandas, I can color the data in and out of a defined range. However, I would like also to to create a link on every cell who can send me to the corresponding screenshot. But I am stuck here.
I would like a result like the following (but with the link to the corresponding screenshot) :
Actually, I found out a way to add the link thanks to the =HYPERLINK() Excel function to all the cells with the apply() Pandas function.
However, I cannot apply a conditional formatting thanks to xlsxWriter anymore because the cells don't have a numerical content anymore
I can apply the conditional formatting first and then iterate through the whole sheet to add a link, but it will be a total mess to retrieve the relation between the data and the different parameters measurement
I would like your help to find ideas and efficient ways to do what I would like

xlsxwriter has a function called write_url ,but first while creating new worksheet you must apply write_url and then use openyxl to insert your pandas data frame
1)create worksheet and insert write_url
2)use openyxl to write data into already formatted cells.

Iterating through big data with pandas, large and small dataframes

This is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is a real stumper for me.
Grab a large data file based on a parquet - no problem
Select 5 columns from the parquet and create a dataframe - no problem
import pandas
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
columns=["ts", "session_id", "event", "duration", "sample_data"])
But here is where it gets a bit tricky for me. One column(a key column) is called "session_id" . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the "duration" column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it's own SUM with a row added with that total listed along with the session_id I'm thinking there is a nested loop formula that will work for me but every effort has been a mess to date.
Ultimately, I'd like to have a final dataframe that is a collection of these unique sub dataframes. I guess I'd need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply
Finally, write this final df to a new parquet file. Should be simple enough so I won't need help with that.
But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows

I think the groupby function will work:
df.groupby('session_id')['duration'].sum()
More info here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

Adding Calculated Column in Pandas Dataframe via Indexing Column #s

Currently building a simple Customer Lifetime Value calculator program for marketing purposes. For a portion of the program, I give the user the option to import a CSV file via pd.read_csv to allow calculations across multiple customer records. I designate the required order of the CSV data in notes included in the output window.
The imported CSV should have 4 inputs per row. Building off of this, I want to create a new column in the dataframe that multiplies columns 1-4. Operating under the assumption that some users will include headers (that will vary per user) while others will not, is there a way I can create the new calculated column based on column # rather than header?
Beginner here. None of the answers I have found have worked for me/been similar to my situation.

Combine Rows with similar values with one column different - Excel

I have a worksheet with many duplicate rows in which only one column is important that it differs.
Is there a function that will put each of the differing streams into new columns, with the header being the date of the stream?
Essentially, I would like to have each song as a row and the day's streams as a column in that row. Please see the attached image for the end result I would like to achieve.
If this is possible in Python, that would be great as well, as I am pulling the data via a Python script using openpyxl.
Thanks!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.