I have a worksheet with many duplicate rows in which only one column is important that it differs.
Is there a function that will put each of the differing streams into new columns, with the header being the date of the stream?
Essentially, I would like to have each song as a row and the day's streams as a column in that row. Please see the attached image for the end result I would like to achieve.
If this is possible in Python, that would be great as well, as I am pulling the data via a Python script using openpyxl.
Thanks!
Related
I'm having trouble finding a solution to fill out an excel template using python. I currently have a pandas dataframe where I use openpyxl to write the necessary data to specific Rows and Cells in a for loop. The issue I have is that in my next project several of the cells I have to write are not continuous so for example instead of going A1,A2,A3 it can go A1,A5,A9. However this time if I were to list the cells like I did in the past it would be impractical.
So I was looking for something that would work similar to a Vlookup in excel. Where in the template we have Python would match the necessary Row and Column to drop the information. I know I might need to use different commands.
I added a picture below as an example. So I would need to drop values in the empty cells and ideally Python would read "USA and Revenue" and know to drop that information on cell B2. I know I might need something to map it also I am just not sure on how to start or if it is even possible.
enter image description here
This is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is a real stumper for me.
Grab a large data file based on a parquet - no problem
Select 5 columns from the parquet and create a dataframe - no problem
import pandas
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
columns=["ts", "session_id", "event", "duration", "sample_data"])
But here is where it gets a bit tricky for me. One column(a key column) is called "session_id" . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the "duration" column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it's own SUM with a row added with that total listed along with the session_id I'm thinking there is a nested loop formula that will work for me but every effort has been a mess to date.
Ultimately, I'd like to have a final dataframe that is a collection of these unique sub dataframes. I guess I'd need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply
Finally, write this final df to a new parquet file. Should be simple enough so I won't need help with that.
But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows
I think the groupby function will work:
df.groupby('session_id')['duration'].sum()
More info here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
I'm new to the community and I only recently started to use Python and more specifically Pandas.
The data set I have I would like the columns to be the date. For each Date I would like to have a customer list that then breaks down to more specific row elements. Everything would be rolled up by an order number, so a distinct count on an order number because sometimes a client purchases more than 1 item. In excel I create a pivot table and process it by distinct order. Then I sort each row element by the distinct count of the order number. I collapse each row down until I just have the client name. If I click to expand the cell then I see each row element.
So my question: If I'm pulling in these huge data sets as a dataframe can I pull in xlsx in as an array? I know it will strip the values, so I would have to set the datetime as a datetime64 element. I've been trying to reshape the array around the date being column, and the rows I want but so far I haven't had luck. I have tried to use pivot_table and groupby with some success but I wasn't able to move the date to the column.
Summary: Overall what I'm looking to know is am I going down the wrong rabbit hole together? I'm looking to basically create a collapsible pivot table with specific color parameters for the table as well so that the current spreadsheet will look identical to the one I'm automating.
I really appreciate any help, as I said I'm brand new to Pandas so direction is key. If I know I'm onto the "best" way of dealing with the export to excel after I've imported and modified the spreadsheet. I get a single sheet of raw data kicked out in .xlsx form. Thanks again!
I am writing a program that will process a bunch of data and fill a column in excel. I am using openpyxl, and strictly using write_only mode as well. Each column will have a fixed 75 cell size, and each cell in the row will have the same formula applied to it. However, I can only process the data one column at a time, I cannot process an entire row, then iterate through all of the rows.
How can I write to a column, then move onto the next column once I have filled the previous one?
This is a rather open ended question, but may I suggest using Pandas. Without some kind of example of what you are trying to achieve it's difficult to make a great recommendation, but I have used pandas in the past a ton for automating processing of excel files. Basically you would just load whatever data into a Pandas DataFrame, then do your transformations/calculations and whenever you are done write it back to either the same or a new excel file (or a number of other formats).
Because the OOXML file format is row-oriented, you must write in rows in write-only mode, it is simply not possible otherwise.
What you might be able to do is to create some kind transitional object that will allow to fill it with columns and then use this to write to openpyxl. A Pandas DataFrame would probably be suitable for this and openpyxl supports converting these into rows.
I have an old excel spreadsheet with a lot of data in a relational database type format, with one main primary key that I need to go through.
I want to compare some rows but there are many entries (thousands of rows, dozens of columns) and Excel doesn't really have built-in features to do this. After looking around I found out the best way to extract the data is using a script with Python, but I have no programming skills in python or any language for the matter. I need to look for duplicates in the key column and then check if there are duplicates rows in that same column and if so merge them in a new row and then a new excel file/sheet separating the merged rows from the non-merged rows.
I don't know if this sounds too complicated or not and I am new here so I did do some research scouring the internet to see if I can find any scripts to do it but no luck really... Here are the closest posts I found that may have something to do with what I want but what I found usually is about people wanting to merge 2 different excel files together:
http://pbpython.com/excel-file-combine.html
Looking to merge two Excel files by ID into one Excel file using Python 2.7
(I have more links but could only post two.)
Basically i'm looking for duplicate rows and want to merge them together into a new file or spreadsheet in excel, separating them from the non dupes and putting it all back together.