I have an excel spreadsheet with following columns:
I want to group this data by vendor and show all transaction and amount data for that vendor by Type (i.e. Wireless, Bonus etc). For ex: it should show all data for vendor 'A' classified by 'Type'. Once done, it should export this to separate excel files (i.e. for vendor 'A', 3 excel file are created showing all transactions for different revenue types i.e. Wireless, Bonus and Gift). I tried using pandas Groupby function, but it requires aggregation, which doesn't help solve the problem.
Can anyone provide any guidance/ inputs on how to solve this ?
I propose the following steps: Use Distinct to get the unique combinations of Vendor and Type. Once you have these unique combinations, loop through them, filter your dataframe and export the filtered dataframe to an Excel sheet.
Related
I have a large CSV file(300mb) with data about accidents based on pincodes/zipcodes. The file has basically header and comma separated values. Key fields are Month, Date, Year, Pincode, Count.
Count represents the accident count for that pincode, however each pincode can get several entries through the day say every few hours. So I want to be able to calculate the max accidents per pincode on a given date i.e I need to group by Month, Date, Year, Pincode and then sum over count after grouping?
I have an idea of how to do this if I loaded the large-ish file into a database or a cloud service such as GCP BigQuery but I want to be able to do this with Python/Pandas dataframes and then store the metrics I am calculating in a table. Is this approach possible with Pandas, if not then possibly PySpark is my last option but that involves the overhead of having to setup a Hadoop etc.
I am open to any other ideas as I am a PyNovice :)
Thank you
You can signup for Databricks Community Edition (for free), in which you can easily have a Spark-ready environment, also easy enough to upload your CSV file.
I'm new to the community and I only recently started to use Python and more specifically Pandas.
The data set I have I would like the columns to be the date. For each Date I would like to have a customer list that then breaks down to more specific row elements. Everything would be rolled up by an order number, so a distinct count on an order number because sometimes a client purchases more than 1 item. In excel I create a pivot table and process it by distinct order. Then I sort each row element by the distinct count of the order number. I collapse each row down until I just have the client name. If I click to expand the cell then I see each row element.
So my question: If I'm pulling in these huge data sets as a dataframe can I pull in xlsx in as an array? I know it will strip the values, so I would have to set the datetime as a datetime64 element. I've been trying to reshape the array around the date being column, and the rows I want but so far I haven't had luck. I have tried to use pivot_table and groupby with some success but I wasn't able to move the date to the column.
Summary: Overall what I'm looking to know is am I going down the wrong rabbit hole together? I'm looking to basically create a collapsible pivot table with specific color parameters for the table as well so that the current spreadsheet will look identical to the one I'm automating.
I really appreciate any help, as I said I'm brand new to Pandas so direction is key. If I know I'm onto the "best" way of dealing with the export to excel after I've imported and modified the spreadsheet. I get a single sheet of raw data kicked out in .xlsx form. Thanks again!
I have an excel file with several tabs, each reporting quarterly account values.
I want to create a dataframe to group_by accounts and report by Period.
I manage to generate a period_index and read the file with pd.ExcelFileParse (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelFile.parse.html) and get a dict with tab names as keyes and the period-data as dataframes. So far, good.
Now, when I loop through the dict to either append the different data frames or concatenate them, Pandas generates errors that only series and DataFrames can do that, not the dict object.
How can I generate one DF from the different dataframes in the dict?
thanks in advance,
kind regards,
Marc
Currently building a simple Customer Lifetime Value calculator program for marketing purposes. For a portion of the program, I give the user the option to import a CSV file via pd.read_csv to allow calculations across multiple customer records. I designate the required order of the CSV data in notes included in the output window.
The imported CSV should have 4 inputs per row. Building off of this, I want to create a new column in the dataframe that multiplies columns 1-4. Operating under the assumption that some users will include headers (that will vary per user) while others will not, is there a way I can create the new calculated column based on column # rather than header?
Beginner here. None of the answers I have found have worked for me/been similar to my situation.
I have a worksheet with many duplicate rows in which only one column is important that it differs.
Is there a function that will put each of the differing streams into new columns, with the header being the date of the stream?
Essentially, I would like to have each song as a row and the day's streams as a column in that row. Please see the attached image for the end result I would like to achieve.
If this is possible in Python, that would be great as well, as I am pulling the data via a Python script using openpyxl.
Thanks!