I have an excel file with several tabs, each reporting quarterly account values.
I want to create a dataframe to group_by accounts and report by Period.
I manage to generate a period_index and read the file with pd.ExcelFileParse (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelFile.parse.html) and get a dict with tab names as keyes and the period-data as dataframes. So far, good.
Now, when I loop through the dict to either append the different data frames or concatenate them, Pandas generates errors that only series and DataFrames can do that, not the dict object.
How can I generate one DF from the different dataframes in the dict?
thanks in advance,
kind regards,
Marc
Related
I am wondering if there is a fast way to rearrange the rows of a csv using pandas so that it could match the order of the rows in another csv that have the same data, but arranged differently. To be clear, these two csvs have the same data in the form of several numeric features spread across several columns. I tried doing loops that matches each row of data with its counterpart by comparing the values in multiple columns, but this prove too slow for my purposes.
You should use pandas DataFrame:
"read_csv"__both files.
Convert both to "DataFrame".
Use "merge".
"to_csv"__use to save.
share your data..
I have an excel spreadsheet with following columns:
I want to group this data by vendor and show all transaction and amount data for that vendor by Type (i.e. Wireless, Bonus etc). For ex: it should show all data for vendor 'A' classified by 'Type'. Once done, it should export this to separate excel files (i.e. for vendor 'A', 3 excel file are created showing all transactions for different revenue types i.e. Wireless, Bonus and Gift). I tried using pandas Groupby function, but it requires aggregation, which doesn't help solve the problem.
Can anyone provide any guidance/ inputs on how to solve this ?
I propose the following steps: Use Distinct to get the unique combinations of Vendor and Type. Once you have these unique combinations, loop through them, filter your dataframe and export the filtered dataframe to an Excel sheet.
This is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is a real stumper for me.
Grab a large data file based on a parquet - no problem
Select 5 columns from the parquet and create a dataframe - no problem
import pandas
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
columns=["ts", "session_id", "event", "duration", "sample_data"])
But here is where it gets a bit tricky for me. One column(a key column) is called "session_id" . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the "duration" column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it's own SUM with a row added with that total listed along with the session_id I'm thinking there is a nested loop formula that will work for me but every effort has been a mess to date.
Ultimately, I'd like to have a final dataframe that is a collection of these unique sub dataframes. I guess I'd need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply
Finally, write this final df to a new parquet file. Should be simple enough so I won't need help with that.
But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows
I think the groupby function will work:
df.groupby('session_id')['duration'].sum()
More info here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
I need to only some of the functionalities of Pandas dataframe and need to remove others or restrict users from using them. So, I am planning to write my own dataframe class which would only have a subset of methods of Pandas dataframes.
The code for pandas DataFrame object can be found here.
Theoretically you could clone the repository and re-write sections of it. However, it's not a simple object and this may take a decent amount of reading into the code to understand how it works.
For example: pandas describes the dataframe object as a
Two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Arithmetic operations
align on both row and column labels. Can be thought of as a dict-like
container for Series objects.
I need to reconcile two separate dataframes. Each row within the two dataframes has a unique id that I am using to match the two dataframes. Without using a loop, how can I reconcile one dataframe against another and vice-versa?
I tried merging the two dataframes on an index (unique id) but the problem I run into when I do this is when there are duplicate rows of data. Is there a way to identify duplicate rows of data and put that data into an array or export it to a CSV?
Your help is much appreciated. Thanks.
Try DataFrame.duplicated and DataFrame.drop_duplicates