Iterating through big data with pandas, large and small dataframes - python

This is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is a real stumper for me.
Grab a large data file based on a parquet - no problem
Select 5 columns from the parquet and create a dataframe - no problem
import pandas
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
columns=["ts", "session_id", "event", "duration", "sample_data"])
But here is where it gets a bit tricky for me. One column(a key column) is called "session_id" . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the "duration" column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it's own SUM with a row added with that total listed along with the session_id I'm thinking there is a nested loop formula that will work for me but every effort has been a mess to date.
Ultimately, I'd like to have a final dataframe that is a collection of these unique sub dataframes. I guess I'd need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply
Finally, write this final df to a new parquet file. Should be simple enough so I won't need help with that.
But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows

I think the groupby function will work:
df.groupby('session_id')['duration'].sum()
More info here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

Related

Partition Data By Year/Month Column without Adding Columns to Result -pyspark/databricks

I have a dataframe in pyspark (and databricks) with the following schema structure:
orders schema:
submitted_at:timestamp
submitted_yyyy_mm using the format "yyyy-MM"
order_id:string
customer_id:string
sales_rep_id:string
shipping_address_attention:string
shipping_address_address:string
shipping_address_city:string
shipping_address_state:string
shipping_address_zip:integer
ingest_file_name:string
ingested_at:timestamp
I need to capture the data in my table in delta lake format, with a partition for every month of the order history reflected in the data of the submitted_yyyy_mm column. I am capturing the data correctly with the exception of two problems. One, my technique is adding two columns (and corresponding data) to the schema (could not figure out how to do the partitioning without adding columns). Two, the partitions correctly capture all the year/months with data, but are missing the year/months without data (requirement is those need to be included also). Specifically, all the months of 2017-2019 should have their own partition (so 36 months). However, my technique only created partitions for those months that actually had orders (which turned out to be 18 of the 36 months of the years 2017-2019).
Here is relevant are of my code:
# take the pristine order table and add these two extra columns you should not have in order to get partition structure
df_with_year_and_month = (df_orders
.withColumn("year", F.year(F.col("submitted_yyyy_mm").cast(T.TimestampType())))
.withColumn("month", F.month(F.col("submitted_yyyy_mm").cast(T.TimestampType()))))
# capture the data to the orders table using the year/month partitioning
df_with_year_and_month.write.partitionBy("year", "month").mode("overwrite").format("delta").saveAsTable(orders_table)
I would be grateful to anyone who might be able to help me tweak my code to fix the two issues I have the result. Thank you
There's no issue here. That's just how it works.
You want to partition on year and month. So you should have those values in you data, no way around it. You should also only partition on values where you want to filter on, since this 'causes partition pruning and results in faster queries. It would make no sense to partition on a field without related value.
Also it's totally normal that you don't create partitions where you don't have data for them. Once data is added, the corresponding partition is created if it doesn't exist yet. You don't need it any sooner than that.

Strategy for creating pivot tables that collapse with large data sets

I'm new to the community and I only recently started to use Python and more specifically Pandas.
The data set I have I would like the columns to be the date. For each Date I would like to have a customer list that then breaks down to more specific row elements. Everything would be rolled up by an order number, so a distinct count on an order number because sometimes a client purchases more than 1 item. In excel I create a pivot table and process it by distinct order. Then I sort each row element by the distinct count of the order number. I collapse each row down until I just have the client name. If I click to expand the cell then I see each row element.
So my question: If I'm pulling in these huge data sets as a dataframe can I pull in xlsx in as an array? I know it will strip the values, so I would have to set the datetime as a datetime64 element. I've been trying to reshape the array around the date being column, and the rows I want but so far I haven't had luck. I have tried to use pivot_table and groupby with some success but I wasn't able to move the date to the column.
Summary: Overall what I'm looking to know is am I going down the wrong rabbit hole together? I'm looking to basically create a collapsible pivot table with specific color parameters for the table as well so that the current spreadsheet will look identical to the one I'm automating.
I really appreciate any help, as I said I'm brand new to Pandas so direction is key. If I know I'm onto the "best" way of dealing with the export to excel after I've imported and modified the spreadsheet. I get a single sheet of raw data kicked out in .xlsx form. Thanks again!

Is there a way to map 2 dataframes onto each other to produce a rearranged dataframe (with one data frame values acting as the new column names)?

I have one dataframe of readings that come in a particular arrangement due to the nature of the experiment. I also have another dataframe, that contains information about each point on the dataframe and what each point corresponds to, in terms of what chemical was at that point. Note, there are only a few different chemicals over the dataframe, but they are arranged all over the dataframe.
What I want to do is to create a new, reorganised dataframe, where the columns are the type of chemical. My inital thought would be to compare the data and information dataframes to produce a dictionary, which I could then transform into a new dataframe. I could not figure out how to do this, and might not actually be the best approach either!
I have previously achieved it by manually rearranging the points on the dataframe to match the pattern I want, but I'm not happy with this approach and must be a better way.
Thanks in advance for any help!

Adding Calculated Column in Pandas Dataframe via Indexing Column #s

Currently building a simple Customer Lifetime Value calculator program for marketing purposes. For a portion of the program, I give the user the option to import a CSV file via pd.read_csv to allow calculations across multiple customer records. I designate the required order of the CSV data in notes included in the output window.
The imported CSV should have 4 inputs per row. Building off of this, I want to create a new column in the dataframe that multiplies columns 1-4. Operating under the assumption that some users will include headers (that will vary per user) while others will not, is there a way I can create the new calculated column based on column # rather than header?
Beginner here. None of the answers I have found have worked for me/been similar to my situation.

How can I manipulate tabular data with python?

I have column-based data in a CSV file and I would like to manipulate it in several ways. People have pointed me to R because it gives you easy access to both rows and columns, but I am already familiar with python and rather use it.
For example, I want to be able to delete all the rows that have a certain value in one of the columns. Or I want to change all the values of one column (i.e., trim the string). I also want to be able to aggregate rows based on common values (like a SQL GROUP BY).
Is there a way to do this in python without having to write a loop to iterate over all of the rows each time?
Look at the pandas library. It provides a DataFrame type similar to R's dataframe that lets you do the kind of thing you're talking about.

Categories

Resources