Currently building a simple Customer Lifetime Value calculator program for marketing purposes. For a portion of the program, I give the user the option to import a CSV file via pd.read_csv to allow calculations across multiple customer records. I designate the required order of the CSV data in notes included in the output window.
The imported CSV should have 4 inputs per row. Building off of this, I want to create a new column in the dataframe that multiplies columns 1-4. Operating under the assumption that some users will include headers (that will vary per user) while others will not, is there a way I can create the new calculated column based on column # rather than header?
Beginner here. None of the answers I have found have worked for me/been similar to my situation.
Related
This is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is a real stumper for me.
Grab a large data file based on a parquet - no problem
Select 5 columns from the parquet and create a dataframe - no problem
import pandas
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
columns=["ts", "session_id", "event", "duration", "sample_data"])
But here is where it gets a bit tricky for me. One column(a key column) is called "session_id" . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the "duration" column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it's own SUM with a row added with that total listed along with the session_id I'm thinking there is a nested loop formula that will work for me but every effort has been a mess to date.
Ultimately, I'd like to have a final dataframe that is a collection of these unique sub dataframes. I guess I'd need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply
Finally, write this final df to a new parquet file. Should be simple enough so I won't need help with that.
But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows
I think the groupby function will work:
df.groupby('session_id')['duration'].sum()
More info here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
In my excel file, I have a list of some 7000-8000 binary chemical compounds. (Consists of 2 elements only).
And I have segregated them into their component elements, i.e., I have 2 columns of elements, namely: First Element and Second Element.
I have attached a screenshot below:
Now I want to fill in the respective Atomic Number and Atomic Weight beside every element as per a predefined list using Python.
How do I do that?
I have attached a screenshot of my predefined list below, as well:
People have told me things like, use the "CSV" package or the "pandas" package, but I would request some more procedural help wrt to the above packages or any other method you might suggest.
Also, if it cannot be done via Python, I am open to other languages as well.
I noticed that your task does not require python programming. The reason is :
You already have a predefined list of items stored in a excel sheet.
Excel already has built in function (VLOOKUP) for this task.
We just have to use VLOOKUP function in column Atomic number, Atomic weight ( you have to create columns in data2 sheet ) which will take care of searching for particular element atomic weight, number and return it in active cell.
Next, use fill handle to apply the function to all the cells or ( if data is in table , great!! no need to use fill handle because table automatically applies the function to whole column range )
I expect that you already know how to work with excel formulas and functions, if not comment down below for further assistance. Kindly upvote the answer if you liked it.
NOTE: If you need automation, then be sure to check out Excel VBA, google sheets, Apps script.
I'm new to the community and I only recently started to use Python and more specifically Pandas.
The data set I have I would like the columns to be the date. For each Date I would like to have a customer list that then breaks down to more specific row elements. Everything would be rolled up by an order number, so a distinct count on an order number because sometimes a client purchases more than 1 item. In excel I create a pivot table and process it by distinct order. Then I sort each row element by the distinct count of the order number. I collapse each row down until I just have the client name. If I click to expand the cell then I see each row element.
So my question: If I'm pulling in these huge data sets as a dataframe can I pull in xlsx in as an array? I know it will strip the values, so I would have to set the datetime as a datetime64 element. I've been trying to reshape the array around the date being column, and the rows I want but so far I haven't had luck. I have tried to use pivot_table and groupby with some success but I wasn't able to move the date to the column.
Summary: Overall what I'm looking to know is am I going down the wrong rabbit hole together? I'm looking to basically create a collapsible pivot table with specific color parameters for the table as well so that the current spreadsheet will look identical to the one I'm automating.
I really appreciate any help, as I said I'm brand new to Pandas so direction is key. If I know I'm onto the "best" way of dealing with the export to excel after I've imported and modified the spreadsheet. I get a single sheet of raw data kicked out in .xlsx form. Thanks again!
I am trying to identify specific rows in one data frame based on the rows in a second data frame . Each row in the 2nd data frame specifies a unique filter. The filter criteria (which columns to use and which values) are known only during execution and vary.
data = pd.DataFrame({'a':[0,1,2,3],'b':[4,5,6,7],'c':[9,6,4,2]})
flt = pd.DataFrame({'a': [3,None,0],'c':[None,2,5]})
The intention is to generate a search criteria dynamically which allows to use vector processing like
data[data['a']==flt['a'].iloc[0]]
data[data['c']==flt['c'].iloc[1]]
data[(data['a']==flt['a'].iloc[2]) & (data['c']==flt['c'].iloc[2])]
I was thinking about a form of meta programming or template which would generate the code on the fly potentially as string and use exec. However it seems that is a bad way to do things in Python ?
The problem is that the 'real' application uses very large data frames in particular for the data to be searched O[millions by hundreds] and the combination of columns used for the search vary a lot. Between 1 and up to a dozen columns. Also flexibility and speed of search is crucial.
I have a worksheet with many duplicate rows in which only one column is important that it differs.
Is there a function that will put each of the differing streams into new columns, with the header being the date of the stream?
Essentially, I would like to have each song as a row and the day's streams as a column in that row. Please see the attached image for the end result I would like to achieve.
If this is possible in Python, that would be great as well, as I am pulling the data via a Python script using openpyxl.
Thanks!