Test Anova on multiple groups - python

I have the following dataframe:
I would like to use this code to compare the means between my entire dataframe:
F_statistic, pVal = stats.f_oneway(percentage_age_ss.iloc[:,0:1],
percentage_age_ss.iloc[:,1:2],
percentage_age_ss.iloc[:,2:3],
percentage_age_ss.iloc[:,3:4]) etc...
However, I don't want to use each time .iloc because it takes too much time. Do you I have another way to do it?
Thanks

get a list of columns using list comprehension, then use star syntax to expand it into the arglist:
stats.f_oneway(*(percentage_age_ss[col] for col in percentage_age_ss.columns))
or, just
stats.f_oneway(*(percentage_age_ss.T.values))

Related

Get specific elements of a list using a list iteration

I'm trying to build the below dataframe
df = pd.DataFrame(columns=['Year','Revenue','Gross Profit','Operating Profit','Net Profit'])
rep_vals =['year','net_sales','gross_income','operating_income','profit_to_equity_holders']
for i in range (len(yearly_reports)):
df.loc[i] = [yearly_reports[i].x for x in rep_vals]
However I get error as per.. 'Report' object has no attribute 'x'
The below (brute force version) of the code works:
for i in range (len(yearly_reports)):
df.loc[i] = [yearly_reports[i].year,yearly_reports[i].net_sales ,
yearly_reports[i].gross_income, yearly_reports[i].operating_income,
yearly_reports[i].profit_to_equity_holders]
My issue is however I want to add a lot more columns and also I don't want to fetch every item from my yearly_reports into the dataframe, how can I iterate the values I want more effeciently please?
Instead of using .x, use [x].
yearly_reports[i][x]
Also, it is probably a bad idea / not necessary / slow to iterate over your dataframe like this. Have a look at join/merge which might be a lot faster.

how to find value inside pandas columns

I have a pandas column :'function' ,of jobs functions:
IT,HR etc..
but I have them in few variations for each function.
('IT application','IT,Digital,Digital' etc..)
I wanted to change all values that contains IT -> IT for example.
I tried:
df['function'].str.contains('IT')
df['function'].isin(['IT'])
which gives only partial results.
I wanted something like:
'IT' in df.loc[:,'function']
but a solution that would work for all the column and not for 1 index at a time.
if there is a solution that doesn't need a loop that would be great.
This should work:
df['function'] = df.function.str.replace(r'(^.IT.$)', 'IT')

Pandas, accessing every nth element in nested array

I have a dataframe of many rows and 4 columns. Each column contains an array of 100 values.
My intuitive way of doing this is the same way I would do it with multi-dimensional numpy arrays.
For example, I want the first element of every array in column1. So I say
df["column1"][:][0]
To me this makes sense: first select the column, then take every array, then take the first element of every array.
However, it just doesn't work at all. Instead, it simply spits out the entire array from column1, row 1.
But - and this is the most frustrating thing - if I say:
df["column1"][1][0]
It gives me EXACTLY what I expect based on my expected logic, as in, I get the first element in the array in the second row in column1.
How can I get every nth element in every array in column1?
The reason that df["column1"][:][0] isn't doing what you expect is that df["column1"][:] returns a Series. With a Series, using bracket indexing returns the item of the series at that index.
If you want to a series where each item in the series is the item in the corresponding array at that index, the correct solution - whether it seems intuitive or not - is to use .str[...] on the Series.
Instead of
df["column1"][:][0]
use this:
df["column1"].str[0]
It might seem like .str should only be used for actual str values, but a neat trick is that works for lists too.
Here are some ways to do this:
[item[0] for item in df['column1']] # will result in a list
or
df['column1'].apply(lambda item: item[0]) # will result in a series
Not sure if you're looking for a way that's similar to slicing, but AFAIU pandas sees the lists in your table are just arbitrary objects, not something pandas provides a sugar for.
Of course, you can do other fancy things by creating a data frame out of your column:
pd.DataFrame(df['column1'].tolist())
And then do whatever you want with it.

Filter Dataframe using regular expression

I have a Dataframe with a column with values separated by semicolons eg. Patient1_Control2; Patient1_Patient3; Control1_Control3. However I only want the rows with PatientX_ControlX or ControlX_PatientX. I don't want ControlX_ControlX or PatientX_PatientX. I thought of the method filter(regex = '...') , but this does not quite do the job. I want to filter the dataframe by a regular expression where I can use the regular expression PatientX_ControlX or ControlX_PatientX (x meaning random string). Is there any method for that? Thanks so much in advance. I'm still learning how to code so every tip would be great. If you have any sources where i can learn more about regular expression, that'd be amazing!
Filter the column data not to contain the relevant values -
df[~(df["data"].str.contains('Patient\d+_Control\d+|Control\d+_Patient\d+'))]
For the following the dataframe -
df = pd.DataFrame({"data":["Patient1_Control2", "Patient1_Patient3", "Control1_Patient3", "Control1_Control3"]})
df[~(df["data"].str.contains('Patient\d+_Control\d+|Control\d+_Patient\d+'))]
Output is -
data
1 Patient1_Patient3
3 Control1_Control3

Python Pandas: .apply taking forever?

I have a DataFrame 'clicks' created by parsing CSV of size 1.4G. I'm trying to create a new column 'bought' using apply function.
clicks['bought'] = clicks['session'].apply(getBoughtItemIDs)
In getBoughtItemIDs, I'm checking if 'buys' dataframe has values I want, and if so, return a string concatenating them. The first line in getBoughtItemIDs is taking forever. What are the ways to make it faster?
def getBoughtItemIDs(val):
boughtSessions = buys[buys['session'] == val].values
output = ''
for row in boughtSessions:
output += str(row[1]) + ","
return output
There are a couple of things that make this code run slowly.
apply is essentially just syntactic sugar for a for loop over the rows of a column. There's also an explicit for loop over a NumPy array in your function (the for row in boughtSessions part). Looping in this (non-vectorised) way is best avoided whenever possible as it impacts performance heavily.
buys[buys['session'] == val].values is looking up val across an entire column for each row of clicks, then returning a sub-DataFrame and then creating a new NumPy array. Repeatedly looking for values in this way is expensive (O(n) complexity each lookup). Creating new arrays is going to be expensive since memory has to be allocated and the data copied across each time.
If I understand what you're trying to do, you could try the following approach to get your new column.
First use groupby to group the rows of buys by the values in 'session'. apply is used to join up the strings for each value:
boughtSessions = buys.groupby('session')[col_to_join].apply(lambda x: ','.join(x))
where col_to_join is the column from buys which contains the values you want to join together into a string.
groupby means that only one pass through the DataFrame is needed and is pretty well-optimised in Pandas. The use of apply to join the strings is unavoidable here, but only one pass through the grouped values is needed.
boughtSessions is now a Series of strings indexed by the unique values in the 'session' column. This is useful because lookups to Pandas indexes are O(1) in complexity.
To match each string in boughtSessions to the approach value in clicks['session'] you can use map. Unlike apply, map is fully vectorised and should be very fast:
clicks['bought'] = clicks['session'].map(boughtSessions)

Categories

Resources