Finding mean of consecutive column data - python

I have the following data:
(the data given here is just representational)
`
I want to do the following with this data:
I want to get column only after the 201
i.e. I want to remove the 200-1 to 200-4 column data.
One way to do this is to retrieve only the required column while reading the data from excel, but I want to know how we can filter the column name on the basis of a particular pattern as 200-1 to 200-4 column name has pattern 200-*
I want to make a column after 202-4 which stores the values in the following ways:
201q1= mean of (201-1 and 201-2)
201q2 = mean of(201-3 and 201-4)
Similarly, if 202-1 to 201-4 data would have been there, a similar column should have been formed.
Please help.
Thanks in advance for your support.

This is a rough example but it will get you close. The example assume that there are always four columns per group:
#sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randn(2,12), columns=['200-1','200-2','200-3','200-4', '201-1', '201-2', '201-3','201-4', '202-1', '202-2', '202-3','202-4'])
# remove 200-* columns
df2 = df[df.columns[~df.columns.str.contains('200-')]]
# us np.arange to create groups
new = df2.groupby(np.arange(len(df2.columns))//2, axis=1).mean()
# rename columns
new.columns = [f'{v}{k}' for v,k in zip([x[:3] for x in df2.columns[::2]], ['q1','q2']*int(len(df2.columns[::2])/2))]
# join
df2.join(new)
201-1 201-2 201-3 201-4 202-1 202-2 202-3 \
0 0.865408 -2.301539 1.744812 -0.761207 0.319039 -0.249370 1.462108
1 -0.172428 -0.877858 0.042214 0.582815 -1.100619 1.144724 0.901591
202-4 201q1 201q2 202q1 202q2
0 -2.060141 -0.718066 0.491802 0.034834 -0.299016
1 0.502494 -0.525143 0.312514 0.022052 0.702043

For step 1, you can get away with list comprehension, and the pandas drop function:
dropcols = [x for x in df.columns if '200-' in x]
df.drop(dropcols, axis=1, inplace=True)
Steps 3 and 4 are similar, you could calculate the rolling mean of the columns:
df2 = df.rolling(2, axis = 1).mean() # creates rolling mean
df2.columns = [x.replace('-', 'q') for x in df2.columns] # renames the columns
dfans = pd.concat([df, df2], axis = 1) # concatenate the columns together
Now, you just need to remove the columns that you dont want and rename them.

Related

How to concatenate values from many columns into one column when one doesn't know the number of columns will have

My application saves an indeterminate number of values in different columns. As a results, I have a data frame with a certain number of columns at the beginning but then from a particular column (that I know) I will have an uncertain number of columns saving same data
Example:
known1 known2 know3 unknow1 unknow2 unknow3 ...
1 3 3 data data2 data3
The result I would like to get should be something like this
known1 known2 know3 all_unknow
1 3 3 data,data2,data3
How can I do this when I don't know the number of unknown columns but what I do know is this will occur (in this example) from the 4th column.
IIUC, use filter to select the columns by keyword:
cols = list(df.filter(like='unknow'))
# ['unknow1', 'unknow2', 'unknow3']
df['all_unknow'] = df[cols].apply(','.join, axis=1)
df = df.drop(columns=cols)
or take all columns from the 4th one:
cols = df.columns[3:]
df['all_unknow'] = df[cols].apply(','.join, axis=1)
df = df.drop(columns=cols)
output:
known1 known2 know3 all_unknow
0 1 3 3 data,data2,data3
df['all_unknown'] = df.iloc[:, 3:].apply(','.join, axis=1)
if you also want to drop all columns after the 4th:
cols = df.columns[3:-1]
df.drop(cols, axis=1)
the -1 is to avoid dropping the new column

Pandas - contains from other DF

I have 2 dataframes:
DF A:
and DF B:
I need to check every row in the DFA['item'] if it contains some of the values in the DFB['original'] and if it does, then add new column in DFA['my'] that would correspond to the value in DFB['my'].
So here is the result I need:
I tought of converting the DFB['original'] into list and then use regex, but this way I wont get the matching result from column 'my'.
Ok, maybe not the best solution, but it seems to be working.
I did cartesian join and then check the records which contains the data needed
dfa['join'] = 1
dfb['join'] = 1
dfFull = dfa.merge(dfb, on='join').drop('join' , axis=1)
dfFull['match'] = dfFull.apply(lambda x: x.original in x.item, axis = 1)
dfFull[dfFull['match']]

how to efficiently decode arrays to columns in pandas dataframe

I have a function that produces results for every month of a year. In my dataframe I collect these results for different data columns. After that, I have a dataframe containing multiple columns with arrays as values. Now I want to "pivot" those columns to have each value in its own column.
For example, if a row contains values [1,2,3,4,5,6,7,8,9,10,11,12] in column 'A', I want to have twelve columns 'A_01', 'A_02', ..., 'A_12' that each contain one value from the array.
My current code is this:
# create new columns
columns_to_add = []
column_count = len(columns_to_process)
for _, row in df[columns_to_process].iterrows():
columns_to_add += [[row[name][offset] if type(row[name]) == list else row[name]
for offset in range(array_len) for name in range(column_count)]]
new_df = pd.DataFrame(columns_to_add,
columns=[name+'_'+str(offset+1) for offset in range(array_len)
for name in columns_to_process],
index=df.index) # make dataframe addendum
(note: some rows don't have any values, so I had to put the condition if type() == list into the iteration)
But this code is awfully slow. I believe there must be a much more elegant solution. Can you show me such a solution?
IIUC, use Series.tolist with the pandas.DataFrame constructor.
We'll use DataFrame.rename as well to fix your column name format.
# Setup
df = pd.DataFrame({'A': [ [1,2,3,4,5,6,7,8,9,10,11,12] ]})
pd.DataFrame(df['A'].tolist()).rename(columns=lambda x: f'A_{x+1:0>2d}')
[out]
A_01 A_02 A_03 A_04 A_05 A_06 A_07 A_08 A_09 A_10 A_11 A_12
0 1 2 3 4 5 6 7 8 9 10 11 12

Wide to long returns empty output - Python dataframe

I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)

How to take mean,std and mad of multiple dataframes that are appended in a list?

I've several hundred dataframes that are appended in a list. All the dataframes have same number of columns but the number of rows are not same. The column names are also same.
So i want to take the mean, mad, std of column value of each column and i'm doing something like this:
All the dataframes are appended in a list (lst)
lst = []
for filen, filen1 in zip(filelistn, filelist1):
df1 = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df2 = pd.read_table(path_to_files1+filen1, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
dfs = pd.merge(df1,df2, on='wave', how='inner')
dfs = df1 - df2
lst.append(dfs)
df = reduce(lambda x, y: pd.merge(x, y, on = 'wave',how='outer'), lst)
df = df.rename(columns = lambda x: x.split('_')[0]).T
df = df.groupby(df.index).agg(['mean','std','mad','median']).T
But the results that i'm getting are a bit weird, Like in column mad there are values like 21,65,36 which is absurd.
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4374.94 -0.001321 -0.001196 0.000378
3 4379.74 0.002778 0.003380 0.004685
4 6828.60 -10.604568 -0.000590 21.084799
5 6839.84 -0.003466 -0.001870 0.010169
6 6842.04 -32.751551 -0.002514 65.118329
7 6842.69 18.293519 -0.002158 36.385884
The column wave is same in all the dataframes, but the number of rows are not. Does it has anything to do with that? May be it's taking the mean of the wrong rows?
Can anyone tell me how to solve this?
You can use pandas.concat to concatenate the sequence of data frames into one large data frame and calculate the statistics afterwards like so.
import pandas as pd
# lst = [construct list of dataframes ...]
df = pd.concat(lst, axis=0)
means = df.mean()
stds = df.std()
Edit: if you would like to get the statistics broken down by some key, e.g. wave, you can use the following.
means = df.groupby('wave').mean()

Categories

Resources