how to efficiently decode arrays to columns in pandas dataframe

how to efficiently decode arrays to columns in pandas dataframe - python

I have a function that produces results for every month of a year. In my dataframe I collect these results for different data columns. After that, I have a dataframe containing multiple columns with arrays as values. Now I want to "pivot" those columns to have each value in its own column.
For example, if a row contains values [1,2,3,4,5,6,7,8,9,10,11,12] in column 'A', I want to have twelve columns 'A_01', 'A_02', ..., 'A_12' that each contain one value from the array.
My current code is this:
# create new columns
columns_to_add = []
column_count = len(columns_to_process)
for _, row in df[columns_to_process].iterrows():
columns_to_add += [[row[name][offset] if type(row[name]) == list else row[name]
for offset in range(array_len) for name in range(column_count)]]
new_df = pd.DataFrame(columns_to_add,
columns=[name+'_'+str(offset+1) for offset in range(array_len)
for name in columns_to_process],
index=df.index) # make dataframe addendum
(note: some rows don't have any values, so I had to put the condition if type() == list into the iteration)
But this code is awfully slow. I believe there must be a much more elegant solution. Can you show me such a solution?

IIUC, use Series.tolist with the pandas.DataFrame constructor.
We'll use DataFrame.rename as well to fix your column name format.
# Setup
df = pd.DataFrame({'A': [ [1,2,3,4,5,6,7,8,9,10,11,12] ]})
pd.DataFrame(df['A'].tolist()).rename(columns=lambda x: f'A_{x+1:0>2d}')
[out]
A_01 A_02 A_03 A_04 A_05 A_06 A_07 A_08 A_09 A_10 A_11 A_12
0 1 2 3 4 5 6 7 8 9 10 11 12

Related

How to concatenate values from many columns into one column when one doesn't know the number of columns will have

My application saves an indeterminate number of values in different columns. As a results, I have a data frame with a certain number of columns at the beginning but then from a particular column (that I know) I will have an uncertain number of columns saving same data
Example:
known1 known2 know3 unknow1 unknow2 unknow3 ...
1 3 3 data data2 data3
The result I would like to get should be something like this
known1 known2 know3 all_unknow
1 3 3 data,data2,data3
How can I do this when I don't know the number of unknown columns but what I do know is this will occur (in this example) from the 4th column.

IIUC, use filter to select the columns by keyword:
cols = list(df.filter(like='unknow'))
# ['unknow1', 'unknow2', 'unknow3']
df['all_unknow'] = df[cols].apply(','.join, axis=1)
df = df.drop(columns=cols)
or take all columns from the 4th one:
cols = df.columns[3:]
df['all_unknow'] = df[cols].apply(','.join, axis=1)
df = df.drop(columns=cols)
output:
known1 known2 know3 all_unknow
0 1 3 3 data,data2,data3

df['all_unknown'] = df.iloc[:, 3:].apply(','.join, axis=1)
if you also want to drop all columns after the 4th:
cols = df.columns[3:-1]
df.drop(cols, axis=1)
the -1 is to avoid dropping the new column

How to split the values of a column to differenet columns in a dataframe

I have a dataframe in the below format. I want to split the values of points column into different columns like A,B,C and so on based on the number of items in the list by deleting the original column.
df:
x y points
0 82.123610 16.724781 [1075038212.0, -18.099967840282456, -18.158378...
1 82.126540 16.490998 [1071765909.0, -20.406018294234215, -15.850444...
2 82.369578 17.402203 [1072646747.0, -16.839004016179505, -18.334996...
3 81.612240 17.464167 [1096294130.0, -15.335239025421126, -15.303402...

I think best here is create numeric columns names:
df = df.join(pd.DataFrame(df.pop('points').tolist(), index=df.index))
If length of list is less like 27 is possible use:
import string
d = dict(enumerate(string.ascii_lowercase))
df = df.join(pd.DataFrame(df.pop('points').tolist(), index=df.index).rename(columns=d))

How to sort a dataFrame in python pandas by two or more columns based on a list of values? [duplicate]

So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.
The list looks something like this:
class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']
This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?

You could make the Class column your index column
df = df.set_index('Class')
and then use df.loc to reindex the DataFrame with class_list:
df.loc[class_list]
Minimal example:
>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
Class Number
0 Gammaproteobacteria 3
1 Bacteroidetes 5
2 Negativicutes 6
>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
Number
Bacteroidetes 5
Negativicutes 6
Gammaproteobacteria 3

Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:
ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']
df_list = []
for i in ordered_classes:
df_list.append(df[df['Class']==i])
ordered_df = pd.concat(df_list)

Adding columns dynamically to a pandas dataframe, from a list contained in the dataframe

I have a dataframe in which the first column contains a list of random size, from 0 to around 10 items in each list. This dataframe also contains several other columns of data.
I would like to insert as many columns as the length of the longest list, and then populate the values across sequentially such that each column has one item from the list in column one.
I was unsure of a good way to go about this.
sample = [[[0,2,3,7,8,9],2,3,4,5],[[1,2],2,3,4,5],[[1,3,4,5,6,7,8,9,0],2,3,4,5]]
headers = ["col1","col2","col3","col4","col5"]
df = pd.DataFrame(sample, columns = headers)
In this example I would like to add 9 columns after column 1, as this is the maxiumum length of the list in the third row of the dataframe. These columns would be populated with:
0 2 3 7 8 9 NULL NULL NULL in the first row,
1 2 NULL NULL NULL NULL NULL NULL NULL in the second, etc...

Edit to fit OPs edit
This is how I would do it. First I would pad the lists of the original column so that they're all the same length and it's easier to work with them. Afterwards it's a matter of creating the columns and filling it with the value corresponding to the position in the list. Let's say our lists are of size up to 4 for an easier example:
df = pd.DataFrame(sample, columns = headers)
df = df.rename(columns={'col1':'col_of_lists'})
max_length = max(df['col_of_lists'].apply(lambda x:len(x)))
df['col_of_lists'] = df['col_of_lists'].apply(lambda x:x + ([np.nan] * (max_length - len(x))))
for i in range(max_length):
df['col_'+str(i)] = df['col_of_lists'].apply(lambda x: x[i])

The easiest way to turn a series of lists into separate columns is to use apply to convert them into a Series, which triggers the 'expand' result type:
result = df['col1'].apply(pd.Series)
At this point, we can adjust the columns from the automatically numbered to include the name of the original 'col1', for example:
result.columns = [
'col1_{}'.format(i + 1)
for i in result.columns]
Finally, we can join it back to the original DataFrame. Using the fact that this was the first column makes it easy, just joining it to the left of the original frame, dropping the original 'col1' in the process:
result = result.join(df.drop('col1', axis=1))
You can even do it all as a one-liner, by using the rename() method to change column names:
df['col1'].apply(pd.Series).rename(
lambda i: 'col1_{}'.format(i + 1),
axis=1,
).join(df.drop('col1', axis=1))

Finding mean of consecutive column data

I have the following data:
(the data given here is just representational)
`
I want to do the following with this data:
I want to get column only after the 201
i.e. I want to remove the 200-1 to 200-4 column data.
One way to do this is to retrieve only the required column while reading the data from excel, but I want to know how we can filter the column name on the basis of a particular pattern as 200-1 to 200-4 column name has pattern 200-*
I want to make a column after 202-4 which stores the values in the following ways:
201q1= mean of (201-1 and 201-2)
201q2 = mean of(201-3 and 201-4)
Similarly, if 202-1 to 201-4 data would have been there, a similar column should have been formed.
Please help.
Thanks in advance for your support.

This is a rough example but it will get you close. The example assume that there are always four columns per group:
#sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randn(2,12), columns=['200-1','200-2','200-3','200-4', '201-1', '201-2', '201-3','201-4', '202-1', '202-2', '202-3','202-4'])
# remove 200-* columns
df2 = df[df.columns[~df.columns.str.contains('200-')]]
# us np.arange to create groups
new = df2.groupby(np.arange(len(df2.columns))//2, axis=1).mean()
# rename columns
new.columns = [f'{v}{k}' for v,k in zip([x[:3] for x in df2.columns[::2]], ['q1','q2']*int(len(df2.columns[::2])/2))]
# join
df2.join(new)
201-1 201-2 201-3 201-4 202-1 202-2 202-3 \
0 0.865408 -2.301539 1.744812 -0.761207 0.319039 -0.249370 1.462108
1 -0.172428 -0.877858 0.042214 0.582815 -1.100619 1.144724 0.901591
202-4 201q1 201q2 202q1 202q2
0 -2.060141 -0.718066 0.491802 0.034834 -0.299016
1 0.502494 -0.525143 0.312514 0.022052 0.702043

For step 1, you can get away with list comprehension, and the pandas drop function:
dropcols = [x for x in df.columns if '200-' in x]
df.drop(dropcols, axis=1, inplace=True)
Steps 3 and 4 are similar, you could calculate the rolling mean of the columns:
df2 = df.rolling(2, axis = 1).mean() # creates rolling mean
df2.columns = [x.replace('-', 'q') for x in df2.columns] # renames the columns
dfans = pd.concat([df, df2], axis = 1) # concatenate the columns together
Now, you just need to remove the columns that you dont want and rename them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to efficiently decode arrays to columns in pandas dataframe - python

Related

How to concatenate values from many columns into one column when one doesn't know the number of columns will have

How to split the values of a column to differenet columns in a dataframe

How to sort a dataFrame in python pandas by two or more columns based on a list of values? [duplicate]

Adding columns dynamically to a pandas dataframe, from a list contained in the dataframe

Finding mean of consecutive column data

Categories

Resources