Drop every nth column in pandas dataframe - python

i have a pandas dataframe where the columns are named like:
0,1,2,3,4,.....,n
i would like to drop every 3rd column so that i get a new dataframe where i would have the columns like:
0,1,3,4,6,7,9,.....,n
I have tried like this:
shape = df.shape[1]
for i in range(2,shape,3):
df = df.drop(df.columns[i], axis=1)
but i get an error saying index is out of bound and i assume this happens because the shape of the dataframe changes when i am dropping the columns. if i just don't store the output of the "for" loop, then the code works but i don't get my new dataframe.
How do i solve this?
Thanks

The issue with code is, each time you drop a column in your loop, you end up with a different set of columns because you overwrite the df back after each iteration. When you try to drop the next 3rd column of THAT new set of columns, you not only drop the wrong one, you end up running out of columns eventually. That's why you get the error you are getting.
iter1 -> 0,1,3,4,5,6,7,8,9,10 ... n #first you drop 2 which is 3rd col
iter2 -> 0,1,3,4,5,7,8,9,10 ... n #next you drop 6 which is 6th col (should be 5)
iter3 -> 0,1,3,4,5,7,8,9, ... n #next you drop 10 which is 9th col (should be 8)
What you want to do is calculate the indexes beforehand and then remove them in one go.
You can simply just get the indexes of columns you want to remove with range and then drop those.
drop_idx = list(range(2,df.shape[1],3)) #Indexes to drop
df2 = df.drop(drop_idx, axis=1) #Drop them at once over axis=1
print('old columns->', list(df.columns))
print('idx to drop->', drop_idx)
print('new columns->',list(df2.columns))
old columns-> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
idx to drop-> [2, 5, 8]
new columns-> [0, 1, 3, 4, 6, 7, 9]
Note: This works only because your columns names are same as indexes. If however, your column names are not like that, you will have to do an extra step of fetching the column names based on the index you want to drop.
drop_idx = list(range(2,df.shape[1],3))
drop_cols = [j for i,j in enumerate(df.columns) if i in drop_idx] #<--
df2 = df.drop(drop_cols, axis=1)

Here is solution with inverted logic - select all columns with removed each 3rd column.
You can filter values by compare added 1 to helper array, with 3 modulo compare for not equal 0 and pass to DataFrame.loc:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df = df.loc[:, (np.arange(len(df.columns)) + 1) % 3 != 0]
print (df)
A B D E
0 a 4 1 5
1 b 5 3 3
2 c 4 5 6
3 d 5 7 9
4 e 5 1 2
5 f 4 0 4

You can use list comprehension to filter columns:
df = df[[k for k in df.columns if (k + 1) % 3 != 0]]
If the names are different (e.g. strings) and you want to discard every 3rd column regardless of its name, then:
df = df[[k for i, k in enumerate(df.columns, 1) if i % 3 != 0]]

Related

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

Appending column values from one dataframe to another as a list

I have dozens of very similar dataFrames. What I want is to combine all 'VALUE' column values from each into lists, and return a dataFrame where the 'VALUE' column is comprised of these lists. I only want to do this for rows where 'PV' contains a substring from a list of substrings.
I came up with one way I thought would work, but it's real nasty and doesn't work anyways (stopped it at 3m). There has to be a better way of doing this, does anyone here have any ideas? Thanks for any and all help.
import pandas as np
# Example dataFrames
df0 = pd.DataFrame(data={'PV': ['pv1', 'pv2', 'pv3', 'pv4'], 'VALUE': [1, 2, 3, 4]})
df1 = pd.DataFrame(data={'PV': ['pv1', 'pv2', 'pv3', 'pv4'], 'VALUE': [5, 6, 7, 8]})
df2 = pd.DataFrame(data={'PV': ['pv1', 'pv2', 'pv3', 'pv4'], 'VALUE': [10, 11, 12, 13]})
DATAFRAMES
df0 dataFrame df1 dataFrame df2 dataFrame
PV VALUE PV VALUE PV VALUE
pv1 1 pv1 5 pv1 10
pv2 2 pv2 6 pv2 11
pv3 3 pv3 7 pv3 12
pv4 4 pv4 8 pv4 13
# Nasty code I thought might work
strings = ['v2', 'v4']
for i, row0 in df0.iterrows():
for j, row1 in df1.iterrows():
if (row0['PV']==row1['PV']) & any(substring in row0['PV'] for substring in strings):
df0.at[i,'VALUE'] = [row0['VALUE'], row1['VALUE']]
Desired result:
PV VALUE
pv1 1
pv2 [2,6]
pv3 3
pv4 [4,8]
#enke thank you for your help! I had to play with it a bit to figure out how to keep nested lists from occurring, and ended up using the following commented function/code/output:
def appendValues(df0, df1, pvStrings=['v2','v4']):
# Turn values in VALUE column into list objects
df0['VALUE'] = df0['VALUE'].apply(lambda x: x if isinstance(x,list) else [x])
# For rows were PV string DOESN'T contain substring, set value to max()+1
# apply makes lists [x] empty if they were set to max()+1, else [x]
df1['VALUE'] = (df1['VALUE']
.where(df1['PV'].str.contains('|'.join(pvStrings)), df1['VALUE'].max()+1)
.apply(lambda x: [x] if x <= df1['VALUE'].max() else []))
# concatenate df1's VALUE column to df0
# set the indexing column to 'PV'
# sum all row values (axis=1) into one list
data = (df0.merge(df1, on='PV')
.set_index('PV')
.sum(axis=1))
# restore singleton lists to their original type, reset index moves current 'PV' index back to a column, and impliments new sequential index
data = data.mask(data.str.len().eq(1), data.str[0]).reset_index(name='VALUE')
return data
data = appendValues(df0, df1, pvStrings=['v2','v4'])
data = appendValues(data, df2, pvStrings=['v1','v4'])
data
Output:
PV VALUE
0 pv1 [1,10]
1 pv2 [2,6]
2 pv3 3
3 pv4 [4,8,13]
You could filter df1 for rows that contain strings; concatenate it with df0; then groupby + agg(list) can aggregate "VALUE"s for each "PV".
Finally, you could use mask to take out the elements from the singleton lists.
out = (pd.concat([df0, df1[df1['PV'].str.contains('|'.join(strings))]])
.groupby('PV', as_index=False)['VALUE'].agg(list))
out['VALUE'] = out['VALUE'].mask(out['VALUE'].str.len().eq(1), out['VALUE'].str[0])
Alternatively, we could make the values in the "VALUE" columns lists and merge + concatenate the lists:
df0['VALUE'] = df0['VALUE'].apply(lambda x: [x])
df1['VALUE'] = df1['VALUE'].where(df1['PV'].str.contains('|'.join(strings)), df1['VALUE'].max()+1).apply(lambda x: [x] if x <= df1['VALUE'].max() else [])
out = df0.merge(df1, on='PV').set_index('PV').sum(axis=1)
out = out.mask(out.str.len().eq(1), out.str[0]).reset_index(name='VALUE')
Output:
PV VALUE
0 pv1 1
1 pv2 [2, 6]
2 pv3 3
3 pv4 [4, 8]
If you don't want to filter out the rows that contain "strings" in "PV" but rather keep them as separate rows, then you could concat + groupby first; then filter + explode:
out = pd.concat([df0, df1]).groupby('PV', as_index=False)['VALUE'].agg(list)
msk = out['PV'].str.contains('|'.join(strings))
out = pd.concat((out[msk].explode('VALUE'), out[~msk])).sort_index()
Output:
PV VALUE
0 pv1 [1, 5]
1 pv2 2
1 pv2 6
2 pv3 [3, 7]
3 pv4 4
3 pv4 8

Scalable approach to make values in a list as column values in a dataframe in pandas in Python

I have a pandas dataframe which has only one column, the value of each cell in the column is a list/array of numbers, this list is of length 100 and this length is consistent across all the cell values.
We need to convert each list value as a column value, in other words have a dataframe which has 100 columns and each column value is at a list/array item.
Something like this
becomes
It can be done with iterrows() as shown below, but we have around 1.5 million rows and need a scalable solution as iterrows() would take alot of time.
cols = [f'col_{i}' for i in range(0, 4)]
df_inter = pd.DataFrame(columns = cols)
for index, row in df.iterrows():
df_inter.loc[len(df_inter)] = row['message']
You can do this:
In [28]: df = pd.DataFrame({'message':[[1,2,3,4,5], [3,4,5,6,7]]})
In [29]: df
Out[29]:
message
0 [1, 2, 3, 4, 5]
1 [3, 4, 5, 6, 7]
In [30]: res = pd.DataFrame(df.message.tolist(), index= df.index)
In [31]: res
Out[31]:
0 1 2 3 4
0 1 2 3 4 5
1 3 4 5 6 7
I think this would work:
df.message.apply(pd.Series)
To use dask to scale (assuming it is installed):
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=8)
ddf.message.apply(pd.Series, meta={0: 'object'})

Python- How to apply dot product between elements of a data frame

I have two data frames. Both have one column of numpy arrays with 3 elements per entry, like so:
0 [0.552347, 0.762896, 0.336009]
1 [0.530716, 0.808313, 0.254895]
2 [0.528786, 0.734991, 0.424469]
3 [0.202799, 0.669395, -0.714691]
4 [0.791936, -0.100072, -0.602347]
6 [0.428896, -0.122712, 0.89498]
How do I take the dot product of each row of one data frame with the corresponding row of the other data frame? Meaning, I want to calculate the dot product of the first element of df1 with the first element of df2, then the second element of df1 with the second element of df2, then third, and so on.
df1 = pd.DataFrame([(np.array([0.552347, 0.762896, 0.336009]), ),
(np.array([0.530716, 0.808313, 0.254895]), )], columns=['v1'])
df2 = pd.DataFrame([(np.array([0.528786, 0.734991, 0.424469]), ),
(np.array([0.202799, 0.669395, -0.714691]), )], columns=['v2'])
pd.concat((df1, df2), axis=1).apply(lambda row: row.v1.dot(row.v2), axis=1)
0 0.995420
1 0.466538
Assuming they are same length of df1 , df2
[x.dot(y) for x, y in zip(df1.col1.values,df2.col1.values)]
Out[648]: [0.9999995633060001, 1.00000083965]
It's pretty fast to calculate dot products manually. For this you can use mul and sum if the 2 dataframes share the same index:
df1.col.mul(df2.col).apply(sum)
If they don't share the same index (but are the same length), use reset_index first:
df1.reset_index().col.mul(df2.reset_index().col).apply(sum)
Example:
>>> df1
col
0 [0, 1, 2]
1 [3, 4, 5]
>>> df2
col
0 [5, 6, 7]
1 [1, 2, 3]
>>> df1.col.mul(df2.col).apply(sum)
0 20
1 26

Access values in a dataframe based on index and values in another

How do i get the value from a dataframe based on a list of index and headers?
These are the dataframes i have:
a = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a','b','c'])
referencingDf = pd.DataFrame(['c','c','b'])
Based on the same index, i am trying to get the following dataframe output:
outputDf = pd.DataFrame([3,6,8])
Currently, i tried this but would need to take the diagonal values. Am pretty sure there is a better way of doing so:
a.loc[referencingDf.index.values, referencingDf[:][0].values]
You need lookup:
b = a.lookup(a.index, referencingDf[0])
print (b)
[3 6 8]
df1 = pd.DataFrame({'vals':b}, index=a.index)
print (df1)
vals
0 3
1 6
2 8
Another way to use list comprehension:
vals = [a.loc[i,j] for i,j in enumerate(referencingDf[0])]
# [3, 6, 8]
IIUC, you can use df.get_value in a list comprehension.
vals = [a.get_value(*x) for x in referencingDf.reset_index().values]
# a simplification would be [ ... for x in enumerate(referencingDf[0])] - DYZ
print(vals)
[3, 6, 8]
And then, construct a dataframe.
df = pd.DataFrame(vals)
print(df)
0
0 3
1 6
2 8
Here's one vectorized approach that uses column_index and then NumPy's advanced-indexing for indexing and extracting those values off each row of dataframe -
In [177]: col_idx = column_index(a, referencingDf.values.ravel())
In [178]: a.values[np.arange(len(col_idx)), col_idx]
Out[178]: array([3, 6, 8])

Categories

Resources