How do i get the value from a dataframe based on a list of index and headers?
These are the dataframes i have:
a = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a','b','c'])
referencingDf = pd.DataFrame(['c','c','b'])
Based on the same index, i am trying to get the following dataframe output:
outputDf = pd.DataFrame([3,6,8])
Currently, i tried this but would need to take the diagonal values. Am pretty sure there is a better way of doing so:
a.loc[referencingDf.index.values, referencingDf[:][0].values]
You need lookup:
b = a.lookup(a.index, referencingDf[0])
print (b)
[3 6 8]
df1 = pd.DataFrame({'vals':b}, index=a.index)
print (df1)
vals
0 3
1 6
2 8
Another way to use list comprehension:
vals = [a.loc[i,j] for i,j in enumerate(referencingDf[0])]
# [3, 6, 8]
IIUC, you can use df.get_value in a list comprehension.
vals = [a.get_value(*x) for x in referencingDf.reset_index().values]
# a simplification would be [ ... for x in enumerate(referencingDf[0])] - DYZ
print(vals)
[3, 6, 8]
And then, construct a dataframe.
df = pd.DataFrame(vals)
print(df)
0
0 3
1 6
2 8
Here's one vectorized approach that uses column_index and then NumPy's advanced-indexing for indexing and extracting those values off each row of dataframe -
In [177]: col_idx = column_index(a, referencingDf.values.ravel())
In [178]: a.values[np.arange(len(col_idx)), col_idx]
Out[178]: array([3, 6, 8])
Related
I have dozens of very similar dataFrames. What I want is to combine all 'VALUE' column values from each into lists, and return a dataFrame where the 'VALUE' column is comprised of these lists. I only want to do this for rows where 'PV' contains a substring from a list of substrings.
I came up with one way I thought would work, but it's real nasty and doesn't work anyways (stopped it at 3m). There has to be a better way of doing this, does anyone here have any ideas? Thanks for any and all help.
import pandas as np
# Example dataFrames
df0 = pd.DataFrame(data={'PV': ['pv1', 'pv2', 'pv3', 'pv4'], 'VALUE': [1, 2, 3, 4]})
df1 = pd.DataFrame(data={'PV': ['pv1', 'pv2', 'pv3', 'pv4'], 'VALUE': [5, 6, 7, 8]})
df2 = pd.DataFrame(data={'PV': ['pv1', 'pv2', 'pv3', 'pv4'], 'VALUE': [10, 11, 12, 13]})
DATAFRAMES
df0 dataFrame df1 dataFrame df2 dataFrame
PV VALUE PV VALUE PV VALUE
pv1 1 pv1 5 pv1 10
pv2 2 pv2 6 pv2 11
pv3 3 pv3 7 pv3 12
pv4 4 pv4 8 pv4 13
# Nasty code I thought might work
strings = ['v2', 'v4']
for i, row0 in df0.iterrows():
for j, row1 in df1.iterrows():
if (row0['PV']==row1['PV']) & any(substring in row0['PV'] for substring in strings):
df0.at[i,'VALUE'] = [row0['VALUE'], row1['VALUE']]
Desired result:
PV VALUE
pv1 1
pv2 [2,6]
pv3 3
pv4 [4,8]
#enke thank you for your help! I had to play with it a bit to figure out how to keep nested lists from occurring, and ended up using the following commented function/code/output:
def appendValues(df0, df1, pvStrings=['v2','v4']):
# Turn values in VALUE column into list objects
df0['VALUE'] = df0['VALUE'].apply(lambda x: x if isinstance(x,list) else [x])
# For rows were PV string DOESN'T contain substring, set value to max()+1
# apply makes lists [x] empty if they were set to max()+1, else [x]
df1['VALUE'] = (df1['VALUE']
.where(df1['PV'].str.contains('|'.join(pvStrings)), df1['VALUE'].max()+1)
.apply(lambda x: [x] if x <= df1['VALUE'].max() else []))
# concatenate df1's VALUE column to df0
# set the indexing column to 'PV'
# sum all row values (axis=1) into one list
data = (df0.merge(df1, on='PV')
.set_index('PV')
.sum(axis=1))
# restore singleton lists to their original type, reset index moves current 'PV' index back to a column, and impliments new sequential index
data = data.mask(data.str.len().eq(1), data.str[0]).reset_index(name='VALUE')
return data
data = appendValues(df0, df1, pvStrings=['v2','v4'])
data = appendValues(data, df2, pvStrings=['v1','v4'])
data
Output:
PV VALUE
0 pv1 [1,10]
1 pv2 [2,6]
2 pv3 3
3 pv4 [4,8,13]
You could filter df1 for rows that contain strings; concatenate it with df0; then groupby + agg(list) can aggregate "VALUE"s for each "PV".
Finally, you could use mask to take out the elements from the singleton lists.
out = (pd.concat([df0, df1[df1['PV'].str.contains('|'.join(strings))]])
.groupby('PV', as_index=False)['VALUE'].agg(list))
out['VALUE'] = out['VALUE'].mask(out['VALUE'].str.len().eq(1), out['VALUE'].str[0])
Alternatively, we could make the values in the "VALUE" columns lists and merge + concatenate the lists:
df0['VALUE'] = df0['VALUE'].apply(lambda x: [x])
df1['VALUE'] = df1['VALUE'].where(df1['PV'].str.contains('|'.join(strings)), df1['VALUE'].max()+1).apply(lambda x: [x] if x <= df1['VALUE'].max() else [])
out = df0.merge(df1, on='PV').set_index('PV').sum(axis=1)
out = out.mask(out.str.len().eq(1), out.str[0]).reset_index(name='VALUE')
Output:
PV VALUE
0 pv1 1
1 pv2 [2, 6]
2 pv3 3
3 pv4 [4, 8]
If you don't want to filter out the rows that contain "strings" in "PV" but rather keep them as separate rows, then you could concat + groupby first; then filter + explode:
out = pd.concat([df0, df1]).groupby('PV', as_index=False)['VALUE'].agg(list)
msk = out['PV'].str.contains('|'.join(strings))
out = pd.concat((out[msk].explode('VALUE'), out[~msk])).sort_index()
Output:
PV VALUE
0 pv1 [1, 5]
1 pv2 2
1 pv2 6
2 pv3 [3, 7]
3 pv4 4
3 pv4 8
I have a pandas.DataFrame like
B1 B2 B3
A1 0 1 2
A2 3 4 5
Also, index=pd.Index(['A2', 'A1']), and columns=pd.Index(['B2', 'B3']). What I want to get is [4, 2], that is, elements in A2-B2 and A1-B3, respectively.
Is there a clever built-in operation to perform this in pandas?
I searched with different expressions for a while but no clue. There could be duplicate questions, sorry for that case. Thank you for taking a look at this.
Use Index.get_indexer for positions by values of index, columns so possible filter in numpy indexing (only convert values of df to numpy array):
index=pd.Index(['A2', 'A1'])
columns=pd.Index(['B2', 'B3'])
i = df.index.get_indexer(index)
c = df.columns.get_indexer(columns)
L = df.to_numpy()[i, c].tolist()
print (L)
[4, 2]
Or reshape by DataFrame.stack and select by DataFrame.loc with MultiIndex.from_tuples:
L = df.stack().loc[pd.MultiIndex.from_tuples(zip(index, columns))].tolist()
print (L)
[4, 2]
If only few values is possible use list comprehension with zip and DataFrame.at:
L = [df.at[i, c] for i, c in zip(index, columns)]
print (L)
[4, 2]
Another option would be to zip the index and columns (same idea as #Jezrael); however, you could just pass the unpacked values to loc (internally , it takes care of finding the right values):
temp = df.stack()
zipped = [*zip(index, columns)]
temp.loc(axis=0)[zipped].array
<PandasArray>
[4, 2]
Length: 2, dtype: int64
I have a DataFrame (temp) and an array (x), whose elements correspond to some of the lines of the DataFrame. I want to get the indexes of the DataFrame whose corresponding records are identical to the elements of the array:
For example:
temp = pd.DataFrame({"A": [1,2,3,4], "B": [4,5,6,7], "C": [7,8,9,10]})
A B C
0 1 4 7
1 2 5 8
2 3 6 9
3 4 7 10
x = np.array([[1,4,7], [3,6,9]])
It should return the indexes: 0 and 2.
I was trying unsuccessfully with this:
temp.loc[temp.isin(x[0])].index
Using numpy broadcasting:
array = temp.to_numpy()[:, None]
mask = (array == x).all(axis=-1).any(axis=-1)
temp.index[mask]
I would convert to Multiindex and then to isin with np.where
i = pd.MultiIndex.from_frame(temp[['A','B','C']])
out = np.where(i.isin(pd.MultiIndex.from_arrays(x.T)))[0]
print(out)
#[0 2]
Or with merge:
cols = ['A','B','C']
out = temp.reset_index().merge(pd.DataFrame(x,columns=cols)).loc[:,'index'].tolist()
Or with np.isin and all
out = temp.index[np.isin(temp[['A','B','C']],x).all(1)]
Since you need to match entire rows of the DataFrame to rows in the numpy array, you can convert the DataFrame to an array and then use enumerate to loop and return the indices:
temp_arr = temp.to_numpy()
for idx, row in enumerate(temp_arr):
if row in x:
print(idx)
Output:
0
2
A more elegant way using list comprehension would be:
idx_list = [i for i, row in enumerate(temp_arr) if row in x ]
print(idx_list)
Output:
[0, 2]
i have a pandas dataframe where the columns are named like:
0,1,2,3,4,.....,n
i would like to drop every 3rd column so that i get a new dataframe where i would have the columns like:
0,1,3,4,6,7,9,.....,n
I have tried like this:
shape = df.shape[1]
for i in range(2,shape,3):
df = df.drop(df.columns[i], axis=1)
but i get an error saying index is out of bound and i assume this happens because the shape of the dataframe changes when i am dropping the columns. if i just don't store the output of the "for" loop, then the code works but i don't get my new dataframe.
How do i solve this?
Thanks
The issue with code is, each time you drop a column in your loop, you end up with a different set of columns because you overwrite the df back after each iteration. When you try to drop the next 3rd column of THAT new set of columns, you not only drop the wrong one, you end up running out of columns eventually. That's why you get the error you are getting.
iter1 -> 0,1,3,4,5,6,7,8,9,10 ... n #first you drop 2 which is 3rd col
iter2 -> 0,1,3,4,5,7,8,9,10 ... n #next you drop 6 which is 6th col (should be 5)
iter3 -> 0,1,3,4,5,7,8,9, ... n #next you drop 10 which is 9th col (should be 8)
What you want to do is calculate the indexes beforehand and then remove them in one go.
You can simply just get the indexes of columns you want to remove with range and then drop those.
drop_idx = list(range(2,df.shape[1],3)) #Indexes to drop
df2 = df.drop(drop_idx, axis=1) #Drop them at once over axis=1
print('old columns->', list(df.columns))
print('idx to drop->', drop_idx)
print('new columns->',list(df2.columns))
old columns-> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
idx to drop-> [2, 5, 8]
new columns-> [0, 1, 3, 4, 6, 7, 9]
Note: This works only because your columns names are same as indexes. If however, your column names are not like that, you will have to do an extra step of fetching the column names based on the index you want to drop.
drop_idx = list(range(2,df.shape[1],3))
drop_cols = [j for i,j in enumerate(df.columns) if i in drop_idx] #<--
df2 = df.drop(drop_cols, axis=1)
Here is solution with inverted logic - select all columns with removed each 3rd column.
You can filter values by compare added 1 to helper array, with 3 modulo compare for not equal 0 and pass to DataFrame.loc:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df = df.loc[:, (np.arange(len(df.columns)) + 1) % 3 != 0]
print (df)
A B D E
0 a 4 1 5
1 b 5 3 3
2 c 4 5 6
3 d 5 7 9
4 e 5 1 2
5 f 4 0 4
You can use list comprehension to filter columns:
df = df[[k for k in df.columns if (k + 1) % 3 != 0]]
If the names are different (e.g. strings) and you want to discard every 3rd column regardless of its name, then:
df = df[[k for i, k in enumerate(df.columns, 1) if i % 3 != 0]]
I have a pandas dataframe which has only one column, the value of each cell in the column is a list/array of numbers, this list is of length 100 and this length is consistent across all the cell values.
We need to convert each list value as a column value, in other words have a dataframe which has 100 columns and each column value is at a list/array item.
Something like this
becomes
It can be done with iterrows() as shown below, but we have around 1.5 million rows and need a scalable solution as iterrows() would take alot of time.
cols = [f'col_{i}' for i in range(0, 4)]
df_inter = pd.DataFrame(columns = cols)
for index, row in df.iterrows():
df_inter.loc[len(df_inter)] = row['message']
You can do this:
In [28]: df = pd.DataFrame({'message':[[1,2,3,4,5], [3,4,5,6,7]]})
In [29]: df
Out[29]:
message
0 [1, 2, 3, 4, 5]
1 [3, 4, 5, 6, 7]
In [30]: res = pd.DataFrame(df.message.tolist(), index= df.index)
In [31]: res
Out[31]:
0 1 2 3 4
0 1 2 3 4 5
1 3 4 5 6 7
I think this would work:
df.message.apply(pd.Series)
To use dask to scale (assuming it is installed):
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=8)
ddf.message.apply(pd.Series, meta={0: 'object'})