how slice by hybrid stile - python

having a random df
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[7,8,9,10],[10,11,12,13],[14,15,16,17]], columns=['A', 'B','C','D'])
cols_in = list(df)[0:2]+list(df)[4:]
now:
x = []
for i in range(df.shape[0]):
x.append(df.iloc[i,cols_in])
obviously in the cycle, x return an error due to col_in assignment in iloc.
How could be possible apply a mixed style slicing of df like in append function ?

It seems like you want to exclude one column? There is no column 4, so depending on which columns you are after, something like this might be what you are after:
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[7,8,9,10],[10,11,12,13],[14,15,16,17]], columns=['A', 'B','C','D'])
If you want to get the column indeces from column names you can do:
cols = ['A', 'B', 'D']
cols_in = np.nonzero(df.columns.isin(cols))[0]
x = []
for i in range(df.shape[0]):
x.append(df.iloc[i, cols_in].to_list())
x
Output:
[[1, 2, 4], [4, 5, 7], [7, 8, 10], [10, 11, 13], [14, 15, 17]]

Related

To connect pandas dataframe and dict?

I need to connect dataframe and dict like this the number of each cell is different so the number of "0","1" and so on is different total number of cells 16.
You should try to make some code, but, how about this ?
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
d = {"C": [7, 8, 9], "D": [10, 11, 12]}
df = pd.concat([df, pd.DataFrame(d)], axis=1)
print(df)

Check pandas df2.colA for occurrences of df1.id and write (df2.colB, df2.colC) into df1.colAB

I have two pandas df and they do not have the same length. df1 has unique id's in column id. These id's occur (multiple times) in df2.colA. I'd like to add a list of all occurrences of df1.id in df2.colA (and another column at the matching index of df1.id == df2.colA) into a new column in df1. Either with the index of df2.colA of the match or additionally with other row entries of all matches.
Example:
df1.id = [1, 2, 3, 4]
df2.colA = [3, 4, 4, 2, 1, 1]
df2.colB = [5, 9, 6, 5, 8, 7]
So that my operation creates something like:
df1.colAB = [ [[1,8],[1,7]], [[2,5]], [[3,5]], [[4,9],[4,6]] ]
I've tries a bunch of approaches with mapping, looping explicitly (super slow), checking with isin etc.
You could use Pandas apply to iterate over each row of df1 value while creating a list with all the indices in df2.colA. This can be achieved by using Pandas index and loc over the df2.colB to create a list with all the indices in df2.colA that match the row in df1.id. Then, within the apply itself use a for-loop to create the list of matched values.
import pandas as pd
# setup
df1 = pd.DataFrame({'id':[1,2,3,4]})
print(df1)
df2 = pd.DataFrame({
'colA' : [3, 4, 4, 2, 1, 1],
'colB' : [5, 9, 6, 5, 8, 7]
})
print(df2)
#code
df1['colAB'] = df1['id'].apply(lambda row:
[[row, idx] for idx in df2.loc[df2[df2.colA == row].index,'colB']])
print(df1)
Output from df1
id colAB
0 1 [[1, 8], [1, 7]]
1 2 [[2, 5]]
2 3 [[3, 5]]
3 4 [[4, 9], [4, 6]]

Selecting different rows from different GroupBy groups

As opposed to GroupBy.nth, which selects the same index for each group, I would like to take specific indices from each group. For example, if my GroupBy object consisted of four groups and I would like the 1st, 5th, 10th, and 15th from each respectively, then I would like to be able to pass x = [0, 4, 9, 14] and get those rows.
This is kind of a strange thing to want; is there a reason?
In any case, to do what you want, try this:
df = pd.DataFrame([['a', 1], ['a', 2],
['b', 3], ['b', 4], ['b', 5],
['c', 6], ['c', 7]],
columns=['group', 'value'])
def index_getter(which):
def get(series):
return series.iloc[which[series.name]]
return get
which = {'a': 0, 'b': 2, 'c': 1}
df.groupby('group')['value'].apply(index_getter(which))
Which results in:
group
a 1
b 5
c 7

plotting a given column name across different data frames in python

All, I have multiple dataframes like
df1 = pd.DataFrame(np.array([
['a', 1, 2],
['b', 3, 4],
['c', 5, 6]]),
columns=['name', 'attr1', 'attr2'])
df2 = pd.DataFrame(np.array([
['a', 2, 3],
['b', 4, 5],
['c', 6, 7]]),
columns=['name', 'attr1', 'attr2'])
df3 = pd.DataFrame(np.array([
['a', 3, 4],
['b', 5, 6],
['c', 7, 8]]),
columns=['name', 'attr1', 'attr2'])
each of these dataframes are generated at specific time steps says T=[t1, t2, t3]
I would like to plot, attr1 or attr2 of the diff data frames as function of time T. I would like to do this for 'a', 'b' and 'c' on all the same graph.
Plot Attr1 VS time for 'a', 'b' and 'c'
If I understand correctly, first assign a column T to each of your dataframes, then concatenate the three. Then, you can groupby the name column, iterate through each, and plot T against attr1 or attr2:
dfs = pd.concat([df1.assign(T=1), df2.assign(T=2), df3.assign(T=3)])
for name, data in dfs.groupby('name'):
plt.plot(data['T'], data['attr2'], label=name)
plt.xlabel('Time')
plt.ylabel('attr2')
plt.legend()
plt.show()

Get DataFrame selection's row posititions

Instead of the indices, I'd like to obtain the row positions, so I can use the result later using df.iloc(row_positions).
This is the example:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
print df[df['a']>=2].index
# Int64Index([2, 7], dtype='int64')
# How do I convert the index list [2, 7] to [1, 2] (the row position)
# I managed to do this for 1 index element, but how can I do this for the entire selection/index list?
df.index.get_loc(2)
Update
I could use a list comprehension to apply the selected result on the get_loc function, but perhaps there's some Pandas-built-in function.
you can use where from numpy:
import numpy as np
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
np.where( df.a>=2)
returns row indices:
(array([1, 2], dtype=int64),)
#ssm's answer is what I would normally use. However to answer your specific query of how to select multiple rows try this:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
indices = df[df['a']>=2].index
print df.ix[indices]
More information on .ix indexing scheme is here
[EDIT to answer the specific query]
How do I convert the index list [2, 7] to [1, 2] (the row position)
df[df['a']>=2].reset_index().index

Categories

Resources