How to convert rows into list using pandas python? - python

product,count,value1,value2,value3
A,10,5,3,2
B,8,2,2,4
This is my dataframe. I need output like following format:
product,count,values
A,10,[5,3,2]
B,8,[2,2,4]

Here's one way
In [27]: df['values'] = df[['value1', 'value2', 'value3']].values.tolist()
In [28]: df
Out[28]:
product count value1 value2 value3 values
0 A 10 5 3 2 [5, 3, 2]
1 B 8 2 2 4 [2, 2, 4]
In [29]: df.drop(['value1', 'value2', 'value3'], axis=1)
Out[29]:
product count values
0 A 10 [5, 3, 2]
1 B 8 [2, 2, 4]
Details:
In [35]: df = pd.DataFrame([['A', 10, 5, 3, 2], ['B', 8, 2, 2, 4]],
....: columns=['product', 'count', 'value1', 'value2', 'value3'])
In [36]: df
Out[36]:
product count value1 value2 value3
0 A 10 5 3 2
1 B 8 2 2 4

Related

Turning a list of dictionaries into a DataFrame

If you have a list of dictionaries like this:
listofdict = [{'value1': [1, 2, 3, 4, 5]}, {'value2': [5, 4, 3, 2, 1]}, {'value3': ['a', 'b', 'c', 'd', 'e']}]
How can you turn it into a dataframe where value1, value2 and value3 are column names and the lists are the columns.
I tried:
df = pd.DataFrame(listofdict)
But it gives me the values congested in one row and the remaining rows as NaN.
Here is another way:
df = pd.DataFrame({k:v for i in listofdict for k,v in i.items()})
Output:
value1 value2 value3
0 1 5 a
1 2 4 b
2 3 3 c
3 4 2 d
4 5 1 e
DataFrame is expecting a single dictionary with column names as keys, so you need to fusion all these dictionaries in a single one like {'value1': [1, 2, 3, 4, 5], 'value2': [5, 4, 3, 2, 1], ... }
You can try
listofdict = [{'value1':[1,2,3,4,5]}, {'value2':[5,4,3,2,1]},{'value3':['a','b','c','d','e']}]
dicofdics = {}
for dct in listofdict:
dicofdics.update(dct)
df = pd.DataFrame(dicofdics)
df
index
value1
value2
value3
0
1
5
a
1
2
4
b
2
3
3
c
3
4
2
d
4
5
1
e

Pandas DataFrame filter by multiple column criterias and multiple intervals

I have checked several answers but found no luck so far.
My dataset is like this:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
Location Place Value1 Value2
A 1 1 1
A 2 1 1
A 3 2 2
B 4 3 3
C 2 4 4
C 3 5 5
and I have a list of intervals:
A: [0, 1]
A: [3, 5]
B: [1, 3]
C: [1, 4]
C: [6, 10]
Now I want that every row that have Location equal to that of the filter list, should have the Place in range of the filter. So the desired output will be:
Location Place Value1 Value2
A 1 1 1
A 3 2 2
C 2 4 4
C 3 5 5
I know that I can chain multiple between conditions by | , but I have a really long list of intervals so manually enter the condition is not feasible. I also consider forloop to slice the data by location first, but I think there could be more efficient way.
Thank you for your help.
Edit: Currently the list of intervals is just strings like this
A 0 1
A 3 5
B 1 3
C 1 4
C 6 10
but I would like to slice them into list of dicts. Better structure for it is also welcome!
First define dataframe df and filters dff:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
dff = pd.DataFrame({'Location':['A','A','B','C','C'],
'fPlace':[[0,1], [3, 5], [1, 3], [1, 4], [6, 10]]})
dff[['p1', 'p2']] = pd.DataFrame(dff["fPlace"].to_list())
now dff is:
Location fPlace p1 p2
0 A [0, 1] 0 1
1 A [3, 5] 3 5
2 B [1, 3] 1 3
3 C [1, 4] 1 4
4 C [6, 10] 6 10
where fPlace transformed to lower and upper bounds p1 and p2 indicates filters that should be applied to Place. Next:
df.merge(dff).query('Place >= p1 and Place <= p2').drop(columns = ['fPlace','p1','p2'])
result:
Location Place Value1 Value2
0 A 1 1 1
5 A 3 2 2
7 C 2 4 4
9 C 3 5 5
Prerequisites:
# presumed setup for your intervals:
intervals = {
"A": [
[0, 1],
[3, 5],
],
"B": [
[1, 3],
],
"C": [
[1, 4],
[6, 10],
],
}
Actual solution:
x = df["Location"].map(intervals).explode().str
l, r = x[0], x[1]
res = df["Place"].loc[l.index].between(l, r)
res = res.loc[res].index.unique()
res = df.loc[res]
Outputs:
>>> res
Location Place Value1 Value2
0 A 1 1 1
2 A 3 2 2
4 C 2 4 4
5 C 3 5 5

How to make a multi-dimensional column into a single valued vector for training data in sklearn pandas

I have a data set in which certain column is a combination of couple of independent values, as in the example below:
id age marks
1 5 3,6,7
2 7 1,2
3 4 34,78,2
Thus the column by itself is composed of multiple values, and I need to pass the vector into a machine learning algorithm , I cannot really combine the values to assign a single value like :
3,6,7 => 1
1,2 => 2
34,78,2 => 3
making my new vector as
id age marks
1 5 1
2 7 2
3 4 3
and then subsequently pass it to the algorithm , as the number of such combination would be infinite and also that might not really capture the real meaning of the data.
how to handle such situation where individual feature is a combination of multiple features.
Note :
the values in column marks are just examples, it could be anything a list of values. it could be list of integer or list of string , string composed of multiple stings separated by commas
You can pd.factorize tuples
Assuming marks is a list
df
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 5 [3, 6, 7]
Apply tuple and factorize
df.assign(new=pd.factorize(df.marks.apply(tuple))[0] + 1)
id age marks new
0 1 5 [3, 6, 7] 1
1 2 7 [1, 2] 2
2 3 4 [34, 78, 2] 3
3 4 5 [3, 6, 7] 1
setup df
df = pd.DataFrame([
[1, 5, ['3', '6', '7']],
[2, 7, ['1', '2']],
[3, 4, ['34', '78', '2']],
[4, 5, ['3', '6', '7']]
], [0, 1, 2, 3], ['id', 'age', 'marks']
)
UPDATE: I think we can use CountVectorizer in this case:
assuming we have the following DF:
In [33]: df
Out[33]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [34]: %paste
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
vect = CountVectorizer(ngram_range=(1,1), stop_words=None, tokenizer=TreebankWordTokenizer().tokenize)
X = vect.fit_transform(df.marks.apply(' '.join))
r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
## -- End pasted text --
Result:
In [35]: r
Out[35]:
1 2 3 34 6 7 78
0 0 0 1 0 1 1 0
1 1 1 0 0 0 0 0
2 0 1 0 1 0 0 1
3 0 0 1 0 1 1 0
OLD answer:
you can first convert your list to string and then categorize it:
In [119]: df
Out[119]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [120]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [121]: df
Out[121]:
id age marks new
0 1 5 [3, 6, 7] 0
1 2 7 [1, 2] 1
2 3 4 [34, 78, 2] 2
3 4 11 [3, 6, 7] 0
In [122]: df.dtypes
Out[122]:
id int64
age int64
marks object
new category
dtype: object
this will also work if marks is a column of strings:
In [124]: df
Out[124]:
id age marks
0 1 5 3,6,7
1 2 7 1,2
2 3 4 34,78,2
3 4 11 3,6,7
In [125]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [126]: df
Out[126]:
id age marks new
0 1 5 3,6,7 0
1 2 7 1,2 1
2 3 4 34,78,2 2
3 4 11 3,6,7 0
Tp access them as either [[x, y, z], [x, y, z]] or [[x, x], [y, y], [z, z]] (whatever is most appropriate for the function you need to call) then use:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2, 3, 4], b=[3, 4, 3, 4], c=[[1,2,3], [1,2], [], [2]]))
df.values
zip(*df.values)
where
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
2 3 3 []
3 4 4 [2]
>>> df.values
array([[1, 3, [1, 2, 3]],
[2, 4, [1, 2]],
[3, 3, []],
[4, 4, [2]]], dtype=object)
>>> zip(*df.values)
[(1, 2, 3, 4), (3, 4, 3, 4), ([1, 2, 3], [1, 2], [], [2])]
To convert a column try this:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2], b=[3, 4], c=[[1,2,3], [1,2]]))
df['c'].apply(lambda x: np.mean(x))
before:
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
after:
>>> df
a b c
0 1 3 2.0
1 2 4 1.5

Been trying to build new data frame from existing data frame and series

I'm trying to write a loop that does the following:
df_f.ix[0] = df_n.loc[0]
df_f.ix[1] = h[0]
df_f.ix[2] = df_n.loc[1]
df_f.ix[3] = h[1]
df_f.ix[4] = df_n.loc[2]
df_f.ix[5] = h[2]
...
df_f.ix[94778] = df_n.loc[47389]
df_f.ix[94779] = h[47389]
Basically, row 1 (and all the rows incremented by 2) of data frame df_f is equal to row 1 of data frame df_n (and its rows incremented by 1) and row 2 (and the rows incremented by 2) of df_f is equal to row 1 (and its rows incremented by 1) of series h. And so on...Can anyone help?
You don't necessarily need loops... You can just create a new list of data from your existing data frame/series and then make that int a new DataFrame
import pandas as pd
#example data
df_n = pd.DataFrame([1,2, 3, 4,5])
h = pd.Series([99, 98, 97, 96, 95])
new_data = [None] * (len(df_n) * 2)
new_data[::2] = df_n.loc[:, 0].values
new_data[1::2] = h.values
new_df = pd.DataFrame(new_data)
In [135]: new_df
Out[135]:
0
0 1
1 99
2 2
3 98
4 3
5 97
6 4
7 96
8 5
9 95
If you really want a loop that will do it you could create an empty data frame like so:
other_df = pd.DataFrame([None] * (len(df_n) * 2))
y = 0
for x in xrange(len(df_n)):
other_df.loc[y] = df_n.loc[x]
y+=1
other_df.loc[y] = h[x]
y+=1
In [136]: other_df
Out[136]:
0
0 1
1 99
2 2
3 98
4 3
5 97
6 4
7 96
8 5
9 95
This is easy to do in Numpy. You can retrieve the data from a Pandas Dataframe using df.values.
>>> import numpy as np
>>> import pandas as pd
>>> df_a, df_b = pd.DataFrame([1, 2, 3, 4]), pd.DataFrame([5, 6, 7, 8])
>>> df_a
0
0 1
1 2
2 3
3 4
>>> df_b
0
0 5
1 6
2 7
3 8
>>> np_a, np_b = df_a.values, df_b.values
>>> np_a
array([[1],
[2],
[3],
[4]])
>>> np_b
array([[5],
[6],
[7],
[8]])
>>> np_c = np.hstack((np_a, np_b))
>>> np_c
array([[1, 5],
[2, 6],
[3, 7],
[4, 8]])
>>> np_c = np_c.flatten()
>>> np_c
array([1, 5, 2, 6, 3, 7, 4, 8])
>>> df_c = pd.DataFrame(np_c)
>>> df_c
0
0 1
1 5
2 2
3 6
4 3
5 7
6 4
7 8
All of this in one line, given df_a and df_b:
>>> df_c = pd.DataFrame(np.hstack((df_a.values, df_b.values)).flatten())
>>> df_c
0
0 1
1 5
2 2
3 6
4 3
5 7
6 4
7 8
Edit:
If you have more than one column, which is the general case,
>>> df_a = pd.DataFrame([[1, 2], [3, 4]])
>>> df_b = pd.DataFrame([[5, 6], [7, 8]])
>>> df_a
0 1
0 1 2
1 3 4
>>> df_b
0 1
0 5 6
1 7 8
>>> np_a = df_a.values
>>> np_a = np_a.reshape(np_a.shape[0], 1, np_a.shape[1])
>>> np_a
array([[[1, 2]],
[[3, 4]]])
>>> np_b = df_b.values
>>> np_b = np_b.reshape(np_b.shape[0], 1, np_b.shape[1])
>>> np_b
array([[[5, 6]],
[[7, 8]]])
>>> np_c = np.concatenate((np_a, np_b), axis=1)
>>> np_c
array([[[1, 2],
[5, 6]],
[[3, 4],
[7, 8]]])
>>> np_c = np_c.reshape(np_c.shape[0] * np_c.shape[2], np_c.shape[1])
>>> np_c
array([[1, 2],
[5, 6],
[3, 4],
[7, 8]])
>>> df_c = pd.DataFrame(np_c)

Nested dictionary to multiindex dataframe where dictionary keys are column labels

Say I have a dictionary that looks like this:
dictionary = {'A' : {'a': [1,2,3,4,5],
'b': [6,7,8,9,1]},
'B' : {'a': [2,3,4,5,6],
'b': [7,8,9,1,2]}}
and I want a dataframe that looks something like this:
A B
a b a b
0 1 6 2 7
1 2 7 3 8
2 3 8 4 9
3 4 9 5 1
4 5 1 6 2
Is there a convenient way to do this? If I try:
In [99]:
DataFrame(dictionary)
Out[99]:
A B
a [1, 2, 3, 4, 5] [2, 3, 4, 5, 6]
b [6, 7, 8, 9, 1] [7, 8, 9, 1, 2]
I get a dataframe where each element is a list. What I need is a multiindex where each level corresponds to the keys in the nested dict and the rows corresponding to each element in the list as shown above. I think I can work a very crude solution but I'm hoping there might be something a bit simpler.
Pandas wants the MultiIndex values as tuples, not nested dicts. The simplest thing is to convert your dictionary to the right format before trying to pass it to DataFrame:
>>> reform = {(outerKey, innerKey): values for outerKey, innerDict in dictionary.items() for innerKey, values in innerDict.items()}
>>> reform
{('A', 'a'): [1, 2, 3, 4, 5],
('A', 'b'): [6, 7, 8, 9, 1],
('B', 'a'): [2, 3, 4, 5, 6],
('B', 'b'): [7, 8, 9, 1, 2]}
>>> pandas.DataFrame(reform)
A B
a b a b
0 1 6 2 7
1 2 7 3 8
2 3 8 4 9
3 4 9 5 1
4 5 1 6 2
[5 rows x 4 columns]
You're looking for the functionality in .stack:
df = pandas.DataFrame.from_dict(dictionary, orient="index").stack().to_frame()
# to break out the lists into columns
df = pandas.DataFrame(df[0].values.tolist(), index=df.index)
dict_of_df = {k: pd.DataFrame(v) for k,v in dictionary.items()}
df = pd.concat(dict_of_df, axis=1)
Note that the order of columns is lost for python < 3.6
This recursive function should work:
def reform_dict(dictionary, t=tuple(), reform={}):
for key, val in dictionary.items():
t = t + (key,)
if isinstance(val, dict):
reform_dict(val, t, reform)
else:
reform.update({t: val})
t = t[:-1]
return reform
If lists in the dictionary are not of the same lenght, you can adapte the method of BrenBarn.
>>> dictionary = {'A' : {'a': [1,2,3,4,5],
'b': [6,7,8,9,1]},
'B' : {'a': [2,3,4,5,6],
'b': [7,8,9,1]}}
>>> reform = {(outerKey, innerKey): values for outerKey, innerDict in dictionary.items() for innerKey, values in innerDict.items()}
>>> reform
{('A', 'a'): [1, 2, 3, 4, 5],
('A', 'b'): [6, 7, 8, 9, 1],
('B', 'a'): [2, 3, 4, 5, 6],
('B', 'b'): [7, 8, 9, 1]}
>>> pandas.DataFrame.from_dict(reform, orient='index').transpose()
>>> df.columns = pd.MultiIndex.from_tuples(df.columns)
A B
a b a b
0 1 6 2 7
1 2 7 3 8
2 3 8 4 9
3 4 9 5 1
4 5 1 6 NaN
[5 rows x 4 columns]
This solution works for a larger dataframe, it fits what was requested
cols = df.columns
int_cols = len(cols)
col_subset_1 = [cols[x] for x in range(1,int(int_cols/2)+1)]
col_subset_2 = [cols[x] for x in range(int(int_cols/2)+1, int_cols)]
col_subset_1_label = list(zip(['A']*len(col_subset_1), col_subset_1))
col_subset_2_label = list(zip(['B']*len(col_subset_2), col_subset_2))
df.columns = pd.MultiIndex.from_tuples([('','myIndex'),*col_subset_1_label,*col_subset_2_label])
OUTPUT
A B
myIndex a b c d
0 0.159710 1.472925 0.619508 -0.476738 0.866238
1 -0.665062 0.609273 -0.089719 0.730012 0.751615
2 0.215350 -0.403239 1.801829 -2.052797 -1.026114
3 -0.609692 1.163072 -1.007984 -0.324902 -1.624007
4 0.791321 -0.060026 -1.328531 -0.498092 0.559837
5 0.247412 -0.841714 0.354314 0.506985 0.425254
6 0.443535 1.037502 -0.433115 0.601754 -1.405284
7 -0.433744 1.514892 1.963495 -2.353169 1.285580

Categories

Resources