I have a dataset (~7000 rows) that I have imported in Pandas for some "data wrangling" but I need some pointers in the right direction to take the next step. My data looks something like the below and it is a description of a structure with several sub levels. B, D and again B are sub levels to A. Cis a sub level to B. and so on...
Level, Name
0, A
1, B
2, C
1, D
2, E
3, F
3, G
1, B
2, C
But i want something like the below, with Name and Mother_name on the same row:
Level, Name, Mother_name
1, B, A
2, C, B
1, D, A
2, E, D
3, F, E
3, G, E
1, B, A
2, C, B
If I understand the format correctly, the parent of a name depends on the
nearest prior row whose level is one less than the current row's level.
Your DataFrame has a modest number of rows (~7000). So there is little harm (to
performance) in simply iterating through the rows. If the DataFrame were very
large, you often get better performance if you can use column-wise vectorized Pandas
operations instead of row-wise iteration. However, in this case it appears that
using column-wise vectorized Pandas operations is awkward and
overly-complicated. So I believe row-wise iteration is the best choice here.
Using df.iterrows to perform row-wise iteration, you can simply record the current parents for every level as you go, and fill in the "mother"s as appropriate:
import pandas as pd
df = pd.DataFrame({'level': [0, 1, 2, 1, 2, 3, 3, 1, 2],
'name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'B', 'C']})
parent = dict()
mother = []
for index, row in df.iterrows():
parent[row['level']] = row['name']
mother.append(parent.get(row['level']-1))
df['mother'] = mother
print(df)
yields
level name mother
0 0 A None
1 1 B A
2 2 C B
3 1 D A
4 2 E D
5 3 F E
6 3 G E
7 1 B A
8 2 C B
If you can specify the mapping of the two columns in something like a dictionary, then you can just use the map method of the original column.
import pandas
names = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'B', 'C']
# name -> sublevel
sublevel_map = {
'A': 'A',
'B': 'A',
'C': 'B',
'D': 'A',
'E': 'D',
'F': 'E',
'G': 'E'
}
df = pandas.DataFrame({'Name': names})
df['Sublevel'] = df['Name'].map(sublevel_map)
Which gives you:
Name Sublevel
0 A A
1 B A
2 C B
3 D A
4 E D
5 F E
6 G E
7 B A
8 C B
Related
I am trying to create a dictionary from a dataframe.
from pandas import util
df= util.testing.makeDataFrame()
df.index.name = 'name'
A B C D
name
qSfQX3rj48 0.184091 -1.195861 0.998988 -0.970523
KSYYLUGiJB -0.998997 -0.387378 -0.303704 0.833731
PmsVVmRbQX -1.510940 -1.062814 0.934954 0.970467
oHjAqjAv1P -1.366054 0.595680 -1.039310 -0.126625
a1cU5c4psT -0.486282 -0.369012 -0.284495 -1.263010
qnqmltdFGR -0.041243 -0.792538 0.234809 0.894919
df.to_dict()
{'A': {'qSfQX3rj48': 0.1840905950693832,
'KSYYLUGiJB': -0.9989969426889559,
'PmsVVmRbQX': -1.5109402125881068,
'oHjAqjAv1P': -1.3660539127241154,
'a1cU5c4psT': -0.48628192605203563,
'qnqmltdFGR': -0.04124312561281138,
The above dict method is using the column name as keys.
dict_keys(['A', 'B', 'C', 'D'])
How can I can set it to a dict where the columns A B C D are the values for the name column. Thus it will have just 1 key.
A B C D
name
qSfQX3rj48 0.184091 -1.195861 0.998988 -0.970523
Should produce a dictionary with a list of values.
{'qSfQX3rj48': [0.184091, -1.195861, 0.998988, -0.970523],
'KSYYLUGiJB': [-0.998997, -0.387378 , -0.303704, 0.833731],
And values are column, thus:
{'name': [A, B, C, D],
d = df.T.to_dict('list')
d[df.index.name] = df.columns.tolist()
Example
df = pd.DataFrame(np.arange(12).reshape(3, 4),
columns=['A', 'B', 'C', 'D'],
index=['one', 'two', 'three'])
df.index.name = 'name'
df:
A B C D
name
one 0 1 2 3
two 4 5 6 7
three 8 9 10 11
d:
{'one': [0, 1, 2, 3],
'two': [4, 5, 6, 7],
'three': [8, 9, 10, 11],
'name': ['A', 'B', 'C', 'D']}
Code:
dict([(nm,[a,b,c,d ]) for nm, a,b,c,d in zip(df.index, df.A, df.B, df.C, df.D)])
I want to analyze the sequences items of the items and the positions in the sequence where the item appear.
For example:
dataframe['sequence_list'][0] = ['a','b', 'f', 'e']
dataframe['sequence_list'][1] = ['a','c', 'd', 'e']
dataframe['sequence_list'][2] = ['a','d']
...
dataframe['sequence_list'][i] = ['a','b', 'c']
What I want to get is:
How many times 'a' appear in position 0, 1, 2, 3 of the list ?
How many times 'b' appear in position 0, 1, 2, 3 of the list ?
...
Output would be like:
output[1,'a'] = 4
output[2,'a'] = 0
output[3,'a'] = 0
output[4,'a'] = 0
output[1,'b'] = 2
...
The output format could be different. I just want to tell if there are any quick matrix computing methodology help me get the stats quickly?
Start by converting the lists into Series using one of the two statements:
df_ser = dataframe.sequence_list.apply(pd.Series)
df_ser = pd.DataFrame(dataframe.sequence_list.tolist()) # ~30% faster?
The columns of the new dataframe are item positions for each row:
# 0 1 2 3
#0 a b f e
#1 a c d e
#2 a d NaN NaN
#3 a b c NaN
Convert the column numbers into the second-level index, then the second-level index into a column of its own:
df_col = df_ser.stack().reset_index(level=1)
# level_1 0
#0 0 a
#0 1 b
#0 2 f
#....
Count the combinations. This is your answer:
output = df_col.groupby(['level_1', 0]).size()
#level_1 0
#0 a 4
#1 b 2
# c 1
# d 1
#2 c 1
# d 1
# f 1
#3 e 2
You can have it as dictionary:
output.to_dict()
#{(0, 'a'): 4, (1, 'b'): 2, (1, 'c'): 1, (1, 'd'): 1,
# (2, 'c'): 1, (2, 'd'): 1, (2, 'f'): 1, (3, 'e'): 2}
All in one line:
dataframe.sequence_list.apply(pd.Series)\
.stack().reset_index(level=1)\
.groupby(['level_1',0]).size().to_dict()
Setup
Using the setup
df = pd.DataFrame({'col': [['a','b', 'f', 'e'], ['a','c', 'd', 'e'], ['a','d'], ['a','b', 'c']]})
col
0 [a, b, f, e]
1 [a, c, d, e]
2 [a, d]
3 [a, b, c]
You can apply+Counter
pd.DataFrame(df.col.tolist()).apply(Counter)
which yields
0 {'a': 4}
1 {'b': 2, 'c': 1, 'd': 1}
2 {'f': 1, 'd': 1, None: 1, 'c': 1}
3 {'e': 2, None: 2}
dtype: object
for each index.
You can just parse your data the way you need, e.g. fill your dicts now to add the zeroes or disconsider, if thats the case, the Nones.
I have a basic question regarding working with arrays:
a= ([ c b a a b b c a a b b c a a b a c b])
b= ([ 0 1 0 1 0 0 0 0 2 0 1 0 2 0 0 1 0 1])
I) Is there a short way, to count the number of time 'c' in a corresponds to 0, 1, and 2 in b and 'b' in a corresponds to 0, 1, 2 and so on
II) How do I create a new array c (subset of a) and d(subset of b) such that it only contains those elements if the corresponding element in a is 'c' ?
In [10]: p = ['a', 'b', 'c', 'a', 'c', 'a']
In [11]: q = [1, 2, 1, 3, 3, 1]
In [12]: z = zip(p, q)
In [13]: z
Out[13]: [('a', 1), ('b', 2), ('c', 1), ('a', 3), ('c', 3), ('a', 1)]
In [14]: counts = {}
In [15]: for pair in z:
...: if pair in counts.keys():
...: counts[pair] += 1
...: else:
...: counts[pair] = 1
...:
In [16]: counts
Out[16]: {('a', 1): 2, ('a', 3): 1, ('b', 2): 1, ('c', 1): 1, ('c', 3): 1}
In [17]: sub_p = []
In [18]: sub_q = []
In [19]: for i, element in enumerate(p):
...: if element == 'a':
...: sub_p.append(element)
...: sub_q.append(q[i])
In [20]: sub_p
Out[20]: ['a', 'a', 'a']
In [21]: sub_q
Out[21]: [1, 3, 1]
Explanation
zip takes two lists and runs a figurative zipper between them. Resulting in a list of tuples
I've used a simplistic approach, I'm just maintaining a map/dictionary that makes not of how many times it has seen a pair of char-int tuples
Then I make 2 sub lists that you can modify to use the character in question and figure out what it maps to
Alternative methods
As abarnert suggested you could use A Counter from collections instead.
Or you could just a count method on z . eg: z.count('a',1). Or you can use a defaultdict instead.
The questions are a bit vague but here's a quick method (some would call it dirty) using Pandas though I think something written without recourse to Pandas should be preferred.
import pandas as pd
#create OP's lists
a= [ 'c', 'b', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'a', 'c', 'b']
b= [ 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 0, 1, 0, 1]
#dump lists to a Pandas DataFrame
df = pd.DataFrame({'a':a, 'b':b})
Question 1
provided I interpreted it correctly, you can cross-tabulate the two arrays:
pd.crosstab(df.a, df.b).stack(). Cross-tabulate basically counts the number of times each number corresponds to a particular letter. .stack is a command to turn output from .crosstab into a more legible format.
#question 1
pd.crosstab(df.a, df.b).stack()
## -- End pasted text --
Out[9]:
a b
a 0 3
1 2
2 2
b 0 4
1 3
2 0
c 0 4
1 0
2 0
dtype: int64
Question 2
Here, I use Pandas boolean indexing ability to only select the elements in array a that correspond to value 'c'. So df.a=='c' will return True for every value in a that is 'c' and False otherwise. df.loc[df.a=='c','a'] will return values from a for which the boolean statement was true.
c = df.loc[df.a == 'c', 'a']
d = df.loc[df.a == 'c', 'b']
In [15]: c
Out[15]:
0 c
6 c
11 c
16 c
Name: a, dtype: object
In [16]: d
Out[16]:
0 0
6 0
11 0
16 0
Name: b, dtype: int64
Python List : https://www.tutorialspoint.com/python/python_lists.htm has a count method.
I suggest you to first zip both lists, as said in comments, and then count occurances of tuple c, 1 and occurances of tuple c, 0 and sum them up, thats what you need for (I), basically.
For (II), if I understood you correctly, you have to take the zipped lists and apply filter on them with lambda x: x[0]==x[1]
I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
print df
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df
cost
item_id
d 5
I need to keep track of desired_row and delete this row from df and repeat the process.
What is the best way to find and delete the desired_row?
I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.
gb = df.groupby('item_id', as_index=False)
>>> gb.groups # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}
# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
cost item_id
0 1 a
5 1 c
2 1 b
6 5 d
# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Try this ?
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]
Out[186]:
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Or using not in
Consider this df with few more rows
pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'],
'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})
df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
7 1 d
8 7 d
Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.
df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
grouped=df_temp.groupby(['item_id'])['cost'].max()
result_df=pd.DataFrame(columns=['item_id','cost'])
for key, value in enumerate(grouped):
index=grouped.index[key]
result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)
print(result_df.head(5))
I have a list of lists - representing a table with 4 columns and many rows (10000+).
Each sub-list contains 4 variables.
Here is a small part of my table:
['1810569', 'a', 5, '1241.52']
['1437437', 'a', 5, '1123.90']
['1437437', 'b', 5, '1232.43']
['1810569', 'b', 5, '1321.31']
['1810569', 'a', 5, '1993.52']
The first column represents house-hold ID, and the second represents member id in the household.
The fourth column represents weights that I want to sum - distinctly for each member.
For the example above I want the output to be:
['1810569', 'a', 5, '3235.04']
['1437437', 'a', 5, '1123.90']
['1437437', 'b', 5, '1232.43']
['1810569', 'b', 5, '1321.31']
In another words - to sum the weights in lines 1 and 5 since they are weights of the same user - while all the other users are distinct.
I saw something about group by in pandas - but I didn't understand how exactly to use it for my problem.
Assuming the following is your list then the following would work:
In [192]:
l=[['1810569', 'a', 5, '1241.52'],
['1437437', 'a', 5, '1123.90'],
['1437437', 'b', 5, '1232.43'],
['1810569', 'b', 5, '1321.31'],
['1810569', 'a', 5, '1993.52']]
l
Out[192]:
[['1810569', 'a', 5, '1241.52'],
['1437437', 'a', 5, '1123.90'],
['1437437', 'b', 5, '1232.43'],
['1810569', 'b', 5, '1321.31'],
['1810569', 'a', 5, '1993.52']]
In [201]:
# construct the df and convert the last column to float
df = pd.DataFrame(l, columns=['household ID', 'Member ID', 'some col', 'weights'])
df['weights'] = df['weights'].astype(float)
df
Out[201]:
household ID Member ID some col weights
0 1810569 a 5 1241.52
1 1437437 a 5 1123.90
2 1437437 b 5 1232.43
3 1810569 b 5 1321.31
4 1810569 a 5 1993.52
So we can now groupby on the household and member id and call sum on the 'weights' column:
In [200]:
df.groupby(['household ID', 'Member ID'])['weights'].sum().reset_index()
Out[200]:
household ID Member ID weights
0 1437437 a 1123.90
1 1437437 b 1232.43
2 1810569 a 3235.04
3 1810569 b 1321.31
You could do it with a dict, using the first three elements as keys to group the data by:
d = {}
for k, b, c, w in l:
if (k, b, c) in d:
d[k, b, c][-1] += float(w)
else:
d[k, b, c] = [k, b, c, float(w)]
from pprint import pprint as pp
pp(list(d.values()))
Output:
[['1810569', 'b', 5, 1321.31],
['1437437', 'b', 5, 1232.43],
['1437437', 'a', 5, 1123.9],
['1810569', 'a', 5, 3235.04]]
If you wanted to maintain a first seen order:
from collections import OrderedDict
d = OrderedDict()
for k, b, c, w in l:
if (k, b, c) in d:
d[k, b, c][-1] += float(w)
else:
d[k, b, c] = [k, b, c, float(w)]
from pprint import pprint as pp
pp(list(d.values()))
Output:
[['1810569', 'a', 5, 3235.04],
['1437437', 'a', 5, 1123.9],
['1437437', 'b', 5, 1232.43],
['1810569', 'b', 5, 1321.31]]