I have a data set of orders with the item ordered, the quantity ordered, and the box it was shipped in. I'd like to find the possible order combinations of [Box Type, Item, Quantity] and assign each order an identifier for its combination for further analysis. Ideally, the output would look like this:
d2 = {'Order Number': [1, 2, 3], 'Order Type': [1, 2, 1]}
pd.DataFrame(d2)
Where grouping by 'Order Type' would provide a count of the unique order types.
The problem is that each box is assigned a unique code necessary to distinguish whether a box held multiple items. In the example data below "box_id" = 3 shows that the second "Box A" contains two items, 3 and 4. While this field is needed to
import pandas as pd
d = {'Order Number': [1, 2, 2, 2, 3], 'Box_id': [1, 2, 3, 3, 4], 'Box Type': ['Box A', 'Box B', 'Box A', 'Box A', 'Box A'],
'Item': ['A1', 'A2', 'A2', 'A3', 'A1'], 'Quantity': [2, 4, 2, 2, 2]}
pd.DataFrame(d)
I have tried representing each order as a tuple of its [Box type, Item, Quantity] data and using those tuples to capture counts with a default dictionary, but that output is understandably messy to interpret and difficult to match with orders afterwards.
from collections import defaultdict
combinations = defaultdict(int)
Order1 = ((('Box A', 'A1', 2),),)
Order2 = (('Box B', 'A2', 4), (('Box A', 'A2', 2),('Box A', 'A3', 2)))
Order3 = ((('Box A', 'A1', 2),),)
combinations[Order1] += 1
combinations[Order2] += 1
combinations[Order3] += 1
# Should result in
combinations = {((('Box A', 'A1', 2),),): 2
(('Box B', 'A2', 4), (('Box A', 'A2', 2),('Box A', 'A3', 2))): 1}
Is there an easier way to get a representation of unique order combinations and their counts?
Related
I have a pandas dataframe being generated by some other piece of code - the dataframe may have different number of columns each time it is generated: let's call them col1,col2,...,coln where n is not fixed. Please note that col1,col2,... are just placeholders, the actual names of columns can be arbitrary like TimeStamp or PrevState.
From this, I want to convert each column into a list, with the name of the list being the same as the column. So, I want a list named col1 with the entries in the first column of the dataframe and so on till coln.
How do I do this?
Thanks
It is not recommended, better is create dictionary:
d = df.to_dict('list')
And then select list by keys of dict from columns names:
print (d['col'])
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
d = df.to_dict('list')
print (d)
{'A': ['a', 'b', 'c', 'd', 'e', 'f'], 'B': [4, 5, 4, 5, 5, 4], 'C': [7, 8, 9, 4, 2, 3]}
print (d['A'])
['a', 'b', 'c', 'd', 'e', 'f']
import pandas as pd
df = pd.DataFrame()
df["col1"] = [1,2,3,4,5]
df["colTWO"] = [6,7,8,9,10]
for col_name in df.columns:
exec(col_name + " = " + df[col_name].values.__repr__())
I am trying to obtain a list from a Dataframe based on a common value of the index.
In the example below I am trying to obtain the lists for 'type' and 'xx' based on 'date'.
Here is the Dataframe:
import pandas as pd
import numpy as np
idx = [np.array(['Jan', 'Jan', 'Feb', 'Mar', 'Mar', 'Mar']),np.array(['A1', 'A2', 'A2', 'A1', 'A3', 'A4'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df.reset_index(inplace=True)
df=df.set_index(['date'])
Which looks like this:
type xx
date
Jan A1 1
Jan A2 5
Feb A2 3
Mar A1 2
Mar A3 7
Mar A4 3
What I am trying to do is to create these two lists:
#list_type
[['A1', 'A2'], ['A2'], ['A1', 'A3', 'A4']]
#list_xx
[['1', '5'], ['3'], ['2', '7', '3']]
As you can see, the elements of the lists are constructed based on a common date.
I would really value an efficient way of doing this in Python.
Use GroupBy.agg with list and then convert DataFrame to dictionary of lists by DataFrame.to_dict:
d = df.groupby(level=0, sort=False).agg(list).to_dict('l')
print (d)
{'type': [['A1', 'A2'], ['A2'], ['A1', 'A3', 'A4']], 'xx': [[1, 5], [3], [2, 7, 3]]}
print (d['type'])
[['A1', 'A2'], ['A2'], ['A1', 'A3', 'A4']]
print (d['xx'])
[[1, 5], [3], [2, 7, 3]]
SOLVED:
# Split and save all unique parts to separate CSV
for unique_part in df['Part'].unique():
df.loc[df['Part'] == unique_part].to_csv(f'Part_{unique_part}.csv')
I have a table containing production data on parts and the variables that were recorded during their production. I need to slice out all columns for unique part rows. I.E All columns for columns for part #1, #2, and #3 be slice and put into separate dataframes.
FORMAT:
Part | Variable1 | Variable 2 etc
1-----------X---------------X
1-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
3-----------X---------------X
3-----------X---------------X
3-----------X---------------X
I have already tried
Creating a dictionary to group by
dict = {k: v for k, v in df.groupby('Part')}
This didn't work because I couldn't properly convert from dict to DataFrame with the correct format
I also tried creating a variable to store all unique part numbers, I just don't know how to loop through the main dataframe to slice out each unique part row section
part_num = df['Part'].unique()
In summary, I need to create separate dataframes with all variable columns for each cluster of rows with unique part number ids.
You can groupby and then apply to turn each group into a list of dicts, and then turn the groupby into a dict where each key is the unique Part value.
Something like:
df = pd.DataFrame({
'Part': [1,1,1,3,3,2,2,2],
'other': ['a','b','c','d','e','f','g','h']
})
d = df.groupby('Part').apply(lambda d: d.to_dict('records')).to_dict()
print d
will print
{1: [{'Part': 1, 'other': 'a'},
{'Part': 1, 'other': 'b'},
{'Part': 1, 'other': 'c'}],
2: [{'Part': 2, 'other': 'f'},
{'Part': 2, 'other': 'g'},
{'Part': 2, 'other': 'h'}],
3: [{'Part': 3, 'other': 'd'}, {'Part': 3, 'other': 'e'}]}
Think you are on the right track with groupby
df = pd.DataFrame({"Part": [1, 1, 2, 2],
"Var1": [10, 11, 12, 13],
"Var2": [20, 21, 22, 23]})
dfg = df.groupby("Part")
df1 = dfg.get_group(1)
df2 = dfg.get_group(2)
What do you want to DO with the data? Do you really need to create a bunch of individual data frames? The example below loops through each group (each part #) and prints. You could use the same method to do something or get something from each group without creating individual data frames.
for grp in dfg.groups:
print(dfg.get_group(grp))
print()
Output:
Part Var1 Var2
0 1 10 20
1 1 11 21
Part Var1 Var2
2 2 12 22
3 2 13 23
I have data like --
sample 1, domain 1, value 1
sample 1, domain 2, value 1
sample 2, domain 1, value 1
sample 2, domain 3, value 1
-- stored in a dictionary --
dict_1 = {('sample 1','domain 1'): value 1, ('sample 1', 'domain 2'): value 1}
-- etc.
Now, I have a different kind of value, named value 2 --
sample 1, domain 1, value 2
sample 1, domain 2, value 2
sample 2, domain 1, value 2
sample 2, domain 3, value 2
-- which I again put in a dictionary,
dict_2 = {('sample 1','domain 1'): value 2, ('sample 1', 'domain 2'): value 2}
How can I merge these two dictionaries in python? The keys, for instance ('sample 1', 'domain 1') are the same for both dictionaries.
I expect it to look like --
final_dict = {('sample 1', 'domain 1'): (value 1, value 2), ('sample 1', 'domain 2'): (value 1, value 2)}
-- etc.
The closest you're likely to get to this would be a dict of lists (or sets). For simplicity, you usually go with collections.defaultdict(list) so you're not constantly checking if the key already exists. You need to map to some collection type as a value because dicts have unique keys, so you need some way to group the multiple values you want to store for each key.
from collections import defaultdict
final_dict = defaultdict(list)
for d in (dict_1, dict_2):
for k, v in d.items():
final_dict[k].append(v)
Or equivalently with itertools.chain, you just change the loop to:
from itertools import chain
for k, v in chain(dict_1.items(), dict_2.items()):
final_dict[k].append(v)
Side-note: If you really need it to be a proper dict at the end, and/or insist on the values being tuples rather than lists, a final pass can convert to such at the end:
final_dict = {k: tuple(v) for k, v in final_dict.items()}
You can use set intersection of keys to do this:
dict_1 = {('sample 1','domain 1'): 'value 1', ('sample 1', 'domain 2'): 'value 1'}
dict_2 = {('sample 1','domain 1'): 'value 2', ('sample 1', 'domain 2'): 'value 2'}
result = {k: (dict_1.get(k), dict_2.get(k)) for k in dict_1.keys() & dict_2.keys()}
print(result)
# {('sample 1', 'domain 1'): ('value 1', 'value 2'), ('sample 1', 'domain 2'): ('value 1', 'value 2')}
The above uses dict.get() to avoid possibilities of a KeyError being raised(very unlikely), since it will just return None by default.
As #ShadowRanger suggests in the comments, If a key is for some reason not found, you could replace from the opposite dictionary:
{k: (dict_1.get(k, dict_2.get(k)), dict_2.get(k, dict_1.get(k))) for k in dict_1.keys() | dict_2.keys()}
Does something handcrafted like this work for you?
dict3 = {}
for i in dict1:
dict3[i] = (dict1[i], dict2[i])
from collections import defaultdict
from itertools import chain
dict_1 = {('sample 1','domain 1'): 1, ('sample 1', 'domain 2'): 2}
dict_2 = {('sample 1','domain 1'): 3, ('sample 1', 'domain 2'): 4}
new_dict_to_process = defaultdict(list)
dict_list=[dict_1.items(),dict_2.items()]
for k,v in chain(*dict_list):
new_dict_to_process[k].append(v)
Output will be
{('sample 1', 'domain 1'): [1, 3],
('sample 1', 'domain 2'): [2, 4]})
I have a Pandas DataFrame which contains an ID, Code and Date. For certain codes I would like to fill subsequent appearances of the ID, based on the date, with a determined set of missing codes. I would also like to know the first appearance of the code against the respective ID.
Example as follows, NB: missing codes are A and B (only codes A and B carry over):
import pandas as pd
d = {'ID': [1, 2, 1, 2, 3, 1], 'date': ['2017-03-22', '2017-03-21', '2017-03-23', '2017-03-24', '2017-03-28', '2017-03-28'], 'Code': ['A, C', 'A', 'B, C', 'E, D', 'A', 'C']}
df = pd.DataFrame(data=d)
# only A and B codes carry over
df
The target dataframe would ideally look as follows:
import pandas as pd
d = {'ID': [1, 2, 1, 2, 3, 1], 'date': ['2017-03-22', '2017-03-21', '2017-03-24', '2017-03-22', '2017-03-28', '2017-03-28'], 'Code': ['A, C', 'A', 'B, C', 'E, D', 'A', 'C'], 'Missing_code': ['', '', 'A', 'A', '', 'A, B'], 'First_code_date': ['', '', '2017-03-22', '2017-03-21', '', '2017-03-23, 2017-03-24']}
df = pd.DataFrame(data=d)
df
Note I am not fussy on how the 'First_code_date' looks providing it is dynamic as the code length may increase or decrease.
If the example is not clear please let me know and I will adjust.
Thank you for help.