Pandas data manipulation: mapping a column with some predetermined values - python

My data look like this
import pandas as pd
import numpy as np
T1_Delivery = 20
T2_Delivery = 30
T3_Delivery = 40
T4_Delivery = 55
data = [
{'Person': 'A', 'Present_Delivery': -10, 'update': 'T1'},
{'Person': 'B', 'Present_Delivery': 30},
{'Person': 'C', 'Present_Delivery': 40},
{'Person': 'D', 'Present_Delivery': 70, 'update': 'T3'},
{'Person': 'E', 'Present_Delivery': 50, 'update': 'T2'},
{'Person': 'F', 'Present_Delivery': 50}
]
df = pd.DataFrame(data)
df['Actual_Delivery'] = np.where(df['update']==np.NaN, df['Present_Delivery'],0)
#map T{x} to T{x}_Delivery
I need to map update entry(x) with T{x}_Delivery as defined globally. Is this possible? I am able to map if _Delivery is not present in global definition.
My output is something like this:
data = [
{'Person': 'A', 'Actual_Delivery': 20},
{'Person': 'B', 'Actual_Delivery': 30},
{'Person': 'C', 'Actual_Delivery': 40},
{'Person': 'D', 'Actual_Delivery': 40},
{'Person': 'E', 'Actual_Delivery': 30},
{'Person': 'F', 'Actual_Delivery': 50}
]
df_desired = pd.DataFrame(data)
EDIT: This is part of a bigger script and it is not possible to change global variable to dictionary!

You can build a dictionary for mapping. Then use pd.Series.map and pd.Series.fillna
mapping = {'T1':20,'T2':30,'T3':40,'T4':55}
df_final = (df[['Person', 'Present_Delivery']].
assign(Present_Delivery = df['update'].map(mapping).fillna(df['Present_Delivery']))
)
Person Present_Delivery
0 A 20.0
1 B 30.0
2 C 40.0
3 D 40.0
4 E 30.0
5 F 50.0
Another idea using pd.Series.where along with pd.Series.isna
df['Present_Delivery'] = (df['Present_Delivery'].where(
df['update'].isna(),df['update'].map(mapping))
)
df_final = df.drop(columns='update')
Person Present_Delivery
0 A 20
1 B 30
2 C 40
3 D 40
4 E 30
5 F 50

Related

Join json files in Pandas from multiple rows

I am given a data frame (Table 1) with the following format. It has only col1 and col2, and json_col.
id col1 col2 json_col
1 a b json1
2 a c json2
3 b d json3
4 c a json4
5 d e json5
I have a new table (Table 2) and I would like to join json files in my new table
col1 col2 col3 col4 union_json
a b json1
a b d json1 and json3 union
a b d e json1, json3, and json5 union
c a json4
Here is an example of Table 1
df = pd.DataFrame({'col1': ['a', 'a', 'b', 'c', 'd'],
'col2': ['b', 'c', 'd', 'a', 'e'],
'col3': [{"origin":"a","destination":"b", "arc":[{"Type":"763","Number":"20"}]},
{"origin":"a","destination":"c", "arc":[{"Type":"763","Number":"50"}]},
{"origin":"a","destination":"d", "arc":[{"Type":"723","Number":"40"}]},
{"origin":"c","destination":"a", "arc":[{"Type":"700","Number":"30"}]},
{"origin":"d","destination":"e", "arc":[{"Type":"700","Number":"40"}]}]})
And, here is an example of Table 2:
df = pd.DataFrame({'col1': ['a', 'a', 'a', 'c'],
'col2': ['b', 'b', 'b', 'a'],
'col3': ['', 'd', 'd', ''],
'col4': ['', '', 'e', '']})
The union of json1 and json2 should look like this:
[[{"origin":"a","destination":"b", "arc":[{"Type":"763","Number":"20"}]}],
[{"origin":"a","destination":"d", "arc":[{"Type":"723","Number":"40"}]}]]
I hope I've understood your question right:
from itertools import combinations
def fn(x):
out, non_empty_vals = [], x[x != ""]
for c in combinations(non_empty_vals, 2):
out.extend(df1.loc[df1[["col1", "col2"]].eq(c).all(axis=1), "col3"])
return out
df2["union_json"] = df2.apply(fn, axis=1)
print(df2.to_markdown(index=False))
Prints:
col1
col2
col3
col4
union_json
a
b
[{'origin': 'a', 'destination': 'b', 'arc': [{'Type': '763', 'Number': '20'}]}]
a
b
d
[{'origin': 'a', 'destination': 'b', 'arc': [{'Type': '763', 'Number': '20'}]}, {'origin': 'a', 'destination': 'd', 'arc': [{'Type': '723', 'Number': '40'}]}]
a
b
d
e
[{'origin': 'a', 'destination': 'b', 'arc': [{'Type': '763', 'Number': '20'}]}, {'origin': 'a', 'destination': 'd', 'arc': [{'Type': '723', 'Number': '40'}]}, {'origin': 'd', 'destination': 'e', 'arc': [{'Type': '700', 'Number': '40'}]}]
c
a
[{'origin': 'c', 'destination': 'a', 'arc': [{'Type': '700', 'Number': '30'}]}]
Dataframes used:
df1
col1 col2 col3
0 a b {'origin': 'a', 'destination': 'b', 'arc': [{'Type': '763', 'Number': '20'}]}
1 a c {'origin': 'a', 'destination': 'c', 'arc': [{'Type': '763', 'Number': '50'}]}
2 b d {'origin': 'a', 'destination': 'd', 'arc': [{'Type': '723', 'Number': '40'}]}
3 c a {'origin': 'c', 'destination': 'a', 'arc': [{'Type': '700', 'Number': '30'}]}
4 d e {'origin': 'd', 'destination': 'e', 'arc': [{'Type': '700', 'Number': '40'}]}
df2
col1 col2 col3 col4
0 a b
1 a b d
2 a b d e
3 c a

List to a Readable Representation using Python

I have data as
[{'name': 'A', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'B', 'subsets': ['B_1', 'B_A'], 'cluster': 2},
{'name': 'C', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'D', 'subsets': ['D_1', 'D_2', 'D_3', 'D_4'], 'cluster': 1}]
I need to represent it as
Cluster Number Subset Name
0 ['X_1', 'X_A', 'X_B'] A, C
1 ['D_1', 'D_2', 'D_3', 'D_4'] D
2 ['B_1', 'B_A'] B
For the sake of completeness, I think it is fair to mention that you can actually create a dataframe without json_normalize in your case and apply groupby as originally shown here:
import pandas as pd
data = [{'name': 'A', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'B', 'subsets': ['B_1', 'B_A'], 'cluster': 2},
{'name': 'C', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'D', 'subsets': ['D_1', 'D_2', 'D_3', 'D_4'], 'cluster': 1}]
df = pd.DataFrame(data).groupby('cluster')
.agg({'subsets':'first','name':', '.join})
.reset_index()
.set_index('cluster')
.rename_axis('Cluster Number')
subsets name
Cluster Number
0 [X_1, X_A, X_B] A, C
1 [D_1, D_2, D_3, D_4] D
2 [B_1, B_A] B
You can use json_normalize + groupby "cluster" and apply join to "name" and first to "subsets":
df = pd.json_normalize(data).groupby('cluster').agg({'subsets':'first','name':', '.join}).reset_index()
Output:
cluster subsets name
0 0 [X_1, X_A, X_B] A, C
1 1 [D_1, D_2, D_3, D_4] D
2 2 [B_1, B_A] B

What is the most efficient way to sum a dict with multiple keys by one key?

I have the following dict structure.
product1 = {'product_tmpl_id': product_id,
'qty':product_uom_qty,
'price':price_unit,
'subtotal':price_subtotal,
'total':price_total,
}
And then a list of products, each item in the list is a dict with the above structure
list_ = [product1,product2,product3,.....]
I need to sum the item in the list, group by the key product_tmpl_id ... I'm using dictcollections but it only sum the qty key, I need to sum key except the product_tmpl_id which is the criteria to group by
c = defaultdict(float)
for d in list_:
c[d['product_tmpl_id']] += d['qty']
c = [{'product_id': id, 'qty': qty} for id, qty in c.items()]
I know how to do it with a for iteration but trying to look for a more pythonic way
thanks
EDIT:
What is need is to pass from this:
lst = [
{'Name': 'A', 'qty':100,'price':10},
{'Name': 'A', 'qty':100,'price':10},
{'Name': 'A', 'qty':100,'price':10},
{'Name': 'B', 'qty':100,'price':10},
{'Name': 'C', 'qty':100,'price':10},
{'Name': 'C', 'qty':100,'price':10},
]
to this
group_lst = [
{'Name': 'A', 'qty':300,'price':30},
{'Name': 'B', 'qty':100,'price':10},
{'Name': 'C', 'qty':200,'price':20},
]
Using basic Python, this doesn't get a whole lot better. You could hack something together with itertools.groupby, but it'd be ugly and probably slower, certainly less clear.
As #9769953 suggested, though, Pandas is a good package to handle this sort of structured, tabular data.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(lst)
Out[2]:
Name price qty
0 A 10 100
1 A 10 100
2 A 10 100
3 B 10 100
4 C 10 100
5 C 10 100
In [3]: df.groupby('Name').agg(sum)
Out[3]:
price qty
Name
A 30 300
B 10 100
C 20 200
You just need a little extra mojo if you don't want to keep the data as a dataframe:
In [4]: grouped = df.groupby('Name', as_index=False).agg(sum)
In [5]: list(grouped.T.to_dict().values())
Out[5]:
[{'Name': 'A', 'price': 30, 'qty': 300},
{'Name': 'B', 'price': 10, 'qty': 100},
{'Name': 'C', 'price': 20, 'qty': 200}]
On the verbose side, but gets the job done:
group_lst = []
lst_of_names = []
for item in lst:
qty_total = 0
price_total = 0
# Get names that have already been totalled
lst_of_names = [item_get_name['Name'] for item_get_name in group_lst]
if item['Name'] in lst_of_names:
continue
for item2 in lst:
if item['Name'] == item2['Name']:
qty_total += item2['qty']
price_total += item2['price']
group_lst.append(
{
'Name':item['Name'],
'qty':qty_total,
'price':price_total
}
)
pprint(group_lst)
Output:
[{'Name': 'A', 'price': 30, 'qty': 300},
{'Name': 'B', 'price': 10, 'qty': 100},
{'Name': 'C', 'price': 20, 'qty': 200}]
You can use defaultdict and Counter
>>> from collections import Counter, defaultdict
>>> cntr = defaultdict(Counter)
>>> for d in lst:
... cntr[d['Name']].update(d)
...
>>> res = [dict(v, **{'Name':k}) for k,v in cntr.items()]
>>> pprint(res)
[{'Name': 'A', 'price': 30, 'qty': 300},
{'Name': 'C', 'price': 20, 'qty': 200},
{'Name': 'B', 'price': 10, 'qty': 100}]

how to write a list of dictionaries into a CSV with multiple values

I have a list of dictionaries in "my_list" as follows:
my_list=[{'Id': '100', 'A': [val1, val2], 'B': [val3, val4], 'C': [val5,val6]},
{'Id': '200', 'A': [val7, val8], 'B': [val9, val10], 'C':
[val11,val12],
{'Id': '300', 'A': [val13, val14], 'B': [val15, val16], 'C':
[val17,val18]}]
I want to write this list into a CSV file as follows:
ID, A, AA, B, BB, C, CC
100, val1, val2, val3, val4, val5, val6
200, val7, val8, val9, val10, val11, val12
300, val13, val14, val15, val16, val17, val18
Does anyone know how can I handle it?
Tablib should do the trick
I leave here the example on their front page (which you can adapt to the .csv format) :
>>> data = tablib.Dataset(headers=['First Name', 'Last Name', 'Age'])
>>> for i in [('Kenneth', 'Reitz', 22), ('Bessie', 'Monke', 21)]:
... data.append(i)
>>> print(data.export('json'))
[{"Last Name": "Reitz", "First Name": "Kenneth", "Age": 22}, {"Last Name": "Monke", "First Name": "Bessie", "Age": 21}]
>>> print(data.export('yaml'))
- {Age: 22, First Name: Kenneth, Last Name: Reitz}
- {Age: 21, First Name: Bessie, Last Name: Monke}
>>> data.export('xlsx')
<censored binary data>
>>> data.export('df')
First Name Last Name Age
0 Kenneth Reitz 22
1 Bessie Monke 21
You could do this... (replacing print with a csv writerow as appropriate)
print(['ID', 'A', 'AA', 'B', 'BB', 'C', 'CC'])
for row in my_list:
out_row = []
out_row.append(row['Id'])
for v in row['A']:
out_row.append(v)
for v in row['B']:
out_row.append(v)
for v in row['C']:
out_row.append(v)
print(out_row)
You can use pandas to do the trick:
my_list = [{'Id': '100', 'A': [val1, val2], 'B': [val3, val4], 'C': [val5, val6]},
{'Id': '200', 'A': [val7, val8], 'B': [val9, val10], 'C': [val11, val12]},
{'Id': '300', 'A': [val13, val14], 'B': [val15, val16], 'C': [val17, val18]}]
index = ['Id', 'A', 'AA', 'B', 'BB', 'C', 'CC']
df = pd.DataFrame(data=my_list)
for letter in ['A', 'B', 'C']:
first = []
second = []
for a in df[letter].values.tolist():
first.append(a[0])
second.append(a[1])
df[letter] = first
df[letter * 2] = second
df = df.reindex_axis(index, axis=1)
df.to_csv('out.csv')
This produces the following output as dataframe:
Id A AA B BB C CC
0 100 1 2 3 4 5 6
1 200 7 8 9 10 11 12
2 300 13 14 15 16 17 18
and this is the out.csv-file:
,Id,A,AA,B,BB,C,CC
0,100,1,2,3,4,5,6
1,200,7,8,9,10,11,12
2,300,13,14,15,16,17,18
See pandas documentation about the csv-feature (csv).
Write DataFrame to a comma-separated values (csv) file

pandas dataframe convert values in array of objects

I want to convert the below pandas data frame
data = pd.DataFrame([[1,2], [5,6]], columns=['10+', '20+'], index=['A', 'B'])
data.index.name = 'City'
data.columns.name= 'Age Group'
print data
Age Group 10+ 20+
City
A 1 2
B 5 6
in to an array of dictionaries, like
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]
I am able to get the above expected result using the following loops
result = []
cols_name = data.columns.name
index_names = data.index.name
for index in data.index:
for col in data.columns:
result.append({cols_name: col, index_names: index, 'count': data.loc[index, col]})
Is there any better ways of doing this? Since my original data will be having large number of records, using for loops will take more time.
I think you can use stack with reset_index for reshape and last to_dict:
print (data.stack().reset_index(name='count'))
City Age Group count
0 A 10+ 1
1 A 20+ 2
2 B 10+ 5
3 B 20+ 6
print (data.stack().reset_index(name='count').to_dict(orient='records'))
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]

Categories

Resources