Converting list to aggregated list - python

So I have a list like this:
date, type
29-5-2017, x
30-5-2017, x
31-5-2017, y
1-6-2017, z
2-6-2017, z
3-6-2017, y
28-5-2017, y
29-5-2017, z
30-5-2017, z
31-5-2017, y
1-6-2017, z
2-6-2017, z
3-6-2017, x
29-5-2017, x
30-5-2017, z
31-5-2017, z
1-6-2017, y
2-6-2017, x
3-6-2017, z
4-6-2017, y
How would I create an aggregated version of this list? So I get each date only once, and see how many of each type there are on a given date.
Like this:
date, no._of_x, no._of_y, no._of_z
28-5-2017, 0, 1, 0
29-5-2017, 2, 0, 1
30-5-2017, 1, 0, 2
31-5-2017, 0, 2, 1
1-6-2017, 0, 1, 2
2-6-2017, 1, 0, 2
3-6-2017, 1, 1, 1
4-6-2017, 0, 1, 0

Assuming that your list is a list of lists with each sublist containing a date string and of of x, y or z, you can first group your list by dates. Just create a dictionary, or collections.defaultdict, mapping dates to lists of x/z/y.
dates = collections.defaultdict(list)
for date, xyz in data:
dates[date].append(xyz)
Next, you can create another dictionary, mapping each of those x/y/z lists to a Counter dict:
counts = {date: collections.Counter(xyz) for date, xyz in dates.items()}
Afterwards, counts is this:
{'2-6-2017,': Counter({'z': 2, 'x': 1}),
'29-5-2017,': Counter({'x': 2, 'z': 1}),
'3-6-2017,': Counter({'x': 1, 'y': 1, 'z': 1}),
'28-5-2017,': Counter({'y': 1}),
'1-6-2017,': Counter({'z': 2, 'y': 1}),
'4-6-2017,': Counter({'y': 1}),
'31-5-2017,': Counter({'y': 2, 'z': 1}),
'30-5-2017,': Counter({'z': 2, 'x': 1})}

Related

Python, how to find patterns of different length and to sum the number of match

I have a list like that:hg = [['A1'], ['A1b'], ['A1b1a1a2a1a~'], ['BT'], ['CF'], ['CT'], ['F'], ['GHIJK'], ['I'], ['I1a2a1a1d2a1a~'], ['I2'], ['I2~'], ['I2a'], ['I2a1'], ['I2a1a'], ['I2a1a2'], ['I2a1a2~'], ['IJ'], ['IJK'], ['L1a2']]
For example, if we look at :['A1'] ['A1b'] ['A1b1a1a2a1a~']
I want to count how many time the pattern 'A1','A1b' and 'A1b1a1a2a1a~' occurs.
Basically, A1 appears 3 times (A1 itself, A1 in A1b and A1 in A1b1a1a2a1a) and A1b two times (A1b itself and A1b in A1b1a1a2a1a) and A1b1a1a2a1a one time. Obviously, I want to do that for the entire list.
However, if in the list we have for example E1b1a1, I don't want to count a match of A1 in E1b1a1.
So what I did is:
dic_test = {}
for i in hg:
for j in hg:
if ''.join(i) in ''.join(j):
if ''.join(i) not in dic_test.keys():
dic_test[''.join(i)]=1
else:
dic_test[''.join(i)]+=1
print (dic_test)
output:{'A1': 3, 'A1b': 2, 'A1b1a1a2a1a~': 1, 'BT': 1, 'CF': 1, 'CT': 1, 'F': 2, 'GHIJK': 1, 'I': 12, 'I1a2a1a1d2a1a~': 1, 'I2': 7, 'I2~': 1, 'I2a': 5, 'I2a1': 4, 'I2a1a': 3, 'I2a1a2': 2, 'I2a1a2~': 1, 'IJ': 3, 'IJK': 2, 'L1a2': 1}
However, as explained above, there is one issue. For example, F should be equal at one and not 2. The reason is because with the code above, I look for F anywhere in the list. But I don't know how to correct that!
There is a second thing that I don't know how to do:
Based on the output:
{'A1': 3, 'A1b': 2, 'A1b1a1a2a1a~': 1, 'BT': 1, 'CF': 1, 'CT': 1, 'F': 2, 'GHIJK': 1, 'I': 12, 'I1a2a1a1d2a1a~': 1, 'I2': 7, 'I2~': 1, 'I2a': 5, 'I2a1': 4, 'I2a1a': 3, 'I2a1a2': 2, 'I2a1a2~': 1, 'IJ': 3, 'IJK': 2, 'L1a2': 1}
I would like to sum the values of the dic based on shared pattern:
example of the desired output{A1b1a1a2a1a~: 6, 'BT': 1,'CF': 1, 'CT': 1, 'F': 1, 'GHIJK': 1, 'I1a2a1a1d2a1a~': 13, I2a1a2:35, 'IJK': 5, 'IJK': 5}:
For example, A1b1a1a2a1a = 6 it's because it is made by A1 which has a value of 3, A1b with a value of 2 and the value of A1b1a1a2a1a equal at 1.
I don't know how to do that.
Any helps will be much appreciated!
Thanks
You count 'F' twice because you are iterating over the product of hg and hg so that the condition if ''.join(i) in ''.join(j) happens twice for 'F'. I solved that by checking the indexes.
You mentioned in the comment that the pattern should be at the beginning of the string so in doesn't work here. You can use .startswith() for that.
I first created a dictionary from the items but sorted(That's important for your second question about summing the values). They all start with the value of 1. Then I iterated over the the items, increased the value only if they are not in the same position.
For the second part of your question, because they are sorted, only the previous items can be at the beginning of the next items. So I got the pairs with .popitem() which hands the last pair (in Python 3.7 and above) and check its previous ones until the dictionary is empty.
hg = [['A1'], ['A1b'], ['A1b1a1a2a1a~'], ['BT'], ['CF'], ['CT'], ['F'], ['GHIJK'], ['I'], ['I1a2a1a1d2a1a~'], ['I2'], ['I2~'], ['I2a'], ['I2a1'], ['I2a1a'], ['I2a1a2'], ['I2a1a2~'], ['IJ'], ['IJK'], ['L1a2']]
# create a sorted dicitonary of all items each with the value of 1.
d = dict.fromkeys((item[0] for item in sorted(hg)), 1)
for idx1, (k, v) in enumerate(d.items()):
for idx2, item in enumerate(hg):
if idx1 != idx2 and item[0].startswith(k):
d[k] += 1
print(d)
print("-----------------------------------")
# last pair in `d`
k, v = d.popitem()
result = {k: v}
while d:
# pop last pair in `d`
k1, v1 = d.popitem()
# get last pair in `result`
k2, v2 = next(reversed(result.items()))
if k2.startswith(k1):
result[k2] += v1
else:
result[k1] = v1
print({k: result[k] for k in reversed(result)})
output:
{'A1': 3, 'A1b': 2, 'A1b1a1a2a1a~': 1, 'BT': 1, 'CF': 1, 'CT': 1, 'F': 1, 'GHIJK': 1, 'I': 11, 'I1a2a1a1d2a1a~': 1, 'I2': 7, 'I2a': 6, 'I2a1': 5, 'I2a1a': 4, 'I2a1a2': 3, 'I2a1a2~': 2, 'I2~': 2, 'IJ': 2, 'IJK': 1, 'L1a2': 1}
-----------------------------------
{'A1b1a1a2a1a~': 6, 'BT': 1, 'CF': 1, 'CT': 1, 'F': 1, 'GHIJK': 1, 'I1a2a1a1d2a1a~': 12, 'I2a1a2~': 27, 'I2~': 2, 'IJK': 3, 'L1a2': 1}
I think you made a mistake for your expected result and it should be like this, but let me know if mine is wrong.
#S.B helped me to better understand what I wanted to do, so I did some modifications to the second part of the script.
I converted the dictionary "d" (re-named "hg_d") into a list of list:
hg_d_to_list = list(map(list, hg_d.items()))
Then, I created a dictionary where the keys are the words and the values the list of the words that matches with startswith() like:
nested_HGs = defaultdict(list)
for i in range(len(hg_d_to_list)):
for j in range(i+1,len(hg_d_to_list)):
if hg_d_to_list[j][0].startswith(hg_d_to_list[i][0]):
nested_HGs[hg_d_to_list[j][0]].append(hg_d_to_list[i][0])
nested_HGs defaultdict(<class 'list'>, {'A1b': ['A1'], 'A1b1a1a2a1a': ['A1', 'A1b'], 'I1a2a1a1d2a1a~': ['I'], 'I2': ['I'], 'I2a': ['I', 'I2'], 'I2a1': ['I', 'I2', 'I2a'], 'I2a1a': ['I', 'I2', 'I2a', 'I2a1'], 'I2a1a2': ['I', 'I2', 'I2a', 'I2a1', 'I2a1a'], 'I2a1a2~': ['I', 'I2', 'I2a', 'I2a1', 'I2a1a', 'I2a1a2'], 'I2~': ['I', 'I2'], 'IJ': ['I'], 'IJK': ['I', 'IJ']})
Then, I sum each key and the value(s) associated to the dictionary "nested_HGs" based on the values of the dictionary "hg_d" like:
HGs_score = {}
for key,val in hg_d.items():
for key2,val2 in nested_HGs.items():
if key in val2 or key in key2:
if key2 not in HGs_score.keys():
HGs_score[key2]=val
else:
HGs_score[key2]+=val
HGs_score {'A1b': 5, 'A1b1a1a2a1a': 6, 'I1a2a1a1d2a1a~': 12, 'I2': 18, 'I2a': 24, 'I2a1': 29, 'I2a1a': 33, 'I2a1a2': 36, 'I2a1a2~': 38, 'I2~': 20, 'IJ': 13, 'IJK': 14}
Here, I realized that I don't care about the key with a value = at 1.
To finish, I get the key of the dictionary that has the highest value :
final_HG_classification = max(HGs_score, key=HGs_score.get)
final_HG_classification=I2a1a2~
It looks like it's working! Any suggestions or improvements are more than welcome.
Thanks in advance.

Set tuple pair of (x, y) coordinates into dict as key with id value

The data looks like this:
d = {'location_id': [1, 2, 3, 4, 5], 'x': [47.43715, 48.213889, 46.631111, 46.551111, 47.356628], 'y': [11.880689, 14.274444, 14.371, 13.665556, 11.705181]}
df = pd.DataFrame(data=d)
print(df)
location_id x y
0 1 47.43715 11.880689
1 2 48.213889 14.274444
2 3 46.631111 14.371
3 4 46.551111 13.665556
4 5 47.356628 11.705181
Expected output:
{(47.43715, 11.880689): 1, (48.213889, 14.274444): 2, (46.631111, 14.371): 3, ...}
So i can simply access ID providing point coordinates.
What i have tried:
dict(zip(df['x'].astype('float'), df['y'].astype('float'), zip(df['location_id'])))
Error: ValueError: dictionary update sequence element #0 has length 3; 2 is required
or
dict(zip(tuple(df['x'].astype('float'), df['y'].astype('float')), zip(df['location_id'])))
TypeError: tuple expected at most 1 arguments, got 2
I have Googled for it a while, but I am not very clear about it. Thank you for any assistance.
I think this
result = dict(zip(zip(df['x'], df['y']), df['location_id']))
should give you what you want? Result:
{(47.43715, 11.880689): 1,
(48.213889, 14.274444): 2,
(46.631111, 14.371): 3,
(46.551111, 13.665556): 4,
(47.356628, 11.705181): 5}
I didn't use a dataframe, is this what you wanted?
my_dict = {}
d = {'location_id': [1, 2, 3, 4, 5], 'x': [47.43715, 48.213889, 46.631111, 46.551111, 47.356628], 'y': [11.880689, 14.274444, 14.371, 13.665556, 11.705181]}
for i in range(len(d['location_id'])):
my_dict[ (d['x'][i] , d['y'][i]) ] = d['location_id'][i]
You can set x and y column as index then export location_id column to dictionary
d = df.set_index(['x', 'y'])['location_id'].to_dict()
print(d)
{(47.43715, 11.880689): 1, (48.213889, 14.274444): 2, (46.631111, 14.371): 3, (46.551111, 13.665556): 4, (47.356628, 11.705181): 5}

Return dictionary list in for loop statement

so, I have this dataframe
I need to replace that categorical column into ordinal/numerical
So if you processing it one by one it would look like:
labels = df_main_correlation['job_level'].astype('category').cat.categories.tolist()
replace_map_comp = {'job_level' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
print(replace_map_comp)
It will return
{'job_level': {'JG03': 1, 'JG04': 2, 'JG05': 3, 'JG06': 4}}
but you can do this using for loop in order to process all the columns right?
I tried this one
columns_categorical =list(df_main_correlation.select_dtypes(['object']).columns) #take the columns I want to process
replace_map_comp_list = []
for i, column in enumerate(columns_categorical):
labels = df_main_correlation[column].astype('category').cat.categories.tolist()
replace_map_comp = {column : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}} # Return dictionary
print(replace_map_comp)
replace_map_comp_list.append(replace_map_comp[i])
replace_map_comp_list
But it only returns
{'job_level': {'JG03': 1, 'JG04': 2, 'JG05': 3, 'JG06': 4}}
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-202-acc2ad8defaa> in <module>()
8 #df_main_correlation.replace(replace_map_comp, inplace=True)
9
---> 10 replace_map_comp_list.append(replace_map_comp[i])
11 replace_map_comp_list
KeyError: 0
My expected result would be
{'job_level': {'JG03': 1, 'JG04': 2, 'JG05': 3, 'JG06': 4}}
{'person_level': {'PG01': 1, 'PG02': 2, 'PG03': 3, 'PG04': 4, 'PG05': 5, 'PG06': 6, 'PG07': 7, 'PG08': 8}}
{'Employee_type': {'RM_type_A': 1, 'RM_type_B': 2, 'RM_type_C': 3}}
any advices?
Consider df:
In [1543]: df
Out[1543]:
job_level person_level Employee_type
0 JG05 PG06 RM_type_A
1 JG04 PG04 RM_type_A
2 JG04 PG05 RM_type_B
3 JG03 PG03 RM_type_C
Use collections.Counter with Dictionary Comprehension:
In [1539]: from collections import Counter
In [1537]: x = df.to_dict('list')
In [1544]: res = {k: Counter(v) for k,v in x.items()}
In [1545]: res
Out[1545]:
{'job_level': Counter({'JG05': 1, 'JG04': 2, 'JG03': 1}),
'person_level': Counter({'PG06': 1, 'PG04': 1, 'PG05': 1, 'PG03': 1}),
'Employee_type': Counter({'RM_type_A': 2, 'RM_type_B': 1, 'RM_type_C': 1})}
Counter itself returns a dict.
try this, not sure
replace_map_comp_list.append(replace_map_comp['job_level'][column])

Count of rows linked by ids in Pandas dataframe

I have a table of ids, and previous ids (see image 1), I want to count the number of unique ids in total linked in one chain, e.g. if we take the latest id as the 'parent' then the result for the example data below would be something like Image 2, where 'a' is linked to 5 total ids (a, b, c, d & e) and 'w' is linked to 4 ids (w, x, y & z). In practicality, I am dealing with randomly generated ids, not sequenced letters.
Python Code to produce example dataframes:
import pandas as pd
raw_data = pd.DataFrame([['a','b'], ['b','c'], ['c', 'd'],['d','e'],['e','-'],
['w','x'], ['x', 'y'], ['y','z'], ['z','-']], columns=['id', 'previous_id'])
output = pd.DataFrame([['a',5],['w',4]], columns = ['parent_id','linked_ids'])
Use convert_matrix.from_pandas_edgelist with connected_components first, then create dictionary for mapping, get first mapped values per groups by Series.map filtered by Series.duplicated and last add new column by Series.map with Counter for mapp dictionary:
df1 = raw_data[raw_data['previous_id'].ne('-')]
import networkx as nx
from collections import Counter
g = nx.from_pandas_edgelist(df1,'id','previous_id')
connected_components = nx.connected_components(g)
d = {y:i for i, x in enumerate(connected_components) for y in x}
print (d)
{'c': 0, 'e': 0, 'b': 0, 'd': 0, 'a': 0, 'y': 1, 'x': 1, 'w': 1, 'z': 1}
c = Counter(d.values())
mapp = {k: c[v] for k, v in d.items()}
print (mapp)
{'c': 5, 'e': 5, 'b': 5, 'd': 5, 'a': 5, 'y': 4, 'x': 4, 'w': 4, 'z': 4}
df = (raw_data.loc[~raw_data['id'].map(d).duplicated(), ['id']]
.rename(columns={'id':'parent_id'})
.assign(linked_ids = lambda x: x['parent_id'].map(mapp)))
print (df)
parent_id linked_ids
0 a 5
5 w 4

Simpler method for performing replacement on multiple columns at a time?

import pandas as pd
w=pd.read_csv('w.csv')
Takes sections of a CSV to add them up. Two columns require numerical conversion
w["Social Media Use Score"]=w.iloc[:,[6,7,8,9,10,11,12,13,14,15,16]].sum(axis=1)
Switches Yes or No in this section to 1 o 0 and adds them up, other section switches ABCD to 1234 and sums
w['Q1'],w['Q3'],w['Q6'] = w['Q1'].map({'No': 1, 'Yes': 0}),\
w['Q3'].map({'No': 1, 'Yes': 0}),\
w['Q6'].map({'No': 1, 'Yes': 0})
w['Q2'],w['Q4'],w['Q5'],w['Q7'],w['Q8'],w['Q9'],w['Q10']=\
w['Q2'].map({'Yes': 1, 'No': 0}),\
w['Q4'].map({'Yes': 1, 'No': 0}),\
w['Q5'].map({'Yes': 1, 'No': 0}),\
w['Q7'].map({'Yes': 1, 'No': 0}),\
w['Q8'].map({'Yes': 1, 'No': 0}),\
w['Q9'].map({'Yes': 1, 'No': 0}),\
w['Q10'].map({'Yes': 1, 'No': 0})
w["Anxiety Score"]=w.iloc[:,[17,18,19,20,21,22,23,24,25,26]].sum(axis=1)
w['d1'],w['d2'],w['d3'],w['d4'],w['d5'],w['d6'],w['d7'],w['d8'],w['d9'],w['d10']=\
w['d1'].map({'A': 1, 'B': 2,'C':3,'D':4}),\
w['d2'].map({'A': 1, 'B': 2,'C':3,'D':4}),\
w['d3'].map({'A': 1, 'B': 2,'C':3,'D':4}),\
w['d4'].map({'A': 1, 'B': 2,'C':3,'D':4}),\
w['d5'].map({'A': 1, 'B': 2,'C':3,'D':4}),\
w['d6'].map({'A': 1, 'B': 2,'C':3,'D':4}),\
w['d7'].map({'A': 1, 'B': 2,'C':3,'D':4}),\
w['d8'].map({'A': 1, 'B': 2,'C':3,'D':4}),\
w['d9'].map({'A': 1, 'B': 2,'C':3,'D':4}),\
w['d10'].map({'A': 1, 'B': 2,'C':3,'D':4})
w['Depression Score']=w.iloc[:,[27,28,29,30,31,32,33,34,35,36]].sum(axis=1)
w.to_csv("foranal.csv")
If you want to perform replacement on multiple columns simultaneously, you should use df.replace (it is slower than map, so use it only if you can afford to).
# Mapping for replacement.
repl_dict = {'A':1, 'B':2,'C':3, 'D':4}
repl_dict.update({'Yes':1, 'No':0})
# Generate the list of columns to perform replace on.
cols = [f'{x}{y}' for x in ('Q','d') for y in range(1, 11)]
w[cols] = w[cols].replace(repl_dict)
# Fix values for special columns.
w.loc[:, ['Q1', 'Q3', 'Q6']] = 1 - w.loc[:, ['Q1', 'Q3', 'Q6']]
"Social Media Use Score" and "Anxiety Score" are fine.

Categories

Resources