Remove multiple dictionary value from a list using Python - python

My data is as follow
dd=[{'id':'aa','age':22,'data':{},'background':{}},
{'id':'bb','age':23,'data':{},'background':{}},
{'id':'cc','age':24,'data':{},'background':{}},
{'id':'dd','age':25,'data':{},'background':{}},
{'id':'ee','age':26,'data':{},'background':{}}
]
How to remove several responses based on id? I have almost 100 responses that need to be removed.
As example:
id = ' aa bb cc '

Use list comprehension to filter out the data you do not want.
However, you should not use the name id
dd = [item for item in dd if item['id'] not in id]

You can also use filter, and lambda functions here,
dd = list(filter(lambda x : x["id"] not in a, dd))

Related

How can one sort a list of dictionaries into an excel sheet such that each unique key and one corresponding key are placed into columns?

I have a list of dictionaries organized as such:
listofdictionaries = [{'key1': (A, B, C), 'key2':[1, 2, 3]},
{'key1': (AA, BB, CC), 'key2':[1, 2, 3]},
{'key1': (AAA, BBB, CCC), 'key2':[4, 5, 6]}]
This list's first and second items have an equivalent value for key2. The third item has a different value for key2. Still using Python, I want the columns organized as such:
Group 1
Group 1 Items
Group 2
Group 2 Items
[1, 2, 3]
(A, B, C)
[4, 5, 6]
(AAA, BBB, CCC)
(AA, BB, CC)
In addition I would like the output to be a .csv file.
With pandas, you can use something like this function
def groupItems(dictList, itemsFrom, groupBy, saveTo=None):
ik, gk, colsDict = itemsFrom, groupBy, {}
groups = {str(d.get(gk)): d.get(gk) for d in dictList}
itemsList = [ [d.get(ik) for d in dictList if str(d.get(gk))==g]
for g in groups ]
maxRows = max(len(li) for li in itemsList) if groups else 0
for gi, (g, li) in enumerate(zip(groups.keys(), itemsList), 1):
colsDict[f'Group {gi}'] = [groups[g]] + [None]*(maxRows-1)
colsDict[f'Group {gi} Items'] = li + [None]*(maxRows-len(li))
rdf = pandas.DataFrame(colsDict)
if saveTo and isinstance(saveTo, str):
print('Saving', maxRows, 'rows for', len(groups),'groups to', saveTo)
rdf.to_csv(saveTo, index=False)
return rdf
Calling groupItems(listofdictionaries, 'key1', 'key2', 'x.csv') will save the DataFrame from the screenshot below to x.csv.
To demonstrate that the brackets were not lost:
You could also get it in this format if you change the function to
def groupItems(dictList, itemsFrom, groupBy, saveTo=None):
ik, gk = itemsFrom, groupBy
groups = {str(d.get(gk)): d.get(gk) for d in dictList}
itemsList = [ [d.get(ik) for d in dictList if str(d.get(gk))==g]
for g in groups ]
maxRows = max(len(li) for li in itemsList) if groups else 0
colsDict = {f'Group {gi}': [groups[g]] + [None] + (
[f'Group {gi} Items'] + li + [None]*(maxRows-len(li))
) for gi, (g, li) in enumerate(zip(groups.keys(), itemsList), 1)}
rdf = pandas.DataFrame(colsDict)
if saveTo and isinstance(saveTo, str):
print('Saving', maxRows+4, 'rows for', len(groups),'groups to', saveTo)
rdf.to_csv(saveTo, index=False)
return rdf
NOTE: Saving as csv will stringify all non-numeric cells. If you want to preserve nested structure, I suggest saving as JSON instead.
Here a solution using an Excel formula. Probably using FILTERXML there is shorter way, but it is not available for Excel Web, which is the version I use. In cell B3, you can use the following formula:
=LET(header, {"Group","Name"}, clean, SUBSTITUTE(TEXTSPLIT(B1,
{"[{","'key1': ","'key2':","}]","}, {"},,1),"),",")"),
split, WRAPROWS(clean,2), gps,INDEX(split,,2),names, INDEX(split,,1),
gpsUx, UNIQUE(gps), out, DROP(REDUCE("", gpsUx, LAMBDA(ac,x,
HSTACK(ac, VSTACK(header, HSTACK(x, FILTER(names, gps=x)))))),,1),
IFERROR(out,""))
Here is the output:
An alternative solution is using a recursive function to do all the replacement. Please check my answer to the question: Is it possible to convert words to numbers in a string without VBA? I use the same function here MultiReplace:
= LAMBDA(text, old, new, IF(MIN(LEN(old))=0, text,
MultiReplace(SUBSTITUTE(text, INDEX(old,1), INDEX(new,1)),
IFERROR(DROP(old,1),""), IFERROR(DROP(new,1),""))))
where old and new are nx1 arrays, where n is the number of substitutions.
The above function needs to be defined in the Name Manager since it is recursive. Now we are going to use in formula to provide the output of the question in cell B3:
=LET(header, {"Group", "Name"}, clean, MultiReplace(B1,
{"{'key1': " ; ", 'key2':" ; "}" ; "[(";"]]";"], ("},
{"" ; "-" ; "" ; "(" ; "]" ; "] & ("}),
split, TEXTSPLIT(#clean,"-"," & "), gps,INDEX(split,,2),names, INDEX(split,,1),
gpsUx, UNIQUE(gps), out, DROP(REDUCE("", gpsUx, LAMBDA(ac,x,
HSTACK(ac, VSTACK(header, HSTACK(x, FILTER(names, gps=x)))))),,1),
IFERROR(out,""))
You can check the output of each name variable: clean, split, to see the intermediate results. Once we have the information in array format (split), then we apply DROP/REDUCE/HSTACK pattern. Check my answer to the question: how to transform a table in Excel from vertical to horizontal but with different length for more details.
Both solutions work for multiple groups, not just for two groups as in the input sample. For each unique key1 group it generates the Group and its corresponding Name columns and concatenate the result horizontally via HSTACK.

Nested dictionary with key: list[key:value] pairs to dataframe

I'm currently struggling with creating a dataframe based on a dictionary that is nested like {key1:[{key:value},{key:value}, ...],key2:[{key:value},{key:value},...]}
And I want this to go into a dataframe, where the value of key1 and key2 are the index, while the list nested key:value pairs would become the column and record values.
Now, for each key1, key2, etc the list key:value pairs can be different in size. Example data:
some_dict = {'0000297386FB11E2A2730050568F1BAB': [{'FILE_ID': '0000297386FB11E2A2730050568F1BAB'},
{'FileTime': '1362642335'},
{'Size': '1016439'},
{'DocType_Code': 'AF3BD580734A77068DD083389AD7FDAF'},
{'Filenr': 'F682B798EC9481FF031C4C12865AEB9A'},
{'DateRegistered': 'FAC4F7F9C3217645C518D5AE473DCB1E'},
{'TITLE': '2096158F036B0F8ACF6F766A9B61A58B'}],
'000031EA51DA11E397D30050568F1BAB': [{'FILE_ID': '000031EA51DA11E397D30050568F1BAB'},
{'FileTime': '1384948248'},
{'Size': '873514'},
{'DatePosted': '7C6BCB90AC45DA1ED6D1C376FC300E7B'},
{'DocType_Code': '28F404E9F3C394518AF2FD6A043D3A81'},
{'Filenr': '13A6A062672A88DE75C4D35917F3C415'},
{'DateRegistered': '8DD4262899F20DE45F09F22B3107B026'},
{'Comment': 'AE207D73C9DDB76E1EEAA9241VJGN02'},
{'TITLE': 'DF96336A6FE08E34C5A94F6A828B4B62'}]}
The final result should look like this:
Index | File_ID | ... | DatePosted | ... | Comment | Title
0000297386FB11E2A2730050568F1BAB|0000297386FB11E2A2730050568F1BAB|...|NaN|...|NaN|2096158F036B0F8ACF6F766A9B61A58B
000031EA51DA11E397D30050568F1BAB|000031EA51DA11E397D30050568F1BAB|...|7C6BCB90AC45DA1ED6D1C376FC300E7B|...|AE207D73C9DDB76E1EEAA9241VJGN02|DF96336A6FE08E34C5A94F6A828B4B62
Now I've tried to parse the dict directly to pandas using comprehension as suggested in Creating dataframe from a dictionary where entries have different lengths and tried to flatten the dict more, and then parsing it to pandas Flatten nested dictionaries, compressing keys. Both with no avail.
Here you go.
You do not need key of first dict. Because it's also available in lower stages.
Then you need to merge multiple dicts into single one. I did that with update.
THen we turn dict into pd series.
And concat it into a dataframe.
In [39]: seriess = []
...: for values in some_dict.values():
...: d = {}
...: for thing in values:
...: d.update(thing)
...: s = pd.Series(d)
...: seriess.append(s)
...:
In [40]: pd.concat(seriess,axis=1).T
Out[40]:
FILE_ID FileTime Size ... TITLE DatePosted Comment
0 0000297386FB11E2A2730050568F1BAB 1362642335 1016439 ... 2096158F036B0F8ACF6F766A9B61A58B NaN NaN
1 000031EA51DA11E397D30050568F1BAB 1384948248 873514 ... DF96336A6FE08E34C5A94F6A828B4B62 7C6BCB90AC45DA1ED6D1C376FC300E7B AE207D73C9DDB76E1EEAA9241VJGN02
Let's try the following code:
dfs = []
for k in some_dict.keys():
dfs.append(pd.DataFrame.from_records(some_dict[k]))
new_df = [dfs[0].append(x) for x in dfs[1:]][0]
final_result = (new_df
.groupby(new_df['FILE_ID'].notna().cumsum())
.first())
Output
FILE_ID FileTime Size DocType_Code Filenr DateRegistered TITLE DatePosted Comment
FILE_ID
1 0000297386FB11E2A2730050568F1BAB 1362642335 1016439 AF3BD580734A77068DD083389AD7FDAF F682B798EC9481FF031C4C12865AEB9A FAC4F7F9C3217645C518D5AE473DCB1E 2096158F036B0F8ACF6F766A9B61A58B None None
2 000031EA51DA11E397D30050568F1BAB 1384948248 873514 28F404E9F3C394518AF2FD6A043D3A81 13A6A062672A88DE75C4D35917F3C415 8DD4262899F20DE45F09F22B3107B026 DF96336A6FE08E34C5A94F6A828B4B62 7C6BCB90AC45DA1ED6D1C376FC300E7B AE207D73C9DDB76E1EEAA9241VJGN02

Dataframe to dictionary, values came out scrambled

I have a dataframe that contains two columns that I would like to convert into a dictionary to use as a map.
I have tried multiple ways of converting, but my dictionary values always comes up in the wrong order.
My python version is 3 and Pandas version is 0.24.2.
This is what the first few rows of my dataframe looks like:
geozip.head()
Out[30]:
Geoid ZIP
0 100100 36276
1 100124 36310
2 100460 35005
3 100460 35062
4 100460 35214
I would like my dictionary to look like this:
{100100: 36276,
100124: 36310,
100460: 35005,
100460: 35062,
100460: 35214,...}
But instead my outputs came up with the wrong order for the values.
{100100: 98520,
100124: 36310,
100460: 57520,
100484: 35540,
100676: 19018,
100820: 57311,
100988: 15483,
101132: 36861,...}
I tried this first but the dictionary came out unordered:
geozipmap = geozip.set_index('Geoid')['ZIP'].to_dict()
Then I tried coverting the two columns into list first then convert to dictionary, but same problem occurred:
geoid = geozip.Geoid.tolist()
zipcode = geozip.ZIP.tolist()
geozipmap = dict(zip(geoid, zipcode))
I tried converting to OrderedDict and that didn't work either.
Then I've tried:
geozipmap = {k: v for k, v in zip(geoid, zipcode)}
I've also tried:
geozipmap = {}
for index, g in enumerate(geoid):
geozipmap[geoid[index]] = zipcode[index]
I've also tried the answers suggested:
panda dataframe to ordered dictionary
None of these work. Really not sure what is going on?
try this default_dict and if same key have multiple values you can provide those as list
from collections import defaultdict
df =pd.DataFrame(data={"Geoid":[100100,100124,100460,100460,100460],
"ZIP":[36276,36310,35005,35062,35214]})
data_dict = defaultdict(list)
for k,v in zip(df['Geoid'],df['ZIP']):
data_dict[k].append(v)
print(data_dict)
defaultdict(<class 'list'>, {100100: [36276], 100124: [36310], 100460: [35005, 35062, 35214]})
Will this work for you?
dfG = df['Geoid'].values
dfZ = df['ZIP'].values
for g , z in zip (dfG,dfZ):
print(str(g)+':'+str(z))
This gives the output as below (but the values are strings)
100100:36276
100124:36310
100460:35005
100460:35062
100460:35214

Replace string in pandas dataframe if it contains specific substring

I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script
import pandas as pd
df=pd.read_csv('lastfile.csv')
df.dropna(inplace=True)
g='Drugs'
z='Weapons'
c='Flowers'
df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )
print(df['category'])
My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem?
Thank you in advance.
Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:
df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})
a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']
g='Drugs'
z='Weapons'
c='Flowers'
d = {g:a, z:b}
df['new_category'] = c
for k, v in d.items():
pat = '|'.join(v)
mask = df.category.str.contains(pat, case=False)
df.loc[mask, 'new_category'] = k
print (df)
category new_category
0 sss mdma df Drugs
1 milit ss aa Weapons
2 aa ss Flowers

Creating pandas dataframes from nested json file that has lista

a picture on how the data look like
So, I have a json file with data, the file is really nested, I want to take only the words and create a new dataframe for each post id. Can anyone help with this?
You can use apply with list comprehension:
df = pd.DataFrame({'member_info.vocabulary':[[], [{'post_iD':'3913', 'word':'Twisters'},
{'post_iD':'3911', 'word':'articulate'}]]})
df['words'] = df['member_info.vocabulary'].apply(lambda x: [y.get('word') for y in x])
print (df)
member_info.vocabulary words
0 [] []
1 [{'post_iD': '3913', 'word': 'Twisters'}, {'po... [Twisters, articulate]
And if get one element lists only add str[0] for select first value of lists:
df = pd.DataFrame({'member_info.vocabulary':[[], [{'post_iD':'3913', 'word':'Twisters'}]]})
df['words'] = df['member_info.vocabulary'].apply(lambda x: [y.get('word') for y in x]).str[0]
print (df)
member_info.vocabulary words
0 [] NaN
1 [{'post_iD': '3913', 'word': 'Twisters'}] Twisters

Categories

Resources