I am doing a rather large loop to pull out multiple key and value pairs which are then formed into multiple dictionaries. I want to turn these eventually into a dataframe. I am assuming I must first make a list out of them? Code looks so:
data = {}
ls_dict = []
keys = [name]
values = [number]
for i in range(len(keys)):
data[keys[i]] = values[i]
ls_dict.append(data)
print(ls_dict)
This loop is inside another larger loop. That is where the key and values are coming from.
When I run the code, I get the load of separate dictionaries like so:
[{'name': number}]
[{'name': number}]
[{'name': number}]
But I was hoping to get them in a list like this:
[{'name': number}, {'name': number}, {'name': number}]
The plan was then to return that list out of the function and turn it into a dataframe with column headings "User" and "User Number".
Any ideas first of all why it's not producing a list. And also, is there maybe a better way to make a dataframe out of the name and number im getting from my larger loop.
All help greatly appreciated.
Try:
final_list = [{key: value} for key, value in zip(keys,values)]
It looks like both keys and values always have only a single element. That's why the given for loop does only one step. Could you maybe also show the outer loop?
A good way to turn this kind of data into dataframes might be to use the from_dict classmethod.
Related
I have a dataframe which has 10k movie names and 40k actor names.
The reason is I'm trying to make a graph from nx but the graphic becomes unreadable because of the names of the actor. So I want to change their names to numbers. Some of these actors played on multiple movies which means they are exists more than once. I want to change all these actors to numbers like 'Leslie Howard' = '1' and so on. I tried some loops and lists but I failed. I want to make a dictionary to be able to check which number was which actor. Can you help me?
You could get all unique names of the column, generate a dictionary and then use map to change the values to the numbers. At the same time you have the dictionary to check to which actor the number refers.
all_names = df['Actor_Name'].unique()
dic = dict((v,k) for k,v in enumerate(all_names))
df['Actor_Name'] = df['Actor_Name'].map(dic)
You can just do factorize
df['Movie_name'] = df['Movie_name'].factorize()[0]
df['Actor_name'] = df['Actor_name'].factorize()[0]
Convert the column into type category and get their unique values with .cat.codes:
df['Actor_Name'] = df['Actor_Name'].astype('category').cat.codes
I have this code which extracts subsets of a dataframe to individual dataframes which represent rainfall events:
j=list(range(len(eventdf)))
for k in range(len(eventdf)):
dfname= 'event'+str(j[k])
dfnatp=meandf2.iloc[eventdf.iloc[k,0]: eventdf.iloc[k,1]+2]
dfnatp.to_csv(dfname+'.csv', sep=',')
while I can very easily dump each dataframe to a .csv file, to do anything with it means that I then have to read it back in.
how do I create each dataframe with name given by the value of 'dfname' in the same way that I can name each csv file?
To elaborate Muhammad's suggestion a little more, you can create an empty dictionary like this (before your for loop):
dfdict = {}
Then you can create new dictionary entries like this (inside your for loop):
dfdict[dfname] = dfnatp
These entries will have dfname as the key and dfnatp as the value, so you can access each dfnatp by using dfdict['eventXXX'], where eventXXX is your identifier.
Here is an introduction to python's dictionary data structure for further reading.
As commented, consider a dictionary of data frames which you can achieve with dictionary comprehension. You lose no functionality of a data frame if saved in a dict or list. Since you need to also save to CSV, consider a defined method. Below uses F-strings for string interpolation.
def proc_data(k):
dfnatp = meandf2.iloc[eventdf.iloc[k,0]: eventdf.iloc[k,1]+2]
dfnatp.to_csv(f"event_{k}.csv")
return dfnatp
df_dict = {
f"event_{k}": proc_data(k) for k in range(len(eventdf))
}
# ACCESS INDIVIDUAL DATA FRAMES
df_dict["event_0"]
df_dict["event_1"]
df_dict["event_2"]
...
I have a dictionary "c" with 30000 keys and around 600000 unique values (around 20 unique values per key)
I want to create a new pandas series "'DOC_PORTL_ID'" to get a sample value from each row of column "'image_keys'" and then look for its key in my dictionary and return. So I wrote a function like this:
def find_match(row, c):
for key, val in c.items():
for item in val:
if item == row['image_keys']:
return key
and then I use .apply to create my new column like:
df_image_keys['DOC_PORTL_ID'] = df_image_keys.apply(lambda x: find_match(x, c), axis =1)
This takes a long time. I am wondering if I can improve my snippet code to make it faster.
I googled a lot and was not able to find the best way of doing this. Any help would appreciated.
You're using your dictionary as a reverse lookup. And frankly, you haven't given us enough information about the dictionary. Are the 600,000 values unique? If not, you're only returning the first one you find. Is that expected?
Assume they are unique
reverse_dict = {val: key for key, values in c.items() for val in values}
df_image_keys['DOC_PORTL_ID'] = df_image_keys['image_keys'].map(reverse_dict)
This is as good as you've done yourself. If those values are not unique, you'll have to provide a better explanation of what you expect to happen.
Dict = {'Things' : {'Car':'Lambo', 'Home':'NatureVilla', 'Gadgets':{'Laptop':{'Programs':{'Data':'Excel', 'Officework': 'Word', 'Coding':{'Python':'PyCharm', 'Java':'Eclipse', 'Others': 'SublimeText'}, 'Wearables': 'SamsungGear', 'Smartphone': 'Nexus'}, 'clothes': 'ArmaaniSuit', 'Bags':'TravelBags'}}}}
d = {(i,j,k,l,m,n): Dict[i][j][k][l][m][n]
for i in Dict.keys()
for j in Dict[i].keys()
for k in Dict[j].keys()
for l in Dict[k].keys()
for m in Dict[l].keys()
for n in Dict[n].keys()
}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
print (df)
What I have already done:
I tried to Multiindex this Irregular Data using pandas but I am getting KeyError at 'Car'. Then I tried to handle exceptions and tried to PASS it but then it results in a Syntax Error. So May be I lost the direction. If there is any other module or way I can index this irregular data and put it in a table somehow. I have a chunk of raw data like this.
What I am trying to do:
I wanted to use this data for printing in QTableView which is from PyQt5 (Making a program with GUI).
Conditions:
This Data keeps on updating every hour from an API.
What I have thought till now:
May be I can append all this data to MySQL. But then when this data updates from API, only Values will change, rest of the KEYS will be the same. But then It will require more space.
References:
How to convert a 3-level dictionary to a desired format?
How to build a MultiIndex Pandas DataFrame from a nested dictionary with lists
Any Help will be appreciated. Thanks for reading the question.
You data is not actually a 6-level dictionary like a dictionary in a 3-level example you referenced to. The difference is: your dictionary has a data on multiple different levels, e.g. 'Lambo' value is on second level of hierarchy with key ('Things','Car') but 'Eclipse' value is on sixth level of hierarchy with key ('Things','Gadgets','Laptop','Programs','Coding','Java')
If you want to 'flatten' your structure you will need to decide what to do with 'missed' key values for deeper levels for values like 'Lambo'.
Btw, maybe it is not actually a solution for your problem, maybe you need to use more appropriate UI widgets like TreeView to work with such kind of hierarchical data, but I will try to directly address your exact question.
Unfortunately it seems to be no easy way to reference all different level values uniformly in one simple dict or list comprehension statement.
Just look at your 'value extractor' (Dict[i][j][k][l][m][n]) there are no such values for i,j,k,l,m,n exists which allows you to get a 'Lambo'. Because to get a Lambo you will need to just use Dict['Things']['Car'] (ironically, in a real life it is also could be difficult to get a Lambo :-) )
One straightforward way to solve your task is:
extract a second level data, extract a third level data, and so on, and combine them together.
E.g. to extract second level values you can write something like this:
val_level2 = {(k1,k2):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
but if you want to combine it later with six level values, it will need to add some padding to your key tuples:
val_level2 = {(k1,k2,'','','',''):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
later you can combine all together by something like:
d = {}
d.update(val_level2)
d.update(val_level3)
But usually the most organic way to work with hierarchical data is to use some recursion, like this:
def flatten_dict(d,key_prefix,max_deep):
return [(tuple(key_prefix+[k]+['']*(max_deep-len(key_prefix))),v)
for k,v in d.items() if not isinstance(v,dict)] +\
sum([flatten_dict(v,key_prefix+[k],max_deep)
for k,v in d.items() if isinstance(v,dict)],[])
And later with code like this:
d={k:v for k,v in flatten_dict(Dict,[],5)}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
df.reset_index()
I actually get this result with your data:
P.S. According to https://www.python.org/dev/peps/pep-0008/#prescriptive-naming-conventions we prefer a lowercase_with_underscores for variable names, CapWords is for classes. So src_dict would be much better, than Dict in your case.
You information looks a lot like json and that's what the API is returning. If that's the case, and you are turning it into a dictionary, then you might me better off using python's json library or even panda's built it read_json format.
Pandas read json
Python's json
In my pandas data frame column, I need to check if the column has any of the word in the dictionary values, then I should return the key.
my_dict = {'woodhill': ["woodhill"],'woodcocks': ["woodcocks"], 'whangateau' : ["whangateau","whangate"],'whangaripo' : ["whangaripo","whangari","whangar"],
'westmere' : ["westmere"],'western springs': ["western springs","western springs","western spring","western sprin",
"western spri","western spr","western sp","western s"]}
I can write a for loop for this, however, I have nearly 1.5 million records in my data frame and the dictionary has more than 100 items and each may have up to 20 values in some case. How do I do this efficiently? Can I create reverse the values as key and key as values in the dictionary to make it fast? Thanks.
you can reverse your dictionary
reversed_dict = {val: key for key in my_dict for val in my_dict[key]}
and then map with your dataframe
df =pd.DataFrame({'col1':['western springs','westerns','whangateau','whangate']})
df['col1'] = df['col1'].map(reversed_dict)
Try this code, this may help you.
1st reverse the dictionary items. # as limited items , so it'll be fast.
2nd create dataframe from dictionary. # instead of searching all keys for each comparison with dataframe, it's best to do join. so for that create dataframe.
3rd make left join from big size dataframe to small size dataframe (in this case dictionary).