I have a Python dictionary with 3 keys which I created using the following code. the dictionary is very large - approximately 100,000 rows.
t1=list(zip(df.Col1, df.Col2,df.Col3))
d_dict= dict(list(zip(t1,df.Col4)))
I now have a separate dataframe which is also very large which has 3 columns which match the dictionary keys. I want to apply series.map(d_dict) to this in order to optimize some code. How can I do this?
I am currently using the following code which has errors on nan and takes a very long time
s1 = df2.apply(lambda x: d_dict[x.Col1,x.Col2,x.Col3], axis=1)
s1= df2.map(d_dict)
is the kind of code that I would be looking for
i solved this by converting the 3 keys into a text '1,0,1' and then making my dictinoary a text '1,0,1: 2341' and using series.map(dict) with the one key.
Related
Trying to create unique key for a dataframe based on some columns. Used hashlib and zlib , both generating different values for each new python program execution for the same record in dataframe.
Looking for a way to create unique checksum and it should be the same for given data record in dataframe. There are many columns , so don't want to use concatenated column as a key. Any insights would be much appreciated. Sample code tested using hashlib and zlib below
Hashlib
stg_matchdf["Unique travelid"] = pd.DataFrame(stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1))[0].\
str.encode('utf-8').apply(lambda x: (hashlib.sha512(x).hexdigest().upper()))
zlib.adler32
stg_matchdf["Unique travelid"] = pd.DataFrame(stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1))[0].\
str.encode('utf-8').apply(lambda x: (zlib.adler32(x) & 0xffffffff ))
Edited(10/21) Changed code and hitting new problem. Please review. Sorry for any confusion
Above code snippets have problem. For a row , some other row's column values hash was added in 'Unique travelid' column due to pd.DataFrame() altering original df rows order. Below modified code fetches respective column values for a given row but hitting new issue explained below
Modified code
stg_matchdf["Unique travelid_Sum"] = stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1)
stg_matchdf["Unique travelid_Key"] = stg_matchdf["Unique travelid_Sum"].apply(lambda x: (zlib.adler32(str(x).encode('utf-8')) & 0xffffffff))
stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1) is not concatenating columns in one particular order across multiple runs. Please see sample below for two runs. Entire length is same , but order of concatenation is random. So it is causing hashlib or zlib to return different values each time. Is there any way to specify order of columns in above code?
Run1:
AHKGCANADACANADANORTH AMERICA266430RDirect WDAYYZINTERNATIONALMANULIFE - CANADA TRANSIENTFeb-2020HONG KONGASIA/PACIFICPARTIAL REFUND2020-02-15Canada266430.02020-02-02Hong Kong2020-03-01QVKGS6
Run2:
YYZCANADAPARTIAL REFUND2664302020-02-02AMANULIFE - CANADA TRANSIENTHONG KONGNORTH AMERICA2020-03-01Hong KongQVKGS6INTERNATIONALDirect WDRHKGACanadaFeb-2020266430.02020-02-15CANADAASIA/PACIFIC
This is my first post to the coding community, so I hope I get the right level of detail in my request for help!
Background info:
I want to repeat (loop) command in a df using a variable that contains a list of options. While the series 'amenity_options' contains a simple list of specific items (let's say only four amenities as the example below) the df is a large data frame with many other items. My goal is the run the operation below for each item in the 'amenity_option' until the end of the list.
amenity_options = ['cafe','bar','cinema','casino'] # this is a series type with multiple options
df = df[df['amenity'] == amenity_options] # this is my attempt to select the the first value in the series (e.g. cafe) out of dataframe that contains such a column name.
df.to_excel('{}_amenity.xlsx, format('amenity') # wish to save the result (e.g. cafe_amenity) as a separate file.
Desired result:I wish to loop step one and two for each and every item available in the list (e.g. cafe, bar, cinema...). So that I will have separate excel files in the end. Any thoughts?
What #Rakesh suggested is correct, you probably just need one more step.
df = df[df['amenity'].isin(amenity_options)]
for key, g in df.groupby('amenity'):
g.to_excel('{}_amenity.xlsx'.format(key))
After you call groupby() of your df, you will get 4 groups so that you can directly loop on them.
The key is the group key, which are cafe, bar and etc. and the g is the sub-dataframe that specifically filtered by that key.
Seems like you just need a simple for loop:
for amenity in amenity_options:
df[df['amenity'] == amenity].to_excel(f"{amenity}_amenity.xlsx")
I am trying to create a series of dictionaries from CSVs that I want to import but I am not sure the best way to do it.
I used RatingFactors = os.listdir(RatingDirectory) and
CSVLocations = []
for factor in RatingFactors:
CSVLocations.append(RatingDirectory + factor)
to create a list of CSVs, these CSVs contain what is essentially a dictionary of FactorName | Factor Value, then 1 | 5, 2 | 3.5.
I want to create a dictionary for each CSV, ideally named based on the CSVs name. However I understand that when looping across variables it is considered bad to try and name my variables inside the loop.
I tried creating a generator function using df_from_each_file = (pd.read_csv(CSVs) for CSVs in CSVLocations)
and if I print the generator using for y in df_from_each_file:
print(y) it gives me each of the dataframes but I don't know how to separate them out?
What is the Pythonic way to do this?
How the CSVs look post import
0 0 1.1
1 1 0.9
2 2 0.9
3 3 0.9
etc
Edit:
Attempt to rephrase my question.
I have a series of CSVs which look like they are formatted like dictionaries, they have two columns and they represent how one factor relates to another. I would like to make a dictionary for each CSV, named like the CSV so that I can interact with them from Python.
Edit 2:
I believe this question is different than the one referenced as that is creating a single dataframe which contains all of the dictionaries, I want all of the dictionaries to be separate rather than in a single unit. I tried using their answer before asking this and I could not separate them out.
I think need dict comprehension with basename for keys:
import glob, os
files = glob.glob('files/*.csv')
sers={os.path.basename(f).split('.')[0]:pd.read_csv(f,index_col=[0]).squeeze() for f in files}
If want one big Series:
d = pd.concat(sers, ignore_index=False)
Dict = {'Things' : {'Car':'Lambo', 'Home':'NatureVilla', 'Gadgets':{'Laptop':{'Programs':{'Data':'Excel', 'Officework': 'Word', 'Coding':{'Python':'PyCharm', 'Java':'Eclipse', 'Others': 'SublimeText'}, 'Wearables': 'SamsungGear', 'Smartphone': 'Nexus'}, 'clothes': 'ArmaaniSuit', 'Bags':'TravelBags'}}}}
d = {(i,j,k,l,m,n): Dict[i][j][k][l][m][n]
for i in Dict.keys()
for j in Dict[i].keys()
for k in Dict[j].keys()
for l in Dict[k].keys()
for m in Dict[l].keys()
for n in Dict[n].keys()
}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
print (df)
What I have already done:
I tried to Multiindex this Irregular Data using pandas but I am getting KeyError at 'Car'. Then I tried to handle exceptions and tried to PASS it but then it results in a Syntax Error. So May be I lost the direction. If there is any other module or way I can index this irregular data and put it in a table somehow. I have a chunk of raw data like this.
What I am trying to do:
I wanted to use this data for printing in QTableView which is from PyQt5 (Making a program with GUI).
Conditions:
This Data keeps on updating every hour from an API.
What I have thought till now:
May be I can append all this data to MySQL. But then when this data updates from API, only Values will change, rest of the KEYS will be the same. But then It will require more space.
References:
How to convert a 3-level dictionary to a desired format?
How to build a MultiIndex Pandas DataFrame from a nested dictionary with lists
Any Help will be appreciated. Thanks for reading the question.
You data is not actually a 6-level dictionary like a dictionary in a 3-level example you referenced to. The difference is: your dictionary has a data on multiple different levels, e.g. 'Lambo' value is on second level of hierarchy with key ('Things','Car') but 'Eclipse' value is on sixth level of hierarchy with key ('Things','Gadgets','Laptop','Programs','Coding','Java')
If you want to 'flatten' your structure you will need to decide what to do with 'missed' key values for deeper levels for values like 'Lambo'.
Btw, maybe it is not actually a solution for your problem, maybe you need to use more appropriate UI widgets like TreeView to work with such kind of hierarchical data, but I will try to directly address your exact question.
Unfortunately it seems to be no easy way to reference all different level values uniformly in one simple dict or list comprehension statement.
Just look at your 'value extractor' (Dict[i][j][k][l][m][n]) there are no such values for i,j,k,l,m,n exists which allows you to get a 'Lambo'. Because to get a Lambo you will need to just use Dict['Things']['Car'] (ironically, in a real life it is also could be difficult to get a Lambo :-) )
One straightforward way to solve your task is:
extract a second level data, extract a third level data, and so on, and combine them together.
E.g. to extract second level values you can write something like this:
val_level2 = {(k1,k2):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
but if you want to combine it later with six level values, it will need to add some padding to your key tuples:
val_level2 = {(k1,k2,'','','',''):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
later you can combine all together by something like:
d = {}
d.update(val_level2)
d.update(val_level3)
But usually the most organic way to work with hierarchical data is to use some recursion, like this:
def flatten_dict(d,key_prefix,max_deep):
return [(tuple(key_prefix+[k]+['']*(max_deep-len(key_prefix))),v)
for k,v in d.items() if not isinstance(v,dict)] +\
sum([flatten_dict(v,key_prefix+[k],max_deep)
for k,v in d.items() if isinstance(v,dict)],[])
And later with code like this:
d={k:v for k,v in flatten_dict(Dict,[],5)}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
df.reset_index()
I actually get this result with your data:
P.S. According to https://www.python.org/dev/peps/pep-0008/#prescriptive-naming-conventions we prefer a lowercase_with_underscores for variable names, CapWords is for classes. So src_dict would be much better, than Dict in your case.
You information looks a lot like json and that's what the API is returning. If that's the case, and you are turning it into a dictionary, then you might me better off using python's json library or even panda's built it read_json format.
Pandas read json
Python's json
In my pandas data frame column, I need to check if the column has any of the word in the dictionary values, then I should return the key.
my_dict = {'woodhill': ["woodhill"],'woodcocks': ["woodcocks"], 'whangateau' : ["whangateau","whangate"],'whangaripo' : ["whangaripo","whangari","whangar"],
'westmere' : ["westmere"],'western springs': ["western springs","western springs","western spring","western sprin",
"western spri","western spr","western sp","western s"]}
I can write a for loop for this, however, I have nearly 1.5 million records in my data frame and the dictionary has more than 100 items and each may have up to 20 values in some case. How do I do this efficiently? Can I create reverse the values as key and key as values in the dictionary to make it fast? Thanks.
you can reverse your dictionary
reversed_dict = {val: key for key in my_dict for val in my_dict[key]}
and then map with your dataframe
df =pd.DataFrame({'col1':['western springs','westerns','whangateau','whangate']})
df['col1'] = df['col1'].map(reversed_dict)
Try this code, this may help you.
1st reverse the dictionary items. # as limited items , so it'll be fast.
2nd create dataframe from dictionary. # instead of searching all keys for each comparison with dataframe, it's best to do join. so for that create dataframe.
3rd make left join from big size dataframe to small size dataframe (in this case dictionary).