I am trying to create a series of dictionaries from CSVs that I want to import but I am not sure the best way to do it.
I used RatingFactors = os.listdir(RatingDirectory) and
CSVLocations = []
for factor in RatingFactors:
CSVLocations.append(RatingDirectory + factor)
to create a list of CSVs, these CSVs contain what is essentially a dictionary of FactorName | Factor Value, then 1 | 5, 2 | 3.5.
I want to create a dictionary for each CSV, ideally named based on the CSVs name. However I understand that when looping across variables it is considered bad to try and name my variables inside the loop.
I tried creating a generator function using df_from_each_file = (pd.read_csv(CSVs) for CSVs in CSVLocations)
and if I print the generator using for y in df_from_each_file:
print(y) it gives me each of the dataframes but I don't know how to separate them out?
What is the Pythonic way to do this?
How the CSVs look post import
0 0 1.1
1 1 0.9
2 2 0.9
3 3 0.9
etc
Edit:
Attempt to rephrase my question.
I have a series of CSVs which look like they are formatted like dictionaries, they have two columns and they represent how one factor relates to another. I would like to make a dictionary for each CSV, named like the CSV so that I can interact with them from Python.
Edit 2:
I believe this question is different than the one referenced as that is creating a single dataframe which contains all of the dictionaries, I want all of the dictionaries to be separate rather than in a single unit. I tried using their answer before asking this and I could not separate them out.
I think need dict comprehension with basename for keys:
import glob, os
files = glob.glob('files/*.csv')
sers={os.path.basename(f).split('.')[0]:pd.read_csv(f,index_col=[0]).squeeze() for f in files}
If want one big Series:
d = pd.concat(sers, ignore_index=False)
Related
I have a Python dictionary with 3 keys which I created using the following code. the dictionary is very large - approximately 100,000 rows.
t1=list(zip(df.Col1, df.Col2,df.Col3))
d_dict= dict(list(zip(t1,df.Col4)))
I now have a separate dataframe which is also very large which has 3 columns which match the dictionary keys. I want to apply series.map(d_dict) to this in order to optimize some code. How can I do this?
I am currently using the following code which has errors on nan and takes a very long time
s1 = df2.apply(lambda x: d_dict[x.Col1,x.Col2,x.Col3], axis=1)
s1= df2.map(d_dict)
is the kind of code that I would be looking for
i solved this by converting the 3 keys into a text '1,0,1' and then making my dictinoary a text '1,0,1: 2341' and using series.map(dict) with the one key.
I am currently working to process some data that are imported to Python as a dataframe that has 10000 rows and 20 columns. The columns store sample names and chemical element. The daaaframe is currently indexed by both sample name and time, appearing as so:
[1]: https://i.stack.imgur.com/7knqD.png .
From this dataframe, I want to create individual arrays for each individual sample, of which there are around 25, with a loop. I have generated an index and array of the sample names, which yields an array that appears as so
samplename = fuegodataframe.index.levels[0]
samplearray = samplename.to_numpy()
array(['AC4-EUH41', 'AC4-EUH79N', 'AC4-EUH79S', 'AC4-EUH80', 'AC4-EUH81',
'AC4-EUH81b', 'AC4-EUH82N', 'AC4-EUH82W', 'AC4-EUH84',
'AC4-EUH85N', 'AC4_EUH48', 'AC4_EUH48b', 'AC4_EUH54N',
'AC4_EUH54S', 'AC4_EUH60', 'AC4_EUH72', 'AC4_EUH73', 'AC4_EUH73W',
'AC4_EUH78', 'AC4_EUH79E', 'AC4_EUH79W', 'AC4_EUH88', 'AC4_EUH89',
'bhvo-1', 'bhvo-2', 'bir-1', 'bir-2', 'gor132-1', 'gor132-2',
'gor132-3', 'sc ol-1', 'sc ol-2'], dtype=object)
I have also created a dictionary with keys of each of these variable names. I am now wondering how I would use this dictionary to generate individual variables for each of these samples that capture all the rows in which a sample is found.
I have tried something along these lines:
for ii in sampledictionary.keys():
if ii == sampledictionary[ii]:
sampledictionary[ii] = fuegodataframe.loc[sampledictionary[ii]]
but this fails. How would I actually go about doing something like this? Is this possible?
I think you're asking how to generate variables dynamically rather than assign your output to a key in your dictionary.
In Python there is a globals function globals() that will output all the variable names defined in the document.
You can assign new variables dynamically to this dictionary
globals()[f'variablename_{ii}'] = fuegodataframe.loc[sampledictionary[ii]]
etc.
if ii was 0 then variablename_0 would be available with the assigned value.
In general this is not considered good practice but it is required sometimes.
I have several dataframes on which I an performing the same functions - extracting mean, geomean, median etc etc for a particular column (PurchasePrice), organised by groups within another column (GORegion). At the moment I am just performing this for each dataframe separately as I cannot work out how to do this in a for loop and save separate data series for each function performed on each dataframe.
i.e. I perform median like this:
regmedian15 = pd.Series(nw15.groupby(["GORegion"])['PurchasePrice'].median(), name = "regmedian_nw15")
I want to do this for a list of dataframes [nw15, nw16, nw17], extracting the same variable outputs for each of them.
I have tried things like :
listofnwdfs = [nw15, nw16, nw17]
for df in listofcmldfs:
df+'regmedian' = pd.Series(df.groupby(["GORegion"])
['PurchasePrice'].median(), name = df+'regmedian')
but it says "can't assign to operator"
I think the main point is I can't work out how to create separate output variable names using the names of the dataframes I am inputting into the for loop. I just want a for loop function that produces my median output as a series for each dataframe in the list separately, and I can then do this for means and so on.
Many thanks for your help!
First, df+'regmedian' = ... is not valid Python syntax. You are trying to assign a value to an expression of the form A + B, which is why Python complains that you are trying to re-define the meaning of +.
Also, df+'regmedian' itself seems strange. You are trying to add a DataFrame and a string.
One way to keep track of different statistics for different datafarmes is by using dicts. For example, you can replace
listofnwdfs = [nw15, nw16, nw17]
with
dict_of_nwd_frames = {15: nw15, 16: nw16, 17: nw17}
Say you want to store 'regmedian' data for each frame. You can do this with dicts as well.
data = dict()
for key, df in dict_of_nwd_frames.items():
data[(i, 'regmedian')] = pd.Series(df.groupby(["GORegion"])['PurchasePrice'].median(), name = str(key) + 'regmedian')
Dict = {'Things' : {'Car':'Lambo', 'Home':'NatureVilla', 'Gadgets':{'Laptop':{'Programs':{'Data':'Excel', 'Officework': 'Word', 'Coding':{'Python':'PyCharm', 'Java':'Eclipse', 'Others': 'SublimeText'}, 'Wearables': 'SamsungGear', 'Smartphone': 'Nexus'}, 'clothes': 'ArmaaniSuit', 'Bags':'TravelBags'}}}}
d = {(i,j,k,l,m,n): Dict[i][j][k][l][m][n]
for i in Dict.keys()
for j in Dict[i].keys()
for k in Dict[j].keys()
for l in Dict[k].keys()
for m in Dict[l].keys()
for n in Dict[n].keys()
}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
print (df)
What I have already done:
I tried to Multiindex this Irregular Data using pandas but I am getting KeyError at 'Car'. Then I tried to handle exceptions and tried to PASS it but then it results in a Syntax Error. So May be I lost the direction. If there is any other module or way I can index this irregular data and put it in a table somehow. I have a chunk of raw data like this.
What I am trying to do:
I wanted to use this data for printing in QTableView which is from PyQt5 (Making a program with GUI).
Conditions:
This Data keeps on updating every hour from an API.
What I have thought till now:
May be I can append all this data to MySQL. But then when this data updates from API, only Values will change, rest of the KEYS will be the same. But then It will require more space.
References:
How to convert a 3-level dictionary to a desired format?
How to build a MultiIndex Pandas DataFrame from a nested dictionary with lists
Any Help will be appreciated. Thanks for reading the question.
You data is not actually a 6-level dictionary like a dictionary in a 3-level example you referenced to. The difference is: your dictionary has a data on multiple different levels, e.g. 'Lambo' value is on second level of hierarchy with key ('Things','Car') but 'Eclipse' value is on sixth level of hierarchy with key ('Things','Gadgets','Laptop','Programs','Coding','Java')
If you want to 'flatten' your structure you will need to decide what to do with 'missed' key values for deeper levels for values like 'Lambo'.
Btw, maybe it is not actually a solution for your problem, maybe you need to use more appropriate UI widgets like TreeView to work with such kind of hierarchical data, but I will try to directly address your exact question.
Unfortunately it seems to be no easy way to reference all different level values uniformly in one simple dict or list comprehension statement.
Just look at your 'value extractor' (Dict[i][j][k][l][m][n]) there are no such values for i,j,k,l,m,n exists which allows you to get a 'Lambo'. Because to get a Lambo you will need to just use Dict['Things']['Car'] (ironically, in a real life it is also could be difficult to get a Lambo :-) )
One straightforward way to solve your task is:
extract a second level data, extract a third level data, and so on, and combine them together.
E.g. to extract second level values you can write something like this:
val_level2 = {(k1,k2):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
but if you want to combine it later with six level values, it will need to add some padding to your key tuples:
val_level2 = {(k1,k2,'','','',''):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
later you can combine all together by something like:
d = {}
d.update(val_level2)
d.update(val_level3)
But usually the most organic way to work with hierarchical data is to use some recursion, like this:
def flatten_dict(d,key_prefix,max_deep):
return [(tuple(key_prefix+[k]+['']*(max_deep-len(key_prefix))),v)
for k,v in d.items() if not isinstance(v,dict)] +\
sum([flatten_dict(v,key_prefix+[k],max_deep)
for k,v in d.items() if isinstance(v,dict)],[])
And later with code like this:
d={k:v for k,v in flatten_dict(Dict,[],5)}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
df.reset_index()
I actually get this result with your data:
P.S. According to https://www.python.org/dev/peps/pep-0008/#prescriptive-naming-conventions we prefer a lowercase_with_underscores for variable names, CapWords is for classes. So src_dict would be much better, than Dict in your case.
You information looks a lot like json and that's what the API is returning. If that's the case, and you are turning it into a dictionary, then you might me better off using python's json library or even panda's built it read_json format.
Pandas read json
Python's json
Suppose I have a DataFrame that is block sparse. By this I mean that there are groups of rows that have disjoint sets of non-null columns. Storing this a huge table will use more memory in the values (nan filling) and unstacking the table to rows will creating a large index (at least it appears that way on saving to disk ... I'm not 100% clear if there is some efficient MultiIndexing that is supposed to be going on).
Typically, I store the blocks as separate DataFrames in a dict or list (dropping the nan columns) and make a class that has almost the same api as a DataFrame, 'manually' passing the queries to the blocks and concatenating the results. This works well but involves a short amount of some special code to store and handle these objects.
Recently, I've noticed that pytables provides a feature similar to this but only for the pytables query api.
Is there some way of handling this natively in pandas? Or am I missing some simpler way of getting a solution that is similar in performance?
EDIT: Here is a small example dataset
import pandas, string, itertools
from pylab import *
# create some data and put it in a list of blocks (d)
m = 10; n = 6;
s = list(string.ascii_uppercase)
A = array([s[x] * (1 + mod(x, 3)) for x in randint(0, 26, m*n)]).reshape(m, n)
df = pandas.DataFrame(A)
d = list()
d += [df.ix[0:(m/2)].T.ix[0:(n/2)].T]
d += [df.ix[(m/2):].T.ix[(n/2):].T]
# 1. use lots of memory, fill with na
d0 = pandas.concat(d) # this is just the original df
# 2. maybe ok, not sure how this is handled across different pandas versions
d1 = pandas.concat([x.unstack() for x in d])
# want this to work however the blocks are stored
print(d0.ix[[0, 8]][[2,5]])
# this raises exception
sdf = pandas.SparseDataFrame(df)
You could use HDFStore this way
Store different tables with a common index (that is itself) a column
only the non-all-nan rows would be stored. so if you group your columns intelligently (e.g.
put the ones that would tend to have lots of sparseness in the same place together). I think you could achieve a 'sparse'-like layout.
you can compress the table if necessary.
you can then query individual tables, and get the coordinates to then pull from other tables (this is what select_as_multiple does).
Can you provde a small example, and rough size of data set, e.g. num of rows, columuns, disjoint groups, etc.
What do your queries look like? This is generally how I approach the problem. Figure our how you are going to query; this is going to define how you store the data layout.