creating many dataframes as subsets of one large one

creating many dataframes as subsets of one large one - python

I have this code which extracts subsets of a dataframe to individual dataframes which represent rainfall events:
j=list(range(len(eventdf)))
for k in range(len(eventdf)):
dfname= 'event'+str(j[k])
dfnatp=meandf2.iloc[eventdf.iloc[k,0]: eventdf.iloc[k,1]+2]
dfnatp.to_csv(dfname+'.csv', sep=',')
while I can very easily dump each dataframe to a .csv file, to do anything with it means that I then have to read it back in.
how do I create each dataframe with name given by the value of 'dfname' in the same way that I can name each csv file?

To elaborate Muhammad's suggestion a little more, you can create an empty dictionary like this (before your for loop):
dfdict = {}
Then you can create new dictionary entries like this (inside your for loop):
dfdict[dfname] = dfnatp
These entries will have dfname as the key and dfnatp as the value, so you can access each dfnatp by using dfdict['eventXXX'], where eventXXX is your identifier.
Here is an introduction to python's dictionary data structure for further reading.

As commented, consider a dictionary of data frames which you can achieve with dictionary comprehension. You lose no functionality of a data frame if saved in a dict or list. Since you need to also save to CSV, consider a defined method. Below uses F-strings for string interpolation.
def proc_data(k):
dfnatp = meandf2.iloc[eventdf.iloc[k,0]: eventdf.iloc[k,1]+2]
dfnatp.to_csv(f"event_{k}.csv")
return dfnatp
df_dict = {
f"event_{k}": proc_data(k) for k in range(len(eventdf))
}
# ACCESS INDIVIDUAL DATA FRAMES
df_dict["event_0"]
df_dict["event_1"]
df_dict["event_2"]
...

Related

pdf form data extraction with pypdf : how to get only key+values?

I'm not a python guru (lire used to R).
I use pypdf package (v3.4.1) to extract data from a pdf form I have created and filled with acrobat.
I can read the form fields with
f = PdfReader('test_formulaire.pdf')
ffields = f.get_fields()
ffields is a dict object of size 3 (3 keys : 'a1','a2','a5'). Each "key" of the dict is a Field class object.
I can access the value of a key with print(ffields['a1'].value)
I now want to create a pandas dataframe with a column for each key of ffields (3 columns, named with the key name) and a row containing the values of each key...
Is there a quick and easy way to do it ?
I can create an empty dataframe with the column names with something like that (probably far from optimal) :
column_names = ["" for x in range(len(ffields))]
idx=0
for i in ffields:
column_names[idx]=i
idx+=1
data = pd.DataFrame(columns=column_names)
An filling it should be possible with other for loops but it seems ugly... (note that some values are numbers and other are strings).
Does anybody have a hint for doing this quite efficiently.
Thanks in advance

Add dictionary to pandas data frame and ignore extra values

I'm reading lot of log files, from which I generate dictionary by parsing each log, I want to add this dictionary to dataframe, later I use this dataframe for analysis. But the information I need in dataframe may differ every time based on user input. So I don't want all the information in the dictionary to add in to data frame. I want the columns I defined in the data frame only to add to data frame.
As of now I'm adding all the dictionaries one by one to a list, then loading this dictionary to dataframe.
for log in log_lines:
# here logic to parse the log and generate the dictionary
my_dict_list.append(d)
pd.Dataframe(my_dict_list)
In this way it adds all the keys and their values to the dataframe,
but what I want is, I will define some columns, let's say user asks ['a','b','c'] columns for analysis, I want the dataframe to load only these keys and their values to the data frame, rest should be ignored.
my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
{'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
{'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]
Note: I don't want this ignoring keys at the time extraction of logs, because I will be extracting lot of logs so its time consuming.
is there a way I can achieve this, using pandas in faster way?.

In tmp_Dict line you can filter only requested columns and save only requested columns.
def log_dataframe(log_lines, requested_columns):
for log in log_lines:
# here logic to parse the log and generate the dictionary
tmp_Dict = {requested_key : d[requested_key] for requested_key in request_columns}
my_dict_list.append(tmp_Dict)
return pd.Dataframe(my_dict_list)

I am just providing you some raw logic for your query i may be wrong on some part but if you find it helpful for you that will be very great you can mail me also for future queries I will be happy to help you.
columns = []
x = int(input('enter no of columns you need'))
for i in range(x):
print("Please specify columns")
columns = int(input())
columns.append(columns)
my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
{'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
{'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]
for data in range(x):
value = pd.DataFrame(my_dict_list[columns[data]])
print(value[[data]])

How to convert Multilevel Dictionary with Irregular Data to Desired Format

Dict = {'Things' : {'Car':'Lambo', 'Home':'NatureVilla', 'Gadgets':{'Laptop':{'Programs':{'Data':'Excel', 'Officework': 'Word', 'Coding':{'Python':'PyCharm', 'Java':'Eclipse', 'Others': 'SublimeText'}, 'Wearables': 'SamsungGear', 'Smartphone': 'Nexus'}, 'clothes': 'ArmaaniSuit', 'Bags':'TravelBags'}}}}
d = {(i,j,k,l,m,n): Dict[i][j][k][l][m][n]
for i in Dict.keys()
for j in Dict[i].keys()
for k in Dict[j].keys()
for l in Dict[k].keys()
for m in Dict[l].keys()
for n in Dict[n].keys()
}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
print (df)
What I have already done:
I tried to Multiindex this Irregular Data using pandas but I am getting KeyError at 'Car'. Then I tried to handle exceptions and tried to PASS it but then it results in a Syntax Error. So May be I lost the direction. If there is any other module or way I can index this irregular data and put it in a table somehow. I have a chunk of raw data like this.
What I am trying to do:
I wanted to use this data for printing in QTableView which is from PyQt5 (Making a program with GUI).
Conditions:
This Data keeps on updating every hour from an API.
What I have thought till now:
May be I can append all this data to MySQL. But then when this data updates from API, only Values will change, rest of the KEYS will be the same. But then It will require more space.
References:
How to convert a 3-level dictionary to a desired format?
How to build a MultiIndex Pandas DataFrame from a nested dictionary with lists
Any Help will be appreciated. Thanks for reading the question.

You data is not actually a 6-level dictionary like a dictionary in a 3-level example you referenced to. The difference is: your dictionary has a data on multiple different levels, e.g. 'Lambo' value is on second level of hierarchy with key ('Things','Car') but 'Eclipse' value is on sixth level of hierarchy with key ('Things','Gadgets','Laptop','Programs','Coding','Java')
If you want to 'flatten' your structure you will need to decide what to do with 'missed' key values for deeper levels for values like 'Lambo'.
Btw, maybe it is not actually a solution for your problem, maybe you need to use more appropriate UI widgets like TreeView to work with such kind of hierarchical data, but I will try to directly address your exact question.
Unfortunately it seems to be no easy way to reference all different level values uniformly in one simple dict or list comprehension statement.
Just look at your 'value extractor' (Dict[i][j][k][l][m][n]) there are no such values for i,j,k,l,m,n exists which allows you to get a 'Lambo'. Because to get a Lambo you will need to just use Dict['Things']['Car'] (ironically, in a real life it is also could be difficult to get a Lambo :-) )
One straightforward way to solve your task is:
extract a second level data, extract a third level data, and so on, and combine them together.
E.g. to extract second level values you can write something like this:
val_level2 = {(k1,k2):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
but if you want to combine it later with six level values, it will need to add some padding to your key tuples:
val_level2 = {(k1,k2,'','','',''):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
later you can combine all together by something like:
d = {}
d.update(val_level2)
d.update(val_level3)
But usually the most organic way to work with hierarchical data is to use some recursion, like this:
def flatten_dict(d,key_prefix,max_deep):
return [(tuple(key_prefix+[k]+['']*(max_deep-len(key_prefix))),v)
for k,v in d.items() if not isinstance(v,dict)] +\
sum([flatten_dict(v,key_prefix+[k],max_deep)
for k,v in d.items() if isinstance(v,dict)],[])
And later with code like this:
d={k:v for k,v in flatten_dict(Dict,[],5)}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
df.reset_index()
I actually get this result with your data:
P.S. According to https://www.python.org/dev/peps/pep-0008/#prescriptive-naming-conventions we prefer a lowercase_with_underscores for variable names, CapWords is for classes. So src_dict would be much better, than Dict in your case.

You information looks a lot like json and that's what the API is returning. If that's the case, and you are turning it into a dictionary, then you might me better off using python's json library or even panda's built it read_json format.
Pandas read json
Python's json

Insert dictionary into SQLlite3

I want to insert data from a dictionary into a sqlite table, I am using slqalchemy to do that, the keys in the dictionary and the column names are the same, and I want to insert the values into the same column name in the table. So this is my code:
#This is the class where I create a table from with sqlalchemy, and I want to
#insert my data into.
#I didn't write the __init__ for simplicity
class Sizecurve(Base):
__tablename__ = 'sizecurve'
XS = Column(String(5))
S = Column(String(5))
M = Column(String(5))
L = Column(String(5))
XL = Column(String(5))
XXL = Column(String(5))
o = Mapping() #This creates an object which is actually a dictionary
for eachitem in myitems:
# Here I populate the dictionary with keys from another list
# This gives me a dictionary looking like this: o={'S':None, 'M':None, 'L':None}
o[eachitem] = None
for eachsize in mysizes:
# Here I assign values to each key of the dictionary, if a value exists if not just None
# product_row is a class and size and stock are its attributes
if(product_row.size in o):
o[product_row.size] = product_row.stock
# I put the final object into a list
simplelist.append(o)
Now I want to put each the values from the dictionaries saved in simplelist into the right column in the sizecurve table. But I am stuck I don't know how to do that? So for example I have an object like this:
o= {'S':4, 'M':2, 'L':1}
And I want to see for the row for column S value 4, column M value 2 etc.

Yes, it's possible (though aren't you missing primary keys/foreign keys on this table?).
session.add(Sizecurve(**o))
session.commit()
That should insert the row.
http://docs.sqlalchemy.org/en/latest/core/tutorial.html#executing-multiple-statements
EDIT: On second read it seems like you are trying to insert all those values into one column? If so, I would make use of pickle.
https://docs.python.org/3.5/library/pickle.html
If performance is an issue (pickle is pretty fast, but if your doing 10000 reads per second it'll be the bottleneck), you should either redesign the table or use a database like PostgreSQL that supports JSON objects.

I have found this answer to a similar question, though this is about reading the data from a json file, so now I am working on understanding the code and also changing my data type to json so that I can insert them in the right place.
Convert JSON to SQLite in Python - How to map json keys to database columns properly?

PyTable Column Order

Is there a way to create a PyTable with a specific column order?
By default, the columns are alphabetically ordered when using both dictionary or class for schema definition for the call to createTable(). My need is to establish a specific order and then use numpy.genfromtxt() to read and store my data from text. Unfortunately, my text file does not have the variable names included in the same way that the file data is.
At this time, my column are ordered alphabetically and the data is misaligned in that it is ordered according to the file layout. And it is desirable to maintain the same order in the pyTable (but, not essential).
Thanks

See:
Is there a way to store PyTable columns in a specific order?

For example, assuming text file is named mydata.txt and is organized as follows:
time(row1) bVar(row1) dVar(row1) aVar(row1) cVar(row1)
time(row2) bVar(row2) dVar(row2) aVar(row2) cVar(row2)
...
time(rowN) bVar(rowN) dVar(rowN) aVar(rowN) cVar(rowN)
So, the desire is to create a table that is ordered with these columns
and then use the numpy.genfromtxt command to populate the table.
# Column and Table definition with desired order
class parmDev(tables.IsDescription):
time = tables.Float64Col()
bVar = tables.Float64Col()
dVar = tables.Float64Col()
aVar = tables.Float64Col()
cVar = tables.Float64Col()
#...
mytab = tables.createTable( group, tabName, paramDev )
data = numpy.genfromtxt(mydata.txt)
mytab.append(data)
This makes for straightforward code and is very fast. But, the table columns are always
ordered alphabetically and the appended data is ordered according to the desired order. Is
there a way to have the order of the table columns follow the class definition order instead of being alphabetical.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

creating many dataframes as subsets of one large one - python

Related

pdf form data extraction with pypdf : how to get only key+values?

Add dictionary to pandas data frame and ignore extra values

How to convert Multilevel Dictionary with Irregular Data to Desired Format

Insert dictionary into SQLlite3

PyTable Column Order

Categories

Resources