Is there a way to create a PyTable with a specific column order?
By default, the columns are alphabetically ordered when using both dictionary or class for schema definition for the call to createTable(). My need is to establish a specific order and then use numpy.genfromtxt() to read and store my data from text. Unfortunately, my text file does not have the variable names included in the same way that the file data is.
At this time, my column are ordered alphabetically and the data is misaligned in that it is ordered according to the file layout. And it is desirable to maintain the same order in the pyTable (but, not essential).
Thanks
See:
Is there a way to store PyTable columns in a specific order?
For example, assuming text file is named mydata.txt and is organized as follows:
time(row1) bVar(row1) dVar(row1) aVar(row1) cVar(row1)
time(row2) bVar(row2) dVar(row2) aVar(row2) cVar(row2)
...
time(rowN) bVar(rowN) dVar(rowN) aVar(rowN) cVar(rowN)
So, the desire is to create a table that is ordered with these columns
and then use the numpy.genfromtxt command to populate the table.
# Column and Table definition with desired order
class parmDev(tables.IsDescription):
time = tables.Float64Col()
bVar = tables.Float64Col()
dVar = tables.Float64Col()
aVar = tables.Float64Col()
cVar = tables.Float64Col()
#...
mytab = tables.createTable( group, tabName, paramDev )
data = numpy.genfromtxt(mydata.txt)
mytab.append(data)
This makes for straightforward code and is very fast. But, the table columns are always
ordered alphabetically and the appended data is ordered according to the desired order. Is
there a way to have the order of the table columns follow the class definition order instead of being alphabetical.
Related
I have a dataframe which has 10k movie names and 40k actor names.
The reason is I'm trying to make a graph from nx but the graphic becomes unreadable because of the names of the actor. So I want to change their names to numbers. Some of these actors played on multiple movies which means they are exists more than once. I want to change all these actors to numbers like 'Leslie Howard' = '1' and so on. I tried some loops and lists but I failed. I want to make a dictionary to be able to check which number was which actor. Can you help me?
You could get all unique names of the column, generate a dictionary and then use map to change the values to the numbers. At the same time you have the dictionary to check to which actor the number refers.
all_names = df['Actor_Name'].unique()
dic = dict((v,k) for k,v in enumerate(all_names))
df['Actor_Name'] = df['Actor_Name'].map(dic)
You can just do factorize
df['Movie_name'] = df['Movie_name'].factorize()[0]
df['Actor_name'] = df['Actor_name'].factorize()[0]
Convert the column into type category and get their unique values with .cat.codes:
df['Actor_Name'] = df['Actor_Name'].astype('category').cat.codes
I have this code which extracts subsets of a dataframe to individual dataframes which represent rainfall events:
j=list(range(len(eventdf)))
for k in range(len(eventdf)):
dfname= 'event'+str(j[k])
dfnatp=meandf2.iloc[eventdf.iloc[k,0]: eventdf.iloc[k,1]+2]
dfnatp.to_csv(dfname+'.csv', sep=',')
while I can very easily dump each dataframe to a .csv file, to do anything with it means that I then have to read it back in.
how do I create each dataframe with name given by the value of 'dfname' in the same way that I can name each csv file?
To elaborate Muhammad's suggestion a little more, you can create an empty dictionary like this (before your for loop):
dfdict = {}
Then you can create new dictionary entries like this (inside your for loop):
dfdict[dfname] = dfnatp
These entries will have dfname as the key and dfnatp as the value, so you can access each dfnatp by using dfdict['eventXXX'], where eventXXX is your identifier.
Here is an introduction to python's dictionary data structure for further reading.
As commented, consider a dictionary of data frames which you can achieve with dictionary comprehension. You lose no functionality of a data frame if saved in a dict or list. Since you need to also save to CSV, consider a defined method. Below uses F-strings for string interpolation.
def proc_data(k):
dfnatp = meandf2.iloc[eventdf.iloc[k,0]: eventdf.iloc[k,1]+2]
dfnatp.to_csv(f"event_{k}.csv")
return dfnatp
df_dict = {
f"event_{k}": proc_data(k) for k in range(len(eventdf))
}
# ACCESS INDIVIDUAL DATA FRAMES
df_dict["event_0"]
df_dict["event_1"]
df_dict["event_2"]
...
I'm reading lot of log files, from which I generate dictionary by parsing each log, I want to add this dictionary to dataframe, later I use this dataframe for analysis. But the information I need in dataframe may differ every time based on user input. So I don't want all the information in the dictionary to add in to data frame. I want the columns I defined in the data frame only to add to data frame.
As of now I'm adding all the dictionaries one by one to a list, then loading this dictionary to dataframe.
for log in log_lines:
# here logic to parse the log and generate the dictionary
my_dict_list.append(d)
pd.Dataframe(my_dict_list)
In this way it adds all the keys and their values to the dataframe,
but what I want is, I will define some columns, let's say user asks ['a','b','c'] columns for analysis, I want the dataframe to load only these keys and their values to the data frame, rest should be ignored.
my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
{'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
{'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]
Note: I don't want this ignoring keys at the time extraction of logs, because I will be extracting lot of logs so its time consuming.
is there a way I can achieve this, using pandas in faster way?.
In tmp_Dict line you can filter only requested columns and save only requested columns.
def log_dataframe(log_lines, requested_columns):
for log in log_lines:
# here logic to parse the log and generate the dictionary
tmp_Dict = {requested_key : d[requested_key] for requested_key in request_columns}
my_dict_list.append(tmp_Dict)
return pd.Dataframe(my_dict_list)
I am just providing you some raw logic for your query i may be wrong on some part but if you find it helpful for you that will be very great you can mail me also for future queries I will be happy to help you.
columns = []
x = int(input('enter no of columns you need'))
for i in range(x):
print("Please specify columns")
columns = int(input())
columns.append(columns)
my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
{'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
{'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]
for data in range(x):
value = pd.DataFrame(my_dict_list[columns[data]])
print(value[[data]])
I want to insert data from a dictionary into a sqlite table, I am using slqalchemy to do that, the keys in the dictionary and the column names are the same, and I want to insert the values into the same column name in the table. So this is my code:
#This is the class where I create a table from with sqlalchemy, and I want to
#insert my data into.
#I didn't write the __init__ for simplicity
class Sizecurve(Base):
__tablename__ = 'sizecurve'
XS = Column(String(5))
S = Column(String(5))
M = Column(String(5))
L = Column(String(5))
XL = Column(String(5))
XXL = Column(String(5))
o = Mapping() #This creates an object which is actually a dictionary
for eachitem in myitems:
# Here I populate the dictionary with keys from another list
# This gives me a dictionary looking like this: o={'S':None, 'M':None, 'L':None}
o[eachitem] = None
for eachsize in mysizes:
# Here I assign values to each key of the dictionary, if a value exists if not just None
# product_row is a class and size and stock are its attributes
if(product_row.size in o):
o[product_row.size] = product_row.stock
# I put the final object into a list
simplelist.append(o)
Now I want to put each the values from the dictionaries saved in simplelist into the right column in the sizecurve table. But I am stuck I don't know how to do that? So for example I have an object like this:
o= {'S':4, 'M':2, 'L':1}
And I want to see for the row for column S value 4, column M value 2 etc.
Yes, it's possible (though aren't you missing primary keys/foreign keys on this table?).
session.add(Sizecurve(**o))
session.commit()
That should insert the row.
http://docs.sqlalchemy.org/en/latest/core/tutorial.html#executing-multiple-statements
EDIT: On second read it seems like you are trying to insert all those values into one column? If so, I would make use of pickle.
https://docs.python.org/3.5/library/pickle.html
If performance is an issue (pickle is pretty fast, but if your doing 10000 reads per second it'll be the bottleneck), you should either redesign the table or use a database like PostgreSQL that supports JSON objects.
I have found this answer to a similar question, though this is about reading the data from a json file, so now I am working on understanding the code and also changing my data type to json so that I can insert them in the right place.
Convert JSON to SQLite in Python - How to map json keys to database columns properly?
I have a dataset in a relational database format (linked by ID's over various .csv files).
I know that each data frame contains only one value of an ID, and I'd like to know the simplest way to extract values from that row.
What I'm doing now:
# the group has only one element
purchase_group = purchase_groups.get_group(user_id)
price = list(purchase_group['Column_name'])[0]
The third row is bothering me as it seems ugly, however I'm not sure what is the workaround. The grouping (I guess) assumes that there might be multiple values and returns a <class 'pandas.core.frame.DataFrame'> object, while I'd like just a row returned.
If you want just the value and not a df/series then call values and index the first element [0] so just:
price = purchase_group['Column_name'].values[0]
will work.
If purchase_group has single row then doing purchase_group = purchase_group.squeeze() would make it into a series so you could simply call purchase_group['Column_name'] to get your values
Late to the party here, but purchase_group['Column Name'].item() is now available and is cleaner than some other solutions
This method is intuitive; for example to get the first row (list from a list of lists) of values from the dataframe:
np.array(df)[0]