Add dictionary to pandas data frame and ignore extra values - python

I'm reading lot of log files, from which I generate dictionary by parsing each log, I want to add this dictionary to dataframe, later I use this dataframe for analysis. But the information I need in dataframe may differ every time based on user input. So I don't want all the information in the dictionary to add in to data frame. I want the columns I defined in the data frame only to add to data frame.
As of now I'm adding all the dictionaries one by one to a list, then loading this dictionary to dataframe.
for log in log_lines:
# here logic to parse the log and generate the dictionary
my_dict_list.append(d)
pd.Dataframe(my_dict_list)
In this way it adds all the keys and their values to the dataframe,
but what I want is, I will define some columns, let's say user asks ['a','b','c'] columns for analysis, I want the dataframe to load only these keys and their values to the data frame, rest should be ignored.
my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
{'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
{'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]
Note: I don't want this ignoring keys at the time extraction of logs, because I will be extracting lot of logs so its time consuming.
is there a way I can achieve this, using pandas in faster way?.

In tmp_Dict line you can filter only requested columns and save only requested columns.
def log_dataframe(log_lines, requested_columns):
for log in log_lines:
# here logic to parse the log and generate the dictionary
tmp_Dict = {requested_key : d[requested_key] for requested_key in request_columns}
my_dict_list.append(tmp_Dict)
return pd.Dataframe(my_dict_list)

I am just providing you some raw logic for your query i may be wrong on some part but if you find it helpful for you that will be very great you can mail me also for future queries I will be happy to help you.
columns = []
x = int(input('enter no of columns you need'))
for i in range(x):
print("Please specify columns")
columns = int(input())
columns.append(columns)
my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
{'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
{'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]
for data in range(x):
value = pd.DataFrame(my_dict_list[columns[data]])
print(value[[data]])

Related

creating many dataframes as subsets of one large one

I have this code which extracts subsets of a dataframe to individual dataframes which represent rainfall events:
j=list(range(len(eventdf)))
for k in range(len(eventdf)):
dfname= 'event'+str(j[k])
dfnatp=meandf2.iloc[eventdf.iloc[k,0]: eventdf.iloc[k,1]+2]
dfnatp.to_csv(dfname+'.csv', sep=',')
while I can very easily dump each dataframe to a .csv file, to do anything with it means that I then have to read it back in.
how do I create each dataframe with name given by the value of 'dfname' in the same way that I can name each csv file?
To elaborate Muhammad's suggestion a little more, you can create an empty dictionary like this (before your for loop):
dfdict = {}
Then you can create new dictionary entries like this (inside your for loop):
dfdict[dfname] = dfnatp
These entries will have dfname as the key and dfnatp as the value, so you can access each dfnatp by using dfdict['eventXXX'], where eventXXX is your identifier.
Here is an introduction to python's dictionary data structure for further reading.
As commented, consider a dictionary of data frames which you can achieve with dictionary comprehension. You lose no functionality of a data frame if saved in a dict or list. Since you need to also save to CSV, consider a defined method. Below uses F-strings for string interpolation.
def proc_data(k):
dfnatp = meandf2.iloc[eventdf.iloc[k,0]: eventdf.iloc[k,1]+2]
dfnatp.to_csv(f"event_{k}.csv")
return dfnatp
df_dict = {
f"event_{k}": proc_data(k) for k in range(len(eventdf))
}
# ACCESS INDIVIDUAL DATA FRAMES
df_dict["event_0"]
df_dict["event_1"]
df_dict["event_2"]
...

Split and create data from a column to many columns

I have a pandas data frame in which the values of one of its columns looks like that
print(VCF['INFO'].iloc[0])
Results (Sorry I can copy and paste this data as I am working from a cluster without an internet connection)
I need to create new columns with the name END, SVTYPE and SVLEN and their info as values of that columns. Following the example, this would be
END SVTYPE SVLEN-
224015456 DEL 223224913
The rest of the info contained in the column INFOI do not need it so far.
The information contained in this column is huge but as far I can read there is not more something=value as you can see in the picture.
Simply use .str.extract:
extracted = df['INFO'].str.extract('END=(?P<END>.+?);SVTYPE=(?P<SVTYPE>.+?);SVLEN=(?P<SVLEN>.+?);')
Output:
>>> extracted
END SVTYPE SVLEN
0 224015456 DEL -223224913

How to loop a command in python with a list as variable input?

This is my first post to the coding community, so I hope I get the right level of detail in my request for help!
Background info:
I want to repeat (loop) command in a df using a variable that contains a list of options. While the series 'amenity_options' contains a simple list of specific items (let's say only four amenities as the example below) the df is a large data frame with many other items. My goal is the run the operation below for each item in the 'amenity_option' until the end of the list.
amenity_options = ['cafe','bar','cinema','casino'] # this is a series type with multiple options
df = df[df['amenity'] == amenity_options] # this is my attempt to select the the first value in the series (e.g. cafe) out of dataframe that contains such a column name.
df.to_excel('{}_amenity.xlsx, format('amenity') # wish to save the result (e.g. cafe_amenity) as a separate file.
Desired result:I wish to loop step one and two for each and every item available in the list (e.g. cafe, bar, cinema...). So that I will have separate excel files in the end. Any thoughts?
What #Rakesh suggested is correct, you probably just need one more step.
df = df[df['amenity'].isin(amenity_options)]
for key, g in df.groupby('amenity'):
g.to_excel('{}_amenity.xlsx'.format(key))
After you call groupby() of your df, you will get 4 groups so that you can directly loop on them.
The key is the group key, which are cafe, bar and etc. and the g is the sub-dataframe that specifically filtered by that key.
Seems like you just need a simple for loop:
for amenity in amenity_options:
df[df['amenity'] == amenity].to_excel(f"{amenity}_amenity.xlsx")

select appropriate dataframe based on user input

I have the below data frames loaded :
df_1000-2000
df_3000-4000
df_5000-6000
df_7000-8000
Now I get a user input value as 1000-2000. Based on the user input value I need to work on respective data frame.
In this case, I need to work on : df_1000-2000
How to select the data frame dynamically based on user input and start working on it ?
Use a dictionary
You should restructure how you store and access your dataframes. First define a dictionary:
dfs = {'1000-2000': df_1000-2000, '3000-4000': df_3000-4000, etc.}
Then taking a user input and using it to query your dictionary is straightforward:
value = input('Input the range you require, e.g. 1000-2000:')
res = dfs[value]

Pandas formatting column within DataFrame and adding timedelta Index error

I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.

Categories

Resources