Split and create data from a column to many columns - python

I have a pandas data frame in which the values of one of its columns looks like that
print(VCF['INFO'].iloc[0])
Results (Sorry I can copy and paste this data as I am working from a cluster without an internet connection)
I need to create new columns with the name END, SVTYPE and SVLEN and their info as values of that columns. Following the example, this would be
END SVTYPE SVLEN-
224015456 DEL 223224913
The rest of the info contained in the column INFOI do not need it so far.
The information contained in this column is huge but as far I can read there is not more something=value as you can see in the picture.

Simply use .str.extract:
extracted = df['INFO'].str.extract('END=(?P<END>.+?);SVTYPE=(?P<SVTYPE>.+?);SVLEN=(?P<SVLEN>.+?);')
Output:
>>> extracted
END SVTYPE SVLEN
0 224015456 DEL -223224913

Related

How to maintain incrementing values in a column after removing rows from a data frame

I work with a test system that outputs a large CSV matrix of values which I then process using the Pandas module in Python. The parameters that system uses when testing a given part are governed by a predetermined sequence. A simplified example is shown here:
Raw data frame
However, not all of these steps are desired in the output data. In fact, the rows containing a 'Clock Frequency' value of '3.0MHz' are only included to act as buffer points to allow a climate chamber to reach the intended temperature. I do not wish to include data collected at these parameters in my results.
I found I was pretty easily able to remove these rows from my data frame by using the below code. Note that in this example I am working with a Pandas data frame called 'csvDF'.
tempBuffers = csvDF[csvDF['Clock Frequency']==3e6].index
csvDF.drop(tempBuffers, inplace=True)
This produces the following output:
Data frame with buffer steps removed
The issue with this is that now my 'Sequence Step' column is wrong. I want the data table to appear as if those buffer steps never existed. The sequence steps should be sequential for all non-buffer steps. The desired output is shown below:
Data frame with buffer steps removed and corrected sequence step column
What code do I need to instantiate in order to achieve this?
You can try something like this:
n = 3 # number of rows in step
csvDF.reset_index(inplace=True, drop=True)
csvDF['Sequence step'] = pd.Series(range(len(csvDF)))
csvDF['Sequence step'] = csvDF['Sequence step'].apply(lambda x: int(x / n))

Add dictionary to pandas data frame and ignore extra values

I'm reading lot of log files, from which I generate dictionary by parsing each log, I want to add this dictionary to dataframe, later I use this dataframe for analysis. But the information I need in dataframe may differ every time based on user input. So I don't want all the information in the dictionary to add in to data frame. I want the columns I defined in the data frame only to add to data frame.
As of now I'm adding all the dictionaries one by one to a list, then loading this dictionary to dataframe.
for log in log_lines:
# here logic to parse the log and generate the dictionary
my_dict_list.append(d)
pd.Dataframe(my_dict_list)
In this way it adds all the keys and their values to the dataframe,
but what I want is, I will define some columns, let's say user asks ['a','b','c'] columns for analysis, I want the dataframe to load only these keys and their values to the data frame, rest should be ignored.
my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
{'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
{'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]
Note: I don't want this ignoring keys at the time extraction of logs, because I will be extracting lot of logs so its time consuming.
is there a way I can achieve this, using pandas in faster way?.
In tmp_Dict line you can filter only requested columns and save only requested columns.
def log_dataframe(log_lines, requested_columns):
for log in log_lines:
# here logic to parse the log and generate the dictionary
tmp_Dict = {requested_key : d[requested_key] for requested_key in request_columns}
my_dict_list.append(tmp_Dict)
return pd.Dataframe(my_dict_list)
I am just providing you some raw logic for your query i may be wrong on some part but if you find it helpful for you that will be very great you can mail me also for future queries I will be happy to help you.
columns = []
x = int(input('enter no of columns you need'))
for i in range(x):
print("Please specify columns")
columns = int(input())
columns.append(columns)
my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
{'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
{'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]
for data in range(x):
value = pd.DataFrame(my_dict_list[columns[data]])
print(value[[data]])

How to create one large dataframe of specific excel information using python pandas from a list of excel paths

Perhaps an easy fix.
I am looking to extract specific information from many of the same style of excel workbooks within a directory and concatenate the specific information all in into one workbook (while changing the format). I have completed every part of this task except for successfully creating one big dataframe of n columns from the different workbooks(proportional to the number of xlsx files read). Each of the read workbooks has only one sheet ['Sheet1']. Does this sound like I am taking the right approach? I am currently using a for loop to gather this data.
Upon much research online (Github, youtube, stackoverflow), others say to make one big dataframe, then concatenate. I have tried to use a for loop to create this dataframe; however, I have not seen users "piece together" bits of data to form a dataframe the way I have. I don't believe this should hinder the operation. I realize I am not appending or concatenating, just not sure where to go with it.
for i in filepaths: #filepaths is a list of n filepaths`
df = pd.read_excel(i) #read the excel sheets`
info = otherslices #condensed form of added slices from df`
Final = pd.DataFrame(info) #expected big dataframe`
The expected results should be columns directly next to each other (one from each excel sheet respectively)
Excel1 Excel2 -> Excel(n)
info1a info1b
info2a info2b
info3a info3b
... ...
What I currently get when using "print(Final)" in loop is
Excel1
info1a
info2a
info3a
...
Excel2
info1b
info2b
info3b
...
|
Excel(n)
However, the dataframe I get from this loop (when I type "Final") is only
the very last excel workbook's data
I would create a list of data frames which you append in each loop then after the loop concate the list into a single data frame. So something like this.
Final=[]
for i in filepaths: #filepaths is a list of n filepaths`
df = pd.read_excel(i) #read the excel sheets`
info = otherslices #condensed form of added slices from df`
Final.append(info) #expected big dataframe`'
Final=pd.concat(Final)
I discovered my own solution to this problem.
Final = pd.DataFrame(index=range(95)) #95 is the number of rows I have for each column
n=0
for i in filepaths: #filepaths is a list of n filepaths
df = pd.read_excel(i) #read the excel sheets`
info = otherslices #condensed form of added slices from df`
Final[n]=pd.DataFrame(info)
n+=1
Final = Final.append(Final) #big dataframe of n columns
Final

how to locate row in dataframe without headers

I noticed that when using .loc in pandas dataframe, it not only finds the row of data I am looking for but also includes the header column names of the dataframe I am searching within.
So when I try to append the .loc row of data, it includes the data + column headers - I don't want any column headers!
##1st dataframe
df_futures.head(1)
date max min
19990101 2000 1900
##2nd dataframe
df_cash.head(1)
date$ max$ min$
1999101 50 40
##if date is found in dataframe 2, I will collect the row of data
data_to_track = []
for ii in range(len(df_futures['date'])):
##date I will try to find in df2
date_to_find = df_futures['date'][ii]
##append the row of data to my list
data_to_track.append(df_cash.loc[df_cash['Date$'] == date_to_find])
I want the for loop to return just 19990101 50 40
It currently returns 0 19990101 50 40, date$, max$, min$
I agree with other comments regarding the clarity of the question. However, if what you want to get is just a string that contains a particular row's data, then you could use to_string() method of Pandas.
In your case,
Instead of this:
df_cash.loc[df_cash['Date$'] == date_to_find]
You could get a string that includes only the row data:
df_cash[df_cash['Date$'] == date_to_find].to_string(header=None)
Also notice that I dropped the .loc part, which outputs the same result.
If your dataframe has multiple columns and you dont want them to be joined in a string (may bring data type issues and is potentially problematic if you want to separate them later on), you could use list() method such as:
list(df_cash[df_cash['Date$'] == date_to_find].iloc[0])

Pandas formatting column within DataFrame and adding timedelta Index error

I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.

Categories

Resources