How to drop duplicate with priority in pandas

How to drop duplicate with priority in pandas - python

I'm new you to pandas and python, and I want to remove duplicates but give it a priority. It's hard to explain but I will give an example to make it clear
ID Phone Email
0001 0234+ null
0001 null a#.com
0001 0234+ a#.com
how I can remove the duplicates in ID and leave the third one because it has both phone and email
and not removing it randomly, and if the id for example has no complete of both values it will still remain one

First Drop NaNs in rows and then drop duplicates
df2 = df.dropna(subset=['Phone']).dropna(subset=['Email']).drop_duplicates('ID')

You can just drop the NaN values based on Phone and Email.
df.dropna(subset=['Phone', 'Email'], inplace=True)

I solve this by take each case to new data frame for example if both email and phone have value will set it a firstdf, then if email only has value it will be in seconddf, etc.
then I concat them and append it to new data frame as final result
and remove id duplicate (by that I set the most important case at top)
code:
# drop if both is null
ff = ff.dropna(subset=["الجوال", 'البريد الالكتروني'] , how="all")
#hh = ff with both not null
hh = ff.dropna(subset=["الجوال", 'البريد الالكتروني'])
## ss = ff with email false and phone true
ss = ff.dropna(subset=["الجوال"])
## yy = ff with email true and phone false
yy = ff.dropna(subset=["البريد الالكتروني"])
#### solution to give priority which to drop we take the most important one top
df1=pd.concat([hh,ss],axis=0)
len(hh) + len(ss)
df2=pd.concat([df1,yy],axis=0)
len(df1) + len(yy)
final= df2.copy()
final= final.drop_duplicates(subset=["رقم الهوية"])
final.to_excel(r'Result.xlsx',index=False)

Related

In Python, If there is a duplicate, use the date column to choose the what duplicate to use

I have code that runs 16 test cases against a CSV, checking for anomalies from poor data entry. A new column, 'Test case failed,' is created. A number corresponding to which test it failed is added to this column when a row fails a test. These failed rows are separated from the passed rows; then, they are sent back to be corrected before they are uploaded into a database.
There are duplicates in my data, and I would like to add code to check for duplicates, then decide what field to use based on the date, selecting the most updated fields.
Here is my data with two duplicate IDs, with the first row having the most recent Address while the second row has the most recent name.
ID
MnLast
MnFist
MnDead?
MnInactive?
SpLast
SpFirst
SPInactive?
SpDead
Addee
Sal
Address
NameChanged
AddrChange
123
Doe
John
No
No
Doe
Jane
No
No
Mr. John Doe
Mr. John
123 place
05/01/2022
11/22/2022
123
Doe
Dan
No
No
Doe
Jane
No
No
Mr. John Doe
Mr. John
789 road
11/01/2022
05/06/2022
Here is a snippet of my code showing the 5th testcase, which checks for the following: Record has Name information, Spouse has name information, no one is marked deceased, but Addressee or salutation doesn't have "&" or "AND." Addressee or salutation needs to be corrected; this record is married.
import pandas as pd
import numpy as np
data = pd.read_csv("C:/Users/file.csv", encoding='latin-1' )
# Create array to store which test number the row failed
data['Test Case Failed']= ''
data = data.replace(np.nan,'',regex=True)
data.insert(0, 'ID', range(0, len(data)))
# There are several test cases, but they function primarily the same
# Testcase 1
# Testcase 2
# Testcase 3
# Testcase 4
# Testcase 5 - comparing strings in columns
df = data[((data['FirstName']!='') & (data['LastName']!='')) &
((data['SRFirstName']!='') & (data['SRLastName']!='') &
(data['SRDeceased'].str.contains('Yes')==False) & (data['Deceased'].str.contains('Yes')==False)
)]
df1 = df[df['PrimAddText'].str.contains("AND|&")==False]
data_5 = df1[df1['PrimSalText'].str.contains("AND|&")==False]
ids = data_5.index.tolist()
# Assign 5 for each failed
for i in ids:
data.at[i,'Test Case Failed']+=', 5'
# Failed if column 'Test Case Failed' is not empty, Passed if empty
failed = data[(data['Test Case Failed'] != '')]
passed = data[(data['Test Case Failed'] == '')]
failed['Test Case Failed'] =failed['Test Case Failed'].str[1:]
failed = failed[(failed['Test Case Failed'] != '')]
# Clean up
del failed["ID"]
del passed["ID"]
failed['Test Case Failed'].value_counts()
# Print to console
print("There was a total of",data.shape[0], "rows.", "There was" ,data.shape[0] - failed.shape[0], "rows passed and" ,failed.shape[0], "rows failed at least one test case")
# output two files
failed.to_csv("C:/Users/Failed.csv", index = False)
passed.to_csv("C:/Users/Passed.csv", index = False)
What is the best approach to check for duplicates, choose the most updated fields, drop the outdated fields/row, and perform my test?

First, try to set a mapping that associates update date columns to their corresponding value columns.
date2val = {"AddrChange": ["Address"], "NameChanged": ["MnFist", "MnLast"], ...}
Then, transform date columns into datetime format to be able to compare them (using argmax later).
for key in date2val.keys():
failed[key] = pd.to_datetime(failed[key])
Then, group by ID the duplicates (since ID is the value that decides whether it is a duplicate), and for each date column get the maximum value in the group (which refers to the most recent update) and retrieve the columns to update from the initial mapping. I'll update the last row and set it as the final updated result (by putting it in corrected list).
corrected = list()
for _, grp in failed.groupby("ID"):
for key in date2val.keys():
recent = grp[key].argmax()
for col in date2val[key]:
grp.iloc[-1][col] = grp.iloc[recent][col]
corrected.append(grp.iloc[-1])
corrected = pd.DataFrame(corrected)

Preparing data:
import pandas as pd
c = 'ID MnLast MnFist MnDead? MnInactive? SpLast SpFirst SPInactive? SpDead Addee Sal Address NameChanged AddrChange'.split()
data1 = '123 Doe John No No Doe Jane No No Mr.JohnDoe Mr.John 123place 05/01/2022 11/22/2022'.split()
data2 = '123 Doe Dan No No Doe Jane No No Mr.JohnDoe Mr.John 789road 11/01/2022 05/06/2022'.split()
data3 = '8888 Brown Peter No No Brwon Peter No No Mr.PeterBrown M.Peter 666Avenue 01/01/2011 01/01/2011'.split()
df = pd.DataFrame(columns = c, data = [data1, data2, data3])
df.AddrChange.astype('datetime64')
df.NameChanged.astype('datetime64')
df
DataFrame is like the example:
Then you pick a piece of the dataframe avoiding changes in original. Adjacent rows have the same ID and the first one has the apropriate name:
df1 = df[['ID', 'MnFist', 'NameChanged']].sort_values(by=['ID', 'NameChanged'], ascending = False)
df1
Then you build a dictionary putting key as df.ID and the appropriate name for its value. You intend to build all the column MnFist:
d = {}
for id in set(df.ID.values):
df_mask = df1.ID == id # filter only rows with same id
filtered_df = df1[df_mask]
if len(filtered_df) <= 1:
d[id] = filtered_df.iat[0, 1] # id has only one row, so no changes
continue
for name in filtered_df.MnFist:
if name in ['unknown', '', ' '] or name is None: # name discards
continue
else:
d[id] = name # found a servible name
if id not in d.keys():
d[id] = filtered_df.iat[0, 1] # no servible name, so picked the first
print(d)
The partial output of the dictionary is:
{'8888': 'Peter', '123': 'Dan'}
Then you build all the column:
df.MnFist = [d[id] for id in df.ID]
df
The partial output is:
Then the same procedure to the other column:
df1 = df[['ID', 'Address', 'AddrChange']].sort_values(by=['ID', 'AddrChange'], ascending = False)
df1
d = { id: df1.loc[df1.ID == id, 'Address'].values[0] for id in set(df.ID.values) }
d
df.Address = [d[id] for id in df.ID]
df
The final output is:
Edited after author comented possibility of unknow inservible data.

Let me restate what I understood from the question:
You have a dataset on which you are doing several sanity checks. (Looks like you already have everything in place for this step)
In next step you are finding duplicates row with different columns updated at different dates. (I assume that you already have this)
Now, you are looking for a new dataset that has non-duplicated rows with updated fields using the latest date entries.
First, define different dates and their related columns in a form of dictionary:
date_to_cols = {"AddrChange": "Address", "NameChanged": ["MnLast", "MnFirst"]}
Next, apply group by using "ID" and then get the index for maximum value of different dates. Once we have the index, we can pull the related fields for that date from the data.
data[list(date_to_cols.keys())] =data[list(date_to_cols.keys())].astype('datetime64')
latest_data = df.groupby('ID')[list(date_to_cols.keys())].idxmax().reset_index()
for date_field, cols_to_update in date_to_cols.items():
latest_data[cols_to_update] = latest_data[date_field].apply(lambda x: data.iloc[x][cols_to_update])
latest_data[date_field] = latest_data[date_field].apply(lambda x: data.iloc[x][date_field])
Next, you can merge these latest_data with the original data (after removing old columns):
cols_to_drop = list(latest_data.columns)
cols_to_drop.remove("ID")
data.drop(columns= cols_to_drop, inplace=True)
latest_data_all_fields = data.merge(latest_data, on="ID", how="left")
latest_data_all_fields.drop_duplicates(inplace=True)

My headers are in the first column of my txt file. I want to create a Pandas DF

Sample data from text file
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email = michael.toole#123.ie
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
Wondering if someone could help me, you can see my sample dataset above. What I would like to do (please tell me if there is a more efficient way) is to loop through the first column and whereever the list of unique ids occur (e.g first_name, last_name, role etc) append the value in the corresponding row to that list and do this which each unique ID so that I'm left with the below.
I have read about multi-indexing and I'm not sure if that might be a better solution but I couldn't get it to work (I'm quite new to python)
enter image description here
# Define a list of selected persons
selectedList = textFile
# Define a list of searching person
searchList = ['uid']
# Define an empty list
foundList = []
# Iterate each element from the selected list
for index, sList in enumerate(textFile):
# Match the element with the element of searchList
if sList in searchList:
# Store the value in foundList if the match is found
foundList.append(selectedList[index])

You have a text file where each records starts with a [User] line and data lines have a key=value format. I know no module able to automatically handle that, but it is easy to parse it by hand. Code could be:
with open('file.txt') as fd:
data = [] # a list of records
for line in fd:
line = line.strip() # strip end of line
if line == '[User]': # new record
row = {} # row will be a key: value dict
data.append(row)
else:
k,v = line.split('=', 1) # split on the = character
row[k] = v
df = pd.DataFrame(data) # list of key: value dicts => dataframe
With the sample data shown, we get:
employeeNo last_name first_name language email department role email Location
0 123 Toole Michael english michael.toole#123.ie Marketing Marketing Lead NaN NaN
1 456 Ronaldo Juan Spanish NaN Data Science Team Lead juan.ronaldo#sms.ie Spain
2 998 Lee Damian english NaN NaN NaN damian.lee#email.com NaN

I'm sure there is a more optimal way to do this, but it would be to get a unique list of row names, this time extracting them in a loop process and combining them into a new dataframe. Finally, update it with the desired column names.
import pandas as pd
import numpy as np
import io
data = '''
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email=michael.toole#123.ie
department=Marketing
role="Marketing Lead"
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department="Data Science"
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
'''
df = pd.read_csv(io.StringIO(data), sep='=', comment='[', header=None)
new_cols = df[0].unique()
new_df = pd.DataFrame()
for col in new_cols:
tmp = df[df[0] == col]
tmp.reset_index(inplace=True)
new_df = pd.concat([new_df, tmp[1]], axis=1)
new_df.columns = new_cols
new_df['User'] = None
new_df = new_df[['User','employeeNo','last_name','first_name','language','email','department','role','Location']]
new_df
User employeeNo last_name first_name language email department role Location
0 None 123 Toole Michael english michael.toole#123.ie Marketing Marketing Lead Spain
1 None 456 Ronaldo Juan Spanish juan.ronaldo#sms.ie Data Science Team Lead NaN
2 None 998 Lee Damian english damian.lee#email.com NaN NaN NaN

Rewrite based on testing of previous version offset values
import pandas as pd
# Revised from previous answer - ensures key value pairs are contained to the same
# record - previous version assumed the first record had all the expected keys -
# inadvertently assigned (Location) value of second record to the first record
# which did not have a Location key
# This version should perform better - only dealing with one single df
# - and using pandas own pivot() function
textFile = 'file.txt'
filter = '[User]'
# Decoration - enabling a check and balance - how many users are we processing?
textFileOpened = open(textFile,'r')
initialRead = textFileOpened.read()
userCount = initialRead.count(filter) # sample has 4 [User] entries - but only three actual unique records
print ('User Count {}'.format(userCount))
# Create sets so able to manipulate and interrogate
allData = []
oneRow = []
userSeq = 0
#Iterate through file - assign record key and [userSeq] Key to each pair
with open(textFile, 'r') as fp:
for fileLineSeq, line in enumerate(fp):
if filter in str(line):
userSeq = userSeq + 1 # Ensures each key value pair is grouped
else: userSeq = userSeq
oneRow = [fileLineSeq, userSeq, line]
allData.append(oneRow)
df = pd.DataFrame(allData)
df.columns = ['FileRow','UserSeq','KeyValue'] # rename columns
userSeparators = df[df['KeyValue'] == str(filter+'\n') ].index # Locate [User Records]
df.drop(userSeparators, inplace = True) # Remove [User] records
df = df.replace(' = ' , '=' , regex=True ) # Input data dirty - cleaning up
df = df.replace('\n' , '' , regex=True ) # remove the new lines appended during the list generation
# print(df) # Test as necessary here
# split KeyValue column into two
df[['Key', 'Value']] = df.KeyValue.str.split('=', expand=True)
# very powerful function - convert to table
df = df.pivot(index='UserSeq', columns='Key', values='Value')
print(df)
Results
User Count 4
Key Location department email employeeNo first_name language last_name role
UserSeq
1 NaN Marketing michael.toole#123.ie 123 Michael english Toole Marketing Lead
2 Spain Data Science juan.ronaldo#sms.ie 456 Juan Spanish Ronaldo Team Lead
3 NaN NaN damian.lee#email.com 998 Damian english Lee NaN

Pandas reading tall data into a DataFrame

I have a text file which consists of tall data. I want to iterate through each line within the text file and create a Dataframe.
The text file looks like this, note that the same fields don't exist for all Users (e.g some might have an email field some might not), Also note that each User is separated by[User]:
[User]
Field=Data
employeeNo=123
last_name=Toole
first_name=Michael
language=english
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
My issue is as follows:
My code iterates through the data but for any field that is not present for that User it iterates down through the list and takes the next piece of data relating to that field.
For example Look at the output below (click on the link below) the first User does not have an email associated with him so the code assigns the email of the second user in the list, however what I want to do is return Nan/N/A/blank if no information is available
Click here to view DataFrame
## Import Libraries
import pandas as pd
import numpy as np
from pandas import DataFrame
## Import Data
## Set column names so that no lines in the text file are missed"
col_names = ['Field',
'Data']
## If you have been sent this script you need to change the file path below, change it to where you have the .txt file saved
textFile = pd.read_csv(r'Desktop\SampleData.txt', delimiter="=", engine='python', names=col_names)
## Get a list of the unique IDs
new_cols = pd.unique(textFile['Field'])
userListing_DF = pd.DataFrame()
## Create a for loop to iterate through the first column and get the unique columns, then concatenate those unique values with data
for col in new_cols:
tmp = textFile[textFile['Field'] == col]
tmp.reset_index(inplace=True)
userListing_DF = pd.concat([userListing_DF, tmp['Data']], axis=1)
userListing_DF.columns = new_cols

Read in the single long column, and then form a group indicator by seeing where the value is '[User]'. Then separate the column labels and values, with a str.split and join back to your DataFrame. Finally pivot to your desired shape.
df = pd.read_csv('test.txt', sep='\n', header=None)
df['Group'] = df[0].eq('[User]').cumsum()
df = df[df[0].ne('[User]')] # No longer need these rows
df = pd.concat([df, df[0].str.split('=', expand=True).rename(columns={0: 'col', 1: 'val'})],
axis=1)
df = df.pivot(index='Group', columns='col', values='val').rename_axis(columns=None)
Field Location department email employeeNo first_name language last_name role
Group
1 Data NaN Marketing NaN 123 Michael english Toole Marketing Lead
2 NaN Spain Data Science juan.ronaldo#sms.ie 456 Juan Spanish Ronaldo Team Lead
3 NaN NaN NaN damian.lee#email.com 998 Damian english Lee NaN

List to dataframe, list to multiple lists, single column to dataframe

Still figuring out programming, help is appreciated! I have a single column of information that i would ultimately like to turn into a dataframe. I could transpose it but the address information varies, it is either 2 lines or 3 lines (some have suite numbers etc).
It generally looks like this.
name x,
ID 1,
123-xyz,
ID 2,
abcdefg,
ACTIVITY,
ggg,
TYPE,
C,
COUNTY,
orange county,
ADDRESS,
123 stack st,
city state zip,
PHONE,
111-111-1111,
EXPIRES,
date,
name y,
ID 1,
456-abc,
ID 2,
cvbnmnb,
ACTIVITY,
ggg,
TYPE,
A,
COUNTY,
dakota county,
ADDRESS,
234 overflow st,
lot a,
city state zip,
PHONE,
000-000-0000,
EXPIRES,
date,
name z,
...,
I was thinking of creating new lists for all desired columns and conditionally appending values with a for loop.
for i in list
if value = ID
append previous value to name list
append next value to ID list
elif value = phone
send next value to phone
elif value = address
evaluate 3 rows down
if value = phone
concatenate previous two values and append to address list
if value != phone
concatenate current and previous 2 values and append to address list
else print error message
Would this be a decently efficient option for lists of around ~20,000 values?
I don't really know how to write this, I am using python in a jupyter notebook. Looking for solutions but also looking to learn more!
-EDIT-
A user had suggested a while loop, and the original data sample I gave was simplified and contained 4 fields. My actual set contained 9, and I tried playing around but unfortunately wasn't able to figure it out on my own.
count = 0 #Pointer to start of a cluster
lengthdf = len(df) #Getting the length of the existing dataframe to use it as the terminating condition
while count != lengthdf:
name = id1 = id2 = activity = type = county = address = phone = expires = "" #Reset the fields for every cluster of information
name = df[0][count] #Name is always the first line of cluster
id1 = df[0][count+2] #id is always third line of cluster
id2 = df[0][count+4]
activity = df[0][count+6]
type = df[0][count+8]
county = df[0][count+10]
n=11
while df[0][count+n] != "Phone": #While row is not 'PHONE', everything else in between is the address, appended and separated by comma.
address=address+df[0][count+n]+", "
n+=1
phone = df[0][count+n+1] #Phone number is always the row after 'PHONE', and is only of 1 line.
expires = df[0][count+n+3]
n+=2
newdf = newdf.append({'NAME': name, 'ID 1': id1, 'ID 2': id2, 'ACTIVITY': activity, 'TYPE': type, 'COUNTY': county, 'ADDRESS': address, 'Phone': phone, 'Expires': expires}, ignore_index=True) #Append the data into the new dataframe
count=count+n

You seem to have a brief understanding of what you need to do judging by the pseudocode you provided!
I'm assuming that your xlsx file looks something like this without the commas.
Based on your sample data, this is what I can come with for you. I'll be referencing each user data as a 'cluster'.
This code works under a few assumptions:
The PHONE field always only have 1 line of data
There is complete data for all cluster (or if there is missing data, a blank exists on the next row).
Data is always in this particular order (i.e. name, ID, address, Phone)
count will be like a pointer to the start of a cluster, while n will be the offset from count. Read the comments for the explanations.
import pandas as pd
df = pd.read_excel (r'test.xlsx', header = None) #Import xlsx file
newdf = pd.DataFrame(columns=['name', 'id', 'address', 'phone']) #Creating blank dataframe
count = 0 #Pointer to start of a cluster
lengthdf = len(df) #Getting the length of the existing dataframe to use it as the terminating condition
while count != lengthdf:
this_add = this_name = this_id = this_phone = "" #Reset the fields for every cluster of information
this_name = df[0][count] #Name is always the first line of cluster
this_id = df[0][count+2] #id is always third line of cluster
n=4
while df[0][count+n] != "PHONE": #While row is not 'PHONE', everything else in between is the address, appended and separated by comma.
this_add=this_add+df[0][count+n]+", "
n+=1
this_phone = df[0][count+n+1] #Phone number is always the row after 'PHONE', and is only of 1 line.
n+=2
newdf = newdf.append({'name': this_name, 'id': this_id, 'address': this_add, 'phone':this_phone}, ignore_index=True) #Append the data into the new dataframe
count=count+n
As for performance wise, I honestly do not think there is much optimisation that can be done given the nature of the dataset (I might be wrong). If you realised my solution is pretty "hard-coded" to reduce the need for if-else statements, but 20,000 lines should not be huge of a problem for Jupyter Notebook. May take a couple of minutes but that should be alright.
I hope this gets you started on tackling other scenarios you may encounter with the remaining datasets!

parsing dataframe columns containing functions

Python/pandas newbie here. The csv file I'm trying to work with has been populated with data that looks something like this:
A B C D
Option1(item1=12345, item12='string', item345=0.123) 2020-03-16 1.234 Option2(item4=123, item56=234, item678=345)
I'd like it to look like this:
item1 item12 item345 B C item4 item56 item678
12345 'string' 0.123 2020-03-16 1.234 123 234 345
In other words, I want to replace columns A and D with new columns headed by what's on the left of the equal sign, using what's to the right of the equal sign as the corresponding value, and with the Option1() and Option2() parts and the commas stripped out. The columns that don't contain functions should be left as is.
Is there an elegant way to do this?
Actually, at this point, I'd settle for any old way, elegant or not; I've found various ways of dealing with this situation if, say, there were dicts populating columns, but nothing to help me pick it apart if there are functions there. Trying to search for the answer only gives me a bunch of results for how to apply functions to dataframes.

As long as your functions always have the same arguments, this should work.
You can read the csv with (if separators are 2 or more spaces, that's what I get when I paste your question example):
df = pd.read_csv('test.csv',sep='[\s]{2,}', index_col=False, engine='python')
If your dataframe is df:
# break out both sides of the equal sign in function into columns
A_vals = df['A'].str.extractall(r'([\w\d]+)=([^,\)]*)')
# get rid of the multi-index and put the values after '=' into columns
A_converted = A_vals.unstack(level=-1)[1]
# set column names to values before '='
A_converted.columns = list(A_vals.unstack(level=-1)[0].values[0])
# same thing for 'D'
D_vals = df['D'].str.extractall(r'([\w\d]+)=([^,\)]*)')
D_converted = D_vals.unstack(level=-1)[1]
D_converted.columns = list(D_vals.unstack(level=-1)[0].values[0])
# join everything together
df = A_converted.join(df.drop(['A','D'], axis=1)).join(D_converted)
Some clarification on the regex '([\w\d]+)=([^,\)]*)' has two capture groups (each part in parens):
Group 1 ([\w\d]+) is one or more characters (+) that are word characters \w or numbers \d.
= between groups.
Group 2 ([^,\)]*) is 0 or more characters (*) that are not (^) a comma , or paren \).

I believe you're looking for something along these lines:
contracts = ["Option(conId=384688665, symbol='SPX', lastTradeDateOrContractMonth='20200116', strike=3205.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 200117P03205000', tradingClass='SPX')",
"Option(conId=12345678, symbol='DJX', lastTradeDateOrContractMonth='20200113', strike=1205.0, right='P', multiplier='200', exchange='SMART', currency='USD', localSymbol='DJXX 333117Y13205000', tradingClass='DJX')"]
new_conts = []
columns = []
for i in range (len(contracts)):
mod = contracts[i].replace('Option(','').replace(')','')
contracts[i] = mod
new_cont = contracts[i].split(',')
new_conts.append(new_cont)
for contract in new_conts:
column = []
for i in range (len(contract)):
mod = contract[i].split('=')
contract[i] = mod[1]
column.append(mod[0])
columns.append(column)
print(len(columns[0]))
df = pd.DataFrame(new_conts,columns=columns[0])
df
Output:
conId symbol lastTradeDateOrContractMonth strike right multiplier exchange currency localSymbol tradingClass
0 384688665 'SPX' '20200116' 3205.0 'P' '100' 'SMART' 'USD' 'SPX 200117P03205000' 'SPX'
1 12345678 'DJX' '20200113' 1205.0 'P' '200' 'SMART' 'USD' 'DJXX 333117Y13205000' 'DJX'
Obviously you can then delete unwanted columns, change names, etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to drop duplicate with priority in pandas - python

First Drop NaNs in rows and then drop duplicates df2 = df.dropna(subset=['Phone']).dropna(subset=['Email']).drop_duplicates('ID')

You can just drop the NaN values based on Phone and Email. df.dropna(subset=['Phone', 'Email'], inplace=True)

Related

In Python, If there is a duplicate, use the date column to choose the what duplicate to use

My headers are in the first column of my txt file. I want to create a Pandas DF

Pandas reading tall data into a DataFrame

List to dataframe, list to multiple lists, single column to dataframe

parsing dataframe columns containing functions

Categories

Resources