Split pandas dataframe sentence, by text between () - python

I have a dataframe like this:
Index| Labels -----------------------------------------|Text
1. |[(Task, add), (Application, MAMP) (Task, delete)] | "add or delete"
2. |[(Servername, abhs)] | "servername create"
Is it possible to split it in this:
Index | Task | Application | Servername |Text
1. |add, delete|MAMP | |"add or delete"
2. | | | abhs |"servername create"
Basically Labels is a list with multiple tuple. The first entry of the tuple is the key while the second entry is the value. I want the key as the column name and the value as the value of this row. In case there is another value with the same key they should be added together.
Other columns should be empty if no key is in the row.
df = pd.DataFrame()
df["Index"] = ["1.","2."]
df["Labels"] =[[("Task","add"),("Application", "MAMP"), ("Task", "delete")],[("Servername", "abhs")]]
df["Text"] = ["add or delete", "servername create"]
At the moment I'm looking for ")" and split but I think that shouldn't be that complicated and there has to be another way then use them as Strings.
But I think there is a better way.

You can do it as follow:
df = pd.DataFrame()
df["Index"] = ["1.","2."]
df["Labels"] =[[("Task","add"),("Application", "MAMP"), ("Task", "delete")],[("Servername", "abhs")]]
df["Text"] = ["add or delete", "servername create"]
def get_labels(labels):
label_dict = {}
for label, value in labels:
label_dict[label] = label_dict.get(label,[]) + [value]
for key in label_dict:
label_dict[key] = ", ".join(label_dict[key]) # in case you want it as string and not as list
return label_dict
df = df.merge(df["Labels"].apply(lambda s: pd.Series(get_labels(s))),
left_index=True, right_index=True).drop(["Labels"],axis=1).fillna('')
print(df)
Output:
Index Text Application Servername Task
0 1. add or delete MAMP add, delete
1 2. servername create abhs
As mention in the comment you don't need the second for loop if you would delete it then the output would look like this (what I think is better for future use but that is up to you):
Index Text Task Application Servername
0 1. add or delete [add, delete] [MAMP]
1 2. servername create [abhs]

Related

In Python, If there is a duplicate, use the date column to choose the what duplicate to use

I have code that runs 16 test cases against a CSV, checking for anomalies from poor data entry. A new column, 'Test case failed,' is created. A number corresponding to which test it failed is added to this column when a row fails a test. These failed rows are separated from the passed rows; then, they are sent back to be corrected before they are uploaded into a database.
There are duplicates in my data, and I would like to add code to check for duplicates, then decide what field to use based on the date, selecting the most updated fields.
Here is my data with two duplicate IDs, with the first row having the most recent Address while the second row has the most recent name.
ID
MnLast
MnFist
MnDead?
MnInactive?
SpLast
SpFirst
SPInactive?
SpDead
Addee
Sal
Address
NameChanged
AddrChange
123
Doe
John
No
No
Doe
Jane
No
No
Mr. John Doe
Mr. John
123 place
05/01/2022
11/22/2022
123
Doe
Dan
No
No
Doe
Jane
No
No
Mr. John Doe
Mr. John
789 road
11/01/2022
05/06/2022
Here is a snippet of my code showing the 5th testcase, which checks for the following: Record has Name information, Spouse has name information, no one is marked deceased, but Addressee or salutation doesn't have "&" or "AND." Addressee or salutation needs to be corrected; this record is married.
import pandas as pd
import numpy as np
data = pd.read_csv("C:/Users/file.csv", encoding='latin-1' )
# Create array to store which test number the row failed
data['Test Case Failed']= ''
data = data.replace(np.nan,'',regex=True)
data.insert(0, 'ID', range(0, len(data)))
# There are several test cases, but they function primarily the same
# Testcase 1
# Testcase 2
# Testcase 3
# Testcase 4
# Testcase 5 - comparing strings in columns
df = data[((data['FirstName']!='') & (data['LastName']!='')) &
((data['SRFirstName']!='') & (data['SRLastName']!='') &
(data['SRDeceased'].str.contains('Yes')==False) & (data['Deceased'].str.contains('Yes')==False)
)]
df1 = df[df['PrimAddText'].str.contains("AND|&")==False]
data_5 = df1[df1['PrimSalText'].str.contains("AND|&")==False]
ids = data_5.index.tolist()
# Assign 5 for each failed
for i in ids:
data.at[i,'Test Case Failed']+=', 5'
# Failed if column 'Test Case Failed' is not empty, Passed if empty
failed = data[(data['Test Case Failed'] != '')]
passed = data[(data['Test Case Failed'] == '')]
failed['Test Case Failed'] =failed['Test Case Failed'].str[1:]
failed = failed[(failed['Test Case Failed'] != '')]
# Clean up
del failed["ID"]
del passed["ID"]
failed['Test Case Failed'].value_counts()
# Print to console
print("There was a total of",data.shape[0], "rows.", "There was" ,data.shape[0] - failed.shape[0], "rows passed and" ,failed.shape[0], "rows failed at least one test case")
# output two files
failed.to_csv("C:/Users/Failed.csv", index = False)
passed.to_csv("C:/Users/Passed.csv", index = False)
What is the best approach to check for duplicates, choose the most updated fields, drop the outdated fields/row, and perform my test?
First, try to set a mapping that associates update date columns to their corresponding value columns.
date2val = {"AddrChange": ["Address"], "NameChanged": ["MnFist", "MnLast"], ...}
Then, transform date columns into datetime format to be able to compare them (using argmax later).
for key in date2val.keys():
failed[key] = pd.to_datetime(failed[key])
Then, group by ID the duplicates (since ID is the value that decides whether it is a duplicate), and for each date column get the maximum value in the group (which refers to the most recent update) and retrieve the columns to update from the initial mapping. I'll update the last row and set it as the final updated result (by putting it in corrected list).
corrected = list()
for _, grp in failed.groupby("ID"):
for key in date2val.keys():
recent = grp[key].argmax()
for col in date2val[key]:
grp.iloc[-1][col] = grp.iloc[recent][col]
corrected.append(grp.iloc[-1])
corrected = pd.DataFrame(corrected)
Preparing data:
import pandas as pd
c = 'ID MnLast MnFist MnDead? MnInactive? SpLast SpFirst SPInactive? SpDead Addee Sal Address NameChanged AddrChange'.split()
data1 = '123 Doe John No No Doe Jane No No Mr.JohnDoe Mr.John 123place 05/01/2022 11/22/2022'.split()
data2 = '123 Doe Dan No No Doe Jane No No Mr.JohnDoe Mr.John 789road 11/01/2022 05/06/2022'.split()
data3 = '8888 Brown Peter No No Brwon Peter No No Mr.PeterBrown M.Peter 666Avenue 01/01/2011 01/01/2011'.split()
df = pd.DataFrame(columns = c, data = [data1, data2, data3])
df.AddrChange.astype('datetime64')
df.NameChanged.astype('datetime64')
df
DataFrame is like the example:
Then you pick a piece of the dataframe avoiding changes in original. Adjacent rows have the same ID and the first one has the apropriate name:
df1 = df[['ID', 'MnFist', 'NameChanged']].sort_values(by=['ID', 'NameChanged'], ascending = False)
df1
Then you build a dictionary putting key as df.ID and the appropriate name for its value. You intend to build all the column MnFist:
d = {}
for id in set(df.ID.values):
df_mask = df1.ID == id # filter only rows with same id
filtered_df = df1[df_mask]
if len(filtered_df) <= 1:
d[id] = filtered_df.iat[0, 1] # id has only one row, so no changes
continue
for name in filtered_df.MnFist:
if name in ['unknown', '', ' '] or name is None: # name discards
continue
else:
d[id] = name # found a servible name
if id not in d.keys():
d[id] = filtered_df.iat[0, 1] # no servible name, so picked the first
print(d)
The partial output of the dictionary is:
{'8888': 'Peter', '123': 'Dan'}
Then you build all the column:
df.MnFist = [d[id] for id in df.ID]
df
The partial output is:
Then the same procedure to the other column:
df1 = df[['ID', 'Address', 'AddrChange']].sort_values(by=['ID', 'AddrChange'], ascending = False)
df1
d = { id: df1.loc[df1.ID == id, 'Address'].values[0] for id in set(df.ID.values) }
d
df.Address = [d[id] for id in df.ID]
df
The final output is:
Edited after author comented possibility of unknow inservible data.
Let me restate what I understood from the question:
You have a dataset on which you are doing several sanity checks. (Looks like you already have everything in place for this step)
In next step you are finding duplicates row with different columns updated at different dates. (I assume that you already have this)
Now, you are looking for a new dataset that has non-duplicated rows with updated fields using the latest date entries.
First, define different dates and their related columns in a form of dictionary:
date_to_cols = {"AddrChange": "Address", "NameChanged": ["MnLast", "MnFirst"]}
Next, apply group by using "ID" and then get the index for maximum value of different dates. Once we have the index, we can pull the related fields for that date from the data.
data[list(date_to_cols.keys())] =data[list(date_to_cols.keys())].astype('datetime64')
latest_data = df.groupby('ID')[list(date_to_cols.keys())].idxmax().reset_index()
for date_field, cols_to_update in date_to_cols.items():
latest_data[cols_to_update] = latest_data[date_field].apply(lambda x: data.iloc[x][cols_to_update])
latest_data[date_field] = latest_data[date_field].apply(lambda x: data.iloc[x][date_field])
Next, you can merge these latest_data with the original data (after removing old columns):
cols_to_drop = list(latest_data.columns)
cols_to_drop.remove("ID")
data.drop(columns= cols_to_drop, inplace=True)
latest_data_all_fields = data.merge(latest_data, on="ID", how="left")
latest_data_all_fields.drop_duplicates(inplace=True)

get values for potentially multiple matches from an other dataframe

I want to fill the 'references' column in df_out with the 'ID' if the corresponding 'my_ID' in df_sp is contained in df_jira 'reference_ids'.
import pandas as pd
d_sp = {'ID': [1,2,3,4], 'my_ID': ["my_123", "my_234", "my_345", "my_456"], 'references':["","","2",""]}
df_sp = pd.DataFrame(data=d_sp)
d_jira = {'my_ID': ["my_124", "my_235", "my_346"], 'reference_ids': ["my_123, my_234", "", "my_345"]}
df_jira = pd.DataFrame(data=d_jira)
df_new = df_jira[~df_jira["my_ID"].isin(df_sp["my_ID"])].copy()
df_out = pd.DataFrame(columns=df_sp.columns)
needed_cols = list(set(df_sp.columns).intersection(df_new.columns))
for column in needed_cols:
df_out[column] = df_new[column]
df_out['Related elements_my'] = df_jira['reference_ids']
Desired output df_out:
| ID | my_ID | references |
|----|-------|------------|
| | my_124| 1, 2 |
| | my_235| |
| | my_346| 3 |
What I tried so far is list comprehension, but I only managed to get the reference_ids "copied" from a helper column to my 'references' column with this:
for row, entry in df_out.iterrows():
cpl_ids = [x for x in entry['Related elements_my'].split(', ') if any(vh_id == x for vh_id in df_cpl_list['my-ID'])]
df_out.at[row, 'Related elements'] = ', '.join(cpl_ids)
I can not wrap my head around on how to get the specific 'ID's on the matches of 'any()' or if this actually the way to go as I need all the matches, not something if there is any match.
Any hints are appreciated!
I work with python 3.9.4 on Windows (adding in case python 3.10 has any other solution)
Backstory: Moving data from Jira to MS SharePoint lists. (Therefore, the 'ID' does not equal the actual index in the dataframe, but is rather assigned by SharePoint upon insertion into the list. Hence, empty after running for the new entries.)
ref_df = df_sp[["ID","my_ID"]].set_index("my_ID")
df_out.references = df_out["Related elements_my"].apply(lambda x: ",".join(list(map(lambda y: "" if y == "" else str(ref_df.loc[y.strip()].ID), x.split(",")))))
df_out[["ID","my_ID","references"]]
output:
ID my_ID references
0 NaN my_124 1,2
1 NaN my_235
2 NaN my_346 3
what is map?
map is something like [func(i) for i in lst] and apply func on all variables of lst but in another manner that increase speed.
and you can read more about this: https://realpython.com/python-map-function/
but, there, our function is : lambda y: "" if y == "" else str(ref_df.loc[y.strip()].ID)
so, if y, or y.strip() there and just for remove spaces, is empty, maps to empty: "" if y == "" like my_234
otherwise, locate y in df_out and get corresponding ID, i.e maps each my_ID to ID
hope to be helpfull :)

Create DataFrame from raw input

I am getting data as follows:-
$0011:0524-08-2021
$0021:0624-08-2021
&0011:0724-08-2021
&0021:0924-08-2021
$0031:3124-08-2021
&0031:3224-08-2021
$0041:3924-08-2021
&0041:3924-08-2021
$0012:3124-08-2021
&0012:3324-08-2021
In $0011:0524-08-2021, $ denotes start of string, 001 denotes ID, 1:05 denotes time, 24-08-2021 denotes date. Similarly &0011:0624-08-2021 everything is same except & denotes end of string.
Taking the above data I want to create a data frame as follows:-
1. $0011:0524-08-2021 &0011:0724-08-2021
2. $0021:0624-08-2021 &0021:0924-08-2021
3. $0031:3124-08-2021 &0031:3224-08-2021
4. $0041:3924-08-2021 &0041:3924-08-2021
5. $0012:3124-08-2021 &0012:3324-08-2021
Basically I want to sort the entries into a data frame as shown above. There are few conditions that must be satisfied in doing so.
1.) Column1 should have only entries of $ and Column2 should have only & entries.
2.) Both the columns should be arranged in increasing order of time. Column1 with $ entries
should be arranged in increasing order of time and same goes for column2 with & entries.
If you're getting the lines as you shown in your example, you can try:
import pandas as pd
def process_lines(lines):
buffer = {}
for line in map(str.strip, lines):
id_ = line[1:4]
if line[0] == "$":
buffer[id_] = line
elif line[0] == "&" and buffer.get(id_):
yield buffer[id_], line
del buffer[id_]
txt = """$0011:0524-08-2021
$0021:0624-08-2021
&0011:0724-08-2021
&0021:0924-08-2021
$0031:3124-08-2021
&0031:3224-08-2021
$0041:3924-08-2021
&0041:3924-08-2021
$0012:3124-08-2021
&0012:3324-08-2021"""
df = pd.DataFrame(process_lines(txt.splitlines()), columns=["A", "B"])
print(df)
Prints:
A B
0 $0011:0524-08-2021 &0011:0724-08-2021
1 $0021:0624-08-2021 &0021:0924-08-2021
2 $0031:3124-08-2021 &0031:3224-08-2021
3 $0041:3924-08-2021 &0041:3924-08-2021
4 $0012:3124-08-2021 &0012:3324-08-2021

Col names not detected - AnalysisException: Cannot resolve 'Name' given input columns 'col10'

I'm trying to run a transformation function in a pyspark script:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "test_csv", transformation_ctx = "datasource0")
...
dataframe = datasource0.toDF()
...
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
#to_long(df, ["A"])
....
df = to_long(dataframe, ["Name","Type"])
My dataset looks like this:
Name |01/01(FRI)|01/02(SAT)|
ALZA CZ| 0 | 0
CLPA CZ| 1 | 5
My desired output is something like this:
Name |Type | Date. |Value |
ALZA CZ|New | 01/01(FRI) | 0
CLPA CZ|New | 01/01(FRI) | 1
ALZA CZ|Old | 01/02(SAT) | 1
CLPA CZ|Old | 01/02(SAT) | 5
However, the last code line gives me an error similar to this:
AnalysisException: Cannot resolve 'Name' given input columns 'col10'
When I check:
df.show()
I see 'col1', 'col2' etc in the first row instead of the actual labels ( ["Name","Type"] ). Should I separately remove and then add the original column titles?
It seems like that your meta data table was configured using the built-in CSV classifier. If this classifier isn't able to detect a header, it will call the columns col1, col2 etc.
Your problem lies one stage before your ETL job, so in my opinion you shouldn't remove and re-add the original column titles, but fix your data import / schema detection, by using a custom classifier.

Reading specific columns from CSV Python

I am trying to parse through a CSV file and extract few columns from the CSV.
ID | Code | Phase |FBB | AM | Development status | AN REMARKS | stem | year | IN -NAME |IN Year |Company
L2106538 |Rs124 | 4 | | | Unknown | | -pre- | 1982 | Domoedne | 1982 | XYZ
I would like to group and extract few columns for uploading them to different models.
For example I would like to group first 3 columns to a model, next two to a different model, first column and the 6, 7 to a different model and so on.
I also need to keep the header of the file and store the data as key value pair so that I would know which column should go for a particular field in a model.
This is what I have so far.
def group_header_value(file):
reader = csv.DictReader(open(file, 'r'))# to have the header and get the data as a key value pair.
all_result= []
for row in reader:
print row
all_result.append(row)
return all_result
def group_by_models(all_results):
MD = range(1,3) # to get the required cols.
for every_row in all_results:
contents = [(every_row[i] for i in MD)]
print contents
def handle(self, *args, **options):
database = options.get('database')
filename = options.get('filename')
all_results = group_header_value(filename)
print 'grouped_bymodel', group_by_models(all_results)
This is what I get when I try to get the contents
grouped_by model: at 0x7f9f5382e0f0>
at 0x7f9f5382e0a0>
at 0x7f9f5382e0f0>
Is there a different approach to extract particular columns in DictReader? how else can I extract required columns using DictReader. Thanks
(every_row[i] for i in MD) is a generator expression. The syntax for a generator expression is (mostly) the same as that for a list comprehension, except that a generator expression is enclosed by parentheses, (...), while a list comprehension uses brackets, [...].
[(every_row[i] for i in MD)] is a list containing one element, the generator expression.
To fix your code with minimal changes, remove the parentheses:
def group_by_models(all_results):
MD = range(1,3) # to get the required cols.
for every_row in all_results:
contents = [every_row[i] for i in MD]
print(contents)
You could also make group_by_models more reusable by making MD a parameter:
def group_by_models(all_results, MD=range(3)):
for every_row in all_results:
contents = [every_row[i] for i in MD]
print(contents)

Categories

Resources