This may be a simple/repeat question, but I could find/figure out yet how to do it.
I have two csv files:
info.csv:
"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,
bcd, uvw, 3124, 813-222-1111, tre,
poi, ccc, 9087, 123-45607890, weq,
and then
age.csv:
student_id,age_1
3124,20
9087,21
1234,45
I want to compare the two csv files, based on the columns "id" from info.csv and "student_id" from age.csv and take the corresponding "age_1" data and put it into the "age" column in info.csv.
So the final output should be:
info.csv:
"Last Name", First Name, ID, phone, adress, age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,45
bcd, uvw, 3124, 813-222-1111, tre,20
poi, ccc, 9087, 123-45607890, weq,21
I am able to simply join the tables based on the keys into a new.csv, but can't put the data in the columns titles "age". I used "csvkit" to do that.
Here is what I used:
csvjoin -c 3,1 info.csv age.csv > new.csv
You can use Pandas and update the info dataframe using the age data. You do it by setting the index of both data frames to ID and student_id respectively, then update the age column in the info dataframe. After that you reset the index so ID becomes a column again.
from StringIO import StringIO
import pandas as pd
info = StringIO("""Last Name,First Name,ID,phone,adress,age X [Total age: 100] |009076
abc, xyz, 1234, 982-128-0000, pqt,
bcd, uvw, 3124, 813-222-1111, tre,
poi, ccc, 9087, 123-45607890, weq,""")
age = StringIO("""student_id,age_1
3124,20
9087,21
1234,45""")
info_df = pd.read_csv(info, sep=",", engine='python')
age_df = pd.read_csv(age, sep=",", engine='python')
info_df = info_df.set_index('ID')
age_df = age_df.set_index('student_id')
info_df['age X [Total age: 100] |009076'].update(age_df.age_1)
info_df.reset_index(level=0, inplace=True)
info_df
outputs:
ID Last Name First Name phone adress age X [Total age: 100] |009076
0 1234 abc xyz 982-128-0000 pqt 45
1 3124 bcd uvw 813-222-1111 tre 20
2 9087 poi ccc 123-45607890 weq 21
Try this...
import csv
info = list(csv.reader(open("info.csv", 'rb')))
age = list(csv.reader(open("age.csv", 'rb')))
def copyCSV(age, info, outFileName = 'out.csv'):
# put age into dict, indexed by ID
# assumes no duplicate entries
# 1 - build a dict ageDict to represent data
ageDict = dict([(entry[0].replace(' ',''), entry[1]) for entry in age[1:] if entry != []])
# 2 - setup output
with open(outFileName, 'wb') as outFile:
outwriter = csv.writer(outFile)
# 3 - run through info and slot in ages and write to output
# nb: had to use .replace(' ','') to strip out whitespaces - these may not be in original .csv
outwriter.writerow(info[0])
for entry in info[1:]:
if entry != []:
key = entry[2].replace(' ','')
if key in ageDict: # checks that you have data from age.csv
entry[5] = ageDict[key]
outwriter.writerow(entry)
copyCSV(age, info)
Let me know if it works or if anything is unclear. I've used a dict because it should be faster if your files are massive, as you only have to loop through the data in age.csv once.
There may be a simpler way / something already implemented...but this should do the trick.
Related
I need to make a dataframe from two txt files.
The first txt file looks like this Street_name space id.
The second txt file loks like this City_name space id.
Example:
text file 1:
Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567
text file 2:
Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567
I need to make one dataframe out of this. Sometimes there is just one word for Street_name, and sometimes more. The same goes for City_name.
I get an error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 3 because I'm trying to put both words for street name into the same column, but don't know how to do it. I want one column for street name (no matter if it consists of one or more words, one for city name and one for id.
I want a df with 3 rows and 3 cols.
Thanks!
Edit: both text files are huge (each 50 mil rows +) so i need this code not to break and be optimised for large files.
It is NOT correct CSV and it may need to read it on your own.
You can normal open(), read() and later split on new line to create list of lines. And later you can use for-loop and use line.rsplit(" ", 1) to split line on last space.
Minimal working example:
I use io to simulate file in memory - so everyone can simply copy and test it - but you should use open()
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
#with open('filename') as fh:
with io.StringIO(text) as fh:
lines = fh.read().splitlines()
print(lines)
lines = [line.rsplit(" ", 1) for line in lines]
print(lines)
import pandas as pd
df = pd.DataFrame(lines, columns=['name', 'name'])
print(df)
Result:
['Roseberry st 1234', 'Brooklyn st 4321', 'Wolseley 1234567']
[['Roseberry st', '1234'], ['Brooklyn st', '4321'], ['Wolseley', '1234567']]
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
EDIT:
read_csv can use regex to define separator (i.e. sep="\s+" for many spaces) and it can even use lookahead/loopbehind ((?=...)/(?<=...)) to check if there is digit after space without catching it as part of separator.
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
import pandas as pd
#df = pd.read_csv('filename', names=['name', 'number'], sep='\s(?=\d)', engine='python')
df = pd.read_csv(io.StringIO(text), names=['name', 'number'], sep='\s(?=\d)', engine='python')
print(df)
Result:
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
And later you can try to connect both dataframe using .join(), .merge() with parameter on= (or something similar) like in SQL query.
text1 = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
text2 = '''Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567'''
import io
import pandas as pd
df1 = pd.read_csv(io.StringIO(text1), names=['street name', 'id'], sep='\s(?=\d)', engine='python')
df2 = pd.read_csv(io.StringIO(text2), names=['city name', 'id'], sep='\s(?=\d)', engine='python')
print(df1)
print(df2)
df = df1.merge(df2, on='id')
print(df)
Result:
street name id
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
city name id
0 Winnipeg 4321
1 Winnipeg 1234
2 Ste Anne 1234567
street name id city name
0 Roseberry st 1234 Winnipeg
1 Brooklyn st 4321 Winnipeg
2 Wolseley 1234567 Ste Anne
Pandas doc: Merge, join, concatenate and compare
There's nothing that I'm aware of in pandas that does this automatically.
Below, I built a script that will merge those addresses (addy + st) into a single column, then merges the two data frames into one based on the "id".
I assume your actual text files are significantly larger, so assuming they follow the pattern set in the two examples, this script should work fine.
Basically, this code turns each line of text in the file into a list, then combines lists of length 3 into length 2 by combining the first two list items.
After that, it turns the "list of lists" into a dataframe and merges those dataframes on column "id".
Couple caveats:
Make sure you set the correct text file paths
Make sure the first line of the text files contains 2, single string column headers (ie: "address id") or (ie: "city id")
Make sure each text file id column header is named "id"
import pandas as pd
import numpy as np
# set both text file paths (you may need full path i.e. C:\Users\Name\bla\bla\bla\text1.txt)
text_path_1 = r'text1.txt'
text_path_2 = r'text2.txt'
# declares first text file
with open(text_path_1) as f1:
text_file_1 = f1.readlines()
# declares second text file
with open(text_path_2) as f2:
text_file_2 = f2.readlines()
# function that massages data into two columns (to put "st" into same column as address name)
def data_massager(text_file_lines):
data_list = []
for item in text_file_lines:
stripped_item = item.strip('\n')
split_stripped_item = stripped_item.split(' ')
if len(split_stripped_item) == 3:
split_stripped_item[0:2] = [' '.join(split_stripped_item[0 : 2])]
data_list.append(split_stripped_item)
return data_list
# runs function on both text files
data_list_1 = data_massager(text_file_1)
data_list_2 = data_massager(text_file_2)
# creates dataframes on both text files
df1 = pd.DataFrame(data_list_1[1:], columns = data_list_1[0])
df2 = pd.DataFrame(data_list_2[1:], columns = data_list_2[0])
# merges data based on id (make sure both text files' id is named "id")
merged_df = df1.merge(df2, how='left', on='id')
# prints dataframe (assuming you're using something like jupyter-lab)
merged_df
pandas has strong support for strings. You can make the lines of each file into a Series and then use a regular expression to separate the fields into separate columns. I assume that "id" is the common value that links the two datasets, so it can become the dataframe index and the columns can just be added together.
import pandas as pd
street_series = pd.Series([line.strip() for line in open("text1.txt")])
street_df = street_series.str.extract(r"(.*?) (\d+)$")
del street_series
street_df.rename({0:"street", 1:"id"}, axis=1, inplace=True)
street_df.set_index("id", inplace=True)
print(street_df)
city_series = pd.Series([line.strip() for line in open("text2.txt")])
city_df = city_series.str.extract(r"(.*?) (\d+)$")
del city_series
city_df.rename({0:"city", 1:"id"}, axis=1, inplace=True)
city_df.set_index("id", inplace=True)
print(city_df)
street_df["city"] = city_df["city"]
print(street_df)
I have a CSV file with a lot of rows and different number of columns.
How to group data by count of columns and show it in different frames?
File CSV has the following data:
1 OLEG US FRANCE BIG
1 OLEG FR 18
1 NATA 18
Because I have different number of colums in each row I have to group rows by count of columns and show 3 frames to be able set header then:
ID NAME STATE COUNTRY HOBBY
FR1: 1 OLEG US FRANCE BIG
ID NAME COUNTRY AGE
FR2: 1 OLEG FR 18
FR3:
ID NAME AGE
1 NATA 18
Any words, I need to group rows by count of columns and show them in different dataframes.
since pandas doesn't allow you to have different length of columns, just don't use it to import your data. Your goal is to create three seperate df, so first import the data as lists, and then deal with it and its differents lengths.
One way to solve this is read the data with csv.reader and create the df's with list comprehension together with a condition for the length of the lists.
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
df1 = pd.DataFrame([item for item in data if len(item)==3], columns='ID NAME AGE'.split())
df2 = pd.DataFrame([item for item in data if len(item)==4], columns='ID NAME COUNTRY AGE'.split())
df3 = pd.DataFrame([item for item in data if len(item)==5], columns='ID NAME STATE COUNTRY HOBBY'.split())
print(df1, df2, df3, sep='\n\n')
ID NAME AGE
0 1 NATA 18
ID NAME COUNTRY AGE
0 1 OLEG FR 18
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
If you need to hardcode too many lines for the same step (e.g. too many df's), then you should consider using a loop to create them and store each dataframe as key/value in a dictionary.
EDIT
Here is the little optimizedway of creating those df's. I think you can't get around creating a list of columns you want to use for the seperate df's, so you need to know what variations of number of columns you have in your data (except you want to create those df's without naming the columns.
col_list=[['ID', 'NAME', 'AGE'],['ID', 'NAME', 'COUNTRY', 'AGE'],['ID', 'NAME', 'STATE', 'COUNTRY', 'HOBBY']]
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
dict_of_dfs = {}
for cols in col_list:
dict_of_dfs[f'df_{len(cols)}'] = pd.DataFrame([item for item in data if len(item)==len(cols)], columns=cols)
for key,val in dict_of_dfs.items():
print(f'{key=}: \n {val} \n')
key='df_3':
ID NAME AGE
0 1 NATA 18
key='df_4':
ID NAME COUNTRY AGE
0 1 OLEG FR 18
key='df_5':
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
Now you don't have variables for your df, instead you have them in a dictionary as keys. (I named the df with the number of columns it has, df_3 is the df with three columns.
If you need to import the data with pandas, you could have a look at this post.
I have code that runs 16 test cases against a CSV, checking for anomalies from poor data entry. A new column, 'Test case failed,' is created. A number corresponding to which test it failed is added to this column when a row fails a test. These failed rows are separated from the passed rows; then, they are sent back to be corrected before they are uploaded into a database.
There are duplicates in my data, and I would like to add code to check for duplicates, then decide what field to use based on the date, selecting the most updated fields.
Here is my data with two duplicate IDs, with the first row having the most recent Address while the second row has the most recent name.
ID
MnLast
MnFist
MnDead?
MnInactive?
SpLast
SpFirst
SPInactive?
SpDead
Addee
Sal
Address
NameChanged
AddrChange
123
Doe
John
No
No
Doe
Jane
No
No
Mr. John Doe
Mr. John
123 place
05/01/2022
11/22/2022
123
Doe
Dan
No
No
Doe
Jane
No
No
Mr. John Doe
Mr. John
789 road
11/01/2022
05/06/2022
Here is a snippet of my code showing the 5th testcase, which checks for the following: Record has Name information, Spouse has name information, no one is marked deceased, but Addressee or salutation doesn't have "&" or "AND." Addressee or salutation needs to be corrected; this record is married.
import pandas as pd
import numpy as np
data = pd.read_csv("C:/Users/file.csv", encoding='latin-1' )
# Create array to store which test number the row failed
data['Test Case Failed']= ''
data = data.replace(np.nan,'',regex=True)
data.insert(0, 'ID', range(0, len(data)))
# There are several test cases, but they function primarily the same
# Testcase 1
# Testcase 2
# Testcase 3
# Testcase 4
# Testcase 5 - comparing strings in columns
df = data[((data['FirstName']!='') & (data['LastName']!='')) &
((data['SRFirstName']!='') & (data['SRLastName']!='') &
(data['SRDeceased'].str.contains('Yes')==False) & (data['Deceased'].str.contains('Yes')==False)
)]
df1 = df[df['PrimAddText'].str.contains("AND|&")==False]
data_5 = df1[df1['PrimSalText'].str.contains("AND|&")==False]
ids = data_5.index.tolist()
# Assign 5 for each failed
for i in ids:
data.at[i,'Test Case Failed']+=', 5'
# Failed if column 'Test Case Failed' is not empty, Passed if empty
failed = data[(data['Test Case Failed'] != '')]
passed = data[(data['Test Case Failed'] == '')]
failed['Test Case Failed'] =failed['Test Case Failed'].str[1:]
failed = failed[(failed['Test Case Failed'] != '')]
# Clean up
del failed["ID"]
del passed["ID"]
failed['Test Case Failed'].value_counts()
# Print to console
print("There was a total of",data.shape[0], "rows.", "There was" ,data.shape[0] - failed.shape[0], "rows passed and" ,failed.shape[0], "rows failed at least one test case")
# output two files
failed.to_csv("C:/Users/Failed.csv", index = False)
passed.to_csv("C:/Users/Passed.csv", index = False)
What is the best approach to check for duplicates, choose the most updated fields, drop the outdated fields/row, and perform my test?
First, try to set a mapping that associates update date columns to their corresponding value columns.
date2val = {"AddrChange": ["Address"], "NameChanged": ["MnFist", "MnLast"], ...}
Then, transform date columns into datetime format to be able to compare them (using argmax later).
for key in date2val.keys():
failed[key] = pd.to_datetime(failed[key])
Then, group by ID the duplicates (since ID is the value that decides whether it is a duplicate), and for each date column get the maximum value in the group (which refers to the most recent update) and retrieve the columns to update from the initial mapping. I'll update the last row and set it as the final updated result (by putting it in corrected list).
corrected = list()
for _, grp in failed.groupby("ID"):
for key in date2val.keys():
recent = grp[key].argmax()
for col in date2val[key]:
grp.iloc[-1][col] = grp.iloc[recent][col]
corrected.append(grp.iloc[-1])
corrected = pd.DataFrame(corrected)
Preparing data:
import pandas as pd
c = 'ID MnLast MnFist MnDead? MnInactive? SpLast SpFirst SPInactive? SpDead Addee Sal Address NameChanged AddrChange'.split()
data1 = '123 Doe John No No Doe Jane No No Mr.JohnDoe Mr.John 123place 05/01/2022 11/22/2022'.split()
data2 = '123 Doe Dan No No Doe Jane No No Mr.JohnDoe Mr.John 789road 11/01/2022 05/06/2022'.split()
data3 = '8888 Brown Peter No No Brwon Peter No No Mr.PeterBrown M.Peter 666Avenue 01/01/2011 01/01/2011'.split()
df = pd.DataFrame(columns = c, data = [data1, data2, data3])
df.AddrChange.astype('datetime64')
df.NameChanged.astype('datetime64')
df
DataFrame is like the example:
Then you pick a piece of the dataframe avoiding changes in original. Adjacent rows have the same ID and the first one has the apropriate name:
df1 = df[['ID', 'MnFist', 'NameChanged']].sort_values(by=['ID', 'NameChanged'], ascending = False)
df1
Then you build a dictionary putting key as df.ID and the appropriate name for its value. You intend to build all the column MnFist:
d = {}
for id in set(df.ID.values):
df_mask = df1.ID == id # filter only rows with same id
filtered_df = df1[df_mask]
if len(filtered_df) <= 1:
d[id] = filtered_df.iat[0, 1] # id has only one row, so no changes
continue
for name in filtered_df.MnFist:
if name in ['unknown', '', ' '] or name is None: # name discards
continue
else:
d[id] = name # found a servible name
if id not in d.keys():
d[id] = filtered_df.iat[0, 1] # no servible name, so picked the first
print(d)
The partial output of the dictionary is:
{'8888': 'Peter', '123': 'Dan'}
Then you build all the column:
df.MnFist = [d[id] for id in df.ID]
df
The partial output is:
Then the same procedure to the other column:
df1 = df[['ID', 'Address', 'AddrChange']].sort_values(by=['ID', 'AddrChange'], ascending = False)
df1
d = { id: df1.loc[df1.ID == id, 'Address'].values[0] for id in set(df.ID.values) }
d
df.Address = [d[id] for id in df.ID]
df
The final output is:
Edited after author comented possibility of unknow inservible data.
Let me restate what I understood from the question:
You have a dataset on which you are doing several sanity checks. (Looks like you already have everything in place for this step)
In next step you are finding duplicates row with different columns updated at different dates. (I assume that you already have this)
Now, you are looking for a new dataset that has non-duplicated rows with updated fields using the latest date entries.
First, define different dates and their related columns in a form of dictionary:
date_to_cols = {"AddrChange": "Address", "NameChanged": ["MnLast", "MnFirst"]}
Next, apply group by using "ID" and then get the index for maximum value of different dates. Once we have the index, we can pull the related fields for that date from the data.
data[list(date_to_cols.keys())] =data[list(date_to_cols.keys())].astype('datetime64')
latest_data = df.groupby('ID')[list(date_to_cols.keys())].idxmax().reset_index()
for date_field, cols_to_update in date_to_cols.items():
latest_data[cols_to_update] = latest_data[date_field].apply(lambda x: data.iloc[x][cols_to_update])
latest_data[date_field] = latest_data[date_field].apply(lambda x: data.iloc[x][date_field])
Next, you can merge these latest_data with the original data (after removing old columns):
cols_to_drop = list(latest_data.columns)
cols_to_drop.remove("ID")
data.drop(columns= cols_to_drop, inplace=True)
latest_data_all_fields = data.merge(latest_data, on="ID", how="left")
latest_data_all_fields.drop_duplicates(inplace=True)
I have two CSVs. The first one contains a list of all previous customers with IDs assigned to them. And a new csv in which I'm auto generating IDs with following code:
df['ID'] = pd.to_datetime('today').strftime('%m%d%y') + df.index.map(str)
OLD.csv
ID FirstName LastName
1 John Smith
2 Jack Ma
3 John Wick
.... .... ....
210906ABC3 Jon Snow
210907ABC0 Peter Parker
210907ABC1 Tony Stark
NEW.csv with current script
ID FirstName LastName
210908ABC0 Black Widow
210908ABC1 Steve Rogers
210908ABC2 John Wick
210908ABC3 John Rambo
210908ABC4 Tony Stark
I need to compare the FirstName, LastName columns from the CSVs and if the customer already exists in OLD.csv, instead of generating a new ID, it should take the ID value from OLD.csv
Expected output for NEW.csv
ID FirstName LastName
210908ABC1 Black Widow
210908ABC2 Steve Rogers
3 John Wick
210908ABC3 John Rambo
1 John Smith
In the future I might need to compare three or four columns and only assign the IDs if all the columns match. FirstName and LastName and (CellPhone or Address) and (Location or SSN)
if you have both files in two dataframes df1 and df2 you can merge the two then update the ID in the first file and print only the columns from the first file, this will only work for files up to a few thousand rows as the merge is quite slow
df2.columns = [x + "_2" for x in df2.columns] # to avoid auto renaming by pd
result = pd.merge(df1, df2, how='left', left_on = key_cols1, right_on = key_cols2)
# update the ID column
result.ID = np.where(result.ID_2.isnull(), result.ID, result.ID_2)
print(result.to_csv(index=False,columns=df1.columns))
Edit:
this is a simple working example, file1 (df1) is the file you want to update and file2 is the file that contains the IDs you want to copy over to file1
import pandas as pd, numpy as np, argparse, os
parser = argparse.ArgumentParser(description='update id in file1 with id from file2.')
parser.add_argument('-k', help='key column both file', required=True)
parser.add_argument('file1', help='file1 to be updated')
parser.add_argument('file2', help='file2 contains updates for file1')
args = parser.parse_args()
if not os.path.isfile(args.file1): raise ValueError('File does not exist: ' + args.file1)
if not os.path.isfile(args.file2): raise ValueError('File does not exist: ' + args.file2)
df1 = pd.read_csv(args.file1,dtype=str,header=0)
df2 = pd.read_csv(args.file2,dtype=str,header=0)
df2.columns = [x + "_2" for x in df2.columns]
key_col1 = [list(df1.columns)[int(x)] for x in args.k.split(",")]
key_col2 = [list(df2.columns)[int(x)] for x in args.k.split(",")]
result = pd.merge(df1, df2, how='left', left_on = key_col1, right_on = key_col2)
result.ID = np.where(result.ID_2.isnull(), result.ID, result.ID_2)
print(result.to_csv(index=False,columns=df1.columns))
use as follows:
$ python merge.py -k 1,2 file1.csv file2.csv
ID,FirstName,LastName
210908ABC0,Black,Widow
210908ABC1,Steve,Rogers
3,John,Wick
210908ABC3,John,Rambo
210907ABC1,Tony,Stark
make sure that the key is unique per row otherwise you can get multiple joins generating extra rows in the output file.
Sample data from text file
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email = michael.toole#123.ie
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
Wondering if someone could help me, you can see my sample dataset above. What I would like to do (please tell me if there is a more efficient way) is to loop through the first column and whereever the list of unique ids occur (e.g first_name, last_name, role etc) append the value in the corresponding row to that list and do this which each unique ID so that I'm left with the below.
I have read about multi-indexing and I'm not sure if that might be a better solution but I couldn't get it to work (I'm quite new to python)
enter image description here
# Define a list of selected persons
selectedList = textFile
# Define a list of searching person
searchList = ['uid']
# Define an empty list
foundList = []
# Iterate each element from the selected list
for index, sList in enumerate(textFile):
# Match the element with the element of searchList
if sList in searchList:
# Store the value in foundList if the match is found
foundList.append(selectedList[index])
You have a text file where each records starts with a [User] line and data lines have a key=value format. I know no module able to automatically handle that, but it is easy to parse it by hand. Code could be:
with open('file.txt') as fd:
data = [] # a list of records
for line in fd:
line = line.strip() # strip end of line
if line == '[User]': # new record
row = {} # row will be a key: value dict
data.append(row)
else:
k,v = line.split('=', 1) # split on the = character
row[k] = v
df = pd.DataFrame(data) # list of key: value dicts => dataframe
With the sample data shown, we get:
employeeNo last_name first_name language email department role email Location
0 123 Toole Michael english michael.toole#123.ie Marketing Marketing Lead NaN NaN
1 456 Ronaldo Juan Spanish NaN Data Science Team Lead juan.ronaldo#sms.ie Spain
2 998 Lee Damian english NaN NaN NaN damian.lee#email.com NaN
I'm sure there is a more optimal way to do this, but it would be to get a unique list of row names, this time extracting them in a loop process and combining them into a new dataframe. Finally, update it with the desired column names.
import pandas as pd
import numpy as np
import io
data = '''
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email=michael.toole#123.ie
department=Marketing
role="Marketing Lead"
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo#sms.ie
department="Data Science"
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee#email.com
[User]
'''
df = pd.read_csv(io.StringIO(data), sep='=', comment='[', header=None)
new_cols = df[0].unique()
new_df = pd.DataFrame()
for col in new_cols:
tmp = df[df[0] == col]
tmp.reset_index(inplace=True)
new_df = pd.concat([new_df, tmp[1]], axis=1)
new_df.columns = new_cols
new_df['User'] = None
new_df = new_df[['User','employeeNo','last_name','first_name','language','email','department','role','Location']]
new_df
User employeeNo last_name first_name language email department role Location
0 None 123 Toole Michael english michael.toole#123.ie Marketing Marketing Lead Spain
1 None 456 Ronaldo Juan Spanish juan.ronaldo#sms.ie Data Science Team Lead NaN
2 None 998 Lee Damian english damian.lee#email.com NaN NaN NaN
Rewrite based on testing of previous version offset values
import pandas as pd
# Revised from previous answer - ensures key value pairs are contained to the same
# record - previous version assumed the first record had all the expected keys -
# inadvertently assigned (Location) value of second record to the first record
# which did not have a Location key
# This version should perform better - only dealing with one single df
# - and using pandas own pivot() function
textFile = 'file.txt'
filter = '[User]'
# Decoration - enabling a check and balance - how many users are we processing?
textFileOpened = open(textFile,'r')
initialRead = textFileOpened.read()
userCount = initialRead.count(filter) # sample has 4 [User] entries - but only three actual unique records
print ('User Count {}'.format(userCount))
# Create sets so able to manipulate and interrogate
allData = []
oneRow = []
userSeq = 0
#Iterate through file - assign record key and [userSeq] Key to each pair
with open(textFile, 'r') as fp:
for fileLineSeq, line in enumerate(fp):
if filter in str(line):
userSeq = userSeq + 1 # Ensures each key value pair is grouped
else: userSeq = userSeq
oneRow = [fileLineSeq, userSeq, line]
allData.append(oneRow)
df = pd.DataFrame(allData)
df.columns = ['FileRow','UserSeq','KeyValue'] # rename columns
userSeparators = df[df['KeyValue'] == str(filter+'\n') ].index # Locate [User Records]
df.drop(userSeparators, inplace = True) # Remove [User] records
df = df.replace(' = ' , '=' , regex=True ) # Input data dirty - cleaning up
df = df.replace('\n' , '' , regex=True ) # remove the new lines appended during the list generation
# print(df) # Test as necessary here
# split KeyValue column into two
df[['Key', 'Value']] = df.KeyValue.str.split('=', expand=True)
# very powerful function - convert to table
df = df.pivot(index='UserSeq', columns='Key', values='Value')
print(df)
Results
User Count 4
Key Location department email employeeNo first_name language last_name role
UserSeq
1 NaN Marketing michael.toole#123.ie 123 Michael english Toole Marketing Lead
2 Spain Data Science juan.ronaldo#sms.ie 456 Juan Spanish Ronaldo Team Lead
3 NaN NaN damian.lee#email.com 998 Damian english Lee NaN