Splitting string in multiple variable fields using regex using python - python

I have a dataframe were each row of a certain column is a text that comes from some bad formatted form where each 'field' is after the the 'field title', an example is:
col
Name: Bob Surname: Ross Title: painter age:34
Surname: Isaac Name: Newton Title: coin checker age: 42
age:20 Title: pilot Name: jack
this is some trash text Name: John Surname: Doe
As from example, the fields can be in any order an some of them could not exist.
What I need to do is to parse the fields so that the second line becomes something like:
{'Name': 'Isaac','Surname': 'Newton',...}
While i can deal with the 'pythonic part' I believe that the parsing should be done using some regex (also due to the fact that the rows are thousands) but I have no idea on how to design it.

Try:
x = df["col"].str.extractall(r"([^\s:]+):\s*(.+?)\s*(?=[^\s:]+:|\Z)")
x = x.droplevel(level="match").pivot(columns=0, values=1)
print(x.apply(lambda x: x[x.notna()].to_dict(), axis=1).to_list())
Prints:
[
{"Name": "Bob", "Surname": "Ross", "Title": "painter", "age": "34"},
{
"Name": "Newton",
"Surname": "Isaac",
"Title": "coin checker",
"age": "42",
},
{"Name": "jack", "Title": "pilot", "age": "20"},
]

Related

loop through a list of dictionaries, perform function, and append result to csv

I trying to loop over a list containing Twitter data in a json format. The list is made of several dictionaries each containing data on a politician. The code works if the input json_response only holds data on one politician. However, when json_response is list of dictionaries i get an error.
In short, I believe the issue can be isolated to three for-loops in the code for tweet in json_response['data']:, for dics in json_response['includes']['users']:, and for element in json_response['includes']['media']:.
# Inputs for the request
bearer_token = auth()
headers = create_headers(bearer_token)
keyword = search_query
start_time = "2016-03-01T00:00:00.000Z"
end_time = "2021-03-31T00:00:00.000Z"
max_results = 3000
json_response = [] # empty list that will hold tweet objects
for i in keyword: # loop through list of politicians in keyword i.e. search query and extract tweets
url = create_url(i, start_time, end_time, max_results)
json_response.append(connect_to_endpoint(url[0], headers, url[1]))
pass
I have only pasted the json_response object for 2 out of 30 politicians due cap on characters. However, the structure is the same for the remaining 28 politicians.
print(json.dumps(json_response, indent=4, sort_keys=True)) # look at json_response object.
[
{
"data": [
{
"author_id": "2877379617",
"created_at": "2021-03-25T12:11:14.000Z",
"id": "1375057688355336195",
"text": "#prettynobodyco She blocked me in 2015 - for pointing out that Tim Kaine enables sexual assault in the military and the evidence was his killing of the MJIA and publicly stated that Military commanders should remain in charge of military rape cases. She's Tanden level awful. Congrats!"
},
{
"author_id": "1265018154444562440",
"created_at": "2021-03-22T19:48:59.000Z",
"id": "1374085719472361474",
"text": "#MehcatCat #AlasscanIsBack #PattyArquette #timkaine Funny, they blocked me. \ud83e\udd23\ud83e\udd23"
},
{
"author_id": "2378324935",
"created_at": "2021-03-07T21:32:13.000Z",
"id": "1368675879312887810",
"text": "#DrWinarick #KatieOGrady4 I apologize for any drama. Katie O Grady blocked me because we had a disagreement about Tim Kaine on one of your older posts. I guess I can't please everyone haha. :/"
},
{
"author_id": "821870502943817729",
"created_at": "2021-02-12T23:53:59.000Z",
"id": "1360376637385244673",
"text": "She blocked me a long ass time ago when I asked her why we shoulf care about Tim Kaine's personal view on abortion if it didn't impact legislation"
},
{
"attachments": {
"media_keys": [
"16_1341045032732770306"
]
},
"author_id": "17232340",
"created_at": "2020-12-21T15:37:07.000Z",
"id": "1341045038420275205",
"text": "#DSingh4Biden #moomintroll8 #timkaine #GovernorVA That's why I replied to you. She blocked me previously, for what silliness I can't remember. Tough being a troll AND a snowflake!"
}
],
"includes": {
"media": [
{
"media_key": "16_1341045032732770306",
"type": "animated_gif"
}
],
"users": [
{
"created_at": "2014-11-15T02:23:57.000Z",
"description": "",
"id": "2877379617",
"name": "Laura Saylor",
"username": "lauraleesaylor"
},
{
"created_at": "2020-05-25T20:33:36.000Z",
"description": "Weird Writer & Lunatic Linguist\nWicked Witch of the East\nshe/her",
"id": "1265018154444562440",
"name": "Zauberkind",
"username": "Zauberkind2"
},
{
"created_at": "2014-03-08T07:22:31.000Z",
"description": "#Resist, #BLM, #Vaxxed, liberal, autistic, kidney transplant survivor, political nerd, mental health advocate, fighter for equality, truth, justice, etc.",
"id": "2378324935",
"name": "Trevor \"Trev\" McKee Achilles",
"username": "MrTAchilles"
},
{
"created_at": "2017-01-19T00:02:52.000Z",
"description": "statist / Progressive Gun Nut/ Single and hating it\n\n / \n\nstraight????? /\n\npronouns / brain worm survivor\n\n \n",
"id": "821870502943817729",
"name": "Squirrel Dad",
"username": "nihilisticpillo"
},
{
"created_at": "2008-11-07T15:09:46.000Z",
"description": "Liberal-Veteran-Dog Lover | Taste for irony, but in moderation | Humor is reason gone mad. ~Groucho Marx | I follow & unfollow back #VeteransResist #Resist",
"id": "17232340",
"name": "anti-Fascist Jim",
"username": "JimnBL"
}
]
},
"meta": {
"newest_id": "1375057688355336195",
"next_token": "b26v89c19zqg8o3foseug43lzoqdft4ghg78o9sn9ds3h",
"oldest_id": "1341045038420275205",
"result_count": 5
}
},
{
"data": [
{
"author_id": "1248251899884814336",
"created_at": "2021-03-27T13:36:45.000Z",
"id": "1375803982409576450",
"text": "#gavinjeffries0 #steven86026859 #MSNBC #SenBooker Uh Oh our friend Steve blocked me, I guess not being able to answer your simple question and being asked to was too much for him."
},
{
"author_id": "293104735",
"created_at": "2021-02-07T21:45:47.000Z",
"id": "1358532435122683904",
"text": "#slwilliams1101 #annabella313 #CrossConnection #TiffanyDCross #Scaramucci #JoyAnnReid #CapehartJ #MSNBC #SenBooker #AliVelshi I stopped watching #TiffanyDCross as well and only watch #CapehartJ now (even though he blocked me in 2016 because I had a \"strong\" response to something mean he said about Hillary Clinton)."
},
{
"author_id": "380970864",
"created_at": "2021-02-07T20:58:01.000Z",
"id": "1358520416273326081",
"text": "#annabella313 #CrossConnection #TiffanyDCross #Scaramucci #JoyAnnReid #CapehartJ #MSNBC After I criticized #TiffanyDCross she blocked me. #JoyAnnReid called herself petty during and interview with #SenBooker. Why be petty? Be mature and thoughtful so people can learn. Hosts need to learn too. I only watch #AliVelshi #CapehartJ now."
},
{
"attachments": {
"media_keys": [
"3_1358448920632909825"
]
},
"author_id": "793175035322171397",
"created_at": "2021-02-07T16:17:44.000Z",
"id": "1358449876565164034",
"text": "#FinstaManhattan #SenSchumer #SenBooker #RonWyden Lmao he blocked me over that. His bio said he likes to 'debate & that sometimes he's wrong but he can admit that'.\n\nGuess not.\n\nI wasn't rude or mean at all. This is too funny \ud83e\udd23"
},
{
"author_id": "752266160352010241",
"created_at": "2021-02-06T20:34:06.000Z",
"id": "1358152008948195328",
"text": "#fattypinner #tkbone32221 #SenSchumer #SenBooker #RonWyden He blocked me \ud83e\udd23\ud83d\ude2d\ud83e\udd23\ud83e\udd23\ud83e\udd23\ud83d\ude2d"
}
],
"includes": {
"media": [
{
"media_key": "3_1358448920632909825",
"type": "photo",
"url": ""
}
],
"users": [
{
"created_at": "2020-04-09T14:11:04.000Z",
"description": "",
"id": "1248251899884814336",
"name": "Firstcomm",
"username": "Firstcomm1"
},
{
"created_at": "2011-05-04T19:26:22.000Z",
"description": "Cinephile, balletomane, book lover, tennis fan, K-Drama fanatic, Jang Na-ra fangirl, USC School of Cinematic Arts alumna, Hillary Clinton and Nancy Pelosi Dem.",
"id": "293104735",
"name": "Joyce Tyler",
"username": "joyce_tyler"
},
{
"created_at": "2011-09-27T14:50:37.000Z",
"description": "Spelman College, BA, George Washington University MA, University of South Florida Ph.D. in Political Science, proud Ted Kennedy, Obama, Biden/Harris Democrat!",
"id": "380970864",
"name": "Stephanie L. Williams, Ph.D.",
"username": "slwilliams1101"
},
{
"created_at": "2016-10-31T19:37:19.000Z",
"description": "Loves: life, fam, cats, cars, tattoos, reality TV; collector of t-shirts & Volkswagen\u2019s. Hates: Oxford commas. #CombatVet #Medic #BidenHarris2020 #Resist",
"id": "793175035322171397",
"name": "Que Sarah Sarah \ud83d\udda4",
"username": "sarahalli13"
},
{
"created_at": "2016-07-10T22:20:03.000Z",
"description": "3x Hollywood Video Street Fighter 2 Champion",
"id": "752266160352010241",
"name": "Sugarcoder",
"username": "TheSugarCoder"
}
]
},
"meta": {
"newest_id": "1375803982409576450",
"next_token": "b26v89c19zqg8o3fosktkdplqiw2q9kzx2ibm4r4y27wd",
"oldest_id": "1358152008948195328",
"result_count": 5
}
}
...28 other politicians
# Create file
csvFile = open("tweet_sample.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Create headers for the data I want to save. I only want to save these columns in my dataset
csvWriter.writerow(
['author id', 'created_at', 'id', 'tweet', 'bio', 'image_url'])
csvFile.close()
def append_to_csv(json_response, fileName):
# A counter variable
global created_at, tweet_id, bio, text, author_id
counter = 0
# Open OR create the target CSV file
csvFile = open(fileName, "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Loop through each tweet
for tweet in json_response[0]['data']: # NOTE adding a 0 gives access to the data for the first politician while adding 1 gives access to data for the second politician and so on...
# 1. Author ID
author_id = tweet['author_id']
# 2. Time created
created_at = dateutil.parser.parse(tweet['created_at'])
# 3. Tweet ID
tweet_id = tweet['id']
# 4. Tweet text
text = tweet['text']
for dics in json_response[0]['includes']['users']: # NOTE 0 added
# 5. description. Contained in includes data object
if ('description' in dics):
bio = dics['description']
else:
bio = " "
for element in json_response[0]['includes']['media']: # NOTE 0 added
# 6. image url. Contained in includes data object
if ('url' in element):
image_url = element['url']
else:
image_url = " "
# Assemble all data in a list
res = [author_id, created_at, tweet_id, text, bio, image_url]
# Append the result to the CSV file
csvWriter.writerow(res)
counter += 1
# When done, close the CSV file
csvFile.close()
# Print the number of tweets for this iteration
print("# of Tweets added from this response: ", counter)
append_to_csv(json_response, "tweet_sample.csv") # Save tweet data in a csv file
Error message:
TypeError: list indices must be integers or slices, not str
By adding the [0] in the loop I avoid the TypeError above. However the output from the function append_to_csv is not ideal as it only includes the last tweet for the first politician. I guess my loop overwrites data.
Desired output would be a data frame with columns author_id, created_at, id, tweet, bio, image_url. Not all users have a bio on their profile or an image_url in their tweet hence the if-else statement in the function above and the bio, no_bio and bio, image_url, no_image_url in the desired data frame.
pol_df = pd.read_csv("path_to_tweet_sample.csv" )
pol_df.head()
author_id created_at id tweet bio image_url
0 737885223858384896 2021-03-26T21:56:02.000Z 1375567243082338314 tweet_text no_bio no_image_url
1 847612931487416323 2021-03-26T21:55:24.000Z 1375567083791073283 tweet_text no_bio no_image_url
2 18634205 2021-03-08T12:29:00.000Z 1368901564363051010 tweet_text bio image_url
3 27327319 2021-03-02T11:53:16.000Z 1366718245521211393 tweet_text bio no_image_url
4 917634626247647232 2021-02-28T18:16:45.000Z 1366089974907432961 tweet_text bio image_url
I think you are confusing lists with dicts. When you try to access a list like a dict (e.g. data["author_id"]) the TypeError you're getting will be raised. You have to iterate over a list and then try to access each dict in that list like [x['author_id'] for x in data], for example. If you want to extract values from the dicts and write it to a csv file you might want to do something like this:
import pandas as pd
author_data = []
for data in resp:
for author in data['data']:
author_id = author['author_id']
created_at = author['created_at']
another_id = author['id']
tweet_text = author['text']
author_data.append([author_id, created_at, another_id, tweet_text])
author_df = pd.DataFrame(author_data, columns=['author_id', 'created_at', 'id', 'text'])
media_data = []
for data in resp:
for media in data['includes']['media']:
url = media.get('url', 'no_url')
media_data.append(media)
media_df = pd.DataFrame(media_data, columns=['url'])
bio_data = []
for data in resp:
for user in data['includes']['users']:
bio = user['description']
author_id = user['id']
bio_data.append([bio, author_id])
bio_df = pd.DataFrame(bio_data, columns=['bio', 'author_id'])
final_df = author_df.merge(bio_df, on="author_id")
print(final_df)
You have to save different parts of the data in different dataframes and then merge them. The thing is that media does not contain the author_id or another key that is shared between the ['includes']['media'] part and ['data'] part so you cannot merge that.

How can I convert nested dictionary to pd.dataframe faster?

I have a json file which looks like this
{
"file": "name",
"main": [{
"question_no": "Q.1",
"question": "what is ?",
"answer": [{
"user": "John",
"comment": "It is defined as",
"value": [
{
"my_value": 5,
"value_2": 10
},
{
"my_value": 24,
"value_2": 30
}
]
},
{
"user": "Sam",
"comment": "as John said above it simply means",
"value": [
{
"my_value": 9,
"value_2": 10
},
{
"my_value": 54,
"value_2": 19
}
]
}
],
"closed": "no"
}]
}
desired result:
Question_no question my_value_sum value_2_sum user comment
Q.1 what is ? 29 40 john It is defined as
Q.1 what is ? 63 29 Sam as John said above it simply means
What I have tried is data = json_normalize(file_json, "main") and then using a for loop like
for ans, row in data.iterrows():
....
....
df = df.append(the data)
But the issue using this is that it is taking a lot of time that my client would refuse the solution. there is around 1200 items in the main list and there are 450 json files like this to convert. So this intermediate process of conversion would take almost an hour to complete.
EDIT:
is it possible to get the sum of the my_value and value_2 as a column? (updated the desired result also)
Select dictionary by main with parameter record_path and meta:
data = pd.json_normalize(file_json["main"],
record_path='answer',
meta=['question_no', 'question'])
print (data)
user comment question_no question
0 John It is defined as Q.1 what is ?
1 Sam as John said above it simply means Q.1 what is ?
Then if order is important convert last N columns to first positions:
N = 2
data = data[data.columns[-N:].tolist() + data.columns[:-N].tolist()]
print (data)
question_no question user comment
0 Q.1 what is ? John It is defined as
1 Q.1 what is ? Sam as John said above it simply means

How to print clean string from JSON list // Dict?

so I have some problem to find how to print a clean string from JSON list // Dict files.
I tried .join, .split method but it doesnt seem to work. Thank for the help guys
My code:
import json
with open("user.json") as f:
data = json.load(f)
for person in data["person"]:
print(person)
The JSON file
{
"person": [
{
"name": "Peter",
"Country": "Montreal",
"Gender": "Male"
},
{
"name": "Alex",
"Country": "Laval",
"Gender": "Male"
}
]
}
The print output (Which is not the correct format I want)
{'name': 'Peter', 'Country': 'Montreal', 'Gender': 'Male'}
{'name': 'Alex', 'Country': 'Laval', 'Gender': 'Male'}
I want to have the output print format to be like this:
Name: Peter
Country: Montreal
Gender:Male
If you want to print all the attributes in the person dictionary (with no exceptions) you can use:
for person in data["person"]:
for k, v in person.items():
print(k, ':', v)
You can access values using their keys as follow
import json
with open("user.json") as f:
data = json.load(f)
for person in data["person"]:
print(f'Name: {person["name"]}')
print(f'Country: {person["Country"]}')
print(f'Gender: {person["Gender"]}')
Result:
Name: Peter
Country: Montreal
Gender: Male
Name: Alex
Country: Laval
Gender: Male
for person in data["person"]:
print(f"Name: {person['name']}")
print(f"Country: {person['Country']}")
print(f"Gender: {person['Gender']}")
for python3.6+

accessing second value in python list loop

I am trying to access the second value in list created from a json object. When accessing the first value"name" I have no problems. But when trying to access "address" i get a error
Result: Failure
Exception: KeyError: 'address'
The json coming in looks like this
{
"DataToCompare": [
{
"name": "Alex Young",
"address": "123 Main Street"
}
],
"DataSetAgainst": [
{
"name": "Bob Doll",
"address": "555 South Street"
},
{
"name": "Bob Young",
"adress": "123 Main St."
}
]
}
In the example below dataBack["address"] = i["address"] is where the error comes in. If I comment it out I get results back for name and name match
def processing_request(dataIncoming):
data_to_compare = dataIncoming["DataToCompare"][0]
dataList = []
for i in dataIncoming["DataSetAgainst"]:
dataList.append(i)
dataResults = []
for i in dataList:
dataBack = {}
clean_name = ''.join(e for e in i["name"] if e.isalnum())
sequence = difflib.SequenceMatcher(isjunk=None, a=data_to_compare["name"], b=clean_name)
difference = sequence.ratio()*100
difference = round(difference, 1)
# works
dataBack["name"] = i["name"]
dataBack["name match"] = difference
# doesnt work
dataBack["address"] = i["address"]
dataResults.append(dataBack)
return json.dumps(dataResults)
The error appears to be a result of a typo:
{
"DataToCompare": [
{
"name": "Alex Young",
"address": "123 Main Street"
}
],
"DataSetAgainst": [
{
"name": "Bob Doll",
"address": "555 South Street"
},
{
"name": "Bob Young",
"address": "123 Main St."
}
]
}
What did I fix: "adress": "123 Main St." to "address": "123 Main St."
Output for the code you shared in this case:
'[{"name": "Bob Doll", "name match": 11.8, "address": "555 South Street"}, {"name": "Bob Young", "name match": 55.6, "address": "123 Main St."}]'
From the things I see, the error is caused because of a mistake in a:
"address": "123 Main St."
line. But let's give a benefit of a doubt and assume that data is corrupted and we want to get a workaround for it.
We could transform our i value into a list of values in the following way:
tab = []
for key in i:
value = i[key]
tab.append(value)
If we now give order to print tabs, we get in the output:
['Bob Doll', '555 South Street']
['Bob Young', '123 Main St.']
So, instead of solution above, you could write last line as:
dataBack["name"], dataBack["address"] = tab
dataBack["name match"] = "cookies"
As a full answer following your script:
def processing_request(dataIncoming):
data_to_compare = dataIncoming["DataToCompare"][0]
dataList = []
for i in dataIncoming["DataSetAgainst"]:
dataList.append(i)
dataResults = []
for i in dataList:
dataBack = {}
clean_name = ''.join(e for e in i["name"] if e.isalnum())
sequence = difflib.SequenceMatcher(isjunk=None, a=data_to_compare["name"], b=clean_name)
difference = sequence.ratio()*100
difference = round(difference, 1)
tab = [i[key] for key in i]
# works
dataBack["name"], dataBack["address"] = tab
dataBack["name match"] = difference
dataResults.append(dataBack)
return json.dumps(dataResults)

How to handle composite type of Entities using RASA NLU?

Let's say I have this utterance: "My name is John James Doe"
{
"rasa_nlu_data": {
"common_examples":
[
{
"text": "My name is John James Doe",
"intent": "Introduction",
"entities": [
{
"start": 11,
"end": 25,
"value": "John James Doe",
"entity": "Name"
}
]
}
],
"regex_features" : [],
"entity_synonyms": []
}
}
Here the substring John James Doe is a composite entity of type Name having 3 simple entities (First Name, Middle Name, Last Name) as follows:
John - First Name(Simple Entity)
James - Middle Name(Simple Entity)
Doe - Last Name(Simple Entity)
So, is there any in RASA for me to make a training format which will handle these kinds of composite type of entities.
Any help is appreciated, Thank you.
I believe you'll have an easier time if you continue to train with an entity type of Name that pulls out a section of text for all the names and then try to process the individual composite parts from the returned entity text. The reason being if you try to train on the component parts, you'll quickly have to provide a whole raft of combinations in your training data and that will become ineffective.
Also bear in mind that this isn't a trivial problem as you go deeper. If you use position alone to determine first / middle / last, you may have problems in Japan (https://www.sljfaq.org/afaq/names-for-people.html ) and if you tried to train to recognise names based on content (ie to pick out Doe as a last name) it will be prone to problems: it's not unknown for Americans to have first names that elsewhere are thought of as last names (Jackson, Hunter etc), middle names vary a lot too (https://en.m.wikipedia.org/wiki/Middle_name )
I've written a custom component for this as I needed composite entities as well. Here is a summary of how it works.
Let's say your training data rather looked like this:
"rasa_nlu_data": {
"common_examples":
[
{
"text": "My name is John James Doe",
"intent": "Introduction",
"entities": [
{
"start": 11,
"end": 15,
"value": "John",
"entity": "first_name"
},
{
"start": 16,
"end": 21,
"value": "James",
"entity": "middle_name"
},
{
"start": 22,
"end": 25,
"value": "Doe",
"entity": "last_name"
},
]
}
],
"regex_features" : [],
"entity_synonyms": []
}
}
So instead of training the full name as an entity that will be split, you train name parts that will be grouped to full names.
The basic idea now is that you define composite patterns with entity placeholders. For your example, you could define this pattern:
full_name = "#first_name #middle_name #last_name"
For your example sentence, Rasa NLU will recognize the three entities in it like this:
My name is John James Doe
^ ^ ^
first_name middle_name last_name
You take the input sentence and replace every recognized entity with its entity type:
My name is #first_name #middle_name #last_name
You can now perform a simple check whether your defined pattern is included in this string.
My name is #first_name #middle_name #last_name
^
| Pattern matches
|
"#first_name #middle_name #last_name"
If it is included, you take all entities values that are part of the inclusion and group them together to a full_name.
My name is John James Doe
^
| Pattern matches
|
#first_name #middle_name #last_name
-> full_name = ["John", "James", "Doe"]
If you use regular expressions instead of simple string matching, you can make this system a lot more flexible. For example, you could make the middle name optional by changing your pattern to
full_name = "#first_name (#middle_name )?#last_name"

Categories

Resources