Convert a text file consisting of strings into a dictionary

Convert a text file consisting of strings into a dictionary - python

I want to know how to convert a text file consisting of strings into a dictionary. My text file looks like this:
Donald Trump, 45th US President, 71 years old
Barack Obama, 44th US President, 56 years old
George W. Bush, 43rd US President, 71 years old
I want to be able to convert that text file into a dictionary being:
{Donald Trump: 45th US President, 71 years old, Barack Obama: 44th US President, 56 years old, George W. Bush: 43rd US President, 71 years old}
How would I go about doing this? Thanks!
I tried to do it by doing this:
d = {}
with open('presidents.txt', 'r') as f:
for line in f:
key = line[0]
value = line[1:]
d[key] = value

Is this what you're looking for?
d = {}
with open("presidents.txt", "r") as f:
for line in f:
k, v, z = line.strip().split(",")
d[k.strip()] = v.strip(), z.strip()
f.close()
print(d)
The final output looks like this:
{'Donald Trump': ('45th US President', '71 years old'), 'Barack Obama': ('44th US President', '56 years old'), 'George W. Bush': ('43rd US President', '71 years old')}

You can use pandas for this:
import pandas as pd
df = pd.read_csv('file.csv', delimiter=', ', header=None, names=['Name', 'President', 'Age'])
d = df.set_index(['Name'])[['President', 'Age']].T.to_dict(orient='list')
# {'Barack Obama': ['44th US President', '56 years old'],
# 'Donald Trump': ['45th US President', '71 years old'],
# 'George W. Bush': ['43rd US President', '71 years old']}

Related

How to split the strings in a particular column of a dataframe based on the value of another column? [duplicate]

This question already has answers here:
Is there a difference between "==" and "is"?
(13 answers)
Closed 2 years ago.
I am trying to split the strings in a column tweet_text if the column lang is en
Here is how to do it on a string:
s = 'I am always sad'
s_split = s.split(" ")
This returns:
['I', 'am', 'always', 'sad']
My current code which does not work:
df['tweet_text'] = df.apply(lambda x: x['tweet_text'].split(" ") if x['lang'] is 'en' else x['tweet_text'], axis = 1)
Dictionary of data:
{'lang': {1404: 'en',
1943: 'en',
2169: 'en',
2502: 'de',
3981: 'nl',
4226: 'en',
7223: 'en',
8557: 'de',
11339: 'pt',
11854: 'en'},
'tweet_text': {1404: 'I am always sad when a colleague loses his job and Frank is not just a colleague he is an impoant person in my',
1943: 'It remains goalless at FNB Stadium between Kaizer Chiefs and Baroka at halftimeRead more',
2169: 'Which one gets your vote 05',
2502: 'Was sagt ihr zu den ersten Minuten',
3981: 'En we gaan door speelronde begint vandaagTegen wie speelt jouw favoriete club',
4226: 'Quote tweet or replyYour favourite Mesut Ozil moment as a Gunner was',
7223: 'How to follow the game live The opponent Current form Did you know The squad Koeman said It must b',
8557: 'BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN BAYERN',
11339: '9o golo para',
11854: 'have loads of boss stuff available on their store products available including the m'}}

Use == instead is and also split(" ") working same like split():
df['tweet_text'] = df.apply(lambda x: x['tweet_text'].split() if x['lang'] == 'en' else x['tweet_text'], axis = 1)
Or you can use alternative with Series.str.split only for en rows:
m = df['lang'] == 'en'
df.loc[m, 'tweet_text'] = df.loc[m, 'tweet_text'].str.split()

You can also do it this way:
mask = df["lang"] == "en", "tweet_text"
df.loc[mask] = df.loc[mask].str.split()

Extract an entire Python dictionary from a list and export to csv [duplicate]

This question already has answers here:
How to write information from a list to a csv file using csv DictWriter?
(2 answers)
Closed 4 years ago.
I've a dictionary embedded within a list in Python. I'd like to extract just the dictionary and then export it to a csv file, with each column representing a unique dictionary key (i.e. Date). For example, here is a (small snippet) of the embedded dictionary I have:
[[{'Date': '2018-069', 'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry', 'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}, {'Date': '2018-068', 'Title': 'The Long and Short of It : Do Public and Private Firms Invest Differently?', 'Author': 'Naomi E. Feldman & Laura Kawano & Elena Patel & Nirupama Rao & Michael Stevens & Jesse Edgerton'}]]
Any help with this would be much appreciated!

Something like below should do, using DictWriter as Patrick mentioned in the comments
import csv
def main():
'''The Main'''
data = [[{'Date': '2018-069', 'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry', 'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}, {'Date': '2018-068', 'Title': 'The Long and Short of It : Do Public and Private Firms Invest Differently?', 'Author': 'Naomi E. Feldman & Laura Kawano & Elena Patel & Nirupama Rao & Michael Stevens & Jesse Edgerton'}]]
with open('sample.csv', 'w', newline='') as csvfile:
fieldnames = data[0][0].keys()
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in data[0]:
writer.writerow(row)
if __name__ == '__main__':
main()

Replace whole string if it contains substring in pandas dataframe based on dictionary key

I am trying to replace data in column 'Place' with data from the dictionary i created. The column 'Place' contains a substring (not case sensitive) of the dictionary key. I cannot get either of my methods to work any guidance is appreciated.
incoming_df = pd.DataFrame({'First_Name' : ['John', 'Chris', 'renzo', 'Laura', 'Stan', 'Russ', 'Lip', 'Hick', 'Donald'],
'Last_Name' : ['stanford', 'lee', 'Olivares', 'Johnson', 'Stanley', 'Russaford', 'Lipper', 'Hero', 'Lipsey'],
'location' : ['Grant Elementary', 'Code Academy', 'Queen Prep', 'Waves College', 'duke Prep', 'california Academy', 'SF College Prep', 'San Ramon Prep', 'San Jose High']})
df = pd.DataFrame({'FirstN': [],
'LastN':[],
'Place': []})
# re index based on data given
df = df.reindex(incoming_df.index)
# copy data over to new dataframe
df['LastN'] = incoming_df.loc[:, incoming_df.columns.str.contains('Last', case=False)]
df['FirstN'] = incoming_df.loc[:, incoming_df.columns.str.contains('First', case=False)]
df['Place'] = incoming_df.loc[:, incoming_df.columns.str.contains('School|Work|Site|Location', case=False)]
places = { 'Grant' : 'DEF Grant Elementary',
'Code' : 'DEF Code Academy',
'Queen' : 'DEF Queen Preparatory High School',
'Waves' : 'DEF Waves College Prep',
'Duke' : 'DEF Duke Preparatory Institute',
'California' : 'DEF California Academy',
'SF College' : 'DEF San Francisco College',
'San Ramon' : 'DEF San Ramon Prep',
'San Jose' : 'DEF San Jose High School' }
# replace dictionary values with values in Place (results in NAN values inside 'Place' column
pat = r'({})'.format('|'.join(places.keys()))
extracted = df.Place.str.extract(pat, expand=False).dropna()
df['Place'] = extracted.apply(lambda x: places[x])
# Also tried this method but did not work
df['Place'] = df['Place'].replace(places)
# original df
FirstN LastN Place
0 John stanford Grant Elementary
1 Chris lee Code Academy
2 renzo Olivares Queen Prep
3 Laura Johnson Waves College
4 Stan Stanley duke Prep
5 Russ Russaford california Academy
6 Lip Lipper SF College Prep
7 Hick Hero San Ramon Prep
8 Donald Lipsey San Jose High
# target df
FirstN LastN Place
0 John Stanford DEF Grant Elementary
1 Chris Lee DEF Code Academy
2 Renzo Olivares DEF Queen Preparatory High School
3 Laura Johnson DEF Waves College Prep
4 Stan Stanley DEF Duke Preparatory Institute
5 Russ Russaford DEF California Academy
6 Lip Lipper DEF San Francisco College
7 Hick Hero DEF San Ramon Prep
8 Donald Lipsey DEF San Jose High School

Using this loop solved my issue
for k, v in dic.items():
df['Place'] = np.where(df['Place'].str.contains(k, case=False), v, df['Place'])

Using a list comprehension, and making use of next to short circuit and avoid wasted iteration.
df.assign(Place=[next((v for i in df.Place if i in k.lower()), None) for k,v in dic.items()])
Place User
0 Heights College arenzo
1 Queens University brenzo
2 York Academy crenzo
3 Danes Institute drenzo
4 Duke University erenzo

Using apply and loc
for key, value in dic.items():
df.loc[df['Place'].apply(lambda x: x in key.lower()), 'Place'] = value

This is challenging given the string mismatch on 'Place'. Some naive workarounds:
1) You can utilize an index mapping, reformatting your dict to:
dic = {'1' : 'Heights College',
'2' : 'Queens University',
'3' : 'York Academy',
'4' : 'Danes Institute',
'5' : 'Duke University'}
Then use a map from your dict to df index:
df['Place'] = df.index.to_series().map(dic)
2) Alternatively, if your user column is unique you could replicate the above, edit your dic to map to user and then apply a similar df.map.If your user column is unique, you could try using map which performs a lookup based on user to your dict and return place.
dic = {'arenzo' : 'Heights College',
'brenzo' : 'Queens University',
'crenzo' : 'York Academy',
'drenzo' : 'Danes Institute',
'erenzo' : 'Duke University'}
df['Place'] = df['User'].map(dic)

Remove extra spaces between columns

I got the output below:
sports(6 spaces)mourinho keen to tie up long-term de gea deal
opinion(5 spaces)the reality of north korea as a nuclear power
How can I make them become sports(1 space) .... and opinion(1 space)... when I write to a .txt file?
Here is my code:
the_frame = pdsql.read_sql_query("SELECT category, title FROM training;", conn)
pd.set_option('display.max_colwidth', -1)
print(the_frame)
the_frame = the_frame.replace('\s+', ' ', regex=True)#tried to remove multiple spaces
base_filename = 'Values.txt'
with open(os.path.join(base_filename),'w') as outfile:
df = pd.DataFrame(the_frame)
df.to_string(outfile, index=False, header=False)

I think your solution is nice, only should be simplify:
Also tested for multiple tabs, it working nice too.
the_frame = pdsql.read_sql_query("SELECT category, title FROM training;", conn)
the_frame = the_frame.replace('\s+', ' ', regex=True)
base_filename = 'Values.txt'
the_frame.to_csv(base_filename, index=False, header=False)
Sample:
the_frame = pd.DataFrame({
'A': ['sports mourinho keen to tie up long-term de gea deal',
'opinion the reality of north korea as a nuclear power'],
'B': list(range(2))
})
print (the_frame)
A B
0 sports mourinho keen to tie up long-term ... 0
1 opinion the reality of north korea as a nu... 1
the_frame = the_frame.replace('\s+', ' ', regex=True)
print (the_frame)
A B
0 sports mourinho keen to tie up long-term de ge... 0
1 opinion the reality of north korea as a nuclea... 1
EDIT: I believe you need join both columns with space and write output to file without sep parameter.
the_frame = pd.DataFrame({'category': {0: 'sports', 1: 'sports', 2: 'opinion', 3: 'opinion', 4: 'opinion'}, 'title': {0: 'mourinho keen to tie up long-term de gea deal', 1: 'suarez fires barcelona nine clear in sociedad fightback', 2: 'the reality of north korea as a nuclear power', 3: 'the real fire fury', 4: 'opposition and dr mahathir'}} )
print (the_frame)
category title
0 sports mourinho keen to tie up long-term de gea deal
1 sports suarez fires barcelona nine clear in sociedad ...
2 opinion the reality of north korea as a nuclear power
3 opinion the real fire fury
4 opinion opposition and dr mahathir
the_frame = the_frame['category'] + ' ' + the_frame['title']
print (the_frame)
0 sports mourinho keen to tie up long-term de ge...
1 sports suarez fires barcelona nine clear in so...
2 opinion the reality of north korea as a nuclea...
3 opinion the real fire fury
4 opinion opposition and dr mahathir
dtype: object
base_filename = 'Values.txt'
the_frame.to_csv(base_filename, index=False, header=False)

You can try the following instead of
the_frame = the_frame.replace('\s+', ' ', regex=True)
#use the below syntax
the_frame = the_frame.str.replace('\s+', ' ', regex=True)# this will remove multiple whitespaces .

Aggregate sets according to keys with defaultdict python

I have a bunch of lines in text with names and teams in this format:
Team (year)|Surname1, Name1
e.g.
Yankees (1993)|Abbot, Jim
Yankees (1994)|Abbot, Jim
Yankees (1993)|Assenmacher, Paul
Yankees (2000)|Buddies, Mike
Yankees (2000)|Canseco, Jose
and so on for several years and several teams.
I would like to aggregate names of players according to team (year) combination deleting any duplicated names (it may happen that in the original database there is some redundant information). In the example, my output should be:
Yankees (1993)|Abbot, Jim|Assenmacher, Paul
Yankees (1994)|Abbot, Jim
Yankees (2000)|Buddies, Mike|Canseco, Jose
I've written this code so far:
file_in = open('filein.txt')
file_out = open('fileout.txt', 'w+')
from collections import defaultdict
teams = defaultdict(set)
for line in file_in:
items = [entry.strip() for entry in line.split('|') if entry]
team = items[0]
name = items[1]
teams[team].add(name)
I end up with a big dictionary made up by keys (the name of the team and the year) and sets of values. But I don't know exactly how to go on to aggregate things up.
I would also be able to compare my final sets of values (e.g. how many players have Yankee's team of 1993 and 1994 in common?). How can I do this?
Any help is appreciated

You can use a tuple as a key here, for eg. ('Yankees', '1994'):
from collections import defaultdict
dic = defaultdict(list)
with open('abc') as f:
for line in f:
key,val = line.split('|')
keys = tuple(x.strip('()') for x in key.split())
vals = [x.strip() for x in val.split(', ')]
dic[keys].append(vals)
print dic
for k,v in dic.iteritems():
print "{}({})|{}".format(k[0],k[1],"|".join([", ".join(x) for x in v]))
Output:
defaultdict(<type 'list'>,
{('Yankees', '1994'): [['Abbot', 'Jim']],
('Yankees', '2000'): [['Buddies', 'Mike'], ['Canseco', 'Jose']],
('Yankees', '1993'): [['Abbot', 'Jim'], ['Assenmacher', 'Paul']]})
Yankees(1994)|Abbot, Jim
Yankees(2000)|Buddies, Mike|Canseco, Jose
Yankees(1993)|Abbot, Jim|Assenmacher, Paul

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert a text file consisting of strings into a dictionary - python

Related

How to split the strings in a particular column of a dataframe based on the value of another column? [duplicate]

Extract an entire Python dictionary from a list and export to csv [duplicate]

Replace whole string if it contains substring in pandas dataframe based on dictionary key

Remove extra spaces between columns

Aggregate sets according to keys with defaultdict python

Categories

Resources