How to merge three files with common id in pandas? - python

I have three files which are users.dat, ratings.dat and movies.dat.
users.dat
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
1::F::1::10::48067
ratings.dat
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
movied.dat
1193::One Flew Over the Cuckoo's Nest (1975)::Drama
661::James and the Giant Peach (1996)::Animation|Children's|Musical
914::My Fair Lady (1964)::Musical|Romance
3408::Erin Brockovich (2000)::Drama
2355::Bug's Life, A (1998)::Animation|Children's|Comedy
1197::Princess Bride, The (1987)::Action|Adventure|Comedy|Romance
1287::Ben-Hur (1959)::Action|Adventure|Drama
2804::Christmas Story, A (1983)::Comedy|Drama
My expected output
1::1193::5::978300760::F::1::10::48067::One Flew Over the Cuckoo's Nest::Drama::1975
1::661::3::978302109::F::1::10::48067::James and the Giant Peach::Animation|Children's|Musical::1996
1::914::3::978301968::F::1::10::48067::My Fair Lady ::Musical|Romance::1964
1::3408::4::978300275::F::1::10::48067::Erin Brockovich ::Drama::2000
1::2355::5::978824291::F::1::10::48067::Bug's Life, A ::Animation|Children's|Comedy::1998
I am trying to merge these files without using pandas. I created three dictionary. User id is a common key. Then, I tried to merge these three files using users keys. But, i did not merge exaclty what i want. Any advice and suggestions will be greatly appreciated
My code
import json
file = open("users.dat","r",encoding = 'utf-8')
users={}
for line in file:
x = line.split('::')
user_id=x[0]
gender=x[1]
age=x[2]
occupation=x[3]
i_zip=x[4]
users[user_id]=gender,age,occupation,i_zip.strip()
file = open("movies.dat","r",encoding='latin-1')
movies={}
for line in file:
x = line.split('::')
movie_id=x[0]
title=x[1]
genre=x[2]
movies[movie_id]=title,genre.strip()
file = open("ratings.dat","r")
ratings={}
for line in file:
x = line.split('::')
a=x[0]
b=x[1]
c=x[2]
d=x[3]
ratings[a]=b,c,d.strip()
newdict = {}
newdict.update(users)
newdict.update(movies)
newdict.update(ratings)
for i in users.keys():
addition = users[i] + movies[i]+ratings[i]
newdict[i] = addition
with open('data.txt', 'w') as outfile:
json.dump(newdict, outfile)
My output like this
{"1": ["F", "1", "10", "48067", "Toy Story (1995)", "Animation|Children's|Comedy", "1246", "4", "978302091"], "2": ["M", "56", "16", "70072", "Jumanji (1995)", "Adventure|Children's|Fantasy", "1247", "5", "978298652"],

First mistake in your code (apart from messed up indents) is that you make a dictionary out of ratings with user ID as a key:
ratings[a]=b,c,d.strip()
For your dataset, dictionary ratings will end up with value { '1': ('2804', '5', '978300719') }. So all but one rating would have been lost since you have only one user.
What you want to do instead is to treat your ratings data as a list, not a dictionary. And the result you are trying to achieve is also an extended version of the ratings, because you will end up with as many rows, as you have scores.
Secondly, you don't need json module, since your desired output is not in JSON format.
Here's a code that does the job:
#!/usr/bin/env python3
# Part 1: collect data from the files
users = {}
file = open("users.dat","r",encoding = 'utf-8')
for line in file:
user_id, gender, age, occupation, i_zip = line.rstrip().split('::')
users[user_id] = (gender, age, occupation, i_zip)
movies={}
file = open("movies.dat","r",encoding='latin-1')
for line in file:
movie_id, title, genre = line.rstrip().split('::')
# Parse year from title
title = title.rstrip()
year = 'N/A'
if title[-1]==')' and '(' in title:
short_title, in_parenthesis = title.rsplit('(', 1)
in_parenthesis = in_parenthesis.rstrip(')').rstrip()
if in_parenthesis.isdigit() and len(in_parenthesis)==4:
# Text in parenthesis has four digits - it must be year
title = short_title.rstrip()
year = in_parenthesis
movies[movie_id] = (title, genre, year)
ratings=[]
file = open("ratings.dat","r")
for line in file:
user_id, movie_id, score, dt = line.rstrip().split('::')
ratings.append((user_id, movie_id, score, dt))
# Part 2: save the output
file = open('output.dat','w',encoding='utf-8')
for user_id, movie_id, score, dt in ratings:
# Get user data from dictionary
gender, age, occupation, i_zip = users[user_id]
# Get movie data from dictionary
title, genre, year = movies[movie_id]
# Merge data into a single string
row = '::'.join([user_id, movie_id, score, dt,
gender, age, occupation, i_zip,
title, genre, year])
# Write to the file
file.write(row + '\n')
file.close()
Part 1 is based on your code, with the main differences that I save the ratings to a list (not dictionary) and that I added parsing of years.
Part 2 is where the output is being saved.
Contents of output.dat file after running the script:
1::1193::5::978300760::F::1::10::48067::One Flew Over the Cuckoo's Nest::Drama::1975
1::661::3::978302109::F::1::10::48067::James and the Giant Peach::Animation|Children's|Musical::1996
1::914::3::978301968::F::1::10::48067::My Fair Lady::Musical|Romance::1964
1::3408::4::978300275::F::1::10::48067::Erin Brockovich::Drama::2000
1::2355::5::978824291::F::1::10::48067::Bug's Life, A::Animation|Children's|Comedy::1998
1::1197::3::978302268::F::1::10::48067::Princess Bride, The::Action|Adventure|Comedy|Romance::1987
1::1287::5::978302039::F::1::10::48067::Ben-Hur::Action|Adventure|Drama::1959
1::2804::5::978300719::F::1::10::48067::Christmas Story, A::Comedy|Drama::1983

Related

How to read txt file data and convert into nested dictionary?

I have this txt file but I'm having trouble in converting it into a nested dictionary in python. The txt file only has the values of the pokemon but are missing the keys such as 'quantity' or 'fee'. Below is the content in the txt file. (I have the ability to change the txt file if needed)
charmander,3,100,fire
squirtle,2,50,water
bulbasaur,5,25,grass
gyrados,1,1000,water flying
This is my desired dictionary:
pokemon = {
'charmander':{'quantity':3,'fee':100,'powers':['fire']},
'squirtle':{'quantity':2,'fee':50,'powers':['water']},
'bulbasaur':{'quantity':5,'fee':25,'powers':['grass']},
'gyrados':{'quantity':1,'fee':1000,'powers':['water','flying']}
}
Convert text file to lines, then process each line using "," delimiters. For powers, split the string again using " " delimiter. Then just package each extracted piece of information into your dict structure as below.
with open('pokemonInfo.txt') as f:
data = f.readlines()
dict = {}
for r in data:
fields = r.split(",")
pName = fields[0]
qty = fields[1]
fee = fields[2]
powers = fields[3]
dict[pName] = {"quantity": qty, "fee": fee, "powers": [p.strip() for p in powers.split(" ")]}
for record in dict.items():
print(record)

Program should read from a file and returns a dictionary but returning a type error

The dataset looks like this-
Action|10|Golden Tree (2012)
Drama|3|Titanic (1967)
So it is Genre|SerialNo|Movie
Required output is-
{ "Toy Story (1995)" : "Adventure", "Golden Tree (2012)" : "Action" }
Currently, the only output generated is "Action", I tried to write some code to fix it, but returns a type error. How do I fix this?
from collections import defaultdict
def read_genre_data(file):
movie_genre_dict = {}
ratings = defaultdict(list)
for line in open(file):
genre, num, movie = line.split('|')
#movie[genre].append(movie)
return genre
readGenre = read_genre_data("genreMovieSample.txt")
print(readGenre)
You need to add to the dictionary, and then return the dictionary. You're just returning the value of genre from the last line of the file.
def read_genre_data(file):
movie_genre_dict = {}
with open(file) as f:
for line in f:
genre, num, movie = line.split('|')
movie_genre_dict[movie] = genre
return movie_genre_dict

Extract data from text file using Python (or any language)

I have a text file that looks like:
First Name Bob
Last name Smith
Phone 555-555-5555
Email bob#bob.com
Date of Birth 11/02/1986
Preferred Method of Contact Text Message
Desired Appointment Date 04/29
Desired Appointment Time 10am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
First Name john
Last name Smith
Phone 555-555-4444
Email john#gmail.com
Date of Birth 03/02/1955
Preferred Method of Contact Text Message
Desired Appointment Date 05/22
Desired Appointment Time 9am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
.... and so on
I need to extract each entry to a csv file, so the data should look like: first name, last name, phone, email, etc. I don't even know where to start on something like this.
first of all you'll need to open the text file in read mode.
I'd suggest using a context manager like so:
with open('path/to/your/file.txt', 'r') as file:
for line in file.readlines():
# do something with the line (it is a string)
as for managing the info you could build some intermediate structure, for example a dictionary or a list of dictionaries, and then translate that into a CSV file with the csv module.
you could for example split the file whenever there is a blank line, maybe like this:
with open('Downloads/test.txt', 'r') as f:
my_list = list() # this will be the final list
entry = dict() # this contains each user info as a dict
for line in f.readlines():
if line.strip() == "": # if line is empty start a new dict
my_list.append(entry) # and append the old one to the list
entry = dict()
else: # otherwise split the line and create new dict
line_items = line.split(r' ')
print(line_items)
entry[line_items[0]] = line_items[1]
print(my_list)
this code won't work because your text is not formatted in a consistent way: you need to find a way to make the split between "title" and "content" (like "first name" and "bob") in a consistent way. I suggest maybe looking at regex and fixing the txt file by making spacing more consistent.
assuming the data resides in a:
a="""
First Name Bob
Last name Smith
Phone 555-555-5555
Email bob#bob.com
Date of Birth 11/02/1986
Preferred Method of Contact Text Message
Desired Appointment Date 04/29
Desired Appointment Time 10am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
First Name john
Last name Smith
Phone 555-555-4444
Email john#gmail.com
Date of Birth 03/02/1955
Preferred Method of Contact Text Message
Desired Appointment Date 05/22
Desired Appointment Time 9am
City Pittsburgh
Location State
IP Address x.x.x.x
User-Agent (Browser/OS) Apple Safari 14.0.3 / OS X
Referrer http://www.example.com
"""
line_sep = "\n" # CHANGE ME ACCORDING TO DATA
fields = ["First Name", "Last name", "Phone",
"Email", "Date of Birth", "Preferred Method of Contact",
"Desired Appointment Date", "Desired Appointment Time",
"City", "Location", "IP Address", "User-Agent","Referrer"]
records = a.split(line_sep * 2)
all_records = []
for record in records:
splitted_record = record.split(line_sep)
one_record = {}
csv_record = []
for f in fields:
found = False
for one_field in splitted_record:
if one_field.startswith(f):
data = one_field[len(f):].strip()
one_record[f] = data
csv_record.append(data)
found = True
if not found:
csv_record.append("")
all_records.append(";".join(csv_record))
one_record will have the record as dictionary and csv_record will have it as a list of fields (ordered as fields variable)
Edited to add: ignore this answer, the code from Koko Jumbo looks infinitely more sensible and actually gives you a CVS file at the end of it! It was a fun exercise though :)
Just to expand on fcagnola's code a bit.
If it's a quick and dirty one-off, and you know that the data will be consistently presented, the following should work to create a list of dictionaries with the correct key/value pairing. Each line is processed by splitting the line and comparing the line number (reset to 0 with each new dict) against an array of values that represent where the boundary between key and value falls.
For example, "First Name Bob" becomes ["First","Name","Bob"]. The function has been told that linenumber= 0 so it checks entries[linenumber] to get the value "2", which it uses to join the key name (items 0 & 1) and then join the data (items 2 onwards). The end result is ["First Name", "Bob"] which is then added to the dictionary.
class Extract:
def extractEntry(self,linedata,lineindex):
# Hardcoded list! The quick and dirty part.
# This is specific to the example data provided. The entries
# represent the index to be used when splitting the string
# between the key and the data
entries = (2,2,1,1,3,4,3,3,1,1,2,2,1)
return self.createNewEntry(linedata,entries[lineindex])
def createNewEntry(self,linedata,dataindex):
list_data = linedata.split()
key = " ".join(list_data[:dataindex])
data = " ".join(list_data[dataindex:])
return [key,data]
with open('test.txt', 'r') as f:
my_list = list() # this will be the final list
entry = dict() # this contains each user info as a dict
extr = Extract() # class for splitting the entries into key/value
x = 0
for line in f.readlines():
if line.strip() == "": # if line is empty start a new dict
my_list.append(entry) # and append the old one to the list
entry = dict()
x = 0
else: # otherwise split the line and create new dict
extracted_data = extr.extractEntry(line,x)
entry[extracted_data[0]] = extracted_data[1]
x += 1
my_list.append(entry)
print(my_list)

save two list in one json file

I'm getting data with two lists and I want to save both of them in one single json file can someone help me.
I'm using selenium
def get_name(self):
name = []
name = self.find_elements_by_class_name ('item-desc')
price = []
price = self.find_elements_by_class_name ('item-goodPrice')
for names in name :
names = (names.text)
#print names
for prices in price :
prices = (prices.text)
#print price
I would create a dictionary and then JSON dumps
An example could be:
import json
def get_name(self):
names = [ name.text for name in self.find_elements_by_class_name('item-desc') ]
prices = [ price.text for price in self.find_elements_by_class_name('item-goodPrice')]
with open('output-file-name.json', 'w') as f:
f.write(json.dumps({'names': names, 'prices': prices}))
EDIT: In the first version of the answer I was only creating the JSON, if you want to create a file as well, you should include what suggested by #Andersson comment

Converting a text file into csv file using python

I have a requirement where in I need to convert my text files into csv and am using python for doing it. My text file looks like this ,
Employee Name : XXXXX
Employee Number : 12345
Age : 45
Hobbies: Tennis
Employee Name: xxx
Employee Number :123456
Hobbies : Football
I want my CSV file to have the column names as Employee Name, Employee Number , Age and Hobbies and when a particular value is not present it should have a value of NA in that particular place. Any simple solutions to do this? Thanks in advance
You can do something like this:
records = """Employee Name : XXXXX
Employee Number : 12345
Age : 45
Hobbies: Tennis
Employee Name: xxx
Employee Number :123456
Hobbies : Football"""
for record in records.split('Employee Name'):
fields = record.split('\n')
name = 'NA'
number = 'NA'
age = 'NA'
hobbies = 'NA'
for field in fields:
field_name, field_value = field.split(':')
if field_name == "": # This is employee name, since we split on it
name = field_value
if field_name == "Employee Number":
number = field_value
if field_name == "Age":
age = field_value
if field_name == "Hobbies":
hobbies = field_value
Of course, this method assumes that there is (at least) Employee Name field in every record.
Maybe this helps you get started? It's just the static output of the first employee data. You would now need to wrap this into some sort of iteration over the file. There is very very likely a more elegant solution, but this is how you would do it without a single import statement ;)
with open('test.txt', 'r') as f:
content = f.readlines()
output_line = "".join([line.split(':')[1].replace('\n',';').strip() for line in content[0:4]])
print(output_line)
I followed very simple steps for this and may not be optimal but solves the problem. Important case here I can see is there can be multiple keys ("Employee Name" etc) in single file.
Steps
Read txt file to list of lines.
convert list to dict(logic can be more improved or complex lambdas can be added here)
Simply use pandas to convert dict to csv
Below is the code,
import pandas
etxt_file = r"test.txt"
txt = open(txt_file, "r")
txt_string = txt.read()
txt_lines = txt_string.split("\n")
txt_dict = {}
for txt_line in txt_lines:
k,v = txt_line.split(":")
k = k.strip()
v = v.strip()
if txt_dict.has_key(k):
list = txt_dict.get(k)
else:
list = []
list.append(v)
txt_dict[k]=list
print pandas.DataFrame.from_dict(txt_dict, orient="index")
Output:
0 1
Employee Number 12345 123456
Age 45 None
Employee Name XXXXX xxx
Hobbies Tennis Football
I hope this helps.

Categories

Resources