Python: Extra comma on csv.Dictreader column - python

I have this read function where it reads a csv file using csv.DictReader. The file.csv is separated by commas and it fully reads. However, this part of my file has a column that contains multiple commas. My question is, how can I make sure that comma is counted as part of a column? I cannot alter my csv file to meet the criteria.
Text File:
ID,Name,University,Street,ZipCode,Country
12,Jon Snow,U of Winterfell,Winterfell #45,60434,Westeros
13,Steve Rogers,NYU,108, Chelsea St.,23333,United States
20,Peter Parker,Yale,34, Tribeca,32444,United States
34,Tyrion Lannister,U of Casterly Rock,Kings Landing #89, 43543,Westeros
The desired output is this:
{'ID': '12', 'Name': 'Jon Snow', 'University': 'U of Winterfell', 'Street': 'Winterfell #45', 'ZipCode': '60434', 'Country': 'Westeros'}
{'ID': '13', 'Name': 'Steve Rogers', 'University': 'NYU', 'Street': '108, Chelsea St.', 'ZipCode': '23333', 'Country': 'United States'}
{'ID': '20', 'Name': 'Peter Parker', 'University': 'Yale', 'Street': '34, Tribeca', 'ZipCode': '32444', 'Country': 'United States'}
{'ID': '34', 'Name': 'Tyrion Lannister', 'University': 'U of Casterly Rock', 'Street': 'Kings Landing #89', 'ZipCode': '43543', 'Country': 'Westeros'}
As you can tell the 'Street' has at least two commas due to the numbers:
13,Steve Rogers,NYU,108, Chelsea St.,23333,United States
20,Peter Parker,Yale,34, Tribeca,32444,United States
Note: Most of the columns being read splits by a str,str BUT under the 'Street' column it is followed by a str, str (there is an extra space after the comma). I hope this makes sense.
The options I tried looking out is using re.split, but I don't know how to implement it on my read file. I was thinking re.split(r'(?!\s),(?!\s)',x[:-1])? How can I make sure the format from my file will count as part of any column? I can't use pandas.
My current output looks like this right now:
{'ID': '12', 'Name': 'Jon Snow', 'University': 'U of Winterfell', 'Street': 'Winterfell #45', 'ZipCode': '60434', 'Country': 'Westeros'}
{'ID': '13', 'Name': 'Steve Rogers', 'University': 'NYU', 'Street': '108', 'ZipCode': 'Chelsea St.', 'Country': '23333', None: ['United States']}
{'ID': '20', 'Name': 'Peter Parker', 'University': 'Yale', 'Street': '34', 'ZipCode': 'Tribeca', 'Country': '32444', None: ['United States']}
{'ID': '34', 'Name': 'Tyrion Lannister', 'University': 'U of Casterly Rock', 'Street': 'Kings Landing #89', 'ZipCode': '43543', 'Country': 'Westeros'}
This is my read function:
import csv
list = []
with open('file.csv', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter=",", skipinitialspace=True)
for col in csv_reader:
list.append(dict(col))
print(dict(col))

You can't use csv if the file isn't valid CSV format.
You need to call re.split() on ordinary lines, not on dictionaries.
list = []
with open('file.csv', mode='r') as csv_file:
keys = csv_file.readline().strip().split(',') # Read header line
for line in csv_file:
line = line.strip()
row = re.split(r'(?!\s),(?!\s)',line)
list.append(dict(zip(keys, row)))

The actual solution for the problem is modifying the script that generates the csv file.
If you have a chance to modify that output you can do 2 things
Use a delimiter other than a comma such as | symbol or ; whatever you believe it doesn't exist in the string.
Or enclose all columns with " so you'll be able to split them by , which are actual separators.
If you don't have a chance to modify the output.
And if you are sure about that multiple commas are only in the street column; then you should use csv.reader instead of DictReader this way you can get the columns by Indexes that you are already sure. for instance row[0] will be ID row[1] will be Name and row[-1] will be Country row[-2] will be ZipCode so row[2:-2] would give you what you need i guess. Indexes can be arranged but the idea is clear I guess.
Hope that helps.
Edit:
import csv
list = []
with open('file.csv', mode='r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=",", skipinitialspace=True)
# pass the header row
next(csv_reader)
for row in csv_reader:
list.append({"ID": row[0],
"Name": row[1],
"University": row[2],
"Street": ' '.join(row[3:-2]),
"Zipcode": row[-2],
"Country": row[-1]})
print(list)
--
Here is the output (with pprint)
[{'Country': 'Westeros',
'ID': '12',
'Name': 'Jon Snow',
'Street': 'Winterfell #45',
'University': 'U of Winterfell',
'Zipcode': '60434'},
{'Country': 'United States',
'ID': '13',
'Name': 'Steve Rogers',
'Street': '108 Chelsea St.',
'University': 'NYU',
'Zipcode': '23333'},
{'Country': 'United States',
'ID': '20',
'Name': 'Peter Parker',
'Street': '34 Tribeca',
'University': 'Yale',
'Zipcode': '32444'},
{'Country': 'Westeros',
'ID': '34',
'Name': 'Tyrion Lannister',
'Street': 'Kings Landing #89',
'University': 'U of Casterly Rock',
'Zipcode': '43543'}]
-- second edit
edited the index on the street.
Regards.

Related

How to create csv file without blank rows [duplicate]

This question already has answers here:
CSV file written with Python has blank lines between each row
(11 answers)
Closed last year.
I have a below code which is creating a csv file:
import csv
# my data rows as dictionary objects
mydict = [{'branch': 'COE', 'cgpa': '9.0', 'name': 'Nikhil', 'year': '2'},
{'branch': 'COE', 'cgpa': '9.1', 'name': 'Sanchit', 'year': '2'},
{'branch': 'IT', 'cgpa': '9.3', 'name': 'Aditya', 'year': '2'},
{'branch': 'SE', 'cgpa': '9.5', 'name': 'Sagar', 'year': '1'},
{'branch': 'MCE', 'cgpa': '7.8', 'name': 'Prateek', 'year': '3'},
{'branch': 'EP', 'cgpa': '9.1', 'name': 'Sahil', 'year': '2'}]
# field names
fields = ['name', 'branch', 'year', 'cgpa']
# name of csv file
filename = "university_records.csv"
# writing to csv file
with open(filename, 'w') as csvfile:
# creating a csv dict writer object
writer = csv.DictWriter(csvfile, fieldnames=fields)
# writing headers (field names)
writer.writeheader()
# writing data rows
writer.writerows(mydict)
Running the above code is giving below excel sheet
It contains blank rows as well. How can I remove these blank rows. Thanks
You should create a dataframe with your dict, and then just use
pd.to_csv(name_of_dataframe, sep=your_columns_sep)
Adding the newline='' in the with open ... does the trick:
import csv
my_dict = [{'branch': 'COE', 'cgpa': '9.0', 'name': 'Nikhil', 'year': '2'},
{'branch': 'COE', 'cgpa': '9.1', 'name': 'Sanchit', 'year': '2'},
{'branch': 'IT', 'cgpa': '9.3', 'name': 'Aditya', 'year': '2'},
{'branch': 'SE', 'cgpa': '9.5', 'name': 'Sagar', 'year': '1'},
{'branch': 'MCE', 'cgpa': '7.8', 'name': 'Prateek', 'year': '3'},
{'branch': 'EP', 'cgpa': '9.1', 'name': 'Sahil', 'year': '2'}]
fields = ['name', 'branch', 'year', 'cgpa']
filename = "foo_bar.csv"
with open(filename, 'w', newline='') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=fields)
writer.writeheader()
writer.writerows(my_dict)

Append field to a list

how can i extract just the value?
I have this code :
data = []
with open('city.txt') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
data.append(row[3])
That appends to the list the following (list is massive):
[....... " 'id': 'AX~only~Mariehamn'", " 'id': 'AX~only~Saltvik'", " 'id': 'AX~only~Sund'"]
How can i just append to the list the value of key 'id' ?
i just want to append this to the list: AX~only~Saltvik, and so on ?
city.txt is file containing the following(90k line file) :
{'name': 'Herat', 'displayName': 'Herat', 'meta': {'type': 'CITY'}, 'id': 'AF~HER~Herat', 'countryId': 'AF', 'countryName': 'AF', 'regionId': 'AF~HER', 'regionName': 'HER', 'latitude': 34.3482, 'longitude': 62.1997, 'links': {'self': {'path': '/api/netim/v1/cities/AF~HER~Herat'}}}
{'name': 'Kabul', 'displayName': 'Kabul', 'meta': {'type': 'CITY'}, 'id': 'AF~KAB~Kabul', 'countryId': 'AF', 'countryName': 'AF', 'regionId': 'AF~KAB', 'regionName': 'KAB', 'latitude': 34.5167, 'longitude': 69.1833, 'links': {'self': {'path': '/api/netim/v1/cities/AF~KAB~Kabul'}}}
so on ....
when i print(row) in the for loop statement i get the following(this is just las line of the output):
["{'name': 'Advancetown'", " 'displayName': 'Advancetown'", " 'meta': {'type': 'CITY'}", " 'id': 'AU~QLD~Advancetown'", " 'countryId': 'AU'", " 'countryName': 'AU'", " 'regionId': 'AU~QLD'", " 'regionName': 'QLD'^C: 152.7713", " 'links': {'self': {'path': '/api/netim/v1/cities/AU~QLD~Basin%20Pocket'}}}"]
This answer is assuming that your output is exact, and that each value appended to your list is along the lines of a string, " 'id': 'AX~only~Mariehamn'".
This means that in the base CSV file, the id and value are stored together as a string. You can get the second value through various string functions.
for row in readCSV:
data.append(row[3].split(": ")[1].strip("'"))
The above code splits the string into a list with two parts, one before the colon and one afterwards: [" 'id'", "'AX~only~Mariehamn'". Then, it takes the second value and strips the 's, resulting in a clean string.
It looks like row[3] is a string representing a key, value pair.
I would split it further and select only the value portion:
data = []
with open('city.txt') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
data.append(row[3].split(':')[1][2:-1]
[2:-1] is to remove the ' and a space.

Get value from data-set field sublist

I have a dataset (that pull its data from a dict) that I am attempting to clean and republish. Within this data set, there is a field with a sublist that I would like to extract specific data from.
Here's the data:
[{'id': 'oH58h122Jpv47pqXhL9p_Q', 'alias': 'original-pizza-brooklyn-4', 'name': 'Original Pizza', 'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/HVT0Vr_Vh52R_niODyPzCQ/o.jpg', 'is_closed': False, 'url': 'https://www.yelp.com/biz/original-pizza-brooklyn-4?adjust_creative=IelPnWlrTpzPtN2YRie19A&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=IelPnWlrTpzPtN2YRie19A', 'review_count': 102, 'categories': [{'alias': 'pizza', 'title': 'Pizza'}], 'rating': 4.0, 'coordinates': {'latitude': 40.63781, 'longitude': -73.8963799}, 'transactions': [], 'price': '$', 'location': {'address1': '9514 Ave L', 'address2': '', 'address3': '', 'city': 'Brooklyn', 'zip_code': '11236', 'country': 'US', 'state': 'NY', 'display_address': ['9514 Ave L', 'Brooklyn, NY 11236']}, 'phone': '+17185313559', 'display_phone': '(718) 531-3559', 'distance': 319.98144420799355},
Here's how the data is presented within the csv/spreadsheet:
location
{'address1': '9514 Ave L', 'address2': '', 'address3': '', 'city': 'Brooklyn', 'zip_code': '11236', 'country': 'US', 'state': 'NY', 'display_address': ['9514 Ave L', 'Brooklyn, NY 11236']}
Is there a way to pull location.city for example?
The below code simply adds a few fields and exports it to a csv.
def data_set(data):
df = pd.DataFrame(data)
df['zip'] = get_zip()
df['region'] = get_region()
newdf = df.filter(['name', 'phone', 'location', 'zip', 'region', 'coordinates', 'rating', 'review_count',
'categories', 'url'], axis=1)
if not os.path.isfile('yelp_data.csv'):
newdf.to_csv('data.csv', header='column_names')
else: # else it exists so append without writing the header
newdf.to_csv('data.csv', mode='a', header=False)
If that doesn't make sense, please let me know. Thanks in advance!

python 2.x csv writer

When I print these strings
print query, netinfo
I get output below, which is fine. How do i take these strings and put them into a CSV file into a single row?
8.8.8.8 [{'updated': '2012-02-24T00:00:00', 'handle': 'NET-8-0-0-0-1', 'description': 'Level 3 Communications, Inc.', 'tech_emails': 'ipaddressing#level3.com', 'abuse_emails': 'abuse#level3.com', 'postal_code': '80021', 'address': '1025 Eldorado Blvd.', 'cidr': '8.0.0.0/8', 'city': 'Broomfield', 'name': 'LVLT-ORG-8-8', 'created': '1992-12-01T00:00:00', 'country': 'US', 'state': 'CO', 'range': '8.0.0.0 - 8.255.255.255', 'misc_emails': None}, {'updated': '2014-03-14T00:00:00', 'handle': 'NET-8-8-8-0-1', 'description': 'Google Inc.', 'tech_emails': 'arin-contact#google.com', 'abuse_emails': 'arin-contact#google.com', 'postal_code': '94043', 'address': '1600 Amphitheatre Parkway', 'cidr': '8.8.8.0/24', 'city': 'Mountain View', 'name': 'LVLT-GOGL-8-8-8', 'created': '2014-03-14T00:00:00', 'country': 'US', 'state': 'CA', 'range': None, 'misc_emails': None}]
I have tried hobbling this together but it's all jacked up. I could use some help on how to use the csv module.
writer=csv.writer(open('dict.csv', 'ab'))
for key in query:
writer.writerow(query)
You can put your variables in a tuple and write to csv file :
import csv
from operator import itemgetter
with open('ex.csv', 'wb') as csvfile:
spamreader = csv.writer(csvfile, delimiter=' ')
spamreader.writerow((query, netinfo))
Note: if you are in python 3 use following code :
import csv
from operator import itemgetter
with open('ex.csv', 'w',newline='') as csvfile:
spamreader = csv.writer(csvfile, delimiter=' ')
spamreader.writerow((query, netinfo))

Python/Shell Script - Merging 2 rows of a CSV file where Address column has 'New Line' character

I have a CSV file, which contains couple of columns. For Example :
FName,LName,Address1,City,Country,Phone,Email
Matt,Shew,"503, Avenue Park",Auckland,NZ,19809224478,matt#xxx.com
Patt,Smith,"503, Baker Street
Mickey Park
Suite 510",Austraila,AZ,19807824478,patt#xxx.com
Doug,Stew,"12, Main St.
21st Lane
Suit 290",Chicago,US,19809224478,doug#xxx.com
Henry,Mark,"88, Washington Park",NY,US,19809224478,matt#xxx.com
In excel it looks something likes this :
It's a usual human tendency to feed/copy-paste address in the particular manner, usually sometimes people copy their signature and paste it to the Address column which creates such situation.
I have tried reading this using Python CSV module and it looks like that python doesn't distinguish between the '\n' Newline between the field values and the end of line.
My code :
import csv
with open(file_path, 'r') as f_obj:
input_data = []
reader = csv.DictReader(f_obj)
for row in reader:
print row
The output looks somethings like this :
{'City': 'Auckland', 'Address1': '503, Avenue Park', 'LName': 'Shew', 'Phone': '19809224478', 'FName': 'Matt', 'Country': 'NZ', 'Email': 'matt#xxx.com'}
{'City': 'Austraila', 'Address1': '503, Baker Street\nMickey Park\nSuite 510', 'LName': 'Smith', 'Phone': '19807824478', 'FName': 'Patt', 'Country': 'AZ', 'Email': 'patt#xxx.com'}
{'City': 'Chicago', 'Address1': '12, Main St. \n21st Lane \nSuit 290', 'LName': 'Stew', 'Phone': '19809224478', 'FName': 'Doug', 'Country': 'US', 'Email': 'doug#xxx.com'}
{'City': 'NY', 'Address1': '88, Washington Park', 'LName': 'Mark', 'Phone': '19809224478', 'FName': 'Henry', 'Country': 'US', 'Email': 'matt#xxx.com'}
I just wanted to write the same content to a file where all the values for a Address1 keys should not have '\n' character and looks like :
{'City': 'Auckland', 'Address1': '503, Avenue Park', 'LName': 'Shew', 'Phone': '19809224478', 'FName': 'Matt', 'Country': 'NZ', 'Email': 'matt#xxx.com'}
{'City': 'Austraila', 'Address1': '503, Baker Street Mickey Park Suite 510', 'LName': 'Smith', 'Phone': '19807824478', 'FName': 'Patt', 'Country': 'AZ', 'Email': 'patt#xxx.com'}
{'City': 'Chicago', 'Address1': '12, Main St. 21st Lane Suit 290', 'LName': 'Stew', 'Phone': '19809224478', 'FName': 'Doug', 'Country': 'US', 'Email': 'doug#xxx.com'}
{'City': 'NY', 'Address1': '88, Washington Park', 'LName': 'Mark', 'Phone': '19809224478', 'FName': 'Henry', 'Country': 'US', 'Email': 'matt#xxx.com'}
Any suggestions guys ???
PS:
I have more than 100K such records in my csv file !!!
You can replace the print row with a dict comprehsion that replaces newlines in the values:
row = {k: v.replace('\n', ' ') for k, v in row.iteritems()}
print row

Categories

Resources