How to parse csv file in python? - python

I need the first column of the table to be written to a variable, and the remaining columns (their number may vary) to be written to the list in order to get the desired value from the list. I'm trying to get email addresses, but the table itself is porridge, so every column needs to be checked.
with open('data.csv', 'r', encoding='utf-8-sig', newline='') as file:
reader = csv.reader(file)
name = list(next(reader))
for items in list(reader):
for item in items:
if '#' in item:
if not item in emails:
emails.append(item)
with open('result.csv', 'a', encoding='utf-8-sig', newline='') as file:
writer = csv.writer(file, delimiter=';')
for email in emails:
writer.writerow(
(
name,
email
)
)
Input:
Наименование,Описание,Адрес,Комментарий к адресу,Почтовый индекс,Микрорайон,Район,Город,Округ,Регион,Страна,Часы работы,Часовой пояс,Телефон 1,E-mail 1,Веб-сайт 1,Instagram 1,Twitter 1,Facebook 1,ВКонтакте 1,YouTube 1,Skype 1,Широта,Долгота,2GIS URL
Магазин автозапчастей,,"Мира, 007",,655153,,,Черногорск,Черногорск городской округ,Республика Хакасия,Россия,Пн: 09:00-18:00; Вт: 09:00-18:00; Ср: 09:00-18:00; Чт: 09:00-18:00; Пт: 09:00-18:00; Сб: 09:00-18:00,+07:00,89130502009,grandauto007#mail.ru,http://avtomagazin.2gis.biz,,,,,,,53.805192,91.334047,https://2gis.com/firm/9711414977516651
Спектр-Авто,автотехцентр,"Вяткина, 4",1 этаж,655017,,,Абакан,Абакан городской округ,Республика Хакасия,Россия,Пн: 09:00-18:00; Вт: 09:00-18:00; Ср: 09:00-18:00; Чт: 09:00-18:00; Пт: 09:00-18:00; Сб: 09:00-18:00,+07:00,89233931771,+79233940022#yandex.ru,http://spectr-avto.2gis.biz,,,,,,,53.716581,91.45005,https://2gis.com/firm/70000001034136187
The result is:
['Наименование', 'Описание', 'Адрес', 'Комментарий к адресу', 'Почтовый индекс', 'Микрорайон', 'Район', 'Город', 'Округ', 'Регион', 'Страна', 'Часы работы', 'Часовой пояс', 'Телефон 1', 'E-mail 1', 'Веб-сайт 1', 'Instagram 1', 'Twitter 1', 'Facebook 1', 'ВКонтакте 1', 'YouTube 1', 'Skype 1', 'Широта', 'Долгота', '2GIS URL'];grandauto007#mail.ru
['Наименование', 'Описание', 'Адрес', 'Комментарий к адресу', 'Почтовый индекс', 'Микрорайон', 'Район', 'Город', 'Округ', 'Регион', 'Страна', 'Часы работы', 'Часовой пояс', 'Телефон 1', 'E-mail 1', 'Веб-сайт 1', 'Instagram 1', 'Twitter 1', 'Facebook 1', 'ВКонтакте 1', 'YouTube 1', 'Skype 1', 'Широта', 'Долгота', '2GIS URL'];+79233940022#yandex.ru
['Наименование', 'Описание', 'Адрес', 'Комментарий к адресу', 'Почтовый индекс', 'Микрорайон', 'Район', 'Город', 'Округ', 'Регион', 'Страна', 'Часы работы', 'Часовой пояс', 'Телефон 1', 'E-mail 1', 'Веб-сайт 1', 'Instagram 1', 'Twitter 1', 'Facebook 1', 'ВКонтакте 1', 'YouTube 1', 'Skype 1', 'Широта', 'Долгота', '2GIS URL'];zhvirblis_yuliya#mail.ru

If I understand the question correctly, what you really want to output is a two-column CSV, with names in the first column, which I assume come from the original CSV's first column, and e-mail in the second column.
If my assumptions are correct, this should work for you:
import csv
with open('data.csv', 'r', encoding='utf-8-sig', newline='') as file:
reader = csv.reader(file)
header = list(next(reader))
emails = []
for items in reader:
name = items[0]
for item in items:
if '#' in item:
if not (name, item) in emails:
emails.append((name, item))
with open('result.csv', 'a', encoding='utf-8-sig', newline='') as file:
writer = csv.writer(file, delimiter=';')
for email in emails:
writer.writerow(email)
Output:
Магазин автозапчастей;grandauto007#mail.ru
Спектр-Авто;+79233940022#yandex.ru
Things I have changed in your code:
The input CSV header is now read into header - did you want to do anything with that?
The name is now set from items[0] for each row in the input CSV.
The emails list is now a list of (name, email) pairs.
Optimization detail: you don't need to turn reader into a list to iterate over it. Just say for items in reader:, it'll be more efficient since it will process each row as it reads it instead of storing them all into a list.

import petl
table = petl.fromcsv('data.csv', encoding='utf-8-sig')
table2 = petl.addfield(table, 'email_address', lambda r: [r[r1] for r1 in petl.header(table) if '#' in r[r1]])
table3 = petl.cut(table2, 'Наименование', 'email_address')
petl.tocsv(table3, 'result.csv', encoding='utf-8-sig', delimiter=';', write_header=True)
Load the CSV into a table
Create a new field(column) that is an aggregate of any field containing an email address
Reduce(cut) the table to only contain the 2 important fields: 'Наименование', 'email_address'
Output the results to a CSV
Output:
Наименование;email_address
Магазин автозапчастей;['grandauto007#mail.ru']
Спектр-Авто;['+79233940022#yandex.ru']
Be sure to install petl:
pip install petl

Related

Python CSV Writes ;;;;;; to every line

I stored data on a CSV file with Python. Now I need to read it with Python but there are some issues with it. There is a
";;;;;;"
statement on the finish of every line.
Here is the code that I used for writing data to CSV :
file = open("products.csv", "a")
writer = csv.writer(file, delimiter=",", quotechar='"', quoting=csv.QUOTE_ALL)
writer.writerow(data)
And I am trying to read that with that code :
with open("products.csv", "r", newline="") as in_file, open("3.csv", "w", newline='') as to_file:
reader = csv.reader(in_file, delimiter="," ,doublequote=True)
for row in reader:
print(row)
Of course, I am not reading it for just printing I need to remove duplicated lines and make it a readable CSV.
I've tried this to fetch strings and edit them and it's worked for other fields except for semicolons. I cant understand why I cant edit those semicolons.
for row in reader:
try:
print(row)
rowList = row[0].split(",")
for index, field in enumerate(rowList):
if '"' in field:
field = field.replace('"', "")
elif ";;;;;;" in rowList[index]:
field = field.replace(";;;;;;", "")
rowList[index] = field
print(rowList)
Here is the output of the code above :
['Product Name', 'Product Description', 'SKU', 'Regular Price', 'Sale Price', 'Images;;;;;;']
Can anybody help me?
I realized that I used 'elif' on there. I changed it and it solved. Thanks for the help. But I still don't know why it added that semicolon to there.

Python 2 replacement specific column from CSV

I have some CSV files the format is ID, timestamp, customerID, email, etc. I want fill the Email column to empty and other columns keep it same. I'm using Python 2.7 and is restricted to use Pandas. Can anyone help me?
Thank you all for the help
My code below, but this is not that efficiency and reliable also if some raw have the strange character it will be broken the logic.
new_columns = [
'\xef\xbb\xbfID', 'timestamp', 'CustomerID', 'Email', 'CountryCode', 'LifeCycle', 'Package', 'Paystatus', 'NoUsageEver', 'NoUsage', 'VeryLowUsage',
'LowUsage', 'NormalUsage', 'HighUsage', 'VeryHighUsage', 'LastStartDate', 'NPS 0-8', 'NPS Score (Q2)', 'Gender(Q38)', 'DOB(Q39)',
'Viaplay users(Q3)', 'Primary Content (Q42)', 'Primary platform(Q4)', 'Detractor (strong) (Q5)', 'Detractor open text(Q22)',
'Contact Detractor (Q21)', 'Contact Detractor (Q20)', 'Contact Detractor (Q43)', 'Contact Detractor(Q26)', 'Contact Detractor(Q27)',
'Contact Detractor(Q44)', 'Improvement areas(Q7)', 'Improvement areas (Q40)', 'D2 More value for money(Q45)', 'D2 Sport content(Q8)',
'D2 Series content(Q9)', 'D2 Film content(Q10)', 'D2 Children content(Q11)', 'D2 Easy to start and use(Q12)',
'D2 Technical and quality(Q13)',
'D2 Platforms(Q14)', 'D2 Service and support(Q15)', 'D3 Sport content(Q16)', 'Missing Sport Content (Q41)',
'D3 Series and films content(Q17)',
'NPS 9-10', 'Recommendation drivers(Q28)', 'R2 Sport content(Q29)', 'R2 Series content(Q30)', 'R2 Film content(Q31)',
'R2 Children content(Q32)', 'R2 Easy to start and use(Q33)', 'R2 Technical and quality(Q34)', 'R2 Platforms(Q35)',
'R2 Service and support(Q36)',
'Promoter open text(Q37)'
]
with open(file_path, 'r') as infile:
print file_path
reader = csv.reader(infile, delimiter=";")
first_row = next(reader)
for row in reader:
output_row = []
for column_name in new_columns:
ind = first_row.index(column_name)
data = row[ind]
if ind == first_row.index('Email'):
data = ''
output_row.append(data)
writer.writerow(output_row)
File format before
File format after
So you are reordering the columns and clearing the email column:
with open(file_path, 'r') as infile:
print file_path
reader = csv.reader(infile, delimiter=";")
first_row = next(reader)
for row in reader:
output_row = []
for column_name in new_columns:
ind = first_row.index(column_name)
data = row[ind]
if ind == first_row.index('Email'):
data = ''
output_row.append(data)
writer.writerow(output_row)
I would suggest moving the searches first_row.index(column_name) and first_row.index('Email') out of the per row processing.
with open(file_path, 'r') as infile:
print file_path
reader = csv.reader(infile, delimiter=";")
first_row = next(reader)
email = first_row.index('Email')
indexes = []
for column_name in new_columns:
ind = first_row.index(column_name)
indexes.append(ind)
for row in reader:
output_row = []
for ind in indexes:
data = row[ind]
if ind == email:
data = ''
output_row.append(data)
writer.writerow(output_row)
email is the index of the email column in the input. indexes is a list of the indexes of the columns in the input in the order specified by the new_columns.
Untested.
You could use dict versions of the csv reader/writer to get the column by name. Something like this:
import csv
with open('./test.csv', 'r') as infile:
reader = csv.DictReader(infile, delimiter=";")
with open('./output.csv', 'w') as outfile:
writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
row['Email'] = ''
writer.writerow(row)

Python csv file writing in different columns

Im trying to read a csv file, and create a new cvs file, with the contents of the old cvs file with Python. My Problem is, that all entrys are saved in the first column, and i cant find a way to save the informations in different columns. Here is my code:
import csv
from itertools import zip_longest
fieldnamesOrdered = ['First Name', 'Last Name' , 'Email', 'Phone Number',
'Street Address', 'City', 'State', 'HubSpot Owner', 'Lifecyle Stage', 'Lead
Status', 'Favorite Color']
listOne = []
listTwo = []
with open('Contac.csv', 'r', encoding = 'utf-8') as inputFile,
open('result.csv', 'w', encoding = 'utf-8') as outputFile:
reader = csv.DictReader(inputFile)
writer = csv.writer(outputFile, delimiter = 't')
for row in reader:
listOne.append(row['First Name'])
listTwo.append(row['Last Name'])
dataLists = [listOne, listTwo]
export_data = zip_longest(*dataLists, fillvalue='')
writer.writerow(fieldnamesOrdered)
writer.writerows(export_data)
inputFile.close()
outputFile.close()
Thank you very much for your answers
writer = csv.writer(outputFile, delimiter = 't')
Aren't those entries in the first column additionally interspersed with strange unsolicited 't' characters?

Python OrderedDict to CSV: Eliminating Blank Lines

When I run this code...
from simple_salesforce import Salesforce
sf = Salesforce(username='un', password='pw', security_token='tk')
cons = sf.query_all("SELECT Id, Name FROM Contact WHERE IsDeleted=false LIMIT 2")
import csv
with open('c:\test.csv', 'w') as csvfile:
fieldnames = ['contact_name__c', 'recordtypeid']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for con in cons['records']:
writer.writerow({'contact_name__c': con['Id'], 'recordtypeid': '082I8294817IWfiIWX'})
print('done')
I get the following output inside my CSV file...
contact_name__c,recordtypeid
xyzzyID1xyzzy,082I8294817IWfiIWX
abccbID2abccb,082I8294817IWfiIWX
I'm not sure why those extra lines are there.
Any tips for getting rid of them so my CSV file will be normal-looking?
I'm on Python 3.4.3 according to sys.version_info.
Here are a few more code-and-output pairs, to show the kind of data I'm working with:
from simple_salesforce import Salesforce
sf = Salesforce(username='un', password='pw', security_token='tk')
print(sf.query_all("SELECT Id, Name FROM Contact WHERE IsDeleted=false LIMIT 2"))
produces
OrderedDict([('totalSize', 2), ('done', True), ('records', [OrderedDict([('attributes', OrderedDict([('type', 'Contact'), ('url', '/services/data/v29.0/sobjects/Contact/xyzzyID1xyzzy')])), ('Id', 'xyzzyID1xyzzy'), ('Name', 'Person One')]), OrderedDict([('attributes', OrderedDict([('type', 'Contact'), ('url', '/services/data/v29.0/sobjects/Contact/abccbID2abccb')])), ('Id', 'abccbID2abccb'), ('Name', 'Person Two')])])])
and
from simple_salesforce import Salesforce
sf = Salesforce(username='un', password='pw', security_token='tk')
cons = sf.query_all("SELECT Id, Name FROM Contact WHERE IsDeleted=false LIMIT 2")
for con in cons['records']:
print(con['Id'])
produces
xyzzyID1xyzzy
abccbID2abccb
Two likely possibilities: the output file needs to be opened in binary mode and/or the writer needs to be told not to use DOS style line endings.
To open the file in binary mode in Python 3 replace your current with open line with:
with open('c:\test.csv', 'w', newline='') as csvfile:
to eliminate the DOS style line endings try:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator="\n")

Parsing data from a text file

I have built a contact form which sends me email for every user registration My question is more related to parsing some text data into csv format. and I have received multiple users information in my mail box which I had copied into a text file. The data looks like below.
Name: testuser2
Email: testuser2#gmail.com
Cluster Name: o b
Contact No.: 12346971239
Coming: Yes
Name: testuser3
Email: testuser3#gmail.com
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes
Name: testuser4
Email: tuser4#yahoo.com
Cluster Name: Mediterranea
Contact No.: 7892174896
Coming: Yes
Name: tuser5
Email: tuserner5#gmail.com
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2
Name: Test User
Email: testuser#yahoo.co.in
Cluster Name: RD
Contact No.: 09833123445
Coming: Yes
Members Participating: 2
As can see the data contains some common fields and some fields which are not present, I am looking for solution/suggestion on how I can parse this data so under the heading "Name", I will collect the name information under that column, and similarly for others. For the data with title "Members Participating" I can just pick the numbers and add it into Excel sheet under the same heading, in case this information is not present for the user, it can just be blank.
Let's decompose the problem into smaller subproblems:
Split the large block of text into separate registrations
Convert each of those registrations to a dictionary
Write the list of dictionaries to CSV
First, let's break the blocks of registration data into different elements:
DATA = '''
Name: testuser2
Email: testuser2#gmail.com
Cluster Name: o b
Contact No.: 12346971239
Coming: Yes
Name: testuser3
Email: testuser3#gmail.com
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes
'''
def parse_registrations(data):
data = data.strip()
return data.split('\n\n')
This function gives us a list of each registration:
>>> regs = parse_registrations(DATA)
>>> regs
['Name: testuser2\nEmail: testuser2#gmail.com\nCluster Name: o b\nContact No.: 12346971239\nComing: Yes', 'Name: testuser3\nEmail: testuser3#gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes']
>>> regs[0]
'Name: testuser2\nEmail: testuser2#gmail.com\nCluster Name: o b\nContact No.: 12346971239\nComing: Yes'
>>> regs[1]
'Name: testuser3\nEmail: testuser3#gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes'
Next, we can convert those substrings to a list of (key, value) pairs:
>>> [field.split(': ', 1) for field in regs[0].split('\n')]
[['Name', 'testuser2'], ['Email', 'testuser2#gmail.com'], ['Cluster Name', 'o b'], ['Contact No.', '12346971239'], ['Coming', 'Yes']]
The dict() function can convert a list of (key, value) pairs into a dictionary:
>>> dict(field.split(': ', 1) for field in regs[0].split('\n'))
{'Coming': 'Yes', 'Cluster Name': 'o b', 'Name': 'testuser2', 'Contact No.': '12346971239', 'Email': 'testuser2#gmail.com'}
We can pass these dictionaries into a csv.DictWriter to write the records as CSV with defaults for any missing values.
>>> w = csv.DictWriter(open("/tmp/foo.csv", "w"), fieldnames=["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"])
>>> w.writeheader()
>>> w.writerow({'Name': 'Steve'})
12
Now, let's combine these all together!
import csv
DATA = '''
Name: testuser2
Email: testuser2#gmail.com
Cluster Name: o b
Contact No.: 12346971239
Coming: Yes
Name: tuser5
Email: tuserner5#gmail.com
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2
'''
COLUMNS = ["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"]
def parse_registration(reg):
return dict(field.split(': ', 1) for field in reg.split('\n'))
def parse_registrations(data):
data = data.strip()
regs = data.split('\n\n')
return [parse_registration(r) for r in regs]
def write_csv(data, filename):
regs = parse_registrations(data)
with open(filename, 'w') as f:
writer = csv.DictWriter(f, fieldnames=COLUMNS)
writer.writeheader()
writer.writerows(regs)
if __name__ == '__main__':
write_csv(DATA, "/tmp/test.csv")
Output:
$ python3 write_csv.py
$ cat /tmp/test.csv
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2#gmail.com,o b,12346971239,Yes,
tuser5,tuserner5#gmail.com,River Retreat A,7583450912,Yes,2
The program below might satisfy your requirement. The general strategy:
First read in all of the email files, parsing the data "by hand", and
Second write the data to a CSV file, using csv.DictWriter.writerows().
import sys
import pprint
import csv
# Usage:
# python cfg2csv.py input1.cfg input2.cfg ...
# The data is combined and written to 'output.csv'
def parse_file(data):
total_result = []
single_result = []
for line in data:
line = line.strip()
if line:
single_result.append([item.strip() for item in line.split(':', 1)])
else:
if single_result:
total_result.append(dict(single_result))
single_result = []
if single_result:
total_result.append(dict(single_result))
return total_result
def read_file(filename):
with open(filename) as fp:
return parse_file(fp)
# First parse the data:
data = sum((read_file(filename) for filename in sys.argv[1:]), [])
keys = set().union(*data)
# Next write the data to a CSV file
with open('output.csv', 'w') as fp:
writer = csv.DictWriter(fp, sorted(keys))
writer.writeheader()
writer.writerows(data)
You can use the the empty line between records to signify end of record. Then process the input file line-by-line and construct a list of dictionaries. Finally write the dictionaries out to a CSV file.
from csv import DictWriter
from collections import OrderedDict
with open('input') as infile:
registrations = []
fields = OrderedDict()
d = {}
for line in infile:
line = line.strip()
if line:
key, value = [s.strip() for s in line.split(':', 1)]
d[key] = value
fields[key] = None
else:
if d:
registrations.append(d)
d = {}
else:
if d: # handle EOF
registrations.append(d)
# fieldnames = ['Name', 'Email', 'Cluster Name', 'Contact No.', 'Coming', 'Members Participating']
fieldnames = fields.keys()
with open('registrations.csv', 'w') as outfile:
writer = DictWriter(outfile, fieldnames=fields)
writer.writeheader()
writer.writerows(registrations)
This code attempts to automate the collection of field names, and will use the same order as unique keys are first seen in the input. If you require a specific field order in the output you can nail it up by uncommenting the appropriate line.
Running this code on your sample input produces this:
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2#gmail.com,o b,12346971239,Yes,
testuser3,testuser3#gmail.com,Mediternea,9121319107,Yes,
testuser4,tuser4#yahoo.com,Mediterranea,7892174896,Yes,
tuser5,tuserner5#gmail.com,River Retreat A,7583450912,Yes,2
Test User,testuser#yahoo.co.in,RD,09833123445,Yes,2
The following will convert your input text file automatically to a CSV file. The headings are automatically generated based on the longest entry.
import csv, re
with open("input.txt", "r") as f_input, open("output.csv", "wb") as f_output:
csv_output = csv.writer(f_output)
entries = re.findall("^(Name: .*?)(?:\n\n|\Z)", f_input.read(), re.M+re.S)
# Determine the entry with the most fields for the CSV headers
headings = []
for entry in entries:
headings = max(headings, [line.split(":")[0] for line in entry.split("\n")], key=len)
csv_output.writerow(headings)
# Write the entries
for entry in entries:
csv_output.writerow([line.split(":")[1].strip() for line in entry.split("\n")])
This produces a CSV text file that can be opened in Excel as follows:
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2#gmail.com,o b,12346971239,Yes
testuser3,testuser3#gmail.com,Mediternea,9121319107,Yes
testuser4,tuser4#yahoo.com,Mediterranea,7892174896,Yes
tuser5,tuserner5#gmail.com,River Retreat A,7583450912,Yes,2
Test User,testuser#yahoo.co.in,RD,09833123445,Yes,2

Categories

Resources