How can I merge fields in a CSV string using Python? - python

I am trying to merge three fields in each line of a CSV file using Python. This would be simple, except some of the fields are surrounded by double quotes and include commas. Here is an example:
,,Joe,Smith,New Haven,CT,"Moved from Portland, CT",,goo,
Is there a simple algorithm that could merge fields 7-9 for each line in this format? Not all lines include commas in double quotes.
Thanks.

Something like this?
import csv
source= csv.reader( open("some file","rb") )
dest= csv.writer( open("another file","wb") )
for row in source:
result= row[:6] + [ row[6]+row[7]+row[8] ] + row[9:]
dest.writerow( result )
Example
>>> data=''',,Joe,Smith,New Haven,CT,"Moved from Portland, CT",,goo,
... '''.splitlines()
>>> rdr= csv.reader( data )
>>> row= rdr.next()
>>> row
['', '', 'Joe', 'Smith', 'New Haven', 'CT', 'Moved from Portland, CT', '', 'goo', '' ]
>>> row[:6] + [ row[6]+row[7]+row[8] ] + row[9:]
['', '', 'Joe', 'Smith', 'New Haven', 'CT', 'Moved from Portland, CTgoo', '']

You can use the csv module to do the heavy lifting: http://docs.python.org/library/csv.html
You didn't say exactly how you wanted to merge the columns; presumably you don't want your merged field to be "Moved from Portland, CTgoo". The code below allows you to specify a separator string (maybe ", ") and handles empty/blank fields.
[transcript of session]
prompt>type merge.py
import csv
def merge_csv_cols(infile, outfile, startcol, numcols, sep=", "):
reader = csv.reader(open(infile, "rb"))
writer = csv.writer(open(outfile, "wb"))
endcol = startcol + numcols
for row in reader:
merged = sep.join(x for x in row[startcol:endcol] if x.strip())
row[startcol:endcol] = [merged]
writer.writerow(row)
if __name__ == "__main__":
import sys
args = sys.argv[1:6]
args[2:4] = map(int, args[2:4])
merge_csv_cols(*args)
prompt>type input.csv
1,2,3,4,5,6,7,8,9,a,b,c
1,2,3,4,5,6,,,,a,b,c
1,2,3,4,5,6,7,8,,a,b,c
1,2,3,4,5,6,7,,9,a,b,c
prompt>\python26\python merge.py input.csv output.csv 6 3 ", "
prompt>type output.csv
1,2,3,4,5,6,"7, 8, 9",a,b,c
1,2,3,4,5,6,,a,b,c
1,2,3,4,5,6,"7, 8",a,b,c
1,2,3,4,5,6,"7, 9",a,b,c

There's a builtin module in Python for parsing CSV files:
http://docs.python.org/library/csv.html

You have tagged this question as 'database'. In fact, maybe it would be easier to upload the two files to separate tables of the db (you can use sqllite or any python sql library, like sqlalchemy) and then join them.
That would give you some advantage after, you would be able to use a sql syntax to query the tables and you can store it on the disk instead of keeping it on memory, so think about it.. :)

Related

Pandas read_csv is retriving different data than what is in the text file

I have a .txt (notepad) file called Log1. It has the following saved in it: [1, 1, 1, 0]
When I write a program to retrieve the data:
Log1 = pd.read_csv('Path...\\Log1.txt')
Log1 = list(Log1)
print(Log1)
It prints: ['[1', ' 1', ' 1.1', ' 0]']
I dont understand where the ".1" is coming from on the third number. Its not in the text file, it just adds it.
Funny enough if I change the numbers in the text file to: [1, 0, 1, 1]. It does not add the .1 It prints ['[1', ' 0', ' 1', ' 1]']
Very odd why its acting this way if anyone has an idea.
Well, I worked out some other options as well, just for the record:
Solution 1 (plain read - this one gets a list of string)
log4 = []
with open('log4.txt') as f:
log4 = f.readlines()
print(log4)
Solution 2 (convert to list of ints)
import ast
with open('log4.txt', 'r') as f:
inp = ast.literal_eval(f.read())
print(inp))
Solution 3 (old school string parsing - convert to list of ints, then put it in a dataframe)
with open('log4.txt', 'r') as f:
mylist = f.read()
mylist = mylist.replace('[','').replace(']','').replace(' ','')
mylist = mylist.split(',')
df = pd.DataFrame({'Col1': mylist})
df['Col1'] = df['Col1'].astype(int)
print(df)
Other ideas here as well:
https://docs.python-guide.org/scenarios/serialization/
In general the reading from the text file (deserializing) is easier if the text file is written in a good structured format in the first place - csv file, pickle file, json file, etc. In this case, using the ast.literal_eval() worked well since this was written out as a list using it's __repr__ format -- though honestly I've never done that before so it was an interesting solution to me as well :)
This should work. Can you please try this,
log2 = log1.values.tolist()
Output:
[['1'], ['1'], ['1'], ['0']]
Your data is not in a CSV format. In CSV you would rather have
1;1;0;1
or something similar.
If you have multiple lines like this, it might make sense to parse this as CSV, otherwise I'd rather parse it using a regexp and .split on the result.
Proposal: Add a bigger input example and your expected output.

Read file line into lists delimited by a specific string (sharp "#")

My file content is something like:
############################
Data1
133
124
FRE
new
Cable
Sat
############################
DataB
233
445
DEU
Old
Sat
###########################
MyValue
4566
455
ITA
NEW
###########################
MyValue5
455
22332
Eng
Sat
Cable
##############################
What I need is to put each of them into a list and the separator must be the "#":
The result here must be:
mylist1=["Data1","133","124","FRE","new","Cable","Sat"]
mylist2=["DataB","233","445","DEU","Old","Sat"]
etc...
The number of lists is variable since the data file length can be variable.
This is one approach
with open('your_file.txt', 'r') as f:
data = f.readlines()
master_list = []
lst = []
for i in data:
if '#' in i:
master_list.append(lst)
lst = []
else:
lst.append(i.replace('\n', ''))
Drop the first element
master_list[1:]
['Data1', '133', '124', 'FRE', 'new', 'Cable', 'Sat'], ['DataB', '233', '445', 'DEU', 'Old', 'Sat'], ['MyValue', '4566', '455', 'ITA', 'NEW'], ['MyValue5', '455', '22332', 'Eng', 'Sat', 'Cable']]```
This will convert the file to a list of lists instead of the named variables after reading the file into a string using an intermediate character, / here, but you could change that if it's in other places in your data.
data = [line.split('\n') for line in re.sub('\n?#+\n?', '/', text).split('/')]
If you prefer the names, you could do something similar for a dictionary, which will likely be better than individual variables.
data = {'mylist' + str(line[0]): line[1].split('\n') for line in enumerate(re.sub('\n?#+\n?', '/', text).split('/'))}
Both of these will have an extra list if there are separators at the top or bottom of the file like in the post, but you could chop those off if needed.
If you really need to assign to variables, you could use exec to set them, but I wouldn't recommend this.
the version of python >3.5
I do not quite understand what you mean, What I need is to put each of them into a list and the separator must be the #?
the example of your result doesn't match your requirements. Maybe you want:
filepath = "./data.txt" # the full pathname of your data file
with open(filepath, "r", encoding="utf-8") as f: # get data
data = f.readlines()
# handle data
for item in data[1:]:
if "#" in item:
print(myList)
myList = []
continue
myList.append(item.strip("\n"))
The result is shown below:
['Data1', '133', '124', 'FRE', 'new', 'Cable', 'Sat']
['DataB', '233', '445', 'DEU', 'Old', 'Sat']
['MyValue', '4566', '455', 'ITA', 'NEW']
['MyValue5', '455', '22332', 'Eng', 'Sat', 'Cable']
if the number of the separator (#) is all the same
it's easy to handle your problem by the faction of "split" like below:
filepath = "./data.txt" # the full pathname of your data file
with open(filepath, "r", encoding="utf-8") as f: # get data
# f.readline() # fiter the first line
data = f.read()
data = data.strip("#").strip("\n").split("###########################\n")
for item in data:
print(item.split("\n"))
I'm shame that I'm a newbie for the site so not have enough reputation to post an image.

python 2.7:iterate dictionary and map values to a file

I have a list of dictionaries which I build from .xml file:
list_1=[{'lat': '00.6849879', 'phone': '+3002201600', 'amenity': 'restaurant', 'lon': '00.2855850', 'name': 'Telegraf'},{'lat': '00.6850230', 'addr:housenumber': '6', 'lon': '00.2844493', 'addr:city': 'XXX', 'addr:street': 'YYY.'},{'lat': '00.6860304', 'crossing': 'traffic_signals', 'lon': '00.2861978', 'highway': 'crossing'}]
My aim is to build a text file with values (not keys) in such order:
lat,lon,'addr:street','addr:housenumber','addr:city','amenity','crossing' etc...
00.6849879,00.2855850, , , ,restaurant, ,'\n'00.6850230,00.2844493,YYY,6,XXX, , ,'\n'00.6860304,00.2861978, , , , ,traffic_signals,'\n'
if value not exists there should be empty space.
I tried to loop with for loop:
for i in list_1:
line= i['lat'],i['lon']
print line
Problem occurs if I add value which does not exist in some cases:
for i in list_1:
line= i['lat'],i['lon'],i['phone']
print line
Also tried to loop and use map() function, but results seems not correct:
for i in list_1:
line=map(lambda x1,x2:x1+','+x2+'\n',i['lat'],i['lon'])
print line
Also tried:
for i in list_1:
for k,v in i.items():
if k=='addr:housenumber':
print v
This time I think there might be too many if/else conditions to write.
Seems like solutions is somewhere close. But can't figure out the solution and its optimal way.
I would look to use the csv module, in particular DictWriter. The fieldnames dictate the order in which the dictionary information is written out. Actually writing the header is optional:
import csv
fields = ['lat','lon','addr:street','addr:housenumber','addr:city','amenity','crossing',...]
with open('<file>', 'w') as f:
writer = csv.DictWriter(f, fields)
#writer.writeheader() # If you want a header
writer.writerows(list_1)
If you really didn't want to use csv module then you can simple iterate over the list of the fields you want in the order you want them:
fields = ['lat','lon','addr:street','addr:housenumber','addr:city','amenity','crossing',...]
for row in line_1:
print(','.join(row.get(field, '') for field in fields))
If you can't or don't want to use csv you can do something like
order = ['lat','lon','addr:street','addr:housenumber',
'addr:city','amenity','crossing']
for entry in list_1:
f.write(", ".join([entry.get(x, "") for x in order]) + "\n")
This will create a list with the values from the entry map in the order present in the order list, and default to "" if the value is not present in the map.
If your output is a csv file, I strongly recommend using the csv module because it will also escape values correctly and other csv file specific things that we don't think about right now.
Thanks guys
I found the solution. Maybe it is not so elegant but it works.
I made a list of node keys look for them in another list and get values.
key_list=['lat','lon','addr:street','addr:housenumber','amenity','source','name','operator']
list=[{'lat': '00.6849879', 'phone': '+3002201600', 'amenity': 'restaurant', 'lon': '00.2855850', 'name': 'Telegraf'},{'lat': '00.6850230', 'addr:housenumber': '6', 'lon': '00.2844493', 'addr:city': 'XXX', 'addr:street': 'YYY.'},{'lat': '00.6860304', 'crossing': 'traffic_signals', 'lon': '00.2861978', 'highway': 'crossing'}]
Solution:
final_list=[]
for i in list:
line=str()
for ii in key_list:
if ii in i:
x=ii
line=line+str(i[x])+','
else:
line=line+' '+','
final_list.append(line)

Using the split function in Python

I am working with the CSV module, and I am writing a simple program which takes the names of several authors listed in the file, and formats them in this manner: john.doe
So far, I've achieved the results that I want, but I am having trouble with getting the code to exclude titles such as "Mr."Mrs", etc. I've been thinking about using the split function, but I am not sure if this would be a good use for it.
Any suggestions? Thanks in advance!
Here's my code so far:
import csv
books = csv.reader(open("books.csv","rU"))
for row in books:
print '.'.join ([item.lower() for item in [row[index] for index in (1, 0)]])
It depends on how much messy the strings are, in worst cases this regexp-based solution should do the job:
import re
x=re.compile(r"^\s*(mr|mrs|ms|miss)[\.\s]+", flags=re.IGNORECASE)
x.sub("", text)
(I'm using re.compile() here since for some reasons Python 2.6 re.sub doesn't accept the flags= kwarg..)
UPDATE: I wrote some code to test that and, although I wasn't able to figure out a way to automate results checking, it looks like that's working fine.. This is the test code:
import re
x=re.compile(r"^\s*(mr|mrs|ms|miss)[\.\s]+", flags=re.IGNORECASE)
names = ["".join([a,b,c,d]) for a in ['', ' ', ' ', '..', 'X'] for b in ['mr', 'Mr', 'miss', 'Miss', 'mrs', 'Mrs', 'ms', 'Ms'] for c in ['', '.', '. ', ' '] for d in ['Aaaaa', 'Aaaa Bbbb', 'Aaa Bbb Ccc', ' aa ']]
print "\n".join([" => ".join((n,x.sub('',n))) for n in names])
Depending on the complexity of your data and the scope of your needs you may be able to get away with something as simple as stripping titles from the lines in the csv using replace() as you iterate over them.
Something along the lines of:
titles = ["Mr.", "Mrs.", "Ms", "Dr"] #and so on
for line in lines:
line_data = line
for title in titles:
line_data = line_data.replace(title,"")
#your code for processing the line
This may not be the most efficient method, but depending on your needs may be a good fit.
How this could work with the code you posted (I am guessing the Mr./Mrs. is part of column 1, the first name):
import csv
books = csv.reader(open("books.csv","rU"))
for row in books:
first_name = row[1]
last_name = row[0]
for title in titles:
first_name = first_name.replace(title,"")
print '.'.(first_name, last_name)

Download CSV directly into Python CSV parser

I'm trying to download CSV content from morningstar and then parse its contents. If I inject the HTTP content directly into Python's CSV parser, the result is not formatted correctly. Yet, if I save the HTTP content to a file (/tmp/tmp.csv), and then import the file in the python's csv parser the result is correct. In other words, why does:
def finDownload(code,report):
h = httplib2.Http('.cache')
url = 'http://financials.morningstar.com/ajax/ReportProcess4CSV.html?t=' + code + '&region=AUS&culture=en_us&reportType='+ report + '&period=12&dataType=A&order=asc&columnYear=5&rounding=1&view=raw&productCode=usa&denominatorView=raw&number=1'
headers, data = h.request(url)
return data
balancesheet = csv.reader(finDownload('FGE','is'))
for row in balancesheet:
print row
return:
['F']
['o']
['r']
['g']
['e']
[' ']
['G']
['r']
['o']
['u']
(etc...)
instead of:
[Forge Group Limited (FGE) Income Statement']
?
The problem results from the fact that iteration over a file is done line-by-line whereas iteration over a string is done character-by-character.
You want StringIO/cStringIO (Python 2) or io.StringIO (Python 3, thanks to John Machin for pointing me to it) so a string can be treated as a file-like object:
Python 2:
mystring = 'a,"b\nb",c\n1,2,3'
import cStringIO
csvio = cStringIO.StringIO(mystring)
mycsv = csv.reader(csvio)
Python 3:
mystring = 'a,"b\nb",c\n1,2,3'
import io
csvio = io.StringIO(mystring, newline="")
mycsv = csv.reader(csvio)
Both will correctly preserve newlines inside quoted fields:
>>> for row in mycsv: print(row)
...
['a', 'b\nb', 'c']
['1', '2', '3']

Categories

Resources