Removing certain separators from csv file with pandas or csv

Removing certain separators from csv file with pandas or csv - python

I've got multiple csv files, which I received in the following line format:
-8,000E-04,2,8E+1,
The first and the third comma are meant to be decimal separators, the second comma is a column delimiter and I think the last one is supposed to indicate a new line. So the csv should only consist of two columns and I have to prepare the data in order to plot it. Therefore I need to specify the two columns as x and y to plot the data.I tried removing or replacing the separators in every line but by doing that I'm no longer able to specify the two columns. Is there a way to remove certain separators from every line of the csv?

You can use the string returned by reading line as follow
line="-8,000E-04,2,8E+1,"
list_string = line.split(',')
x= float(list_string[0]+"."+list_string[1])
y= float(list_string[2]+"."+list_string[3])
print(x,y)
Result is
-0.0008 28.0
you can arrange x and y in columns also or whatever you want

Here a short program in python to convert your csv-files
import csv
f1 = "in_test.csv"
f2 = "out_test.csv"
with open(f1, newline='') as csv_reader:
reader = csv.reader(csv_reader, delimiter=',')
with open(f2, mode='w', newline='') as csv_writer:
writer = csv.writer(csv_writer, delimiter=";")
for row in reader:
out_row = [row[0] + '.' + row[1], row[2] + '.' + row[3]]
writer.writerow(out_row)
Sample input:
-8,000E-04,2,8E+1,
-2,000E-03,2,7E+2,
Sample output:
-8.000E-04;2.8E+1
-2.000E-03;2.7E+2

I think you should replace the second comma using regex. Well, I'm definitely not an expert at it, but I've managed to come up with this:
import re
s = "-8,000E-04,2,8E+1,"
pattern = "^([^,]*,[^,]*),(.*),$"
grps = re.search(pattern, s).groups()
res = [float(s.replace(",", ".")) for s in grps]
print(res)
# [-0.0008, 28.0]
Sample csv file:
-8,000E-04,2,8E+1,
6,0E-6,-45E+2,
-5,550E-6,-6,2E+1,
And you can do something like this:
x = []
y = []
regex = re.compile("^([^,]*,[^,]*),(.*),$")
with open("a.csv") as f:
for line in f:
result = regex.search(line).groups()
x.append(float(result[0].replace(",", ".")))
y.append(float(result[1].replace(",", ".")))
The result is:
print(x, y)
# [-0.0008, 6e-06, -5.55e-06] [28.0, -4500.0, -62.0]
I'm not sure this is the most efficient way, but it works.

Related

edit a specific colum in a text file

I have a file text with some content.
I want to edit only the column "Medicalization". For example with a program, by entring on keypad B the column "Medicalization" becomes B :
This column has coordinates 14 for each letter of medicalization.
I tried something but I get an "index out of range" error :
with open('d:/test.txt','r') as infile:
with open('d:/test2.txt','w') as outfile:
for line in infile :
line = line.split()
new_line = '"B"\n'.format(line[14])
outfile.write(new_line)
Is that possible to do that with Python ?

Since data is in tabular form so use pandas.read_csv with sep \s+ then use pandas.DataFrame.loc to replace A with B in medicalization.
import pandas as pd
df = pd.read_csv("test.txt", sep="\s+")
df.loc[df["medicalization"] == "A" ,"medicalization"] = "B"
print(df)
typtpt name medicalization
0 1 Entrance B
1 2 Departure B
2 3 Consultation B
3 4 Meeting B
4 5 Transfer B
And if you want to save it back then use:
df.to_csv('test.txt', sep='\t', index=False)

The 'A' value you wish to change cannot possibly be column 14 in every line. If you look at, for example, the 4th row (with 'Consultation' as the name), even with a single space separating the columns, the third column would be at column position 17. So your assumption about fixed column positions must be wrong. If there is, for example, a single space or tab character separating each column, then for the first row of actual data the 'A` value would be at offset 12 and this would explain your exception.
Assuming a single space is separating each column from one another, then you could use the csv module as follows:
import csv
with open('d:/test.txt') as infile:
with open('d:/test2.txt', 'w', newline='') as outfile:
rdr = csv.reader(infile, delimiter=' ')
wtr = csv.writer(outfile, delimiter=' ')
# just write out the first row:
header = next(rdr)
wtr.writerow(header)
for row in rdr:
row[2] = 'B'
wtr.writerow(row)
Or specify delimiter='\t' if a tab is used to separate the columns.
If an arbitrary number of whitespace characters (spaces or tabs) separates each column, then:
with open('test.txt') as infile:
with open('test2.txt', 'w') as outfile:
first_time = True
for row in infile:
columns = row.split()
if first_time:
first_time = False
else:
columns[2] = 'B'
print(' '.join(columns), file=outfile)

The index out of range error is because of the output you get from the line = line.split(). This splits by all the whitespace thus the output of the line.split() is a list like so ['01','Entrance','A'] for line 2 for example. So when you do the indexing you're indexing at 14 which does not exist within the list.
If you're data files format is consistent (all Medicalization data is in the 3rd column) you can achieve what you're after with pure python like so:
with open('test.txt','r') as infile:
with open('test2.txt','w') as outfile:
for idx, line in enumerate(infile) :
line = line.split()
# if idx is 0 its the headers so we don't want to change those
if idx != 0:
line[2] = '"B"'
outfile.write(' '.join(line) + '\n')
However, #Hamza's answer is potentially a nicer one using pandas.

How to find max and min values within lists without using maps/SQL?

I'm learning python and have a data set (csv file) I've been able to split the lines by comma but now I need to find the max and min value in the third column and output the corresponding value in the first column in the same row.
This is the .csv file: https://www.dropbox.com/s/fj8tanwy1lr24yk/loan.csv?dl=0
I also can't use Pandas or any external libraries; I think it would have been easier if I used them
I have written this code so far:
f = open("loanData.csv", "r")
mylist = []
for line in f:
mylist.append(line)
newdata = []
for row in mylist:
data = row.split(",")
newdata.append(data)

I'd use the built-in csv library for parsing your CSV file, and then just generate a list with the 3rd column values in it:
import csv
with open("loanData.csv", "r") as loanCsv:
loanCsvReader = csv.reader(loanCsv)
# Comment out if no headers
next(loanCsvReader, None)
loan_data = [ row[2] for row in loanCsvReader]
max_val = max(loan_data)
min_val = min(loan_data)
print("Max: {}".format(max_val))
print("Max: {}".format(min_val))
Don't know if the details of your file, whether it has a headers or not but you can comment out
next(loanCsvReader, None)
if you don't have any headers present

Something like this might work. The index would start at zero, so the third column should be 2.
min = min([row.split(',')[2] for row in mylist])
max = max([row.split(',')[2] for row in mylist])
Separately, you could probably read and reformat your data to a list with the following:
with open('loanData.csv', 'r') as f:
data = f.read()
mylist = list(data.split('\n'))
This assumes that the end of each row of data is newline (\n) delimited (Windows), but that might be different depending on the OS you're using.

Reading Data into Lists

I'm trying to open a CSV file that contains 100 columns and 2 rows. I want to read the file and put the data in the first column into one list (my x_coordinates) and the data in the second column into another list (my y_coordinates)
X= []
Y = []
data = open("data.csv")
headers = data.readline()
readMyDocument = data.read()
for data in readMyDocument:
X = readMyDocument[0]
Y = readMyDocument[1]
print(X)
print(Y)
I'm looking to get two lists but instead the output is simply a list of 2's.
Any suggestions on how I can change it/where my logic is wrong.

You can do something like:
import csv
# No need to initilize your lists here
X = []
Y = []
with open('data.csv', 'r') as f:
data = list(csv.reader(f))
X = data[0]
Y = data[1]
print(X)
print(Y)
See if that works.

You can use pandas:
import pandas as pd
XY = pd.read_csv(path_to_file)
X = XY.iloc[:,0]
Y = XY.iloc[:,1]
or you can
X=[]
Y=[]
with open(path_to_file) as f:
for line in f:
xy = line.strip().split(',')
X.append(xy[0])
Y.append(xy[1])

First things first: you are not closing your file.
A good practice would be to use with when opening files so it can close even if the code breaks.
Then, if you want just one column, you can break your lines by the column separator and use just the column you want.
But this would be kind of learning only, in a real situation you may want to use a lib like built in csv or, even better, pandas.
X = []
Y = []
with open("data.csv") as data:
lines = data.read().split('\n')
# headers is not being used in this spinet
headers = lines[0]
lines = lines[1:]
# changing variable name for better reading
for line in lines:
X.append(line[0])
Y.append(line[1])
print(X)
print(Y)
Ps.: I'm ignoring some variables that you used but were not declared in your code snipet. But they could be a problem too.

Using numpy's genfromtxt , read the docs here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
Some assumptions:
Delimiter is ","
You don't want the headers obliviously in the lists, that's why
skipping the headers.
You can read the docs and use other keywords as well.
import numpy as np
X= list(np.genfromtxt('data.csv',delimiter=",",skip_header=1)[:,0])
Y = list(np.genfromtxt('data.csv',delimiter=",",skip_header=1)[:,1])

Python: Even after specifying delimiter, csv writer delimits at wrong place

I'm trying to write lists like this to a CSV file:
['ABC','One,Two','12']
['DSE','Five,Two','52']
To a file like this:
ABC One,Two 12
DSE Five,Two 52
Basically, write anything inside '' to a cell.
However, it is splitting One and Two into different cells and merging ABC with One in the first cell.
Part of my script:
out_file_handle = open(output_path, "ab")
writer = csv.writer(out_file_handle, delimiter = "\t", dialect='excel', lineterminator='\n', quoting=csv.QUOTE_NONE)
output_final = (tsv_name_list.split(".")[0]+"\t"+key + "\t" + str(listOfThings))
output_final = str([output_final]).replace("[","").replace("]","").replace('"',"").replace("'","")
output_final = output_final.split("\\t")
print output_final #gives the first lists of strings I mentioned above.
writer.writerow(output_final)
First output_final line gives
ABC One,Two 12
DSE Five,Two 52

Using the csv module simply works, so you're going to need to be more specific about what's convincing you that the elements are bleeding across cells. For example, using the (now quite outdated) Python 2.7:
import csv
data_lists = [['ABC','One,Two','12'],
['DSE','Five,Two','52']]
with open("out.tsv", "wb") as fp:
writer = csv.writer(fp, delimiter="\t", dialect="excel", lineterminator="\n")
writer.writerows(data_lists)
I get an out.tsv file of:
dsm#winter:~/coding$ more out.tsv
ABC One,Two 12
DSE Five,Two 52
or
>>> out = open("out.tsv").readlines()
>>> for row in out: print repr(row)
...
'ABC\tOne,Two\t12\n'
'DSE\tFive,Two\t52\n'
which is exactly as it should be. Now if you take these rows, which are tab-delimited, and for some reason split them using commas as the delimiter, sure, you'll think that there are two columns, one with ABC\tOne and one with Two\t12. But that would be silly.

You've set up the CSV writer, but then for some reason you completely ignore it and try to output the lines manually. That's pointless. Use the functionality available.
writer = csv.writer(...)
for row in tsv_list:
writer.writerow(row)

parse list in python with \t

I have a file, each line looks like this:
#each line is a list:
a = ['1\t2\t3\t4\t5']
#type(a) is list
#str(a) shows as below:
["['1\\t2\\t3\\t4\\t5']"]
I want an output of only 1 2 3 4 5, how should I accomplish that? Thanks everyone.

I agree with Bhargav Rao. Use a for loop.
for i in a:
i.replace('\t',' ')
print i
I believe it will print out the way you want.
Otherwise please specify.

If you are reading lines from a file, you can use the following approach:
with open("file.txt", "r") as f_input:
for line in f_input:
print line.replace("\t"," "),
If the file you are reading from is actually columns of data, then the following approach might be suitable:
import csv
with open("file.txt", "r") as f_input:
csv_input = csv.reader(f_input, delimiter="\t")
for cols in csv_input:
print " ".join(cols)
cols would give you a list of columns for each row in the file, i.e. cols[0] would be the first column. You can then easily print them out with spaces as shown.

there are many ways of doing this.
you can use a for loop or you can just join the list and replace all the \t
a = ['1\t2\t3\t4\t5']
a = "".join(a).replace("\t", "")
print(a)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing certain separators from csv file with pandas or csv - python

You can use the string returned by reading line as follow line="-8,000E-04,2,8E+1," list_string = line.split(',') x= float(list_string[0]+"."+list_string[1]) y= float(list_string[2]+"."+list_string[3]) print(x,y) Result is -0.0008 28.0 you can arrange x and y in columns also or whatever you want

Related

edit a specific colum in a text file

How to find max and min values within lists without using maps/SQL?

Reading Data into Lists

Python: Even after specifying delimiter, csv writer delimits at wrong place

parse list in python with \t

Categories

Resources