Parsing a csv file with column data in Python - python

I want to read the first 3 columns of a csv file and do some modification before storing them.
Data in csv file:
{::[name]str1_str2_str3[0]},1,U0.00 - Sensor1 Not Ready\nTry Again,1,0,12
{::[name]str1_str2_str3[1]},2,U0.00 - Sensor2 Not Ready\nTry Again,1,0,12
From the column1, I just want to parse the value 0 or 1 within the [ ].
Then the value in column2
From column3, I want to parse the substring "Sensor1 Not Ready". Then convert to upper case and replace the space with underscore (eg - SENSOR1_NOT_READY). And then print the string in a new column.
Parsing format -
**<value from column 1>.<value from column 2>.<string from column 3>**
I am new to coding in Python. Can someone help me with this? What is the best and the most efficient way to do this?
TIA
What I have tried so far -
import csv
from collections import defaultdict
columns = defaultdict(list)
with open('filename.csv','rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
for i in range(len(row)):
columns[i].append(row[i])
columns = dict(columns)
Is this a good way for Column 3?
x = # Parsed data from Column 3'
a, b = x.split("\n") # 'a' denotes the substring before \n
c, d = a.split("-") # 'd' denotes the substring after '-'
e = d.upper()
new_str = str.replace(" ", "_")
print new_str

My suggestion is to read a whole line as a string, and then extract desired data with re module like this:
import re
term = '\[(\d)\].*,(\d+),.*-\s([\w\s]+)\\n'
line = '{::[name]str1_str2_str3[0]},1,U0.00 - Sensor1 Not Ready\nTry Again,1,0,12'
capture = list(re.search(term, line).groups())
capture[-1] = '_'.join(capture[-1].split()).upper()
result = ','.join(capture)
#0,1,Sensor1_Not_Ready

Related

edit a specific colum in a text file

I have a file text with some content.
I want to edit only the column "Medicalization". For example with a program, by entring on keypad B the column "Medicalization" becomes B :
This column has coordinates 14 for each letter of medicalization.
I tried something but I get an "index out of range" error :
with open('d:/test.txt','r') as infile:
with open('d:/test2.txt','w') as outfile:
for line in infile :
line = line.split()
new_line = '"B"\n'.format(line[14])
outfile.write(new_line)
Is that possible to do that with Python ?
Since data is in tabular form so use pandas.read_csv with sep \s+ then use pandas.DataFrame.loc to replace A with B in medicalization.
import pandas as pd
df = pd.read_csv("test.txt", sep="\s+")
df.loc[df["medicalization"] == "A" ,"medicalization"] = "B"
print(df)
typtpt name medicalization
0 1 Entrance B
1 2 Departure B
2 3 Consultation B
3 4 Meeting B
4 5 Transfer B
And if you want to save it back then use:
df.to_csv('test.txt', sep='\t', index=False)
The 'A' value you wish to change cannot possibly be column 14 in every line. If you look at, for example, the 4th row (with 'Consultation' as the name), even with a single space separating the columns, the third column would be at column position 17. So your assumption about fixed column positions must be wrong. If there is, for example, a single space or tab character separating each column, then for the first row of actual data the 'A` value would be at offset 12 and this would explain your exception.
Assuming a single space is separating each column from one another, then you could use the csv module as follows:
import csv
with open('d:/test.txt') as infile:
with open('d:/test2.txt', 'w', newline='') as outfile:
rdr = csv.reader(infile, delimiter=' ')
wtr = csv.writer(outfile, delimiter=' ')
# just write out the first row:
header = next(rdr)
wtr.writerow(header)
for row in rdr:
row[2] = 'B'
wtr.writerow(row)
Or specify delimiter='\t' if a tab is used to separate the columns.
If an arbitrary number of whitespace characters (spaces or tabs) separates each column, then:
with open('test.txt') as infile:
with open('test2.txt', 'w') as outfile:
first_time = True
for row in infile:
columns = row.split()
if first_time:
first_time = False
else:
columns[2] = 'B'
print(' '.join(columns), file=outfile)
The index out of range error is because of the output you get from the line = line.split(). This splits by all the whitespace thus the output of the line.split() is a list like so ['01','Entrance','A'] for line 2 for example. So when you do the indexing you're indexing at 14 which does not exist within the list.
If you're data files format is consistent (all Medicalization data is in the 3rd column) you can achieve what you're after with pure python like so:
with open('test.txt','r') as infile:
with open('test2.txt','w') as outfile:
for idx, line in enumerate(infile) :
line = line.split()
# if idx is 0 its the headers so we don't want to change those
if idx != 0:
line[2] = '"B"'
outfile.write(' '.join(line) + '\n')
However, #Hamza's answer is potentially a nicer one using pandas.

How to find max and min values within lists without using maps/SQL?

I'm learning python and have a data set (csv file) I've been able to split the lines by comma but now I need to find the max and min value in the third column and output the corresponding value in the first column in the same row.
This is the .csv file: https://www.dropbox.com/s/fj8tanwy1lr24yk/loan.csv?dl=0
I also can't use Pandas or any external libraries; I think it would have been easier if I used them
I have written this code so far:
f = open("loanData.csv", "r")
mylist = []
for line in f:
mylist.append(line)
newdata = []
for row in mylist:
data = row.split(",")
newdata.append(data)
I'd use the built-in csv library for parsing your CSV file, and then just generate a list with the 3rd column values in it:
import csv
with open("loanData.csv", "r") as loanCsv:
loanCsvReader = csv.reader(loanCsv)
# Comment out if no headers
next(loanCsvReader, None)
loan_data = [ row[2] for row in loanCsvReader]
max_val = max(loan_data)
min_val = min(loan_data)
print("Max: {}".format(max_val))
print("Max: {}".format(min_val))
Don't know if the details of your file, whether it has a headers or not but you can comment out
next(loanCsvReader, None)
if you don't have any headers present
Something like this might work. The index would start at zero, so the third column should be 2.
min = min([row.split(',')[2] for row in mylist])
max = max([row.split(',')[2] for row in mylist])
Separately, you could probably read and reformat your data to a list with the following:
with open('loanData.csv', 'r') as f:
data = f.read()
mylist = list(data.split('\n'))
This assumes that the end of each row of data is newline (\n) delimited (Windows), but that might be different depending on the OS you're using.

Python: Split List into 2 Sublists by tabseperating elements

Question:
How can I split a list into two sublists where the elements are separated by a tab in the element?
Context:
I want to read a .txt file delimited by tabs into a Pandas DataFrame. The files look something like:
Column1 \t 123
Column2 \t
Column3 \t text
Meaning that each line has one column followed by one tab and then one value of the column (sometimes no value).
My idea was to read the file and save each line as an element of a list, then split the list into two keeping the first part before the tab as one list and the second part after the tab as another. Then build my dataframe from there.
for file in txt_files: #iterate over all files
f = open(file) #open each file individually
lines = f.readlines() #read each line as an element into a list
f.close()
#make sublists columns and values
You can read your files into a dataframe like this:
import pandas as pd
# Empty list to store dataframe rows
df_rows = []
# Read all text files
for tf in text_files:
# For each file
with open(tf) as f:
# Empty dictionary to store column names and values
df_dict = {}
# For each line
for line in f:
# Split by tab
k, v = line.split('\t')
# Column name as key, value as value
df_dict[k] = v
# Add the dictionary to list
df_rows.append(df_dict)
# Read the list of dictionaries as a dataframe
df = pd.DataFrame(df_rows)
# Preview dataframe
df.head()
If I understand correctly, you can just transpose the dataframe read_csv will give you with delimiter='\t'.
Demo:
>>> from io import StringIO
>>> import pandas as pd
>>>
>>> file = StringIO('''Column1\t123
...: Column2\t
...: Column3\ttext''')
>>>
>>> df = pd.read_csv(file, delimiter='\t', index_col=0, header=None).T
>>> df
>>>
0 Column1 Column2 Column3
1 123 NaN text
(If your delimiter is really ' \t ' then use delimiter=' \t ' and engine='python').

Splitting a delimited file and storing into new column

I am trying to split csv file. After reading the delimited file, I want to split desired column furthur. My sample code:
import csv
sample = open('~/sample.txt')
adr = csv.reader(sample, delimiter='|')
for row in adr:
a = row[0]
b = row[1]
c = row[2]
d = row [3]
new=""
new = row[4].split(",")
for row1 in new:
print row1
sample.txt file contains:
aa|bb|cc|dd|1,2,3,4|xx
ab|ax|am|ef|1,5,6|jk
cx|kd|rd|j|1,9|k
Above code produce output as:
[1,2,3,4]
[1,5,6]
[1,9]
I am trying to further split new column and going to use splited output for comparison. For example, desired output for splitting will be :
aa|bb|cc|dd|1|2|3|4|xx
ab|ax|am|ef|1|5|6| |jk
cx|kd|rd|j|1|9| | |k
Also I want to store mutiple blank or NULL value of new column, as shown in above example [1,2,3,4], [1,5,6]. Is there better way to split?
You're pretty much there already! A few more lines after new = row[4].split(",") are all you need.
for i in range(len(new), 4):
new.append('')
newrow = row[0:4] + new + row[5:]
print('|'.join(newrow))
Edit 2: addressing your comments below in the simplest way possible, just loop through it twice, looking for the longest "subarray" the first time. Re: printing extra times, you likely copied the code into the wrong place/indentation and have it in the loop.
Full code:
import csv
sample = open('~/sample.txt')
adr = csv.reader(sample, delimiter='|')
longest = 0
for row in adr:
curLen = len(row[4].split(','))
if curLen > longest:
longest = curLen
sample.seek(0)
for row in adr:
new = row[4].split(",")
for i in range(len(new), longest):
new.append(' ')
newrow = row[0:4] + new + row[5:]
print('|'.join(newrow))

Inserting data into two columns of csv

My test1111.csv looks similar to this:
Sales #, Date, Tel Number, Comment
393ED3, 5/12/2010, 5555551212, left message
585E54, 6/15/2014, 5555551213, voice mail
585868, 8/16/2010, , number is 5555551214
I have the following code:
import re
import csv
from collections import defaultdict
# Below code places csv entries into dictionary so that they can be parsed
# by column. Then print statement prints Sales # column.
columns = defaultdict(list)
with open("c:\\test1111.csv", "r") as f:
reader = csv.DictReader(f)
for row in reader:
for (k,v) in row.items():
columns[k].append(v)
# To print all columns, use: print columns
# To print a specific column, use: print(columns['ST'])
# Below line takes list output and separates into new lines
sales1 = "\n".join(columns['Sales #'])
print sales1
# Below code searches all columns for a 10 digit number and outputs the
# results to a new csv file.
with open("c:\\test1111.csv", "r") as old, \
open("c:\\results1111.csv", 'wb') as new:
for line in old:
#Regex to match exactly 10 digits
match = re.search('(?<!\d)\d{10}(?!\d)', line)
if match:
match1 = match.group()
print match1
new.writelines((match1) + '\n')
else:
nomatch = "No match"
print nomatch
new.writelines((nomatch) + '\n')
The first section of the code opens the original csv and prints all entries from the Sales # column to stdout with each entry in its own row.
The second section of the code opens the original csv and searches every row for a 10 digit number. When it finds one it writes each one (or no match) to each row of a new csv.
What I would like to now do is to also write the sales column data to the new csv. So ultimately, the sales column data would appear as rows in the first column and the regex data would appear as rows in the second column in the new csv. I have been having trouble getting that to work as the new.writelines won't take two arguments. Can someone please help me with how to accomplish this?
I would like the results1111.csv to look like this:
393ED3, 5555551212
585E54, 5555551213
585868, 5555551214
Starting with the second part of your code, all you need to do is concatenate the sales data within your writelines:
sales_list = sales1.split('\n')
# Below code searches all columns for a 10 digit number and outputs the
# results to a new csv file.
with open("c:\\test1111.csv", "r") as old, \
open("c:\\results1111.csv", 'wb') as new:
i = 0 # counter to add the proper sales figure
for line in old:
#Regex to match exactly 10 digits
match = re.search('(?<!\d)\d{10}(?!\d)', line)
if match:
match1 = match.group()
print match1
new.writelines(str(sales_list[i])+ ',' + (match1) + '\n')
else:
nomatch = "No match"
print nomatch
new.writelines(str(sales_list[i])+ ',' + (nomatch) + '\n')
i += 1
Using the counter i, you can keep track of what row you're on and use that to add the corresponding sales column figure.
Just to point out that in a CSV, unless the spaces are really needed, they shouldn't be there. Your data should look like this:
Sales #,Date,Tel Number,Comment
393ED3,5/12/2010,5555551212,left message
585E54,6/15/2014,5555551213,voice mail
585868,8/16/2010,,number is 5555551214
And, adding a new way of getting the same answer, you can use Pandas data analysis libraries for task involving data tables. It will only be 2 lines for what you want to achieve:
>>> import pandas as pd
# Read data
>>> data = pd.DataFrame.from_csv('/tmp/in.cvs')
>>> data
Date Tel Number Comment
Sales#
393ED3 5/12/2010 5555551212 left message
585E54 6/15/2014 5555551213 voice mail
585868 8/16/2010 NaN number is 5555551214
# Write data
>>> data.to_csv('/tmp/out.cvs', columns=['Tel Number'], na_rep='No match')
That last line will write to out.cvs the column Tel Number inserting No match when no telephone number is found, exactly what you want. Output file:
Sales#,Tel Number
393ED3,5555551212.0
585E54,5555551213.0
585868,No match

Categories

Resources