Splitting a delimited file and storing into new column - python

I am trying to split csv file. After reading the delimited file, I want to split desired column furthur. My sample code:
import csv
sample = open('~/sample.txt')
adr = csv.reader(sample, delimiter='|')
for row in adr:
a = row[0]
b = row[1]
c = row[2]
d = row [3]
new=""
new = row[4].split(",")
for row1 in new:
print row1
sample.txt file contains:
aa|bb|cc|dd|1,2,3,4|xx
ab|ax|am|ef|1,5,6|jk
cx|kd|rd|j|1,9|k
Above code produce output as:
[1,2,3,4]
[1,5,6]
[1,9]
I am trying to further split new column and going to use splited output for comparison. For example, desired output for splitting will be :
aa|bb|cc|dd|1|2|3|4|xx
ab|ax|am|ef|1|5|6| |jk
cx|kd|rd|j|1|9| | |k
Also I want to store mutiple blank or NULL value of new column, as shown in above example [1,2,3,4], [1,5,6]. Is there better way to split?

You're pretty much there already! A few more lines after new = row[4].split(",") are all you need.
for i in range(len(new), 4):
new.append('')
newrow = row[0:4] + new + row[5:]
print('|'.join(newrow))
Edit 2: addressing your comments below in the simplest way possible, just loop through it twice, looking for the longest "subarray" the first time. Re: printing extra times, you likely copied the code into the wrong place/indentation and have it in the loop.
Full code:
import csv
sample = open('~/sample.txt')
adr = csv.reader(sample, delimiter='|')
longest = 0
for row in adr:
curLen = len(row[4].split(','))
if curLen > longest:
longest = curLen
sample.seek(0)
for row in adr:
new = row[4].split(",")
for i in range(len(new), longest):
new.append(' ')
newrow = row[0:4] + new + row[5:]
print('|'.join(newrow))

Related

I need to read a csv with unknown number of columns and then write the data to a csv with a set number of columns

So I have a file that looks like this:
name,number,email,job1,job2,job3,job4
I need to convert it to one that looks like this:
name,number,email,job1
name,number,email,job2
name,number,email,job3
name,number,email,job4
How would I do this in Python?
As said in a comment that you can use pandas to read, write and manipulate csv file.
Here is one example of how you can solve your problem with pandas in python
import pandas as pd
# df = pd.read_csv("filename.csv") # read csv file from disk
# comment out below line when open from disk
df = pd.DataFrame([['ss','0152','ss#','student','others']],columns=['name','number','email','job1','job2'])
print(df)
this line output is
name number email job1 job2
0 ss 0152 ss# student others
Now we need to know how many columns are there:
x = len(df.columns)
print(x)
it will store the number of column in x
5
Now let's create a empty Dataframe with columns= [name,number,email,job]
c = pd.DataFrame(columns=['name','number','email','job'])
print(c)
output:
Columns: [name, number, email, job]
Index: []
Now we use loop from range 3 to end of the column and concat datafarme with our empty dataframe:
for i in range(3,x):
df1 = df.iloc[:,0:3].copy() # we took first 3 column
df2 = df.iloc[:,[i]].copy() # we took ith coulmn
df1['job'] = df2; # added ith coulmn to the df1
c = pd.concat([df1,c]); # concat df1 and c
print(c)
output:
name number email job
0 ss 0152 ss# others
0 ss 0152 ss# student
Dataframe c has your desired output. Now you can save it using
c.to_csv('ouput.csv')
Let's assume this is the dataframe:
import pandas as pd
df = pd.DataFrame(columns=['name','number','email','job1','job2','job3','job4'])
df = df.append({'name':'jon', 'number':123, 'email':'smth#smth.smth', 'job1':'a','job2':'b','job3':'c','job4':'d'},ignore_index=True)
We define a new dataframe:
new_df = pd.DataFrame(columns=['name','number','email','job'])
Now, we loop over the old one to split it based on the jobs. I assume you have 4 jobs to split:
for i, row in df.iterrows():
for job in range(1,5):
job_col = "job" + str(job)
new_df = new_df.append({'name':row['name'], 'number':row['number'], 'email':row['email'], 'job':row[job_col]}, ignore_index=True)
You can use the csv module and Python's unpacking syntax to get the data from the input file and write it to the output file.
import csv
with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
# Skip header row, if necessary
next(reader)
# Use sequence unpacking to get the fixed variables and
# and arbitrary number of "jobs".
for name, number, email, *jobs in reader:
for job in jobs:
writer.writerow([name, number, email, job])
Below:
with open('input.csv') as f_in:
lines = [l.strip() for l in f_in.readlines()]
with open('output.csv','w') as f_out:
for idx,line in enumerate(lines):
if idx > 0:
fields = line.split(',')
for idx in range(3,len(fields)):
f_out.write(','.join(fields[:3]) + ',' + fields[idx] + '\n')
input.csv
header row
name,number,email,job1,job2,job3,job4
name1,number1,email1,job11,job21,job31,job41
output.csv
name,number,email,job1
name,number,email,job2
name,number,email,job3
name,number,email,job4
name1,number1,email1,job11
name1,number1,email1,job21
name1,number1,email1,job31
name1,number1,email1,job41

Replacing and deleting columns from a csv using python

Here is a code that I am writing
import csv
import openpyxl
def read_file(fn):
rows = []
with open(fn) as f:
reader = csv.reader(f, quotechar='"',delimiter=",")
for row in reader:
if row:
rows.append(row)
return rows
replace = {x[0]:x[1:] for x in read_file("replace.csv")}
delete = set( (row[0] for row in read_file("delete.csv")) )
result = []
input_file="input.csv"
with open(input_file) as f:
reader = csv.reader(f, quotechar='"')
for row in reader:
if row:
if row[7] in delete:
continue
elif row[7] in replace:
result.append(replace[row[7]])
else:
result.append(row)
with open ("done.csv", "w+", newline="") as f:
w = csv.writer(f,quotechar='"', delimiter= ",")
w.writerows(result)
here are my files:
input.csv:
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
"-","-","-","-","-","-","-","aaaaa","-","-","bbbbb","-",","
"-","-","-","-","-","-","-","ccccc","-","-","ddddd","-",","
"-","-","-","-","-","-","-","eeeee","-","-","fffff","-",","
this is a 13 column csv. I am interested only in the 8th and the 11th fields.
this is my replace.csv:
"aaaaa","11111","22222"
delete.csv:
ccccc
so what I am doing is compare the first column of replace.csv(line by line) with the 8th column of input.csv and if they match then replace 8th column of input.csv with the second column of replace.csv and 11th column of input with the 3rd column of replace.csv
and for delete.csv it compares both files line by line and if match is found it deletes the entire row.
and if any line is not present in either replace.csv or delete.csv then print the line as it is.
so my desired output is:
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
"-","-","-","-","-","-","-",11111,"-","-",22222,"-",","
"-","-","-","-","-","-","-","eeeee","-","-","fffff","-",","
but when I run this code it gives me an output like this:
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
11111,22222
where am I going wrong?
I am trying to make changes to my program that I had earlier posted a question about.Since the input file has changed I am trying to make changes to my program.
https://stackoverflow.com/a/54388144/9279313
#anuj
I think SafeDev's solution is optimal but if you don't want to go with pandas, just make little changes in your code.
for row in reader:
if row:
if row[7] in delete:
continue
elif row[7] in replace:
key = row[7]
row[7] = replace[key][0]
row[10]= replace[key][1]
result.append(row)
else:
result.append(row)
Hope this solves your issue.
It's actually quite simple. Instead of making it by scratch just use the panda library. From there it's easier to handle any dataset. This is how you would do it:
EDIT:
import pandas as pd
input_csv = pd.read_csv('input.csv')
replace_csv = pd.read_csv('replace.csv', header=None)
delete_csv = pd.read_csv('delete.csv')
r_lst = [i for i in replace_csv.iloc[:, 0]]
d_lst = [i for i in delete_csv]
input2_csv = pd.DataFrame.copy(input_csv)
for i, row in input_csv.iterrows():
if row['c8'] in r_lst:
input2_csv.loc[i, 'c8'] = replace_csv.iloc[r_lst.index(row['c8']), 1]
input2_csv.loc[i, 'c11'] = replace_csv.iloc[r_lst.index(row['c8']), 2]
if row['c8'] in d_lst:
input2_csv = input2_csv[input2_csv.c8 != row['c8']]
input2_csv.to_csv('output.csv', index=False)
This process can be made even more dynamic by turning it into a function that has parameters of column names and replacing 'c8' and 'c11' with those two parameters.

Parsing a csv file with column data in Python

I want to read the first 3 columns of a csv file and do some modification before storing them.
Data in csv file:
{::[name]str1_str2_str3[0]},1,U0.00 - Sensor1 Not Ready\nTry Again,1,0,12
{::[name]str1_str2_str3[1]},2,U0.00 - Sensor2 Not Ready\nTry Again,1,0,12
From the column1, I just want to parse the value 0 or 1 within the [ ].
Then the value in column2
From column3, I want to parse the substring "Sensor1 Not Ready". Then convert to upper case and replace the space with underscore (eg - SENSOR1_NOT_READY). And then print the string in a new column.
Parsing format -
**<value from column 1>.<value from column 2>.<string from column 3>**
I am new to coding in Python. Can someone help me with this? What is the best and the most efficient way to do this?
TIA
What I have tried so far -
import csv
from collections import defaultdict
columns = defaultdict(list)
with open('filename.csv','rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
for i in range(len(row)):
columns[i].append(row[i])
columns = dict(columns)
Is this a good way for Column 3?
x = # Parsed data from Column 3'
a, b = x.split("\n") # 'a' denotes the substring before \n
c, d = a.split("-") # 'd' denotes the substring after '-'
e = d.upper()
new_str = str.replace(" ", "_")
print new_str
My suggestion is to read a whole line as a string, and then extract desired data with re module like this:
import re
term = '\[(\d)\].*,(\d+),.*-\s([\w\s]+)\\n'
line = '{::[name]str1_str2_str3[0]},1,U0.00 - Sensor1 Not Ready\nTry Again,1,0,12'
capture = list(re.search(term, line).groups())
capture[-1] = '_'.join(capture[-1].split()).upper()
result = ','.join(capture)
#0,1,Sensor1_Not_Ready

How to find minimum value from CSV file row in Python?

I'm a beginner at Python and I am using it for my project.
I want to extract the minimum value from column4 of a CSV file and I am not sure how to.
I can print the whole of column[4] but am not sure how to print the minimum value (just one column) from column[4].
CSV File: https://www.emcsg.com/marketdata/priceinformation
I'm downloading the Uniform Singapore Energy Price & Demand Forecast for 9 Sep.
Thank you in advance.
This is my current codes:
import csv
import operator
with open('sep.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
header = next(readCSV)
data = []
for row in readCSV:
Date = row[0]
Time = row[1]
Demand = row[2]
RCL = row[3]
USEP = row [4]
EHEUR = row [5]
LCP = row[6]
Regulations = row[7]
Primary = row[8]
Secondary = row[9]
Contingency = row[10]
Last_Updated = row[11]
print header[4]
print row[4]
not sure how are you reading the values. however, you can add all the values to and list and then:
list = []
<loop to extract values>
list.insert(index, value)
min_value = min(list)
Note: index is the 'place' where the value get inserted.
Your phrasing is a bit ambiguous. At first I thought you meant the minimum of the fourth row, but looking at the data you seem to be wanting the minimum of the fourth column (USEP($/MWh)). For that, (assuming that "Realtime_Sep-2017.csv" is the filename) you can do:
import pandas as pd
df = pd.read_csv("Realtime_Sep-2017.csv")
print(min(df["USEP($/MWh)"])
Other options include df.min()["USEP($/MWh)"], df.min()[4], and min(df.iloc[:,4])
EDIT 2 :
Solution for a column without pandas module:
with open("Realtime_Sep-2017.csv") as file:
lines = file.read().split("\n") #Read lines
num_list = []
for line in lines:
try:
item = line.split(",")[4][1:-1] #Choose 4th column and delete ""
num_list.append(float(item)) #Try to parse
except:
pass #If it can't parse, the string is not a number
print(max(num_list)) #Prints maximum value
print(min(num_list)) #Prints minimum value
Output:
81.92
79.83
EDIT :
Here is the solution for a column:
import pandas as pd
df = pd.read_csv("Realtime_Sep-2017.csv")
row_count = df.shape[0]
column_list = []
for i in range(row_count):
item = df.at[i, df.columns.values[4]] #4th column
column_list.append(float(item)) #parse float and append list
print(max(column_list)) #Prints maximum value
print(min(column_list)) #Prints minimum value
BEFORE EDIT :
(solution for a row)
Here is a simple code block:
with open("Realtime_Sep-2017.csv") as file:
lines = file.read().split("\n") #Reading lines
num_list = []
line = lines[3] #Choosing 4th row.
for item in line.split(","):
try:
num_list.append(float(item[1:-1])) #Try to parse
except:
pass #If it can't parse, the string is not a number
print(max(num_list)) #Prints maximum value
print(min(num_list)) #Prints minimum value

Split a row into multiple cells and keep the maximum value of second value for each gene

I am new to Python and I prepared a script that will modify the following csv file
accordingly:
1) Each row that contains multiple Gene entries separated by the /// such as:
C16orf52 /// LOC102725138 1.00551
should be transformed to:
C16orf52 1.00551
LOC102725138 1.00551
2) The same gene may have different ratio values
AASDHPPT 0.860705
AASDHPPT 0.983691
and we want to keep only the pair with the highest ratio value (delete the pair AASDHPPT 0.860705)
Here is the script I wrote but it does not assign the correct ratio values to the genes:
import csv
import pandas as pd
with open('2column.csv','rb') as f:
reader = csv.reader(f)
a = list(reader)
gene = []
ratio = []
for t in range(len(a)):
if '///' in a[t][0]:
s = a[t][0].split('///')
gene.append(s[0])
gene.append(s[1])
ratio.append(a[t][1])
ratio.append(a[t][1])
else:
gene.append(a[t][0])
ratio.append(a[t][1])
gene[t] = gene[t].strip()
newgene = []
newratio = []
for i in range(len(gene)):
g = gene[i]
r = ratio[i]
if g not in newgene:
newgene.append(g)
for j in range(i+1,len(gene)):
if g==gene[j]:
if ratio[j]>r:
r = ratio[j]
newratio.append(r)
for i in range(len(newgene)):
print newgene[i] + '\t' + newratio[i]
if len(newgene) > len(set(newgene)):
print 'missionfailed'
Thank you very much for any help or suggestion.
Try this:
with open('2column.csv') as f:
lines = f.read().splitlines()
new_lines = {}
for line in lines:
cols = line.split(',')
for part in cols[0].split('///'):
part = part.strip()
if not part in new_lines:
new_lines[part] = cols[1]
else:
if float(cols[1]) > float(new_lines[part]):
new_lines[part] = cols[1]
import csv
with open('clean_2column.csv', 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=' ',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
for k, v in new_lines.items():
writer.writerow([k, v])
First of all, if you're importing Pandas, know that you have I/O Tools to read CSV files.
So first, let's import it that way :
df = pd.read_csv('2column.csv')
Then, you can extract the indexes where you have your '///' pattern:
l = list(df[df['Gene Symbol'].str.contains('///')].index)
Then, you can create your new rows :
for i in l :
for sub in df['Gene Symbol'][i].split('///') :
df=df.append(pd.DataFrame([[sub, df['Ratio(ifna vs. ctrl)'][i]]], columns = df.columns))
Then, drop the old ones :
df=df.drop(df.index[l])
Then, I'll do a little trick to remove your lowest duplicate values. First, I'll sort them by 'Ratio (ifna vs. ctrl)' then I'll drop all the duplicates but the first one :
df = df.sort('Ratio(ifna vs. ctrl)', ascending=False).drop_duplicates('Gene Symbol', keep='first')
If you want to keep your sorting by Gene Symbol and reset indexes to have simpler ones, simply do :
df = df.sort('Gene Symbol').reset_index(drop=True)
If you want to re-export your modified data to your csv, do :
df.to_csv('2column.csv')
EDIT : I edited my answer to correct syntax errors, I've tested this solution with your csv and it worked perfectly :)
This should work.
It uses the dictionary suggestion of Peter.
import csv
with open('2column.csv','r') as f:
reader = csv.reader(f)
original_file = list(reader)
# gets rid of the header
original_file = original_file[1:]
# create an empty dictionary
genes_ratio = {}
# loop over every row in the original file
for row in original_file:
gene_name = row[0]
gene_ratio = row[1]
# check if /// is in the string if so split the string
if '///' in gene_name:
gene_names = gene_name.split('///')
# loop over all the resulting compontents
for gene in gene_names:
# check if the component is in the dictionary
# if not in dictionary set value to gene_ratio
if gene not in genes_ratio:
genes_ratio[gene] = gene_ratio
# if in dictionary compare value in dictionary to gene_ratio
# if dictionary value is smaller overwrite value
elif genes_ratio[gene] < gene_ratio:
genes_ratio[gene] = gene_ratio
else:
if gene_name not in genes_ratio:
genes_ratio[gene_name] = gene_ratio
elif genes_ratio[gene_name] < gene_ratio:
genes_ratio[gene_name] = gene_ratio
#loop over dictionary and print gene names and their ratio values
for key in genes_ratio:
print key, genes_ratio[key]

Categories

Resources