Read Column Values Between Two Rows CSV Python - python

I have a CSV which is in the format:
Name1,Value1
,Value2
,Value3
Name2,Value40
,Value50
,Value60
Name3,Value5
,Value10
,Value15
There is not a set number of "values" per "name".
There is not pattern to the names.
I want to read the Values for Each Name into a dict such as:
Name1 : [Value1,Value2,Value3]
Name2 : [Value40,Value50,Value60]
etc.
My current code is this:
CSVFile = open("GroupsCSV.csv")
Reader = csv.reader(CSVFile)
for row in Reader:
if row[0] and row[2]:
objlist = []
objlist.append(row[2])
for row in Reader:
if not row[0] and row[2]:
objlist.append(row[2])
else:
break
print(objlist)
This half-works.
It will do Name1,Name3,Name5,Name7 etc.
I cant seem to find a way to stop it skipping.
Would prefer to do this without the use of something like Lambda (as its not something i fully understand yet!).
EDIT: Image of example csv (real data has another unnecessary column, hence the "row[2]" in the code.:

Try pandas:
import pandas as pd
df = pd.read_csv('your_file.csv', header=None)
(df.ffill() # fill the blank with the previous Name
.groupby([0])[1] # collect those with same name
.apply(list) # put those in a list
.to_dict() # make a dictionary
)
Output:
{'Name1': ['Value1', 'Value2', 'Value3'],
'Name2': ['Value40', 'Value50', 'Value60'],
'Name3': ['Value5', 'Value10', 'Value15']}
Update: the pure python(3) solution:
with open('your_file.csv') as f:
lines = f.readlines()
d = {}
for line in lines:
row = line.split(',')
if row[0] != '':
key = row[0]
d[key] = []
d[key].append(row[1])
d

I think the issue you are facing is because of your nested loop. Both loops are pointing to the same iterator. You are starting the second loop after it finds Name1 and breaking it when it finds Name2. By the time the outer loops continues after the break you have already skipped Name2.
You could have both conditions in the same loop:
# with open("GroupsCSV.csv") as csv_file:
# reader = csv.reader(csv_file)
reader = [[1,2,3],[None,5,6]] # Mocking the csv input
objlist = []
for row in reader:
if row[0] and row[2]:
objlist.clear()
objlist.append(row[2])
elif not row[0] and row[2]:
objlist.append(row[2])
print(objlist)
EDIT: I have updated the code to provide a testable output.
The printed output looks as follows:
[3]
[3, 6]

Related

Turning a CSV file with a header into a python dictionary

Locked. There are disputes about this question’s content being resolved at this time. It is not currently accepting new answers or interactions.
Lets say I have the following example csv file
a,b
100,200
400,500
How would I make into a dictionary like below:
{a:[100,400],b:[200,500]}
I am having trouble figuring out how to do it manually before I use a package, so I understand. Any one can help?
some code I tried
with open("fake.csv") as f:
index= 0
dictionary = {}
for line in f:
words = line.strip()
words = words.split(",")
if index >= 1:
for x in range(len(headers_list)):
dictionary[headers_list[i]] = words[i]
# only returns the last element which makes sense
else:
headers_list = words
index += 1
At the very least, you should be using the built-in csv package for reading csv files without having to bother with parsing. That said, this first approach is still applicable to your .strip and .split technique:
Initialize a dictionary with the column names as keys and empty lists as values
Read a line from the csv reader
Zip the line's contents with the column names you got in step 1
For each key:value pair in the zip, update the dictionary by appending
with open("test.csv", "r") as file:
reader = csv.reader(file)
column_names = next(reader) # Reads the first line, which contains the header
data = {col: [] for col in column_names}
for row in reader:
for key, value in zip(column_names, row):
data[key].append(value)
Your issue was that you were using the assignment operator = to overwrite the contents of your dictionary on every iteration. This is why you either want to pre-initialize the dictionary like above, or use a membership check first to test if the key exists in the dictionary, adding it if not:
key = headers_list[i]
if key not in dictionary:
dictionary[key] = []
dictionary[key].append(words[i])
An even cleaner shortcut is to take advantage of dict.get:
key = headers_list[i]
dictionary[key] = dictionary.get(key, []) + [words[i]]
Another approach would be to take advantage of the csv package by reading each row of the csv file as a dictionary itself:
with open("test.csv", "r") as file:
reader = csv.DictReader(file)
data = {}
for row_dict in reader:
for key, value in row_dict.items():
data[key] = data.get(key, []) + [value]
Another standard library package you could use to clean this up further is collections, with defaultdict(list), where you can directly append to the dictionary at a given key without worrying about initializing with an empty list if the key wasn't already there.
To do that just keep the column name and data seperate then iterate the column and add the value for the corresponding index in data, not sure if this work with empty values.
However, I am much sure that going through pandas would be 100% easier, it's a really used library for working with data in external files.
import csv
datas = []
with open('fake.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
for row in csv_reader:
if line_count == 0:
cols = row
line_count += 1
else:
datas.append(row)
line_count += 1
dict = {}
for index, col in enumerate(cols): #Iterate through the data with value and indices
dict[col] = []
for data in datas: #append a in the current dict key, a new value.
#if this key doesn't exist, it will create a new one.
dict[col].append(data[index])
print(dict)

Replacing and deleting columns from a csv using python

Here is a code that I am writing
import csv
import openpyxl
def read_file(fn):
rows = []
with open(fn) as f:
reader = csv.reader(f, quotechar='"',delimiter=",")
for row in reader:
if row:
rows.append(row)
return rows
replace = {x[0]:x[1:] for x in read_file("replace.csv")}
delete = set( (row[0] for row in read_file("delete.csv")) )
result = []
input_file="input.csv"
with open(input_file) as f:
reader = csv.reader(f, quotechar='"')
for row in reader:
if row:
if row[7] in delete:
continue
elif row[7] in replace:
result.append(replace[row[7]])
else:
result.append(row)
with open ("done.csv", "w+", newline="") as f:
w = csv.writer(f,quotechar='"', delimiter= ",")
w.writerows(result)
here are my files:
input.csv:
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
"-","-","-","-","-","-","-","aaaaa","-","-","bbbbb","-",","
"-","-","-","-","-","-","-","ccccc","-","-","ddddd","-",","
"-","-","-","-","-","-","-","eeeee","-","-","fffff","-",","
this is a 13 column csv. I am interested only in the 8th and the 11th fields.
this is my replace.csv:
"aaaaa","11111","22222"
delete.csv:
ccccc
so what I am doing is compare the first column of replace.csv(line by line) with the 8th column of input.csv and if they match then replace 8th column of input.csv with the second column of replace.csv and 11th column of input with the 3rd column of replace.csv
and for delete.csv it compares both files line by line and if match is found it deletes the entire row.
and if any line is not present in either replace.csv or delete.csv then print the line as it is.
so my desired output is:
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
"-","-","-","-","-","-","-",11111,"-","-",22222,"-",","
"-","-","-","-","-","-","-","eeeee","-","-","fffff","-",","
but when I run this code it gives me an output like this:
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13
11111,22222
where am I going wrong?
I am trying to make changes to my program that I had earlier posted a question about.Since the input file has changed I am trying to make changes to my program.
https://stackoverflow.com/a/54388144/9279313
#anuj
I think SafeDev's solution is optimal but if you don't want to go with pandas, just make little changes in your code.
for row in reader:
if row:
if row[7] in delete:
continue
elif row[7] in replace:
key = row[7]
row[7] = replace[key][0]
row[10]= replace[key][1]
result.append(row)
else:
result.append(row)
Hope this solves your issue.
It's actually quite simple. Instead of making it by scratch just use the panda library. From there it's easier to handle any dataset. This is how you would do it:
EDIT:
import pandas as pd
input_csv = pd.read_csv('input.csv')
replace_csv = pd.read_csv('replace.csv', header=None)
delete_csv = pd.read_csv('delete.csv')
r_lst = [i for i in replace_csv.iloc[:, 0]]
d_lst = [i for i in delete_csv]
input2_csv = pd.DataFrame.copy(input_csv)
for i, row in input_csv.iterrows():
if row['c8'] in r_lst:
input2_csv.loc[i, 'c8'] = replace_csv.iloc[r_lst.index(row['c8']), 1]
input2_csv.loc[i, 'c11'] = replace_csv.iloc[r_lst.index(row['c8']), 2]
if row['c8'] in d_lst:
input2_csv = input2_csv[input2_csv.c8 != row['c8']]
input2_csv.to_csv('output.csv', index=False)
This process can be made even more dynamic by turning it into a function that has parameters of column names and replacing 'c8' and 'c11' with those two parameters.

Python CSV import with nested list creation

I am trying to simply import a .csv into Python. I've read numerous documents but for the life of me I can't figure out how to do the following.
The CSV format is as follows
NYC,22,55
BOSTON,39,22
I'm trying to generate the following : {NYC = [22,55], BOSTON = [39,22]} so that I can call i[0] and i[1] in a loop for each variable.
I've tried
import csv
input_file = csv.DictReader(open("C:\Python\Sandbox\longlat.csv"))
for row in input_file:
print(row)
Which prints my variables, but I dont know hot to nest two numeric values within the city name and generate the list that im hoping to get.
Thanks for your help, sorry for my rookie question -
If you are not familiar with python comprehensions, you can use the following code that uses a for loop:
import csv
with open('C:\Python\Sandbox\longlat.csv', 'r') as f:
reader = csv.reader(f)
result = {}
for row in reader:
result[row[0]] = row[1:]
The previous code works if you want the numbers to be string, if you want them to be numbers use:
import csv
with open('C:\Python\Sandbox\longlat.csv', 'r') as f:
reader = csv.reader(f)
result = {}
for row in reader:
result[row[0]] = [int(e) for i in row[1:]] # float instead of int is also valid
Use dictionary comprehension:
import csv
with open(r'C:\Python\Sandbox\longlat.csv', mode='r') as csvfile:
csvread = csv.reader(csvfile)
result = {k: [int(c) for c in cs] for k, *cs in csvread}
This works in python-3.x, and produces on my machine:
>>> result
{'NYC': [22, 55], 'BOSTON': [39, 22]}
It also works for an arbitrary number of columns.
In case you use python-2.7, you can use indexing and slicing over sequence unpacking:
import csv
with open(r'C:\Python\Sandbox\longlat.csv', mode='r') as csvfile:
csvread = csv.reader(csvfile)
result = {row[0]: [int(c) for c in row[1:]] for row in csvread}
Each row will have 3 values. You want the first as the key and the rest as the value.
>>> row
['NYC','22','55']
>>> {row[0]: row[1:]}
{'NYC': ['22', '55']}
You can create the whole dict:
lookup = {row[0]: row[1:] for row in input_file}
You can also use pandas like so:
import pandas as pd
df = pd.read_csv(r'C:\Python\Sandbox\longlat.csv')
result = {}
for index, row in df.iterrows():
result[row[0]] = row[1:]
Heres a hint. Try familiarizing yourself with the str.split(x) function
strVar = "NYC,22,55"
listVar = strVar.split(',') # ["NYC", "22", "55"]
cityVar = listVar[0] # "NYC"
restVar = listVar[1:]; # ["22", "55"]
# If you want to convert `restVar` into integers
restVar = map(int, restVar)

Read in only rows in between certain strings Python

So I have a text file that I am trying to read with csv in python, however I only want the rows in between two rows that start with certain strings. I have no problems with just reading the data, I have:
import csv
with open('path to file','r') as inf:
reader = csv.reader(inf, delimiter=" ")
and to get all the data I can just loop through and append to a list:
raw_data=[]
for row in reader:
raw_data.append(row)
I know I can get the rows I want by doing something like:
for row in raw_data:
if row[0] == 'string1':
begin_idx = raw_data.index(row)
elif row[0] == 'string2':
end_idx = raw_data.index(row)
data=[]
for idx in range(begin_idx+1,end_idx):
data.append(raw_data[idx])
However, I was hoping to be able to do this all at once when I first loop through the text file, so if anyone has any ideas on how this could be done it would appreciated.
Note, the reason I am not just looking for index of the rows I want is because they are just a list of integers that will change each time I run this. The pdf to text conversion I run isn't extremely clean, so the row titles don't line up with the actual data for the row.
Iterator objects are nice in that they are just calling next() on the object like reader when using in
So this will allow you to go through this in one linear pass by looping through separately when you hit the starting string. Try this:
import csv
with open('path to file','r') as inf:
reader = csv.reader(inf, delimiter=" ")
data=[]
for row in reader:
if row[0] == 'string1':
for row in reader:
if row[0]=='string2':
break
data.append(row)
You can introduce a state variable into your for loop:
data = []
copying = False
for row in reader:
if copying:
data.append(row)
if row[0] == 'string1':
copying = True
if row[0] == 'string2':
copying = False

shorten csv file based on rules python

I am stuck writing the following program.
I have a csv file
"SNo","Column1","Column2"
"A1","X","Y"
"A2","A","B"
"A1","X","Z"
"A3","M","N"
"A1","D","E"
I want to shorten this csv to follow these rules
a.) If the SNo occurs more than once in the file,
combine all column1 and column2 entries of that serial number
b.) If same column1 entries and column2 entries occur more than once,
then do not combine them twice.
Therefore the output of the above should be
"SNo","Column1","Column2"
"A1","X,D","Y,Z,E"
"A2","A","B"
"A3","M","N"
So far I am reading the csv file, iterating the rows. checking if SNo of next row is same as the previous row. Whats the best way to combine.
import csv
temp = "A1"
col1=""
col2=""
col3=""
with open("C:\\file\\file1.csv","rb") as f:
reader = csv.reader(f)
for row in reader:
if row[0] == temp:
continue
col1 = col1+row[1]
col2=col2+row[2]
col3=col3+row[3]
temp = row[0]
print row[0]+";"+col1+";"+col2+";"+col3
col1=""
col2=""
col3=""
Please let me know a good way to do this.
Thanks
The simplest approach is to maintain a dictionary with keys as serial numbers and sets to contain the columns. Then you could do something like the following:
my_dict = {}
for row in reader:
if not row[0] in my_dict.keys():
my_dict[row[0]] = [set(), set()]
my_dict[row[0]][0].add(row[1])
my_dict[row[0]][1].add(row[2])
Writing the file out (to a file opened as file_out) would be as simple as iterating through the dictionary using a join command:
for k in my_dict.keys():
file_out.write("{0},\"{1}\",\"{2}\"\n".format(
k,
','.join([x for x in my_dict[k][0]]),
','.join([x for x in my_dict[k][1]])
))

Categories

Resources