Taking last line from each row in a large csv file?

Taking last line from each row in a large csv file? - python

I have a 12000 rows with multiple lines in each row.
I need to read and write into a new column only last lines in all 12000 rows
"► Контакт с пациентом | 07.02.2019 | |
► Принять в работу | 07.02.2019 | |
► Контакт с пациентом | 08.02.2019 | |
► Получить КП | 14.02.2019 | |
► ждем КП | 18.02.2019 | |
► отправил ему ответ и стоимости лекарств! через дви недели с ним связываться | 05.03.2019 | |
► арихив | 23.03.2019 | | ";
"► Контакт с пациентом | 19.06.2019 | |
► Принять в работу | 19.06.2019 | |
► Контакт с пациентом | 26.08.2019 | |
► Архив. | 10.09.2019 | | ";
I can do that only for one row and thats it. How can I do that through all 12000 rows
import pandas as pd
df = pd.read_csv('/Users/gfidarov/Desktop/crosscheck/crosscheck/sheet1')
r = df.split('|')
r = r[-4:]
r = '|'.join(r)
print(r)
here I can read that with csv library but I can't take only the last one. And if I try to make it like I did with pandas row = row[-4:] I am getting error. How can I solve my problem?
import csv
with open('/Users/gfidarov/Desktop/sheet_one') as f:
reader = csv.DictReader(f, delimiter='|')
for row in reader:
print(list(row))

For that file, the last line of each row is the line ending with a semicolon (;) following a double quote (").
So this could be enough:
with open('/Users/gfidarov/Desktop/sheet_one') as f:
for line in f:
if line.strip().endswith('";'): # Ok this is the line we want...
line = line.strip().strip('";') # clean it a little
print(line)
BTW, the csv try did not work because by default the double quote is used to quote fieds containing the delimiter or new lines, so here the csv module will only see one single field.

row in DictReader is a dict, where the keys are taken from the first row
When you use list(row), that only gives you those keys
You want to use csv.reader instead of csv.DictReader, which gives you a list for each row.
with open('/Users/gfidarov/Desktop/sheet_one.csv') as f:
reader = csv.reader(f, delimiter='|')
for row in reader:
print(row)
Also, like #BergeBallesta said, the double quotes cause the error
but you need to use a text editor, to find and replace the "s and the ;s, so the csv module can read it properly

Related

TypeError: '_csv.reader' object is not subscriptable and days passed [duplicate]

I'm trying to parse through a csv file and extract the data from only specific columns.
Example csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
I'm trying to capture only specific columns, say ID, Name, Zip and Phone.
Code I've looked at has led me to believe I can call the specific column by its corresponding number, so ie: Name would correspond to 2 and iterating through each row using row[2] would produce all the items in column 2. Only it doesn't.
Here's what I've done so far:
import sys, argparse, csv
from settings import *
# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
fromfile_prefix_chars="#" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file
# open csv file
with open(csv_file, 'rb') as csvfile:
# get number of columns
for line in csvfile.readlines():
array = line.split(',')
first_item = array[0]
num_columns = len(array)
csvfile.seek(0)
reader = csv.reader(csvfile, delimiter=' ')
included_cols = [1, 2, 6, 7]
for row in reader:
content = list(row[i] for i in included_cols)
print content
and I'm expecting that this will print out only the specific columns I want for each row except it doesn't, I get the last column only.

The only way you would be getting the last column from this code is if you don't include your print statement in your for loop.
This is most likely the end of your code:
for row in reader:
content = list(row[i] for i in included_cols)
print content
You want it to be this:
for row in reader:
content = list(row[i] for i in included_cols)
print content
Now that we have covered your mistake, I would like to take this time to introduce you to the pandas module.
Pandas is spectacular for dealing with csv files, and the following code would be all you need to read a csv and save an entire column into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
so if you wanted to save all of the info in your column Names into a variable, this is all you need to do:
names = df.Names
It's a great module and I suggest you look into it. If for some reason your print statement was in for loop and it was still only printing out the last column, which shouldn't happen, but let me know if my assumption was wrong. Your posted code has a lot of indentation errors so it was hard to know what was supposed to be where. Hope this was helpful!

import csv
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
with open('file.txt') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
for (k,v) in row.items(): # go over each column name and value
columns[k].append(v) # append the value into the appropriate list
# based on column name k
print(columns['name'])
print(columns['phone'])
print(columns['street'])
With a file like
name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.
Will output
>>>
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']
Or alternatively if you want numerical indexing for the columns:
with open('file.txt') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
print(columns[0])
>>>
['Bob', 'James', 'Smithers']
To change the deliminator add delimiter=" " to the appropriate instantiation, i.e reader = csv.reader(f,delimiter=" ")

Use pandas:
import pandas as pd
my_csv = pd.read_csv(filename)
column = my_csv.column_name
# you can also use my_csv['column_name']
Discard unneeded columns at parse time:
my_filtered_csv = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
P.S. I'm just aggregating what other's have said in a simple manner. Actual answers are taken from here and here.

You can use numpy.loadtext(filename). For example if this is your database .csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
And you want the Name column:
import numpy as np
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))
>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
More easily you can use genfromtext:
b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')

With pandas you can use read_csv with usecols parameter:
df = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
Example:
import pandas as pd
import io
s = '''
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
'''
df = pd.read_csv(io.StringIO(s), usecols=['total_bill', 'day', 'size'])
print(df)
total_bill day size
0 16.99 Sun 2
1 10.34 Sun 3
2 21.01 Sun 3

Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.
Answer: Let's say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.
from petl import fromcsv, look, cut, tocsv
#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')

I think there is an easier way
import pandas as pd
dataset = pd.read_csv('table1.csv')
ftCol = dataset.iloc[:, 0].values
So in here iloc[:, 0], : means all values, 0 means the position of the column.
in the example below ID will be selected
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

import pandas as pd
csv_file = pd.read_csv("file.csv")
column_val_list = csv_file.column_name._ndarray_values

Thanks to the way you can index and subset a pandas dataframe, a very easy way to extract a single column from a csv file into a variable is:
myVar = pd.read_csv('YourPath', sep = ",")['ColumnName']
A few things to consider:
The snippet above will produce a pandas Series and not dataframe.
The suggestion from ayhan with usecols will also be faster if speed is an issue.
Testing the two different approaches using %timeit on a 2122 KB sized csv file yields 22.8 ms for the usecols approach and 53 ms for my suggested approach.
And don't forget import pandas as pd

If you need to process the columns separately, I like to destructure the columns with the zip(*iterable) pattern (effectively "unzip"). So for your example:
ids, names, zips, phones = zip(*(
(row[1], row[2], row[6], row[7])
for row in reader
))

import pandas as pd
dataset = pd.read_csv('Train.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
X is a a bunch of columns, use it if you want to read more that one column
y is single column, use it to read one column
[:, 1:-1] are [row_index : to_row_index, column_index : to_column_index]

SAMPLE.CSV
a, 1, +
b, 2, -
c, 3, *
d, 4, /
column_names = ["Letter", "Number", "Symbol"]
df = pd.read_csv("sample.csv", names=column_names)
print(df)
OUTPUT
Letter Number Symbol
0 a 1 +
1 b 2 -
2 c 3 *
3 d 4 /
letters = df.Letter.to_list()
print(letters)
OUTPUT
['a', 'b', 'c', 'd']

import csv
with open('input.csv', encoding='utf-8-sig') as csv_file:
# the below statement will skip the first row
next(csv_file)
reader= csv.DictReader(csv_file)
Time_col ={'Time' : []}
#print(Time_col)
for record in reader :
Time_col['Time'].append(record['Time'])
print(Time_col)

From CSV File Reading and Writing you can import csv and use this code:
with open('names.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['first_name'], row['last_name'])

To fetch column name, instead of using readlines() better use readline() to avoid loop & reading the complete file & storing it in the array.
with open(csv_file, 'rb') as csvfile:
# get number of columns
line = csvfile.readline()
first_item = line.split(',')

Grouping CSV Rows By The Names of Users

I have a table on Python with the following data from a CSV:
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| 123#yahoo.com | Brian | Computer Tech|
| example#gmail.com | Brian | Sales|
| someone#google.com |Gabby |Sales|
| testinge#sendesk.com |Gabby |Marketing|
| sandbox#aol.com |Tyler | Porter |
I want to be able to group the data by the Name and have all of the other cells come with it.
It should end up looking like this.
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| 123#yahoo.com | Brian | Computer Tech|
| example#gmail.com | Brian | Sales|
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| someone#google.com |Gabby |Sales|
| testinge#sendesk.com |Gabby |Marketing|
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| sandbox#aol.com |Tyler | Porter |
Furthermore, I want to create a new csv file for every table that is created. Can anyone help? I have tried to loop it through but have failed too many times. I am currently back to the base and only have the file propagting in its normal table. Can anyone help?
import csv
f = open('work.csv')
csv_f = csv.reader(f)
for row in csv_f:
print (row)

When you are trying to group variables based on a certain key (the name in this case) a hashmap is usually a good data structure to try.
As a general solution for future readers:
Create an empty dictionary.
Choose the key that you want to group your data.
Iterate over the data and parse the key and related items.
Add the related items to dict[key].
Now each key in dict will have a list of all the items related to it.
Tailored more specifically to the OP's question:
import collections
def write_csv(name, lines):
with open(f"{name}_work.csv", "w") as f:
for line in lines:
f.write(','.join(item for item in line))
f.write('\n')
if __name__ == "__main__":
# LOAD DATA
with open("work.csv", 'r') as f:
lines = []
for line in f.readlines():
lines.append(line.strip('\n').split(','))
# GROUP DATA BY NAME INTO A DICTIONARY
names = collections.defaultdict(list)
for email, name, job in lines[1:]:
names[name].append((email, job))
# WRITE A NEW .csv FILE FOR EACH NAME
for name in names:
new_lines = lines[:1]
for email, job in names[name]:
new_lines.append([name, email, job])
write_csv(name, new_lines)

string manipulation, data wrangling, regex

I have a .txt file of 3 million rows. The file contains data that looks like this:
# RSYNC: 0 1 1 0 512 0
#$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
# random_number_ofspaces_before_this text $TTL 60s
#more random information
:127.0.1.2:https://www.spamhaus.org/query/domain/$
test
:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com
.0-1-hub.com
.zzzy1129.cn
:127.0.1.4:https://www.spamhaus.org/query/domain/$
.0-il.ml
.005verf-desj.com
.01accesfunds.com
In the above data, there is a code associated with all domains listed beneath it.
I want to turn the above data into a format that can be loaded into a HiveQL/SQL. The HiveQL table should look like:
+--------------------+--------------+-------------+-----------------------------------------------------+
| domain_name | period_count | parsed_code | raw_code |
+--------------------+--------------+-------------+-----------------------------------------------------+
| test | 0 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-0m5tk.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-1-hub.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .zzzy1129.cn | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-il.ml | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .005verf-desj.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .01accesfunds.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
+--------------------+--------------+-------------+-----------------------------------------------------+
Please note that I do not want the vertical bars in any output. They are just to make the above look like a table
I'm guessing that creating a HiveQL table like the above will involve converting the .txt into a .csv or a Pandas data frame. If creating a .csv, then the .csv would probably look like:
domain_name,period_count,parsed_code,raw_code
test,0,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-1-hub.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.zzzy1129.cn,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-il.ml,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.005verf-desj.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.01accesfunds.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
I'd be interested in a Python solution, but lack familiarity with the packages and functions necessary to complete the above data wrangling steps. I'm looking for a complete solution, or code tidbits to construct my own solution. I'm guessing regular expressions will be needed to identify the "category" or "code" line in the raw data. They always start with ":127.0.1." I'd also like to parse the code out to create a parsed_code column, and a period_count column that counts the number of periods in the domain_name string. For testing purposes, please create a .txt of the sample data I have provided at the beginning of this post.

Regardless of how you want to format in the end, I suppose the first step is to separate the domain_name and code. That part is pure python
rows = []
code = None
parsed_code = None
with open('input.txt', 'r') as f:
for line in f:
line = line.rstrip('\n')
if line.startswith(':127'):
code = line
parsed_code = line.split(':')[1]
continue
if line.startswith('#'):
continue
period_count = line.count('.')
rows.append((line,period_count,parsed_code, code))
Just for illustration, you can use pandas to format the data nicely as tables, which might help if you want to pipe this to SQL, but it's not absolutely necessary. Post-processing of strings are also quite straightforward in pandas.
import pandas as pd
df = pd.DataFrame(rows, columns=['domain_name', 'period_count', 'parsed_code', 'raw_code'])
print (df)
prints this:
domain_name period_count parsed_code raw_code
0 test 0 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
1 .0-0m5tk.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
2 .0-1-hub.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
3 .zzzy1129.cn 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
4 .0-il.ml 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
5 .005verf-desj.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
6 .01accesfunds.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...

You can do all of this with the Python standard library.
HEADER = "domain_name | code"
# Open files
with open("input.txt") as f_in, open("output.txt", "w") as f_out:
# Write header
print(HEADER, file=f_out)
print("-" * len(HEADER), file=f_out)
# Parse file and output in correct format
code = None
for line in f_in:
if line.startswith("#"):
# Ignore comments
continue
if line.endswith("$"):
# Store line as the current "code"
code = line
else:
# Write these domain_name entries into the
# output file separated by ' | '
print(line, code, sep=" | ", file=f_out)

how to check list which has tab characters in it in python?

I have a data.csv file with belwo content in it and at the end of this file, it has some new lines as well. Now I want to read this file and get the value from last row for particular column.
Connecting to the ControlService endpoint
Found 3 rows.
Requests List:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Client ID | Client Type | Service Type | Status | Trust Domain | Data Instance Name | Data Version | Creation Time | Last Update | Scheduled Time |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
REFRESH_ROUTINGTIER_ARTIFACTS_1465901168866 | ROUTINGTIER_ARTIFACTS | SYSTEM | COMPLETED | RRA Bulk Client | soa_server1 | 18.2.2.0.0 | 2016-06-14 03:49:55 -07:00 | 2016-06-14 03:49:57 -07:00 | --- |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
500333443 | CREATE | [FA_GSI] | COMPLETED | holder | soa_server1 | 18.3.2.0.0 | 2018-08-07 11:59:57 -07:00 | 2018-08-07 12:04:37 -07:00 | --- |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
500333446 | CREATE | [FA_GSI] | COMPLETED | holder-test | soa_server1 | 18.3.2.0.0 | 2018-08-07 12:04:48 -07:00 | 2018-08-07 12:08:52 -07:00 | --- |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Now I want to parse the above file and extra value from the last row. I want to extra value of "Client ID" and "Trust Domain" column in the last row which is:
Client ID: 500333446
Trust Domain: holder-test
I got my below python script but it fails because of new lines at the end of the csv file? If my csv file doesnt have any new line then it works fine.
import csv
lines_to_skip = 4
with open('data.csv', 'r') as f:
reader = csv.reader(f, delimiter='|')
for i in range(lines_to_skip):
next(reader)
data = []
for line in reader:
if line[0].find("---") != 0:
print line
data.append(line)
print("{}={}".format(data[-1][0].replace(" ",""),data[-1][4].replace(" ","")))
I am getting this error at if block line if my csv file has some new lines at the end:
Traceback (most recent call last):
File "test.py", line 11, in <module>
if line[0].find("---") != 0:
IndexError: list index out of range
This is the line prints out at the end:
[' \t\t']

You could try splitting each row with | into a list of dictionaries and only printing the Client ID and Trust Domain from the last row:
with open('data.txt') as f:
# collect rows of interest
rows = []
for line in f:
if '|' in line:
items = [item.strip() for item in line.split('|')]
rows.append(items)
# first item will be headers
headers = rows[0]
# put each row into dictionary
data = [dict(zip(headers, row)) for row in rows[1:]]
# print out last row information of interest
print('Client ID:', data[-1]['Client ID'])
print('Trust Domain:', data[-1]['Trust Domain'])
Which Outputs:
Client ID: 500333446
Trust Domain: holder-test
As requested in the comments, if you want to print 500333446=holder-test instead, you can change the final print sequence to:
print('%s=%s' % (data[-1]['Client ID'], data[-1]['Trust Domain']))
# 500333446=holder-test

If you have empty lines at the end, the csv.reader will give you empty rows, so you have to write code to deal with that. If you just do line[0] on every line, even the empty ones, you will get exactly the exception you're asking about.
But all you have to do is check whether line is empty before trying to check line[0]:
if line:
if line[0].find("---") != 0:
… or, more compactly:
if line and line[0].find("---") != 0:

Before processing the line, you should strip off any unwanted characters and verify that it is a line that you want.
What you can do is this:
if line and line[0].strip(" \t") and not line[0].startswith("---"):
Or another way:
if all([line, line[0].strip(" \t"), not line[0].startswith("---")]):
if line checks if the line is an empty list, so that 2. does not throw an error.
line[0].strip(" \t") checks if the first value only contains unwanted characters.
not line[0].startswith("---") is the same as your line[0].find("---") != 0

Merge two CSV columns and match up

I have a CSV with three major columns I need to infuse.
One of them is a name of the product called "Material"
One of them is the group name called "Serial"
The final is "Related" which matches the Martial with the Serial
At the moment the CSV will look like the following:
(example, has more fields and different data)
Martial | Serial | Related
ExOne | GroupOne |
ExTwo | GroupOne |
ExThree | GroupOne |
ExFour | GroupTwo |
ExFive | GroupTwo |
ExSix | GroupThree |
I need to match each martial to each over by the serial but limited to five (and separated by "///"
The example outcome should look like the following:
Martial | Serial | Related
ExOne | GroupOne | ExOne///ExTwo///ExThree
ExTwo | GroupOne | ExOne///ExTwo///ExThree
ExThree | GroupOne | ExOne///ExTwo///ExThree
ExFour | GroupTwo | ExFour///ExFive
ExFive | GroupTwo | ExFour///ExFive
ExSix | GroupThree | ExSix
This is my first attempt at Python and the code that i've tried at the moment is only touching on what I said. The way I'm building the code is bit by bit, the first bit (aim) is to match the serial groups and list all martial items under, for example:
GroupOne
ExOne
ExTwo
ExThree
GroupTwo
ExFour
ExFive
GroupSix
ExSix
Then from there I can make cases and combine by factors (if more then 5 ect)
import csv
import sys
with open('EGLOINDOORCSV.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
Materials = []
Serials = []
for row in readCSV:
Material = row[0]
Serial = row[4]
Materials.append(Material)
Serials.append(Serial)
if Serial == Serial:
print(Serial)
print(Material, end = "///")
print("\n")
break
print("Done")

First let's recreate a sample file:
data = '''\
Martial|Serial|Related
ExOne|GroupOne|
ExTwo|GroupOne|
ExThree|GroupOne|
ExFour|GroupTwo|
ExFive|GroupTwo|
ExSix|GroupThree|'''
with open('test.csv', 'w') as f:
f.write(data)
Now the actual code using Pandas (Pandas comes together with the Anaconda package). Use pip install pandas to install it without anaconda.
import pandas as pd
df = pd.read_csv('test.csv', sep='|')
df['Related'] = df['Serial'].map(df.groupby('Serial')['Martial']
.apply(lambda x: '///'.join(x)))
df.to_csv('output.csv', index=False)
Returns:
Martial Serial Related
0 ExOne GroupOne ExOne///ExTwo///ExThree
1 ExTwo GroupOne ExOne///ExTwo///ExThree
2 ExThree GroupOne ExOne///ExTwo///ExThree
3 ExFour GroupTwo ExFour///ExFive
4 ExFive GroupTwo ExFour///ExFive
5 ExSix GroupThree ExSix

My approach is to read the CSV twice. In the first pass, I gather related information and in the second, output:
import csv
# Pass 1: gather related materials
with open('EGLOINDOORCSV.csv') as csvfile:
reader = csv.reader(csvfile)
related = {}
for row in reader:
material = row[0]
serial = row[1]
related.setdefault(serial, set()).add(material)
# print(related) # for debugging
# Pass 2: print
with open('EGLOINDOORCSV.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
material = row[0]
serial = row[1]
print('%s | %s | %s' % (material, serial, '///'.join(sorted(related[serial]))))
Output:
ExOne | GroupOne | ExOne///ExThree///ExTwo
ExTwo | GroupOne | ExOne///ExThree///ExTwo
ExThree | GroupOne | ExOne///ExThree///ExTwo
ExFour | GroupTwo | ExFive///ExFour
ExFive | GroupTwo | ExFive///ExFour
ExSix | GroupThree | ExSix
Notes
I assume your CSV file does not have a header. If you do, you will need to skip it:
reader = csv.reader(csvfile)
next(reader) # Skip the header, then move on
Based on the CSV you supplied, I assigned row[0] to material, please adjust the index number to match your file
About the related dictionary
This dictionary is where I keep the relations, it looks like this:
{
"GroupTwo": set(["ExFour", "ExFive"]),
"GroupOne": set(["ExOne", "ExThree", "ExTwo"]),
"GroupThree": set(["ExSix"])
}
In my code, the statement:
related.setdefault(serial, set()).add(material)
is a shorthand for:
if serial not in related:
related[serial] = set()
related[serial].add(material)

This is the approach using inbox itertools, you don't need to install any extra package. Then this is how to write it in a pythonistic way also using dictionary and list comprehension.
Step by step approach:
#reading all file at once
import csv
with open('EGLOINDOORCSV.csv') as csvfile:
l=[r for r in csv.reader(csvfile, delimiter=r',')][1:] #skip header
#itertools requires sorted data. Sorting by second field.
key=lambda x: x[1]
l = sorted( l, key = key)
#grouping to an aux dictionary
from itertools import groupby
d={ k: "///".join( x[0] for x in g) for k,g in groupby( l, key) }
#updating third column from aux dictionary
for x in l:
x[2]=d[x[1]]
Et voilà!
#this is the content of l, ready to go back to a new csv
[
['ExOne', 'GroupOne', 'ExOne///ExTwo///ExThree'],
['ExTwo', 'GroupOne', 'ExOne///ExTwo///ExThree'],
['ExThree', 'GroupOne', 'ExOne///ExTwo///ExThree'],
['ExSix', 'GroupThree', 'ExSix'],
['ExFour', 'GroupTwo', 'ExFour///ExFive'],
['ExFive', 'GroupTwo', 'ExFour///ExFive'],
]
Disclaimer: This is a vanilla solution, all in the box, but remember, pandas is your friend handling data, take in mind to install it and move to a pandas solution if you need to manage lots of data.
Raw data
$cat EGLOINDOORCSV.csv
Martial,Serial,Related
ExOne,GroupOne,
ExTwo,GroupOne,
ExThree,GroupOne,
ExFour,GroupTwo,
ExFive,GroupTwo,
ExSix,GroupThree,

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Taking last line from each row in a large csv file? - python

Related

TypeError: '_csv.reader' object is not subscriptable and days passed [duplicate]

Grouping CSV Rows By The Names of Users

string manipulation, data wrangling, regex

how to check list which has tab characters in it in python?

Merge two CSV columns and match up

Categories

Resources