Make edits to the original csv file - python

I have three different columns in my csv file, with their respected values. Column B (Name column) in csv file has the values in all caps. I am trying to convert it into first letter caps but when I run the code it returns all the columns squished together and in quotes.
The Original File:
Company Name Job Title
xxxxxx JACK NICHOLSON Manager
yyyyyy BRAD PITT Accountant
I am trying to do:
Company Name Job Title
xxxxxx Jack Nicholson Manager
yyyyyy Brad Pitt Accountant
My code:
import csv
with open('C:\\Users\\Data.csv', 'rb') as f:
reader = csv.reader(f, delimiter='\t')
data = list(reader)
for item in data:
if len(item) > 1:
item[1] = item[1].title()
with open('C:\\Users\\Data.csv', 'wb') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)
My result after I run the code is: Instead of returning three different columns and the second column adjusted with the title() syntax, it returns all the three columns squished together in just one column with quotes.
"Company","Name","Job Title"
xxxxxx,"JACK NICHOLSON","Manager"
yyyyyy,"BRAD PITT","Accountant"
I do not know what is wrong with my snippet. The result has absurd markings in the beginning

A slight change to Mohammed's solution using read_fwf to simplify reading the file.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html
import pandas as pd
df = pd.read_fwf('old_csv_file')
df.Name = df.Name.str.title()
df.to_csv('new_csv_file', index=False, sep='\t')
EDIT:
Changed to use a string method over lambda. I prefer to use lambdas as a last result.

You can do something like this with pandas:
import pandas as pd
df = pd.read_csv('old_csv_file', sep='\s{3,}')
df.Name = df.Name.apply(lambda x: x.title())
df.to_csv('new_csv_file', index=False, sep='\t')
string.title() converts the string to title case, i.e every first letter of the word in string is capitalized and subsequent letters are converted to lower case.
With df.apply you can perform some operation on an entire column or row.
'\s{3,}' is a regular expression
\s is a space character. \s{3,} is for more than 3 spaces.
When you are reading a CSV format you have to specify how your columns are separated.
Generally columns are separated by comma or tab. But in your case you have like 5,6 spaces between each column of a row.
So by using \s{3,} I am telling the CSV processor that the columns in a row are delimited by more than 3 spaces.
If I had use only \s then it would have treated First Name and Last Name as two separate columns because they have 1 space in between. So by 3+ spaces I made First Name and Last Name as a single column.

Take note that data stores each row as list containing one string only.
Having a length of 1, the statement inside this if block won't execute.
if len(item) > 1:
item[1] = item[1].title()
Aside from that, reading and writing in binary format is unnecessary.
import csv
with open('C:\\Users\\Data.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t')
data = list(reader)
for item in data[1:]: # excludes headers
item[0] = item[0].title() # will capitalize the Company column too
item[0] = item[0][0].lower() + item[0][1:] # that's why we need to revert
print(item)
# see that data contains lists having one element only
# the line above will output to
# ['Company Name Job Title']
# ['xxxxxx Jack Nicholson Manager']
# ['yyyyyy Brad Pitt Accountant']
with open('C:\\Users\\Data.csv', 'w') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)

Related

In python 3 Compare different rows of 2 different csv files and create new csv

Here is the scenario.
I have 2 csv files as follows:
CSV FILE 1 (previousmembers.csv):
john.doe#mydomain.com
suzy.smith#mydomain.com
test.person#mydomain.com
another.person#mydomain.com
cool.guy#mydomain.com
CSV FILE 2 (updatedmembers.csv):
1234,password1,John,Mike,Doe,2022,john.doe#mydomain.com
83762,password2,Suzy,Sally,Smith,2022,suzy.smith#mydomain.com
91209,password3,Test,Kid,Person,2023,test.person#mydomain.com
671653,password4,Cool,Tom,Guy,2027,cool.guy#mydomain.com
82637,password5,New,Billy,Kid,2026,new.kid#mydomain.com
956656,password6,Another,New,Newbie,2027,another.newbie#mydomain.com
Desired output (newfolks.csv):
82637,password5,New,Billy,Kid,2026,new.kid#mydomain.com
956656,password6,Another,New,Newbie,2027,another.newbie#mydomain.com
Here is what I have so far, and its not even close to working:
with open('previousmembers.csv') as check_file:
check_set = set([row[0] for row in check_file])
with open('updatedmembers.csv', 'r') as in_file, open('newfolks.csv', 'w') as out_file:
check_set2 = set([row[6] for row in in_file])
for line in check_set2:
if line not in check_set:
out_file.write(line)
The idea is that I want to create a csv file that has every line from updatedmembers.csv where row[6] of updatedmembers.csv does NOT exist in previousmembers.csv. (previusmembers.csv will only ever have an email listed, which is why I need to compare row[6] of updatedmembers.csv
Any help is greatly appreciated!
The main issue is that you are not processing the comma separated values into a list. You would typically use the csv module for this, which will handle edge cases nicely and makes some things simpler. But if you are just learning, you can use split(',') to split the values. After you do that, you can then index and get words. For example:
with open('previousmembers.csv') as check_file:
# no need to index here, it's just one string per line
# strip whitespace to be sure there's no junk
check_set = set(row.strip() for row in check_file)
with open('updatedmembers.csv', 'r') as in_file, open('newfolks.csv', 'w') as out_file:
for line in in_file:
# split on commas (or use csv module)
fields = line.split(',')
if fields[6].strip() not in check_set:
out_file.write(line)
This will write these rows to the new file:
82637,password5,New,Billy,Kid,2026,new.kid#mydomain.com
956656,password6,Another,New,Newbie,2027,another.newbie#mydomain.com
An alternative option to do it with Pandas.
import pandas as pd
csv_1 = pd.DataFrame(['john.doe#mydomain.com',
'suzy.smith#mydomain.com',
'test.person#mydomain.com',
'another.person#mydomain.com',
'cool.guy#mydomain.com'])
# csv_1 = pd.read_csv('previousmembers.csv')
csv_1.columns = ['email_csv1']
csv_2 = pd.DataFrame([
[1234,'password1','John','Mike','Doe',2022,'john.doe#mydomain.com'],
[83762,'password2','Suzy','Sally','Smith',2022,'suzy.smith#mydomain.com'],
[91209,'password3','Test','Kid','Person',2023,'test.person#mydomain.com'],
[671653,'password4','Cool','Tom','Guy',2027,'cool.guy#mydomain.com'],
[82637,'password5','New','Billy','Kid',2026,'new.kid#mydomain.com'],
[956656,'password6','Another','New','Newbie',2027,'another.newbie#mydomain.com']
])
# csv_2 = pd.read_csv('updatedmembers.csv')
csv_2.columns = ['id', 'password', 'first_name', 'last_name', 'person_type', 'year', 'email_csv2']
csv_3 = csv_2.merge(csv_1, how="left", left_on='email_csv2', right_on='email_csv1')
newfolks = csv_3.loc[csv_3.email_csv1.isna()][csv_2.columns]
new_folks.to_csv('new_folks.csv')
Basically, do a left join from csv2 to csv1 on email field. Then select the rows in csv2 which don't have a corresponding email in csv1 and we're done.

Find String in between strings in CSV file. Return results and Rows

I have a CSV file I am trying to run through all rows and pull out a string between two strings using Python. I am new to python. I would then like to return the String found in a new column along with all other columns and rows. SAMPLE of How the CSV looks. I am trying to get everything between /**ZTAG & ZTAG**\
Number Assignment_Group Notes
123456 Team One "2019-06-10 16:24:36 - (Work notes)
05924267 /**ZTAG-DEVICE-HW-APPLICATION-WONT-LAUNCH-STUFF-
SENT-REPAIR-FORCE-REPROCESSED-APPLICATION-ZTAG**\
2019-05-24 16:44:48 - (Work notes)
Attachment:snippet.PNG sys_attachment
sys_id:b08bf432db69ff083bfe3a10ad961961
I have been reading about this for a two days. I know I am missing
something
easy. I have looked at splitting the file in multiple ways.
import csv
import pandas
import re
f = open('test.csv')
csv_f = csv.reader(f)
#match = re.search("/**\ZTAG (.+?) ZTAG**\\", csv_f,flags=re.IGNORECASE)
for row in csv_f:
#print(re.split('/**ZTAG| ',csv_f))
#x = csv_f.split('/**ZTAG')
match = re.search("/**\ZTAG (.+?) ZTAG**\\", csv_f,flags=re.IGNORECASE)
print (row[0])
f.close()
I would need to have all columns and rows return with new column
containing string. EXAMPLE Below
Number, Assignment_group, Notes, TAG
123456, Team One, All stuff, ZTAG-DEVICE-HW-APPLICATION-WONT-
LAUNCH-STUFF-SENT-REPAIR-FORCE-REPROCESSED-APPLICATION-
I believe this regular expression should work:
result = re.search("\/\**ZTAG(.*)ZTAG\**", text)
extracted_text = result.group(1)
this will give you the string
-DEVICE-HW-APPLICATION-WONT-LAUNCH-STUFF- SENT-REPAIR-FORCE-REPROCESSED-APPLICATION-
you can do this for each row in your for loop if necessary

How to manage a problem reading a csv that is a semicolon-separated file where some strings contain semi-colons?

The problem I have can be illustrated by showing a couple of sample rows in my csv (semicolon-separated) file, which look like this:
4;1;"COFFEE; COMPANY";4
3;2;SALVATION ARMY;4
Notice that in one row, a string is in quotation marks and has a semi-colon inside of it (none of the columns have quotations marks around them in my input file except for the ones containing semicolons).
These rows with the quotation marks and semicolons are causing a problem -- basically, my code is counting the semicolon inside quotation marks within the column/field. So when I read in this row, it reads this semicolon inside the string as a delimiter, thus making it seem like this row has an extra field/column.
The desired output would look like this, with no quotation marks around "coffee company" and no semicolon between 'coffee' and 'company':
4;1;COFFEE COMPANY;4
3;2;SALVATION ARMY;4
Actually, this column with "coffee company" is totally useless to me, so the final file could look like this too:
4;1;xxxxxxxxxxx;4
3;2;xxxxxxxxxxx;4
How can I get rid of just the semi-colons inside of this one particular column, but without getting rid of all of the other semi-colons?
The csv module makes it relatively easy to handle a situation like this:
# Contents of input_file.csv
# 4;1;"COFFEE; COMPANY";4
# 3;2;SALVATION ARMY;4
import csv
input_file = 'input_file.csv' # Contents as shown in your question.
with open(input_file, 'r', newline='') as inp:
for row in csv.reader(inp, delimiter=';'):
row[2] = row[2].replace(';', '') # Remove embedded ';' chars.
# If you don't care about what's in the column, use the following instead:
# row[2] = 'xyz' # Value not needed.
print(';'.join(row))
Printed output:
4;1;COFFEE COMPANY;4
3;2;SALVATION ARMY;4
Follow-on question: How to write this data to a new csv file?
import csv
input_file = 'input_file.csv' # Contents as shown in your question.
output_file = 'output_file.csv'
with open(input_file, 'r', newline='') as inp, \
open(output_file, 'w', newline='') as outp:
writer= csv.writer(outp, delimiter=';')
for row in csv.reader(inp, delimiter=';'):
row[2] = row[2].replace(';', '') # Remove embedded ';' chars.
writer.writerow(row)
Here's an alternative approach using the Pandas library which spares you having to code for loops:
import pandas as pd
#Read csv into dataframe df
df = pd.read_csv('data.csv', sep=';', header=None)
#Remove semicolon in column 2
df[2] = df[2].apply(lambda x: x.replace(';', ''))
This gives the following dataframe df:
0 1 2 3
0 4 1 COFFEE COMPANY 4
1 3 2 SALVATION ARMY 4
Pandas provides several inbuilt functions to help you manipulate data or make statistical conclusions. Having the data in a tabular format can also make working with it more intuitive.

Extract a column from csv file which has few rows with extra commas as value(address field), which causes the column count to break

I need to access values of a column that occur after the address column, but due to presence of comma in the address field, I causes the file to count extra columns.
Example csv:
id,name,place,address,age,type,dob,date
1,Murtaza,someplace,Street,MA,22,B,somedate,somedate,
2,Murtaza,someplace,somestreet,45,C,somedate,somedate,
3,Murtaza,someplace,somestreet,MA,44,V,somedate,somedate
Excel output:
id name place address age type dob date newcolumn9
1 Murtaza someplace somestreet MA 22 B somedate somedate
2 Murtaza someplace somestreet 45 C somedate somedate
3 Murtaza someplace somestreet MA 44 V somedate somedate
This is what I tried:
# I was able to see that all columns before the column with extra commas displayed fine using this code.
import pandas as pd
import csv
with open('Myfile', 'rb') as f,
open('Newfile', 'wb') as g:
writer = csv.writer(g, delimiter=',')
for line in f:
row = line.split(',', 2)
writer.writerow(row)
I am trying to do this in python pandas.
If I can parse the csv in reverse ill be able to get the proper values regardless of the error.
From the above example, I want extract age column.
panda, or simply re.split():
import re
your_csv_file=open('your_csv_file.csv','r').read()
i_column=2 #index of desired column, counted from back
lines=re.split('\n',your_csv_file)[:-1] #eventually remove last (empty) line
your_column=[]
for line in lines:
your_column.append(re.split(',',line)[-i_column]) #the minus affects indexing beginning at the end
print(your_column)
executed on a .csv-file like the one below
4rth,askj,fpou,ABC,aekert
kjgf,poiuf,pejhh,,oeiu,DEF,akdhg
iuzrit,fslgk,gth,,rhf,,rhe,GHI,ozug
pwiuto,,,,eflgjkhrlguiazg,JKL,rgj
this returns
['ABC', 'DEF', 'GHI', 'JKL']
I think the best way to do this might be to write a separate script that removes the faulty commas. But if you want to just ignore the faulty lines, then that can be done by reading in each line into a StringIO and ignore the line with the incorrect number of commas. So if you're expecting 4 columns:
from cStringIO import StringIO
import pandas
s = StringIO()
correct_columns = 4
with open('MyData.csv') as file:
for line in file:
if len(','.split(line)) == correct_columns:
s.write(line)
s.seek(0)
pandas.read_csv(s)

Remove rows by keyword in column, and then remove all columns and save as text in python

This is kind of confusing I suppose, but I have a CSV with 3 columns,
Example:
name, product, type
John, car, new
Jim, truck, used
Jack, minivan, new
Jane, SUV, used
Jeff, car, used
First, I want to go through the CSV and remove all rows except for "new". Once that is done I want to remove all but the first column, and then save that list as a text file.
The code I have so far...
import csv
input_file = 'example.csv'
output_file = 'namesonly.txt'
reader = csv.reader(open(input_file,'rb'), delimiter=',')
for line in reader:
if "new" in line:
print line
With the code I have it prints just what I want:
John, car, new
Jack, minivan, new
Now that I have just the customers that bought "new" vehicles, I want to then cut the 2 columns on the right, leaving just a list of names. I then want to save that list of just names to a .txt file. This is where I am getting stuck, I don't know how to proceed from here.
This is no problem. Look at the following.
f = open('namesonly.txt', 'w')
...
for line in reader:
if "new" in line[2]:
#line = line.split(',') #<- you don't need this line because you are reading the input as a delimitd string already
f.write(line[0] + '\n') # write the first thing before the first comma (your names)
f.close()
This is untested, but something similar should work.
import csv
with open('example.csv') as infile, open('namesonly.txt', 'w') as outfile:
for name, _prod, condition in csv.reader(infile):
if condition.lower() == 'new':
continue
outfile.write(name)
outfile.write('\n')
While all approaches given until now work, they are all naive, and will perform poorly on large CSV files. The also require you to "manually" work with CSV files, and create for loops.
When ever you see files CSV files you should think of the two options: SQLite or Python Pandas.
SQLite, and it's built into your Python already. It uses SQL, so you need to learn some SQL ...
Pandas, uses a more Pythonic API to do the things you want to do, and it's not included (but it should not be complicated to install either...).
Here is how to what you wanted with Pandas:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('example.csv')
Get all the names (the first column):
In [3]: df['name']
Out[3]:
0 John
1 Jim
2 Jack
3 Jane
4 Jeff
Name: name, dtype: object
Find all new products:
In [18]: df[df[' type'] == ' new']
Out[18]:
name product type
0 John car new
2 Jack minivan new
You can assign the result, and then save it to a csv file.
In [19]: res = df[df[' type'] == ' used']
In [20]: res.to
res.to_clipboard res.to_dict res.to_hdf res.to_latex res.to_period res.to_sparse res.to_string
res.to_csv res.to_excel res.to_html res.to_msgpack res.to_pickle res.to_sql res.to_timestamp
res.to_dense res.to_gbq res.to_json res.to_panel res.to_records res.to_stata res.to_wide
In [20]: res.to_c
res.to_clipboard res.to_csv
In [20]: res.to_csv('new_products.csv')
Also note that Pandas can handle CSV files very efficiently since it is written in C.
About loading CSV with pandas
The CSV reader has tons of options. Check them out! I loaded the file naively and hence the space in the column name. If you think it's ugly, I would agree. You can pass the following keyword to fix the situation:
df = pd.read_csv('example.csv', delim_whitespace=True)
To show how simple pandas is
If you really want the names of those who have new products, as in the answer of Padraic Cunningham, you can simply concatenate methods:
In [46]: df[df['type'] == 'new'].name
Out[46]:
0 John
2 Jack
Name: name, dtype: object
In [47]: df[df['type'] == 'new'].name.to_csv('out.csv')
Just unpack using a generator expression and keep the name when the type row entry is equal to new:
import csv
with open("in.csv") as f, open("out.csv","w") as out:
wr = csv.writer(out)
wr.writerows((name,) for name, _, tpe in csv.reader(f) if tpe == "new")
in.csv:
name,product,type
John,car,new
Jim,truck,used
Jack,minivan,new
Jane,SUV,used
Jeff,car,used
out.csv:
John
Jack

Categories

Resources