Stemming csv files in Python

Stemming csv files in Python - python

Okay, I have this code in Python in which it imports two csv files. The first csv file is named "claims" (one column, many rows) and the other one is named "sexualHarassment" (one column, many rows) The program right now checks all rows of "claims" to see if it contains any words from "sexualHarassment" and if it does, then it outputs that row into a new csv file named "output" It also eliminates certain stopwords that I chose. <-- This part of the program works.
Now I need to go through a stemming process to stem all of the words to take out tenses from the words. Such as "discriminated" to "discriminat", "harassed" to "harass" and so on..
I've downloaded and installed a couple stemming packages but I can only stem out words such as:
from nltk import PorterStemmer
PorterStemmer().stem_word('discriminated')
>>>discriminate
Is there anyway that I can run this stemming check for all words in each row of the "sexual harassment" file before it outputs it into the new csv file?
Here is a copy of my code:
import csv
with open("claims.csv") as file1, open("masterlist.csv") as file2,
open("stopwords.csv") as file3, open("output.csv", "wb+") as file4:
writer = csv.writer(file4)
key_words = [word.strip() for word in file2.readlines()]
stop_words = [' also ', ' although ', ' always ', ' and ', ' any ', ' are ', ' as ', ' at ',\
' around ', ' be ', ' by ', ' for ', ' from ', ' has ', ' on ', ' that ', ' were ', ' will ',\
' with ' ' can ', ' cannot ', ' if ', ' it ', ' the ', ' there ', ' which ', ' in ', ' is ',\
' its ', ' me ', ' of ', ' was ', ' then ', ' with ', ' a ', ' an ', ' to ', ' to ', ' when ',\
' however ', '"', ',', '.', '-', '?', '!', '(', ')']
for row in file1:
row = row.strip()
row = row.lower()
for stopword in stop_words:
if stopword in row:
row = row.replace(stopword," ")
for key in key_words:
if key in row:
writer.writerow([key, row])
break

Related

How to make a function that lays mines to random squares on nested list?

The field is created like this:
field = []
for row in range(10):
field.append([])
for col in range(15):
field[-1].append(" ")
Tuples represent free squares where mines can be layed
free = []
for x in range(15):
for y in range(10):
free.append((x, y))
I have to lay the mines trough this function:
def lay_mines(field, free, number_of_mines):
for _ in number_of_mines:
mines = random.sample(free, number_of_mines)
field(mines) = ["x"]
I was thinking using random.sample() or random.choice(). I just can't get it to work. How can I place the string "x" to a certain random coordinate?

import random
def lay_mines(x, y, number_of_mines=0):
f = [list(' ' * x) for _ in range(y)]
for m in random.sample(range(x * y), k=number_of_mines): # random sampling without replacement
f[m % y][m // y] = 'X'
return f
field = lay_mines(15, 10, 20)
print(*field, sep='\n')
Prints:
['X', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
[' ', ' ', ' ', ' ', 'X', ' ', ' ', ' ', 'X', ' ', ' ', ' ', ' ', ' ', ' ']
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'X', ' ', ' ', ' ', ' ', ' ', ' ']
[' ', ' ', 'X', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
[' ', ' ', 'X', ' ', ' ', ' ', ' ', 'X', 'X', ' ', ' ', ' ', ' ', ' ', ' ']
[' ', ' ', ' ', ' ', ' ', ' ', 'X', ' ', ' ', ' ', ' ', 'X', ' ', ' ', ' ']
[' ', 'X', ' ', ' ', 'X', ' ', ' ', ' ', ' ', 'X', ' ', ' ', ' ', ' ', ' ']
[' ', ' ', 'X', 'X', 'X', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
[' ', ' ', ' ', ' ', ' ', 'X', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'X']
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'X', ' ', 'X', ' ', ' ', ' ', ' ']

How to clean a list

I have a list like this
countries = ["['Luxemburgo ", 'Suiza ', 'Noruega ', 'Irlanda ', 'Islandia ', 'Catar ', 'Singapur ', 'Estados Unidos ', 'Dinamarca ', 'Australia ', 'Suecia ', 'Países Bajos ', 'San Marino ', 'Austria ', 'Finlandia ', 'Alemania ', 'Hong Kong ', 'Bélgica ', 'Canadá ', 'Emiratos Árabes Unidos ', 'Reino Unido ', 'Israel ', 'Nueva Zelanda ', 'Francia ', "Japón ']"]
and I don't know how to convert it to a really list. If I print the first element:
>>> print(countries[0])
['Luxemburgo
How can I do to eliminate the [ and it would have the two '' because it's a string but with the rest of the words in the list that prints only the word without the ''.

The better question is: Where did the list come from and can we fix the problem at the source?
But you can fix it now by doing:
import ast
countries_fixed = ast.literal_eval("', '".join(countries))
Afterwards, your list elements still contain spaces, to fix that too, you can do this instead:
countries_fixed = ast.literal_eval(
"', '".join(country.strip() for country in countries)
)
Result:
>>> print(countries[0])
Luxemburgo

you can simply do like this:
countries = ["['Luxemburgo ", 'Suiza ', 'Noruega ', 'Irlanda ', 'Islandia ', 'Catar ', 'Singapur ', 'Estados Unidos ', 'Dinamarca ', 'Australia ', 'Suecia ', 'Países Bajos ', 'San Marino ', 'Austria ', 'Finlandia ', 'Alemania ', 'Hong Kong ', 'Bélgica ', 'Canadá ', 'Emiratos Árabes Unidos ', 'Reino Unido ', 'Israel ', 'Nueva Zelanda ', 'Francia ', "Japón ']"]
lis=[]
for x in countries:
lis.append(x.replace("['","").replace("']",""))
print(lis[0])
output:
Luxemburgo

For loop not working as intended

I am in the middle of my course work and I am now having trouble with one of my for loops.
def update():
update=[]
update1=[]
with open('Stock2.txt','r') as stockFile:
for eachLine in stockFile:
eachLine=eachLine.strip().split()
update.append(eachLine)
update.remove(update[0])
stockFile.close()
with open('Stock2.txt','r') as stockFile:
for eachLine in stockFile:
eachLine=eachLine.strip().split(' ')
update1.append(eachLine)
update1.remove(update1[0])
for eachList in update1:
loopCon=-1
for eachItem in eachList:
loopCon+=1
if eachItem=='':
eachList[loopCon]=' '
count=-1
for eachList in update1:
for eachItem in eachList:
count+=1
if eachItem != ' ':
print(count)
The last for loop that I have been working on is looping ok but when I add one to count every time it loops on the for loop 'for eachItem in eachList:' it comes up with random numbers as follows:
0 10 14 21 28 35 36 46 62 69 76 83 84 94 111
Here is the stock file I am using - Stock2.txt
GTIN-8 Product-Name Price(£) CSL ROL TSL
95820194 Windows-10-64bit 119.99 0 1 3
68196167 Cheese 1.00 0 3 8
62017014 Bread 0.93 0 3 9
86179616 10tb-memory-stick 916.96 0 0 4
19610577 Freddo 0.15 0 2 9
So on.
Is there anything I have done wrong whilst doing this as I probably would not be able to detect it that easily as I have only been doing python for almost 1 year.
Thank you for your time.

You increment count outside the if that prints. Try this instead:
for eachList in update1:
for eachItem in eachList:
if eachItem != ' ':
count+=1
print(count)

If I put a print update1 statement before your last for loop, i.e., before the statement for eachList in update1:, I get the following output:
[['95820194', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'Windows-10-64bit', ' ', ' ', ' ', '119.99', ' ', ' ', ' ', ' ', ' ', ' ', '0', ' ', ' ', ' ', ' ', ' ', ' ', '1', ' ', ' ', ' ', ' ', ' ', ' ', '3'], ['68196167', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'Cheese', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '1.00', ' ', ' ', ' ', ' ', ' ', ' ', '0', ' ', ' ', ' ', ' ', ' ', ' ', '3', ' ', ' ', ' ', ' ', ' ', ' ', '8'], ['62017014', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'Bread', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '0.93', ' ', ' ', ' ', ' ', ' ', ' ', '0', ' ', ' ', ' ', ' ', ' ', ' ', '3', ' ', ' ', ' ', ' ', ' ', ' ', '9'], ['86179616', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '10tb-memory-stick', ' ', ' ', '916.96', ' ', ' ', ' ', ' ', ' ', ' ', '0', ' ', ' ', ' ', ' ', ' ', ' ', '0', ' ', ' ', ' ', ' ', ' ', ' ', '4'], ['19610577', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'Freddo', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '0.15', ' ', ' ', ' ', ' ', ' ', ' ', '0', ' ', ' ', ' ', ' ', ' ', ' ', '2', ' ', ' ', ' ', ' ', ' ', ' ', '9']]
So by this it seems the output isnt random at all. What you are doing is traversing each list inside the list update1, and you are incrementing count each time you get an element in eachItem.
However you are printing count only when eachItem != ' '. So as you can see it prints 0 when eachItem == '95820194', and then it prints 10 when eachItem == 'Windows-10-64bit', and so on. Though it is incremented even when eachItem == ' ', just not printed.

Python: Reading cvs file into lists and an array

I am new to Python, and this is my first post in here, so I hope you will bear over with me. I am having big trouble reading a csv file into a desired format. My file consists of 132 columns, and the head of the file looks like this:
['10520', ' 386681375.82149398', ' 85.25775430', ' -56.07840500', ' 173', ' 153', ' 151', ' 161', ' 180', ' 167', ' 189', ' 171', ' 173', ' 171', ' 207', ' 169', ' 173', ' 168', ' 184', ' 168', ' 201', ' 197', ' 204', ' 201', ' 210', ' 239', ' 211', ' 227', ' 247', ' 248', ' 266', ' 276', ' 322', ' 336', ' 331', ' 381', ' 358', ' 483', ' 532', ' 709', ' 841', ' 1004', ' 1128', ' 1540', ' 1945', ' 2747', ' 3718', ' 5378', ' 6273', ' 8415', ' 12727', ' 18248', ' 24103', ' 33688', ' 40744', ' 52821', ' 65535', ' 59114', ' 55225', ' 49919', ' 51894', ' 58381', ' 50376', ' 48315', ' 42337', ' 30577', ' 24078', ' 24337', ' 22432', ' 20191', ' 19999', ' 17674', ' 22519', ' 22542', ' 22644', ' 23966', ' 21033', ' 21326', ' 20257', ' 20441', ' 21859', ' 26976', ' 32514', ' 34732', ' 45555', ' 48416', ' 34952', ' 28511', ' 24611', ' 18843', ' 17081', ' 14592', ' 13550', ' 13011', ' 15370', ' 15827', ' 15232', ' 16054', ' 14823', ' 14538', ' 12544', ' 11865', ' 11442', ' 10089', ' 10340', ' 11269', ' 11336', ' 11873', ' 10012', ' 9824', ' 9488', ' 7696', ' 9273', ' 9502', ' 8752', ' 8341', ' 8192', ' 8293', ' 8067', ' 8402', ' 9258', ' 9290', ' 8144', ' 8009', ' 7660', ' 6772', ' 6008', ' 6792', ' 6993', ' 6662', ' 7047', ' 6662 ']
['10520', ' 386681375.86699998', ' 85.25527360', ' -56.09263480', ' 113', ' 102', ' 120', ' 124', ' 117', ' 127', ' 124', ' 118', ' 128', ' 120', ' 125', ' 120', ' 140', ' 135', ' 144', ' 127', ' 143', ' 148', ' 141', ' 153', ' 142', ' 142', ' 149', ' 152', ' 168', ' 180', ' 196', ' 188', ' 196', ' 246', ' 259', ' 270', ' 337', ' 360', ' 506', ' 540', ' 625', ' 887', ' 1122', ' 1251', ' 2007', ' 2883', ' 3238', ' 4370', ' 6240', ' 9164', ' 10751', ' 16656', ' 20996', ' 27753', ' 37774', ' 35377', ' 38637', ' 39265', ' 35183', ' 38830', ' 32149', ' 25455', ' 27272', ' 24488', ' 21036', ' 20931', ' 17166', ' 17019', ' 18196', ' 15450', ' 15120', ' 15934', ' 15021', ' 14936', ' 16253', ' 16457', ' 15873', ' 19667', ' 23150', ' 26140', ' 35761', ' 42594', ' 61758', ' 65535', ' 42354', ' 28672', ' 25173', ' 20344', ' 15883', ' 14432', ' 10575', ' 11342', ' 12348', ' 13229', ' 19632', ' 23456', ' 18102', ' 15600', ' 13425', ' 9962', ' 8281', ' 7609', ' 6948', ' 7391', ' 8878', ' 10006', ' 11295', ' 10073', ' 9410', ' 10354', ' 10667', ' 10054', ' 9011', ' 8793', ' 9055', ' 7463', ' 6692', ' 8051', ' 8330', ' 7369', ' 6612', ' 6328', ' 6545', ' 6235', ' 5895', ' 5085', ' 4876', ' 5154', ' 4649', ' 5226', ' 6137', ' 5354 ']
and I am interested in getting:
four lists/vectors/1D arrays (or what ever) of the four first colums.
The next 128 columns I would like to get into an array.
I would like to get the output without ([] , ' ") and other non-number-characters.
So fare the code looks like this
import sys, math, numpy
from numpy import *
from scipy import *
import csv
try:
ifile = sys.argv[1]
#; ofile = sys.argv[2]
except:
print "Usage:", sys.argv[0], "ifile"; sys.exit(1)
# Open and read file from std, and assign first four (orbit, time, lat, lon) columns to four lists, and last 128 columns (waveforms) to an array.
ifile = open(ifile)
orbit = []
time = []
lat = []
lon = []
#wvf= [[],[]]
try:
reader = csv.reader(ifile, delimiter=',')
for row in reader:
orbit.append(row[0])
time.append(row[1])
lat.append(row[2])
lon.append(row[3])
# wvf = [row[4:132] for row in reader] row[0:128] for col in len(reader)]
wvf = [row[4:132]],[row[1:128]]
finally:
ifile.close()
...and now do something with data...
I have thought about first splitting all lines, and thereafter gathering the last 128 columns into the array, but I haven't managed to do it.
I hope your having an idea of what I am wishing to accomplish, and are able to help me out.
Thanks

You can load the file into a numpy array using np.genfromtxt. An advantage of doing it this way is that the data goes directly from the file to a space-efficient numpy array. If you use the csv module, and store the data in Python lists, then your data will consume a lot more memory.
import sys
import numpy as np
try:
ifile = sys.argv[1]
#; ofile = sys.argv[2]
except:
print "Usage:", sys.argv[0], "ifile"; sys.exit(1)
# Open and read file from std, and assign first four (orbit, time, lat, lon)
# columns to four lists, and last 128 columns (waveforms) to an array.
def remove_bracket(line):
return float(line.strip("][ '"))
data = np.genfromtxt(ifile, delimiter = ',',
dtype = 'float',
converters = {i:remove_bracket for i in range(132)}
)
orbit = data[:,0]
time = data[:,1]
lat = data[:,2]
lon = data[:,3]
wvf = data[:,4:128]
print(wvf)
Note that the variables orbit, time, etc. are "views" of data -- they are not copies of data, and so do not require (much) additional memory. This also means that modifying orbit will also affect data, and vice versa.

Simply:
wvf = []
try:
reader = csv.reader(ifile, delimiter=',')
for row in reader:
# ...
wvf.append(row[4:132])
Initialize wvf to be an empty array like the others, then append one sub-list (slice) per row of data.
(Just in case your data is really big and you want to optimise your memory usage: there's the array module for efficient storage.)

Partitioning a string in Python by a regular expression

I need to split a string into an array on word boundaries (whitespace) while maintaining the whitespace.
For example:
'this is a\nsentence'
Would become
['this', ' ', 'is', ' ', 'a' '\n', 'sentence']
I know about str.partition and re.split, but neither of them quite do what I want and there is no re.partition.
How should I partition strings on whitespace in Python with reasonable efficiency?

Try this:
s = "this is a\nsentence"
re.split(r'(\W+)', s) # Notice parentheses and a plus sign.
Result would be:
['this', ' ', 'is', ' ', 'a', '\n', 'sentence']

Symbol of whitespace in re is '\s' not '\W'
Compare:
import re
s = "With a sign # written # the beginning , that's a\nsentence,"\
'\nno more an instruction!,\tyou know ?? "Cases" & and surprises:'\
"that will 'lways unknown **before**, in 81% of time$"
a = re.split('(\W+)', s)
print a
print len(a)
print
b = re.split('(\s+)', s)
print b
print len(b)
produces
['With', ' ', 'a', ' ', 'sign', ' # ', 'written', ' # ', 'the', ' ', 'beginning', ' , ', 'that', "'", 's', ' ', 'a', '\n', 'sentence', ',\n', 'no', ' ', 'more', ' ', 'an', ' ', 'instruction', '!,\t', 'you', ' ', 'know', ' ?? "', 'Cases', '" & ', 'and', ' ', 'surprises', ':', 'that', ' ', 'will', " '", 'lways', ' ', 'unknown', ' **', 'before', '**, ', 'in', ' ', '81', '% ', 'of', ' ', 'time', '$', '']
57
['With', ' ', 'a', ' ', 'sign', ' ', '#', ' ', 'written', ' ', '#', ' ', 'the', ' ', 'beginning', ' ', ',', ' ', "that's", ' ', 'a', '\n', 'sentence,', '\n', 'no', ' ', 'more', ' ', 'an', ' ', 'instruction!,', '\t', 'you', ' ', 'know', ' ', '??', ' ', '"Cases"', ' ', '&', ' ', 'and', ' ', 'surprises:that', ' ', 'will', ' ', "'lways", ' ', 'unknown', ' ', '**before**,', ' ', 'in', ' ', '81%', ' ', 'of', ' ', 'time$']
61

Try this:
re.split('(\W+)','this is a\nsentence')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stemming csv files in Python - python

Related

How to make a function that lays mines to random squares on nested list?

How to clean a list

For loop not working as intended

Python: Reading cvs file into lists and an array

Partitioning a string in Python by a regular expression

Categories

Resources