I am trying to clean user reviews which they are crawled in the web. When I try to read on the pandas. There is no warning or error. Then print the lenght of the dataframe.
Then I would like to apply normalization step. But I am focusing on Turkish language,so I cannot use python library. I will use third party software.
For this purpose, I am trying to write reviews columns to text file. When I write to these data text file lenght of the sample is
and target size:
Basically I do this:
Note: As I mentioned these are the customer reviews, as we expected they are dirty and noisy. Some of the samples contains many enter characters such as approximately 56 of the sample contains "\n\n\n\n". I have tried solve this problem in python by cleaning data but every time I am losing sample. Also I tried to fix it on Excel, it did not work.
Question: Do you have any suggestion for fixing data?
It seems that you are producing two CSVs files from your df and then read them back as reviews and targets.
If you use pd.read_csv to read them back, pd.read_csv has this argument skip_blank_lines=True by default which skips blank lines. If some rows of your original df contains only a number of '\n', then it will end up with an empty line in your new CSVs which will be skipped the next time they get read.
You can verify this by setting up two counter variables for the total number of empty lines and see if that matches with the 'loss'.
num_empty_review = 0
num_empty_target = 0
for ..., ... in df.iterrows():
review = ...replace('\n', '')
target = ...replace('\n', '')
if review.replace(' ', '') == '':
num_empty_review += 1
if target.replace(' ', '') == '':
num_empty_target += 1
...
...
print(num_empty_review, num_empty_target)
Lastly, next time, please paste your code here in text form like what I did in above :)
I am working with a .txt file. This has 100 rows and 5 columns. I need to divide it in five vectors of lenght 100, one for each column. I am trying to follow this: Reading specific columns from a text file in python.
However, when I implement it as:
token = open('token_data.txt','r')
linestoken=token.readlines()
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split(' ')[1])
token.close()
I don't know how this is stored. If I write print('resulttoken'), nothing appears on my screen.
Can someone please tell me what I am doing wrong?
Thanks.
part of my text file
x.split(' ') is not useful, because columns of your text file separated by more than one space. Use x.split() to ignore spaces:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split()[tokens_column_number])
token.close()
print(resulttoken)
Well, the file looks like to be split by table rather than space, so try this:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1 resulttoken=[] for x in linestoken:
resulttoken.append(x.split('\t'))
token.close()
print(resulttoken)
You want a list of five distinct lists, and append to each in turn.
columns = [[]] * 5
with open('token_data.txt','r') as token:
for line in token:
for field, value in enumerate(line.split()):
columns[field].append(value)
Now, you will find the first value from the first line in columns[0][0], the second value from the first line in columns[1][0], the first value from the second line in columns[0][1], etc.
To print the value of a variable, don't put quotes around it. Quotes create a literal string.
print(columns[0][0])
prints the value of columns[0][0] whereas
print('columns[0][0]')
simply prints the literal text "columns[0][0]".
You can use data_py package to read column wise data in FORTRAN style.
Install this package using
pip install data-py
Usage Example
from data_py import datafile
NoOfLines=0
lineNumber=2 # Line number to read (Excluding lines starting with '#')
df1=datafile("C:/Folder/SubFolder/data-file-name.txt")
df1.separator="," # No need to specify if separator is space(" ") and for 'tab' separated values use '\t'
NoOfLines=df1.lines # Total number of lines in the data file (Excluding lines starting with '#')
[Col1,Col2,Col3,Col4,Col5]=["","","","",""] # Initial values
[Col1,Col2,Col3,Col4,Col5]=df1.read([Col1,Col2,Col3,Col4,Col5)],lineNumber)
print(Col1,Col2,Col3,Col4,Col5) # In str format
For details please follow the link https://www.respt.in/p/python-package-datapy.html
I have a file in a special format .cns,which is a segmented file used to analyze copy number. It is a text file, that looks like this (first line plus header):
head -1 copynumber.cns
chromosome,start,end,gene,log2 chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067
We transformed it to a .csv so we could separate it by tab (but it didn't work well). The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is useful. The output I need is something like this:
gene log2
LOC102725121 -0.28067
DDX11L1 -0.28067
OR4F5 -0.28067
PIK3CA 0.35475
NRAS 3.35475
The fist step, would be, to separate everything by commas and then, transpose columns? and finally print de log2 value for each gene that was contained in that string delimited by quotes. If you could help me with an R, or python script it would help a lot. Perhaps awk would work too.
I am using LInux UBuntu V16.04
I'm not sure if I am being clear, let me know if this is useful.
Thank you!
Hope following code in Python helps
import csv
list1 = []
with open('copynumber.cns','r') as file:
exampleReader = csv.reader(file)
for row in exampleReader:
list1.append(row)
for row in list1:
strings = row[3].split(',') # Get fourth column in CSV, i.e. gene column, and split on occurrance of comma
for string in strings: # Loop through each string
print(string + ' ' + str(row[4]))
This is a short script I've written to refine and validate a large dataset that I have.
# The purpose of this script is the refinement of the job data attained from the
# JSI as it is rendered by the `csv generator` contributed by Luis for purposes
# of presentation on the dashboard map.
import csv
# The number of columns
num_headers = 9
# Remove invalid characters from records
def url_escaper(data):
for line in data:
yield line.replace('&','&')
# Be sure to configure input & output files
with open("adzuna_input_THRESHOLD.csv", 'r') as file_in, open("adzuna_output_GO.csv", 'w') as file_out:
csv_in = csv.reader( url_escaper( file_in ) )
csv_out = csv.writer(file_out)
# Get rid of rows that have the wrong number of columns
# and rows that have only whitespace for a columnar value
for i, row in enumerate(csv_in, start=1):
if not [e for e in row if not e.strip()]:
if len(row) == num_headers:
csv_out.writerow(row)
else:
print "line %d is malformed" % i
I have one field that is structured like so:
finance|statistics|lisp
I've seen ways to do this using other utilities like R, but I want to ideally achieve the same effect within the scope of this python code.
Maybe I can iterate over all the characters of all the columnar values, perhaps as a list, and if I see a | I can dispose of the | and all the text that follows it within the scope of the column value.
I think surely it can be achieved with slices as they do here but I don't quite understand how the indices with slices work- and I can't see how I could include this process harmoniously within the cascade of the current script pipeline.
With regex I guess it's something like this
(?:|)(.*)
Why not use string's split method?
In[4]: 'finance|statistics|lisp'.split('|')[0]
Out[4]: 'finance'
It does not fail with exception when you do not have separator character in the string too:
In[5]: 'finance/statistics/lisp'.split('|')[0]
Out[5]: 'finance/statistics/lisp'
I've tried many approaches based on great stack overflow ideas per:
How to write header row with csv.DictWriter?
Writing a Python list of lists to a csv file
csv.DictWriter -- TypeError: __init__() takes at least 3 arguments (4 given)
Python: tuple indices must be integers, not str when selecting from mysql table
https://docs.python.org/2/library/csv.html
python csv write only certain fieldnames, not all
Python 2.6 Text Processing and
Why is DictWriter not Writing all rows in my Dictreader instance?
I tried mapping reader and writer fieldnames and special header parameters.
I built a second layer test from some great multi-column SO articles:
code follows
import csv
import re
t = re.compile('<\*(.*?)\*>')
headers = ['a', 'b', 'd', 'g']
with open('in2.csv', 'rb') as csvfile:
with open('out2.csv', 'wb') as output_file:
reader = csv.DictReader(csvfile)
writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
writer.writeheader()
print(headers)
for row in reader:
row['d'] = re.findall(t, row['d'])
print(row['a'], row['b'], row['d'], row['g'])
writer.writerow(row)
input data is:
a, b, c, d, e, f, g, h
<* number 1 *>, <* number 2 *>, <* number 3 *>, <* number 4 *>, ...<* number 8 *>
<* number 2 *>, <* number 3 *>, <* number 4 *>, ...<* number 8 *>, <* number 9 *>
output data is:
['a', 'b', 'd', 'g' ]
('<* number 1 *>', '<* number 2 *>', ' number 4 ', <* number 7 *>)
('<* number 2 *>', '<* number 3 *>', ' number 5 ', <* number 8 *>)
exactly as desired.
But when I use a rougher data set that has words with blanks, double quotes, and mixes of upper and lower case letters, the printing works at the row level, but the writing does not work entirely.
By entirely, I have been able (I know I'm in epic fail mode here) to actually write one row of the challenging data, but not in that instance, a header and multiple rows. Pretty lame that I can't overcome this hurdle with all the talented articles that I've read.
All four columns fail with either a key error or with a "TypeError: tuple indices must be integers, not str"
I'm obviously not understanding how to grasp what Python needs to make this happen.
The high level is: read in text files with seven observations / columns. Use only four columns to write out; perform the regex on one column. Make sure to write out each newly formed row, not the original row.
I may need a more friendly type of global temp table to read the row into, update the row, then write the row out to a file.
Maybe I'm asking too much of Python architecture to coordinate a DictReader and a DictWriter to read in data, filter to four columns, update the fourth column with a regex, then write out the file with the updated four tuples.
At this juncture, I don't have the time to investigate a parser. I would like to eventually in more detail, since per release of Python (2.7 now, 3.x later) parsers seem handy.
Again, apologize for the complexity of the approach and my lack of understanding of the underpinnings of Python. In R language, the parallel of my shortcomings would be understanding coding at the S4 level, not just the S3 level.
Here is data that is closer to what fails, sorry--I needed to show how the headers are set up, how the file rows coming in are formatted with individual double quotes with quotes around the entire row and how the date is formatted, but not quoted:
stuff_type|stuff_date|stuff_text
""cool stuff"|01-25-2015|""the text stuff <*to test*> to find a way to extract all text that is <*included in special tags*> less than star and greater than star"""
""cool stuff"|05-13-2014|""the text stuff <*to test a second*> to find a way to extract all text that is <*included in extra special tags*> less than star and greater than star"""
""great big stuff"|12-7-2014|"the text stuff <*to test a third*> to find a way to extract all text that is <*included in very special tags*> less than star and greater than star"""
""nice stuff"|2-22-2013|""the text stuff <*to test a fourth ,*> to find a way to extract all text that is <*included in doubly special tags*> less than star and greater than star"""
stuff_type,stuff_date,stuff_text
cool stuff,1/25/2015,the text stuff <*to test*> to find a way to extract all text that is <*included in special tags*> less than star and greater than star
cool stuff,5/13/2014,the text stuff <*to test a second*> to find a way to extract all text that is <*included in extra special tags*> less than star and greater than star
great big stuff,12/7/2014,the text stuff <*to test a third*> to find a way to extract all text that is <*included in very special tags*> less than star and greater than star
nice stuff,2/22/2013,the text stuff <*to test a fourth *> to find a way to extract all text that is <*included in really special tags*> less or greater than star
I plan to retest this, but a Spyder update made my Python console crash this morning. Ugghh. With vanilla Python, the test data above fails with the following code... no need to do the write step...can't even print here... may need the QUOTES.NONE in the dialect.
import csv
import re
t = re.compile('<\*(.*?)\*>')
headers = ['stuff_type', 'stuff_date', 'stuff_text']
with open('C:/Temp/in3.csv', 'rb') as csvfile:
with open('C:/Temp/out3.csv', 'wb') as output_file:
reader = csv.DictReader(csvfile)
writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
writer.writeheader()
print(headers)
for row in reader:
row['stuff_text'] = re.findall(t, row['stuff_text'])
print(row['stuff_type'], row['stuff_date'], row['stuff_text'])
writer.writerow(row)
Error:
can't past the snipping tool image in here ....sorry
KeyError: 'stuff_text'
OK: it might be in the quoting and separation of columns: the data above without quotes printed without a KeyError and now writes to the file correctly: I may have to clean up the file from quote characters before I pull out text with the regex. Any thoughts would be appreciated.
Good question # Andrea Corbellini
The code above generates the following output if I've manually removed the quotes:
stuff_type,stuff_date,stuff_text
cool stuff,1/25/2015,"['to test', 'included in special tags']"
cool stuff,5/13/2014,"['to test a second', 'included in extra special tags']"
great big stuff,12/7/2014,"['to test a third', 'included in very special tags']"
nice stuff,2/22/2013,"['to test a fourth ', 'included in really special tags']"
which is what I want in regards to output. So, thanks for your "lazy" question---I'm the lazy one that should have put this second output as a follow on.
Again, without removing multiple sets of quotation marks, I have KeyError: 'stuff_type'. I apologize that I have attempted to insert the image from a screen capture of the Python with the error, but have not figured out yet how to do that in SO. I used the Images section above, but that seems to point to a file that maybe is uploaded to SO? not inserted?
With #monkut's excellent input below on using ".".join things or literally stuff is getting better.
{['stuff_type', 'stuff_date', 'stuff_text']
('cool stuff', '1/25/2015', 'to test:included in special tags')
('cool stuff', '5/13/2014', 'to test a second:included in extra special tags')
('great big stuff', '12/7/2014', 'to test a third:included in very special tags')
('nice stuff', '2/22/2013', 'to test a fourth :included in really special tags')}
import csv
import re
t = re.compile('<\*(.*?)\*>')
headers = ['stuff_type', 'stuff_date', 'stuff_text']
csv.register_dialect('piper', delimiter='|', quoting=csv.QUOTE_NONE)
with open('C:/Python/in3.txt', 'rb') as csvfile:
with open('C:/Python/out5.csv', 'wb') as output_file:
reader = csv.DictReader(csvfile, dialect='piper')
writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
writer.writeheader()
print(headers)
for row in reader:
row['stuff_text'] = ":".join(re.findall(t, row['stuff_text']))
print(row['stuff_type'], row['stuff_date'], row['stuff_text'])
writer.writerow(row)
Error path follows:
runfile('C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py', wdir='C:/Python')
['stuff_type', 'stuff_date', 'stuff_text']
('""cool stuff"', '01-25-2015', 'to test')
Traceback (most recent call last):
File "<ipython-input-3-832ce30e0de3>", line 1, in <module>
runfile('C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py', wdir='C:/Python')
File "C:\Users\Methody\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Users\Methody\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py", line 20, in <module>
row['stuff_text'] = ":".join(re.findall(t, row['stuff_text']))
File "C:\Users\Methody\Anaconda\lib\re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
I'll have a find a stronger way to clean up and remove the quotes before processing the regex findall. Probably something row = string.remove(quotes with blanks).
I think findall returns a list, which may be screwing things up, since dictwriter wants a single string value.
row['d'] = re.findall(t, row['d'])
You can use .join to turn the results to a single string value:
row['d'] = ":".join(re.findall(t, row['d']))
Where, here values are joined with, ":". As you mention, though, you may need to clean the values a bit more...
You mentioned there was a problem with using the compiled regex object.
Here's an example of how the compiled regex object is used:
import re
t = re.compile('<\*(.*?)\*>')
text= ('''cool stuff,1/25/2015,the text stuff <*to test*> to find a way to extract all text that'''
''' is <*included in special tags*> less than star and greater than star''')
result = t.findall(text)
This should return the following into result:
['to test', 'included in special tags']