Data Formatting and Fixing - python

I am trying to clean user reviews which they are crawled in the web. When I try to read on the pandas. There is no warning or error. Then print the lenght of the dataframe.
Then I would like to apply normalization step. But I am focusing on Turkish language,so I cannot use python library. I will use third party software.
For this purpose, I am trying to write reviews columns to text file. When I write to these data text file lenght of the sample is
and target size:
Basically I do this:
Note: As I mentioned these are the customer reviews, as we expected they are dirty and noisy. Some of the samples contains many enter characters such as approximately 56 of the sample contains "\n\n\n\n". I have tried solve this problem in python by cleaning data but every time I am losing sample. Also I tried to fix it on Excel, it did not work.
Question: Do you have any suggestion for fixing data?

It seems that you are producing two CSVs files from your df and then read them back as reviews and targets.
If you use pd.read_csv to read them back, pd.read_csv has this argument skip_blank_lines=True by default which skips blank lines. If some rows of your original df contains only a number of '\n', then it will end up with an empty line in your new CSVs which will be skipped the next time they get read.
You can verify this by setting up two counter variables for the total number of empty lines and see if that matches with the 'loss'.
num_empty_review = 0
num_empty_target = 0
for ..., ... in df.iterrows():
review = ...replace('\n', '')
target = ...replace('\n', '')
if review.replace(' ', '') == '':
num_empty_review += 1
if target.replace(' ', '') == '':
num_empty_target += 1
...
...
print(num_empty_review, num_empty_target)
Lastly, next time, please paste your code here in text form like what I did in above :)

Related

commas in between data cells not quoted load to dataframe in pandas

read a comma-separated CSV file with commas in cells has no quotes in python
For example the CSV file is in the below format
product,unit,count,alter,denom
(any name) xyz,kg,1,000,volume,1
reposting with data
read a comma separated csv file with commas in cells has no quotes in python example the csv file is in below format
product,unit,count,alter,denom
(any name or id) xyz,kg,1,000,volume,1
1142,KG,1,000,L,910
1143,v,1,000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1,000,TO,1
11567,K,28,EA,100
11569,v,1,000,TO,1
here count value is 1,000 but it is separated by comma which gives 2 values this should be rectified and load data to dataframes output should be like
product unit count alter denom
xyz kg 1,000 volume 1
i have used
df=pd.read_csv("filename.csv",sep=",")
here count value is 1,000 but it is separated by a comma which gives 2 values
this should be rectified and load data to data frames
the output should be like
product unit count alter denom
xyz kg 1,000 volume 1
1142 KG 1,000 L 910
I have used
df=pd.read_csv("filename.csv",sep=",")
The fundamental problem is that your input is not a valid .csv file. Either a comma is part of the data or it is a field delimiter. It can't be both.
The simplest approach is to go back to whoever or whatever supplied the file and complain that the format is invalid.
The producer of the file has several, usually easy, options to fix this: (1) Suppress the thousands separator. (2) Quote the field containing the comma, for example "1,000". (3) Choose a different field delimiter, such as ;. This is a very common approach in Europe because , frequently means a decimal point and so ignoring it is a bad idea.
You should not be in the position of having to clean up someone else's sloppy export.
However, since you have the file that you have, and don't seem in a position to take this advice, your only option is to reprocess the file so that it is valid.
The approach is to read the defective input file, check each row to see how many fields it has, and if it has one too many and the cause is a thousands separator comma masquerading as a field delimiter, then glue the two halves of the number back together; and then write out the modified file.
# fixit.py
# Program to accept an invalid csv file with an unescaped comma in column 3 and regularize it
# Use like this: python fixit.py < wrongfile.csv > rightfile.csv
import sys
import csv
def fix(row: list[str]) -> list[str]:
"""
If there are 5 columns:
return unchanged.
If there are 6 columns
and columns 2 and 3 can be interpreted as a number with a thousand separator:
combine columns 2 and 3 and return the row.
Otherwise return an empty list.
"""
if len(row) == 5:
return row
if len(row) == 6 and row[2].isdigit() and row[3].isdigit():
return row[:2] + [row[2] + row[3]] + row[4:]
return []
def main(infile, outfile):
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
if fixed_row := fix(row):
writer.writerow(fixed_row)
else:
print(f"Line {reader.line_num} could not be fixed", file=sys.stderr)
if __name__ == '__main__':
sys.stdout.reconfigure(newline="")
# This is because module csv does its own thing with end-of-line and requires the file have newline=""
main(sys.stdin,sys.stdout)
Given this input:
product,unit,count,alter,denom
(any name or id) xyz,kg,1,000,volume,11142,KG,1,000,L,910
1143,v,1,000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1,000,TO,1
11567,K,28,EA,100
11569,v,1,000,TO,1
you will see this output:
product,unit,count,alter,denom
1143,v,1000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1000,TO,1
11567,K,28,EA,100
11569,v,1000,TO,1
along with a warning written to the console about line 2.
Your question shows the data with a blank line between each row of data. I'm assuming that your data is not really like that and the blank lines are the result of your inexperience in formatting a Stack Overflow question properly. But if your data really is like that, the program will still work. You will just get a lot of warnings about blank lines. There won't be any blank lines in the output because pandas.read_csv() doesn't need them.

Rewriting Single Words in a .txt with Python

I need to create a Database, using Python and a .txt file.
Creating new items is no Problem,the inside of the Databse.txt looks like this:
Index Objektname Objektplace Username
i.e:
1 Pen Office Daniel
2 Saw Shed Nic
6 Shovel Shed Evelyn
4 Knife Room6 Evelyn
I get the index from a QR-Scanner (OpenCV) and the other informations are gained via Tkinter Entrys and if an objekt is already saved in the Database, you should be able to rewrite Objektplace and Username.
My Problems now are the following:
If I scan the Code with the index 6, how do i navigate to that entry, even if it's not in line 6, without causing a Problem with the Room6?
How do I, for example, only replace the "Shed" from Index 4 when that Objekt is moved to f.e. Room6?
Same goes for the Usernames.
Up until now i've tried different methods, but nothing worked so far.
The last try looked something like this
def DBChange():
#Removes unwanted bits from the scanned code
data2 = data.replace("'", "")
Index = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
#Adds a whitespace at the end of the Entrys to seperate them
Userlen = len(User)
User2 = User.ljust(Userlen)
Einlagerungsortlen = len(Einlagerungsort)+1
Einlagerungsort2 = Einlagerungsort.ljust(Einlagerungsortlen)
#Navigate to the exact line of the scanned Index and replace the words
#for the place and the user ONLY in this line
file = open("Datenbank.txt","r+")
lines=file.readlines()
for word in lines[Index].split():
List.append(word)
checkWords = (List[2],List[3])
repWords = (Einlagerungsort2, User2)
for line in file:
for check, rep in zip(checkWords, repWords):
line = line.replace(check, rep)
file.write(line)
file.close()
Return()
Thanks in advance
I'd suggest using Pandas to read and write your textfile. That way you can just use the index to select the approriate line. And if there is no specific reason to use your text format, I would switch to csv for ease of use.
import pandas as pd
def DBChange():
#Removes unwanted bits from the scanned code
# I haven't changed this part, since I guess you need this for some input data
data2 = data.replace("'", "")
Indexnr = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
# I removed the lines here. This isn't necessary when using csv and Pandas
# read in the csv file
df = pd.read_csv("Datenbank.csv")
# Select line with index and replace value
df.loc[Indexnr, 'Username'] = User
df.loc[Indexnr, 'Objektplace'] = Einlagerungsort
# Write back to csv
df.to_csv("Datenbank.csv")
Return()
Since I can't reproduce your specific problem, I haven't tested it. But something like this should work.
Edit
To read and write text-file, use ' ' as the seperator. (I assume all values do not contain spaces, and your text file now uses 1 space between values).
reading:
df = pd.read_csv('Datenbank.txt', sep=' ')
Writing:
df.to_csv('Datenbank.txt', sep=' ')
First of all, this is a terrible way to store data. My suggestion is not particularily well code, don't do this in production! (edit
newlines = []
for line in lines:
entry = line.split()
if entry[0] == Index:
#line now is the correct line
#Index 2 is the place, index 0 the ID, etc
entry[2] = Einlagerungsort2
newlines.append(" ".join(entry))
# Now write newlines back to the file

How to organise dataFrame like this, in Python:

I have a file which has some information:
1.Movie ID (the first character before a ":")
2.User ID
4.User Rating
3.Date
All elements are splited by a "," but Movie ID, which is separated by a colon
if I create a dataframe like this:
df=pd.read_csv('combined_data_1.txt',header = None,names['Movie_ID','User_ID','Rating','Date'])
and print the dataframe, I will get this:
Which is not correct, obviosly.
So, if you look at the "Movie_ID" column, in the first row, there is a 1:1488844. Only the number "1" (just before the colon) should be in the "Movie_ID" column, not "1:1488844". The rest (1488844) should be in the User_ID column.
Another problem is that not every "Movie_ID" column have its correctly ID, and in this case, it should be "1" until I find another movie id, that again, will be the first number before a colon.
I know that the ids of all the movies follow a sequence, that is: 1,2,3,4,...
Another problem that I saw, was that when I read the file, for some reason a split occours when there is a colon, so after the first row (which doesn't get splited), when a colon appears, a row in "Movie_ID" is created containing only, for example: "2:", not something like the first row.
In the end, I would like to get something like this:
But I don't know how to organise like this.
Thanks for the help!
Use shift with axis=1 and simply modify the columns:
df=df.shift(axis=1)
df['Movie_ID']=df['User_ID'].str[0]
df['User_ID']=df['User_ID'].str[2:]
And now:
print(df)
Would be desired result.
I believe the issue might be coming from how your data is being stored and thus parsed due to the way your Movie ID is stored separated by a : (colon) rather than a , (comma) as would be needed in a CSV.
If you are able to parse to have it delineate by commas exclusively. the text before it is opened as a CSV, you may be able to eliminate this issue. I only note this because Pandas does not permit multiple delimiters.
Here is what I was able to come up with regarding making something which delineates by colon and comma for how you desire. While I know this isn't your ultimate goal, hopefully this is able to get you on the right path.
import pandas as pd
with open("combined_data_1.txt") as file:
lines = file.readlines()
#Splitting the data into a list delineated by colons
data = []
for line in lines:
if(":" in line):
data.append([])
else: #Using else here prevents the line containing the colon from being saved.
data[len(data)-1].append(line)
for x in range(len(data)):
print("Section " + str(x+1) + ":\n")
print(str(data[x]) + "\n\n")

pandas filtering in front of specific string

enter image description here
As you see in yellow column(col name='식품이름')
some have #, some are not.
I want to delete ~~# and remain only last words.
SO, i coded like
import pandas as pd
pr1 = pd.read_csv('D:\\py_project\\.vscode\\wdata.csv', encoding='utf-8')
for i in pr1['식품이름']:
if '#' in i:
i = i.split('#')[-1]
but the problem is how can I edit and apply at real file.
If I print, it works well but it didn't saved in raw file.
How can I solve it??
I think condition is not necessary, use str.split with select last strings by indexing:
pr1 = pd.DataFrame({'식품이름':[' ~~#text','ssds','wew#efs']})
print (pr1)
식품이름
0 ~~#text
1 ssds
2 wew#efs
pr1['new'] = pr1['식품이름'].str.split("#").str[-1]
print (pr1)
식품이름 new
0 ~~#text text
1 ssds ssds
2 wew#efs efs

Use Python xlsxwriter module to write srt data into and excel

this time I tried to use Python's xlsxwriter module to write data from a .srt into an excel.
The subtitle file looks like this in sublime text:
but I want to write the data into an excel, so it looks like this:
It's my first time to code python for this, so I'm still in the stage of trial and error...I tried to write some code like below
but I don't think it makes sense...
I'll continue trying out, but if you know how to do it, please let me know. I'll read your code and try to understand them! Thank you! :)
The following breaks the problem into a few pieces:
Parsing the input file. parse_subtitles is a generator that takes a source of lines and yields up a sequence of records in the form {'index':'N', 'timestamp':'NN:NN:NN,NNN -> NN:NN:NN,NNN', 'subtitle':'TEXT'}'. The approach I took was to track which of three distinct states we're in:
seeking to next entry for when we're looking for the next index number, which should match the regular expression ^\d*$ (nothing but a bunch of numbers)
looking for timestamp when an index is found and we expect a timestamp to come in the next line, which should match the regular expression ^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$ (HH:MM:SS,mmm -> HH:MM:SS,mmm) and
reading subtitles while consuming actual subtitle text, with blank lines and EOF interpreted as subtitle termination points.
Writing the above records to a row in a worksheet. write_dict_to_worksheet accepts a row and worksheet, plus a record and a dictionary defining the Excel 0-indexed column numbers for each of the record's keys, and then it writes the data appropriately.
Organizaing the overall conversion convert accepts an input filename (e.g. 'Wildlife.srt' that'll be opened and passed to the parse_subtitles function, and an output filename (e.g. 'Subtitle.xlsx' that will be created using xlsxwriter. It then writes a header and, for each record parsed from the input file, writes that record to the XLSX file.
Logging statements left in for self-commenting purposes, and because when reproducing your input file I fat-fingered a : to a ; in a timestamp, making it unrecognized, and having the error pop up was handy for debugging!
I've put a text version of your source file, along with the below code, in this Gist
import xlsxwriter
import re
import logging
def parse_subtitles(lines):
line_index = re.compile('^\d*$')
line_timestamp = re.compile('^\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}$')
line_seperator = re.compile('^\s*$')
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
state = 'seeking to next entry'
for line in lines:
line = line.strip('\n')
if state == 'seeking to next entry':
if line_index.match(line):
logging.debug('Found index: {i}'.format(i=line))
current_record['index'] = line
state = 'looking for timestamp'
else:
logging.error('HUH: Expected to find an index, but instead found: [{d}]'.format(d=line))
elif state == 'looking for timestamp':
if line_timestamp.match(line):
logging.debug('Found timestamp: {t}'.format(t=line))
current_record['timestamp'] = line
state = 'reading subtitles'
else:
logging.error('HUH: Expected to find a timestamp, but instead found: [{d}]'.format(d=line))
elif state == 'reading subtitles':
if line_seperator.match(line):
logging.info('Blank line reached, yielding record: {r}'.format(r=current_record))
yield current_record
state = 'seeking to next entry'
current_record = {'index':None, 'timestamp':None, 'subtitles':[]}
else:
logging.debug('Appending to subtitle: {s}'.format(s=line))
current_record['subtitles'].append(line)
else:
logging.error('HUH: Fell into an unknown state: `{s}`'.format(s=state))
if state == 'reading subtitles':
# We must have finished the file without encountering a blank line. Dump the last record
yield current_record
def write_dict_to_worksheet(columns_for_keys, keyed_data, worksheet, row):
"""
Write a subtitle-record to a worksheet.
Return the row number after those that were written (since this may write multiple rows)
"""
current_row = row
#First, horizontally write the entry and timecode
for (colname, colindex) in columns_for_keys.items():
if colname != 'subtitles':
worksheet.write(current_row, colindex, keyed_data[colname])
#Next, vertically write the subtitle data
subtitle_column = columns_for_keys['subtitles']
for morelines in keyed_data['subtitles']:
worksheet.write(current_row, subtitle_column, morelines)
current_row+=1
return current_row
def convert(input_filename, output_filename):
workbook = xlsxwriter.Workbook(output_filename)
worksheet = workbook.add_worksheet('subtitles')
columns = {'index':0, 'timestamp':1, 'subtitles':2}
next_available_row = 0
records_processed = 0
headings = {'index':"Entries", 'timestamp':"Timecodes", 'subtitles':["Subtitles"]}
next_available_row=write_dict_to_worksheet(columns, headings, worksheet, next_available_row)
with open(input_filename) as textfile:
for record in parse_subtitles(textfile):
next_available_row = write_dict_to_worksheet(columns, record, worksheet, next_available_row)
records_processed += 1
print('Done converting {inp} to {outp}. {n} subtitle entries found. {m} rows written'.format(inp=input_filename, outp=output_filename, n=records_processed, m=next_available_row))
workbook.close()
convert(input_filename='Wildlife.srt', output_filename='Subtitle.xlsx')
Edit: Updated to split multiline subtitles across multiple rows in output

Categories

Resources