I have a file which has some information:
1.Movie ID (the first character before a ":")
2.User ID
4.User Rating
3.Date
All elements are splited by a "," but Movie ID, which is separated by a colon
if I create a dataframe like this:
df=pd.read_csv('combined_data_1.txt',header = None,names['Movie_ID','User_ID','Rating','Date'])
and print the dataframe, I will get this:
Which is not correct, obviosly.
So, if you look at the "Movie_ID" column, in the first row, there is a 1:1488844. Only the number "1" (just before the colon) should be in the "Movie_ID" column, not "1:1488844". The rest (1488844) should be in the User_ID column.
Another problem is that not every "Movie_ID" column have its correctly ID, and in this case, it should be "1" until I find another movie id, that again, will be the first number before a colon.
I know that the ids of all the movies follow a sequence, that is: 1,2,3,4,...
Another problem that I saw, was that when I read the file, for some reason a split occours when there is a colon, so after the first row (which doesn't get splited), when a colon appears, a row in "Movie_ID" is created containing only, for example: "2:", not something like the first row.
In the end, I would like to get something like this:
But I don't know how to organise like this.
Thanks for the help!
Use shift with axis=1 and simply modify the columns:
df=df.shift(axis=1)
df['Movie_ID']=df['User_ID'].str[0]
df['User_ID']=df['User_ID'].str[2:]
And now:
print(df)
Would be desired result.
I believe the issue might be coming from how your data is being stored and thus parsed due to the way your Movie ID is stored separated by a : (colon) rather than a , (comma) as would be needed in a CSV.
If you are able to parse to have it delineate by commas exclusively. the text before it is opened as a CSV, you may be able to eliminate this issue. I only note this because Pandas does not permit multiple delimiters.
Here is what I was able to come up with regarding making something which delineates by colon and comma for how you desire. While I know this isn't your ultimate goal, hopefully this is able to get you on the right path.
import pandas as pd
with open("combined_data_1.txt") as file:
lines = file.readlines()
#Splitting the data into a list delineated by colons
data = []
for line in lines:
if(":" in line):
data.append([])
else: #Using else here prevents the line containing the colon from being saved.
data[len(data)-1].append(line)
for x in range(len(data)):
print("Section " + str(x+1) + ":\n")
print(str(data[x]) + "\n\n")
Related
I am trying to clean user reviews which they are crawled in the web. When I try to read on the pandas. There is no warning or error. Then print the lenght of the dataframe.
Then I would like to apply normalization step. But I am focusing on Turkish language,so I cannot use python library. I will use third party software.
For this purpose, I am trying to write reviews columns to text file. When I write to these data text file lenght of the sample is
and target size:
Basically I do this:
Note: As I mentioned these are the customer reviews, as we expected they are dirty and noisy. Some of the samples contains many enter characters such as approximately 56 of the sample contains "\n\n\n\n". I have tried solve this problem in python by cleaning data but every time I am losing sample. Also I tried to fix it on Excel, it did not work.
Question: Do you have any suggestion for fixing data?
It seems that you are producing two CSVs files from your df and then read them back as reviews and targets.
If you use pd.read_csv to read them back, pd.read_csv has this argument skip_blank_lines=True by default which skips blank lines. If some rows of your original df contains only a number of '\n', then it will end up with an empty line in your new CSVs which will be skipped the next time they get read.
You can verify this by setting up two counter variables for the total number of empty lines and see if that matches with the 'loss'.
num_empty_review = 0
num_empty_target = 0
for ..., ... in df.iterrows():
review = ...replace('\n', '')
target = ...replace('\n', '')
if review.replace(' ', '') == '':
num_empty_review += 1
if target.replace(' ', '') == '':
num_empty_target += 1
...
...
print(num_empty_review, num_empty_target)
Lastly, next time, please paste your code here in text form like what I did in above :)
I am working with a .txt file. This has 100 rows and 5 columns. I need to divide it in five vectors of lenght 100, one for each column. I am trying to follow this: Reading specific columns from a text file in python.
However, when I implement it as:
token = open('token_data.txt','r')
linestoken=token.readlines()
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split(' ')[1])
token.close()
I don't know how this is stored. If I write print('resulttoken'), nothing appears on my screen.
Can someone please tell me what I am doing wrong?
Thanks.
part of my text file
x.split(' ') is not useful, because columns of your text file separated by more than one space. Use x.split() to ignore spaces:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split()[tokens_column_number])
token.close()
print(resulttoken)
Well, the file looks like to be split by table rather than space, so try this:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1 resulttoken=[] for x in linestoken:
resulttoken.append(x.split('\t'))
token.close()
print(resulttoken)
You want a list of five distinct lists, and append to each in turn.
columns = [[]] * 5
with open('token_data.txt','r') as token:
for line in token:
for field, value in enumerate(line.split()):
columns[field].append(value)
Now, you will find the first value from the first line in columns[0][0], the second value from the first line in columns[1][0], the first value from the second line in columns[0][1], etc.
To print the value of a variable, don't put quotes around it. Quotes create a literal string.
print(columns[0][0])
prints the value of columns[0][0] whereas
print('columns[0][0]')
simply prints the literal text "columns[0][0]".
You can use data_py package to read column wise data in FORTRAN style.
Install this package using
pip install data-py
Usage Example
from data_py import datafile
NoOfLines=0
lineNumber=2 # Line number to read (Excluding lines starting with '#')
df1=datafile("C:/Folder/SubFolder/data-file-name.txt")
df1.separator="," # No need to specify if separator is space(" ") and for 'tab' separated values use '\t'
NoOfLines=df1.lines # Total number of lines in the data file (Excluding lines starting with '#')
[Col1,Col2,Col3,Col4,Col5]=["","","","",""] # Initial values
[Col1,Col2,Col3,Col4,Col5]=df1.read([Col1,Col2,Col3,Col4,Col5)],lineNumber)
print(Col1,Col2,Col3,Col4,Col5) # In str format
For details please follow the link https://www.respt.in/p/python-package-datapy.html
I have a file in a special format .cns,which is a segmented file used to analyze copy number. It is a text file, that looks like this (first line plus header):
head -1 copynumber.cns
chromosome,start,end,gene,log2 chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067
We transformed it to a .csv so we could separate it by tab (but it didn't work well). The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is useful. The output I need is something like this:
gene log2
LOC102725121 -0.28067
DDX11L1 -0.28067
OR4F5 -0.28067
PIK3CA 0.35475
NRAS 3.35475
The fist step, would be, to separate everything by commas and then, transpose columns? and finally print de log2 value for each gene that was contained in that string delimited by quotes. If you could help me with an R, or python script it would help a lot. Perhaps awk would work too.
I am using LInux UBuntu V16.04
I'm not sure if I am being clear, let me know if this is useful.
Thank you!
Hope following code in Python helps
import csv
list1 = []
with open('copynumber.cns','r') as file:
exampleReader = csv.reader(file)
for row in exampleReader:
list1.append(row)
for row in list1:
strings = row[3].split(',') # Get fourth column in CSV, i.e. gene column, and split on occurrance of comma
for string in strings: # Loop through each string
print(string + ' ' + str(row[4]))
I've a CSV dataset with 2000 rows, with a messy column about first name/surname. In this column I need to dissociate first names and surnames. For that, I've a base with all surnames given in France in the twenty last years.
So, the source database looks like :
"name"; "town"
"Johnny Aaaaaa"; "Bordeaux"
"Bbbb Tom";"Paris"
"Ccccc Pierre Dddd" ; "Lyon"
...
I want obtain something like :
"surname"; "firstname"; "town"
"Aaaaaa"; "Johnny "; "Bordeaux"
"Bbbb"; "Tom"; "Paris"
"Ccccc Dddd" ; "Pierre"; "Lyon"
...
And, my reference database of first names :
"firstname"; "sex"
"Andre"; "M"
"Bob"; "M"
"Johnny"; "M"
...
Technically, I must compare each row from the first base with each field from the second base, in order to identify which character chain correspond to the first name...
I have no idea about the way to do that.
Any ideas are welcome... thanks.
Looks like you want to
Read the data from file say input.csv
Extract the name and split it into first name and last name
Get the sex using first name
And probably write the data again to a new csv or print it.
You can follow the approach as below. You can get more sophisticated in splitting using regex but here is something basic using strip commands:
inFile=open('input.csv','r')
rows=inFile.readlines()
newData=[]
if len(rows) > 1:
for row in rows[1:]:
#Remove the new line chars at the end of line and split on ;
data=row.rstrip('\n').split(';')
#Remove additional spaces in your data
name=data[0].strip()
#Get rid of quotes
name=name.strip('"').split(' ')
fname=name[1]
lname=name[0]
city=data[1].strip()
city=city.strip('"')
#Now you can get the sex info from your other database save this in a list to get the sex info later
sex='M' #replace this with your db calls
newData.append([fname, lname, sex, city])
inFile.close()
#You can put all of this in the new csv file by something like this (it seperates the fileds using comma):
outFile=open('otput.csv','w')
for row in newData:
outFile.write(','.join(row))
outFile.write('\n')
outFile.close(
Well. Finally, I chose the "brute force" approach : each term of each line is compared with the 11.000 keys of my second base (converted in a dictionnary). Not smart, but efficient.
for row in input:
splitted = row[0].lower().split()
for s in splitted :
for cle, valeur in dict.items() :
if cle == s :
print ("{} >> {}".format(cle, valeur))
All ideas about more pretty solutions are still welcome.
Im trying to get an output like:
KPLR003222854-2009131105131
in a text file. The way I am attempting to derive that output is as such:
with open('Processed_Data.txt', 'r') as file_P, open('KIC_list.txt', 'w') as namelist:
nameData = []
for line in file_P:
splt_file_P = line.split()
nameData.append(splt_file_P[0])
for key in nameData:
namelist.write('\n' 'KPLR00' + "".join(str(w) for w in nameData) + '-2009131105131')
However I am having an issue in that the numbers in the nameData array are all appearing at once in the specified output, instead of using on ID cleanly as shown above the output is something like this:
KPLR00322285472138721382172198371823798123781923781237819237894676472634973256279234987-2009131105131
So my question is how do I loop the write command in a way that will allow me to get each separate ID (each has a specific index value, but there are over 150) to be properly outputted.
EDIT:
Also, some of the ID's in the list are not the same length, so I wanted to add 0's to the front of the 'key' to make them all equal 9 digits. I cheated this by adding the 0's into the KPLR in quotes but not all of the ID's need just two 0's. The question is, could I add 0's between KPLR and the key in any way to match the 9-digit format?
Your code looks like it's working as one would expect: "".join(str(w) for w in nameData) makes a string composed of the concatenation of every item in nameData.
Chances are you want;
for key in nameData:
namelist.write('\n' 'KPLR00' + key + '-2009131105131')
Or even better:
for key in nameData:
namelist.write('\nKPLR%09i-2009131105131'%int(key)) #no string concatenation
String concatenation tends to be slower, and if you're not only operating on strings, will involve explicit calls to str. Here's a pair of ideone snippets showing the difference: http://ideone.com/RR5RnL and http://ideone.com/VH2gzx
Also, the above form with the format string '%09i' will pad with 0s to make the number up to 9 digits. Because the format is '%i', I've added an explicit conversion to int. See here for full details: http://docs.python.org/2/library/stdtypes.html#string-formatting-operations
Finally, here's a single line version (excepting the with statement, which you should of course keep):
namelist.write("\n".join("KPLR%09i-2009131105131"%int(line.split()[0]) for line in file_P))
You can change this:
"".join(str(w) for w in nameData)
to this:
",".join(str(w) for w in nameData)
Basically, the "," will comma delimit the elements in your nameData list. If you use "", then there will be nothing to separate the elements, so they appear all at once. You can change the delimiter to suit your needs.
Just for kicks:
with open('Processed_Data.txt', 'r') as file_P, open('KIC_list.txt', 'w') as namelist:
nameData = [line.split()[0] for line in file_P]
namelist.write("\n".join("KPLR00" + str(key) + '-2009131105131' for key in nameData))
I think that will work, but I haven't tested it. You can make it even smaller/uglier by not using nameData at all, and just use that list comprehension right in its place.