Python: append not adding to an empty list - python

Hi sorry if this is an obvious one, looked around online and I can't seem to find what I'm doing wrong.
I am trying to compare the contents of two lists, in two separate csv files (file A and file B). Both csv files are of x rows, but only 1 column each. File A consists of rows with sentences in each, file B consists of single words. My goal is to search the rows in file B, and if any of these rows appear in file A, append the relevant rows from file A to a separate, empty list to be exported. The code I am running is as follows:
import pandas as pd
#Importing csv files
##File A is 2 rows of sentences in 1 column, e.g. "This list should be picked up for the word DISHWASHER" and "This sentence should not appear in list_AB"
file_A = pd.read_csv(r"C:\Users\User\Desktop\File\file_A.csv")
##File B is 2 rows of singular words that should appear in file A e.g. "DISHWASHER", "QWERTYXYZ123"
file_B = pd.read_csv(r"C:\Users\User\Desktop\File\file_B.csv", converters={i: str for i in range(10)})
#Convert csv files to lists
file_A2 = file_A.values.tolist()
file_B2 = file_B.values.tolist()
#Empty list
list_AB = []
#for loop supposed to filter file_A based on file_B
for x in file_A2:
words = x[0].split(" ")
#print(words)
for y in file_B2:
#print(y)
if y in words:
list_AB.append(x)
print(list_AB)
The problem is that print(list_AB) only returns an empty list ([]), not a filtered version of file_A. The reason I want to do it this way is because the actual csv files I want to read consist of 21600 (file A) and 50400 (file B) rows. Apologies in advance if this is a really basic question.
Edit: Added images of csv file examples, couldn't see how to upload files.

The problem is in the if-statement y in words.
Here y is a list. You're searching for a list inside a list of strings (not a list of lists).
Using y[0] in words solve your problem.

Related

commas in between data cells not quoted load to dataframe in pandas

read a comma-separated CSV file with commas in cells has no quotes in python
For example the CSV file is in the below format
product,unit,count,alter,denom
(any name) xyz,kg,1,000,volume,1
reposting with data
read a comma separated csv file with commas in cells has no quotes in python example the csv file is in below format
product,unit,count,alter,denom
(any name or id) xyz,kg,1,000,volume,1
1142,KG,1,000,L,910
1143,v,1,000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1,000,TO,1
11567,K,28,EA,100
11569,v,1,000,TO,1
here count value is 1,000 but it is separated by comma which gives 2 values this should be rectified and load data to dataframes output should be like
product unit count alter denom
xyz kg 1,000 volume 1
i have used
df=pd.read_csv("filename.csv",sep=",")
here count value is 1,000 but it is separated by a comma which gives 2 values
this should be rectified and load data to data frames
the output should be like
product unit count alter denom
xyz kg 1,000 volume 1
1142 KG 1,000 L 910
I have used
df=pd.read_csv("filename.csv",sep=",")
The fundamental problem is that your input is not a valid .csv file. Either a comma is part of the data or it is a field delimiter. It can't be both.
The simplest approach is to go back to whoever or whatever supplied the file and complain that the format is invalid.
The producer of the file has several, usually easy, options to fix this: (1) Suppress the thousands separator. (2) Quote the field containing the comma, for example "1,000". (3) Choose a different field delimiter, such as ;. This is a very common approach in Europe because , frequently means a decimal point and so ignoring it is a bad idea.
You should not be in the position of having to clean up someone else's sloppy export.
However, since you have the file that you have, and don't seem in a position to take this advice, your only option is to reprocess the file so that it is valid.
The approach is to read the defective input file, check each row to see how many fields it has, and if it has one too many and the cause is a thousands separator comma masquerading as a field delimiter, then glue the two halves of the number back together; and then write out the modified file.
# fixit.py
# Program to accept an invalid csv file with an unescaped comma in column 3 and regularize it
# Use like this: python fixit.py < wrongfile.csv > rightfile.csv
import sys
import csv
def fix(row: list[str]) -> list[str]:
"""
If there are 5 columns:
return unchanged.
If there are 6 columns
and columns 2 and 3 can be interpreted as a number with a thousand separator:
combine columns 2 and 3 and return the row.
Otherwise return an empty list.
"""
if len(row) == 5:
return row
if len(row) == 6 and row[2].isdigit() and row[3].isdigit():
return row[:2] + [row[2] + row[3]] + row[4:]
return []
def main(infile, outfile):
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
if fixed_row := fix(row):
writer.writerow(fixed_row)
else:
print(f"Line {reader.line_num} could not be fixed", file=sys.stderr)
if __name__ == '__main__':
sys.stdout.reconfigure(newline="")
# This is because module csv does its own thing with end-of-line and requires the file have newline=""
main(sys.stdin,sys.stdout)
Given this input:
product,unit,count,alter,denom
(any name or id) xyz,kg,1,000,volume,11142,KG,1,000,L,910
1143,v,1,000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1,000,TO,1
11567,K,28,EA,100
11569,v,1,000,TO,1
you will see this output:
product,unit,count,alter,denom
1143,v,1000,L,910
11144,K,1,EA,1
11529,KG,1,EA,1
11548,V,1,EA,10
11551,V,1,EA,4
11562,K,1000,TO,1
11567,K,28,EA,100
11569,v,1000,TO,1
along with a warning written to the console about line 2.
Your question shows the data with a blank line between each row of data. I'm assuming that your data is not really like that and the blank lines are the result of your inexperience in formatting a Stack Overflow question properly. But if your data really is like that, the program will still work. You will just get a lot of warnings about blank lines. There won't be any blank lines in the output because pandas.read_csv() doesn't need them.

How to separate elements from textfile based on substring of element into 2 output files

I have a long lists of animal identifiers in a text file. Our convention is to use two of alphabetical characters, followed by a litter identifier a dash and then the animal id within that litter. The number before the dash identifies whether they are control or manipulated animals.
So it looks like this: (with explanations in parentheses not in the text file) The only things in the text file are the identifier and possibly a data after that identifier on the same line.
XL20-4 is a control animal (0 - even),
XL21-4 is a manipulated animal (1 - odd),
Running all the way to the 300s
XL304-5 (4 - even - control),
XL303-4 (3 - odd - manipulated).
First how to create an ordered list in separate textfiles of the animals in each condition from the original text file, so it can then be read by our matlab code.
It needs to retain the order of animal generation within those new textfiles
i.e.
XL302-4,
XL304-5,
XL304-6,
XL306-1,
Each with a '/n' ending.
Thanks in advance.
based on what you had said this would be the way to do it, but there should be some finer tweaking because the file contents originally are unknown (name and how they are placed in the text file)
import re
def write_to_file(file_name, data_to_write):
with open(file_name, 'w') as file:
for item in data_to_write:
file.write(f"{item}\n")
# read contents from file
with open('original.txt', 'r') as file:
contents = file.readlines()
# assuming that each of the 'XL20-4,' are on a new line
control_group = []
manipulated_group = []
for item in contents:
# get only the first number between the letters and dash
test_generation = int(item[re.search(r"\d", item).start():item.find('-')])
if test_generation % 2: # if even evaluates to 0 ~ being false
manipulated_group.append(item)
else:
control_group.append(item)
# write to files with the data
write_to_file('control.txt', control_group)
write_to_file('manipulated.txt', manipulated_group)

I need to append a list into another list (specific issue)

This is my code:
files = open('clean.txt').readlines()
print files
finallist = []
for items in files:
new = items.split()
new.append(finallist)
And since the file of text is too huge, here is an example of "print files":
files = ['chemistry leads outstanding another story \n', 'rhapsodic moments blow narrative prevent bohemian rhapsody']
I really need each line separated by '\n' to be splitted in words & placed in a list of list just like the format below:
outcome = [['chemistry','leads','outstanding', 'another', 'story'],['rhapsodic','moments','blow', 'narrative', 'prevent', 'bohemian', 'rhapsody']]
I've tried methods just like the first code given and it returns an empty list. Please help! Thanks in advance.
The last line of your code is backwards, it seems. Instead of
new.append(finallist)
it should be
finallist.append(new)
I changed the last line to the version above, and the result was a list (finallist) containing 2 sub-lists. Here is the code that seems to work:
files = open('clean.txt').readlines()
print files
finallist = []
for items in files:
new = items.split()
finallist.append(new)
Use list comprehension to reduce line
finallist = [i.split() for i in files]

Reading columns of a txt file on python

I am working with a .txt file. This has 100 rows and 5 columns. I need to divide it in five vectors of lenght 100, one for each column. I am trying to follow this: Reading specific columns from a text file in python.
However, when I implement it as:
token = open('token_data.txt','r')
linestoken=token.readlines()
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split(' ')[1])
token.close()
I don't know how this is stored. If I write print('resulttoken'), nothing appears on my screen.
Can someone please tell me what I am doing wrong?
Thanks.
part of my text file
x.split(' ') is not useful, because columns of your text file separated by more than one space. Use x.split() to ignore spaces:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1
resulttoken=[]
for x in linestoken:
resulttoken.append(x.split()[tokens_column_number])
token.close()
print(resulttoken)
Well, the file looks like to be split by table rather than space, so try this:
token = open('token_data.txt','r')
linestoken=token.readlines()
tokens_column_number = 1 resulttoken=[] for x in linestoken:
resulttoken.append(x.split('\t'))
token.close()
print(resulttoken)
You want a list of five distinct lists, and append to each in turn.
columns = [[]] * 5
with open('token_data.txt','r') as token:
for line in token:
for field, value in enumerate(line.split()):
columns[field].append(value)
Now, you will find the first value from the first line in columns[0][0], the second value from the first line in columns[1][0], the first value from the second line in columns[0][1], etc.
To print the value of a variable, don't put quotes around it. Quotes create a literal string.
print(columns[0][0])
prints the value of columns[0][0] whereas
print('columns[0][0]')
simply prints the literal text "columns[0][0]".
You can use data_py package to read column wise data in FORTRAN style.
Install this package using
pip install data-py
Usage Example
from data_py import datafile
NoOfLines=0
lineNumber=2 # Line number to read (Excluding lines starting with '#')
df1=datafile("C:/Folder/SubFolder/data-file-name.txt")
df1.separator="," # No need to specify if separator is space(" ") and for 'tab' separated values use '\t'
NoOfLines=df1.lines # Total number of lines in the data file (Excluding lines starting with '#')
[Col1,Col2,Col3,Col4,Col5]=["","","","",""] # Initial values
[Col1,Col2,Col3,Col4,Col5]=df1.read([Col1,Col2,Col3,Col4,Col5)],lineNumber)
print(Col1,Col2,Col3,Col4,Col5) # In str format
For details please follow the link https://www.respt.in/p/python-package-datapy.html

Copy number file format issue (Need to modify the structure)

I have a file in a special format .cns,which is a segmented file used to analyze copy number. It is a text file, that looks like this (first line plus header):
head -1 copynumber.cns
chromosome,start,end,gene,log2 chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067
We transformed it to a .csv so we could separate it by tab (but it didn't work well). The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is useful. The output I need is something like this:
gene log2
LOC102725121 -0.28067
DDX11L1 -0.28067
OR4F5 -0.28067
PIK3CA 0.35475
NRAS 3.35475
The fist step, would be, to separate everything by commas and then, transpose columns? and finally print de log2 value for each gene that was contained in that string delimited by quotes. If you could help me with an R, or python script it would help a lot. Perhaps awk would work too.
I am using LInux UBuntu V16.04
I'm not sure if I am being clear, let me know if this is useful.
Thank you!
Hope following code in Python helps
import csv
list1 = []
with open('copynumber.cns','r') as file:
exampleReader = csv.reader(file)
for row in exampleReader:
list1.append(row)
for row in list1:
strings = row[3].split(',') # Get fourth column in CSV, i.e. gene column, and split on occurrance of comma
for string in strings: # Loop through each string
print(string + ' ' + str(row[4]))

Categories

Resources