If I have for example 3 txt files that looks as follows:
file1.txt:
a 10
b 20
c 30
file2.txt:
d 40
e 50
f 60
file3.txt:
g 70
h 80
i 90
I would like to read this data from the files and create a single excel file that will look like this:
Specifically in my case I have 100+ txt files that I read using glob and loop.
Thank you
There's a bit of logic involved into getting the output you need.
First, to process the input files into separate lists. You might need to adjust this logic depending on the actual contents of the files. You need to be able to get the columns for the files. For the samples provided my logic works.
I added a safety check to see if the input files have the same number of rows. If they don't it will seriously mess up the resulting excel file. You'll need to add some logic if a length mismatch happens.
For the writing to the excel file, it's very easy using pandas in combination with openpyxl. There are likely more elegant solutions, but I'll leave it to you.
I'm referencing some SO answers in the code for further reading.
requirements.txt
pandas
openpyxl
main.py
# we use pandas for easy saving as XSLX
import pandas as pd
filelist = ["file01.txt", "file02.txt", "file03.txt"]
def load_file(filename: str) -> list:
result = []
with open(filename) as infile:
# the split below is OS agnostic and removes EOL characters
for line in infile.read().splitlines():
# the split below splits on space character by default
result.append(line.split())
return result
loaded_files = []
for filename in filelist:
loaded_files.append(load_file(filename))
# you will want to check if the files have the same number of rows
# it will break stuff if they don't, you could fix it by appending empty rows
# stolen from:
# https://stackoverflow.com/a/10825126/9267296
len_first = len(loaded_files[0]) if loaded_files else None
if not all(len(i) == len_first for i in loaded_files):
print("length mismatch")
exit(419)
# generate empty list of lists so we don't get index error below
# stolen from:
# https://stackoverflow.com/a/33990699/9267296
result = [ [] for _ in range(len(loaded_files[0])) ]
for f in loaded_files:
for index, row in enumerate(f):
result[index].extend(row)
result[index].append('')
# trim the last empty column
result = [line[:-1] for line in result]
# write as excel file
# stolen from:
# https://stackoverflow.com/a/55511313/9267296
# note that there are some other options on this SO question, but this one
# is easily readable
df = pd.DataFrame(result)
with pd.ExcelWriter("output.xlsx") as writer:
df.to_excel(writer, sheet_name="sheet_name_goes_here", index=False)
result:
Related
Still quite new to this and am struggling.
I have a directory of a few hundred text files, each file has thousands of lines of information on it.
Some lines contain one number, some many
example:
39 312.000000 168.871795
100.835446
101.800298
102.414406
104.491999
108.855079
107.384008
103.608815
I need to pull all of the information from each text file, I want the name of the text file (minus the '.txt') to be in the first column, and all other information following that to complete the row (regardless of its layout within the file)
import pandas as pd
import os
data= '/path/to/data/'
path='/other/directory/path/'
lst=['list of files needed']
for dirpath, dirs, subj in os.walk(data):
while i<=5: #currently being used to break before iterating through entire directory to check it's working
with open(dirpath +lst[i], 'r') as file:
info=file.read().replace('\n', '') #txt file onto one line
corpus.append(lst[i]+' ') #begin list with txt file name
corpus.append(info) #add file contents to list after file name
output=''.join(corpus) #get out of list format
output.split()
i+=1
df=pd.read_table(output, lineterminator=',')
df.to_csv(path + 'testing.csv')
if i >5:
break
Currently, this is printing Errno 2 (no such file or directory) then goes on to print the contents of the first file and no others, but does not save it to csv.
This also seems horribly convoluted and I'm sure there's another way of doing it
I also suspect the lineterminator will not force each new text file onto a new row, so any suggestions there would be appreciated
desired output:
file1 39 312.000 168.871
file2 72 317.212 173.526
You are loading os and pandas so you can take advantage of their functionality (listdir, path, DataFrame, concat, and to_csv) and drastically reduce your code's complexity.
import os
import pandas as pd
data='data/'
path='output/'
files = os.listdir(data)
output = pd.DataFrame()
for file in files:
file_name = os.path.splitext(file)[0]
with open(os.path.join(data, file)) as f:
info = [float(x) for x in f.read().split()]
#print(info)
df = pd.DataFrame(info, columns=[file_name], index = range(len(info)))
output = pd.concat([output, df], axis=1)
output = output.T
print(output)
output.to_csv(path + 'testing.csv', index=False)
I would double-check that your data folder only has txt files. And, maybe add a check for txt files to the code.
This got less elegant as I learned about the requirements. If you want to flip the columns and rows, just take out the output.T line. This transposes the dataframe.
I have about 200 CSV files and I need to combine them on specific columns. Each CSV file contains 1000 filled rows on specific columns. My file names are like below:
csv_files = [en_tr_translated0.csv, en_tr_translated1000.csv, en_tr_translated2000.csv, ......... , en_tr_translated200000.csv]
My csv file columns are like below:
The two first columns are prefilled up to same 200.000 rows/sentences in the all csv files. My each en_tr_translated{ }.csv files contains 1000 translated sentences related with their file name. For example:
en_tr_translated1000.csv file contains translated sentences from 0 to 1000th row, en_tr_translated2000.csv file contains translated sentences from 1000th to 2000th row etc. Rest is nan/empty. Below is an example image from en_tr_translated3000.csv file.
I want to copy/merge/join the rows to have one full csv file that contains all the translated sentences. I tried the below code:
out = pd.read_csv(path + 'en_tr_translated0.csv', sep='\t', names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=False)
##
i = 1000
for _ in tqdm(range(200000)):
new = pd.read_csv(path + f'en_tr_translated{i}.csv', sep='\t', names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=False)
out.loc[_, 'translated_tr_sentence'] = new.loc[_, 'translated_tr_sentence']
out.loc[_, 'translated_en_sentence'] = new.loc[_, 'translated_en_sentence']
if _ == i:
i += 1000
Actually, it works fine but my problem is, it takes 105 HOURS!!
Is there any faster way to do this? I have to do this for like 5 different datasets and this is getting very annoying.
Any suggestion is appreciated.
Your input files have one row of data exactly as one row in the file, correct? So it would probably be even faster if you don't even use pandas. Although if done correctly 200.000 should be still very fast no matter if using pandas or not.
For doing it without: Just open each file, move to the fitting index, write 1000 lines to the output file. Then move on to next file. You might have to fix headers etc. and look out that there is no shift in the indices, but here is an idea of how to do that:
with open(path + 'en_tr_translated_combined.csv', 'w') as f_out: # open out file in write modus
for filename_index in tqdm(range(0, 201000, 1000)): # iterate over each index in steps of 1000 between 0 and 200000
with open(path + f'en_tr_translated{filename_index}.csv') as f_in: # open file with that index
for row_index, line in enumerate(f_in): # iterate over its rows
if row_index < filename_index: # skip rows until you reached the ones with content in translation
continue
if row_index > filename_index + 1000: # close the file if you reached the part where the translations end
break
f_out.write(line) # for the inbetween: copy the content to out file
I would load all the files, drop the rows that are not fully filled, and afterwards concatenate all of the dataframes.
Something like:
dfs = []
for ff in Path('.').rglob('*.csv'):
dfs.append((pd.read_csv(ff, names=['en_sentence', 'tr_sentence', 'translated_tr_sentence', 'translated_en_sentence'], dtype=str, encoding='utf-8', low_memory=True).dropna())
df = pd.concat(dfs)
I have multiple CSV files, I want to compare them. The file contents are the same except for some additional changes, and I want to list those additional changes.
For eg:
files =[1.csv, 2.csv,3.csv]
I want to compare 1.csv and 2.csv, get the difference and store somewhere, then compare 2.csv and 3.csv, store the diff somewhere.
for dirs in glob.glob(INPUT_PATH+"*"):
if (os.path.isdir(dirs)):
for files in glob.glob(dirs+'*/'+'/*.csv'):
## list all the csv files but how to read them to get difference.
you can use pandas to read csv as dataframe in a list then compare them from that list :
import pandas as pd
dfList = []
dfList.append(pd.read_csv('FilePath'))
dfList[0] contains the content of first csv file and so on
So, for comparing between first and 2nd csv you have to compare between dfList[0] and dfList[1]
The first fonction compare 2 files and the second fonction create a additional file with the difference between the 2 files.
import os
def compare(file_compared,file_master):
"""
A = [100,200,300]
B = [400,500,100]
compare(A,B) = [200,300]
"""
file_compared_list = []
file_master_list = []
with open(file_compared,'r') as fc:
for line in fc:
file_compared_list.append(line.strip())
with open(file_master,'r') as fm:
for line in fm:
file_master_list.append(line.strip())
return list(set(file_compared_list) - set(file_master_list))
def create_file(filename):
diff = compare("file1.csv","file2.csv")
with open(filename,'w') as f:
for element in diff:
f.write(element)
create_file("test.csv")
I have a CSV file and I want to:
1. Import the CSV as a Dataframe
2. Read in a row at a time
3. Copy the VALUES of each cell to a separate string
4. Print the strings
5. Go to the next row and repeat steps 3-4 until done.
My code kind of works, it does read in and prints the first 2 rows, but there are 6 in my CSVC file.
I tried adding an index field but that didn't help much, 3 lines printed instead of 6.
Here is what my CSV file looks like: (the extra line return is so you can read it, not shown in my file.
00C525B70C246049E4.dwg,011021a.dwg
00CD5B2301DF204DCC.dwg,010636e.dwg
00F70B6C0B1EF04B54.dwg,005159v.dwg
0A02B9F7087BF040D5.dwg,003552n.dwg
0A1EE7CC078B404C64.dwg,020526c.dwg
0A1F67D201CCD04F81.doc,X1771-a.doc
import pandas
colnames = ['infocard','file_name']
data = pandas.read_csv('E:/test_Files_To_Rename.csv', names=colnames)
for i, elem in enumerate(data,0):
sfile = data.loc[i,"infocard"]
dst = data.loc[i,"file_name"]
print( sfile +' to ' + dst )
Once I get the program to print the two different file names I want to replace the print statement with:
os.rename(sfile, dst)
so I can rename the files. I am testing with 6 files, my database has 50,000 files which is why I want to use a script.
This is what is displayed:
00C525B70C246049E4.dwg to 011021a.dwg
00CD5B2301DF204DCC.dwg to 010636e.dwg
Any ideas?
Thanks!
I used the following code to iterate through the .csv spreadsheet:
import pandas as pd
df = pd.read_csv('/home/stephen/Desktop/data.csv')
for i in range(len(df)):
sfile = df.values[i][0]
dst = df.values[i][1]
print(sfile + ' to ' + dst)
I got the following output:
00C525B70C246049E4.dwg to 011021a.dwg
00CD5B2301DF204DCC.dwg to 010636e.dwg
00F70B6C0B1EF04B54.dwg to 005159v.dwg
0A02B9F7087BF040D5.dwg to 003552n.dwg
0A1EE7CC078B404C64.dwg to 020526c.dwg
0A1F67D201CCD04F81.doc to X1771-a.doc
This is the spreadsheet that I used:
I currently have 1 text file (tab del) with 15,000 records.
I have another text file (tab del) with 5000 records.
In the file with 5000 records there are some row that match with the file containing 15,000 records. These are identifiable by a column header named URN (unique record number). for example i may need URN 62294 to be taken out of the main file, but i don't know i have to take that one out until i compare the two files and see that it is in both.
how difficult is this to do in python?
Try installing pandas with pip install pandas
Then run this:
import pandas as pd
filename1 = #main file
filename2 = #the other file
main = pd.read_csv(filename1,sep='\t') # sep='\t' is for tab delimited file
side = pd.read_csv(filename2,sep='\t')
main['URN'] = main['URN'].astype(int)
side['URN'] = side['URN'].astype(int)
merge = pd.merge(main,side,on='URN',how='inner') #how=inner means the URN value is in both 2 files
#merge = merge[merge['URN'] != 62294]
print (merge)
merge.to_excel('Output.xlsx,index=False)
Is it difficult ? No, you could do it rather easily with
file1 = open("file1.txt","r")
results = []
for line in file1:
file2 = open("file2.txt","r")
for l in file2:
if (l.split("\t")[0] == line.split("\t")[0]):
results.append(l.split("\t")[0])
break
file2.close()
file1.close()
for i in results:
print(i)
Now, is it the best way ? Probably not for large files.
(Took me 74 seconds with your files).
You can look at Pandas library. It will allow you to load both tables as dataframes and join them on needed column in sql-like style. It should be rather easy with documentation.