Generating multiple .txt files from a single .csv? - python

I'm new to Python and am having a difficult time figuring out how to write a program that will write out a single .txt file for every line in a .csv file. For instance, I have the following .csv file with data from multiple calculations and I need .txt files created for each individual calculation. Formatting is rough to do here but the bold letters are column names and corresponding elements are underneath (ex: "Run2 and "20" belong to column C).
A B C D
Title: Run1 Run 2 Run3
"Initial Composition: FeO" 10 20 30
"Initial Composition: MgO" 40 50 60
I want my Python code to output the following:
1.txt:
Title: Run 1
Initial Composition: FeO 10
Initial Composition: Mgo: 40
2.txt:
Title: Run 2
Initial Composition: FeO 20
Initial Composition: Mgo: 50
The elements from A need to be printed in every .txt file with numbers from various calculations contained in columns B, C, etc... printed beside with a space. Bonus points for anyone who can also help me create custom filenames for the .txt files based on the title (ex: the data from column A creates a .txt file called "Run1.txt." Don't know if assigning each column to a dictionary and then appending them all together would be the best route?
Thank you!

Something like this:
with open('runs.csv','rb') as read_file:
reader = csv.reader(read_file)
for run in reader:
with open(run[0] + '.txt','wb') as write_file:
write_file.write(run[1] + '\n')
For a csv file with the format "Name of file","Run results", obviously this can be replaced with anything you want.

Related

Code is working slow - performance issue in python

I have file which has 4 columns with, separated values. I need only first column only so I have read file then split that line with, separated and store it in one list variable called first_file_list.
I have another file which has 6 columns with, separated values. My requirement is read first column of first row of file and check that string is exist in list called first_file_list. If that is exist then copy that line to new file.
My first file has approx. 6 million records and second file has approx. 4.5 million records. Just to check the performance of my code instead of 4.5 million I have put only 100k records in second file and to process the 100k record code takes approx. 2.5 hours.
Following is my logic for this:
first_file_list = []
with open("c:\first_file.csv") as first_f:
next(first_f) # Ignoring first row as it is header and I don't need that
temp = first_f.readlines()
for x in temp:
first_file_list.append(x.split(',')[0])
first_f.close()
with open("c:\second_file.csv") as second_f:
next(second_f)
second_file_co = second_f.readlines()
second_f.close()
out_file = open("c:\output_file.csv", "a")
for x in second_file_co:
if x.split(',')[0] in first_file_list:
out_file.write(x)
out_file.close()
Can you please help me to get to know that what I am doing wrong here so that my code take this much time to compare 100k records? or can you suggest better way to do this in Python.
Use a set for fast membership checking.
Also, there's no need to copy the contents of the entire file to memory. You can just iterate over the remaining contents of the file.
first_entries = set()
with open("c:\first_file.csv") as first_f:
next(first_f)
for line in first_f:
first_entries.add(line.split(',')[0])
with open("c:\second_file.csv") as second_f:
with open("c:\output_file.csv", "a") as out_file:
next(second_f)
for line in second_f:
if line.split(',')[0] in first_entries:
out_file.write(line)
Additionally, I noticed you called .close() on file objects that were opened with the with statement. Using with (context managers) means all the clean up is done after you exit its context. So it handles the .close() for you.
work with sets - see below
first_file_values = set()
second_file_values = set()
with open("c:\first_file.csv") as first_f:
next(first_f)
temp = first_f.readlines()
for x in temp:
first_file_values.add(x.split(',')[0])
with open("c:\second_file.csv") as second_f:
next(second_f)
second_file_co = second_f.readlines()
for x in second_file_co:
second_file_values.add(x.split(',')[0])
with open("c:\output_file.csv", "a") as out_file:
for x in second_file_values:
if x in first_file_values:
out_file.write(x)

How to extract (slice) fixed sized 2D arrays from a formatted file using Python?

I have a file that contains a set of multiple 2D arrays stacked above each other and separated by strings. This file is generated from another software which I don't have control of its output.
I would like to separate these sets into separate files using a "for" or "while" loop.
i.e.,
the loop will look where the string characters begin ("x" "y"),
Copy the data until the next string begins (2nd string not included in the copy)
create a new file and save the copied data.
proceed this way until all data is copied out
Here is an example of the data, this contains only 2 sets, which can be done manually, but I have a file that contains 10s of sets:
"x" "y"
1e+06 28.1647
1.77828e+06 28.1647
3.16228e+06 28.1646
5.62341e+06 28.1646
1e+07 28.1645
1.77828e+07 28.1641
3.16228e+07 28.1629
5.62341e+07 28.1591
1e+08 28.1471
1.77828e+08 28.1095
3.16228e+08 27.9924
5.62341e+08 27.6412
1e+09 26.6846
1.77828e+09 24.5621
3.16228e+09 21.0562
5.62341e+09 16.5839
1e+10 11.599
1.77828e+10 6.18774
3.16228e+10 0.10613
5.62341e+10 -6.99352
1e+11 -15.4214
1.77828e+11 -25.7501
3.16228e+11 -40.0745
5.62341e+11 -58.1688
1e+12 -66.9569
"x" "y"
1e+06 28.1784
1.77828e+06 28.1784
3.16228e+06 28.1784
5.62341e+06 28.1783
1e+07 28.1782
1.77828e+07 28.1778
3.16228e+07 28.1767
5.62341e+07 28.173
1e+08 28.1614
1.77828e+08 28.1249
3.16228e+08 28.0114
5.62341e+08 27.6705
1e+09 26.738
1.77828e+09 24.6535
3.16228e+09 21.1808
5.62341e+09 16.7247
1e+10 11.7433
1.77828e+10 6.3266
3.16228e+10 0.230876
5.62341e+10 -6.88885
1e+11 -15.3386
1.77828e+11 -25.689
3.16228e+11 -40.0328
5.62341e+11 -58.2147
1e+12 -67.1267

Finding Min and Max Values in List Pattern in Python

I have a csv file and the data pattern like this:
I am importing it from csv file. In input data, there are some whitespaces and I am handling it by using pattern as above. For output, I want to write a function that takes this file as an input and prints the lowest and highest blood pressure. Also, it will return average of all mean values. On the other side, I should not use pandas.
I wrote below code blog.
bloods=open(bloodfilename).read().split("\n")
blood_pressure=bloods[4].split(",")[1]
pattern=r"\s*(\d+)\s*\[\s*(\d+)-(\d+)\s*\]"
re.findall(pattern,blood_pressure)
#now extract mean, min and max information from the blood_pressure of each patinet and write a new file called blood_pressure_modified.csv
pattern=r"\s*(\d+)\s*\[\s*(\d+)-(\d+)\s*\]"
outputfilename="blood_pressure_modified.csv"
# create a writeable file
outputfile=open(outputfilename,"w")
for blood in bloods:
patient_id, blood_pressure=bloods.strip.split(",")
mean=re.findall(pattern,blood_pressure)[0]
blood_pressure_modified=re.sub(pattern,"",blood_pressure)
print(patient_id, blood_pressure_modified, mean, sep=",", file=outputfile)
outputfile.close()
Output should looks like this:
This is a very simple kind of answer to this. No regex, pandas or anything.
Let me know if this is working. I can try making it work better for any case it doesn't work.
bloods=open("bloodfilename.csv").read().split("\n")
means = []
'''
Also, rather than having two list of mins and maxs,
we can have just one and taking min and max from this
list later would do the same trick. But just for clarity I kept both.
'''
mins = []
maxs = []
for val in bloods[1:]: #assuming first line is header of the csv
mean, ranges = val.split(',')[1].split('[')
means.append(int(mean.strip()))
first, second = ranges.split(']')[0].split('-')
mins.append(int(first.strip()))
maxs.append(int(second.strip()))
print(f'the lowest and the highest blood pressure are: {min(mins)} {max(maxs)} respectively\naverage of mean values is {sum(means)/len(means)}')
You can also create functions to perform small small strip stuff. That's usually a better way to code. I wrote this in bit hurry, so don't mind.
Maybe this could help with your question,
Suppose you have a CSV file like this, and want to extract only the min and max values,
SN Number
1 135[113-166]
2 140[110-155]
3 132[108-180]
4 40[130-178]
5 133[118-160]
Then,
import pandas as pd
df = pd.read_csv("REPLACE_WITH_YOUR_FILE_NAME.csv")
results = df['YOUR_NUMBER_COLUMN'].apply(lambda x: x.split("[")[1].strip("]").split("-"))
with open("results.csv", mode="w") as f:
f.write("MIN," + "MAX")
f.write("\n")
for i in results:
f.write(str(i[0]) + "," + str(i[1]))
f.write("\n")
f.close()
After you ran the snippet after without any errors then in your current working directory their should be a file named results.csv Open it up and you will have the results

U-SQL Python extension: very slow performance

I'm doing something seemingly trivial that takes much longer than I would expect it to. I'm loading a 70MB file, running it through a reducer which calls a Python script that does not modify the data, and writing the data back to a new file.
It takes 42 minutes when I run it through the Python script, it takes less than one minute (including compilation) if I don't.
I'm trying to understand:
What am I doing wrong?
What is going on underneath the hood that takes so long?
I store the input and output files on Azure Data Lake Store. I'm using parallelism 1, a TSV input file of about 70MB (2000 rows, 2 columns). I'm just passing the data through. It takes 42 minutes until the job finishes.
I generated the test input data with this Python script:
import base64
# create a roughly 70MB TSV file with 2000 rows and 2 columns: ID (integer) and roughly 30KB data (string)
fo = open('testinput.tsv', 'wb')
for i in range(2000):
fo.write(str(i).encode() + b'\t' + base64.b85encode(bytearray(os.urandom(30000))) + b'\n')
fo.close()
This is the U-SQL script I use:
REFERENCE ASSEMBLY [ExtPython];
DECLARE #myScript = #"
def usqlml_main(df):
return df
";
#step1 =
EXTRACT
col1 string,
col2 string
FROM "/test/testinput.tsv" USING Extractors.Tsv();;
#step2 =
REDUCE #ncsx_files ON col1
PRODUCE col1 string, col2 string
USING new Extension.Python.Reducer(pyScript:#myScript);
OUTPUT #step2
TO "/test/testoutput.csv"
USING Outputters.Tsv(outputHeader: true);
I have the same issue.
I have a 116 mb csv file I want to read in (and then do stuff). When trying to read in the file and do nothing in the python script it times out after 5 hours, I even tried reducing the file to 9,28 mb it also times out after 5 hours.
However, when reduced to 1,32 mb the job finishes after 16 mins. (With results as expected).
REFERENCE ASSEMBLY [ExtPython];
DECLARE #myScript = #"
def usqlml_main(df):
return df
";
#train =
EXTRACT txt string,
casegroup string
FROM "/test/t.csv"
USING Extractors.Csv();
#train =
SELECT *,
1 AS Order
FROM #train
ORDER BY Order
FETCH 10000;
#train =
SELECT txt,
casegroup
FROM #train; // 1000 rows: 16 mins, 10000 rows: times out at 5 hours.
#m =
REDUCE #train ON txt, casegroup
PRODUCE txt string, casegroup string
USING new Extension.Python.Reducer(pyScript:#myScript);
OUTPUT #m
TO "/test/t_res.csv"
USING Outputters.Csv();
REDUCE #ROWSET ALL
If you don't reduce on ALL, it will invoke the python function per row.
If you want to use parallelism, you could create temporary groups to reduce on.

Sorting content of a text file in python 3.x

I would like to sort a text file which contains
lines like
1000S 00RR: 20 values
1200S -10RR: 10 values
900S -20RR: 6 values
150S -05RR: 4 values
10000S 00RR: 2 values
I want to sort it as (in the order ascending taking the first element before space's numerical value into consideration)
150S -05RR: 4 values
900S -20RR: 6 values
1000S 00RR: 20 values
1200S -10RR: 10 values
10000S 00RR: 2 values
I was wondering what would be the better way to implement it.
I tried the following:
with open(file_name, "r") as file_name_opened:
lines = file_name_opened.readlines()
for x in range(0,20):
try:
list_one.append(lines[x])
except IndexError:
pass
return sorted(list_one)
print("sorted: " + str(sorted(list_one)))
It would be nice to know if there is a better way to do it...
An open file is an iterable of strings that end with '\n'. While developing code to process an open file as such an iterable, one can instead define such an iterable in the code itself. This can be done with a multiline string literal and splitlines, as shown below. This makes it possible to post code that readers can execute themselves without the bother of creating and later deleting a separate file.
The key problem in sorting is to define the key function. Based on your example, you want the integer value of each line up to the 'S'. So you need to fine the 'S', split out the preceding digits, and convert the number substring to an int.
data = '''\
1000S 00RR: 20 values
1200S -10RR: 10 values
900S -20RR: 6 values
150S -05RR: 4 values
10000S 00RR: 2 values
'''.splitlines(keepends=True)
def keyfunc(line):
return int(line[:line.index('S')])
test_int = keyfunc("1000S 00RR: 20 values") # for testing only
print(type(test_int), test_int)
# <class 'int'> 1000
out = sorted(data, key=keyfunc)
for line in out: print(line, end='')
This code prints the output you requested. To use it with a file, delete the data and keyfunc test statements and wrap the sorted statement with with open(filename) as data:

Categories

Resources