Replace text strings in a file that start with certain characters - python

I would like to replace text in a file by searching for specific letters at the beginning of the string. For example here is a section of the file:
6 HT 4.092000 4.750000 -0.502000 0 5 7
7 HT 5.367000 5.548000 -0.325000 0 5 6
8 OT -5.470000 5.461000 1.463000 0 9 10
9 HT -5.167000 4.571000 1.284000 0 8 10
10 HT -4.726000 6.018000 1.235000 0 8 9
11 OT -4.865000 -5.029000 -3.915000 0 12 13
12 HT -4.758000 -4.129000 -3.608000 0 11 13
I would like to use "HT" as the search and be able to replace the "space0space" with 2002. When I try I replace all 0 with 2002 and not the column that is just 0. After this I need to then search "OT" and replace the 0 column with 2001.
So basically I need to search a string that identify the line and replace a column specific string while the text that lies between is variable. The output needs to be printed to a new_file.xyz. Also I will be doing this repeatedly on lots of files so it would be great to be a script that can typed in front of the file that will be operated on. Thanks.

This should do it for you (for HT):
with open('file.txt') as f:
lines = f.readlines()
new_lines = []
for line in lines:
if "HT" in line:
new_line = line.replace(' 0 ', '2002')
new_lines.append(new_line)
else:
new_lines.append(line)
content = ''.join(new_lines)
print(content)
# 6 HT 4.092000 4.750000 -0.502000 2002 5 7
# 7 HT 5.367000 5.548000 -0.325000 2002 5 6
# 8 OT -5.470000 5.461000 1.463000 0 9 10
# 9 HT -5.167000 4.571000 1.284000 2002 8 10
# 10 HT -4.726000 6.018000 1.235000 2002 8 9
# 11 OT -4.865000 -5.029000 -3.915000 0 12 13
# 12 HT -4.758000 -4.129000 -3.608000 2002 11 13
Repeat the same logic (add to the case or otherwise) for other line identifiers.
If you put this in a function, you could use it to replace all by id:
def _find_and_replace(current_lines, line_id, value):
lines = []
for l in current_lines:
lines.append(l.replace(' 0 ', value)) if line_id in l else lines.append(l)
return ''.join(lines)
with open('file.txt') as f:
lines = f.readlines()
new_lines = _find_and_replace(lines, line_id='HT', value='2002')
print(new_lines)
Though, if you have many identifiers, I would implement a solution which won't go through the list of lines every time but rather lookup the identifier as it iterates the lines.

The solution using fileinput module, re.search() and re.sub() functions:
import fileinput, re
with fileinput.input(files=("lines.txt"), inplace=True) as f:
for line in f:
if (re.search(r'\bHT\b', line)): # checks if line contains `HT` column
print(re.sub(r' 0 ', '2002', line).strip())
elif (re.search(r'\OT\b', line)): # checks if line contains `OT` column
print(re.sub(r' 0 ', '2001', line).strip())
else:
print(line)
The file contents after processing:
6 HT 4.092000 4.750000 -0.502000 2002 5 7
7 HT 5.367000 5.548000 -0.325000 2002 5 6
8 OT -5.470000 5.461000 1.463000 2001 9 10
9 HT -5.167000 4.571000 1.284000 2002 8 10
10 HT -4.726000 6.018000 1.235000 2002 8 9
11 OT -4.865000 -5.029000 -3.915000 2001 12 13
12 HT -4.758000 -4.129000 -3.608000 2002 11 13
Optional in-place filtering: if the keyword argument inplace=True is passed to fileinput.input() or to the FileInput
constructor, the file is moved to a backup file and standard output is
directed to the input file (if a file of the same name as the backup
file already exists, it will be replaced silently). This makes it
possible to write a filter that rewrites its input file in place.

Related

Unwanted white spaces resulting into distorted column

I am trying to import a list of chemicals from a txt file which is spaced (not tabbed).
NO FORMULA NAME CAS No A B C D TMIN TMAX code ngas#TMIN ngas#25 C ngas#TMAX
1 CBrClF2 bromochlorodifluoromethane 353-59-3 -0.0799 4.9660E-01 -6.3021E-05 -9.0961E-09 200 1500 2 96.65 142.14 572.33
2 CBrCl2F bromodichlorofluoromethane 353-58-2 4.0684 4.1343E-01 1.6576E-05 -3.4388E-08 200 1500 2 87.14 127.90 545.46
3 CBrCl3 bromotrichloromethane 75-62-7 7.3767 3.5056E-01 6.9163E-05 -4.9571E-08 200 1500 2 79.86 116.73 521.53
4 CBrF3 bromotrifluoromethane 75-63-8 -9.5253 6.5020E-01 -3.4459E-04 1.0987E-07 230 1500 1,2 123.13 156.61 561.26
5 CBr2F2 dibromodifluoromethane 75-61-6 2.8167 4.9405E-01 -1.2627E-05 -2.8629E-08 200 1500 2 100.89 148.24 618.87
6 CBr4 carbon tetrabromide 558-13-4 10.6812 3.2869E-01 1.0739E-04 -6.0788E-08 200 1500 2 80.23 116.62 540.18
7 CClF3 chlorotrifluoromethane 75-72-9 13.8075 4.7487E-01 -1.3368E-04 2.2485E-08 230 1500 1,2 116.23 144.10 501.22
8 CClN cyanogen chloride 506-77-4 0.8665 3.6619E-01 -2.9975E-05 -1.3191E-08 200 1500 2 72.80 107.03 438.19
When I import with pandas
df = pd.read_csv('trial1.txt', sep='\s')
I get:
For first 5 compounds (index 0-4) name is correctly in Name column but for 6th (index 5) and 8th (index 7) compounds - their name is divided because of space and it goes to CAS. Causing CAS column value to go under No column and values and so on subsequently.
Is there a way to eliminate this issue ? Thank you
I would suggest you put some work on the 'trial1.txt' file before loading it to df. The following code will result to what you finally want to get:
with open ('trial1.txt') as f:
l=f.readlines()
l=[i.split() for i in l]
target=len(l[1])
for i in range(1,len(l)):
if len(l[i])>target:
l[i][2]=l[i][2]+' '+l[i][3]
l[i].pop(3)
l=['#'.join(k) for k in l] #supposing that there is no '#' in your entire file, otherwise use some other rare symbol that doesn't eist in your file
l=[i+'\n' for i in l]
with open ('trial2.txt', 'w') as f:
f.writelines(l)
df = pd.read_csv('trial2.txt', sep='#', index_col=0)
Try this:
You basically have to strip out the spaces between words in the name column. So here I first read the file and then strip out the spaces in the NAME column using re.sub.
In this code, I am assuming that words are separated by atleast 5 letters on either sides. You can change that number {5} as you deem fit.
import re
with open('trial1.txt', 'r') as f:
lines = f.readlines()
l = [re.sub(r"([a-z]{5,})\s([a-z]{5,})", r"\1\2", line) for line in lines]
df = pd.read_csv(io.StringIO('\n'.join(l)), delim_whitespace=True)
Prints:
NO FORMULA NAME CAS No A B C D TMIN TMAX code ngas#TMIN ngas#25 C.1 ngas#TMAX
1 CBrClF2 bromochlorodifluoromethane 353-59-3 -0.0799 0.49660 -0.000063 -9.096100e-09 200 1500 2 96.65 142.14 572.33 NaN NaN
2 CBrCl2F bromodichlorofluoromethane 353-58-2 4.0684 0.41343 0.000017 -3.438800e-08 200 1500 2 87.14 127.90 545.46 NaN NaN
3 CBrCl3 bromotrichloromethane 75-62-7 7.3767 0.35056 0.000069 -4.957100e-08 200 1500 2 79.86 116.73 521.53 NaN NaN
4 CBrF3 bromotrifluoromethane 75-63-8 -9.5253 0.65020 -0.000345 1.098700e-07 230 1500 1,2 123.13 156.61 561.26 NaN NaN
5 CBr2F2 dibromodifluoromethane 75-61-6 2.8167 0.49405 -0.000013 -2.862900e-08 200 1500 2 100.89 148.24 618.87 NaN NaN
6 CBr4 carbontetrabromide 558-13-4 10.6812 0.32869 0.000107 -6.078800e-08 200 1500 2 80.23 116.62 540.18 NaN NaN
7 CClF3 chlorotrifluoromethane 75-72-9 13.8075 0.47487 -0.000134 2.248500e-08 230 1500 1,2 116.23 144.10 501.22 NaN NaN
8 CClN cyanogenchloride 506-77-4 0.8665 0.36619 -0.000030 -1.319100e-08 200 1500 2 72.80 107.03 438.19 NaN NaN

how to open csv in python?

I have a dataset in following format.
row_num;locale;day_of_week;hour_of_day;agent_id;entry_page;path_id_set;traffic_type;session_durantion;hits
"988681;L6;Monday;17;1;2111;""31672;0"";6;7037;\N"
"988680;L2;Thursday;22;10;2113;""31965;0"";2;49;14"
"988679;L4;Saturday;21;2;2100;""0;78464"";1;1892;14"
"988678;L3;Saturday;19;8;2113;51462;6;0;1;\N"
I want it to be in following format :
row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N
988680 L2 Thursday 22 10 2113 31965 0 2 49 14
988679 L4 Saturday 21 2 2100 0 78464 1 1892 14
988678 L3 Saturday 19 8 2113 51462 6 0 1 N
I tried with the following code :
import pandas as pd
df = pd.read_csv("C:\Users\Rahhy\Desktop\trivago.csv", delimiter = ";")
But I am getting a error as :
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Using replace():
with open("data_test.csv", "r") as fileObj:
contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
print(contents)
OUTPUT:
row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N
EDIT:
You can open a file, read its content, replace the unwanted chars. write the new contents to the file and then read it through pd.read_csv:
with open("data_test.csv", "r") as fileObj:
contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
# print(contents)
with open("data_test.csv", "w+") as fileObj2:
fileObj2.write(contents)
import pandas as pd
df = pd.read_csv(r"data_test.csv", index_col=False)
print(df)
OUTPUT:
row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N
import pandas as pd
from io import StringIO
# Load the file to a string (prefix r (raw) to not use \ for escaping)
filename = r'c:\temp\x.csv'
with open(filename, 'r') as file:
raw_file_content = file.read()
# Remove the quotes which break the CSV file
file_content_without_quotes = raw_file_content.replace('"','')
# Simulate a file with the corrected CSV content
simulated_file = StringIO(file_content_without_quotes)
# Get the CSV as a table with pandas
# Since the first field in each data row shall not be used for indexing we need to set index_col=False
csv_data = pd.read_csv(simulated_file, delimiter = ';', index_col=False)
print(csv_data['hits']) # print some column
csv_data
Since there are 11 data fields and 10 headers only the first 10 fields are used. You'll have to figure out what you want to do with the last one (Values: \N, 14)
Output:
0 7037
1 49
2 1892
3 1
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

When/why does split_on_silence() return an empty list

I'm trying to take an mp3 and simply remove silent blocks. I'm using pydub.split_on_silence(), but it returns an empty list. In my code below, the audio chunk seems to be silent for the first 4 seconds, has 12 seconds of audio, then is silent for the remainder.
from pydub import AudioSegment
from pydub.silence import split_on_silence
sound = AudioSegment.from_mp3("audio_files/xxxxxx.mp3")
clip = sound[21*1000:45*1000]
#"graph" the volume in 1 second increments
for x in range(0,int(len(clip)/1000)):
print(x,clip[x*1000:(x+1)*1000].max_dBFS)
chunks = split_on_silence(
clip,
min_silence_len=1000,
silence_thresh=-16,
keep_silence=100
)
print("number of chunks",len(chunks))
print (chunks)
Output :
0 -59.67942035834925
1 -59.67942035834925
2 -60.20599913279624
3 -59.18294868384861
4 -7.294483767470469
5 -9.54772815923718
6 -7.8863408992261785
7 -8.018780602216872
8 -8.086437972291877
9 -9.689721851628853
10 -12.146807891343315
11 -13.187719632532362
12 -14.065443216019279
13 -14.344275171835644
14 -14.668150366783275
15 -10.544064231686791
16 -59.67942035834925
17 -59.9387199016366
18 -58.94496421785445
19 -59.9387199016366
20 -59.42763781218885
21 -59.67942035834925
22 -60.20599913279624
23 -59.67942035834925
number of chunks 0
[]
Thanks #ggrelet. I think the solution is that silence is judged by average dbfs (or just .dBFS) not max DBFS. I changed my code accordingly (display the average dBFS, lower the threshold to -40) and got a non-null return.

Group rows in a CSV by blocks of 25

I have a csv file with 2 columns, representing a distribution of items per year which looks like this:
A B
1900 10
1901 2
1903 5
1908 8
1910 25
1925 3
1926 4
1928 1
1950 10
etc, about 15000 lines.
When making a distribution diagram based on this data, it's too many points on an axe, not very pretty. I want to group rows by blocks of 25 years, so that at the end I would have less points at the axe.
So, for example, from 1900 till 1925 I would have a sum of produced items, 1 row in A column and 1 row in B column:
1925 53
1950 15
So far I only figured how to convert the data in csv file to int:
o=open('/dates_dist.csv', 'rU')
mydata = csv.reader(o)
def int_wrapper(mydata):
for v in reader:
yield map(int, v)
reader = int_wrapper(mydata)
Can't find how to do it further...
You could use itertools.groupby:
import itertools as IT
import csv
def int_wrapper(mydata):
for v in mydata:
yield map(int, v)
with open('data', 'rU') as o:
mydata = csv.reader(o)
header = next(mydata)
reader = int_wrapper(mydata)
for key, group in IT.groupby(reader, lambda row: (row[0]-1)//25+1):
year = key*25
total = sum(row[1] for row in group)
print(year, total)
yields
(1900, 10)
(1925, 43)
(1950, 15)
Note that 1900 to 1925 (inclusive) spans 26 years, not 25. So
if you want to group 25 years, given the way you are reporting the totals, you probably want the half-open interval (1900, 1925].
The expression row[0]//25 takes the year and integer divides by 25.
This number will be the same for all numbers in the range [1900, 1925).
To make the range half-open on the left, subtract and add 1: (row[0]-1)//25+1.
Here is my approach . Its definitely not the most engaging python code, but could be a way to achieve the desired output.
if __name__ == '__main__':
o=open('dates_dist.csv', 'rU')
lines = o.read().split("\n") # Create a list having each line of the file
out_dict = {}
curr_date = 0;
curr_count = 0
chunk_sz = 25; #years
if len(lines) > 0:
line_split = lines[0].split(",")
start_year = int(line_split[0])
curr_count = 0
# Iterate over each line of the file
for line in lines:
# Split at comma to get the year and the count.
# line_split[0] will be the year and line_split[1] will be the count.
line_split = line.split(",")
curr_year = int(line_split[0])
time_delta = curr_year-start_year
if time_delta<chunk_sz or time_delta == chunk_sz:
curr_count = curr_count + int(line_split[1])
else:
out_dict[start_year+chunk_sz] = curr_count
start_year = start_year+chunk_sz
curr_count = int(line_split[1])
#print curr_year , curr_count
out_dict[start_year+chunk_sz] = curr_count
print out_dict
You could create a dummy column and group by it after doing some integer division:
df['temp'] = df['A'] // 25
>>> df
A B temp
0 1900 10 76
1 1901 2 76
2 1903 5 76
3 1908 8 76
4 1910 25 76
5 1925 3 77
6 1926 4 77
7 1928 1 77
8 1950 10 78
>>> df.groupby('temp').sum()
A B
temp
76 9522 50
77 5779 8
78 1950 10
My numbers are slightly different from yours since I am technically grouping from 1900-1924, 1925-1949, and 1950-1974, but the idea is the same.

Python csv reader-zipping reader with range

I am having a really simple csv file of this type (i have put the Fibonacci numbers for example):
nn,number
1,1
2,1
3,2
4,3
5,5
6,8
7,13
8,21
9,34
10,55
11,89
12,144
13,233
14,377
15,610
16,987
17,1597
18,2584
19,4181
20,6765
21,10946
22,17711
23,28657
24,46368
25,75025
26,121393
27,196418
and i am just trying to bulk process the rows in the following manner (the fib numbers are irrelevant)
import csv
b=0
s=1
i=1
itera=0
maximum=10000
bulk_save=10
csv_file='really_simple.csv'
fo = open(csv_file)
reader = csv.reader(fo)
##Skipping headers
_headers=reader.next()
while (s>0) and itera<maximum:
print 'processing...'
b+=1
tobesaved=[]
for row,i in zip(reader,range(1,bulk_save+1)):
itera+=1
tobesaved.append(row)
print itera,row[0]
s=len(tobesaved)
print 'chunk no '+str(b)+' processed '+str(s)+' rows'
print 'Exit.'
The output i get is a bit weird (as if the reader is omitting an entry at the end of the loop)
processing...
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
chunk no 1 commited 10 rows
processing...
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
20 21
chunk no 2 commited 10 rows
processing...
21 23
22 24
23 25
24 26
25 27
chunk no 3 commited 5 rows
processing...
chunk no 4 commited 0 rows
Exit.
Do you have any idea what the problem could be?
My guess is the zip function.
The reason i have the code like that (getting chunks of data )is that i need to save in bulk csv entries to sqlite3 database (using executemany and commit at the end of every zip loop, so that I will not overload my memory.
Thanks!
Try following:
import csv
def process(rows, chunk_no):
for no, data in rows:
print no, data
print 'chunk no {} process {} rows'.format(chunk_no, len(rows))
csv_file='really_simple.csv'
with open(csv_file) as fo:
reader = csv.reader(fo)
_headers = reader.next()
chunk_no = 1
tobesaved = []
for row in reader:
tobesaved.append(row)
if len(tobesaved) == 10:
process(tobesaved, chunk_no)
chunk_no += 1
tobesaved = []
if tobesaved:
process(tobesaved, chunk_no)
prints
1 1
2 1
3 2
4 3
5 5
6 8
7 13
8 21
9 34
10 55
chunk no 1 process 10 rows
11 89
12 144
13 233
14 377
15 610
16 987
17 1597
18 2584
19 4181
20 6765
chunk no 2 process 10 rows
21 10946
22 17711
23 28657
24 46368
25 75025
26 121393
27 196418
chunk no 3 process 7 rows

Categories

Resources