removing line breaks in a csv file

removing line breaks in a csv file - python

I have a csv file with lines, each line begins with (#) and all the fields within a line are separated with (;). One of the fields, that contains "Text" (""[ ]""), has some line breaks that produce errors while importing the whole csv file to excel or access. The text after the line breaks is considered as independent lines, not following the structure of the table.
#4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; ""[OJO!
la premiacin de los #Oscar, nuestros amigos de #cinencuentro revisan las categoras.
+info: co/plHcfSIfn8]""; 0
#624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; ""[Porque nunca dejamos de amar]""; 0
any help with this using a python script? or any other solution...
as output I would like to have the lines:
#4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; ""[OJO! la premiacin de los #Oscar, nuestros amigos de #cinencuentro revisan las categoras. +info: co/plHcfSIfn8]""; 0
#624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; ""[Porque nunca dejamos de amar]""; 0
any help? I a csv file (54MB) with a lot of lines with line breaks... some other lines are ok...

You should share your expected output as well.
Anyways, I suggest you first clean your file to remove the newline characters. Then you can read it as csv. One solution can be (I believe someone will suggest something better :-) )
Clean the file (on linux):
sed ':a;N;$!ba;s/\n/ /g' input_file | sed "s/ #/\n#/g" > output_file
Read file as csv (You can read it using any other method)
import pandas as pd
df = pd.read_csv('output_file', delimiter=';', header=None)
df.to_csv('your_csv_file_name', index=False)
Let's see if it helps you :-)

You can search for lines that are followed by a line that doesn't start with "#", like this \r?\n+(?!#\d+;).
The following was generated from this regex101 demo. It replaces such line ends with a space. You can change that to whatever you like.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"\r?\n+(?!#\d+;)"
test_str = ("#4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; \"\"[OJO!\n"
"la premiacin de los #Oscar, nuestros amigos de #cinencuentro revisan las categoras.\n"
"+info: co/plHcfSIfn8]\"\"; 0\n"
"#624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; \"\"[Porque nunca dejamos de amar]\"\"; 0")
subst = " "
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Related

File CSV with pipe (|) but I've pipe in the middle of the field

I have a CSV file, generate by ERP, the delimiter is pipe (|). But in this file I have columns with the format in ERP is Text and the users in many lines put pipe(|) in the middle of the text
ex
|100019391 |99806354 |EV | RES: Consulta COBRO VVISTA - Chile |31|24.06.2021|
this part EV | Res*** is the field where de user put pipe.
My error is, when the pand read this lines, it give me a Error
Skipping line 46: Expected 28 fields in line 46, saw 29
enter image description here
Is there a option to fix it?
Tks

Assuming that no space exists after the pipe separator, then we can use the following regex r"\|(?!\s)" for the sep argument.
Sample input:
col|col1|col2|col3|col4|
100019391 |99806354 |EV | RES: Consulta COBRO VVISTA - Chile |31|24.06.2021|
100019392 |99806777 |TEST - Chile |31|25.06.2021|
100019393 |99806779 |TE | ST - Chile |31|25.06.2021|
Then, we can import the above csv as follows:
df = pd.read_csv(csv_filename,
usecols=range(5),
sep=r"\|(?!\s)",
lineterminator='\r',
engine='python')
Adjust usecols according to the number of columns you have. Adjust lineterminator according to the line terminator being used in your file. The engine='python' is required as python will throw a warning for using regex in sep.
Pic of the output

read and split information from text file python

I'm stuck on a problem:
I have a text file called id_numbers.txt that contains this information:
325255, Jan Jansen
334343, Erik Materus
235434, Ali Ahson
645345, Eva Versteeg
534545, Jan de Wilde
345355, Henk de Vries
I need python to split the information at the comma and write a program that will the display the information as follows:
Jan Jansen has cardnumber: 325255
Erik Materus has cardnumber: 334343
Ali Ahson has cardnumber: 235434
Eva Versteeg has cardnumber: 645345
I've tried to convert to list and split(",") but that ends up adding the next number like this:
['325255', ' Jan Jansen\n334343', ' Erik Materus\n235434', ' Ali Ahson\n645345', ' Eva Versteeg\n534545', ' Jan de Wilde\n345355', ' Henk de Vries']
Help would be appreciated!

You can do it this way
with open('id_numbers.txt', 'r') as f:
for line in f:
line = line.rstrip() # this removes the \n at the end
id_num, name = line.split(',')
name = name.strip() # in case name has trailing spaces in both sides
print('{0} has cardnumber: {1}'.format(name, id_num))

How to use multiline flag in python regex?

I want to transform chunks of text into a database of single line entries database with regex. But I don't know why the regex group isn't recognized.
Maybe because the multiline flag isn't properly set.
I am a beginner at python.
import re
with open("a-j-0101.txt", encoding="cp1252") as f:
start=1
ecx=r"(?P<entrcnt>[0-9]{1,3}) célébrités ou évènements"
ec1=""
nmx=r"(?P<ename>.+)\r\nAfficher le.*"
nm1=""
for line in f:
if start == 1:
out = open('AST0101.txt' + ".txt", "w", encoding="cp1252") #utf8 cp1252
ec1 = re.search(ecx,line)
out.write(ec1.group("entrcnt"))
start=0
out.write(r"\r\n")
nm1 = re.search(nmx,line, re.M)
out.write(str(nm1.group("ename")).rstrip('\r\n'))
out.close()
But I get the error:
File "C:\work-python\transform-asth-b.py", line 16, in <module>
out.write(str(nm1.group("ename")).rstrip('\r\n'))
builtins.AttributeError: 'NoneType' object has no attribute 'group'
here is the input:
210 célébrités ou évènements ont été trouvés pour la date du 1er janvier.
Création de l'euro
Afficher le...
...
...
...
expected output:
210
Création de l'euro ;...
... ;...
... ;...
EDIT: I try to change nmx to match \n or \r\n but no result:
nmx=r"(?P<ename>.+)(\n|\r\n)Afficher le"
best regards

In this statement:
nm1 = re.search(nmx,line, re.M)
you get an NoneType object (nm1 = None), because no matches were found. So make more investigation on the nmx attribute, why you get no matches in the regex.
By the way if it´s possible to get a NoneType object, you can avoid this by preventing a NoneType:
If nm1 is not None:
out.write(str(nm1.group("ename")).rstrip('\r\n'))
else:
#handle your NoneType case

If you are reading a single line at a time, there is no way for a regex to match on a previous line you have read and then forgotten.
If you read a group of lines, you can apply a regex to the collection of lines, and the multiline flag will do something useful. But your current code should probably simply search for r'^Afficher le\.\.\.' and use the state machine (start == 0 or start == 1) to do this in the right context.

Python: how to extract string from file - only once

I have the below output from router stored in a file
-#- --length-- -----date/time------ path
3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image
4 1896 Sep 27 2019 14:22:08 +05:30 taas/NN41_R11_Golden_Config
5 1876 Nov 27 2017 20:07:50 +05:30 taas/nfast_default.cfg
I want to search for substring 'Golden_Image' from the file & get the complete path. So here, the required output would be this string:
taas/NN41_R11_Golden_Image
First attempt:
import re
with open("outlog.out") as f:
for line in f:
if "Golden_Image" in line:
print(line)
Output:
3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image
Second attempt
import re
hand = open('outlog.out')
for line in hand:
line = line.rstrip()
x = re.findall('.*?Golden_Image.*?',line)
if len(x) > 0:
print x
Output:
['3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image']
Neither of these give the required output. How can I fix this?

This is actually surprisingly fiddly to do if the path can contain spaces.
You need to use the maxsplit argument to split to identify the path field.
with open("outlog.out") as f:
for line in f:
field = line.split(None,7)
if "Golden_Image" in field:
print(field)

Do split on the line and check for the "Golden_Image" string exists in the splitted parts.
import re
with open("outlog.out") as f:
for line in f:
if not "Golden_Image" in i:
continue
print re.search(r'\S*Golden_Image\S*', line).group()
or
images = re.findall(r'\S*Golden_Image\S*', open("outlog.out").read())
Example:
>>> s = '''
-#- --length-- -----date/time------ path
3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image
4 1896 Sep 27 2019 14:22:08 +05:30 taas/NN41_R11_Golden_Config
5 1876 Nov 27 2017 20:07:50 +05:30 taas/nfast_default.cfg'''.splitlines()
>>> for line in s:
for i in line.split():
if "Golden_Image" in i:
print i
taas/NN41_R11_Golden_Image
>>>

Reading full content at once and then doing the search will not be efficient. Instead, file can be read line by line and if line matches the criteria then path can be extracted without doing further split and using RegEx.
Use following RegEx to get path
\s+(?=\S*$).*
Link: https://regex101.com/r/zuH0Zv/1
Here if working code:
import re
data = "3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image"
regex = r"\s+(?=\S*$).*"
test_str = "3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image"
matches = re.search(regex, test_str)
print(matches.group().strip())

Follow you code, if you just want get the right output, you can more simple.
with open("outlog.out") as f:
for line in f:
if "Golden_Image" in line:
print(line.split(" ")[-1])
the output is :
taas/NN41_R11_Golden_Image
PS： if you want some more complex operations, you may need try the re module like the #Avinash Raj answered.

Python print both the matching groups in regex

I want to find two fixed patterns from a log file. Here is a line in a log file looks like
passed dangerb.xavier64.423181.k000.drmanhattan_resources.log Aug 23
04:19:37 84526 362
From this log, I want to extract drmanhattan and 362 which is a number just before the line ends.
Here is what I have tried so far.
import sys
import re
with open("Xavier.txt") as f:
for line in f:
match1 = re.search(r'((\w+_\w+)|(\d+$))',line)
if match1:
print match1.groups()
However, everytime I run this script, I always get drmanhattan as output and not drmanhattan 362.
Is it because of | sign?
How do I tell regex to catch this group and that group ?
I have already consulted this and this links however, it did not solve my problem.

line = 'Passed dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37 84526 362'
match1 = re.search(r'(\w+_\w+).*?(\d+$)', line)
if match1:
print match1.groups()
# ('drmanhattan_resources', '362')
If you have a test.txt file that contains the following lines:
Passed dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23
04:19:37 84526 362 Passed
dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37
84526 363 Passed
dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37
84526 361
you can do:
with open('test.txt', 'r') as fil:
for line in fil:
match1 = re.search(r'(\w+_\w+).*?(\d+)\s*$', line)
if match1:
print match1.groups()
# ('drmanhattan_resources', '362')
# ('drmanhattan_resources', '363')
# ('drmanhattan_resources', '361')

| mean OR so your regex catch (\w+_\w+) OR (\d+$)
Maybe you want something like this :
((\w+_\w+).*?(\d+$))

With re.search you only get the first match, if any, and with | you tell re to look for either this or that pattern. As suggested in other answers, you could replace the | with .* to match "anything in between" those two pattern. Alternatively, you could use re.findall to get all matches:
>>> line = "passed dangerb.xavier64.423181.k000.drmanhattan_resources.log Aug 23 04:19:37 84526 362"
>>> re.findall(r'\w+_\w+|\d+$', line)
['drmanhattan_resources', '362']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

removing line breaks in a csv file - python

Related

File CSV with pipe (|) but I've pipe in the middle of the field

read and split information from text file python

How to use multiline flag in python regex?

Python: how to extract string from file - only once

Python print both the matching groups in regex

Categories

Resources