trying to convert txt data to csv columns - python

I am having a small issue. The below code works, but when i put a /test1 and /test2 into the file and change '/test1 (.*?) /test2', it doesnt see it in the file and except runs.
import re
with open ('test.txt') as f:
fin = f.read()
try:
print(re.search('test1 (.*?) test2', fin) .group(1))
except:
print('Didnt find test')
My goal is to extract from a list of text files and push into CSV Columns that has text like this below where i would extract /J6 to /K6 as a value range. There is multiple different lines of /J6 to /K6 , each value to be put into a separate column in the CSV.
/J60000,0000,0819,0016,0356,-13,0363/K60013
,0012,0013,0875,-0021,00465,0120/L60089,0002,
I just want to understand is there a syntax problem detecting the / . I am trying to extract values between a value and another value .thank you

You can use the re.split function.
Something like this...
In [84]: import re
In [85]: inp = "/J60000,0000,0819,0016,0356,-13,0363/K60013 ,0012,0013,0875,-0021,00465,0120/L60089,0002,"
In [86]: re.split("/[J,K,L]\d", inp)
Out[86]:
['',
'0000,0000,0819,0016,0356,-13,0363',
'0013 ,0012,0013,0875,-0021,00465,0120',
'0089,0002,']
Disclaimer: I'm not good with regex at all. I used this link as a reference.
https://www.dataquest.io/blog/regex-cheatsheet/

Related

How to get an unknown substring between two known substrings, within a giant string/file

I'm trying to get all the substrings under a "customLabel" tag, for example "Month" inside of ...,"customLabel":"Month"},"schema":"metric...
Unusual issue: this is a 1071552 characters long ndjson file, of a single line ("for line in file:" is pointless since there's only one).
The best I found was that:
How to find a substring of text with a known starting point but unknown ending point in python
but if I use this, the result obviously doesn't stop (at Month) and keeps writing the whole remaining of the file, same as if using partition()[2].
Just know that Month is only an example, customLabel has about 300 variants and they are not listed (I'm actually doing this to list them...)
To give some details here's my script so far:
with open("file.ndjson","rt", encoding='utf-8') as ndjson:
filedata = ndjson.read()
x="customLabel"
count=filedata.count(x)
for i in range (count):
if filedata.find(x)>0:
print("Found "+str(i+1))
So right now it properly tells me how many occurences of customLabel there are, I'd like to get the substring that comes after customLabel":" instead (Month in the example) to put them all in a list, to locate them way more easily and enable the use of replace() for traductions later on.
I'd guess regex are the solution but I'm pretty new to that, so I'll post that question by the time I learn about them...
If you want to search for all (even nested) customLabel values like this:
{"customLabel":"Month" , "otherJson" : {"customLabel" : 23525235}}
you can use RegEx patterns with the re module
import re
label_values = []
regex_pattern = r"\"customLabel\"[ ]?:[ ]?([1-9a-zA-z\"]+)"
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
values = re.findall(regex_pattern, line)
label_values.extend(values)
print(label_values) # ['"Month"', '23525235']
# If you don't want the items to have quotations
label_values = [i.replace('"', "") for i in label_values]
print(label_values) # ['Month', '23525235']
Note: If you're only talking about ndjson files and not nested searching, then it'd be better to use the json module to parse the lines and then easily get the value of your specific key which is customLabel.
import json
label = "customLabel"
label_values = []
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
line_json = json.loads(line)
if line_json.get(label) is not None:
label_values.append(line_json.get(label))
print(label_values) # ['Month']

"Replace" from central file?

I am trying to extend the replace function. Instead of doing the replacements on individual lines or individual commands, I would like to use the replacements from a central text file.
That's the source:
import os
import feedparser
import pandas as pd
pd.set_option('max_colwidth', -1)
RSS_URL = "https://techcrunch.com/startups/feed/"
feed = feedparser.parse(RSS_URL)
entries = pd.DataFrame(feed.entries)
entries = entries[['title']]
entries = entries.to_string(index=False, header=False)
entries = entries.replace(' ', '\n')
entries = os.linesep.join([s for s in entries.splitlines() if s])
print(entries)
I want to be able to replace words from a RSS feed, from a central "Replacement"-file, witch So the source file should have two columns:Old word, New word. Like replace function replace('old','new').
Output/Print Example:
truck
rental
marketplace
D’Amelio
family
launches
to
invest
up
to
$25M
...
In most cases I want to delete the words that are unnecessary for me, so e.g. replace('to',''). But I also want to be able to change special names, e.g. replace('D'Amelio','DAmelio'). The goal is to reduce the number of words and build up a kind of keyword radar.
Is this possible? I can't find any help Googling. But it could well be that I do not know the right terms or can not formulate.
with open('<filepath>','r') as r:
# if you remove the ' marks from around your words, you can remove the [1:-1] part of the below code
words_to_replace = [word.strip()[1:-1] for word in r.read().split(',')]
def replace_words(original_text, words_to_replace):
for word in words_to_replace:
original_text = original_text.replace(word, '')
return original_text
I was unable to understand your question properly but as far as I understand you have strings like cat, dog, etc. and you have a file in which you have data with which you want to replace the string. If this was your requirement, I have given the solution below, so try running it if it satisfies your requirement.
If that's not what you meant, please comment below.
TXT File(Don't use '' around the strings in Text File):
papa, papi
dog, dogo
cat, kitten
Python File:
your_string = input("Type a string here: ") #string you want to replace
with open('textfile.txt',"r") as file1: #open your file
lines = file1.readlines()
for line in lines: #taking the lines of file in one by one using loop
string1 = f'{line}'
string1 = string1.split() #split the line of the file into list like ['cat,', 'kitten']
if your_string == string1[0][:-1]: #comparing the strings of your string with the file
your_string = your_string.replace(your_string, string1[1]) #If string matches like user has given input cat, it will replace it with kitten.
print(your_string)
else:
pass
If you got the correct answer please upvote my answer as it took my time to make and test the python file.

Remove embedded line-feeds from fields of a CSV file

I have a CSV file in which a single row is getting split into multiple rows.
The source file contents are:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAA
XYZ"
"4",
"ABCD"
As we can see, IDs 3 and 4 are getting split into multiple rows. So, is there any way in Python to join those rows with the previous line?
Desired output:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAAXYZ"
"4","ABCD"
This the code I have:
data = open(r"C:\Users\suksengupta\Downloads\BotReport_V01.csv","r")
It looks like your CSV file has control characters embedded in the fields contents. If that is the case, you need to strip them out in order to have each field contents printed joined together.
With that in mind, something like this will fix the problem:
import re
src = r'C:\Users\suksengupta\Downloads\BotReport_V01.csv'
with open(src) as f:
data = re.sub(r'([\w|,])\s+', r'\1', f.read())
print(data)
The above code will result in the output below printed to console:
"ID","BotName"
"1","ABC"
"2","CDEF"
"3","AAAXYZ"
"4","ABCD"

string manipulation with python pandas and replacement function

I'm trying to write a code that checks the sentences in a csv file and search for the words that are given from a second csv file and replace them,my code is as bellow it doesn't return any errors but it is not replacing any words for some reasons and printing back the same sentences without and replacement.
import string
import pandas as pd
text=pd.read_csv("sentences.csv")
change=pd.read_csv("replace.csv")
for row in text:
print(text.replace(change['word'],change['replacement']))
the sentences csv file looks like
and the change csv file looks like
Try:
text=pd.read_csv("sentences.csv")
change=pd.read_csv("replace.csv")
toupdate = dict(zip(change.word, change.replacement))
text = text['sentences'].replace(toupdate, regex=True)
print(text)
dataframe.replace(x,y) changes complete x to y, not part of x.
you have to use regex or custom function to do what you want. for example :
change_dict = dict(zip(change.word,change.replacement))
def replace_word(txt):
for key,val in change_dict.items():
txt = txt.replace(key,val)
return txt
print(text['sentences'].apply(replace_word))
// to create one more additonal column to avoid any change in original colum
text["new_sentence"]=text["sentences"]
for changeInd in change.index:
for eachTextid in text.index:
text["new_sentence"][eachTextid]=text["new_sentence"][eachTextid].replace(change['word'][changeInd],change['replacement'][changeInd])
clear code: click here plz

How do I handle closing double quotes in CSV column with python?

This is the python script:
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
bits[1] = '"input"'
fo.write( ','.join(bits) )
f.close()
fo.close()
I have a CSV file and I'm replacing the content of the 2nd column with the string "input". However, I need to grab some information from that column content first.
The content might look like this:
failurelog_wl","inputfile/source/XXXXXXXX"; "**X_CORD2**"; "Invoice_2M";
"**Y_CORD42**"; "SIZE_ID37""
It has weird type of data as you can see, especially that it has 2 double quotes at the end of the line instead of just one that you would expect.
I need to extract the XCORD and YCORD information, like XCORD = 2 and YCORD = 42, before replacing the column value. I then want to insert an extra column, named X_Y, which represents (2_42).
How can I modify my script to do that?
If I understand your question correctly, you can use a simple regular expression to pull out the numbers you want:
import re
f = open('csvdata.csv','rb')
fo = open('out6.csv','wb')
for line in f:
bits = line.split(',')
x_y_matches = re.match('.*X_CORD(\d+).*Y_CORD(\d+).*', bits[1])
assert x_y_matches is not None, 'Line had unexpected format: {0}'.format(bits[1])
x_y = '({0}_{1})'.format(x_y_matches.group(1), x_y_matches.group(2))
bits[1] = '"input"'
bits.append(x_y)
fo.write( ','.join(bits) )
f.close()
fo.close()
Note that this will only work if column 2 always says 'X_CORD' and 'Y_CORD' immediately before the numbers. If it is sometimes a slightly different format, you'll need to adjust the regular expression to allow for that. I added the assert to give a more useful error message if that happens.
You mentioned wanting the column to be named X_Y. Your script appears to assume that there is no header, and my modified version definitely makes this assumption. Again, you'd need to adjust for that if there is a header line.
And, yes, I agree with the other commenters that using the csv module would be cleaner, in general, for reading and writing csv files.

Categories

Resources