Removing various characters from (.csv or .txt) file Python - python

I have a .csv file which looks like:
['NAME' " 'RA_I1'" " 'DEC_I1'" " 'Mean_I1'" " 'Median_I1'" " 'Mode_I1'" ...]"
where this string carries on for (I think) 95 entries, the entire file is over a thousand rows deep. I want to remove all the characters: [ ' " and just have everything separated by a single white space entry (' ').
So far I've tried:
import pandas as pd
df1 = pd.read_table('slap.txt')
for char in df1:
if char in " '[":
df1.replace(char, '')
print df1
Where I'm just 'testing' the code to see if it will do what I want it to, it's not. I'd like to implement it on the entire file, but I'm not sure how.
I've checked this old post out but not quite getting it to work for my purposes. I've also played with the linked post, the only problem with it seems to be that all the entries are spaced twice rather than just once....

This looks like something you ought to be able to grab with a (not particularly pretty) regular expression in the sep argument of read_csv:
In [11]: pd.read_csv(file_name, sep='\[\'|\'\"\]|[ \'\"]*', header=None)
Out[11]:
0 1 2 3 4 5 6 7
0 NaN NAME RA_I1 DEC_I1 Mean_I1 Median_I1 Mode_I1 NaN
You can play about with the regular expression til it truly fits your needs.
To explain this one:
sep = ('\[\' # each line startswith [' (the | means or)
'|\'\"\]' # endswith '"] (at least the one I had)
'|[ \'\"]+') # this is the actual delimiter, the + means at least one, so it's a string of ", ' and space in any order.
You can see this hack has left a NaN column at either end. The main reason this is pretty awful is because of the inconsistency of your "csv", I would definitely recommend cleaning it up, of course, one way to do that is just to use pandas and then to_csv. If it's generated by someone else... complain (!).

Have you tried:
string.strip(s[, chars])
?
http://docs.python.org/2/library/string.html

Related

.replace('\n','') not working to remove \n from string that is taken from pandas df

In the following string
SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\nCANDICE - https://www.lovebilly.com\n\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\nwith this lens -- http://amzn.to/2rUJOmD\nbig drone - http://amzn.to/2o3GLX5\nSony CAMERA http://amzn.to/2nOBmnv\nOLD CAMERA; http://amzn.to/2o2cQBT\nMAIN LENS; http://amzn.to/2od5gBJ\nBIG SONY CAMERA; http://amzn.to/2nrdJRO\nBIG Canon CAMERA; on http://instagram.com/caseyneistat\non https://www.facebook.com/cneistat\non https://twitter.com/CaseyNeistat\n\namazing intro song by https://soundcloud.com/discoteeth\n\nad disclosure. THIS IS NOT AN AD. not selling or promoting anything. but samsung did produce the Shantell Video as a 'GALAXY PROJECT' which is an initiative that enables creators like Shantell and me to make projects we might otherwise not have the opportunity to make. hope that's clear. if not ask in the comments and i'll answer any specifics.
I am trying to remove any \n. This string is accessed from a pandas df. The solution I have tried is:
i = str(i).replace("\n", "")
The original code looks like:
for i in data["description"]:
print(i)
i = str(i).replace("\n", "")
i = str(i).split(" ")
for x in i:
x = x.replace("\n", "")
print(x)
where data is the df that stores all of the data from the csv file, and description is the column where the string is taken out of.
I suspect that the failure of replace() to work is due to the string being from a df, as when I try it with just a regular string
x = "a \n\n string"
.replace() works just fine. Any reason why taking strings from a df causes replace to fail? Thanks.
Pandas Dataframes keep their string methods a bit hidden behind the .str attribute. Something like df["column_name"].str.replace("\n", "") should work, and I'd recommend the pandas documentation below to learn more.
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-methods
This should work:
df["description"].str.replace("\n", "")
Or you could use either of the following if you want to do this for the entire df:
df = df.replace("\n", "")
df.replace("\n", "", inplace = True)

How to replace string and exclude certain changing integers?

I am trying to replace
'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
with:
'AMAT 10Q Filing Section: Risk'
However, everything up until Filing Section: Risk will be constantly changing, except for positioning. I just want to pull the characters from position 0 to 5 and from 15 through 19.
df['section'] = df['section'].str.replace(
I'd like to manipulate this but not sure how?
Any help is much appreciated!
Given your series as s
s.str.slice(0, 5) + s.str.slice(15, 19) # if substring-ing
s.str.replace(r'\d{5}', '') # for a 5-length digit string
You may need to adjust your numbers to index properly. If that doesn't work, you probably want to use a regular expression to get rid of some length of numbers (as above, with the example of 5).
Or in a single line to produce the final output you have above:
s.str.replace(r'\d{10}_|\d{8}_', '').str.replace('_', ' ')
Though, it might not be wise to replace the underscores. Instead, if they change, explode the data into various columns which can be worked on separately.
If you want to replace a fix length/position of chars, use str.slice_replace to replace
df['section'] = df['section'].str.slice_replace(6, 14, ' ')
Other people would probably use regex to replace pieces in your string. However, I would:
Split the string
append the piece if it isn't a number
Join the remaining data
Like so:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
n = []
for i in s.split('_'):
try:
i = int(i)
except ValueError:
n.append(i)
print(' '.join(n))
AMAT 10Q Filing Section: Risk
Edit:
Re-reading your question, if you are just looking to substring:
Grabbing the first 5 characters:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
print(s[:4]) # print index 0 to 4 == first 5
print(s[15:19]) # print index 15 to 19
print(s[15:]) # print index 15 to the end.
If you would like to just replace pieces:
print(s.replace('_', ' '))
you could throw this in one line as well:
print((s[:4] + s[15:19] + s[28:]).replace('_', ' '))
'AMAT 10Q Filing Section: Risk'

Python string grouping?

Basically, I print a long message but I want to group all of those words into 5 character long strings.
For example "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner." I want to make that
"iPhon 6isn' tsimp lybig ger-i t'sbe terri never yway. Large r,yet drama tical lythi nner. "
As suggested by #vaultah, this is achieved by splitting the string by a space and joining them back without spaces; then using a for loop to append the result of a slice operation to an array. An elegant solution is to use a comprehension.
text = "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner."
joined_text = ''.join(text.split())
splitted_to_six = [joined_text[char:char+6] for char in range(0,len(joined_text),6)]
' '.join(splitted_to_six)
I'm sure you can use the re module to get back dashes and apostrophes as they're meant to be
Simply do the following.
import re
sentence="iPhone 6 isn't simply bigger - it's better in every way. Larger, yet dramatically thinner."
sentence = re.sub(' ', '', sentence)
count=0
new_sentence=''
for i in sentence:
if(count%5==0 and count!=0):
new_sentence=new_sentence+' '
new_sentence=new_sentence+i
count=count+1
print new_sentence
Output:
iPhon e6isn 'tsim plybi gger- it'sb etter ineve ryway .Larg er,ye tdram atica llyth inner .

Python image creation loops

This code below creates a 1D image of a race track:
def displayTrack(position):
output=''#value given to output
track=[' ']*20# track is initially just a bunch of empty spaces
track[position]= 'r'#AND track also contains an r icon
print(' -'*20)#these are the top and bottom borders
print('0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J')#these represent each individual cell
for i in range(len(track)):
output= output +track[i] +'|'#append a "|" before and after each empty space " "
print (output)#print the result
print(' -'*20)
If you run this code you will be able to view the image. If you look at the charachter "r" you will see that to the right of charachter "r" there is "|" character. I need to implement a "|" on the left side of runner as well. I need to use a method similar to above because the initial states of many of the variables and the image depends on other variables,etc.
I know the problem exists in the fate that output= ''. If instead output was not a space, or not a charachter at all then the image would display properly but I do not know how to make it so. Can someone please give me a hand. Any and all help is appreciated.
If anything is unclear please let me know an I will change it as soon as possible.
EDIT: So I figured out that the new code should look something like this: There are 3 changes:
1) output='|' instead of ''
2) in the strings that contain the hyphens as well as the alphanumeric charachters, the space at the end is moved to the beginning instead. This fixes all the problems.
Is this what you want ? It is unclear, since your original layout is strange.
def displayTrack(position):
output='|'#value given to output
track=[' ']*20# track is initially just a bunch of empty spaces
track[position]= 'r'#AND track also contains an r icon
print(' -'*20)#these are the top and bottom borders
print(' 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J')#these represent each individual cell
for i in range(len(track)):
output= output +track[i] +'|'#append a "|" before and after each empty space " "
print (output)#print the result
print(' -'*20)
Your comment #append a "|" before and after each empty space " " is misleading. What the statement before it does, is add a part of the track and a "|". It doesn't look if the character is a space, and doesn't put anything before it. The only reason there are |'s before the spaces is because they follow a position which has one after it.
To put something before the rest, start with output = '|' instead of ''. You may want to put an extra space before the other lines as well in that case, to keep things lined up. For example: print (' ' + ' -' * 20)

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources