Remove text:u from strings in python - python

I am using xlrd library to import values from excel file to python list.
I have a single column in excel file and extracting data row wise.
But the problem is the data i am getting in list is as
list = ["text:u'__string__'","text:u'__string__'",.....so on]
How can i remove this text:u from this to get natural list with strings ?
code here using python2.7
book = open_workbook("blabla.xlsx")
sheet = book.sheet_by_index(0)
documents = []
for row in range(1, 50): #start from 1, to leave out row 0
documents.append(sheet.cell(row, 0)) #extract from first col
data = [str(r) for r in documents]
print data

Iterate over items and remove extra characters from each word:
s=[]
for x in list:
s.append(x[7:-1]) # Slice from index 7 till lastindex - 1

If that's the standard input list you have, you can do it with simple split
[s.split("'")[1] for s in list]
# if your string itself has got "'" in between, using regex is always safe
import re
[re.findall(r"u'(.*)'", s)[0] for s in list]
#Output
#['__string__', '__string__']

I had the same problem. Following code helped me.
list = ["text:u'__string__'","text:u'__string__'",.....so on]
for index, item in enumerate(list):
list[index] = list[index][7:] #Deletes first 7 xharacters
list[index] = list[index][:-1] #Deletes last character

Related

Group Many List In Two's

I currently have a list with one element inside each. Is there a way python can combine the first two lists into one? I tried my code down below. The last for loop Is my attempt. If you see the actual output, it only duplicates it but doesnt get the second element. I need the first and second element to be listed together. Please note that the post i made earlier is not a duplicate of the post mentioned there. The post the moderator had suggested is answwering a question on how to split a SINGLE list into even chunks. I am asking how to group together many lists in 2's. Basically what i am doing is opening a file, looking for value between the strings 'cdc or dcc\s(space)' and returning those values. I then want to compare it to the string that comes next.
text.txt
^random binary characters d1234 d0123456789d 1234c null null null d34 dc49416494949 c3456
output:
['d1234d0123456789d1234c']
['d34dc49416494949c3456']
expected output:
['d1234d0123456789d1234c','d34dc49416494949c3456']
code:
with open(text.txt, 'r', encoding="ISO-8859-1") as in_file:
data = in_file.readlines()
for row in data:
micr_ocr_line = re.findall(r'd[^d]*d[^d]*c[0-9]+|d[^d]*d[^d]*c\s+[0-9]+', row)
for r in micr_ocr_line:
rmve_spcl_char = re.sub (r'([^a-zA-Z-0-9]+?)', '', r)
rmve_spcl_char = re.sub(r'(c\d{4,}).*', r'\1', rmve_spcl_char)
a = [l for l in rmve_spcl_char.split('\n')]
for previous, current in zip_longest(a, a[::1]):
print(previous, current)
micr_ocr_dat = [previous, current]
print(micr_ocr_dat)
micr_ocr_dat_l.append(micr_ocr_dat)
You've almost got it I think - just slice slightly different?
a = [l for l in rmve_spcl_char.split('\n')]
for previous, current in zip(a[::2], a[1::2]):
print(previous, current)
micr_ocr_dat = [previous, current]
print(micr_ocr_dat)
micr_ocr_dat_l.append(micr_ocr_dat)
btw , you should use len to get the length of the list and use range() to the for loop
Here is the way to split it in two
for fp in dat_filepath:
with open(fp, 'r', encoding="ISO-8859-1") as in_file:
data = in_file.readlines()
listresult = []
for a in range(0, len(data), 2):
listresult.append([data[a], data[a + 1]])
print(listresult)
The list result shuld be the data you expected

Excel cell into list in Python

So I have an Excel column which contains Python lists.
The problem is that when I'm trying to loop through it in Python it reads the cells as str. Attempt to split it makes the items in a list generate as e.g.:
list = ["['Gdynia',", "'(2262011)']"]
list[0] = "['Gdynia,'"
list1 = "'(2261011)']"
I want only to get the city name which is e.g. 'Gdynia' or 'Tczew'. Any idea how can I make it possible?
You can split the string at a desired symbol, ' would be good for your example.
Then you get a list of strings and you can chose the part you need.
str = "['Gdynia',", "'(2262011)']"
str_parts = str.split("'") #['[', 'Gdynia', ',', '(2262011)', ']']
city = str_parts[1] #'Gdynia'
Solution with re:
import re
data = ["['Gdynia', '(2262011)'",
"['Tczew', '(2214011)']",
"['Zory', ’(2479011)']"]
r = re.compile("'(.*?)'")
print(*[r.search(s).group(1) for s in data], sep='\n')
Output
Gdynia
Tczew
Zory

Parsing a list of lists and manipulating it in place

So I have a list of lists that I need to parse through and manipulate the contents of. There are strings of numbers and words in the sublists, and I want to change the numbers into integers. I don't think it's relevant but I'll mention it just in case: my original data came from a CSV that I split on newlines, and then split again on commas.
What my code looks like:
def prep_data(data):
list = data.split('\n') #Splits data on newline
list = list[1:-1] #Gets rid of header and last row, which is an empty string
prepped = []
for x in list:
prepped.append(x.split(','))
for item in prepped: #Converts the item into an int if it is able to be converted
for x in item:
try:
item[x] = int(item[x])
except:
pass
return prepped
I tried to loop through every sublist in prepped and change the type of the values in them, but it doesn't seem like the loop does anything as the prep_data returns the same thing as it did before I implemented that for loop.
I think I see what is wrong, you are thinking python is more generous with it's assignment than it actually is.
def prep_data(data):
list = data.split('\n') #Splits data on newline
list = list[1:-1] #Gets rid of header and last row, which is an empty string
prepped = []
for x in list:
prepped.append(x.split(','))
for i in prepped: #Converts the item into an int if it is able to be converted
item = prepped[i]
for x in item:
try:
item[x] = int(item[x])
except:
pass
prepped[i] = item
return prepped
I can't run this on the machine I'm on right now but it seems the problem is that "prepped" wasn't actually receiving any new assignments, you were just changing values in the sub array "item"
I'm not sure about your function, because maybe I didn't understand your income data, but you could try something like the following because if you only pass, you could lose string or weird data:
def parse_data(raw_data):
data_lines = raw_data.split('\n') #Splits data on newline
data_rows_without_header = data_lines[1:-1] #Gets rid of header and last row, which is an empty string
parsed_date = []
for raw_row in data_rows_without_header:
splited_row = raw_line.split(',')
parsed_row = []
for value in splited_row:
try:
parsed_row.append(int(value)
except:
print("The value '{}' is not castable".format(value))
parsed_row.append(value) # if cast fails, add the string as it is
parsed_date.append(parsed_row)
return parsed_date

Getting slashes and letters while I only want number

I am using the following code to bring back prices from an ecommerce website:
response.css('div.price.regularPrice::text').extract()
but getting the following result:
'\r\n\t\t\tDhs 5.00\r\n\t\t\t\t\t\t\t\t',
'\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t',
I do not want the slashes and letters and only the number 5. How do I get this?
First you can use strip() to remove tabs "\t" and enters "\n".
data = ['\r\n\t\t\tDhs 5.00\r\n\t\t\t\t\t\t\t\t',
'\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t']
data = [item.strip() for item in data]
and you get
['Dhs 5.00', '']
Next you can use if to skip empty elements
data = [item for item in data if item]
and you get
['Dhs 5.00']
If item always has the same structure Dns XXX.00
then you can use slicing [4:-3] to remove "Dhs " and ".00"
data = [item[4:-3] for item in data]
and you get
['5']
So now you have to only get first element data[0] to get 5.
If you need you can convert string "5" to integer 5 using int()
result = int(data[0])
You can even put all in one line
data = ['\r\n\t\t\tDhs 5.00\r\n\t\t\t\t\t\t\t\t',
'\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t']
data = [item.strip()[4:-3] for item in data if item.strip()]
result = int(data[0])
If you always need only first element from list then you can write it
data = ['\r\n\t\t\tDhs 5.00\r\n\t\t\t\t\t\t\t\t',
'\r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t']
result = int( data[0].strip()[4:-3] )
Use regex to fetch only the numbers.
\d+ regex expression should do the trick.

Slicing a string into a list based on reoccuring patterns

I have a long string variable full of hex values:
hexValues = 'AA08E3020202AA08E302AA1AA08E3020101' etc..
The first 2 bytes (AA08) are a signature for the start of a frame and the rest of the data up to the next AA08 are the contents of the signature.
I want to slice the string into a list based on the reoccurring start of frame sign, e.g:
list = [AA08, E3020202, AA08, F25S1212, AA08, 42ABC82] etc...
I'm not sure how I can split the string up like this. Some of the frames are also corrupted, where the start of the frame won'y have AA08, but maybe AA01.. so I'd need some kind of regex to spot these.
if I do list = hexValues.split('AA08)', the list just removes all the starts of the frame...
So I'm a bit stuck.
Newbie to python.
Thanks
For the case when you don't have "corrupted" data the following should do:
hex_values = 'AA08E3020202AA08E302AA1AA08E3020101'
delimiter = hex_values[:4]
hex_values = hex_values.replace(delimiter, ',' + delimiter + ',')
hex_list = hex_values.split(',')[1:]
print(hex_list)
['AA08', 'E3020202', 'AA08', 'E302AA1', 'AA08', 'E3020101']
Without considering corruptions, you may try this.
l = []
for s in hexValues.split('AA08'):
if s:
l += ['AA08', s]

Categories

Resources