python split by "\t" is not showing all elements in it - python

I am trying to split by "\t" but it is not printing all the elements in it
import sys
reload(sys)
sys.setdefaultencoding('utf8')
s = ['A\t"Ravi"\t"Tirupur"\t"India"\t"641652"\t"arunachalamravi#gmail.com"\t"17379602"\t"+ 2"\t"Government Higher Secondary School', ' Tiruppur"\t\t"1989"\t"Maths',' Science"\t"No"\t"Encotec Energy 2 X 600 MW ITPCL"\t"Associate Vice President- Head Maintenance"\t"2015"\t"2016"\t"No"\t"27-Mar-2017"\t"9937297878"\t\t"2874875"\t"Submitted"\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t']
print s[0].split("\t")
Results
['A', '"Ravi"', '"Tirupur"', '"India"', '"641652"', '"arunachalamravi#gmail.com"', '"17379602"', '"+ 2"', '"Government Higher Secondary School']
But i want results upto this
2874875, Submitted
How to fix the code and where is the change?

Easy, you have more than one item in your list so when you do s[0] you just get the first one, fix your list or fix your code like this:
joined_string = ''.join(s)
print joined_string.split("\t")
It should work

With your data you should do something like this:
s[2].split("\t")[10:12]

You could use Python's chain() function to create a single list from the multiple elements:
from itertools import chain
s = ['A\t"Ravi"\t"Tirupur"\t"India"\t"641652"\t"arunachalamravi#gmail.com"\t"17379602"\t"+ 2"\t"Government Higher Secondary School', ' Tiruppur"\t\t"1989"\t"Maths',' Science"\t"No"\t"Encotec Energy 2 X 600 MW ITPCL"\t"Associate Vice President- Head Maintenance"\t"2015"\t"2016"\t"No"\t"27-Mar-2017"\t"9937297878"\t\t"2874875"\t"Submitted"\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t']
result = list(chain.from_iterable(x.rstrip('\t').split('\t') for x in s))
print result
This would give you all of the split entries, and remove the trailing tabs from the end:
['A', '"Ravi"', '"Tirupur"', '"India"', '"641652"', '"arunachalamravi#gmail.com"', '"17379602"', '"+ 2"', '"Government Higher Secondary School', ' Tiruppur"', '', '"1989"', '"Maths', ' Science"', '"No"', '"Encotec Energy 2 X 600 MW ITPCL"', '"Associate Vice President- Head Maintenance"', '"2015"', '"2016"', '"No"', '"27-Mar-2017"', '"9937297878"', '', '"2874875"', '"Submitted"']
If you also want to get rid of the quotes, then use this instead:
result = [v.strip('"') for v in chain.from_iterable(x.rstrip('\t').split('\t') for x in s)]
Giving you:
['A', 'Ravi', 'Tirupur', 'India', '641652', 'arunachalamravi#gmail.com', '17379602', '+ 2', 'Government Higher Secondary School', ' Tiruppur', '', '1989', 'Maths', ' Science', 'No', 'Encotec Energy 2 X 600 MW ITPCL', 'Associate Vice President- Head Maintenance', '2015', '2016', 'No', '27-Mar-2017', '9937297878', '', '2874875', 'Submitted']

Related

Python Lists of Lists Converting to Dict to Dataframe

Converting a massive list of lists into dictionary and code only works for the first item in the list of lists.
a_list = [[('Bedrooms', ' 4'),
('Street Address', ' 90 Lake '),
('Contact Phone', ' 970-xxx-xxxx'),
('Bathrooms', ' 5'),
('Price', ' $5,350,000'),
('Zip Code', ' 5000')],
[('Bedrooms', ' 4'),
('Street Address', ' 1490 Creek '),
('Contact Phone', ' 970-xxx-xxx3'),
('Bathrooms', ' 10'),
('Price', ' $7,350,000'),
('Zip Code', ' 6000'),
('City', ' Edwards'),
('Price1', ' 4200000')],
[('Street Address', ' 280 Lane'),
('Bedrooms', ' 2'),
('Property Type', ' Townhouse'),
('Square Feet', ' 3000'),
('Bathrooms', ' 4'),
('Contact Phone', ' 303-xxx-xxxx'),
('MLS', ' 66666'),
('Contact Name', ' C Name'),
('Brokerage', ' Real Estate'),
('City', 'Creek'),
('Zip Code', '89899'),
('Price1', ' 2100000'),
('Posted On', ' Nov 13, 2019')
]]
Current code only assigns k,v to 1st item:
items = {}
for line in list:
for i in range(len(line)):
key = line[i][0]
value = line[i][1]
items[key] = value
items.update(line)
RESULT:
items = {'Bedrooms':' 4'),
('Street Address': ' 90 Lake '),
('Contact Phone': ' 970-xxx-xxxx'),
('Bathrooms': ' 5'),
('Price': ' $5,350,000'),
('Zip Code': ' 5000'}
Ultimately, I want to build DataFrame matching keys and values from the list of lists.
There is a nicer way to do this by using map to convert each list to a dict and then calling the DataFrame constructor on it. Also, do not use built-ins as variable names, in this case list. I went ahead and renamed your input data as the variable data.
dicts = list(map(dict, data))
pd.DataFrame(dicts)
Bathrooms Bedrooms Brokerage ... Square Feet Street Address Zip Code
0 5 4 NaN ... NaN 90 Lake 5000
1 10 4 NaN ... NaN 1490 Creek 6000
2 4 2 Real Estate ... 3000 280 Lane 89899
[3 rows x 14 columns]
Something like this?
unpacked = [{k: v for k,v in one_list} for one_list in list_of_lists]
pd.DataFrame(unpacked)
A dictionary in python is a data structure that stores key-value pairs. Essentially there is a unique key that is needed and every time you add a key-value pair (using update) to the dictionary. It does the following:
Checks if key is present
If the key is present, it updates the value to the new value
If they key is not present, it adds the key-value pair to the dictionary
You could take a look at this link for better understanding of 'update'
https://python-reference.readthedocs.io/en/latest/docs/dict/update.html
Although there are easier ways to do this, the issue with your code is the last line i.e
items.update(line)
Instead of your code, you could use the code below (if you choose to proceed down this same approach, rather than the approach suggested by other answers):
items = {}
new_list = [] # another list
for line in list:
for i in range(len(line)):
key = line[i][0]
value = line[i][1]
items[key] = value
new_list.append(items) # use this line instead of your update
then
import pandas as pd
pd.DataFrame(new_list)
This should give you the result that you are looking for.

How do I use regex to extract pair of values from text into lists?

I have a string like this:
string='<final:company name> abc. </final:company name> <final:number of employees> 143.</final:number of employees> <final: average salary> medium. </final: average salary>'
What I want to extract is all the pattern expression headings and then the values separately within the < >. So, I want: -company name, -number of employees, -average salary in one list maybe.
and I want to extract the values separately like: abc, 143, medium
When I code as follows:
regex='<final:(.*?)</final'
pattern=re.compile(regex)
finding=re.findall(pattern,string)
print(finding)
I get ['company name> abc. ', 'number of employees> 143.', ' average salary> medium. ']
Which is not excatly what I am looking for. How do I code this correctly?
You can use this regex:
regex = r'<final:([^>]*)>\s*([^<\s\.]*)'
Your group 1 will contain the tags i.e, company name, number of employees, average salary and group 2 will contain their values i.e, abc, 143, medium.
Live demo here
OUTPUT
>>> pattern=re.compile(regex)
>>> finding=re.findall(pattern,string)
>>> print(finding)
[('company name', 'abc'), ('number of employees', '143'), (' average salary', 'medium')]
To make 2 different lists from your finding, you can do something like this:
>>> tags = map(lambda x: x[0], finding)
>>> values = map(lambda x: x[1], finding)
>>> tags
['company name', 'number of employees', ' average salary']
>>> values
['abc', '143', 'medium']
or you can also use zip to convert it to two lists:
>>> tags, values = map(list, zip(*finding))
>>> tags
['company name', 'number of employees', ' average salary']
>>> values
['abc', '143', 'medium']
To allow whitespaces in the content and strip the whitespaces in the tag names you could do
import re
string='<final:company name> abc. </final:company name> <final:number of employees> 143.</final:number of employees> <final: average salary> medium. </final: average salary>'
rx = re.compile(r'''
<final:\s*
(?P<tag>[^>]+)>
(?P<content>[^<]+)
</final:\1>''', re.X)
results = {m.group('tag').strip(): m.group('content').strip() for m in rx.finditer(string)}
print(results)
# {'number of employees': '143.', 'company name': 'abc.', 'average salary': 'medium.'}
Afterwards, you will be able to access your elements like results['company name'].
Generally though this looks like some (invalid ?) XML file. If it was valid (and you just had some typos while copying to the question) consider using a real parser instead.
See a demo on regex101.com.

Regex - Splitting Strings at full-stops unless it's part of an honorific [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a list containing all possible titles:
['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']
I need a Python 2.7 code that can replace all full-stops \. with newline \n unless it's one of the above titles.
Splitting it into a list of strings would be fine as well.
Sample Input:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India. The bill is set to pass.
Sample Output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
This should do the trick, here we use a list comprehension with a conditional statement to concatenate the words with a \n if they contain a full-stop, and are not in the list of key words. Otherwise just concatenate a space.
Finally the words in the sentence are joined using join(), and we use rstrip() to eliminate any newline remaining at the end of the string.
l = set(['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.',
'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.',
'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.'] )
s = 'Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road
map for introduction of GST in India. The bill is set to pass.'
def split_at_period(input_str, keywords):
final = []
split_l = input_str.split(' ')
for word in split_l:
if '.' in word and word not in keywords:
final.append(word + '\n')
continue
final.append(word + ' ')
return ''.join(final).rstrip()
print split_at_period(s, l)
or a one liner :D
print ''.join([w + '\n' if '.' in w and w not in l else w + ' ' for w in s.split(' ')]).rstrip()
Sample output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
How it works?
Firstly we split up our string with a space ' ' delimiter using the split() string function, thus returning the following list:
>>> ['Modi', 'is', 'waiting', 'in', 'line', 'to', 'Thank', 'Dr.',
'Manmohan', 'Singh', 'for', 'preparing', 'a', 'road', 'map', 'for',
'introduction', 'of', 'GST', 'in', 'India.', 'The', 'bill', 'is',
'set', 'to', 'pass.']
We then start to build up a new list by iterating through the split-up list. If we see a word that contains a period, but is not a keyword, (Ex: India. and pass. in this case) then we have to concatenate a newline \n to the word to begin the new sentence. We can then append() to our final list, and continue out of the current iteration.
If the word does not end off a sentence with a period, we can just concatenate a space to rebuild the original string.
This is what final looks like before it is built as a string using join().
>>> ['Modi ', 'is ', 'waiting ', 'in ', 'line ', 'to ', 'Thank ', 'Dr.
', 'Manmohan ', 'Singh ', 'for ', 'preparing ', 'a ', 'road ', 'map ',
'for ', 'introduction ', 'of ', 'GST ', 'in ', 'India.\n', 'The ', 'bill ',
'is ', 'set ', 'to ', 'pass.\n']
Excellent, we have spaces, and newlines where they need to be! Now, we can rebuild the string. Notice however, that the the last element in the list also happens to contain a \n, we can clean that up with calling rstrip() on our new string.
The initial solution did not support spaces in the keywords, I've included a new more robust solution below:
import re
def format_string(input_string, keywords):
regexes = '|'.join(keywords) # Combine all keywords into a regex.
split_list = re.split(regexes, input_string) # Split on keys.
removed = re.findall(regexes, input_string) # Find removed keys.
newly_joined = split_list + removed # Interleave removed and split.
newly_joined[::2] = split_list
newly_joined[1::2] = removed
space_regex = '\.\s*'
for index, section in enumerate(newly_joined):
if '.' in section and section not in removed:
newly_joined[index] = re.sub(space_regex, '.\n', section)
return ''.join(newly_joined).strip()
convert all titles (and sole dot) into a regular expression
use a replacement callback
code:
import re
l = "|".join(map(re.escape,['.','Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']))
e="Dear Mr. Foo, I would like to thank you. Because Lt.-Col. Collins told me blah blah. Bye."
def do_repl(m):
s = m.group(1)
if s==".":
rval=".\n"
else:
rval = s
return rval
z = re.sub("("+l+")",do_repl,e)
# bonus: leading blanks should be stripped even that's not the question
z= re.sub(r"\s*\n\s*","\n",z,re.DOTALL)
print(z)
output:
Dear Mr. Foo, I would like to thank you.
Because Lt.-Col. Collins told me blah blah.
Bye.

Preserve whitespaces when using split() and join() in python

I have a data file with columns like
BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77
and the individual columns are separated by a varying number of whitespaces.
My goal is to read in those lines, do some math on several rows, for example multiplying column 4 by .95, and write them out to a new file. The new file should look like the original one, except for the values that I modified.
My approach would be reading in the lines as items of a list. And then I would use split() on those rows I am interested in, which will give me a sublist with the individual column values. Then I do the modification, join() the columns together and write the lines from the list to a new text file.
The problem is that I have those varying amount of whitespaces. I don't know how to introduce them back in the same way I read them in. The only way I could think of is to count characters in the line before I split them, which would be very tedious. Does someone have a better idea to tackle this problem?
You want to use re.split() in that case, with a group:
re.split(r'(\s+)', line)
would return both the columns and the whitespace so you can rejoin the line later with the same amount of whitespace included.
Example:
>>> re.split(r'(\s+)', line)
['BBP1', ' ', '0.000000', ' ', '-0.150000', ' ', '2.033000', ' ', '0.00', ' ', '-0.150', ' ', '1.77']
You probably do want to remove the newline from the end.
Other way to do this is:
s = 'BBP1 0.000000 -0.150000 2.033000 0.00 -0.150 1.77'
s.split(' ')
>>> ['BBP1', '', '', '0.000000', '', '-0.150000', '', '', '', '2.033000', '', '0.00', '-0.150', '', '', '1.77']
If we specify space character argument in split function, it creates list without eating successive space characters. So, original numbers of space characters are restored after 'join' function.
For lines that have whitespace at the beginning and/or end, a more robust pattern is (\S+) to split at non-whitespace characters:
import re
line1 = ' 4 426.2 orange\n'
line2 = '12 82.1 apple\n'
re_S = re.compile(r'(\S+)')
items1 = re_S.split(line1)
items2 = re_S.split(line2)
print(items1) # [' ', '4', ' ', '426.2', ' ', 'orange', '\n']
print(items2) # ['', '12', ' ', '82.1', ' ', 'apple', '\n']
These two lines have the same number of items after splitting, which is handy. The first and last items are always whitespace strings. These lines can be reconstituted using a join with a zero-length string:
print(repr(''.join(items1))) # ' 4 426.2 orange\n'
print(repr(''.join(items2))) # '12 82.1 apple\n'
To contrast the example with a similar pattern (\s+) (lower-case) used in the other answer here, each line splits with different result lengths and positions of the items:
re_s = re.compile(r'(\s+)')
print(re_s.split(line1)) # ['', ' ', '4', ' ', '20.0', ' ', 'orange', '\n', '']
print(re_s.split(line2)) # ['12', ' ', '82.1', ' ', 'apple', '\n', '']
As you can see, this would be a bit more difficult to process in a consistent manner.

Iterating over multiple lists in Python

I have a list within a list, and I am trying to iterate through one list, and then in the inner list I want to search for a value, and if this value is present, place that list in a variable.
Here's what I have, which doesn't seem to be doing the job:
for z, g in range(len(tablerows), len(andrewlist)):
tablerowslist = tablerows[z]
if "Andrew Alexander" in tablerowslist:
andrewlist[g] = tablerowslist
Any ideas?
This is the list structure:
[['Kyle Bazzy', 'FUP dropbox message', '8/18/2011', 'Swing Trade Stocks</a>', ' ', 'Affiliate blog'], ['Kyle Bazzy', 'FUP dropbox message', '8/18/2011', 'Swing Trade Software</a>', ' ', 'FUP from dropbox message. Affiliate blog'], ['Kyle Bazzy', 'FUP dropbox message', '8/18/2011', 'Start Day Trading (Blog)</a>', ' ', 'FUP from dropbox message'], ['Kyle Bazzy', 'Call, be VERY NICE', '8/18/2011', ' ', 'r24867</a>', 'We have been very nice to him, but he wants to cancel, we need to keep being nice and seeing what is wrong now.'], ['Jason Raznick', 'Reach out', '8/18/2011', 'Lexis Nexis</a>', ' ', '-'], ['Andrew Alexander', 'Check on account in one week', '8/18/2011', ' ', 'r46876</a>', '-'], ['Andrew Alexander', 'Cancel him from 5 dollar feed', '8/18/2011', ' ', 'r37693</a>', '-'], ['Aaron Wise', 'FUP with contract', '8/18/2011', 'YouTradeFX</a>', ' ', "Zisa is on vacation...FUP next week and then try again if she's still gone."], ['Aaron Wise', 'Email--JASON', '8/18/2011', 'Lexis Nexis</a>', ' ', 'email by today'], ['Sarah Knapp', '3rd FUP', '8/18/2011', 'Steven L. Pomeranz</a>', ' ', '-'], ['Sarah Knapp', 'Are we really interested in partnering?', '8/18/2011', 'Reverse Spins</a>', ' ', "V. political, doesn't seem like high quality content. Do we really want a partnership?"], ['Sarah Knapp', '2nd follow up', '8/18/2011', 'Business World</a>', ' ', '-'], ['Sarah Knapp', 'Determine whether we are actually interested in partnership', '8/18/2011', 'Fayrouz In Dallas</a>', ' ', "Hasn't updated since September 2010."], ['Sarah Knapp', 'See email exchange w/Autumn; what should happen', '8/18/2011', 'Graham and Doddsville</a>', ' ', "Wasn't sure if we could partner bc of regulations, but could do something meant simply to increase traffic both ways."], ['Sarah Knapp', '3rd follow up', '8/18/2011', 'Fund Action</a>', ' ', '-']]
For any value that has a particular value in it, say, Andrew Alexander, I want to make a separate list of these.
For example:
[['Andrew Alexander', 'Check on account in one week', '8/18/2011', ' ', 'r46876</a>', '-'], ['Andrew Alexander', 'Cancel him from 5 dollar feed', '8/18/2011', ' ', 'r37693</a>', '-']]
Assuming you have a list whose elements are lists, this is what I'd do:
andrewlist = [row for row in tablerows if "Andrew Alexander" in row]
>>> #I have a list within a list,
>>> lol = [[1, 2, 42, 3], [4, 5, 6], [7, 42, 8]]
>>> found = []
>>> #iterate through one list,
>>> for i in lol:
... #in the inner list I want to search for a value
... if 42 in i:
... #if this value is present, place that list in a variable
... found.append(i)
...
>>> found
[[1, 2, 42, 3], [7, 42, 8]]
for z, g in range(len(tablerows), len(andrewlist)):
This means "make a list of the numbers which are between the length of tablerows and the length of andrewlist, and then look at each of those numbers in turn, and treat those numbers as a list of two values, and assign the two values to z and g each time through the loop".
A number cannot be treated as a list of two values, so this fails.
You need to be much, much clearer about what you are doing. Show an example of the contents of tablerows before the loop, and the contents of andrewlist before the loop, and what it should look like afterwards. Your description is muddled: I can only guess that when you say "and then I want to iterate through one list" you mean one of the lists in your list-of-lists; but I can't tell whether you want one specific one, or each one in turn. And then when you next say "and then in the inner list I want to...", I have no idea what you're referring to.

Categories

Resources