Getting a nested list with "" in the outer list - python

After opening and reading an input file, I'm trying to split the input on different characters. This works well, although I seem to be getting a nested list which I don't want. My list does not look like [[list]], but like ["[list]"]. What did I do wrong here?
The input looks like this:
name1___1 2 3 4 5
5=20=22=10=2=0=0=1=0=1something,something
name2___1 2 3 4
2=30=15=8=4=3=2=0=0=0;
The output looks like this:
["['name1", '', '', "1 2 3 4 5', 'name2", '', '', "1 2 3 4']"]
Here is my code:
file = open("file.txt")
input_of_this_file = file.read()
a = input_of_this_file.split("\n")
b = a[0::2] # so i get only the even lines
c = str(b) # to make it a string so the .strip() works
d = c.strip() # because there were whitespaces
e = d("_")
print e
If i then do:
x = e[0]
I get:
['name1
This removes the outer list, but also removes the last ].
I would like it to look like: name1, name2
So that i only get the names.

Use itertools.islice and a list comprehension.
>>> from itertools import islice
>>> with open("tmp.txt") as f:
... [line.rstrip("\n").split("_") for line in islice(f, None, None, 2)]
...
[['name1', '', '', '1 2 3 4 5'], ['name2', '', '', '1 2 3 4']]

Keeping your code syntax without imports:
c=[]
input_of_file = '''name1___1 2 3 4 5
5=20=22=10=2=0=0=1=0=1something,something
name2___1 2 3 4
2=30=15=8=4=3=2=0=0=0;'''
a = input_of_file.split("\n")
b = a[::2]
for item in b:
new_item = item.split('__')
c.append(new_item)
Results
c = [['name1', '_1 2 3 4 5'], ['name2', '_1 2 3 4']]
c[0][0] = 'name1'

Related

Convert a large string to Dataframe

I've a large string looking like this :
'1 Start Date str_date B 10 C \n 2 Calculation notional cal_nt C 10 0\n 3 Calculation RATE Today cal_Rate_td C 9 R\n ....'
the issue is that I can't use one space or two to split my string because from Start date to str_date theres 2spaces but in the next line there will be 3 for example and maybe the next line will have 1space to seperate ... this makes it very hard to create a correct DataFrame as I want, is there a way to this ? thanks
to get a list with all the words that have _ (as you requested in the comments) you could use a regular expression:
import re
s = '1 Start Date str_date B 10 C \n 2 Calculation notional cal_nt C 10 0\n 3 Calculation RATE Today cal_Rate_td C 9 R\n ....'
list(map(re.Match.group, re.finditer(r'\w+_.\w+', s)))
output:
['str_date', 'cal_nt', 'cal_Rate_td']
or you can use a list comprehension:
[e for e in s.split() if '_' in e]
output:
['str_date', 'cal_nt', 'cal_Rate_td']
to get a data frame from your string you could use the above information, the third field:
s = '1 Start Date str_date B 10 C \n 2 Calculation notional cal_nt C 10 0\n 3 Calculation RATE Today cal_Rate_td C 9 R\n'
third_fields = [e for e in s.split() if '_' in e]
rows = []
for third_field, row in zip(third_fields, s.split('\n')):
current_row = []
row = row.strip()
first_field = re.search(r'\d+\b', row).group()
current_row.append(first_field)
# remove first field
row = row[len(first_field):].strip()
second_field, rest_of_fields = row.split(third_field)
parsed_fields = [e.group() for e in re.finditer(r'\b[\w\d]+\b', rest_of_fields)]
current_row.extend([second_field, third_field, *parsed_fields])
rows.append(current_row)
pd.DataFrame(rows)
output:
Like #kederrac answer, you can use regex to split them
import re
s = "1 Start Date str_date B 10 C "
l = re.compile("\s+").split(s.strip())
# output ['1', 'Start', 'Date', 'str_date', 'B', '10', 'C']

Remove last digit from String depending on length

I am trying to remove the last digit in the df[4] string if the string is over 5 digits.
I tried adding .str[:-1] to df[4]=df[4].astype(str) this removes the last digit from every string in the dataframe.
df[3]=df[3].astype(str)
df[4]=df[4].astype(str).str[:-1]
df[5]=df[5].astype(str)
I tried several different combinations of if statements but none have worked.
I'm new to python and pandas so any help is appreciated
You can filter first on the string length:
condition = df[4].astype(str).str.len() > 5
df.loc[condition, 4]=df.loc[condition, 4].astype(str).str[:-1]
For example:
>>> df
4
0 1
1 11
2 111
3 1111
4 11111
5 111111
6 1111111
7 11111111
8 111111111
>>> condition = df[4].astype(str).str.len() > 5
>>> df.loc[condition, 4]=df.loc[condition, 4].astype(str).str[:-1]
>>> df
4
0 1
1 11
2 111
3 1111
4 11111
5 11111
6 111111
7 1111111
8 11111111
If these are natural integers, it is however more efficient to divide by 10:
condition = df[4].astype(str).str.len() > 5
df.loc[condition, 4]=df.loc[condition, 4] // 10
Accessing Elements of a Collection
>>> x = "123456"
# get element at index from start
>>> x[0]
'1'
# get element at index from end
>>> x[-1]
'6'
# get range of elements from n-index to m-index
>>> x[0:3]
'123'
>>> x[1:-2]
'234'
>>> x[-4:-2]
'34'
# get range from/to index with open end/start
>>> x[:-2]
'1234'
>>> x[4:]
'56'
List Comprehension Syntax
I haven't see the pythons list comprehension syntax which really cool and easy.
# input data frame with variable string length 1 to n
df = [
'a',
'ab',
'abc',
'abcd',
'abcdf',
'abcdfg',
'abcdfgh',
'abcdfghi',
'abcdfghij',
'abcdfghijk',
'abcdfghijkl',
'abcdfghijklm'
]
# using list comprehension syntax: [element for element in collection]
df_new = [
# short hand if syntax: value_a if True else value_b
r if len(r) <= 5 else r[0:5]
for r in df
]
Now df_new contains only string up to a length of 5:
[
'a',
'ab',
'abc',
'abcd',
'abcdf',
'abcdf',
'abcdf',
'abcdf',
'abcdf',
'abcdf',
'abcdf',
'abcdf'
]
cause [-1]removes last numbers or change number to -1
try str df[4]=-1

using regex to find multiple occurences in a file in python

I m trying the following:
myfile.txt has the following content ,I want to extract data between each 'abc start' and 'abc end' using regular expression in python. thanks for the help
abc start
1
2
3
4
abc end
5
6
7
abc start
8
9
10
abc end
expecting a output as 1 2 3 4 8 9 10
import re
with open('myfile.txt') as f:
txt = f.read()
strings = re.findall('abc start\n(.+?)\nabc end', txt, re.DOTALL)
# to transform to your output..
result = []
for s in strings:
result += s.split('\n')
print(result)
#['1', '2', '3', '4', '8', '9', '10']
using regex
import re
string = ''
with open('file.txt','r') as f:
for i in f.readlines():
string +=i.strip()+' '
f.close()
exp = re.compile(r'abc start(.+?)abc end')
result = [[int(j) for j in list(i.strip().split())] for i in exp.findall(string)]
print(result)
# [[1, 2, 3, 4], [8, 9, 10]]

indirect sorting a list in python

I have a list that I need to sort
list_of_sent = ['a5 abc xyz','w1 3 45 7','a6 abc deg','r4 2 7 9']
The rules are as follows.
if the 2nd item is a number, that will always come later
others are in their lexicographically sorted order (without changing the ordering of the individual items)
In the above example, the expected output is
['a5 abc deg','a6 abc xyz','r4 2 7 9','w1 3 45 7']
I understand this is some form of indirect sorting but not sure how to approach that. So far, I have separated the list in terms of whether the 2nd and onward items have numbers or not. But not sure how to proceede after that.
def reorderLines(logLines):
numList = []
letterList = []
hashMap = {}
for item in logLines:
words = item.split()
key= words[0]
hashMap[key] = item
if words[1].isdigit():
numList.append(item)
else:
letterList.append(item)
#sort each list individually
print(numList)
print(letterList)
EDIT:
This will output
['a5 abc xyz','a6 abc deg']
['w1 3 45 7','r4 2 7 9']
How do I proceed afterwards to reach to the output of
['a5 abc deg','a6 abc xyz','r4 2 7 9','w1 3 45 7']
The answer to your direct question is simple.
You've already worked out how to split the list into these two lists:
['a5 abc xyz','a6 abc deg']
['w1 3 45 7','r4 2 7 9']
Now, you just need to sort each one, and add them together.
But this really isn't the right approach in the first place. When looking at how to do a custom sort, the first thing you should do is ask yourself whether some other list, which you could easily transform this one into, would be trivial to sort.
For example, imagine you had this:
list_of_sent = [
(False, 'a5 abc xyz'),
(True, 'w1 3 45 7'),
(False, 'a6 abc deg'),
(True, 'r4 2 7 9')]
… where that first value in each tuple is True iff the second word in the string is a number.
If that were your list, you could just call sort or sorted on it, and you'd be done.
So, can you transform each of your strings into a tuple like that? Sure you can:
def flagnumbers(words):
isnumber = words.split()[1].isdigit()
return isnumber, words
And how, you can just pass that as the key function to sort your list:
list_of_sent = ['a5 abc xyz','w1 3 45 7','a6 abc deg','r4 2 7 9']
print(sorted(list_of_sent, key=flagnumbers))
That's it.
The Sorting HOWTO in the docs covers key functions in more detail, with some nice examples.
We can write sorting key as follows:
def sorting_key(element):
second_substring = element.split()[1]
return second_substring.isdecimal(), element
then use it in sorted builtin like
>>> list_of_sent = ['a5 abc xyz', 'w1 3 45 7', 'a6 abc deg', 'r4 2 7 9']
>>> sorted(list_of_sent, key=sorting_key)
['a5 abc xyz', 'a6 abc deg', 'r4 2 7 9', 'w1 3 45 7']
or if we don't need old order we can sort list_of_sent in place (may be more efficient, at least will not occupy additional memory for a new list):
>>> list_of_sent = ['a5 abc xyz', 'w1 3 45 7', 'a6 abc deg', 'r4 2 7 9']
>>> list_of_sent.sort(key=sorting_key)
>>> list_of_sent
['a5 abc xyz', 'a6 abc deg', 'r4 2 7 9', 'w1 3 45 7']
More info about differences between sorted & list.sort could be found in this thread.
After the comment do these:
newlist = []
numList.sort()
letterList.sort()
newlist = letterList + numList
print(numList)
print (letterList)
print (newlist)

Strip all characters from column header before a :

I have column's named like this:
1:Arnston 2:Berg 3:Carlson 53:Brown
and I want to strip all the characters before and including :. I know I can rename the columns, but that would be pretty tedious since my numbers go up to 100.
My desired out put is:
Arnston Berg Carlson Brown
Assuming that you have a frame looking something like this:
>>> df
1:Arnston 2:Berg 3:Carlson 53:Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7
You can use the vectorized string operators to split each entry at the first colon and then take the second part:
>>> df.columns = df.columns.str.split(":", 1).str[1]
>>> df
Arnston Berg Carlson Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7
import re
s = '1:Arnston 2:Berg 3:Carlson 53:Brown'
s_minus_numbers = re.sub(r'\d+:', '', s)
Gets you
'Arnston Berg Carlson Brown'
The best solution IMO is to use pandas' str attribute on the columns. This allows for the use of regular expressions without having to import re:
df.columns.str.extract(r'\d+:(.*)')
Where the regex means: select everything ((.*)) after one or more digits (\d+) and a colon (:).
You can do it with a list comprehension:
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
print('Before: {!r}'.format(columns))
columns = [col.split(':')[1] for col in columns]
print('After: {!r}'.format(columns))
Output
Before: ['1:Arnston', '2:Berg', '3:Carlson', '53:Brown']
After: ['Arnston', 'Berg', 'Carlson', 'Brown']
Another way is with a regular expression using re.sub():
import re
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
pattern = re.compile(r'^.+:')
columns = [pattern.sub('', col) for col in columns]
print(columns)
Output
['Arnston', 'Berg', 'Carlson', 'Brown']
df = pd.DataFrame({'1:Arnston':[5,9,9],
'2:Berg':[0,3,2],
'3:Carlson':[2,2,9] ,
'53:Brown':[1,9,7]})
[x.split(':')[1] for x in df.columns.factorize()[1]]
output:
['Arnston', 'Berg', 'Carlson', 'Brown']
You could use str.replace and pass regex expression:
In [52]: df
Out[52]:
1:Arnston 2:Berg 3:Carlson 53:Brown
0 1.340711 1.261500 -0.512704 -0.064384
1 0.462526 -0.358382 0.168122 -0.660446
2 -0.089622 0.656828 -0.838688 -0.046186
3 1.041807 0.775830 -0.436045 0.162221
4 -0.422146 0.775747 0.106112 -0.044917
In [51]: df.columns.str.replace('\d+[:]','')
Out[51]: Index(['Arnston', 'Berg', 'Carlson', 'Brown'], dtype='object')

Categories

Resources