indirect sorting a list in python - python

I have a list that I need to sort
list_of_sent = ['a5 abc xyz','w1 3 45 7','a6 abc deg','r4 2 7 9']
The rules are as follows.
if the 2nd item is a number, that will always come later
others are in their lexicographically sorted order (without changing the ordering of the individual items)
In the above example, the expected output is
['a5 abc deg','a6 abc xyz','r4 2 7 9','w1 3 45 7']
I understand this is some form of indirect sorting but not sure how to approach that. So far, I have separated the list in terms of whether the 2nd and onward items have numbers or not. But not sure how to proceede after that.
def reorderLines(logLines):
numList = []
letterList = []
hashMap = {}
for item in logLines:
words = item.split()
key= words[0]
hashMap[key] = item
if words[1].isdigit():
numList.append(item)
else:
letterList.append(item)
#sort each list individually
print(numList)
print(letterList)
EDIT:
This will output
['a5 abc xyz','a6 abc deg']
['w1 3 45 7','r4 2 7 9']
How do I proceed afterwards to reach to the output of
['a5 abc deg','a6 abc xyz','r4 2 7 9','w1 3 45 7']

The answer to your direct question is simple.
You've already worked out how to split the list into these two lists:
['a5 abc xyz','a6 abc deg']
['w1 3 45 7','r4 2 7 9']
Now, you just need to sort each one, and add them together.
But this really isn't the right approach in the first place. When looking at how to do a custom sort, the first thing you should do is ask yourself whether some other list, which you could easily transform this one into, would be trivial to sort.
For example, imagine you had this:
list_of_sent = [
(False, 'a5 abc xyz'),
(True, 'w1 3 45 7'),
(False, 'a6 abc deg'),
(True, 'r4 2 7 9')]
… where that first value in each tuple is True iff the second word in the string is a number.
If that were your list, you could just call sort or sorted on it, and you'd be done.
So, can you transform each of your strings into a tuple like that? Sure you can:
def flagnumbers(words):
isnumber = words.split()[1].isdigit()
return isnumber, words
And how, you can just pass that as the key function to sort your list:
list_of_sent = ['a5 abc xyz','w1 3 45 7','a6 abc deg','r4 2 7 9']
print(sorted(list_of_sent, key=flagnumbers))
That's it.
The Sorting HOWTO in the docs covers key functions in more detail, with some nice examples.

We can write sorting key as follows:
def sorting_key(element):
second_substring = element.split()[1]
return second_substring.isdecimal(), element
then use it in sorted builtin like
>>> list_of_sent = ['a5 abc xyz', 'w1 3 45 7', 'a6 abc deg', 'r4 2 7 9']
>>> sorted(list_of_sent, key=sorting_key)
['a5 abc xyz', 'a6 abc deg', 'r4 2 7 9', 'w1 3 45 7']
or if we don't need old order we can sort list_of_sent in place (may be more efficient, at least will not occupy additional memory for a new list):
>>> list_of_sent = ['a5 abc xyz', 'w1 3 45 7', 'a6 abc deg', 'r4 2 7 9']
>>> list_of_sent.sort(key=sorting_key)
>>> list_of_sent
['a5 abc xyz', 'a6 abc deg', 'r4 2 7 9', 'w1 3 45 7']
More info about differences between sorted & list.sort could be found in this thread.

After the comment do these:
newlist = []
numList.sort()
letterList.sort()
newlist = letterList + numList
print(numList)
print (letterList)
print (newlist)

Related

concatenate text from a list based on another list

I have two lists, one of them is some lines, and the other is some values for these lines as follows:
text = ['Hello, ','I need some help in here ','things are not working well ','so i posted ','this question here ','hoping to get some ','good ','answers ','out of you ','that\'s it ','thanks']
value = [1,1,0,1,1,1,0,1,1,0,1]
Goal is to concatenate lines that meet the value 1 continuously, to get this result in any way possible:
['Hello ,I need some help in here ',
'so i posted this question here hoping to get some',
'answers out of you ',
'thanks']
I tried to put it as a DataFrame but I then didn't know how to go on (using pandas in solution is not a must)
print(pd.DataFrame(data={"text":text,"value":value}))
text value
0 Hello, 1
1 I need some help in here 1
2 things are not working well 0
3 so i posted 1
4 this question here 1
5 hoping to get some 1
6 good 0
7 answers 1
8 out of you 1
9 that's it 0
10 thanks 1
Waiting for some Answers
There is no need to use Pandas:
tmp_str = ""
results = []
for chuck, is_evolved in zip(text, value):
if is_evolved:
tmp_str += chuck
else:
results.append(tmp_str)
tmp_str = ""
if tmp_str:
results.append(tmp_str)
print(results)
If you want a pandas approach you can use pandas.Series.cumsum, pandas.DataFrame.groupby, df.groupby.transform and aggregate by str.join, and then access indices where value is 1:
>>> df.groupby(
df['value'].ne(df['value'].shift(1)
).cumsum()
).transform(' '.join)[df['value'].eq(1)].drop_duplicates()
text
0 Hello, I need some help in here
3 so i posted this question here hoping to get some
7 answers out of you
10 thanks
EXPLANATION
>>> df['value'].ne(df['value'].shift(1)).cumsum()
0 1
1 1
2 2
3 3
4 3
5 3
6 4
7 5
8 5
9 6
10 7
Name: value, dtype: int32
>>> df.groupby(df['value'].ne(df['value'].shift(1)).cumsum()).transform(' '.join)
text
0 Hello, I need some help in here
1 Hello, I need some help in here
2 things are not working well
3 so i posted this question here hoping to get some
4 so i posted this question here hoping to get some
5 so i posted this question here hoping to get some
6 good
7 answers out of you
8 answers out of you
9 that's it
10 thanks
If you don't need a dataframe, you can use itertools.groupby over zipped values of (text, value) and groupby the second element, i.e. value. Then str.join groups' text part if key == 1.
>>> from itertools import groupby
>>> [' '.join([*zip(*g)][0]) for k, g in groupby(zip(text, value), lambda x: x[1]) if k]
['Hello, I need some help in here ',
'so i posted this question here hoping to get some ',
'answers out of you ',
'thanks']
Solution
Using pythonic without using pandas:
text_ = [text[count] for count, n in enumerate(value) if n == 1]
Description
This will take the list item in text at the count in the for loop if the list item in value equals 1.
Output
['Hello, ', 'I need some help in here ', 'so i posted ', 'this question here ', 'hoping to get some ', 'answers ', 'out of you ', 'thanks']

Getting a nested list with "" in the outer list

After opening and reading an input file, I'm trying to split the input on different characters. This works well, although I seem to be getting a nested list which I don't want. My list does not look like [[list]], but like ["[list]"]. What did I do wrong here?
The input looks like this:
name1___1 2 3 4 5
5=20=22=10=2=0=0=1=0=1something,something
name2___1 2 3 4
2=30=15=8=4=3=2=0=0=0;
The output looks like this:
["['name1", '', '', "1 2 3 4 5', 'name2", '', '', "1 2 3 4']"]
Here is my code:
file = open("file.txt")
input_of_this_file = file.read()
a = input_of_this_file.split("\n")
b = a[0::2] # so i get only the even lines
c = str(b) # to make it a string so the .strip() works
d = c.strip() # because there were whitespaces
e = d("_")
print e
If i then do:
x = e[0]
I get:
['name1
This removes the outer list, but also removes the last ].
I would like it to look like: name1, name2
So that i only get the names.
Use itertools.islice and a list comprehension.
>>> from itertools import islice
>>> with open("tmp.txt") as f:
... [line.rstrip("\n").split("_") for line in islice(f, None, None, 2)]
...
[['name1', '', '', '1 2 3 4 5'], ['name2', '', '', '1 2 3 4']]
Keeping your code syntax without imports:
c=[]
input_of_file = '''name1___1 2 3 4 5
5=20=22=10=2=0=0=1=0=1something,something
name2___1 2 3 4
2=30=15=8=4=3=2=0=0=0;'''
a = input_of_file.split("\n")
b = a[::2]
for item in b:
new_item = item.split('__')
c.append(new_item)
Results
c = [['name1', '_1 2 3 4 5'], ['name2', '_1 2 3 4']]
c[0][0] = 'name1'

How to find duplicates from a Pandas dataframe based upon the values in other columns?

I have a Pandas Df-
A=
[period store item
1 32 'A'
1 34 'A'
1 32 'B'
1 34 'B'
2 42 'X'
2 44 'X'
2 42 'Y'
2 44 'Y']
I need to implement something like this:
If an item has the same set of stores as any other item for that particular period then those items are duplicate.
So in this case A and B are duplicates as they have the same stores for the respective periods.
I have tried converting this into a nested dictionary using this:
dicta = {p: g.groupby('items')['store'].apply(tuple).to_dict()
for p, g in mkt.groupby('period')}
Which is returning me a dictionary like this:
dicta = {1: {'A': (32, 34),'B': (32, 34)}, 2: {'X': (42, 44),'Y': (42, 44)}}
...
So in the end I want a dictionary like this.
{1:(A,B),2:(X,Y)}
Although, I am not able to find any logic how to find the duplicate items.
Is there any other method that can be done to find those duplicate items
You can simply use .duplicated. Make sure to pass ['period', 'store'] as subset and keep as False so all the rows will be returned.
print(A[A.duplicated(subset=['period', 'store'], keep=False)])
Outputs
period store item
0 1 32 A
1 1 34 A
2 1 32 B
3 1 34 B
4 2 42 X
5 2 44 X
6 2 42 Y
7 2 44 Y
Note that according to the logic you specified all the rows are duplicates.
EDIT After OP elaborated on the expected format, I suggest
duplicates = A[A.duplicated(subset=['period', 'store'], keep=False)]
output = {g: tuple(df['item'].unique()) for g, df in duplicates.groupby('period')}
Then output is {1: ('A', 'B'), 2: ('X', 'Y')}.

Python replace entire string if it begin with certain character in dataframe

I have data that contains 'None ...' string at random places. I am trying to replace a cell in the dataframe with empty character only when it begin with 'None ..'. Here is what I tried, but I get errors like 'KeyError'.
df = pd.DataFrame({'id': [1,2,3,4,5],
'sub': ['None ... ','None ... test','math None ...','probability','chemistry']})
df.loc[df['sub'].str.replace('None ...','',1), 'sub'] = '' # getting key error
output looking for: (I need to replace entire value in cell if 'None ...' is starting string. Notice, 3rd row shouldn't be replaced because 'None ...' is not starting character)
id sub
1
2
3 math None ...
4 probability
5 chemistry
You can use the below to identify the cells to replace and then assign them an empty value:
df.loc[df['sub'].str.startswith("None"), 'sub'] = ""
df.head()
id sub
0 1
1 2
2 3 math None ...
3 4 probability
4 5 chemistry
You can simpy replace 'None ...' and by using a regular expression you can apply this replacement only for strings that start with None.
df['sub'] = df['sub'].str.replace(r'^None \.\.\.*','',1)
the output looks like this:
id sub
0 1
1 2 test
2 3 math None ...
3 4 probability
4 5 chemistry
df['sub'] = df['sub'].str.replace('[\w\s]*?(None \.\.\.)[\s\w]*?','',1)
Out:
sub
id
1
2 test
3
4 probability
5 chemistry
Look at startswith, then after we find the row need to be replaced we using replace
df['sub']=df['sub'].mask(df['sub'].str.startswith('None ... '),'')
df
Out[338]:
id sub
0 1
1 2
2 3 math None ...
3 4 probability
4 5 chemistry
First, you are using the sub strings as index, that is why you received key error.
Second you can do this by:
df['sub']=df['sub'].apply(lambda x: '' if x.find('None')==0 else x)

Slicing large lists based on input

If I have multiple lists such that
hello = [1,3,5,7,9,11,13]
bye = [2,4,6,8,10,12,14]
and the user inputs 3
is there a way to get the output to go back 3 indexes in the list and start there to get:
9 10
11 12
13 14
with tabs \t between each space.
if the user would input 5
the expected output would be
5 6
7 8
9 10
11 12
13 14
I've tried
for i in range(user_input):
print(hello[-i-1], '\t', bye[-i-1])
Just use negative indexies that start from the end minus the user input (-user_input) and move to the the end (-1), something like:
for i in range(-user_input, 0):
print(hello[i], bye[i])
Another zip solution, but one-lined:
for h, b in zip(hello[-user_input:], bye[-user_input:]):
print(h, b, sep='\t')
Avoids converting the result of zip to a list, so the only temporaries are the slices of hello and bye. While iterating by index can avoid those temporaries, in practice it's almost always cleaner and faster to do the slice and iterate the values, as repeated indexing is both unpythonic and surprisingly slow in CPython.
Use negative indexing in the slice.
hello = [1,3,5,7,9,11,13]
print(hello[-3:])
print(hello[-3:-2])
output
[9, 11, 13]
[9]
You can zip the two lists and use itertools.islice to obtain the desired portion of the output:
from itertools import islice
print('\n'.join(map(' '.join, islice(zip(map(str, hello), map(str, bye)), len(hello) - int(input()), len(hello)))))
Given an input of 3, this outputs:
5 6
7 8
9 10
11 12
13 14
You can use zip to return a lists of tuple where the i-th element comes from the i-th iterable argument.
zip_ = list(zip(hello, bye))
for item in zip_[-user_input:]:
print(item[0], '\t' ,item[1])
then use negative index to get what you want.
If you want to analyze the data
I think using pandas.datafrme may be helpful.
INPUT_INDEX = int(input('index='))
df = pd.DataFrame([hello, bye])
df = df.iloc[:, len(df.columns)-INPUT_INDEX:]
for col in df.columns:
h_value, b_value = df[col].values
print(h_value, b_value)
console
index=3
9 10
11 12
13 14

Categories

Resources