Appending columns using loc pandas dataframe - python

I am working with a dataframe that I have created with the below code:
df = pd.DataFrame({'player': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'playerlookup': ['B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
'score': ['10', '9', '8', '7', '6', '5', '4', '3']})
I want to add a new column called "scorelookup" to this dataframe that for each row, takes the value in the 'playerlookup' column, searches for it in the 'player' column and then returns the score in a new column. For example, the value in the "scorelookup" column in the first row of the dataframe would be '9' because that was the score for player 'B'. In instances where the value in the 'playerlookup' column isn't contained within the 'player' column (for example the last row of the table which has a value of 'I' in the 'playerlookup' column), the value in that column would be blank.
I have tried using code like:
df['playerlookup'].apply(lambda n: df.loc[df['player'] == n, 'score'])
but have been unsuccessful.
Any help massively appreciated!

I hope this is the result you are looking for :
import pandas as pd
df = pd.DataFrame({'player': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'playerlookup': ['B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
'score': ['10', '9', '8', '7', '6', '5', '4', '3']})
d1 = df[["playerlookup"]].copy()
d2 = df[["player","score"]].copy()
d1.rename({'playerlookup':'player'}, axis='columns',inplace=True)
df["scorelookup"] = d1.merge(d2, on='player', how='left')["score"]
The output
player playerlookup score scorelookup
0 A B 10 9
1 B C 9 8
2 C D 8 7
3 D E 7 6
4 E F 6 5
5 F G 5 4
6 G H 4 3
7 H I 3 NaN

Related

How to get particular index number of list items

my_list = ['A', 'B', 'C', 'D', 'E', 'B', 'F', 'D', 'C', 'B']
idx = my_list.index('B')
print("index :", idx)
In here I used the '.index()' function.
for i in my_list:
print(f"index no. {my_list.index(i)}")
I tried to find each index number of the items of the (my_list) list.
But it gave same result for same values. But they located in difference places of the list.
if 'B' == my_list[(len(my_list) - 1)]:
print("True")
if 'B' == my_list[(len(my_list) - 4)]:
print("True")
I need to mention particular values by the index number of their (to do something).
Imagine; I need to set values to nesting with the values of the list.
i.e :
my_list_2 = ['A', 'B', '2', 'C', '3', 'D', '4', 'E', 'B', '2', 'F', '6', 'D', 'C', '3', 'B']
- ------ ------ ------ - ------ ------ - ------ -
If I want to nesting values with their Consecutive (number type) items and
the other values need to nest with '*' mark (as default).Because they have no any Consecutive (numeric) values.
so then how I mention each (string) values and (numeric) values in a coding part to nesting them.
In this case as my example I expected result:
--> my_list_2 = [['A', ''], ['B', '2'], ['C', '3'], ['D', '4'], ['E', ''], ['B', '2'], ['F', '6'], ['D', ''], ['C', '3'], ['B', '']]
This is the coding part which I tried to do this :
def_setter = [
[my_list_2[i], '*'] if my_list_2[i].isalpha() and my_list_2[i + 1].isalpha() else [my_list_2[i], my_list_2[i + 1]]
for i in range(0, len(my_list_2) - 1)]
print("Result : ", def_setter)
But it not gave me the expected result.
Could you please help me to do this !
There might be a more pythonic way to reorganize this array, however, with the following function you can loop through the list and append [letter, value] if value is a number, append [letter, ''] if value is a letter.
def_setter = []
i = 0
while i < len(my_list_2):
if i + 1 == len(my_list_2):
if my_list_2[i].isalpha():
def_setter.append([my_list_2[i], ''])
break
prev, cur = my_list_2[i], my_list_2[i + 1]
if cur.isalpha():
def_setter.append([prev, ''])
i += 1
else:
def_setter.append([prev, cur])
i += 2
print(def_setter)
>>> [['A', ''],
['B', '2'],
['C', '3'],
['D', '4'],
['E', ''],
['B', '2'],
['F', '6'],
['D', ''],
['C', '3'],
['B', '']]

Iterate over rows in pandas dataframe. If blanks exist before a specific column, move all column values over

I am attempting to iterate over all rows in a pandas dataframe and move all leftmost columns within each row over until all the non null column values in each row touch. The amount of column movement depends on the number of empty columns between the first null value and the cutoff column.
In this case I am attempting to 'close the gap' between values in the leftmost columns into the column 'd' touching the specific cutoff column 'eee'. The correlating 'abc' rows should help to visualize the problem.
Column 'eee' or columns to the right of 'eee' should not be touched or moved
def moveOver():
df = {
'aaa': ['a', 'a', 'a', 'a', 'a', 'a'],
'bbb': ['', 'b', 'b', 'b', '', 'b'],
'ccc': ['', '', 'c', 'c', '', 'c'],
'ddd': ['', '', '', 'd', '', ''],
'eee': ['b', 'c', 'd', 'e', 'b', 'd'],
'fff': ['c', 'd', 'e', 'f', 'c', 'e'],
'ggg': ['d', 'e', 'f', 'g', 'd', 'f']
}
In row 1 AND 5: 'a' would be moved over 3 column index's to column 'ddd'
In row 2: ['a','b'] would be moved over 2 column index's to columns ['ccc', 'ddd'] respectively
etc.
finalOutput = {
'aaa': ['', '', '', 'a', '', ''],
'bbb': ['', '', 'a', 'b', '', 'a'],
'ccc': ['', 'a', 'b', 'c', '', 'b'],
'ddd': ['a', 'b', 'c', 'd', 'a', 'c'],
'eee': ['b', 'c', 'd', 'e', 'b', 'd'],
'fff': ['c', 'd', 'e', 'f', 'c', 'e'],
'ggg': ['d', 'e', 'f', 'g', 'd', 'f']
}
You can do this:
keep_cols = df.columns[0:df.columns.get_loc('eee')]
df.loc[:,keep_cols] = [np.roll(v, Counter(v)['']) for v in df[keep_cols].values]
print(df):
aaa bbb ccc ddd eee fff ggg
0 a b c d
1 a b c d e
2 a b c d e f
3 a b c d e f g
4 a b c d
5 a b c d e f
Explanation:
You want to consider only those columns which are to the left of 'eee', so you take those columns as stored in keep_cols
Next you'd want each row to be shifted by some amount (we need to know how much), to shift I used numpy's roll. But how much amount? It is given by number of blank values - for that I used Counter from collections.

Python: sort 2D list according to the properties of sublists

I have a 2D list:
ls = [
['-2,60233106656288100', '2', 'C'],
['-9,60233106656288100', '2', 'E'],
['-4,60233106656288100', '2', 'E'],
['-3,60233106656288100', '2', 'C'],
['-5,60233106656288100', '4', 'T'],
['-0,39019660724115224', '3', 'E'],
['-3,60233106656288100', '2', 'T'],
['-6,01086748514074000', '1', 'Q'],
['-5,02684650459461800', '0', 'X'],
['-1,25228509312138300', 'A', 'N'],
['-0,85517128843547330', '3', 'E'],
['1,837508975733196200', '3', '-', 'E'],
['1,850925075915637700', '5', '-', 'T'],
['1,826767133229081000', '4', '-', 'C'],
['1,845357865328532300', '3', '-', 'E'],
['0,636275318914609100', 'a', 'n', 'N']
]
I want to sort it first so that the shorter sublists are sorted according to the second column and after that according to the third column so that the list stays sorted according to the second column (first row has 0 in the second column, then 1, then five twos etc. but the twos switch places so that I first have two E's and then two C's and then T). After that I want to sort the longer sublists according to the fourth column. The row where I have A should be the last one of the shorter lists and the row where I have a should be the last row. So the output should be as follows:
[
['-5,02684650459461800', '0', 'X'],
['-6,01086748514074000', '1', 'Q'],
['-9,60233106656288100', '2', 'E'],
['-4,60233106656288100', '2', 'E'],
['-3,60233106656288100', '2', 'C'],
['-2,60233106656288100', '2', 'C'],
['-3,60233106656288100', '2', 'T'],
['-0,39019660724115224', '3', 'E'],
['-0,85517128843547330', '3', 'E'],
['-5,60233106656288100', '4', 'T'],
['-1,25228509312138300', 'A', 'N'],
['1,837508975733196200', '3', '-', 'E'],
['1,845357865328532300', '3', '-', 'E'],
['1,826767133229081000', '4', '-', 'C'],
['1,850925075915637700', '5', '-', 'T'],
['0,636275318914609100', 'a', 'n', 'N']
]
I know that I can sort according to the second column as:
ls.sort(key=lambda x:x[1])
But this sorts the whole list and gives:
['-5,02684650459461800', '0', 'X']
['-6,01086748514074000', '1', 'Q']
['-2,60233106656288100', '2', 'C']
['-9,60233106656288100', '2', 'E']
['-4,60233106656288100', '2', 'E']
['-3,60233106656288100', '2', 'C']
['-3,60233106656288100', '2', 'T']
['-0,39019660724115224', '3', 'E']
['-0,85517128843547330', '3', 'E']
['1,837508975733196200', '3', '-', 'E']
['1,845357865328532300', '3', '-', 'E']
['-5,60233106656288100', '4', 'T']
['1,826767133229081000', '4', '-', 'C']
['1,850925075915637700', '5', '-', 'T']
['-1,25228509312138300', 'A', 'N']
['0,636275318914609100', 'a', 'n', 'N']
How can I implement the sorting so that I can choose a certain portion of the list and then sort it and after that sort it again according to other column?
If I understand you correctly, you want to sort the list
first by the len of the sublists,
then by each of the elements in the list, except for the first, using the next element as a tie-breaker in case the previous are all equal
For this, you can use a tuple as the search key, using the len and a slice of the sublist starting at the second element (i.e. at index 1):
ls.sort(key=lambda x: (len(x), x[1:]))
Note that this will also use elements after the fourth as further tie-breakers, which might not be wanted. Also this creates temporary (near) copies of all the sublists, which may be prohibitive if the lists are longer, even if all comparisons may be decided after the 3rd or 4th element.
Alternatively, if you only need the first four, or ten, or whatever number of elements, you can create a closed slice and used that to compare:
ls.sort(key=lambda x: (len(x), x[1:4]))
Since out-of-bounds slices are evaluated as empty lists, this works even if the lists have fewer elements than either the start- or end-index.
How about:
ls.sort(key=lambda x: (l := len(x), x[1], '' if l < 4 else x[3]))
That would sort it by length of the sublist first, then by the 2nd column and finally by the 4th column, if there is one (picking '' in case there isn't, which would still sort it all the way to the top).

sort data by date and time python

I have my data in txt file.
1 B F 2019-03-10
1 C G 2019-03-11
1 B H 2019-03-10
1 C I 2019-03-10
1 B J 2019-03-10
2 A K 2019-03-10
1 D L 2019-03-10
2 D M 2019-03-10
2 E N 2019-03-11
1 E O 2019-03-10
What I need to do is to split the data according to the first column.
So all rows with number 1 in the first column go to one list( or dictionary or whatever) and all rows with number 2 in the first column do to other list or whatever. This is a sample data, in original data we do not know how many different numbers are in the first column.
What I have to do next is to sort the data for each key (in my case for numbers 1 and 2) by date and time. I could do that with the data.txt, but not with the dictionary.
with open("data.txt") as file:
reader = csv.reader(file, delimiter="\t")
data=sorted(reader, key=itemgetter(0))
lines = sorted(data, key=itemgetter(3))
lines
OUTPUT:
[['1', 'B', 'F', '2019-03-10'],
['2', 'D', 'M', '2019-03-10'],
['1', 'B', 'H', '2019-03-10'],
['1', 'C', 'I', '2019-03-10'],
['1', 'B', 'J', '2019-03-10'],
['1', 'D', 'L', '2019-03-10'],
['2', 'A', 'K', '2019-03-10'],
['1', 'E', 'O', '2019-03-10'],
['1', 'C', 'G', '2019-03-11'],
['2', 'E', 'N', '2019-03-11']]
So what I need is to group the data by the number in the first column as well as to sort this by the date and time. Could anyone please help me to combine these two codes somehow? I am not sure if I had to use a dictionary, maybe there is another way to do that.
You can sort corresponding list for each key after splitting the data according to the first column
def sort_by_time(key_items):
return sorted(key_items, key=itemgetter(3))
d = {k: sort_by_time(v) for k, v in d.items()}
If d has separate elements for time and for date, then you can sort by several columns:
sorted(key_items, key=itemgetter(2, 3))
itertools.groupby can help build the lists:
from operator import itemgetter
from itertools import groupby
from pprint import pprint
# Read all the data splitting on whitespace
with open('data.txt') as f:
data = [line.split() for line in f]
# Sort by indicated columns
data.sort(key=itemgetter(0,3,4))
# Build a dictionary keyed on the first column
# Note: data must be pre-sorted by the groupby key for groupby to work correctly.
d = {group:list(items) for group,items in groupby(data,key=itemgetter(0))}
pprint(d)
Output:
{'1': [['1', 'B', 'F', '2019-03-10', '16:13:38.935'],
['1', 'B', 'H', '2019-03-10', '16:13:59.045'],
['1', 'C', 'I', '2019-03-10', '16:14:07.561'],
['1', 'B', 'J', '2019-03-10', '16:14:35.371'],
['1', 'D', 'L', '2019-03-10', '16:14:40.854'],
['1', 'E', 'O', '2019-03-10', '16:15:05.878'],
['1', 'C', 'G', '2019-03-11', '16:14:39.999']],
'2': [['2', 'D', 'M', '2019-03-10', '16:13:58.641'],
['2', 'A', 'K', '2019-03-10', '16:14:43.224'],
['2', 'E', 'N', '2019-03-11', '16:15:01.807']]}

Trying to do a left join of two datasets but getting strange results

To make this as clear as possible I started with a simple example. I created two random dataframes
dummy_data1 = {
'id': ['1', '2', '3', '4', '5'],
'Feature1': ['A', 'C', 'E', 'G', 'I'],
'Feature2': ['B', 'D', 'F', 'H', 'J']}
df1 = pd.DataFrame(dummy_data1, columns = ['id', 'Feature1', 'Feature2'])
dummy_data2 = {
'id': ['1', '2', '6', '7', '8'],
'Feature3': ['K', 'M', 'O', 'Q', 'S'],
'Feature4': ['L', 'N', 'P', 'R', 'T']}
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'Feature3', 'Feature4'])
And if I apply this df_merge = pd.merge(df1, df2, on = 'id', how='outer') or df_merge = df1.merge(df2,how='left', left_on='id', right_on='id') I get the desired output of
Now I am trying to apply the same technique with two large datasets that have the same number of rows. All I want to do is join the columns together into one large dataframe. The length of each dataframe is 512573 But when I apply
df_merge = orig_data_updated.merge(demographic_data1,how='left', left_on='Location+Type', right_on='Location+Type')
Then the length magically becomes 3596301 which is simply not possible. My question is simple. How do I do a left join on two dataframes such that the number of rows is the same and I just join the columns together?

Categories

Resources