Index error while concatenating two dataframes in pandas

Index error while concatenating two dataframes in pandas - python

I get the following error
pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
On code
dfp = pd.concat([df, tdf], axis=1)
I am trying to concatenate columns of tdf to the columns of df.
For these print statements
print(df.shape)
print(tdf.shape)
print(df.columns)
print(tdf.columns)
print(df.index)
print(tdf.index)
I get the following output:
(70000, 25)
(70000, 20)
Index(['300', '301', '302', '303', '304', '305', '306', '307', '308', '309',
'310', '311', '312', '313', '314', '315', '316', '317', '318', '319',
'320', '321', '322', '323', '324'],
dtype='object')
Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13',
'14', '15', '16', '17', '18', '19', '20'],
dtype='object')
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
9990, 9991, 9992, 9993, 9994, 9995, 9996, 9997, 9998, 9999],
dtype='int64', length=70000)
RangeIndex(start=0, stop=70000, step=1)
Any idea what is the issue? Why would indexing be a problem? Indexes are supposed to be the same since I concat columns, not rows. Column values seem to be perfectly different.
Thanks!

The problem is that df is not uniquely indexed. So you need to either reset the index
pd.concat([df.reset_index(),tdf], axis=1)
or drop it
pd.concat([df.reset_index(drop=True),tdf], axis=1)

Related

Numpy / Flatten a list

I have create this of character
list1 = [['20']*3,['35']*2,['40']*4,['10']*2,['15']*3]
result :
[['20', '20', '20'], ['35', '35'], ['40', '40', '40', '40'], ['10', '10'], ['15', '15', '15']]
I can convert it into a single list using list comprehension
charlist = [x for sublist in list1 for x in sublist]
print(charlist)
['20', '20', '20', '35', '35', '40', '40', '40', '40', '10', '10', '15', '15', '15']
I was wondering how to do that with numpy
listNP=np.array(list1)
gives as output :
array([list(['20', '20', '20']), list(['35', '35']),
list(['40', '40', '40', '40']), list(['10', '10']),
list(['15', '15', '15'])], dtype=object)
The fact is that listNP.flatten() gives as an output the same result. Probably I missed a step when converting the list into an numpy array

You can bypass all the extra operations and use np.repeat:
>>> np.repeat(['20', '35', '40', '10', '15'], [3, 2, 4, 2, 3])
array(['20', '20', '20', '35', '35', '40', '40', '40', '40',
'10', '10', '15', '15', '15'], dtype='<U2')
If you need dtype=object, make the first argument into an array first:
arr1 = np.array(['20', '35', '40', '10', '15'], dtype=object)
np.repeat(arr1, [3, 2, 4, 2, 3])

Use hstack()
import numpy as np
list1 = [['20']*3,['35']*2,['40']*4,['10']*2,['15']*3]
flatlist = np.hstack(list1)
print(flatlist)
['20' '20' '20' '35' '35' '40' '40' '40' '40' '10' '10' '15' '15' '15']
In trying to construct your ListNP with np.array as you do in the OP, I got a warning about jagged arrays and having to use dtype=object, but letting hstack construct it directly doesn't evoke a warning (thanks #Michael Delgado in the comments)

How to select number columns in pandas dataframe

How can I select number columns in the below column names
output_df.columns = Index(['EVENT_ID', 'Date', 'Time', 'Track', '#', 'Distance', 'Betfair Grade','Runners', 'Win Trap', 'Win BSP', '1', '2', '3', '4', '5', '6', '7',
'8', '9', '10', 'Trap1 Odds Band', 'Trap2 Odds Band', 'Trap3 Odds Band'],
dtype='object')
I tried this function and I got the below output.
output_df.filter(regex="\d+", axis=1).columns
Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'Trap1 Odds Band',
'Trap2 Odds Band', 'Trap3 Odds Band'],dtype='object')
I just want the number columns:
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

new_df = df[df.columns.isnumeric()]
This should work?

Try filtering a full match:
output_df.filter(regex="^\d+$", axis=1).columns
Or better without filter:
df.columns[df.columns.isdigit()]

Create array using lists (consisting of lists) but without flattening inner lists - python

I am trying to create an array using two lists, one of which has a list for each element. The problem is that in the first case I manage to do what I want, using np.column_stack but in the second case, although my initial lists look similar (in structure), my list of lists enters the array flattened (which is not what I need.
I am attaching two examples to replicate, on the first case, I manage to get an array, where each line has a string as first element, and a list as a second, while in the second case, I get 4 columns (the list is flattened) with no obvious reason.
Example 1
temp_list_column1=['St. Raphael',
'Goppingen',
'HSG Wetzlar',
'Huttenberg',
'Kiel',
'Stuttgart',
'Izvidac',
'Viborg W',
'Silkeborg-Voel W',
'Bjerringbro W',
'Lyngby W',
'Most W',
'Ostrava W',
'Presov W',
'Slavia Prague W',
'Dicken',
'Elbflorenz',
'Lubeck-Schwartau',
'HK Ogre/Miandum',
'Stal Mielec',
'MKS Perla Lublin W',
'Koscierzyna W',
'CS Madeira W',
'CSM Focsani',
'CSM Bucuresti',
'Constanta',
'Iasi',
'Suceava',
'Timisoara',
'Saratov',
'Alisa Ufa W',
'Pozarevac',
'Nove Zamky',
'Aranas',
'Ricoh',
'H 65 Hoor W',
'Lugi W',
'Strands W']
temp_list_column2=[['32', '16', '16'],
['32', '16', '16'],
['27', '13', '14'],
['23', '9', '14'],
['29', '14', '15'],
['24', '17', '7'],
['30', '15', '15'],
['26', '12', '14'],
['27', '13', '14'],
['26'],
['18', '9', '9'],
['34', '15', '19'],
['30', '13', '17'],
['31', '13', '18'],
['27', '10', '17'],
['28', '14', '14'],
['24', '14', '10'],
['28', '12', '16'],
['28', '9', '19'],
['22', '13', '9'],
['30', '14', '16'],
['22', '14', '8'],
['17', '8', '9'],
['26'],
['41', '21', '20'],
['36', '18', '18'],
['10'],
['25', '12', '13'],
['27', '16', '11'],
['31', '15', '16'],
['25', '15', '10'],
['24', '8', '16'],
['28', '14', '14'],
['24', '13', '11'],
['26', '14', '12'],
['33', '17', '16'],
['26', '12', '14'],
['17', '12', '5']]
import numpy as np
temp_array = np.column_stack((temp_list_column1,temp_list_column2))
output
array([['St. Raphael', ['32', '16', '16']],
['Goppingen', ['32', '16', '16']],
['HSG Wetzlar', ['27', '13', '14']],
['Huttenberg', ['23', '9', '14']],
['Kiel', ['29', '14', '15']],
['Stuttgart', ['24', '17', '7']],
['Izvidac', ['30', '15', '15']],
['Viborg W', ['26', '12', '14']],
['Silkeborg-Voel W', ['27', '13', '14']],
['Bjerringbro W', ['26']],
['Lyngby W', ['18', '9', '9']],
['Most W', ['34', '15', '19']],
['Ostrava W', ['30', '13', '17']],
['Presov W', ['31', '13', '18']],
['Slavia Prague W', ['27', '10', '17']],
['Dicken', ['28', '14', '14']],
['Elbflorenz', ['24', '14', '10']],
['Lubeck-Schwartau', ['28', '12', '16']],
['HK Ogre/Miandum', ['28', '9', '19']],
['Stal Mielec', ['22', '13', '9']],
['MKS Perla Lublin W', ['30', '14', '16']],
['Koscierzyna W', ['22', '14', '8']],
['CS Madeira W', ['17', '8', '9']],
['CSM Focsani', ['26']],
['CSM Bucuresti', ['41', '21', '20']],
['Constanta', ['36', '18', '18']],
['Iasi', ['10']],
['Suceava', ['25', '12', '13']],
['Timisoara', ['27', '16', '11']],
['Saratov', ['31', '15', '16']],
['Alisa Ufa W', ['25', '15', '10']],
['Pozarevac', ['24', '8', '16']],
['Nove Zamky', ['28', '14', '14']],
['Aranas', ['24', '13', '11']],
['Ricoh', ['26', '14', '12']],
['H 65 Hoor W', ['33', '17', '16']],
['Lugi W', ['26', '12', '14']],
['Strands W', ['17', '12', '5']]], dtype=object)
Example 2
temp_list_column1b=['Benidorm',
'Alpla Hard',
'Dubrava',
'Frydek-Mistek',
'Karvina',
'Koprivnice',
'Nove Veseli',
'Vardar',
'Meble Elblag Wojcik',
'Zaglebie',
'Benfica',
'Barros W',
'Juvelis W',
'Assomada W',
'UOR No.2 Moscow',
'Izhevsk W',
'Stavropol W',
'Din. Volgograd W',
'Zvenigorod W',
'Adyif W',
'Crvena zvezda',
'Ribnica',
'Slovan',
'Jeruzalem Ormoz',
'Karlskrona',
'Torslanda W']
temp_list_column2b=[['28', '14', '14'],
['27', '12', '15'],
['24', '13', '11'],
['24', '14', '10'],
['28', '17', '11'],
['30', '16', '14'],
['26', '15', '11'],
['38', '18', '20'],
['24', '13', '11'],
['33', '15', '18'],
['24', '10', '14'],
['18', '11', '7'],
['22', '9', '13'],
['25', '12', '13'],
['19', '11', '8'],
['24', '10', '14'],
['21', '9', '12'],
['18', '10', '8'],
['31', '17', '14'],
['29', '15', '14'],
['26', '14', '12'],
['29', '12', '17'],
['25', '11', '14'],
['33', '19', '14'],
['32', '14', '18'],
['19', '12', '7']]
import numpy as np
temp_arrayb = np.column_stack((temp_list_column1b,temp_list_column2b))
output
array([['Benidorm', '28', '14', '14'],
['Alpla Hard', '27', '12', '15'],
['Dubrava', '24', '13', '11'],
['Frydek-Mistek', '24', '14', '10'],
['Karvina', '28', '17', '11'],
['Koprivnice', '30', '16', '14'],
['Nove Veseli', '26', '15', '11'],
['Vardar', '38', '18', '20'],
['Meble Elblag Wojcik', '24', '13', '11'],
['Zaglebie', '33', '15', '18'],
['Benfica', '24', '10', '14'],
['Barros W', '18', '11', '7'],
['Juvelis W', '22', '9', '13'],
['Assomada W', '25', '12', '13'],
['UOR No.2 Moscow', '19', '11', '8'],
['Izhevsk W', '24', '10', '14'],
['Stavropol W', '21', '9', '12'],
['Din. Volgograd W', '18', '10', '8'],
['Zvenigorod W', '31', '17', '14'],
['Adyif W', '29', '15', '14'],
['Crvena zvezda', '26', '14', '12'],
['Ribnica', '29', '12', '17'],
['Slovan', '25', '11', '14'],
['Jeruzalem Ormoz', '33', '19', '14'],
['Karlskrona', '32', '14', '18'],
['Torslanda W', '19', '12', '7']],
dtype='<U19')
In the first case, shape is (38, 2), while in the second is (26, 4) (i am interested in the number of columns only). Am I missing something obvious?

Your problem here seems to be that the first B list is jagged, while your second is rectangular.
Look at the difference in how Numpy converts the following two lists into Arrays (which, as #hpaulj points out, is exactly what happens when you pass them to column_stack:
In [1]: b1 = [
...: [1,2,3],
...: [2,3,4],
...: [3,4,5],
...: [4,5,6]]
In [2]: np.array(b1)
Out[2]:
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[4, 5, 6]])
In [3]: b2 = [
...: [1,2,3],
...: [2,3],
...: [3]]
In [4]: np.array(b2)
Out[4]: array([list([1, 2, 3]), list([2, 3]), list([3])], dtype=object)
Thus, when column stacking your example lists, in the first case you have a 1D array of lists that gets converted into a single column, whereas in the second case you have a 2D matrix of numbers that has 3 columns.
You should probably just not even be using Numpy's column_stack in this case, just zip the two lists together. If you want a numpy array as your final result, just np.array(list(zip(list_a, list_b)))
EDIT: In retrospect, your data structure sounds more like what's typically referred to as a DataFrame, rather than a matrix which is what Numpy is trying to give you.
import pandas as pd
data = pd.DataFrame()
data['name'] = temp_list_column1
data['numbers'] = test_list_column2
# Or
data = pd.DataFrame(list(zip(temp_list_column1, temp_list_column2)), columns=['name', 'numbers'])
Which gives you a data structure that looks like:
name numbers
0 John [1, 2, 3]
1 James [2, 3, 4]
2 Peter [3, 4, 5]
3 Paul [4, 5, 6]

Diagnosis
It seems like the issue is for the 2nd example, all the sublists has 3 elements while in the first example there are sublists with length 1 e.g. ['Bjerringbro W', ['26']]; the list ['26'] has only one element.
In the second case apparently np.column_stack forces to NOT HAVE lists as a cell element. In fact, we can have another discussion about why you want to see lists as cell elements which I will not go through here. Here is the solution
Special Case Solution
I assume you don't mind using pandas
import pandas as pd
series_1 = pd.Series(temp_list_column1b).to_frame(name='col1') # name it whatever you want
series_2 = pd.Series(temp_list_column2b).to_frame(name='col2') # name it whatever you want
df = pd.concat([series_1, series_2], axis=1)
# print(df) # view in pandas form
# print(df.values) # to see how it looks like as a numpy array
# print(df.values.shape) # to see how what the shape is in terms of numpy
Generalized Solution
Assuming you have a list of such columns which is called "list_of_cols". Then:
import pandas as pd
'''
list_of_cols: all the lists you want to combine
'''
df = pd.concat([pd.Series(temp_col).to_frame() for temp_col in list_of_cols], axis=1)
I hope this helps!

Python 3 Dictionary sorted with wrong output

I am currently doing an Assignment; however, I got some interesting output which confused me so much.
I am trying to sort the following dictionary:
result = {'A1': '9', 'A2': '14', 'A3': '16', 'A4': '0', 'B1': '53', 'B2': '267', 'B3': '75', 'B4': '22', 'C1': '19', 'C2': '407', 'C3': '171', 'C4': '56', 'C5': '10', 'D3': '47', 'D4': '34', 'D5': '10'}
My sorting code with Python 3 is the following : (only sorted by value)
sortedList = [v for v in sorted(result.values())]
The output is :
['0', '10', '10', '14', '16', '171', '19', '22', '267', '34', '407', '47', '53', '56', '75', '9']
which is not fully sorted. The output is quite strange.
Why it is happened like this?
I have used another dict to test like this:
testD = {'A':'5','B': '9','c': '8','d': '6'}
the output is right :
['5', '6', '8', '9']
Is there something wrong with my result dictionary or is there something I am missing?

Strings will be ordered with a lexical sort. To sort your data numerically, convert the values into integers first.

Strings are compared one character at a time, so '30' < '4' because '3' < '4'. You need to use a key parameter to get the comparison based on the numeric value, not the string characters.
Also, it's redundant to use a list comprehension on something that already returns a list.
sortedList = sorted(result.values(), key=int)

As the value of dictionary are string so there is lexical sorting based on ascii values.
As evident, you need the values to be sorted according to their integer values.
result = {'A1': '9', 'A2': '14', 'A3': '16', 'A4': '0', 'B1': '53', 'B2': '267', 'B3': '75', 'B4': '22', 'C1': '19', 'C2': '407', 'C3': '171', 'C4': '56', 'C5': '10', 'D3': '47', 'D4': '34', 'D5': '10'}
As mentioned in the comments by #AChampion, you can pass the sort value type by using key something like this :
sortedList = sorted(result.values(), key = int)
print(sortedList)
Or you can do something like this :
result_ints = dict((k,int(v)) for k,v in result.items())
sortedList = [str(v) for v in sorted(result_ints.values())]
print(sortedList)
Both of the above code snippets will result in :
['0', '9', '10', '10', '14', '16', '19', '22', '34', '47', '53', '56', '75', '171', '267', '407']

You can try this :
result = [{'A1': '9', 'A2': '14', 'A3': '16', 'A4': '0', 'B1': '53', 'B2': '267', 'B3': '75', 'B4': '22', 'C1': '19', 'C2': '407', 'C3': '171', 'C4': '56', 'C5': '10', 'D3': '47', 'D4': '34', 'D5': '10'}]
lst=[]
for item in result:
for key in item.keys():
lst.append(int(item.get(key)))
sortedList = [v for v in sorted(lst)]
print(sortedList)
Output: [0, 9, 10, 10, 14, 16, 19, 22, 34, 47, 53, 56, 75, 171, 267,
407]

Identify or count continuously repeated number (actually missing value: nan) in the list

Basically, I would like to identify whether the missing values in data set are continuously repeated or not. If there are countinously repeated missing values in the data set, I would like to know whether lengths of the each continuously repeated missing value sets are above certian number or not.
For example:
data =['1', '0', '9', '31', '11', '12', 'nan', '10', '44', '53', '12', '66', '99', '3', '2', '6.75833',....., 'nan', 'nan', 'nan', '3', '7', 'nan', 'nan']
In data above, the total number of 'nan' would be 6 and it could be calculated with data.count('nan'). However, what I want to know is how much continuously the missing value can be repeated. For this data, the answer would be 3.
I apologize that I don't show my example code, but I am a very novice in this area and I couldn't have any idea for coding.
Any idea, help or tips would be really appreciated.

This looks like a job for itertools.groupby():
>>> from itertools import groupby
>>> data =['1', '0', '9', '31', '11', '12', 'nan', '10', '44', '53',
'12', '66', '99', '3', '2', '6.75833', 'nan', 'nan', 'nan',
'3', '7', 'nan', 'nan']
>>> [len(list(group)) for key, group in groupby(data) if key == 'nan']
[1, 3, 2]
Note if your code actually has real NaNs instead of strings, the if key == 'nan'equality test should be replaced with math.isnan(key).

Or you can try this one, which is faster:
grouped_L = [sum(1 for i in group) for k,group in groupby(L)]

Using pyrle for speed. In this solution I replace nan with a number not in the data (-42). This is because nan is a difficult value for rles, as np.nan != np.nan and hence no nans are treated as consecutive.
import numpy as np
data =['1', '0', '9', '31', '11', '12', 'nan', '10', '44', '53', '12', '66', '99', '3', '2', '6.75833', 'nan', 'nan', 'nan', '3', '7', 'nan', 'nan']
arr = np.array([np.float(f) for f in data])
assert not -42 in arr
from pyrle import Rle
r = Rle(arr)
arr[np.isnan(arr)] = -42
is_nan = r.values == -42
np.max(r.runs[is_nan])
# 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Index error while concatenating two dataframes in pandas - python

The problem is that df is not uniquely indexed. So you need to either reset the index pd.concat([df.reset_index(),tdf], axis=1) or drop it pd.concat([df.reset_index(drop=True),tdf], axis=1)

Related

Numpy / Flatten a list

How to select number columns in pandas dataframe

Create array using lists (consisting of lists) but without flattening inner lists - python

Python 3 Dictionary sorted with wrong output

Identify or count continuously repeated number (actually missing value: nan) in the list

Categories

Resources