I'm trying to remove duplicate elements in a dataframe. This DataFrame comes from calculating the distance between a given list of geocoordinates. As you can see in the following DataFrame, the data is duplicated but I can't set the index to 'dist' because in other cases, the distance might be 0 or 1 (repeated) and then important data will be discarded.
import pandas as pd
df = pd.DataFrame({'Name_x':['a','b','c','d'],
'Name_y':['b','a','d','c'],
'Latitude_x':['lat_a','lat_b','lat_c','lat_d'],
'Longitude_x':['long_a','long_b','long_c','long_d'],
'Latitude_y':['lat_b','lat_a','lat_d','lat_c'],
'Longitude_y':['long_b','long_a','long_d','long_c'],
'dist':[0,0,1,1]})
df
In this case I would like to remain with the values Name_x: ['a','c'], Name_y['b','d'] with the corresponding geocoordinates: Latitude_x:['lat_a','lat_c'], Latitude_y:['lat_b','lat_d'], Longitude_x:['long_a','long_c'], Longitude_y: ['long_b','long_d'].
I'm not sure if you want this:
df['Name_x'].eq(df['Name_y'].shift()) # filter by equals for name
df.loc[df['Name_x'].eq(df['Name_y'].shift())] # Your "unique" rows
Related
I created a dataframe with some previous operations but when I query a column name with an index (for example, df['order_number][0] ), multiple rows/records come as output.
The screenshot shows the unique and total indexes of the dataframe. image shows the difference in lengths of uniques indexes and all indexes
It looks like the data kept their index when you merged/joined df. Try:
df.reset_index()
Could you should a df.head() for example, usually when you consume a data source, if you sent the arg indexto True each row will be assigned a unique numerical index
I have a bunch of dataframes where I want to pull out a single column from each and merge them into another dataframe with a timestamp column that is not indexed.
So e.g. all the dataframes look like:
[Index] [time] [col1] [col2] [etc]
0 2020-04-21T18:00:00Z 1 2 ...
All of the dataframes have a 'time' column and a 'col1' column. Because the 'time' column does not necessarily overlap, I made a new dataframe with a join of all the dataframes (that I added to a dictionary)
di = ... #dictionary of all the dataframes of interest
for key in di:
temptimeslist = di[key]['time'].tolist()
fulltimeslist.extend(x for x in temptimeslist if x not in fulltimeslist)
datadf['time'] = fulltimeslist #make a new df and add this as a column
(i'm sure there's an easier way to do the above, any suggestions are welcome). Note that for a number of reasons, translatting the ISO datetime format into a datetime and setting that as an index is not ideal.
The dumb way to do what I want is obvious enough:
for key in di:
datadf[key] = float("NaN")
tempdf = di[key] #could skip this probably
for i in range(len(datadf)):
if tempdf.time[tempdf.time==datadf.time[i]].index.tolist():
if len(tempdf.time[tempdf.time==datadf.time[i]].index.tolist())==1: #make sure value only shows up once, could reasonably skip this and put protection in elsewhere
datadf.loc[i,key] = float(tempdf[colofinterest][tempdf.time[tempdf.time==datadf.time[i]].index.tolist()])
#i guess i could do the above backwards so i loop over only the shorter dataframe to save some time.
but this seems needlessly long for python...I originally tried the pandas merge and join methods but got various keyerrors when trying them...same goes for 'in' statements inside the if statements.
e.g, I've tried thins like
datadf.join(Nodes_dict[key],datadf['time']==Nodes_dict[key]['time'],how="left").select()
but this fails.
I guess the question boils down to the following steps:
1) given 2 dataframes with a column of strings (times in iso format), find the indexes in the larger one for where they match the shorter one (or vice versa)
2) given that list of indexes, populate a separate column in the larger df using values from the smaller df but only in the correct spots and nan otherwise
I am appending different dataframes to make one set. Occasionally, some values have the same index, so it stores the value as a series. Is there a quick way within Pandas to just overwrite the value instead of storing all the values as a series?
You weren't very clear guy. If you want to resolve the duplicated indexes problem, probably the pd.Dataframe.reset_index() method will be enough. But, if you have duplicate rows when you concat the Dataframes, just use the pd.DataFrame.drop_duplicates() method. Else, share a bit of your code with or be clearer.
I'm not sure that the code below is what you're searching.
we say two dataframes, one columns, the same index and different values. and you wanna overwrite the value in one dataframe with the other. you can do it with a simple loop with iloc indexer.
import pandas as pd
df_1 = pd.DataFrame({'col_1':['a','b','c','d']})
df_2 = pd.DataFrame({'col_1':['q','w','e','r']})
rows = df_1.shape[0]
for idx in range(rows):
df_1['col_1'].iloc[idx] = df_2['col_2'].iloc[idx]
Then, you check the df_1. you should get that:
df_1
col_1
0 q
1 w
2 e
3 r
Whatever the response is what you want, let me know so I can help you.
I have implemented the below part of code :
array = [table.iloc[:, [0]], table.iloc[:, [i]]]
It is supposed to be a dataframe consisted of two vectors extracted from previously imported dataset. I use the parameter i, because this code is a part of a loop which uses a predefined function to analyze correlations between one fixed variable [0] and the rest of them - each iteration check a correlation with different variable [i].
Python treats this object as a list or as a tuple when I change the brackets to round ones. I need this object to be a dataframe (next step is to remove NaN values using .dropna which is a df atribute.
How can I fix that issue?
If I have correctly understood your question, you want to build an extract from a larger dataframe containing only 2 columns known by their index number. You can simply do:
sub = table.iloc[:, [0,i]]
It will keep all attributes (including index, column names and dtype) from the original table dataframe.
What is your goal with the dataframe?
dataframe is a common term in data analysis using pandas
Pandas was developed just to facilitate such analysis, in it to get the data in a .csv file and transform into a dataframe is simple like:
import pandas as pd
df = pd.read_csv('my-data.csv')
df.info()
Or from a dict or array
df = pd.DataFrame(my_dict_or_array)
Then u can select the rows u wish
df.loc[:, ['INDEX_ROW_1', 'INDEX_ROW_2']]
Let us know if it's what you are looking for
I'm new to pandas and python, and could definitely use some help.
I have the code below, which almost does what I want. It creates dummy variables for the unique values in a field and indexes them by the unique combinations of the unique values in two other fields.
What I would like is only one row for each unique combination of the fields used for the index. Right now I get multiple rows for say 'asset subs end dt' = 10/30/2008 and 'reseller csn' = 55008 if the dummy variable comes up 3 times. I would rather have one row for the combination of index field values with a 3 in the dummy variable column.
Code:
df = data
df = df.set_index(['ASSET_SUBS_END_DT','RESELLER_CSN'])
Dummies=pd.get_dummies(df['EXPERTISE'])
something like:
df.groupby(level=[0, 1]).EXPERTISE.count()
when you do this groupby, everything with the same index is grouped together. assuming your data in EXPERTISE is notnull, you will get a new DataFrame returned with unique index values and the count per each index. try it out for yourself, play around with the results, and see how it can be combined with your existing DataFrame to get the final result you want.