Fast searching a Pandas dataframe column - python

I have a Pandas dataframe with one column containing string IDs. I am using idxmax() to return the index of the found IDs but since the data is over a million rows, it takes a lot of time to search. Is there any efficient way to search which can reduce time. I have not sorted the IDs currently.

please add a sample of the data you are working on.
the following code snippet will sort the dataframe based on the column and select the first row:
df.sort_values('column_name').iloc[0,0]

Related

How to generate duplicate rows and enumerate/calculate based on a column value in a DataFrame in python

I am trying to convert some sas code to python. I feel difficulty to create new column to enumerate and calculate values by looping rows based on a column value.
Original dataset:
My desired output looks like:
This is a solution I found after research several case in stackoverflow:
df=df.iloc[np.arange(len(df)).repeat(df.count)]
df=df.set_index(df.groupby('id').cumcount()+1).reset_index()

How to split a dataframe column into multiple rows and transform string

I am pulling some data from a graphql api and the data I am getting is being returned in a dict like this:
and then I also turned the dict into a pandas dataframe which returns this:
so from my beginner understanding, it looks like the 'swaps' row is just a super long string. I was looking at some online tutorials and still cannot figure out how to transform this row into many rows (its 1000 rows). Any help would be greatly appreciated!
You can only convert value of swaps to dataframe
df = pd.DataFrame(x['data']['swaps'])

Merging values from columns in Pandas Dataframe

Before I start, my disclaimer is that I'm very new to Python and I've been building a flask app in an effort to learn more, so my question might be silly but please oblige me.
I have a Pandas Dataframe created from reading in a csv or excel doc on a flask app. The user uploads the document, so the dataframe and column names change with every upload.
The user also selects the columns they want to merge from a html multiselect object, which returns the selected columns from the user to the app script in the form of a python list.
What I currently have is:
df=pd.read_csv(file)
columns=df.columns.values
and
selected_col=request.form.getlist('columns')
All of this works fine, but I'm now stuck. How can I merge the values from the rows of this list of column names (selected_col) into a new column on the dataframe such that df["Merged"] = list of selected column values.
I've seen people use the merge function which seems to work well for 2 columns, but in this case it could be any number of columns that are merged, hence I'm looking for a function that either takes in a list of columns and merges it or iterates through the list of columns appending the values in a new column.
Sounds like what you want to do is more like an element-wise concatenation, not a merge.
If I understand you correctly, you can get your desired result with a list comprehension creating a nested list that is turned into a pandas Series by assigning it as a new DataFrame column:
df['Merged'] = [list(row) for row in df[selected_col].values]

Groupby for large number columns in pandas

I am trying to loop through multiple excel files in pandas. The structure of the files are very much similar, the first 10 column forms a key and rest of the columns have the values. I want to group by first 10 columns and sum the rest.
I have searched and found solutions online for similar cases but my problem is that
I have large number of columns with values ( to be aggregate as sum) and
Number / names of columns(with values) is different for each
file(dataframe)
#Key columns are same across all the files.
I can't share the actual data sample but here is the format sample of the file structure
and here is the desired output from the above data
It is like a groupby operation but with uncertain large number of columns and header name makes it difficult to use groupby or pivot. Can Any one suggest me what is the best possible solution for it in python.
Edited:
df.groupby(list(df.columns[:11])).agg(sum)
is working but for some reason it is taking 25-30 mins. the same thing MS Access is done in 1-2 mins . Am I doing something wrong here or is there any other way to do it in python itself
Just use df.columns which has the list of columns, you can then use a slice on that list to get the 10 leftmost columns.
This should work:
df.groupby(df.columns[:10].to_list()).sum()

fastest way to copy values from one cell of a dataframe to another data frame if a third cell matches

I have a master dataframe with anywhere between 750 to 3000 rows of data.
I have a daily order dataframe with anywhere from 3000 to 5000 rows of data.
If the product code of the daily order dataframe is found in the master dataframe, I get the item cost. Otherwise, it is marked as invalid and deleted.
I currently do this via 2 for loops. But I will have to do many more such comparisons and data updating (other fields to compare, other values to copy)
What is the most efficient way to do this?
I cannot make the column I am comparing the index column of the master dataframe.
In this case, the product code may be unique in the master and I could do a merge, but there are other cases where I may have to compare other values like supplier city which may not be unique.
I seem to be doing this repeatedly in all my Python codes and I want to learn the most efficient way to do this.
Order DF:
[![Order csv from which the Order DF is created][1]][1]
Master DF
[![Master csv from which Master DF is created][1]][1]
def fillVol(orderDF,mstrDF,paramC,paramF,notFound):
orderDF['ttlVol']=0
for i in range(len(orderDF)):
found=False
for row in mstrDF.itertuples():
if (orderDF.loc[i,paramC]==getattr(row,paramC)):
orderDF.loc[i,paramF[0]]=getattr(row,paramF[0])#mtrl cbf
found=True
break
if (found==False):
notFound.append(inv.loc[i,paramC])
inv['ttlVol']=inv[paramF[0]]*inv[paramF[2]]
return notFound
I am passing along the column names I am comparing and the column names I am filling with data because there are minor variations in naming the csv. In the data I have shared, the material volume is CBF, in come cases it is CBM
The data columns cannot be index because there are no unique data in any of the columns, it is always a combination of values that makes them unique.
The data, in this case, is a float and numpy could be used, but in other cases like copying city names from a master, the data is a string. numpy was the suggestion to other people with a similar issue
I dont know if this is the most efficient way of doing it - as someone who started programming with Fortran and then C, I am always for basic datatypes and this solution is not utilising basic datatype. This is definitely a highly Pythonic solution.
orderDF=orderDF[orderDF[ParamF].isin(mstrDF[ParamF])]
orderDF=orderDF.reset_index(drop=True)
I use a left merge on the orderDF and msterDF data frames to copy all relevant values
orderDF=orderDF.merge(mstrDF.drop_duplicates(paramC,keep='last')[[paramF[0]]]', how='left',validate = 'm:1')

Categories

Resources