Efficiently running operations on Pandas DataFrame columns that are not unique - python

I have a DataFrame similar to this:
import numpy as np
raw_data = {'Identifier':['10','10','10','11',11,'12','13']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Identifier'])
print df
As you can see the 'Identifier' column is not unique and the dataframe itself has many rows.
Everytime I try to do a calculation on the Identifier column using:
df['CalculatedColumn'] = df['Identifer'] + apply calculation here
As Identifer is not unique, is there a better way of doing this? Maybe store the calculations for each unique identifier and then pass in the results? The calculation is quite complex and added with the number of rows, this takes a long time. But i would want to reduce it as the identifiers are not unique.
Any thoughts?

I'm pretty sure there is a more pythonic way, but this works for me:
import numpy as np
import pandas as pd
raw_data = {'Identifier':['10','10','10','11','11','12','13']}
df = pd.DataFrame(raw_data,columns=['Identifier'])
df['CalculatedColumn']=0
dfuni=df.drop_duplicates(['Identifier'])
dfuni['CalculatedColumn']=dfuni['Identifier']*2 # perform calculation
for j in range(len(dfuni)):
df['CalculatedColumn'][df['Identifier']==dfuni['Identifier'].iloc[j]]=dfuni['CalculatedColumn'].iloc[j]
print df
print dfuni
As an explanation: I'm creating a new dataframe dfuni which contains all the unique fields of your original dataframe. Then, you perform your calculation on this (I just multiplied the value of the Identifier by two, and because it is a string, the result is 1010, 1111 etc.). Up to here, I like the code, but then, I'm using a loop through all the values of dfuni to copy them back into your original df. For this point, there might be a more elegant solution.
As a result, I get:
Identifier CalculatedColumn
0 10 1010
1 10 1010
2 10 1010
3 11 1111
4 11 1111
5 12 1212
6 13 1313
PS: This code was tested with Python 3. The only thing I adapted was the print-statements. I may have missed something.

Related

pandas dataframe: how to select rows where one column-value is like 'values in a list'

I have a requirement where I need to select rows from a dataframe where one column-value is like values in a list.
The requirement is for a large dataframe with millions of rows and need to search for rows where column-value is like values of a list of thousands of values.
Below is a sample data.
NAME,AGE
Amar,80
Rameshwar,60
Farzand,90
Naren,60
Sheikh,45
Ramesh,55
Narendra,85
Rakesh,86
Ram,85
Kajol,80
Naresh,86
Badri,85
Ramendra,80
My code is like below. But problem is that I'm using a for loop, hence with increased number of values in the list-of-values (variable names_like in my code) I need to search, the number of loop and concat operation increases and it makes the code runs very slow.
I can't use the isin() option as isin is for exact match and for me it is not an exact match, it a like condition for me.
Looking for a better more performance efficient way of getting the required result.
My Code:-
import pandas as pd
infile = "input.csv"
df = pd.read_csv(infile)
print(f"df=\n{df}")
names_like = ['Ram', 'Nar']
df_res = pd.DataFrame(columns=df.columns)
for name in names_like:
df1 = df[df['NAME'].str.contains(name, na=False)]
df_res = pd.concat([df_res,df1], axis=0)
print(f"df_res=\n{df_res}")
My Output:-
df_res=
NAME AGE
1 Rameshwar 60
5 Ramesh 55
8 Ram 85
12 Ramendra 80
3 Naren 60
6 Narendra 85
10 Naresh 86
Looking for a better more performance efficient way of getting the required result.
You can pass all names in joined list by | for regex or, loop is not necessary:
df_res = df[df['NAME'].str.contains('|'.join(names_like), na=False)]
Use this hope you will find a great way.
df_res = df[df['NAME'].str.contains('|'.join(names_like), na=False)]

How can I separate text into multiple values in a CSV file using Python? [duplicate]

This question already has answers here:
Split one column into multiple columns by multiple delimiters in Pandas
(2 answers)
Closed 1 year ago.
I'd like to begin processing some data for analysis but I have to separate the responses into multiple values. Currently each column contains one value that is combined with 3 responses, Agree: #score, Disagree: #score, Neither agree nor disagree. I'd like to separate the responses from the column into individual values to create an analysis for a visualization. Would I need to include regular expression to do this?
So far that code I have is just to load the data with some libraries I plan to use:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def load_data():
# importing datasets
df=pd.read_csv('dataset.csv')
return df
load_data().head()
You need to use str.split(';') to first split the values into multiple columns. Then for each column value, split the string again using str.split(':') but take [-1] part of it.
Here's how you can do it.
import pandas as pd
df = pd.DataFrame({'username':['Dragonfly','SpeedHawk','EagleEye'],
'Question1':['Comfortable:64;Neither comfortable nor uncomfortable:36',
'Comfortable:0;Neither comfortable nor uncomfortable:100',
'Comfortable:10;Neither comfortable nor uncomfortable:90'],
'Question2':['Agree:46;Disagree:13;Neither agree nor disagree:41',
'Agree:96;Disagree:0;Neither agree nor disagree:4',
'Agree:90;Disagree:5;Neither agree nor disagree:5']})
df[['Q1_Comfortable','Q1_Neutral']] = df['Question1'].str.split(';',expand=True)
df[['Q2_Agree','Q2_Disagree','Q2_Neutral']] = df['Question2'].str.split(';',expand=True)
df.drop(columns=['Question1','Question2'],inplace=True)
for col in df.columns[1:]:
df[col] = df[col].str.split(':').str[-1]
print (df)
The output of this will be:
username Q1_Comfortable Q1_Neutral Q2_Agree Q2_Disagree Q2_Neutral
0 Dragonfly 64 36 46 13 41
1 SpeedHawk 0 100 96 0 4
2 EagleEye 10 90 90 5 5

Cell-wise calculations in a Pandas Dataframe

I have what I'm sure is a fundamental lack of understanding about how dataframes work in Python. I am sure this is an easy question, but I have looked everywhere and can't find a good explanation. I am trying to understand why sometimes dataframe calculations seem to run on a row-by-row (or cell by cell) basis, and sometimes seem to run for an entire column... For example:
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df
Which gives:
Name Depth
0 49-037-23094 20
1 49-029-21476 21
2 49-029-20812 7
3 49-041-21318 18
Now I know I can do:
df['DepthDouble']=df['Depth']*2
And get:
Name Depth DepthDouble
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
Which is what I would expect. But this doesn't always work, and I'm trying to understand why. For example, I am trying to run this code to modify the name:
df['newName']=''.join(re.findall('\d',str(df['Name'])))
which gives:
Name Depth DepthDouble \
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
newName
0 04903723094149029214762490292081234904121318
1 04903723094149029214762490292081234904121318
2 04903723094149029214762490292081234904121318
3 04903723094149029214762490292081234904121318
So it is taking all the values in my name column, removing the dashes, and concatenating them. Of course, I'd just like it to be a new name column exactly the same as the original "Name" column, but without the dashes.
So, can anyone help me understand what I am doing wrong here? I Don't understand why sometimes Dataframe calculations for one column are done row by row (e.g., the Depth Doubled column) and sometimes Python seems to take all values in the entire column and run the calculation (e.g., the newName column).
Surely the way to get around this isn't by making a loop for every index in the df to force it to run individually for each row for a given column?
If the output you're looking for is:
Name Depth newName
0 49-037-23094 20 4903723094
1 49-029-21476 21 4902921476
2 49-029-20812 7 4902920812
3 49-041-21318 18 4904121318
The way to get this is:
df['newName']=df['Name'].map(lambda name: ''.join(re.findall('\d', name)))
map is like apply but specifically for Series objects. Since you're applying to only the Name column you are operating on a Series.
If the lambda part is confusing, an equivalent way to write it is:
def find_digits(name):
return ''.join(re.findall('\d', name))
df['newName']=df['Name'].map(find_digits)
The equivalent operation in traditional for loops is:
newNameSeries = pd.Series(name='newName')
for name in df['Name']:
newNameSeries = newNameSeries.append(pd.Series(''.join(re.findall('\d', name))), ignore_index=True)
pd.concat([df, newNameSeries], axis=1).rename(columns={0:'newName'})
While there might be a slightly cleaner way to do the loop, you can see how much simpler the first approach is compared to trying to use for-loops. It's also faster. As you already have indicated you know, avoid for loops when using pandas.
The issue is that with str(df['Name']) you are converting the entire Name-column of your DataFrame to one single string. What you want to do instead is to use one of pandas' own methods for strings, which will be applied to every single element of the column.
For example, you could use pandas' replace method for strings:
import pandas as pd
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df['newName'] = df['Name'].str.replace('-', '')

Pandas dataframe "all true" criterion

Python 2.7, Pandas 0.18.
I have a DataFrame, and I have methods that select a subset of the rows via a criterion parameter. I'd like to know a more idiomatic way to write a criterion that matches all rows.
Here's a very simple example:
import pandas as pd
def apply_to_matching(df,criterion):
df.loc[criterion,'A'] = df[criterion]['A']*df[criterion]['B']
df = pd.DataFrame({'A':[1,2,3,4],'B':[10,100,1000,10000]})
criterion = (df['A']<3)
result = apply_to_matching(df,criterion)
print df
The output would be:
A B
0 10 10
1 200 100
2 3 1000
3 4 10000
because the criterion applies to only the first two rows.
I would like to know the idiomatic way to create a criterion that selects all rows of the DataFrame.
This could be done by adding a column of all true values to the DataFrame:
# Add a column
df['AllTrue']=True
criterion = df['AllTrue']
result = apply_to_matching(df,criterion)
print df.drop('AllTrue',axis=1)
The output is:
A B
0 10 10
1 200 100
2 3000 1000
3 40000 10000
but that approach adds a column to my DataFrame, which I have to filter out later to not get it in my output.
So, is there a more idiomatic way to do this in Pandas? One which does not require me to know anything about the column names, and not change the DataFrame?
When everything should be True, the boolean indexing way would require a series of True. With the code you have above, another way to look at it is that the criterion argument can also receive slices. Getting all the rows would mean slicing the entire rows like this df.loc[:, 'A']. As you need to pass it as an argument to apply_to_matching function, use slice builtin:
apply_to_matching(df, slice(None, None))

Pandas Groupby - Sparse Matrix Error

This question is related to the question I asked previously about using pandas get_dummies() function (link below).
Pandas Get_dummies for nested tables
However in the course of utilizing the solution provide in the answer I noticed odd behavior when looking at the groupby function. The issue is that repeated (non-unique) index values for a dataframe appear to cause an error when the matrix is represented in sparse format, while working as expected for dense matrix.
I have extremely high dimensional data thus sparse matrix will be required for memory reasons. An example of the error is below. If anyone has a work around it would be greatly appreciated
Working:
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name')
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Failing
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Note you will need version 16.1 or greater of pandas.
Thank you in advance
You can perform your groupby in a different way as a workaround. Don't set Instance as the index and use the column for your groupby and drop the Instance column (last column in this case since it was just added). Groupby will will make an Instance index.
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
#WORKAROUND:
result=result.groupby('Instance').apply(max)[result.columns[:-1]]
result
Out[58]:
Name_Doe Name_Jane Name_John Name_Smith
Instance
1 0 0 1 1
2 0 1 0 0
3 1 0 0 0
Note: The sparse dataframe stores your Instance int's as floats within a BlockIndex in the dataframe column. In order to have the index the exact same as the first example you'd need to change to int from float.
result.index=result.index.map(int)
result.index.name='Instance'

Categories

Resources