pandas error in df.apply() only for a specific dataframe - python

Noticed something very strange in pandas. My dataframe(with 3 rows and 3 columns) looks like this:
When I try to extract ID and Name(separated by underscore) to their own columns using command below, it gives me an error:
df[['ID','Name']] = df.apply(lambda x: get_first_last(x['ID_Name']), axis=1, result_type='broadcast')
Error is:
ValueError: cannot broadcast result
Here's the interesting part though..When I delete the "From_To" column from the original dataframe, performing the same df.apply() to split ID_Name works perfectly fine and I get the new columns like this:
I have checked a lot of SO answers but none seem to help. What did I miss here?
P.S. get_first_last is a very simple function like this:
def get_first_last(s):
str_lis = s.split("_")
return [str_lis[0], str_lis[1]]

From the doc of pandas.DataFrame.apply :
'broadcast' : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
So the problem is that the original shape of your dataframe is (3, 3) and the result of your apply function is 2 columns, so you have a mismatch. and that also explane why when you delete the "From_To", the new shape is (3, 2) and now you have a match ...
You can use 'broadcast' instead of 'expand' and you will have your expected result.
table = [
['1_john', 23, 'LoNDon_paris'],
['2_bob', 34, 'Madrid_milan'],
['3_abdellah', 26, 'Paris_Stockhom']
]
df = pd.DataFrame(table, columns=['ID_Name', 'Score', 'From_to'])
df[['ID','Name']] = df.apply(lambda x: get_first_last(x['ID_Name']), axis=1, result_type='expand')
hope this helps !!

It's definitely not a good use case to use apply, you should rather do:
df[["ID", "Name"]]=df["ID_Name"].str.split("_", expand=True, n=1)
Which for your data will output (I took only first 2 columns from your data frame):
ID_Name Score ID Name
0 1_john 23 1 john
1 2_bob 34 2 bob
2 3_janet 45 3 janet
Now n=1 is just in case you would have multiple _ (e.g. as a part of the name) - to make sure you will return at most 2 columns (otherwise the above code would fail)
For instance, if we slightly modify your code, we get the following output:
ID_Name Score ID Name
0 1_john 23 1 john
1 2_bob_jr 34 2 bob_jr
2 3_janet 45 3 janet

Related

Python Return all Columns [duplicate]

Using Python Pandas I am trying to find the Country & Place with the maximum value.
This returns the maximum value:
data.groupby(['Country','Place'])['Value'].max()
But how do I get the corresponding Country and Place name?
Assuming df has a unique index, this gives the row with the maximum value:
In [34]: df.loc[df['Value'].idxmax()]
Out[34]:
Country US
Place Kansas
Value 894
Name: 7
Note that idxmax returns index labels. So if the DataFrame has duplicates in the index, the label may not uniquely identify the row, so df.loc may return more than one row.
Therefore, if df does not have a unique index, you must make the index unique before proceeding as above. Depending on the DataFrame, sometimes you can use stack or set_index to make the index unique. Or, you can simply reset the index (so the rows become renumbered, starting at 0):
df = df.reset_index()
df[df['Value']==df['Value'].max()]
This will return the entire row with max value
I think the easiest way to return a row with the maximum value is by getting its index. argmax() can be used to return the index of the row with the largest value.
index = df.Value.argmax()
Now the index could be used to get the features for that particular row:
df.iloc[df.Value.argmax(), 0:2]
The country and place is the index of the series, if you don't need the index, you can set as_index=False:
df.groupby(['country','place'], as_index=False)['value'].max()
Edit:
It seems that you want the place with max value for every country, following code will do what you want:
df.groupby("country").apply(lambda df:df.irow(df.value.argmax()))
Use the index attribute of DataFrame. Note that I don't type all the rows in the example.
In [14]: df = data.groupby(['Country','Place'])['Value'].max()
In [15]: df.index
Out[15]:
MultiIndex
[Spain Manchester, UK London , US Mchigan , NewYork ]
In [16]: df.index[0]
Out[16]: ('Spain', 'Manchester')
In [17]: df.index[1]
Out[17]: ('UK', 'London')
You can also get the value by that index:
In [21]: for index in df.index:
print index, df[index]
....:
('Spain', 'Manchester') 512
('UK', 'London') 778
('US', 'Mchigan') 854
('US', 'NewYork') 562
Edit
Sorry for misunderstanding what you want, try followings:
In [52]: s=data.max()
In [53]: print '%s, %s, %s' % (s['Country'], s['Place'], s['Value'])
US, NewYork, 854
In order to print the Country and Place with maximum value, use the following line of code.
print(df[['Country', 'Place']][df.Value == df.Value.max()])
You can use:
print(df[df['Value']==df['Value'].max()])
Using DataFrame.nlargest.
The dedicated method for this is nlargest which uses algorithm.SelectNFrame on the background, which is a performant way of doing: sort_values().head(n)
x y a b
0 1 2 a x
1 2 4 b x
2 3 6 c y
3 4 1 a z
4 5 2 b z
5 6 3 c z
df.nlargest(1, 'y')
x y a b
2 3 6 c y
import pandas
df is the data frame you create.
Use the command:
df1=df[['Country','Place']][df.Value == df['Value'].max()]
This will display the country and place whose value is maximum.
My solution for finding maximum values in columns:
df.ix[df.idxmax()]
, also minimum:
df.ix[df.idxmin()]
I'd recommend using nlargest for better performance and shorter code. import pandas
df[col_name].value_counts().nlargest(n=1)
I encountered a similar error while trying to import data using pandas, The first column on my dataset had spaces before the start of the words. I removed the spaces and it worked like a charm!!

Cell-wise calculations in a Pandas Dataframe

I have what I'm sure is a fundamental lack of understanding about how dataframes work in Python. I am sure this is an easy question, but I have looked everywhere and can't find a good explanation. I am trying to understand why sometimes dataframe calculations seem to run on a row-by-row (or cell by cell) basis, and sometimes seem to run for an entire column... For example:
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df
Which gives:
Name Depth
0 49-037-23094 20
1 49-029-21476 21
2 49-029-20812 7
3 49-041-21318 18
Now I know I can do:
df['DepthDouble']=df['Depth']*2
And get:
Name Depth DepthDouble
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
Which is what I would expect. But this doesn't always work, and I'm trying to understand why. For example, I am trying to run this code to modify the name:
df['newName']=''.join(re.findall('\d',str(df['Name'])))
which gives:
Name Depth DepthDouble \
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
newName
0 04903723094149029214762490292081234904121318
1 04903723094149029214762490292081234904121318
2 04903723094149029214762490292081234904121318
3 04903723094149029214762490292081234904121318
So it is taking all the values in my name column, removing the dashes, and concatenating them. Of course, I'd just like it to be a new name column exactly the same as the original "Name" column, but without the dashes.
So, can anyone help me understand what I am doing wrong here? I Don't understand why sometimes Dataframe calculations for one column are done row by row (e.g., the Depth Doubled column) and sometimes Python seems to take all values in the entire column and run the calculation (e.g., the newName column).
Surely the way to get around this isn't by making a loop for every index in the df to force it to run individually for each row for a given column?
If the output you're looking for is:
Name Depth newName
0 49-037-23094 20 4903723094
1 49-029-21476 21 4902921476
2 49-029-20812 7 4902920812
3 49-041-21318 18 4904121318
The way to get this is:
df['newName']=df['Name'].map(lambda name: ''.join(re.findall('\d', name)))
map is like apply but specifically for Series objects. Since you're applying to only the Name column you are operating on a Series.
If the lambda part is confusing, an equivalent way to write it is:
def find_digits(name):
return ''.join(re.findall('\d', name))
df['newName']=df['Name'].map(find_digits)
The equivalent operation in traditional for loops is:
newNameSeries = pd.Series(name='newName')
for name in df['Name']:
newNameSeries = newNameSeries.append(pd.Series(''.join(re.findall('\d', name))), ignore_index=True)
pd.concat([df, newNameSeries], axis=1).rename(columns={0:'newName'})
While there might be a slightly cleaner way to do the loop, you can see how much simpler the first approach is compared to trying to use for-loops. It's also faster. As you already have indicated you know, avoid for loops when using pandas.
The issue is that with str(df['Name']) you are converting the entire Name-column of your DataFrame to one single string. What you want to do instead is to use one of pandas' own methods for strings, which will be applied to every single element of the column.
For example, you could use pandas' replace method for strings:
import pandas as pd
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df['newName'] = df['Name'].str.replace('-', '')

Pandas Apply/Lambda returning dataframe and not single row

New to Python and Pandas, so please bear with me here.
I have created a dataframe with 10 rows, with a column called 'Distance' and I want to calculate a new column (TotalCost) with apply and a lambda funtion that I have created. Snippet below of the function
def TotalCost(Distance, m, c):
return m * df.Distance + c
where Distance is the column in the dataframe df, while m and c are just constants that I declare earlier in the main code.
I then try to apply it in the following manner:
df = df.apply(lambda row: TotalCost(row['Distance'], m, c), axis=1)
but when running this, I keep getting a dataframe as an output, instead of a single row.
EDIT: Adding in an example of input and desired output,
Input: df = {Distance: '1','2','3'}
if we assume m and c equal 10,
then the output of applying the function should be
df['TotalCost'] = 20,30,40
I will post the error below this, but what am I missing here? As far as I understand, my syntax is correct. Any assistance would be greatly appreciated :)
The error message:
ValueError: Wrong number of items passed 10, placement implies 1
Your lambda in apply should process only one row. BTW, apply return only calculated columns, not whole dataframe
def TotalCost(Distance,m,c): return m * Distance + c
df['TotalCost'] = df.apply(lambda row: TotalCost(row['Distance'],m,c),axis=1)
Your apply function will basically pass one row at a time to your lambda function and then returns a copy of your data frame with the edited or changed values
Finally it returns a modified copy of dataframe constructed with rows returned by lambda functions, instead of altering the original dataframe.
have a look at this link it should help you gain more insight
https://thispointer.com/pandas-apply-apply-a-function-to-each-row-column-in-dataframe/
import numpy as np
import pandas as pd
def star(x,m,c):
return x*m+c
vals=[(1,2,4),
(3,4,5),
(5,6,6) ]
df=pd.DataFrame(vals,columns=('one','two','three'))
res=df.apply(star,axis=0,args=[2,3])
Initial DataFrame
one two three
0 1 2 4
1 3 4 5
2 5 6 6
After applying the function you should get this stored in res
one two three
0 5 7 11
1 9 11 13
2 13 15 15
This is a more memory-efficient and cleaner way:
df.eval('total_cost = #m * Distance + #c', inplace=True)
Update: I also sometimes stick to assign,
df = df.assign(total_cost=lambda x: TotalCost(x['Distance'], m, c))

Pandas is printing more rows than expected

Currently I work on a database and I try to sort my rows with pandas. I have a column called 'sessionkey' which refers to a session. So each row can be assigned to a session. I tried to seperate the Data into these sessions.
Furthermore there can be duplicated rows. I tried to drop those with the drop_duplicates function from pandas.
df = pd.read_csv((path_of_data+'part-00000-9d3e32a7-87f8-4218-bed1-e30855ce6f0c-c000.csv'), keep_default_na=False, engine='python')
tmp = df['sessionkey'].values #I want to split data into different sessions
tmp = np.unique(tmp)
df.set_index('sessionkey', inplace=True)
watching = df.loc[tmp[10]].drop_duplicates(keep='first') #here I pick one example
print(watching.sort_values(by =['eventTimestamp', 'eventClickSequenz']))
print(watching.info())
I would have thought that this works fine but when I tried to check my results by printing out my splitted dataframe the output looks very odd to me. For example I printed the length of the Dataframe it says 38 rows x 4 columns. But when I print the same Dataframe there are clearly more than 38 rows and there are still duplicates in it.
I already tried to split the data by using unique indices:
comparison = pd.DataFrame()
for index, item in enumerate(df['sessionkey'].values):
if item==tmp: comparison = comparison.append(df.iloc[index])
comparison.drop_duplicates(keep='first', inplace=True)
print(comparison.sort_values( by = ['eventTimestamp']))
But the Problem is still the same.
The output also seems to follow a pattern. Lets say we have 38 entries. Then pandas returns me the first 1-37 entries and then appends the 2-38 entries. So the last one is left out and then the whole list is shifted and printed again.
When I return the numpy values there are just 38 different rows. So is this a problem of the print function from pandas? Is there an error in my code? Does pandas have a problem with not-unique indexes?
EDIT:
Okay I figured out what the problem is. I wanted to look at a long dataframe so I used:
pd.set_option('display.max_rows', -1)
Now we can use some sample data:
data = np.array([[119, 0], [119, 1], [119, 2]])
columns = ['sessionkey', 'event']
df = pd.DataFrame(data, columns = columns)
print(df)
Printed it now looks like this:
sessionkey event
0 119 0
1 119 1
1 119 1
2 119 2
Although I expected it to look like this:
sessionkey event
0 119 0
1 119 1
2 119 2
I thought my Dataframe has the wrong shape but this is not the case.
So the event in the middle gets printed doubled. Is this a bug or the intendent output?
so drop_duplicates() doesn't look at the index when getting rid of rows, instead it looks at the whole row. But it does have a useful subset kwarg which allows you to specify which rows to use.
You can try the following
df = pd.read_csv((path_of_data+'part-00000-9d3e32a7-87f8-4218-bed1-e30855ce6f0c-c000.csv'), keep_default_na=False, engine='python')
print(df.shape)
print(df["session"].nunique()) # number of unique sessions
df_unique = df.drop_duplicates(subset=["session"],keep='first')
# these two numbers should be the same
print(df_unique.shape)
print(df_unique["session"].nunique())
It sounds like you want to drop_duplicates based on the index - by default drop_duplicates drops based on the column values. To do that try
df.loc[~df.index.duplicated()]
This should only select index values which are not duplicated
I used your sample code.
data = np.array([[119, 0], [119, 1], [119, 2]])
columns = ['sessionkey', 'event']
df = pd.DataFrame(data, columns = columns)
print(df)
And I got your expected outcome.
sessionkey event
0 119 0
1 119 1
2 119 2
After I set the max_rows option, as you did:
pd.set_option('display.max_rows', -1)
I got the incorrect outcome.
sessionkey event
0 119 0
1 119 1
1 119 1
2 119 2
The problem might be the "-1" setting. The doc states that "None" will set max rows to unlimited. I am unsure what "-1" will do in a parameter that takes positive integers or None as acceptable values.
Try
pd.set_option('display.max_rows', None)

How to do columnwise operations in pandas?

I have a dataframe that looks something like:
sample parameter1 parameter2 parameter3
A 9 6 3
B 4 5 7
C 1 5 8
and I want to do an operation that does something like:
for sample in dataframe:
df['new parameter'] = df[sample, parameter1]/df[sample, parameter2]
so far I have tried:
df2.loc['ratio'] = df2.loc['reads mapped']/df2.loc['raw total sequences']
but I get the error:
KeyError: 'the label [reads mapped] is not in the [index]'
when I know well that it is in the index, so I figure I am missing some concept somewhere. Any help is much appreciated!
I should add that the parameter values are floats, just in case that is a problem as well!
The method .loc first expects row indices, then column indices, so the following should work, since you wanted to do column-wise operations:
df2['ratio'] = df2.loc[:, 'reads mapped'] / df2.loc[:, 'raw total sequences']
You can find more info in the documentation.

Categories

Resources