I have a data frame with the following columns:
d = {'find_no': [1, 2, 3], 'zip_code': [32351, 19207, 8723]}
df = pd.DataFrame(data=d)
When there are 5 digits in the zip_code column, I want to return True. When there are not 5 digits, I want to return the "find_no".
Sample output would have the results in an added column to the dataframe, corresponding to the row it's referencing.
You could try np.where:
import numpy as np
df['result'] = np.where(df['zip_code'].astype(str).str.len() == 5, True, df['find_no'])
Only downside with this approach is that NumPy will convert your True values to 1's, which could be confusing. An approach to keep the values you want is to do
import numpy as np
df['result'] = np.where(df['zip_code'].astype(str).str.len() == 5, 'True', df['find_no'].astype(str))
The downside here being that you lose the meaning of those values by casting them to strings. I guess it all depends on what you're hoping to accomplish.
Here is my issue, I have a dataframe, let's say:
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
I also have a list of index:
idx = [1, 2]
I would like to store in a list, the corresponding value in each column.
Meaning I want the first value of the col1 and the second value of the col2.
I'm sure there is a simple answer to my issue however I'm mixing everything up with iloc and cannot find a way of developing a optimized method in my case (I have 1000 rows and 4 columns).
IIUC, you can try:
you can extract the complete rows and then pick the diagonal elements
result = np.diag(df.values[idx])
Alternative:
convert the dataframe to numpy array.
use numpy indexing to access the required values.
result = df.values[idx, range(len(df.columns))]
OUTPUT:
array([6, 3])
Use:
list(df.values[idx, range(len(idx))])
Output:
[6, 3]
Here is a different way:
df.stack().loc[list(zip(idx,df.columns[:(len(idx))]))].to_numpy()
I would like to apply the loop below where for each index value the unique values of a column called SERIAL_NUMBER will be returned. Essentially I want to confirm that for each index there is a unique serial number.
index_values = df.index.levels
for i in index_values:
x = df.loc[[i]]
x["SERIAL_NUMBER"].unique()
The problem, however, is that my dataset has a multi-index and as you can see below it is stored in a frozen list. I am just interested in the index values that contain a long number. The word "vehicle" also as an index can be removed as it is repeated all over the dataset.
How can I extract these values into a list so I can use them in the loop?
index_values
>>
FrozenList([['0557bf98-c3e0-4955-a23f-2394635ab531', '074705a3-a96a-418c-9bfe-14c37f5c4e6f', '0f47e260-0fa2-40ba-a417-7c00ea74248c', '17342ca2-6246-4150-8080-96d6125cf2b5', '26c6c0d1-0134-4b3a-a149-61dd93afab3b', '7600be43-5d0a-49b3-a1ee-fd107db5822f', 'a07f2b0c-447c-4143-a361-d7ddbffdcc77', 'b929801c-2f32-4a95-bfc4-48a05b48ee01', 'cc912023-0113-42cd-8fe7-4df4005127c2', 'e424bd02-e188-462e-a1a6-2f4ed8fe0a2d'], ['vehicle']])
without an example its hard to judge, but I think you need
df.index.get_level_values(0).unique() # add .tolist() if you want a list
import pandas as pd
df = pd.DataFrame({'A' : [5]*5, 'B' : [6]*5})
df = df.set_index('A',append=True)
df.index.get_level_values(0).unique()
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df.index.get_level_values(1).unique()
Int64Index([5], dtype='int64', name='A')
to drop duplicates from an index level use the .duplicated() method.
df[~df.index.get_level_values(1).duplicated(keep='first')]
B
A
0 5 6
I have an input/output data where index and header have numbers that represents different types of industries. I want to create new columns and rows that would represent the sum of columns and rows that belong to certain industry group. To give an example(please refer to the example that I manually made as below), I would want to create new row/column that would have index/header as US_industry_135/CAN_industry_135 which would sum the rows/columns that has industry number 1, 3, or 5. The below example is a small set that I manually created, but I wanted to know if there is a way to put the condition in summation so that I sum rows/columns whose index/header has numbers that belong to specific numbers. I could extract the numbers from header/index and create make a separate row/column, but I was wondering if there is a way to directly check from the index/headers without creating new columns. Thank you in advance for your help!
import pandas as pd
data = {'US1':[3, 2, 1, 4,3,2,1,4,2,3,7,9],'US2':[8,4,9,2,1,3,4,2,5,6,18,11],'US3':[2,4,2,2,3,2,4,2,3,2,7,6],
'US4':[7,4,8,2,2,3,2,4,6,8,17,15],'US5':[2,4,3,2,2,4,1,3,2,4,7,11],
'CAN1':[3, 2, 1, 4,6,2,3,1,4,2,10,5],'CAN2':[8,4,9,2,5,7,3,5,7,1,22,13],'CAN3':[2,4,2,2,4,5,2,3,3,2,8,10],
'CAN4':[7,4,8,2,2,3,1,3,2,4,17,10],'CAN5':[2,4,3,2,6,7,5,4,0,9,11,20],
'US_IND_135':[7,10,6,8,8,8,6,9,7,9,21,26],'CAN_IND_135':[7,10,6,8,16,14,10,8,7,13,29,35]}
df = pd.DataFrame(data, index=['US1','US2','US3','US4','US5','CAN1','CAN2','CAN3','CAN4','CAN5','US_IND_135','CAN_IND_135'])
df
Let's define list of indexes of interest:
idx = [1, 3, 5]
Do the summation using specified columns:
df[['US' + str(i) for i in idx]].sum(axis = 1)
Alternatively, if you want to join summation column to dataframe, you can assign result to the variable:
s1 = df[['US' + str(i) for i in idx]].sum(axis = 1)
s1.name = 'NEW_US_IND_' + ''.join("{0}".format(i) for i in idx)
And add new column:
df.join(s1)
I have a dataframe with three columns like this:
Subject{1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, ...} datetime{6/4/16 3:04:30, 6/5/16 6:02:15, ...} markers{}
It is sorted by subject then by datetime, and the markers column is empty
I also have a dictionary which maps subject numbers to lists of datetimes. These datetimes are not exactly the same as the ones already in the dataframe. I want to add all these datetimes to the markers column in their corresponding subject and date row for comparison purposes, so a dictionary with the key (subject) 1 with a list of values like {6/4/16 5:00:15, 6/5/16 6:10:30} would have its first value added to row 1 because the subject and date match and its second value added to row 2 for the same reason.
I thought of looping through each dictionary key and all it's corresponding datetimes, but then finding the corresponding row in the map for each datetime within the nested loops would be very inefficient. It would be something like this:
for subject in df.iloc[:, 0]:
# go to subject in dictionary and loop through datetimes in corresponding list,
# adding the matching datetime to the current row
# O(n^2) time!
Is there a more efficient way to do this?
Thanks!
try this, you will have to customize the answer somewhat to meet your specific needs, but the logic is basically the same.
df = pd.DataFrame({'colA': [100,200],'colB': ['NaN','NaN']})
dict1 = {100: ['rat','cat','hat'], 200: ['hen','men','den']}
df = pd.concat([df['colA'],df['colA'].map(dict1).apply(pd.Series)], axis = 1)