I am trying to split a huge dataframe into smaller dataframes based on values on a specific column.
What I basically did was I created a for loop then assigned each dataframe to a dictionary.
However when I call the items from the dictionary all values are NaN except for the cell_id values that I used for splitting.
Why would this happen?
Also I would appreciate if there are more practical ways to do this.
df_sliced_dict = {}
for cell in ex_df['cell_id'].unique():
df_sliced_dict[cell] = ex_df[ex_df.loc[:, ['cell_id']] == cell]
Replace
df_sliced_dict[cell] = ex_df[ex_df.loc[:, ['cell_id']] == cell]
with
df_sliced_dict[cell] = ex_df[ex_df['cell_id'] == cell]
inside the for-loop and it will work as expected.
The problem is that ex_df.loc[:, ['cell_id']] (or ex_df[['cell_id']]) is a DataFrame, not a Series, and you want a Series to construct your boolean mask.
Related
I have a DataFrame that's read in from a csv. The data has various problems. The one i'm concerned about for this post is that some data is not in the column it should be. For example, '900' is in the zipcode column, or 'RQ' is in the langauge column when it should be in the nationality column. In some cases, these "misinsertions" are just anomalies and can be converted to NaN. In other cases they indicate that the values have shifted one column to the right or the left such that the whole row has missinserted data. I want to remove these shifted lines from the DataFrame and try to fix them separately. My proposed solution has been to keep track of the number of bad values in each row as I am cleaning each column. Here is an example with the zipcode column:
def is_zipcode(value: str, regx):
if regx.match(value):
return value
else:
return nan
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['ZIPCODE'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')
I'm doing something like this on every column in the dataframe depending on the data in that column, e.g. for the 'Nationality' column i'll look up the values in a json file of nationality codes.
What I haven't been able to achieve is to keep count of the bad values in a row. I tried something like this:
def is_zipcode(value: str, regx):
if regx.match(value):
return 0
else:
return 1
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['badValues'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')
df['badValues'] = df['badValues'] + df['Nationalities'].map(is_nationality, na_action='ignore) # where is_nationality() similarly returns 1 if it is a bad value
And this can work to keep track of the bad values. What I'd like to do is somehow combine the process of cleaning the data and getting the bad values. I'd love to do something like this:
def is_zipcode(value: str, regx):
if regx.match(value):
# add 1 to the value of df['badValues'] at the corresponding index
return value
else:
return nan
The problem is that I don't think it's possible to access the index of the value being passed to the map function. I looked at these two questions (one, two) but I didn't see a solution to my issue.
I guess this would do what you want ...
is_zipcode_mask = df['ZIPCODE'].str.match(regex_for_zipcode)
print(len(df[is_zipcode_mask]))
I know that one can compare a whole column of a dataframe and making a list out of all rows that contain a certain value with:
values = parsedData[parsedData['column'] == valueToCompare]
But is there a possibility to make a list out of all rows, by comparing two columns with values like:
values = parsedData[parsedData['column01'] == valueToCompare01 and parsedData['column02'] == valueToCompare02]
Thank you!
It is completely possible, but I have never tried using and in order to mask the dataframe, rather using & would be of interest in this case. Note that, if you want your code to be more clear, use ( ) in each statement:
values = parsedData[(parsedData['column01'] == valueToCompare01) & (parsedData['column02'] == valueToCompare02)]
for lat,lng,value in zip(location_saopaulo_df['geolocation_lat'], location_saopaulo_df['geolocation_lng'], location_saopaulo_df['municipality']):
coordinates = (lat,lng)
items = rg.search(coordinates)
value = items[0]['admin2']
I am trying to iterate over 3 columns from the dataframe, get the latitude and longitude values from the two columns, use it to get the address then add the city name to the last column I stated which is an empty column consists of NaN values.
However, my for loop is not stopping. I would be grateful if you can tell me why it doesn't stop or better way to do what I'm trying to do.
Thank you in advance.
if rg is reverse_geocoder, there is a better way to query several coordinates at once than looping. try this:
res = rg.search(tuple(zip(location_saopaulo_df['geolocation_lat'],
location_saopaulo_df['geolocation_lng'])))
And then extract just the admin2 value by constructing dataframe for example like:
df_ = pd.Dataframe(res)
and see what it looks like. You may be able to perform a merge or index alignment to put it back into your original dataframe location_saopaulo_df
I am trying to use a for loop to assign a column with one of two values based on the value of another column. I created a list of the items I want to assign to one element, using else to assign the others. However, my code is only assigning the else value to the column. I also tried elif and it did not work. Here is my code:
#create list of aggressive reasons
aggressive = ['AGGRESSIVE - ANIMAL', 'AGGRESSIVE - PEOPLE', 'BITES']
#create new column assigning 'Aggressive' or 'Not Aggressive'
for reason in top_dogs_reason['Reason']:
if reason in aggressive:
top_dogs_reason['Aggression'] = 'Aggressive'
else:
top_dogs_reason['Aggression'] = 'Not Aggressive'
My new column top_dogs_reason['Aggression'] only has the value of Not Aggressive. Can someone please tell me why?
You should be using loc to assign things like this which isolate a part of a dataframe you want to update. The first line grabs the values in the "Aggression" column where the "Reason" column has a value contained in the list `aggressive1. The second line finds places where its not in the "Reason" column.
top_dogs_reason[top_dogs_reason['Reason'].isin(aggressive), 'Aggression'] = 'Aggressive'
top_dogs_reason[~top_dogs_reason['Reason'].isin(aggressive), 'Aggression'] = 'Not Aggressive'
or in one line as Roganjosh explained which uses np.where which is much like an excel if/else statement. so here we're saying if reason is in aggressive, give us "Aggressive", otherwise "Not Aggressive", and assign that to the "Aggression" column:
top_dogs_reason['Aggression'] = np.where(top_dogs_reason['Reason'].isin(aggressive), "Aggressive", "Not Aggressive")
or anky_91's answer which uses .map to map values. this is an effective way to feed a dictionary to a pandas series, and for each value in the series it looks at the key in the dictionary and returns the corresponding value:
top_dogs_reason['reason'].isin(aggressive).map({True:'Aggressive',False:'Not Aggressive'})
My problem: I have a pandas dataframe and one column in particular which I need to process contains values separated by (":") and in some cases, some of those values between ":" can be value = value, and can appear at the start/middle/end of the string. The length of the string can differ in each cell as we iterate through the row, for e.g.
clickstream['events']
1:3:5:7=23
23=1:5:1:5:3
9:0:8:6=5:65:3:44:56
1:3:5:4
I have a file which contains the lookup values of these numbers,e.g.
event_no,description,event
1,xxxxxx,login
3,ffffff,logout
5,eeeeee,button_click
7,tttttt,interaction
23,ferfef,click1
output required:
clickstream['events']
login:logout:button_click:interaction=23
click1=1:button_click:login:button_click:logout
Is there a pythonic way of looking up these individual values and replacing with the event column corresponding to the event_no row as shown in the output? I have hundreds of events and trying to work out a smart way of doing this. pd.merge would have done the trick if I had a single value, but I'm struggling to work out how I can work across the values and ignore the "=value" part of the string
Edit for to ignore missing keys in Dict:
import pandas as pd
EventsDict = {1:'1:3:5:7',2:'23:45:1:5:3',39:'0:8:46:65:3:44:56',4:'1:3:5:4'}
clickstream = pd.Series(EventsDict)
#Keep this as a dictionary
EventsLookup = {1:'login',3:'logout',5:'button_click',7:'interaction'}
def EventLookup(x):
list1 = [EventsLookup.get(int(item),'Missing') for item in x.split(':')]
return ":".join(list1)
clickstream.apply(EventLookup)
Since you are using a full DF and not just a series, use:
clickstream['events'].apply(EventLookup)
Output:
1 login:logout:button_click:interaction
2 Missing:Missing:login:button_click:logout
4 login:logout:button_click:Missing
39 Missing:Missing:Missing:Missing:logout:Missing...