Creating new column based on multiple different values - python

I have current code below that creates a new column based on multiple different values of a column that has different values representing similar things such as Car, Van or Ship, Boat, Submarine that I want all to be classified under the same value in the new column such as Vehicle or Boat.
Code with Simplified Dataset example:
def f(row):
if row['A'] == 'Car':
val = 'Vehicle'
elif row['A'] == 'Van':
val = 'Vehicle'
elif row['Type'] == 'Ship'
val = 'Boat'
elif row['Type'] == 'Scooter'
val = 'Bike'
elif row['Type'] == 'Segway'
val = 'Bike'
return val
What is best method similar to using wildcards rather than type each value out if there are multiple values (30 plus values ) that I want to bucket into the same new values under the new column?
Thanks

One way is to use np.select with isin:
df = pd.DataFrame({"Type":["Car","Van","Ship","Scooter","Segway"]})
df["new"] = np.select([df["Type"].isin(["Car","Van"]),
df["Type"].isin(["Scooter","Segway"])],
["Vehicle","Bike"],"Boat")
print (df)
Type new
0 Car Vehicle
1 Van Vehicle
2 Ship Boat
3 Scooter Bike
4 Segway Bike

Related

creating a new list isn't working with a for loop

I'm trying to create a new list using the df = coffee and wanted to call in the column Type, which has 4 (Coffee, Expresso, Herbal Tea, and Tea) into a separate list. One for bean and the other leaf. It runs but doesn't print the list I am looking for and I want to imply .lower to lower the first capital in each one.
thanks
bean = []
for b in df['Type']:
#for y in df['Type'] == 'Coffee' and df['Type'] == 'Expresso':
if b == 'Coffee'.lower():
bean.append(b)
elif b == 'Expresso'.lower():
bean.append(b)
print(bean)
leaf = []
for l in df['Type']:
if l == 'Tea'.lower():
#x in df['Type'] == 'Tea' and df['Type'] == 'Herbal Tea':
leaf.append(l)
elif l == 'Herbal Tea'.lower():
leaf.append(l)
print(leaf)

How to create a new column as a result of comparing two nested consecutive rows in the Panda dataframe?

I need to write a code in Panda Dataframe. So: The values in the ID column will be checked sequentially whether they are the same or not. Three situations arise here. Case 1: If the ID is not the same as the next line, write it as "unique" in the Comment column. Case 2: If the ID is the same as the next column and different from the next one, write it as "ring" in the Comment column. Case 3: If the ID is the same as the next multiple columns, write it as "multi" in the Comment column. Case 4: do this until the rows in the ID column are complete.
import pandas as pd
df = pd.read_csv('History-s.csv')
a = len(df['ID'])
c = 0
while a != 0:
c += 1
while df['ID'][i] == df['ID'][i + 1]:
if c == 2:
if df['Nod 1'][i] == df['Nod 2'][i + 1]:
df['Comment'][i] = "Ring"
df['Comment'][i + 1] = "Ring"
else:
df['Comment'][i] = "Multi"
df['Comment'][i + 1] = "Multi"
elif c > 2:
df['Comment'][i] = "Multi"
df['Comment'][i + 1] = "Multi"
i += 1
else:
df['Comment'][i] = "Unique"
a = a -1
print(df, '\n')
Data is like this:
Data
After coding data frame should be like this:
Result
From the input dataframe you have provided, my first impression was that as you are checking next line in a while loop, so you are strictly considering just the next comin line, for ex.
ID
value
comment
1
2
MULTI
1
3
RING
3
4
UNIQUE
But if that is not the case, you can simply use pandas groupby function.
def func(df):
if len(df)>2:
df['comment'] = 'MULTI'
elif len(df)==2:
df['comment'] = 'RING'
else:
df['comment'] = 'UNIQUE'
return df
df = df.groupby(['ID']).apply(func)
Output:
ID value comment
0 1 2 RING
1 1 3 RING
2 3 4 UNIQUE

Assign a value to a sub group if condition is met

I would like to create column and assign a number to each team that won and lost in a given 'Rally' (0 for a Loss, 1 for a Win). The last row of each rally will display who won in the 'Points' column.
The image shows how the data is formatted and the desired result is in the 'Outcome' column:
My current code is;
def winLoss(x):
if 'A' in x['Points']:
if x.TeamAB == 'A':
return 1
else:
return 0
elif 'B' in x['Points']:
if x.TeamAB == 'B':
return 1
else:
return 0
df['Outcome'] = df.groupby('Rally').apply(winLoss).any()
Grab the winners for each rally by grouping and taking the last row of Points for each group, then use multiindex to loc filter an assign the Outcome:
winners = pd.MultiIndex.from_frame(
df.groupby(['Rally'])['Points']
.last().str.slice(-1).reset_index()
)
df.set_index(['Rally', 'TeamAB'], inplace=True)
df['Outcome'] = 0
df.loc[df.index.isin(winners), 'Outcome'] = 1
df.reset_index(inplace=True)

Need assistance in analyzing data (mean, median, mode, etc) from csv file

I am having difficulty with finding the mean, median, mode, counting occurrences of a value within a csv file.
This section of the file is a column of letters 'M' or 'F'
This specific excerpt of code displays a problem I am facing:
I am not sure why the counting variables are not being incremented.
Any assistance would be greatly appreciated
citations2 = open('Non Traffic Citations.csv')
data2 = csv.reader(citations2)
gender = []
for row in data2:
gender.append(row[2])
del gender [0]
male_count = 0
female_count = 0
for item in gender:
# print(item) - shows that the list has values within it
if 'M' == item:
male_count = + 1
if 'F' == item:
female_count = + 1
print(male_count)
print(female_count)
If you are trying to increment the gender counts, you have the syntax incorrect in your loop.
for item in gender:
if 'F' == item:
female_count += 1
elif 'M' == item:
male_count += 1
print(male_count)
print(female_count)
You can use pandas:
import pandas as pd
df=pd.read_csv('Non Traffic Citations.csv')
df.describe()

Optimizing runtime of comparing 2 Pandas datasets

I have this issue where I take the 2009/2010 dataset of white house visitors, a csv with these headers.
https://obamawhitehouse.archives.gov/files/disclosures/visitors/WhiteHouse-WAVES-Key-1209.txt
I want to extract the name of all names of visitors who visited in both 2009 and 2010.
I have this function to do this, but it is far too slow. Is there a conceptually faster way to do this?
def task3():
culled_data = data[["NAMELAST", "NAMEFIRST", "TOA", "TOD"]]
data9 = culled_data[culled_data["TOA"].str.contains("2009", na = False)]
data10 = culled_data[culled_data["TOA"].str.contains("2010", na = False)]
unique_names = pandas.DataFrame({'count':\
data.groupby(["NAMELAST", "NAMEFIRST"]).size()}).reset_index()
unqiue_names = unique_names[unique_names["count"] > 1]
count = 0
for index, row in unique_names.iterrows():
if data9[data9.NAMELAST == row["NAMELAST"]].shape[0] > 0 and data10[data10.NAMELAST == row["NAMELAST"]].shape[0] > 0 and data9[data9.NAMEFIRST == row["NAMEFIRST"]].shape[0] > 0 and data10[data10.NAMEFIRST == row["NAMEFIRST"]].shape[0] > 0:
count += 1
else:
unique_names = unique_names[unique_names.NAMELAST != row["NAMELAST"]]
return count, unique_names
One way is to use python sets:
fullnames9 = set([' '.join(r) for r in data9[['NAMEFIRST', 'NAMELAST']].values])
fullnames10 = set([' '.join(r) for r in data10[['NAMEFIRST', 'NAMELAST']].values])
names_who_visited_in_both_years = fullnames9 & fullnames10 # set intersection
Note that if two different people have the same first and last name, this code will falsely conclude that they visited in both years. Also, this only gets full names. Getting the DataFrame indexes of people who visited in both years would be more useful, and is left as an exercise ;)

Categories

Resources