I have a data-frame (df)
which looks like:
first_name surname location identifier
0 Fred Smith London FredSmith
1 Jane Jones Bristol JaneJones
I am trying to query a particular field and return it to a variable value using:
value = df.loc[df['identifier'] == query_identifier ,'location']
so where query_identifier is equal to FredSmith I get returned to value:
0 London
How can I remove the 0 so I just have:
London
Try this statement:
value = df.loc[df['identifier'] == "FredSmith" ,'location'].values[0]
This will help you.
If there is multiple values for the same identifier, then:
value = df.loc[df['identifier'] == "FredSmith" ,'location'].values
for df_values in value:
print(df_values)
This is just enhancement.
Related
I have a file which contains data of users in rows which is stored in some cryptic format. I want to decode that and create a dataframe
sample row -- AN04N010105SANDY0205SMITH030802031989
Note-
AN04N01 is standard 7 letter string at the start to denote that this row is valid.
Here 0105SANDY refers to 1st column(name) having length 5
01 -> 1st column ( which is name column )
05 -> length of name ( Sandy )
Similarly,0205SMITH refers to
02 -> 2nd column ( which is surname column )
05 -> length of surname ( Smith )
Similarly,030802031989 refers to
03 -> 3rd column ( DOB )
08 -> length of DOB
I want a data frame like --
| name | surname | DOB |
|Sandy | SMITH | 02031989 |
I was trying to use regex, but i don't know how to put this into a data frame after identifying names, also how will you find the number of characters to read?
Rather than using regex for groups that might be out of order and varying length, it might be simpler to consume the string in a serial manner.
With the following, you track an index i through the string and consume two characters for code, then length and finally the variable amount of characters given by length. Then, you store the values in a dict, append the dicts to a list and turn that list of dicts into a dataframe. Bonus, it works with the elements in any order.
import pandas as pd
test_strings = [
"AN04N010105ALICE0205ADAMS030802031989",
"AN04N010103BOB0205SMITH0306210876",
"AN04N0103060101010104FRED0204OWEN",
"XXXXXXX0105SANDY0205SMITH030802031989",
]
code_map = {"01": "name", "02": "surname", "03": "DOB"}
def parse(s):
i = 7
d = {}
while i < len(s):
code, i = s[i:i+2], i+2 # read code
length, i = int(s[i:i+2]), i+2 # read length
val, i = s[i:i+length], i + length # read value
d[code_map[code]] = val # store value
return d
ds = []
for s in test_strings:
if not s.startswith("AN04N01"):
continue
ds.append(parse(s))
df = pd.DataFrame(ds)
df contains:
name surname DOB
0 ALICE ADAMS 02031989
1 BOB SMITH 210876
2 FRED OWEN 010101
Try:
def fn(x):
rv, x = [], x[7:]
while x:
_, n, x = x[:2], x[2:4], x[4:]
value, x = x[: int(n)], x[int(n) :]
rv.append(value)
return rv
m = df["row"].str.startswith("AN04N01")
df[["NAME", "SURNAME", "DOB"]] = df.loc[m, "row"].apply(fn).apply(pd.Series)
print(df)
Prints:
row NAME SURNAME DOB
0 AN04N010105SANDY0205SMITH030802031989 SANDY SMITH 02031989
1 AN04N010105BANDY0205BMITH030802031989 BANDY BMITH 02031989
2 AN04N010105CANDY0205CMITH030802031989 CANDY CMITH 02031989
3 XXXXXXX0105DANDY0205DMITH030802031989 NaN NaN NaN
Dataframe used:
row
0 AN04N010105SANDY0205SMITH030802031989
1 AN04N010105BANDY0205BMITH030802031989
2 AN04N010105CANDY0205CMITH030802031989
3 XXXXXXX0105DANDY0205DMITH030802031989
here it is the code for this pattern :
(\w{2}\d{2}\w{1}\d{2})(\d{4}\w{5}\d+\w{5})(\d+)
or use this pattern :
(\D{5})\d+(\D+)\d+(02\d+)
Please help me with the python script to filter the below CSV.
Below is the example of the CSV dump for which I have done the initial filtration.
Last_name
Gender
Name
Phone
city
Ford
Male
Tom
123
NY
Rich
Male
Robert
21312
LA
Ford
Female
Jessica
123123
NY
Ford
Male
John
3412
NY
Rich
Other
Linda
12312
LA
Ford
Other
James
4321
NY
Smith
Male
David
123123
TX
Rich
Female
Mary
98689
LA
Rich
Female
Jennifer
86860
LA
Ford
Male
Richard
12123
NY
Smith
Other
Daniel
897097
TX
Ford
Other
Lisa
123123123
NY
import re
def gather_info (L_name):
dump_filename = "~/Documents/name_report.csv"
LN = []
with open(dump_filename, "r") as FH:
for var in FH.readlines():
if L_name in var
final = var.split(",")
print(final[1], final[2], final[3])
return LN
if __name__ == "__main__":
L_name = input("Enter the Last name: ")
la_name = gather_info(L_name)
By this, I am able to filter by the last name. for example, if I choose L_name as Ford, then I have my output as
Gender
Name
Phone
Male
Tom
123
Female
Jessica
123123
Male
John
3412
Other
James
4321
Male
Richard
12123
Other
Lisa
22412
I need help extending the script by selecting each gender and the values in the list to perform other functions, then calling the following gender and the values to achieve the same functions. for example, first, it selects the gender Male [Tom, John] and performs other functions. then selects the next gender Female [Jessica] and performs the same functions and then selects the gender Other [James, Lisa] and performs the same functions.
I would recomend using the pandas module which allows for easy filtering and grouping of data
import pandas as pd
if __name__ == '__main__':
data = pd.read_csv('name_reports.csv')
L_name = input("Enter the last name: ")
by_last_name = data[data['Last_name'] == L_name]
groups = by_last_name.groupby(['Gender'])
for group_name, group_data in groups:
print(group_name)
print(group_data)
Breaking this down into its pieces the first part is
data = pd.read_csv('name_reports.csv')
This reads the data from the csv and places it into a dataframe
Second we have
by_last_name = data[data['Last_name'] == L_name]
This filters the dataframe to only have results with Last_name equal to L_name
Next we group the data.
groups = by_last_name.groupby(['Gender'])
this groups the filtered data frames by gender
then we iterate over this. It returns a tuple with the group name and the dataframe associated with that group.
for group_name, group_data in groups:
print(group_name)
print(group_data)
This loop just prints out the data to access fields from it you can use the iterrows function
for index,row in group_data.iterrows():
print(row['city']
print(row['Phone']
print(row['Name']
And then you can use those for whatever function you want. I would recommend reading on the documentation for pandas since depending on the function you plan on using there may be a better way to do it using the library. Here is the link to the library https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
Since you cannot use the pandas module then a method using only the csv module would look like this
import csv
def has_last_name(row,last_name):
return row['Last_name'] == last_name
def has_gender(row,current_gender):
return row['Gender'] == current_gender
if __name__ == '__main__':
data = None
genders = ['Male','Female','Other']
with open('name_reports.csv') as csvfile:
data = list(csv.DictReader(csvfile,delimiter=','))
L_name = input('Enter the Last name: ')
get_by_last_name = lambda row: has_last_name(row,L_name)
filtered_by_last_name = list(filter(get_by_last_name,data))
for gender in genders:
get_by_gender = lambda row: has_gender(row,gender)
filtered_by_gender = list(filter(get_by_gender,filtered_by_last_name))
print(filtered_by_gender)
The important part is the filter built in function. This takes in a function that takes in an item from a list and returns a bool. filter takes this function and an iterable and returns a generator of items that return true for that function. The other important part is the csv.DictReader which returns your csv file as a dictionary which makes allows you to access attributes by key instead of by index.
Ok, my frustration has hit epic proportions. I am new to Pandas and trying to use it on an excel db i have, however, i cannot seem to figure out what should be a VERY simple action.
I have a dataframe as such:
ID UID NAME STATE
1 123 Bob NY
1 123 Bob PA
2 124 Jim NY
2 124 Jim PA
3 125 Sue NY
all i need is to be able to locate and print the ID of a record by the unique combination of UID and STATE.
The closest I can come up with is this:
temp_db = fd_db.loc[(fd_db['UID'] == "1") & (fd_db['STATE'] == "NY")]
but this still grabs all UID and not ONLY the one with the STATE
Then, when i try to print the result
temp_db.ID.values
prints this:
['1', '1']
I need just the data and not the structure.
My end result needs to be just to print to the screen : 1
Any help is much appreciated.
I think it's because your UID condition is wrong : the UID column an Integer and you give a String.
For example when I run this :
df.loc[(df['UID'] == "123") & (df['STATE'] == 'NY')]
The output is :
Empty DataFrame
Columns: [ID, UID, NAME, STATE]
Index: []
but when I consider UID as an Integer :
df.loc[(df['UID'] == 123) & (df['STATE'] == 'NY')]
It output :
ID UID NAME STATE
0 1 123 Bob NY
I hope that will help you !
fd_db.loc[(fd_db['UID'] == 123) & (fd_db['STATE'] == 'NY')]['ID'].iloc[0]
i'm trying to add a new column into my data frame that specifies if a user in the "created by" column is part of a team(which is held in a separate list).
Original Data frame(df)
URL text created_by
id
1 www.pandora.com Pandora John
2 m.jcpenney.com other Steve
3 www.youtube.com You-tube Rob
4 www.facebook.com Facebook David
Team_Names = ['John','Steve','Rob','Euan']
I want the final data frame to contain a new column with True or False values depending on if the "created by" value is in the "Tam_Names" list
Team_Mask = df['Created by'].isin(Team_Names)
df['In_Team'] = df.[Team_Mask]
i'm getting errors on the last line of code.Any help would be appreciated
Assign mask to new column:
Team_Names = ['John','Steve','Rob','Euan']
df['In_Team'] = df['created_by'].isin(Team_Names)
print (df)
URL text created_by In_Team
1 www.pandora.com Pandora John True
2 m.jcpenney.com other Steve True
3 www.youtube.com You-tube Rob True
4 www.facebook.com Facebook David False
Or use assign:
df = df.assign(In_Team = df['created_by'].isin(Team_Names))
I have two data frames. df1 looks like -
MovieName Actors
lights out Maria Bello
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis
df2 looks like -
ActorName Gender
Tom male
Emily female
Christopher male
I want to add two columns in df1 'female_actors' and 'male_actors' which contains the count of female and male actors in that particular movie respectively. Whether an actor is male or female is done based on df2.
Here is what I am doing -
def func(actors, gender):
actors = [act.split()[0] for act in actors.split('*')]
n_gender = df2.Gender[df2.Gender==gender][df2.ActorName.isin(actors)].count()
return n_gender
df1['male_actors'] = df1.Actors.apply(lambda x: func(x, 'male'))
df1['female_actors'] = df1.Actors.apply(lambda x: func(x, 'female'))
This code gives me list index out of range error.
Please note that -
If particular name isn't present in gender.csv, don't count it in the total.
If there is just one actor in a movie, and it isn't present in gender.csv, then it's count should be zero.
Result should be -
MovieName Actors male_actors female_actors
lights out Maria Bello 0 0
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis 2 1
Feel free to suggest some other approach.
How about this?
df1['Male'] = df1.Actors.apply(lambda x: len(pd.concat( [df2[(df2.ActorName == name) & (df2.Gender == 'male')] for name in x.split('*')] )))
df1['Female'] = df1.Actors.apply(lambda x: len(pd.concat( [df2[(df2.ActorName == name) & (df2.Gender == 'female')] for name in x.split('*')] )))
using str and join
d2 = df2.set_index('ActorName')
d1 = df1.set_index('MovieName')
method 1
split
d1.join(d1.Actors.str.split('*', expand=True).stack() \
.str.split(expand=True)[0].map(d2.Gender) \
.groupby(level='MovieName') \
.value_counts().unstack()).fillna(0).reset_index()
method 2
extractall
d1.join(d1.Actors.str.extractall('((?P<first>[^*]+)\s+(?P<last>[^*]+))') \
['first'].map(d2.Gender).groupby(level='MovieName') \
.value_counts().unstack()).fillna(0).reset_index()