Update cell values in dataframe - python

I am parsing data row-wise, how can I update a data frame cell value in a loop (read a value, parse it, write it to another columnn)
I have tried the below code
data = pd.read_csv("MyNames.csv")
data["title"] = ""
i = 0
for row in data.iterrows():
name = (HumanName(data.iat[i,1]))
print(name)
data.ix['title',i] = name["title"]
i = i + 1
data.to_csv('out.csv')
I would expect the following
name = "Mr John Smith"
| Title
Mr John Smith | Mr
All help appreciated!
Edit: I realise that I might not need to iterate. If I could call the function for all rows in a column and dump the results into another column that would be easier - like a SQL update statement. Thanks

Assuming that HumanName is a function or whatever that takes in a string and returns a dict you want. not able to test this code from here, but you get the gist
data['title'] = data['name'].apply(lambda name: HumanName(name)['title'])
EDIT I used row[1] because of your data.iat[i,1] that index might actually need to be 0 instead of 1 not sure

You can try .apply
def name_parsing(name):
"This function parses the name anyway you want"""
return HumanName(name)['title']
# with .apply, the function will be applied to every item in the column
# the return will be a series. In this case, the series will be attributed to 'title' column
data['title'] = data['name'].apply(name_parsing)
Also, another option, as we're discussing bellow, is to persist an instance of HumanName in the dataframe, so if you need other information from it later you don't need to instantiate and parse the name again (string manipulation can be very slow on big dataframes).
If so, part of the solution would be to create a new column. After that you would get the ['title'] attribute from it:
# this line creates a HumanName instance column
data['HumanName'] = data['name'].apply(lambda x: HumanName(x))
# this lines gets the 'title' from the HumanName object and applies to a 'title' column
data['title'] = data['HumanName'].apply(lambda x: x['title'])

Related

different output from pandas iterrows if .csv column headed "name" than other text

i'm a beginner using pandas to look at a csv. i'm using .iterrows() to see if a given record matches today's date, so far so good. however when calling (row.name) for a .csv with a column headed 'name' i get different output than if i rename the column and edit the (row."column-heading") to match. i can call it anything but "name" and get the right output. i tried (row.notthename) (row.fish) and (row.thisisodd) - which all worked fine - before coming here.
if the first colmumn in birthdays.csv is "name" and i call print(row.name) it returns "2". if the first column is "notthename" and i call print(row.notthename) it returns the relevant name. what gives? i don't understand why arbitrarily renaming the column and the function call is yielding different output?
eg case A: column named "name"
birthdays.csv:
name,email,year,month,day
a test name,test#email.com,1961,12,21
testerito,blagh#sdgdg.com,1985,02,23
testeroonie,sihgfdb#sidkghsb.com,2022,01,17
data = pandas.read_csv("birthdays.csv")
for (index, row) in data.iterrows():
if (dt.datetime.now()).month == row.month and (dt.datetime.now()).day == row.day:
print(row.name)
outputs "2"
whereas case B: column named "notthename"
data = pandas.read_csv("birthdays.csv")
for (index, row) in data.iterrows():
if (dt.datetime.now()).month == row.month and (dt.datetime.now()).day == row.day:
print(row.notthename)
outputs "testeroonie"
i'm missing something.... is there some special handling of "name" going on?
thanks for helping me learn!
This happens because DataFrame.iterrows returns a Series object, and the Series object has a built-in property called name. This is why using the object shortcut for column names, although convenient, can be dangerous. The dictionary notation doesn't have this issue:
print(row['name'])

How to change the value of a cell in a row in a CSV file using iterrows()?

I'm trying to write a script to change the cells of a column named 'ticker', when the cell is a specific name. For example, for all the cells that are 'BRK.B' in the ticker column, I want to change them to 'BRK-B' instead.
Code:
company_input = input('What company do you want to change the value of?')
company_input.upper()
print(f'Company chosen is: {company_input}')
company_change = input('What would you like to change the value to?')
company_change.upper()
# For SF1 file
for ticker, row in sf1.iterrows():
row['ticker'] = company_input
row['ticker'] = company_change
This isn't changing anything in the file and I'm not sure why. I'd appreciate some help.
In the documentation of iterrows() it says
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Instead, to modify all the values in a column, you should use the apply() method and define a function to handle changing BRK.B to BRK-B.
That would look something like this
def change_cell(x): #? x is the original value
return x.replace(".", "-") #? The return value will be whatever the cell is modified into
df['ticker'] = df['ticker'].apply(change_cell)

How to feed array of user_ids to flickr.people.getInfo()?

I have been working on extracting the flickr users location (not lat. and long. but person's country) by using their user_ids. I have made a dataframe (Here's the dataframe) consisted with photo id, owner and few other columns. My attempt was to feed each of the owner to flickr.people.getInfo() query by iterating owner column in dataframe. Here is my attempt
for index, row in df.iterrows():
A=np.array(df["owner"])
for i in range(len(A)):
B=flickr.people.getInfo(user_id=A[i])
unfortunately, it results only 1 result. After careful examination I've found that it belongs to the last user in the dataframe. My dataframe has 250 observations. I don't know how could I extract others.
Any help is appreciated.
It seems like you forgot to store the results while iterating over the dataframe. I haven't use the API but I think that this snippet should do it.
result_dict = {}
for idx, owner in df['owner'].iteritems():
result_dict[owner] = flickr.people.getInfo(user_id=owner)
The results are stored in a dictonary where the user id is the key.
EDIT:
Since it is a JSON you can use the read_json function to parse the result.
Example:
result_list = []
for idx, owner in df['owner'].iteritems():
result_list.appen(pd.read_json(json.dumps(flickr.people.get‌​Info(user_id=owner))‌​,orient=list))
# you may have to set the orient parameter.
# Option are: 'split','records','index', Default is 'index'
Note: I switched the dictonary to a list, since it is more convenient
Afterwards you can concatenate the resulting pandas serieses together like this:
df = pd.concat(result_list, axis=1).transpose()
I added the transpose() since you probably want the ID as the index.
Afterwards you should be able to sort by the column 'location'.
Hope that helps.
The canonical way to achieve that is to use an apply. It will be much more efficient.
import pandas as pd
import numpy as np
np.random.seed(0)
# A function to simulate the call to the API
def get_user_info(id):
return np.random.randint(id, id + 10)
# Some test data
df = pd.DataFrame({'id': [0,1,2], 'name': ['Pierre', 'Paul', 'Jacques']})
# Here the call is made for each ID
df['info'] = df['id'].apply(get_user_info)
# id name info
# 0 0 Pierre 5
# 1 1 Paul 1
# 2 2 Jacques 5
Note, another way to write the same thing is
df['info'] = df['id'].map(lambda x: get_user_info(x))
Before calling the method, have the following lines first.
from flickrapi import FlickrAPI
flickr = FlickrAPI(FLICKR_KEY, FLICKR_SECRET, format='parsed-json')

Python Pandas get one text value

I have a dataframe of information. One column is a rank. I just want to find the row with a rank of 2 and get the another column item (called 'Name'.) I can find the row and get the name but its not a pure text item that I can add to other text. Its an object.
How do I just get the name as text?
Code:
print "The name of the 2nd best is: " + groupDF.loc[(DF['Rank']==2),'Name']
This gives me the id of the row and the Name. I just want the Name
This is what I get:
4 The name of the 2nd best is: Hawthorne
Name: CleanName, dtype: object
I just can't figure out what to search on to get the answer. I get lots of other stuff but not this answer.
Thanks in advance.
In a little bit more detail:
I understand you have a data frame of the kind:
names = ["Almond","Hawthorn","Peach"]
groupDF = pd.DataFrame({'Rank':[1,2,3],'Name':names})
groupDF.loc[(groupDF['Rank']==2),'Name'] gives you a Series object. If the rank is unique then either of the following two possibilities works
groupDF.loc[(groupDF['Rank']==2),'Name'].item()
or
groupDF.loc[(groupDF['Rank']==2),'Name'].iloc[0]
result:
'Hawthorn'
If the rank is not unique, the second one still works and gives you the first hit, that is, the first element of the Series object created by the command.
You need to call the item() method of the resulting Series object.

Pandas For Loop, If String Is Present In ColumnA Then ColumnB Value = X

I'm pulling Json data from the Binance REST API, after formatting I'm left with the following...
I have a dataframe called Assets with 3 columns [Asset,Amount,Location],
['Asset'] holds ticker names for crypto assets e.g.(ETH,LTC,BNB).
However when all or part of that asset has been moved to 'Binance Earn' the strings are returned like this e.g.(LDETH,LDLTC,LDBNB).
['Amount'] can be ignored for now.
['Location'] is initially empty.
I'm trying to set the value of ['Location'] to 'Earn' if the string in ['Asset'] includes 'LD'.
This is how far I got, but I can't remember how to apply the change to only the current item, it's been ages since I've used Pandas or for loops.
And I'm only able to apply it to the entire column rather than the row iteration.
for Row in Assets['Asset']:
if Row.find('LD') == 0:
print('Earn')
Assets['Location'] = 'Earn' # <----How to apply this to current row only
else:
print('???')
Assets['Location'] = '???' # <----How to apply this to current row only
The print statements work correctly, but currently the whole column gets populated with the same value (whichever was last) as you might expect.
So (LDETH,HOT,LDBTC) returns ('Earn','Earn','Earn') rather than the desired ('Earn','???','Earn')
Any help would be appreciated...
np.where() fits here. If the Asset starts with LD, then return Earn, else return ???:
Assets['Location'] = np.where(Assets['Asset'].str.startswith('LD'), 'Earn', '???')
You could run a lambda in df.apply to check whether 'LD' is in df['Asset']:
df['Location'] = df['Asset'].apply(lambda x: 'Earn' if 'LD' in x else None)
One possible solution:
def get_loc(row):
asset = row['Asset']
if asset.find('LD') == 0:
print('Earn')
return 'Earn'
print('???')
return '???'
Assets['Location'] = Assets.apply(get_loc, axis=1)
Note, you should almost never iterate over a pandas dataframe or series.

Categories

Resources