Setting values when iterating through a DataFrame

Setting values when iterating through a DataFrame - python

I have a dictionary of states (example IA:Idaho). I have loaded the dictionary into a DataFrame bystate_df.
then I am importing a CSV with states deaths that I want to add them to the bystate_df as I read the lines:
byState_df = pd.DataFrame(states.items())
byState_df['Deaths'] = 0
df['Deaths'] = df['Deaths'].convert_objects(convert_numeric=True)
print byState_df
for index, row in df.iterrows():
if row['Area'] in states:
byState_df[(byState_df[0] == row['Area'])]['Deaths'] = row['Deaths']
print byState_df
but the byState_df is still 0 afterwords:
0 1 Deaths
0 WA Washington 0
1 WI Wisconsin 0
2 WV West Virginia 0
3 FL Florida 0
4 WY Wyoming 0
5 NH New Hampshire 0
6 NJ New Jersey 0
7 NM New Mexico 0
8 NA National 0
I test row['Deaths'] while it iterates and it's producing the correct values, it just seem to be setting the byState_df value incorrectly.

Can you try the following code where I use .loc instead of [][].
byState_df = pd.DataFrame(states.items())
byState_df['Deaths'] = 0
df['Deaths'] = df['Deaths'].convert_objects(convert_numeric=True)
print byState_df
for index, row in df.iterrows():
if row['Area'] in states:
byState_df.loc[byState_df[0] == row['Area'], 'Deaths'] = row['Deaths']
print byState_df

Related

How can I count # of occurences of more than one column (eg city & country)?

Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?

There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1

Python: Replace multiple old values to new value Pandas

I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code to make a function for replace values:
df = {'Name':['al', 'el', 'naila', 'dori','jlo'],
'living':['Alvando','Georgia GG','Newyork NY','Indiana IN','Florida FL'],
'sample2':['malang','kaltim','ambon','jepara','sragen'],
'output':['KOTA','KAB','WILAYAH','KAB','DAERAH']
}
df = pd.DataFrame(df)
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0)
df = df.replace('KAB', 1)
But I am actually expecting this output with the simple code that doesn't repeat replace
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0
I've tried using np.where but it doesn't give the desired result. all results display 0, but the original value is 1
df['output'] = pd.DataFrame({'output':np.where(df == "KAB", 1, 0).reshape(-1, )})

This code should work for you:
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0).replace('KAB', 1)
Output:
>>> df
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0

Loops that produce different values for each item in list

I have a loop that goes through a list of countries in a list and I want to find the total from each one and put it as a new object with the country name. So I want a list at the end: australia_total, brazil_total, etc which has the total number for each country.
The last line in the code is the one that doesn't work
for c in countrylist:
country = data[data.country == c]
pivotcountry = country.pivot_table(['new_cases','new_deaths'], index='week', aggfunc='sum', margins=False)
pivotcountry.reset_index(level=0, inplace=True)
{c}_total = pivotcountry.new_cases.iloc[-1]

If you want the totals to appear in a new DataFrame then you'll need to add a new row for each time around the loop:
import pandas
data = pandas.read_csv(r"WHO-COVID-19-global-data.csv", skipinitialspace=True)
countrylist = data.Country.unique()
out_columns = ['Country', 'Total_cases', 'Total_deaths']
out_data = pandas.DataFrame(columns=out_columns)
for c in countrylist:
country = data[data.Country == c]
case_total = country['New_cases'].sum()
deaths_total = country['New_deaths'].sum()
out_data = out_data.append(pandas.DataFrame([[c, case_total, deaths_total]], columns=out_columns))
print(out_data)
Gives:
Country Total_cases Total_deaths
0 Afghanistan 34451 1010
0 Albania 3571 95
0 Algeria 19195 1011
0 Andorra 855 52
0 Angola 506 26
.. ... ... ...
0 Venezuela (Bolivarian Republic of) 9178 85
0 Viet Nam 372 0
0 Yemen 1469 418
0 Zambia 1895 42
0 Zimbabwe 985 18
[216 rows x 3 columns]
But if you genuinely want a Python variable for each country, then you can use the locals() dictionary:
import pandas
data = pandas.read_csv(r"WHO-COVID-19-global-data.csv", skipinitialspace=True)
countrylist = data.Country.unique()
for c in countrylist:
country = data[data.Country == c]
case_total = country['New_cases'].sum()
locals()['%s_total' % (c.lower())] = case_total
print("Australia: %s" % australia_total)
print("Germany: %s" % germany_total)
However I wouldn't recommend using locals() to set new variables as it's not obvious to someone reading the code that new variable names are being created without using the traditional foo = 123 assignment statement.

Constructing a dataframe with multiple columns based on str conditions using a loop - python

I have a webscraped Twitter DataFrame that includes user location. The location variable looks like this:
2 Crockett, Houston County, Texas, 75835, USA
3 NYC, New York, USA
4 Warszawa, mazowieckie, RP
5 Texas, USA
6 Virginia Beach, Virginia, 23451, USA
7 Louisville, Jefferson County, Kentucky, USA
I would like to construct state dummies for all USA states by using a loop.
I have managed to extract users from the USA using
location_usa = location_df['location'].str.contains('usa', case = False)
However the code would be too bulky I wrote this for every single state. I have a list of the states as strings.
Also I am unable to use
pd.Series.Str.get_dummies()
as there are different locations within the same state and each entry is a whole sentence.
I would like the output to look something like this:
Alabama Alaska Arizona
1 0 0 1
2 0 1 0
3 1 0 0
4 0 0 0
Or the same with Boolean values.

Use .str.extract to get a Series of the states, and then use pd.get_dummies on that Series. Will need to define a list of all 50 states:
import pandas as pd
states = ['Texas', 'New York', 'Kentucky', 'Virginia']
pd.get_dummies(df.col1.str.extract('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(','))
Kentucky New York Texas Virginia
0 0 0 1 0
1 0 1 0 0
2 0 0 0 0
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
Note I matched on States followed by a ',' as that seems to be the pattern and allows you to avoid false matches like 'Virginia' with 'Virginia Beach', or more problematic things like 'Washington County, Minnesota'
If you expect mutliple states to match on a single line, then this becomes .extractall summing across the 0th level:
pd.get_dummies(df.col1.str.extractall('(' + '|'.join(x+',' for x in states)+ ')')[0].str.strip(',')).sum(level=0).clip(upper=1)
Edit:
Perhaps there are better ways, but this can be a bit safer as suggested by #BradSolomon allowing matches on 'State,( optional 5 digit Zip,) USA'
states = ['Texas', 'New York', 'Kentucky', 'Virginia', 'California', 'Pennsylvania']
pat = '(' + '|'.join(x+',?(\s\d{5},)?\sUSA' for x in states)+ ')'
s = df.col1.str.extract(pat)[0].str.split(',').str[0]
Output: s
0 Texas
1 New York
2 NaN
3 Texas
4 Virginia
5 Kentucky
6 Pennsylvania
Name: 0, dtype: object
from Input
col1
0 Crockett, Houston County, Texas, 75835, USA
1 NYC, New York, USA
2 Warszawa, mazowieckie, RP
3 Texas, USA
4 Virginia Beach, Virginia, 23451, USA
5 Louisville, Jefferson County, Kentucky, USA
6 California, Pennsylvania, USA

Updating values in a specific column based on values in another column

Sorry for the trivial question.
I am having troubles with selecting and replacing a value in list based on the values in another column. I have the following list:
Jack 0.794938 0
Marc 0.05155265 0
Eliza 0.96454115 0
Louis 0.075102 0
Milo 0.951499 0
Marc 0.63319 0
Michael 0.719391 0
Louis 0.502843 0
Eliza 0.620387 0
I would like to keep the first occurrence of each name with the third column taking the value of the second column of the second occurrence. So the result should be:
Jack 0.794938 0
Marc 0.05155265 0.63319
Eliza 0.96454115 0.620387
Louis 0.075102 0.502843
Milo 0.951499 0
Michael 0.719391 0
I am using this code:
res = []
already_added = set()
for e in a:
key1 = e[0]
if key1 not in already_added:
res.append(e)
from that point on I would like something like:
else:
res[res[:][0] == e[0]][2] = e[1]
or
else:
res[np.where(res[:][0] == e[0]][2])] = e[1]
But I keep getting the TypeError: list indices must be integers or slices, not list.
Can someone help me solve this?
Thanks
Edit: I corrected the indices

Here is a pure numpy solution. It sorts the records by first column to easily find duplicate names.
import numpy as np
data = """
Jack 0.794938 0
Marc 0.05155265 0
Eliza 0.96454115 0
Louis 0.075102 0
Milo 0.951499 0
Marc 0.63319 0
Michael 0.719391 0
Louis 0.502843 0
Eliza 0.620387 0
"""
data = (line.split() for line in data.strip().split('\n'))
data = np.array([(x, float(y), float(z)) for x, y, z in data], dtype=object)
res = data.copy()
idx = np.argsort(res[:, 0], kind='mergesort')
dupl = res[idx[:-1], 0] == res[idx[1:], 0]
res[idx[:-1][dupl], 2] = res[idx[1:][dupl], 1]
mask = np.ones(res.shape[:1], dtype=bool)
mask[idx[1:][dupl]] = False
res = res[mask]
Result:
# array([['Jack', 0.794938, 0.0],
# ['Marc', 0.05155265, 0.63319],
# ['Eliza', 0.96454115, 0.620387],
# ['Louis', 0.075102, 0.502843],
# ['Milo', 0.951499, 0.0],
# ['Michael', 0.719391, 0.0]], dtype=object)

You could use Pandas:
Load values into a dataframe, df:
csvfile = StringIO("""Jack 0.794938 0
Marc 0.05155265 0
Eliza 0.96454115 0
Louis 0.075102 0
Milo 0.951499 0
Marc 0.63319 0
Michael 0.719391 0
Louis 0.502843 0
Eliza 0.620387 0""")
df= pd.read_csv(csvfile, header=None, sep='\s\s+')
Then, use groupby and unstack:
df.groupby(0).apply(lambda x: pd.Series(x[1].tolist()))\
.unstack().add_prefix('value').reset_index()
Output:
0 value0 value1
0 Eliza 0.964541 0.620387
1 Jack 0.794938 NaN
2 Louis 0.075102 0.502843
3 Marc 0.051553 0.633190
4 Michael 0.719391 NaN
5 Milo 0.951499 NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Setting values when iterating through a DataFrame - python

Related

How can I count # of occurences of more than one column (eg city & country)?

Python: Replace multiple old values to new value Pandas

Loops that produce different values for each item in list

Constructing a dataframe with multiple columns based on str conditions using a loop - python

Updating values in a specific column based on values in another column

Categories

Resources