Using Pandas in CSV - python

I am new to Python, kindly help me to understand and go forward in python learning. Find below the sample data:
Country Age Sal OnWork
USA 52 12345 No
UK 23 1142 Yes
MAL 25 4456 No
I would like to find the mean value in SAL column if OnWork is NO

Let's Say that your data looks like the following,
{'Country': 'USA', 'Age': '52', 'Sal': '12345', 'OnWork': 'No'}
{'Country': 'UK', 'Age': '23', 'Sal': '1142', 'OnWork': 'Yes'}
{'Country': 'MAL', 'Age': '25', 'Sal': '4456', 'OnWork': 'No'}
The below code will be of help in your case:
df = your_dataframe
df[df["OnWork"]=="No"]["Sal"].mean()

Related

How do I add a column to a DataFrame that is a derivative of another column

Sorry about a newbie question, I am making baby steps in Python. My DataFrame has a column address of type object. address has a country, like this: {... "city": "...", "state": "...", "country": "..."} . How do I add a column country that's derived from the column address?
Without the data its difficult to answer, but if the values are Python dict, applying a pandas Series on rows should work:
df['address'].apply(pd.Series)
You will have to assign the result back to the original dataframe, and if the values are JSON string, you may first want to convert it to dictionary using json.loads
SAMPLE RUN:
>>> df
x address
0 1 {'city': 'xyz', 'state': 'Massachusetts', 'country': 'US'}
1 2 {'city': 'ABC', 'state': 'LONDON', 'country': 'UK'}
>>> df.assign(country=df['address'].apply(pd.Series)['country'])
x address country
0 1 {'city': 'xyz', 'state': 'Massachusetts', 'country': 'US'} US
1 2 {'city': 'ABC', 'state': 'LONDON', 'country': 'UK'} UK
Even better to use key directly along with Series.str:
>>> df.assign(country=df['address'].str['country'])
x address country
0 1 {'city': 'xyz', 'state': 'Massachusetts', 'country': 'US'} US
1 2 {'city': 'ABC', 'state': 'LONDON', 'country': 'UK'} UK

Is there a way to do this in Python?

I have a data frame that looks like this:
data = {'State': ['24', '24', '24',
'24','24','24','24','24','24','24','24','24'],
'County code': ['001', '001', '001',
'001','002','002','002','002','003','003','003','003'],
'TT code': ['123', '123', '123',
'123','124','124','124','124','125','125','125','125'],
'BLK code': ['221', '221', '221',
'221','222','222','222','222','223','223','223','223'],
'Age Code': ['1', '1', '2', '2','2','2','2','2','2','1','2','1']}
df = pd.DataFrame(data)
essentially I want to just have where only the TT code where the age code is 2 and there are no 1's. So I just want to have the data frame where:
'State': ['24', '24', '24', '24'],
'County code': ['002','002','002','002',],
'TT code': ['124','124','124','124',],
'BLK code': ['222','222','222','222'],
'Age Code': ['2','2','2','2']
is there a way to do this?
IIUC, you want to keep only the TT groups where there are only Age groups with value '2'?
You can use a groupby.tranform('all') on the boolean Series:
df[df['Age Code'].eq('2').groupby(df['TT code']).transform('all')]
output:
State County code TT code BLK code Age Code
4 24 002 124 222 2
5 24 002 124 222 2
6 24 002 124 222 2
7 24 002 124 222 2
This should work.
df111['Age Code'] = "2"
I am just wondering why the choice of string for valueType of integer

Pandas : Is there a way to count the number of occurrences of values in a given column which contains dictionaries values in the cells?

I have a df column where each cell contais a dictionary, so when I apply value_counts to this column, I obviolsy get the results for the number of occurrencies of each dictionary. But what I need is to get the number of occirrences of the separate values.
a column cells look something like this:
col1
1 [{'name': 'John'}, {'name': 'Mark'}, {'name': 'Susan'}, {'name': 'Mr.Bean'}, {'name': 'The
Smiths'}]
2 [{'name': 'Mark'}, {'name': 'Barbara}, {'name': 'Poly'}, {'name': 'John'}, {'name': 'Nick'}]
So basically what I need as result is how many Susans, Johns etc. there are in the entire columns
Any help will be appreciated
You can try this, using #jch setup:
df = pd.DataFrame({'col1': [ [{'name': 'John'}, {'name': 'Mark'}, {'name': 'Susan'}, {'name': 'Mr.Bean'}, {'name': 'The Smiths'}], \
[{'name': 'Mark'}, {'name': 'Barbara'}, {'name': 'Poly'}, {'name': 'John'}, {'name': 'Nick'}] ] })
pd.DataFrame(df['col1'].to_list()).stack().str['name'].value_counts()
Output:
John 2
Mark 2
Susan 1
Mr.Bean 1
The Smiths 1
Barbara 1
Poly 1
Nick 1
dtype: int64
Let's use pandas DataFrame constructor, stack to reshape to single column, then using the selector from .str accessor to get the values from dictionaries and lastly value_counts.
The data is actually a list of dictionaries on each line. You can build a dataframe from each row. Then the names are contained in a column which can be converted to a list, exploded and then perform a value_counts on that:
df = pd.DataFrame({'col1': [ [{'name': 'John'}, {'name': 'Mark'}, {'name': 'Susan'}, {'name': 'Mr.Bean'}, {'name': 'The Smiths'}], \
[{'name': 'Mark'}, {'name': 'Barbara'}, {'name': 'Poly'}, {'name': 'John'}, {'name': 'Nick'}] ] })
print(df)
Output:
col1
0 [{'name': 'John'}, {'name': 'Mark'}, {'name': ...
1 [{'name': 'Mark'}, {'name': 'Barbara'}, {'name...
value_count :
df.apply(lambda x: pd.DataFrame(x['col1']).squeeze().to_list(), axis=1).explode().value_counts()
Output :
John 2
Mark 2
Susan 1
Mr.Bean 1
The Smiths 1
Barbara 1
Poly 1
Nick 1
We can use explode() function to transform each element of a list-like to a row, replicating the index values. Then we can use json_normalize to convert each key in the dictionary to transform it to a column. Then value_counts() can be used to count each unique value in the dataFrame.
df = pd.DataFrame({'col1': [ [{'name': 'John'}, {'name': 'Mark'}, {'name': 'Susan'}, {'name': 'Mr.Bean'}, {'name': 'The Smiths'}], \
[{'name': 'Mark'}, {'name': 'Barbara'}, {'name': 'Poly'}, {'name': 'John'}, {'name': 'Nick'}] ] })
print(pd.json_normalize(df.col1.explode()).value_counts())
Result :
name
John 2
Mark 2
Barbara 1
Mr.Bean 1
Nick 1
Poly 1
Susan 1
The Smiths 1
If you want to get the count of any one name, say John
name = 'John'
print(pd.json_normalize(df.col1.explode()).eq(name).sum())
Result :
2

Create new fake data with new primary keys from existing dataframe python

I have a dataframe as following:
df1 = pd.DataFrame({'id': ['1a', '2b', '3c'], 'name': ['Anna', 'Peter', 'John'], 'year': [1999, 2001, 1993]})
I want to create new data by randomly re-arranging values in each column but for column id I also need to add a random letter at the end of the values, then add the new data to existing df1 as following:
df1 = pd.DataFrame({'id': ['1a', '2b', '3c', '2by', '1ao', '1az', '3cc'], 'name': ['Anna', 'Peter', 'John', 'John', 'Peter', 'Anna', 'Anna'], 'year': [1999, 2001, 1993, 1999, 1999, 2001, 2001]})
Could anyone help me, please? Thank you very much.
Use DataFrame.sample and add random letter by numpy.random.choice:
import string
N = 5
df2 = (df1.sample(n=N, replace=True)
.assign(id =lambda x:x['id']+np.random.choice(list(string.ascii_letters),size=N)))
df1 = df1.append(df2, ignore_index=True)
print (df1)
id name year
0 1a Anna 1999
1 2b Peter 2001
2 3c John 1993
3 1aY Anna 1999
4 3cp John 1993
5 3cE John 1993
6 2bz Peter 2001
7 3cu John 1993

How to target specific rows for replacement with order is different between dataframes?

Say I have 2 dataframes that I want to fillna with but the order is not the exact same, is it possible to target specific columns as part of the mapping?
Here's a example:
import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
print(data)
print(dataNew)
print(dataNew.fillna(data))
My output is not right because if you see, dataframe new's country data is not in the same order(uk/us are shifted and so is fr/uk at the end). Is there a way to match it based on year, country and sales before replacing the NaN value in the rep column?
The output I'm looking for is, like the first data column. I'm trying to understand how I could have filled in the NA's by selecting a matching cells in another df. To make the question easier I made the first data column have less fields so the question is purely about mapping/searching
year country sales rep
0 2016 uk 10 john
1 2016 usa 21 john
2 2015 fr 20 claire
3 2014 fr 10 kyle
4 2013 uk 12 kyle
import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'sales': [10, 21, 20, 10,12],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
# join on these three columns and get the rep column from other dataframe
cols = ['year', 'country', 'sales']
dataNew = dataNew.drop('rep', 1).join(data.set_index(cols), on = cols)
print(data)
output:
year country sales rep
0 2016 uk 10 john
1 2016 usa 21 john
2 2015 fr 20 claire
3 2014 fr 10 kyle
4 2013 uk 12 kyle
You can also use Multi-Index along with fillna():
import pandas as pd
import numpy as np
data = pd.DataFrame({'year': ['2016', '2016', '2015', '2014', '2013'],
'country':['uk', 'usa', 'fr','fr','uk'],
'rep': ['john', 'john', 'claire', 'kyle','kyle']
})
dataNew = pd.DataFrame({'year': ['2016', '2016', '2015', '2013', '2014'],
'country':['usa', 'uk', 'fr','uk','fr'],
'sales': [21,10, 20, 12,10],
'rep': [np.nan, 'john', np.nan, np.nan, 'kyle']
})
(dataNew.set_index(['year', 'country']).sort_index().fillna(
data.set_index(['year', 'country']).sort_index())
).reset_index()
year country sales rep
0 2013 uk 12 kyle
1 2014 fr 10 kyle
2 2015 fr 20 claire
3 2016 uk 10 john
4 2016 usa 21 john

Categories

Resources