I have a nested dictionary as follows.I want to print the data in the tabular form. Now condition here is i want to print only certain data in table for an example : BIRT, NAME, and SEX. how can i do that?
import sys
import pandas as pd
indi ={}
indi = {'#I7#': {'BIRT': '15 NOV 1925', 'FAMS': '#F2#', 'NAME': 'Rose /Campbell/', 'DEAT': '26 AUG 2009', 'SEX': 'F'}, '#I5#': {'BIRT': '15 SEP 1928', 'FAMS': '#F3#', 'NAME': 'Rosy /Huleknberg/', 'DEAT': '10 MAR 2010', 'SEX': 'F'}}
person = pd.DataFrame(indi).T
person.fillna(0, inplace=True)
print(person)
output
BIRT DEAT FAMC FAMS NAME SEX
#I5# 15 SEP 1928 10 MAR 2010 0 #F3# Rosy /Huleknberg/ F
#I7# 15 NOV 1925 26 AUG 2009 0 #F2# Rose /Campbell/ F
Documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.filter.html
You can try to use something like this:
print(person.filter(['BIRT', 'NAME', 'SEX']))
Output will be:
BIRT NAME SEX
#I5# 15 SEP 1928 Rosy /Huleknberg/ F
#I7# 15 NOV 1925 Rose /Campbell/ F
Related
I'm trying to sort the values with current year.
Current year values should show first.
mdlist = [{0:'31 Jan 2022', 1:'', 2:'10 Feb 2022'},
{0:'10 Feb 2021', 1:'20, Feb 2021', 2:''},
{0:'10 Feb 2022', 1:'10 Feb 2022', 2:'10 Feb 2022'}]
mdlist = sorted(mdlist, key=lambda d:d[0])
but it is not working as expected
expected output:
mdlist = [{0:'31 Jan 2022', 1:'', 2:'10 Feb 2022'},
{0:'10 Feb 2022', 1:'10 Feb 2022', 2:'10 Feb 2022'},
{0:'10 Feb 2021', 1:'20 Feb 2021', 2:''}]
Maybe you could leverage the fact that these are datetimes by using the datetime module and sort it by the years in descending order and the month-days in ascending order:
from datetime import datetime
def sorting_key(dct):
ymd = datetime.strptime(dct[0], "%d %b %Y")
return -ymd.year, ymd.month, ymd.day
mdlist.sort(key=sorting_key)
Output:
[{0: '31 Jan 2022', 1: '', 2: '10 Feb 2022'},
{0: '10 Feb 2022', 1: '10 Feb 2022', 2: '10 Feb 2022'},
{0: '10 Feb 2021', 1: '20 Feb 2021', 2: ''}]
Use a key function that returns 0 if the year is 2022, 1 otherwise. This will sort all the 2022 dates first.
firstyear = '2022'
mdlist = sorted(mdlist, key=lambda d: 0 if d:d[0].split()[-1] == firstyear else 1)
This question already has answers here:
Pandas DataFrame Groupby two columns and get counts
(8 answers)
Closed 3 years ago.
I have a data frame like this:
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
And I need to group drug names and mean number of ingredients by year like this:
year drug_name avg_number_of_ingredients
0 2019 drug a,b,c.. mean value for column
1 2018 drug a,b,c.. mean value for column
2 2017 drug a,b,c.. mean value for column
If I do df.groupby('year'), I lose drug names. How can I do it?
Let me show you the solution on the simple example. First, I make the same data frame as you have:
>>> df = pd.DataFrame(
[
{'year': 2019, 'drug_name': 'NEXIUM I.V.', 'avg_number_of_ingredients': 8},
{'year': 2016, 'drug_name': 'ZOLADEX', 'avg_number_of_ingredients': 10},
{'year': 2017, 'drug_name': 'PRILOSEC', 'avg_number_of_ingredients': 59},
{'year': 2017, 'drug_name': 'BYDUREON BCise', 'avg_number_of_ingredients': 24},
{'year': 2019, 'drug_name': 'Lynparza', 'avg_number_of_ingredients': 28},
]
)
>>> print(df)
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
Now, I make a df_grouped, which still consists of information about drugs name.
>>> df_grouped = df.groupby('year', as_index=False).agg({'drug_name': ', '.join, 'avg_number_of_ingredients': 'mean'})
>>> print(df_grouped)
year drug_name avg_number_of_ingredients
0 2016 ZOLADEX 10.0
1 2017 PRILOSEC, BYDUREON BCise 41.5
2 2019 NEXIUM I.V., Lynparza 18.0
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a column with 4 values like below in a dataframe :
Have attached the image below for better understanding
Input
India,Chennai - 24 Oct 1992
India,-Chennai, Oct 1992
(Asia) India,Chennai-22 Oct 1992
India,-Chennai, 1992
Output
Place
India Chennai
India Chennai
(Asia) India Chennai
India Chennai
Date
24 Oct 1992
Oct 1992
22 Oct 1992
1992
I need to split the Date and Year(23 Oct 1992, 1992) separately as a column and the text (India,Chennai) as separate column.
I'm bit confused to extract the values, I tried the replace and split options but couldn't achieve the result.
Would appreciate if somebody could help !!
Apologies for the format of Input and Output data !!
Use:
import re
df['Date'] = df['col'].str.split("(-|,)").str[-1]
df['Place'] = df.apply(lambda x: x['col'].split(x['Date']), axis=1).str[0].str.replace(',', ' ').str.replace('-', '')
Input
col
0 India,Chennai - 24 Oct 1992
1 India,-Chennai,Oct 1992
2 India,-Chennai, 1992
3 (Asia) India,Chennai-22 Oct 1992
Output
col Place Date
0 India,Chennai - 24 Oct 1992 India Chennai 24 Oct 1992
1 India,-Chennai,Oct 1992 India Chennai Oct 1992
2 India,-Chennai, 1992 India Chennai 1992
3 (Asia) India,Chennai-22 Oct 1992 (Asia) India Chennai 22 Oct 1992
There are lot of ways to create columns by using Pandas library in python,
you can create by creating list or by list of dictionaries or by dictionaries of list.
for simple understanding here i am going to use lists
first import pandas as pd
import pandas as pd
creating a list from given data
data = [['India','chennai', '24 Oct', 1992], ['India','chennai', '23 Oct', 1992],\
['India','chennai', '23 Oct', 1992],['India','chennai', '21 Oct', 1992]]
creating dataframe from list
df = pd.DataFrame(data, columns = ['Country', 'City', 'Date','Year'], index=(0,1,2,3))
print
print(df)
output will be as
Country City Date Year
0 India chennai 24 Oct 1992
1 India chennai 23 Oct 1992
2 India chennai 23 Oct 1992
3 India chennai 21 Oct 1992
hope this will help you
The following assumes that the first digit is where we always want to split the text. If the assumption fails then the code also fails!
>>> import re
>>> text_array
['India,Chennai - 24 Oct 1992', 'India,-Chennai,23 Oct 1992', '(Asia) India,Chennai-22 Oct 1992', 'India,-Chennai, 1992']
# split at the first digit, keep the digit, split at only the first digit
>>> tmp = [re.split("([0-9]){1}", t, maxsplit=1) for t in text_array]
>>> tmp
[['India,Chennai - ', '2', '4 Oct 1992'], ['India,-Chennai,', '2', '3 Oct 1992'], ['(Asia) India,Chennai-', '2', '2 Oct 1992'], ['India,-Chennai, ', '1', '992']]
# join the last two fields together to get the digit back.
>>> r = [(i[0], "".join(i[1:])) for i in tmp]
>>> r
[('India,Chennai - ', '24 Oct 1992'), ('India,-Chennai,', '23 Oct 1992'), ('(Asia) India,Chennai-', '22 Oct 1992'), ('India,-Chennai, ', '1992')]
If you have control over the how input is generated then I would suggest that
the input is made more consistent and then we can parse using a tool like
pandas or directly with csv.
Hope this helps.
Regards,
Prasanth
Python code:
import re
import pandas as pd
input_dir = '/content/drive/My Drive/TestData'
csv_file = '{}/test001.csv'.format(input_dir)
p = re.compile(r'(?:[0-9]|[0-2][0-9]|[3][0-1])\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?:\d{4})|(?:\d{4})', re.IGNORECASE)
places = []
dates = []
with open(csv_file, encoding='utf-8', errors='ignore') as f:
for line in f:
s = re.sub("[,-]", " ", line.strip())
s = re.sub("\s+", " ", s)
r = p.search(s)
str_date = r.group()
dates.append(str_date)
place = s[0:s.find(str_date)]
places.append(place)
dict = {'Place': places,
'Date': dates
}
df = pd.DataFrame(dict)
print(df)
Output:
Place Date
0 India Chennai 24 Oct 1992
1 India Chennai Oct 1992
2 (Asia) India Chennai 22 Oct 1992
3 India Chennai 1992
I am trying to drop all records with Name= Tina, but keep the record if Year =2015
import pandas as pd
data = {'name': ['Jason', 'Tina', 'Tina', 'Tina', 'Amy'],
'year': [2015, 2012, 2013, 2015, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
This code df[(df.name != 'Tina') will drop all the records name = Tina, but i need to keep the one with Year = 2015
expected output:
name reports year
Cochice Jason 4 2015
Maricopa Tina 2 2015
Yuma Amy 3 2014
Try using drop duplicate keep='last' if record are sorted as you need them.
df.drop_duplicates('name', keep='last')
Output:
name year reports
Cochice Jason 2015 4
Maricopa Tina 2015 2
Yuma Amy 2014 3
I'm not sure if this is possible but I have a very large array containing dates
a = ['Fri, 19 Aug 2011 19:28:17 -0000',....., 'Wed, 05 Feb 2012 11:00:00 -0000']
I'm trying to find if there is a way to count the frequency of the days and months in the array. In this case I'm trying to count strings for abbreviations of months or days (such as Fri,Mon, Apr, Jul)
You can use Counter() from the collections module.
from collections import Counter
a = ['Fri, 19 Aug 2011 19:28:17 -0000',
'Fri, 09 June 2017 11:11:11 -0000',
'Wed, 05 Feb 2012 11:00:00 -0000']
# this generator splits the dates into words, and cleans word from "".,;-:" characters:
# ['Fri', '19', 'Aug', '2011', '19:28:17', '0000', 'Fri', '09', 'June',
# '2017', '11:11:11', '0000', 'Wed', '05', 'Feb', '2012', '11:00:00', '0000']
# and feeds it to counting:
c = Counter( (x.strip().strip(".,;-:") for word in a for x in word.split() ))
for key in c:
if key.isalpha():
print(key, c[key])
The if prints only those keys from the counter that are pure "letters" - not digits:
Fri 2
Aug 1
June 1
Wed 1
Feb 1
Day-names and Month-names are the only pure isalpha() parts of your dates.
Full c output:
Counter({'0000': 3, 'Fri': 2, '19': 1, 'Aug': 1, '2011': 1,
'19:28:17': 1, '09': 1, 'June': 1, '2017': 1, '11:11:11': 1,
'Wed': 1, '05': 1, 'Feb': 1, '2012': 1, '11:00:00': 1})
Improvement by #AzatIbrakov comment:
c = Counter( (x.strip().strip(".,;-:") for word in a for x in word.split()
if x.strip().strip(".,;-:").isalpha()))
would weed out non-alpha words in the generation step already.
Python has a built in .count method which is very useful here:
lista = [
'Fri, 19 Aug 2011 19:28:17 -0000',
'Fri, 19 Aug 2011 19:28:17 -0000',
'Sun, 19 Jan 2011 19:28:17 -0000',
'Sun, 19 Aug 2011 19:28:17 -0000',
'Fri, 19 Jan 2011 19:28:17 -0000',
'Mon, 05 Feb 2012 11:00:00 -0000',
'Mon, 05 Nov 2012 11:00:00 -0000',
'Wed, 05 Feb 2012 11:00:00 -0000',
'Tue, 05 Nov 2012 11:00:00 -0000',
'Tue, 05 Dec 2012 11:00:00 -0000',
'Wed, 05 Jan 2012 11:00:00 -0000',
]
listb = (''.join(lista)).split()
for index, item in enumerate(listb):
count = {}
for item in listb:
count[item] = listb.count(item)
months = ['Jan', 'Feb', 'Aug', 'Nov', 'Dec']
for k in count:
if k in months:
print(f"{k}: {count[k]}")
Output:
(xenial)vash#localhost:~/python/stack_overflow$ python3.7 count_months.py
Aug: 3
Jan: 3
Feb: 2
Nov: 2
Dec: 1
What happens is we take all the items of the lista and join them into one string. Then we split that string to get all the individual words.
Now we can use the count method and create a dictionary to hold the counts. We can create a list of items we want to retrieve from the dicionary and only retrieve those keys