How to count certain elements of a string in a large array? - python

I'm not sure if this is possible but I have a very large array containing dates
a = ['Fri, 19 Aug 2011 19:28:17 -0000',....., 'Wed, 05 Feb 2012 11:00:00 -0000']
I'm trying to find if there is a way to count the frequency of the days and months in the array. In this case I'm trying to count strings for abbreviations of months or days (such as Fri,Mon, Apr, Jul)

You can use Counter() from the collections module.
from collections import Counter
a = ['Fri, 19 Aug 2011 19:28:17 -0000',
'Fri, 09 June 2017 11:11:11 -0000',
'Wed, 05 Feb 2012 11:00:00 -0000']
# this generator splits the dates into words, and cleans word from "".,;-:" characters:
# ['Fri', '19', 'Aug', '2011', '19:28:17', '0000', 'Fri', '09', 'June',
# '2017', '11:11:11', '0000', 'Wed', '05', 'Feb', '2012', '11:00:00', '0000']
# and feeds it to counting:
c = Counter( (x.strip().strip(".,;-:") for word in a for x in word.split() ))
for key in c:
if key.isalpha():
print(key, c[key])
The if prints only those keys from the counter that are pure "letters" - not digits:
Fri 2
Aug 1
June 1
Wed 1
Feb 1
Day-names and Month-names are the only pure isalpha() parts of your dates.
Full c output:
Counter({'0000': 3, 'Fri': 2, '19': 1, 'Aug': 1, '2011': 1,
'19:28:17': 1, '09': 1, 'June': 1, '2017': 1, '11:11:11': 1,
'Wed': 1, '05': 1, 'Feb': 1, '2012': 1, '11:00:00': 1})
Improvement by #AzatIbrakov comment:
c = Counter( (x.strip().strip(".,;-:") for word in a for x in word.split()
if x.strip().strip(".,;-:").isalpha()))
would weed out non-alpha words in the generation step already.

Python has a built in .count method which is very useful here:
lista = [
'Fri, 19 Aug 2011 19:28:17 -0000',
'Fri, 19 Aug 2011 19:28:17 -0000',
'Sun, 19 Jan 2011 19:28:17 -0000',
'Sun, 19 Aug 2011 19:28:17 -0000',
'Fri, 19 Jan 2011 19:28:17 -0000',
'Mon, 05 Feb 2012 11:00:00 -0000',
'Mon, 05 Nov 2012 11:00:00 -0000',
'Wed, 05 Feb 2012 11:00:00 -0000',
'Tue, 05 Nov 2012 11:00:00 -0000',
'Tue, 05 Dec 2012 11:00:00 -0000',
'Wed, 05 Jan 2012 11:00:00 -0000',
]
listb = (''.join(lista)).split()
for index, item in enumerate(listb):
count = {}
for item in listb:
count[item] = listb.count(item)
months = ['Jan', 'Feb', 'Aug', 'Nov', 'Dec']
for k in count:
if k in months:
print(f"{k}: {count[k]}")
Output:
(xenial)vash#localhost:~/python/stack_overflow$ python3.7 count_months.py
Aug: 3
Jan: 3
Feb: 2
Nov: 2
Dec: 1
What happens is we take all the items of the lista and join them into one string. Then we split that string to get all the individual words.
Now we can use the count method and create a dictionary to hold the counts. We can create a list of items we want to retrieve from the dicionary and only retrieve those keys

Related

Group python dataframe and display all correspond values for each unique key in a dictionary

I have the following dataset
id
date
7510
15 Jun 2020
7510
16 Jun 2020
7512
15 Jun 2020
7512
07 Jul 2020
7520
15 Jun 2020
7520
16 Aug 2020
I need to convert this to a dictionary which is quite straight forward, but need each unique id as a key and all corresponding values as values to the unique key.
for example;
dictionary = {7510: ["15 Jun 2020", "16 Jun 2020"], 7512: ["15 Jun 2020", "07 Jul 2020"],
7520: ["15 Jun 2020", "16 Aug 2020"] }
Try this:
df.groupby('id')['date'].agg(list).to_dict()
Output:
{7510: ['15 Jun 2020', '16 Jun 2020'],
7512: ['15 Jun 2020', '07 Jul 2020'],
7520: ['15 Jun 2020', '16 Aug 2020']}

Sort a list of dictionaries of dates by value

I'm trying to sort the values with current year.
Current year values should show first.
mdlist = [{0:'31 Jan 2022', 1:'', 2:'10 Feb 2022'},
{0:'10 Feb 2021', 1:'20, Feb 2021', 2:''},
{0:'10 Feb 2022', 1:'10 Feb 2022', 2:'10 Feb 2022'}]
mdlist = sorted(mdlist, key=lambda d:d[0])
but it is not working as expected
expected output:
mdlist = [{0:'31 Jan 2022', 1:'', 2:'10 Feb 2022'},
{0:'10 Feb 2022', 1:'10 Feb 2022', 2:'10 Feb 2022'},
{0:'10 Feb 2021', 1:'20 Feb 2021', 2:''}]
Maybe you could leverage the fact that these are datetimes by using the datetime module and sort it by the years in descending order and the month-days in ascending order:
from datetime import datetime
def sorting_key(dct):
ymd = datetime.strptime(dct[0], "%d %b %Y")
return -ymd.year, ymd.month, ymd.day
mdlist.sort(key=sorting_key)
Output:
[{0: '31 Jan 2022', 1: '', 2: '10 Feb 2022'},
{0: '10 Feb 2022', 1: '10 Feb 2022', 2: '10 Feb 2022'},
{0: '10 Feb 2021', 1: '20 Feb 2021', 2: ''}]
Use a key function that returns 0 if the year is 2022, 1 otherwise. This will sort all the 2022 dates first.
firstyear = '2022'
mdlist = sorted(mdlist, key=lambda d: 0 if d:d[0].split()[-1] == firstyear else 1)

how to add 2 Array list into a single pandas dataframe with two seperate column name

Hello guys I have a program that takes two array "Year_Array" and "Month_Array" and generates the output according to conditions.
I want to add that both array in a single dataframe with column name year and name so in future I can add that dataframe with other dataframe.
Below is the sample code:
Year_Array=[2010,2011,2012,2013,2014]
Month_Array=['Jan','Feb','Mar','April','May','June','July','Aug','Sep','Oct','Nov','Dec']
segment=[1, 1, 3, 5, 2, 1, 1, 1, 2, 1, 6, 1]
p=0
for p in range(0, len(Year_Array), 1):
c=0
for i in range(0, len(segment),1):
h = segment[i]
for j in range(0,int(h) , 1):
print((Year_Array[p]) ,(Month_Array[c]))
c += 1
On the basis of segment the code is generated like this:
output
2010 Jan
2010 Feb
2010 Mar
2010 Mar
2010 Mar
2010 April
2010 April
2010 April
2010 April
2010 April
2010 May
2010 May
2010 June
2010 July
2010 Aug
2010 Sep
2010 Sep
2010 Oct
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2010 Dec
2011 Jan
2011 Feb
2011 Mar
2011 Mar
2011 Mar
2011 April
......
......
2012 Jan
2012 Feb
2012 Mar
2012 Mar
2012 Mar
2012 April
......
......so on till 2014
all this output i want to store in a single dataframe for this i tried with this way:
df=pd.DataFrame(Year_Array[p])
print(df)
df.columns = ['year']
print("df-",df)
df1=pd.DataFrame(Month_Array[c])
df1.columns = ['month']
print(df1)
if i write :then this also print only the array values not the output
df=pd.DataFrame(Year_Array)
print(df)
**but this is not working i want the same ouput while printing the array in dataframe with column name "year" and "month"**please tell me how to do it..thanks
You can create a Array with the expected output and create a dataframe from it.
Edit : added column name to dataframe.
Year_Array=[2010,2011,2012,2013,2014]
Month_Array=['Jan','Feb','Mar','April','May','June','July','Aug','Sep','Oct','Nov','Dec']
final_Array=[]
segment=[1, 1, 3, 5, 2, 1, 1, 1, 2, 1, 6, 1]
p=0
for p in range(0, len(Year_Array), 1):
c=0
for i in range(0, len(segment),1):
h = segment[i]
# print(h)
for j in range(0,int(h) , 1):
final_Array.append(((Year_Array[p], Month_Array[c])))
c += 1
data = pd.DataFrame(final_Array,columns=['year','month'])
data.head()
Output :
year month
0 2010 Jan
1 2010 Feb
2 2010 Mar
3 2010 Mar
4 2010 Mar

Order dates inside a string from a list

I have this kind of list of lists I give you example input:
thislist= [[1, 'Aug 2014, Sept 2016, Ian 2014, Feb 2016', 2], [5,'Aug 2015, Sept 2012, Ian 2015, Aug 2017',4]]
I'm interested to work only at index[1] for each list (the one with the dates) and my desired output will be this:
thislist= [[1, 'Ian, Aug 2014; Feb, Sept 2016', 2], [5,'Sept 2012; Ian, Aug 2015; Aug 2017',4]]
(the above it's just a example, in my actual case I will have many more dates with years, but the format is exactly the same)
Basically I want to order each dates name abbreviation (they are in Romanian but they are quite same in English) on their actual order from calendar (ex: Ian, Feb, Mar, Apr ...etc) and to have them grouped like in the example on years in cronological order (2010, 2011, 2012, 2013 ....etc) and have that ";" for separation. How I can do this? I think the only option should be regex, but I'm not that good with it, so I can get to my desired output? I'm using python 3, thank you so much for your time!
You should take consider that "%B %Y" it takes full month Name because Romanian and English month abbreviation is not same in all cases
from datetime import datetime
thislist = [[1, 'August 2014, September 2016, January 2014, February 2016', 2],
[5, 'August 2015, September 2012, January 2015, February 2017', 4]]
sorted_list = []
months = []
i = 0
for dates in thislist:
sorted_list = []
chgDates = dates[1].split(",")
for test1 in chgDates:
sorted_list.append(test1.strip())
test = sorted(sorted_list, key=lambda x: datetime.strptime(x, "%B %Y"))
str1 = ', '.join(test)
thislist[i][1] = str1.replace(",", ";")
i = + 1
print(thislist)
Response:
[[1, 'January 2014; August 2014; February 2016; September 2016', 2], [5, 'September 2012; January 2015; August 2015; February 2017', 4]]
Now you can translate from English-> Romanian. You should read a little bit about lists and dictionary in python. I don't think that you will receive a full answer if you just wait for community.
from datetime import datetime
import re
thislist = [[1, 'August 2014, September 2016, January 2014, February 2016, March 2016', 2],
[5, 'August 2015, September 2012, January 2015, February 2017', 4]]
sorted_list = []
months = []
i = 0
def translateInRo(string, dyct):
substrs = sorted(dyct, key=len, reverse=True)
regexp = re.compile('|'.join(map(re.escape, substrs)))
return regexp.sub(lambda match: dyct[match.group(0)], string)
for dates in thislist:
sorted_list = []
chgDates = dates[1].split(",")
for test1 in chgDates:
sorted_list.append(test1.strip())
test = sorted(sorted_list, key=lambda x: datetime.strptime(x, "%B %Y"))
str1 = ', '.join(test)
translate = translateInRo(
str1, {"September": "Septembrie", "January": "Ianuarie", "September": "Septembrie", "February": "Februarie", "March": "Martie"})
thislist[i][1] = translate
i = + 1
print(thislist)

how to print dictionary data in the tabular form in python

I have a nested dictionary as follows.I want to print the data in the tabular form. Now condition here is i want to print only certain data in table for an example : BIRT, NAME, and SEX. how can i do that?
import sys
import pandas as pd
indi ={}
indi = {'#I7#': {'BIRT': '15 NOV 1925', 'FAMS': '#F2#', 'NAME': 'Rose /Campbell/', 'DEAT': '26 AUG 2009', 'SEX': 'F'}, '#I5#': {'BIRT': '15 SEP 1928', 'FAMS': '#F3#', 'NAME': 'Rosy /Huleknberg/', 'DEAT': '10 MAR 2010', 'SEX': 'F'}}
person = pd.DataFrame(indi).T
person.fillna(0, inplace=True)
print(person)
output
BIRT DEAT FAMC FAMS NAME SEX
#I5# 15 SEP 1928 10 MAR 2010 0 #F3# Rosy /Huleknberg/ F
#I7# 15 NOV 1925 26 AUG 2009 0 #F2# Rose /Campbell/ F
Documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.filter.html
You can try to use something like this:
print(person.filter(['BIRT', 'NAME', 'SEX']))
Output will be:
BIRT NAME SEX
#I5# 15 SEP 1928 Rosy /Huleknberg/ F
#I7# 15 NOV 1925 Rose /Campbell/ F

Categories

Resources