Related
I am trying to convert python dataframe into column headers. I am using transpose function but results are not as expected. Which function can be used to accomplish the results as given below?
data is:
Year 2020
Month SEPTEMBER
Filed Date 29-11-2020
Year 2022
Month JULY
Filed Date 20-08-2022
Year 2022
Month APRIL
Filed Date 20-05-2022
Year 2017
Month AUGUST
Filed Date 21-09-2017
Year 2018
Month JULY
Filed Date 03-02-2019
Year 2021
Month MAY
Filed Date 22-06-2021
Year 2017
Month DECEMBER
Filed Date 19-01-2018
Year 2018
Month MAY
Filed Date 03-02-2019
Year 2019
Month MARCH
Filed Date 28-09-2019
and convert it into:
Year Month Filed Date
2020 September 29-11-2020
2022 July 20-08-2022
You can do it like this:
df = pd.DataFrame(
[df1.iloc[i:i+3][1].tolist() for i in range(0, len(df1), 3)],
columns=df1.iloc[0:3][0].tolist(),
)
print(df):
Year Month Filed
0 2020 SEPTEMBER Date 29-11-2020
1 2022 JULY Date 20-08-2022
2 2022 APRIL Date 20-05-2022
3 2017 AUGUST Date 21-09-2017
4 2018 JULY Date 03-02-2019
5 2021 MAY Date 22-06-2021
6 2017 DECEMBER Date 19-01-2018
7 2018 MAY Date 03-02-2019
8 2019 MARCH Date 28-09-2019
I have found a solution to my problem. Here df1 is:
Year 2020
Month SEPTEMBER
Filed Date 29-11-2020
Year 2022
Month JULY
Filed Date 20-08-2022
Year 2022
Month APRIL
Filed Date 20-05-2022
Year 2017
Month AUGUST
Filed Date 21-09-2017
Year 2018
Month JULY
Filed Date 03-02-2019
Year 2021
Month MAY
Filed Date 22-06-2021
Year 2017
Month DECEMBER
Filed Date 19-01-2018
Year 2018
Month MAY
Filed Date 03-02-2019
Year 2019
Month MARCH
Filed Date 28-09-2019
I used pivot function and approached the problem like this:
df=pd.DataFrame()
for i in range(0,len(df1),3):
df= df.append(df1.pivot(columns='A', values='B', index=None).bfill(axis = 0).iloc[i])
df.reset_index(drop=True, inplace=True)
print(df)
result:
A Filed Date Month Year
0 29-11-2020 SEPTEMBER 2020
1 20-08-2022 JULY 2022
2 20-05-2022 APRIL 2022
3 21-09-2017 AUGUST 2017
4 03-02-2019 JULY 2018
Hello guys I have a program that takes two array "Year_Array" and "Month_Array" and generates the output according to conditions.
I want to add that both array in a single dataframe with column name year and name so in future I can add that dataframe with other dataframe.
Below is the sample code:
Year_Array=[2010,2011,2012,2013,2014]
Month_Array=['Jan','Feb','Mar','April','May','June','July','Aug','Sep','Oct','Nov','Dec']
segment=[1, 1, 3, 5, 2, 1, 1, 1, 2, 1, 6, 1]
p=0
for p in range(0, len(Year_Array), 1):
c=0
for i in range(0, len(segment),1):
h = segment[i]
for j in range(0,int(h) , 1):
print((Year_Array[p]) ,(Month_Array[c]))
c += 1
On the basis of segment the code is generated like this:
output
2010 Jan
2010 Feb
2010 Mar
2010 Mar
2010 Mar
2010 April
2010 April
2010 April
2010 April
2010 April
2010 May
2010 May
2010 June
2010 July
2010 Aug
2010 Sep
2010 Sep
2010 Oct
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2010 Dec
2011 Jan
2011 Feb
2011 Mar
2011 Mar
2011 Mar
2011 April
......
......
2012 Jan
2012 Feb
2012 Mar
2012 Mar
2012 Mar
2012 April
......
......so on till 2014
all this output i want to store in a single dataframe for this i tried with this way:
df=pd.DataFrame(Year_Array[p])
print(df)
df.columns = ['year']
print("df-",df)
df1=pd.DataFrame(Month_Array[c])
df1.columns = ['month']
print(df1)
if i write :then this also print only the array values not the output
df=pd.DataFrame(Year_Array)
print(df)
**but this is not working i want the same ouput while printing the array in dataframe with column name "year" and "month"**please tell me how to do it..thanks
You can create a Array with the expected output and create a dataframe from it.
Edit : added column name to dataframe.
Year_Array=[2010,2011,2012,2013,2014]
Month_Array=['Jan','Feb','Mar','April','May','June','July','Aug','Sep','Oct','Nov','Dec']
final_Array=[]
segment=[1, 1, 3, 5, 2, 1, 1, 1, 2, 1, 6, 1]
p=0
for p in range(0, len(Year_Array), 1):
c=0
for i in range(0, len(segment),1):
h = segment[i]
# print(h)
for j in range(0,int(h) , 1):
final_Array.append(((Year_Array[p], Month_Array[c])))
c += 1
data = pd.DataFrame(final_Array,columns=['year','month'])
data.head()
Output :
year month
0 2010 Jan
1 2010 Feb
2 2010 Mar
3 2010 Mar
4 2010 Mar
I have large file contain multiple lines but in some line having unique pattern, I want to split our large file based on this pattern.
Below data in text file:
commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.
commit 349e1b42d3b3d23e95a227a1ab744fc6167e6893
Date: Sat Jun 9 02:52:37 2018 +0530
Revert "Removing the printf added"
This reverts commit da0fac94719176009188ce40864b09cfb84ca590.
commit 8bfd4e7086ff5987491f280b57d10c1b6e6433fe
Date: Sat Jun 9 02:52:18 2018 +0530
Revert Bulk
This reverts commit c2ee318635987d44e579c92d0b86b003e1d2a076.
commit bcb10c54068602a96d367ec09f08530ede8059ef
Date: Fri Jun 8 19:53:03 2018 +0530
fix crash observed
commit a84169f79fbe9b18702f6885b0070bce54d6dd5a
Date: Fri Jun 8 18:14:21 2018 +0530
Interface PBR
commit 254726fe3fe0b9f6b228189e8a6fe7bdf4aa9314
Date: Fri Jun 8 18:12:10 2018 +0530
Crash observed
commit 18e7106d54e19310d32e8b31d584cec214fb2cb7
Date: Fri Jun 8 18:09:13 2018 +0530
Changes to fix crash
Currently my code as below:
import re
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
txtrawdata = fp.read()
commits = re.split(r'^(commit|)[ a-zA-Z0-9]{40}$',txtrawdata)
print(commits)
Expected Output:
I want to split above string based on "commit 18e7106d54e19310d32e8b31d584cec214fb2cb7" and convert them into python list.
import re
text = ''' commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.'''
print(re.split(r'^\s*commit \S*\s*', text, flags=re.MULTILINE))
This outputs:
['', 'Date: Sat Jun 9 04:11:37 2018 +0530\n\n configurations\n', 'Date: Sat Jun 9 02:59:56 2018 +0530\n\n remote\n', 'Date: Sat Jun 9 02:52:51 2018 +0530\n\n remote fix\n This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.']
Explanation of this regex in Regex101 here.
groups = re.findall(r'(^\s*commit\s+[a-z0-9]+.*?)(?=^commit|\Z)', data, flags=re.DOTALL|re.MULTILINE)
for g in groups:
print(g)
print('-' * 80)
Prints:
commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
--------------------------------------------------------------------------------
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
--------------------------------------------------------------------------------
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.
--------------------------------------------------------------------------------
...and so on
This will extract the commit shas:
commits = list()
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
for line in fp:
m = re.match('^commit\s+([a-f0-9]{40})$', line)
if m:
commits.append(m.group(0))
commits is now a list of just the strings of the commit. Now if your gitlog output format changes this will change the matching regex. Make sure you're generating it with --no-abbrev-commit.
For example, if I have the following event data, and want to find clusters of at least 2 events that are within 1 minute of each other in which id_1, id_2, and id_3 are all the same. For reference, I have the epoch timestamp (in microseconds) in addition to the date-time timestamp.
event_id timestamp id_1 id_2 id_3
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442813 Jun 15, 2015 10:22 PM PDT A 2 34567
9442810 Jun 15, 2015 10:22 PM PDT A 3 12345
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442840 Jun 15, 2015 10:23 PM PDT C 3 12345
9442839 Jun 15, 2015 10:23 PM PDT C 1 34567
9442838 Jun 15, 2015 10:23 PM PDT C 2 12345
9442907 Jun 15, 2015 10:24 PM PDT C 3 34567
9442886 Jun 15, 2015 10:24 PM PDT C 1 12345
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
9442934 Jun 15, 2015 10:25 PM PDT C 3 12345
For each cluster found, I want to return a set of (id_1, id_2, id_3, [list of event_ids], min_timestamp_of_cluster, max_timestamp_of_cluster). Additionally, if there's a cluster with (e.g.) 6 events, I'd only want to return a single result with all events, not one for each grouping of 3 events.
If I understood your problem correctly, you can make use of scikit-learn's DBSCAN clustering algorithm with a custom distance (or metric) function. Your distance function should return a very large number if either of the id_1, id_2 or id_3's are different. Otherwise is should return the time difference.
But with this method, the number of clusters are determined by the algorithm and not as an input given to the algorithm. If you are determined to pass the number of clusters as an input, k-means is the clustering algorithm you may need to look into.
In pure python, use a "sliding window" that encompasses all the events in a 1 minute range.
The premise is simple: maintain a queue of events that is a subsequence
of the total list, in order. The "window" (queue) should be all the events you care about. In this case, that is determined by the 60-second time gap requirement.
As you process events, add one event to the end of the queue. If the first event in the queue is more than 60 seconds from the newly-added last event, slide the window forward by dropping the first event from the front of the queue.
This is python3:
import collections
import operator
import itertools
from datetime import datetime
#### FROM HERE: vvv is just faking events. Delete or replace.
class Event(collections.namedtuple('Event', 'event_id timestamp id_1 id_2 id_3 epoch_ts')):
def __str__(self):
return ('{e.event_id} {e.timestamp} {e.id_1} {e.id_2} {e.id_3}'
.format(e=self))
def get_events():
event_list = map(operator.methodcaller('strip'), '''
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442813 Jun 15, 2015 10:22 PM PDT A 2 34567
9442810 Jun 15, 2015 10:22 PM PDT A 3 12345
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442840 Jun 15, 2015 10:23 PM PDT C 3 12345
9442839 Jun 15, 2015 10:23 PM PDT C 1 34567
9442838 Jun 15, 2015 10:23 PM PDT C 2 12345
9442907 Jun 15, 2015 10:24 PM PDT C 3 34567
9442886 Jun 15, 2015 10:24 PM PDT C 1 12345
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
9442934 Jun 15, 2015 10:25 PM PDT C 3 12345
'''.strip().splitlines())
for line in event_list:
idev, *rest = line.split()
ts = rest[:6]
id1, id2, id3 = rest[6:]
id2 = int(id2) # faster when sorting (see find_clustered_events)
id3 = int(id3) # faster when sorting (see find_clustered_events)
ts_str = ' '.join(ts)
dt = datetime.strptime(ts_str.replace('PDT', '-0700'), '%b %d, %Y %I:%M %p %z')
epoch = dt.timestamp()
ev = Event(idev, ts_str, id1, id2, id3, epoch)
yield ev
#### TO HERE: ^^^ was just faking up your events. Delete or replace.
def add_cluster(key, group):
'''Do whatever you want with the clusters. I'll print them.'''
print('Cluster:', key)
print(' ','\n '.join(map(str, group)), sep='')
def find_clustered_events(events, cluster=3, gap_secs=60):
'''Call add_cluster on clusters of events within a maximum time gap.
Args:
events (iterable): series of events, in chronological order
cluster (int): minimum number of events in a cluster
gap_secs (float): maximum time-gap from start to end of cluster
Returns:
None.
'''
window = collections.deque()
evkey = lambda e: (e.id_1, e.id_2, e.id_3)
for ev in events:
window.append(ev)
t0 = window[0].epoch_ts
tn = window[-1].epoch_ts
if tn - t0 < gap_secs:
continue
window.pop()
for k, g in itertools.groupby(sorted(window, key=evkey), key=evkey):
group = tuple(g)
if len(group) >= cluster:
add_cluster(k, group)
window.append(ev)
window.popleft()
# Call find_clustered with event generator, cluster args.
# Note that your data doesn't have any 3-clusters without time seconds. :-(
find_clustered_events(get_events(), cluster=2)
The output looks like this:
$ python test.py
Cluster: ('A', 1, 34567)
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
Cluster: ('A', 2, 12345)
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
Cluster: ('A', 3, 34567)
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
Cluster: ('A', 1, 12345)
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
Cluster: ('C', 2, 34567)
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
Please note: the code above doesn't try to keep track of events already in a cluster. So if you have, for example, an event type that occurs every fifteen seconds, you will have a sequence like this:
event1 t=0:00
event2 t=0:15
event3 t=0:30
event4 t=0:45
event5 t=1:00
And you will get overlapping clusters:
event1, event2, event3 (t=0:00 .. 0:30)
event2, event3, event4 (t=0:15 .. 0:45)
event3, event4, event5 (t=0:30 .. 1:00)
Technically, those are valid clusters, each slightly different. But you may wish to expunge previously-clustered events from the window, if you want the events to only appear in a single cluster.
Alternatively, if the chance of clustering and repetition is low, it might improve performance to implement repeat-checking in the add_cluster() function, to reduce the work done by the main loop.
A final note: this does a LOT of sorting. And the sorting is not efficient, since it gets repeated every time a new event appears. If you have a large data set, the performance will probably be bad. If your event keys are relatively few - that is, if the id1,2,3 values tend to repeat over and over again - you would be better off dynamically creating separate deques for each distinct key (id1+id2+id3) and dispatching the event to the appropriate deque, applying the same window logic, and then checking the length of the deque.
On the other hand, if you are processing something like web-server logs, where the requester is always changing, that might tend to create a memory problem with all the useless deques. So this is a memory vs. time trade-off you'll have to be aware of.
You can use this to create the dataframe:
xyz = pd.DataFrame({'release' : ['7 June 2013', '2012', '31 January 2013',
'February 2008', '17 June 2014', '2013']})
I am trying to split the data and save, them into 3 columns named "day, month and year", using this command:
dataframe[['day','month','year']] = dataframe['release'].str.rsplit(expand=True)
The resulting dataframe is :
dataframe
As you can see, that it works perfectly when it gets 3 strings, but whenever it is getting less then 3 strings, it saves the data at the wrong place.
I have tried split and rsplit, both are giving the same result.
Any solution to get the data at the right place?
The last one is year and it is present in every condition , it should be the first one to be saved and then month if it is present otherwise nothing and same way the day should be stored.
You could
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply(
lambda x: pd.Series(x.split()[::-1]))
In [18]: dataframe
Out[18]:
release year month day
0 7 June 2013 2013 June 7
1 2012 2012 NaN NaN
2 31 January 2013 2013 January 31
3 February 2008 2008 February NaN
4 17 June 2014 2014 June 17
5 2013 2013 NaN NaN
Try reversing the result.
dataframe[['year','month','day']] = dataframe['release'].str.rsplit(expand=True).reverse()