Replace -1 in pandas series with unique values - python

I have a pandas series that can have positive integers (0, 8, 10, etc) and -1s:
id values
1137 -1
1097 -1
201 8
610 -1
594 -1
727 -1
970 21
300 -1
243 0
715 -1
946 -1
548 4
Name: cluster, dtype: int64
I want to replace those -1 with values that don't already exist in the series and that are unique between them, in other words, I can't fill twice with, for example, 90. What's the most pythonic way to do that?
Here is the expected output:
id values
1137 1
1097 2
201 8
610 3
594 5
727 6
970 21
300 7
243 0
715 9
946 10
548 4
Name: cluster, dtype: int64

Idea is create all possible values by np.arange with add more values for positives, then get difference with positives and set to filtered column:
m = df['values'] != -1
s = np.setdiff1d(np.arange(len(df) + m.sum()), df.loc[m, 'values'])
df.loc[~m, 'values'] = s[:(~m).sum()]
print (df)
id values
0 1137 1
1 1097 2
2 201 8
3 610 3
4 594 5
5 727 6
6 970 21
7 300 7
8 243 0
9 715 9
10 946 10
11 548 4

Related

Pandas: Determine which rows are connected by a matching value in the next

I have a dataframe like so:
J1 J2 J3 J4
0 551 5 552 553
1 551 554 2 5
2 2 554 555 556
3 7 6 557 558
4 559 9 560 561
The goal is to determine which rows are connected to one another. For example: rows 0, 1, and 2 have a matching value that connects it to the next (551 in row 0 and 1, and 554 in row 1 and 2). Once that is determined, I need to isolate those rows into its own separate chunk of data. It should work for any row in the dataframe, not necessarily just the next row. I can't quite figure out how to do this. Any ideas?
As you dataset is small, you can use numpy broadcasting to perform all comparisons:
The code below gives you the number of connected rows (I added an extra connected row for the example):
a = df.values
b = (a==a[:,None]).sum(2)
np.fill_diagonal(b, 0)
df['connected'] = b.sum(0)
output:
0 1 2 3 connected
0 551 5 552 553 1
1 551 554 2 5 3
2 2 554 555 556 1
3 7 6 557 558 0
4 559 9 560 561 0
5 500 0 2 0 1
Finding the connected successive rows:
You can compare with the next row using shift+any:
mask = df.eq(df.shift(-1)).any(1)
df['connected'] = mask|mask.shift()
output:
J1 J2 J3 J4 connected
0 551 5 552 553 True
1 551 554 2 5 True
2 2 554 555 556 True
3 7 6 557 558 False
4 559 9 560 561 False

Pandas - how to sort week and year numbers formatted as strings?

I have a pandas dataframe like this, which sorted like:
>>> weekly_count.sort_values(by='date_in_weeks', inplace=True)
>>> weekly_count.loc[:9,:]
date_in_weeks count
0 1-2013 362
1 1-2014 378
2 1-2015 201
3 1-2016 294
4 1-2017 300
5 1-2018 297
6 10-2013 329
7 10-2014 314
8 10-2015 324
9 10-2016 322
in above data, first column, all rows of date_in_weeks are simply "week number of a year - year". I now want to sort it like this:
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
How do i do this?
Use Series.argsort with converted to datetimes with format %W week number of the year, link:
df = df.iloc[pd.to_datetime(df['date_in_weeks'] + '-0',format='%W-%Y-%w').argsort()]
print (df)
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
You can also convert to datetime , assign to the df, then sort the values and drop the extra col:
s = pd.to_datetime(df['date_in_weeks'],format='%M-%Y')
final = df.assign(dt=s).sort_values(['dt','count']).drop('dt',1)
print(final)
date_in_weeks count
0 1-2013 362
6 10-2013 329
1 1-2014 378
7 10-2014 314
2 1-2015 201
8 10-2015 324
3 1-2016 294
9 10-2016 322
4 1-2017 300
5 1-2018 297
You can try using auxiliary columns:
import pandas as pd
df = pd.DataFrame({'date_in_weeks':['1-2013','1-2014','1-2015','10-2013','10-2014'],
'count':[362,378,201,329,314]})
df['aux'] = df['date_in_weeks'].str.split('-')
df['aux_2'] = df['aux'].str.get(1).astype(int)
df['aux'] = df['aux'].str.get(0).astype(int)
df = df.sort_values(['aux_2','aux'],ascending=True)
df = df.drop(columns=['aux','aux_2'])
print(df)
Output:
date_in_weeks count
0 1-2013 362
3 10-2013 329
1 1-2014 378
4 10-2014 314
2 1-2015 201

How to mask a data frame by the month?

I have a dataframe df1 with a column dates which includes dates. I want to plot the dataframe for just a certain month. The column dates look like:
Unnamed: 0 Unnamed: 0.1 dates DPD weekday
0 0 1612 2007-06-01 23575.0 4
1 3 1615 2007-06-04 28484.0 0
2 4 1616 2007-06-05 29544.0 1
3 5 1617 2007-06-06 29129.0 2
4 6 1618 2007-06-07 27836.0 3
5 7 1619 2007-06-08 23434.0 4
6 10 1622 2007-06-11 28893.0 0
7 11 1623 2007-06-12 28698.0 1
8 12 1624 2007-06-13 27959.0 2
9 13 1625 2007-06-14 28534.0 3
10 14 1626 2007-06-15 23974.0 4
.. ... ... ... ... ...
513 721 2351 2009-06-09 54658.0 1
514 722 2352 2009-06-10 51406.0 2
515 723 2353 2009-06-11 48255.0 3
516 724 2354 2009-06-12 40874.0 4
517 727 2357 2009-06-15 77085.0 0
518 728 2358 2009-06-16 77989.0 1
519 729 2359 2009-06-17 75209.0 2
520 730 2360 2009-06-18 72298.0 3
521 731 2361 2009-06-19 60037.0 4
522 734 2364 2009-06-22 69348.0 0
523 735 2365 2009-06-23 74086.0 1
524 736 2366 2009-06-24 69187.0 2
525 737 2367 2009-06-25 68912.0 3
526 738 2368 2009-06-26 57848.0 4
527 741 2371 2009-06-29 72718.0 0
528 742 2372 2009-06-30 72306.0 1
And I just want to have June 2007 for example.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['month'] = pd.PeriodIndex(df1.dates, freq='M')
nov_mask=df1['month'] == 2007-06
plot_data= df1[nov_mask].pivot(index='dates', values='DPD')
plot_data.plot()
plt.show()
I don't know what's wrong with my code.The error shows that there is something wrong with 2007-06 when i defining nov_mask, i think the data type is wrong but I tried a lot and nothing works..
You don't need PeriodIndex if you just want to get June 2007 data. I have no access to IPython right now but this should point you in the right direction.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['year'] = df1['dates'].dt.year
df1['month'] = df1['dates'].dt.month
july_mask = ((df1['year'] == 2007) & (df1['month'] == 7))
filtered = df1[july_mask ]
# ... Do something with filtered.

Read 4 lines of data into one row of pandas data frame

I have txt file with such values:
108,612,620,900
168,960,680,1248
312,264,768,564
516,1332,888,1596
I need to read all of this into a single row of data frame.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
I have many such files and so I'll keep appending rows to this data frame.
I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have :
df = pd.read_csv(f,sep=",| ", header = None)
But this takes , and (space) as separators where as I want it to take newline as a separator.
First, read the data:
df = pd.read_csv('test/t.txt', header=None)
It gives you a DataFrame shaped like the CSV. Then concatenate:
s = pd.concat((df.loc[i] for i in df.index), ignore_index=True)
It gives you a Series:
0 108
1 612
2 620
3 900
4 168
5 960
6 680
7 1248
8 312
9 264
10 768
11 564
12 516
13 1332
14 888
15 1596
dtype: int64
Finally, if you really want a horizontal DataFrame:
pd.DataFrame([s])
Gives you:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.

How to set array index in this case?

I have this file
0 0 716
0 1 851
0 2 900
1 0 724
1 1 857
1 2 903
2 0 812
2 1 858
2 2 902
3 0 799
3 1 852
3 2 905
4 0 833
4 1 871
4 2 907
5 0 940
5 1 955
5 2 995
6 0 941
6 1 956
6 2 996
7 0 942
7 1 957
7 2 999
8 0 944
8 1 958
8 2 992
9 0 946
9 1 952
9 2 998
I want to write third column values like this
0 0 716
1 0 724
2 0 812
3 0 799
4 0 833
0 1 851
1 1 857
2 1 858
3 1 852
4 1 871
0 2 900
1 2 903
2 2 902
3 2 905
4 2 907
5 0 940
6 0 941
7 0 942
8 0 944
9 0 946
5 1 955
6 1 956
7 1 957
8 1 958
9 1 952
5 2 995
6 2 996
7 2 999
8 2 992
9 2 998
I have read file
l= [line.rstrip('\n') for line in open('test.txt')]
Now I am stuck,how to read this as 3d array? With enumerate function,does not work because it includes first value on its own,I do not need that.
This works:
with open('input.txt') as infile:
rows = [map(int, line.split()) for line in infile]
def part(minval, maxval):
return [r for r in rows if minval <= r[0] <= maxval]
with open('output.txt', 'w') as outfile:
for half in [part(0, 4), part(5, 9)]:
half.sort(key=lambda (a, b, c): (b, a, c))
for row in half:
outfile.write('%s %s %s\n' % tuple(row))
Let me know if you have questions.
it would be very simple if you could use pandas module:
import pandas as pd
fn = r'D:\temp\.data\37146154.txt'
df = pd.read_csv(fn, delim_whitespace=True, header=None, names=['col1','col2','col3'])
df.sort_values(['col2','col1','col3'])
if you want to write it back to a new file:
df.sort_values(['col2','col1','col3']).to_csv('new_file', sep='\t', index=False, header=None)
Test:
In [15]: df.sort_values(['col2','col1','col3'])
Out[15]:
col1 col2 col3
0 0 0 716
3 1 0 724
6 2 0 812
9 3 0 799
12 4 0 833
15 5 0 940
18 6 0 941
21 7 0 942
24 8 0 944
27 9 0 946
1 0 1 851
4 1 1 857
7 2 1 858
10 3 1 852
13 4 1 871
16 5 1 955
19 6 1 956
22 7 1 957
25 8 1 958
28 9 1 952
2 0 2 900
5 1 2 903
8 2 2 902
11 3 2 905
14 4 2 907
17 5 2 995
20 6 2 996
23 7 2 999
26 8 2 992
29 9 2 998

Categories

Resources