query a pandas dataframe based in index and datacolumns - python

I have a Datset that looks like :
data="""cruiseid year station month day date lat lon depth_w taxon count
AA8704 1987 1 04 13 13-APR-87 35.85 -75.48 18 Centropages_typicus 75343
AA8704 1987 1 04 13 13-APR-87 35.85 -75.48 18 Gastropoda 0
AA8704 1987 1 04 13 13-APR-87 35.85 -75.48 18 Calanus_finmarchicus 2340
AA8704 1987 1 07 13 13-JUL-87 35.85 -75.48 18 Acartia_spp. 5616
AA8704 1987 1 07 13 13-JUL-87 35.85 -75.48 18 Metridia_lucens 468
AA8704 1987 1 08 13 13-AUG-87 35.85 -75.48 18 Evadne_spp. 0
AA8704 1987 1 08 13 13-AUG-87 35.85 -75.48 18 Salpa 0
AA8704 1987 1 08 13 13-AUG-87 35.85 -75.48 18 Oithona_spp. 468
"""
datafile = open('data.txt','w')
datafile.write(data)
datafile.close()
I read it into pandas with :
parse = lambda x: dt.datetime.strptime(x, '%d-%m-%Y')
df = pd.read_csv('data.txt',index_col=0, header=False, parse_dates={"Datetime" : [1,3,4]}, skipinitialspace=True, sep=' ', skiprows=0)
How can i generate a subset from this dataframe with all the records in April where the taxon is 'Calanus_finmarchicus' or 'Gastropoda'
I can query the dataframe where taxon is equal to 'Calanus_finmarchicus' or 'Gastropoda' using
df[(df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda')]
But i'm in trouble quering the time, something similar in numy can be like :
import numpy as np
data = np.genfromtxt('data.txt', dtype=[('cruiseid','S6'), ('year','i4'), ('station','i4'), ('month','i4'), ('day','i4'), ('date','S9'), ('lat','f8'), ('lon','f8'), ('depth_w','i8'), ('taxon','S60'), ('count','i8')], skip_header=1)
selection = [np.where((data['taxon']=='Calanus_finmarchicus') | (data['taxon']=='Gastropoda') & ((data['month']==4) | (data['month']==3)))[0]]
data[selection]
Here's a link with a notebook to reproduce the example

You can refer to datetime's month attribute:
>>> df.index.month
array([4, 4, 4, 7, 7, 8, 8, 8], dtype=int32)
>>> df[((df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda'))
... & (df.index.month == 4)]
cruiseid station date lat lon depth_w \
Datetime
1987-04-13 AA8704 1 13-APR-87 35.85 -75.48 18
1987-04-13 AA8704 1 13-APR-87 35.85 -75.48 18
taxon count Unnamed: 11
Datetime
1987-04-13 Gastropoda 0 NaN
1987-04-13 Calanus_finmarchicus 2340 NaN

As others said, you can use df.index.month to filter by month, but I also suggest to use pandas.Series.isin() to check your taxon condition:
>>> df[df.taxon.isin(['Calanus_finmarchicus', 'Gastropoda']) & (df.index.month == 4)]
cruiseid station date lat lon depth_w \
Datetime
1987-04-13 AA8704 1 13-APR-87 35.85 -75.48 18
1987-04-13 AA8704 1 13-APR-87 35.85 -75.48 18
taxon count Unnamed: 11
Datetime
1987-04-13 Gastropoda 0 NaN
1987-04-13 Calanus_finmarchicus 2340 NaN

Use the month attribute of your index:
df[(df.index.month == 4) & ((df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda'))]

i didn't pay attention on the syntax (brachets order) and on the dataframe.index attributes, this line give me what i was lloking for :
results = df[((df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda')) & (df.index.month==4)] # [df.index.month==4)]

Related

How do I assign year&months in PD dataframe?

My pandaframe looks very weird after running the code. The data doesnt not come with a year/month variable so I have to add them manually. Is there a way I could do that?
sample = []
url1 = "https://api.census.gov/data/2018/cps/basic/jan?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
url2 = "https://api.census.gov/data/2018/cps/basic/feb?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
url3 = "https://api.census.gov/data/2018/cps/basic/mar?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
sample.append(requests.get(url1).text)
sample.append(requests.get(url2).text)
sample.append(requests.get(url3).text)
sample = [json.loads(i) for i in sample]
sample = pd.DataFrame(sample)
sample
Consider read_json to directly read the Census URL API inside a user-defined method. Then iterate pairwise through all possible pairs of years and months using itertools.product to build data frame and assign corresponding columns:
import pandas as pd
import calendar
import itertools
def get_census_data(year, month):
# BUILD DYNAMIC URL
url = (
f"https://api.census.gov/data/{year}/cps/basic/{month.lower()}?"
"get=PEFNTVTY,PEMNTVTY&for=state:01"
)
# CLEAN RAW DATA FOR APPROPRIATE ROWS AND COLS, ASSIGN YEAR/MONTH COLS
raw_df = pd.read_json(url)
cps_df = (
pd.DataFrame(raw_df.iloc[1:,])
.set_axis(raw_df.iloc[0,], axis="columns", inplace=False)
.assign(year = year, month = month)
)
return cps_df
# MONTH AND YEAR LISTS
months_years = itertools.product(
range(2010, 2021),
calendar.month_abbr[1:13]
)
# ITERATE PAIRWISE THROUGH LISTS
cps_list = [get_census_data(yr, mo) for yr, mo in months_years]
# COMPILE AND CLEAN FINAL DATA FRAME
cps_df = (
pd.concat(cps_list, ignore_index=True)
.drop_duplicates()
.reset_index(drop=True)
.rename_axis(None, axis="columns")
)
Output
cps_df
PEFNTVTY PEMNTVTY state year month
0 57 57 1 2010 Jan
1 303 303 1 2010 Jan
2 233 233 1 2010 Jan
3 57 233 1 2010 Jan
4 73 73 1 2010 Jan
... ... ... ... ...
6447 210 139 1 2020 Dec
6448 363 363 1 2020 Dec
6449 301 57 1 2020 Dec
6450 57 242 1 2020 Dec
6451 416 416 1 2020 Dec
[6452 rows x 5 columns]
The response to each API call is a JSON array of arrays. You called the wrong DataFrame constructor. Try this:
base_url = "https://api.census.gov/data/2018/cps/basic"
params = {
"get": "PEFNTVTY,PEMNTVTY",
"for": "state:01",
"PEEDUCA": 39,
}
df = []
for month in ["jan", "feb", "mar"]:
r = requests.get(f"{base_url}/{month}", params=params)
r.raise_for_status()
j = r.json()
df.append(pd.DataFrame.from_records(j[1:], columns=j[0]).assign(month=month))
df = pd.concat(df)
Result:
PEFNTVTY PEMNTVTY PEEDUCA state month
0 57 57 39 1 jan
1 57 57 39 1 jan
2 57 57 39 1 jan
3 57 57 39 1 jan
4 57 57 39 1 jan
...

Find median in nth range in Python

I am trying to find value of every median in my dataset for every 15 days. Dataset has three columns - index, value and date.
This is for evaluation of this median according to some conditions. Each of 15 days will get new value according to conditions.
I've tried several approaches (mostly python comprehension) but I am still a beginner to solve it properly.
value date index
14 13065 1983-07-15 14
15 13065 1983-07-16 15
16 13065 1983-07-17 16
17 13065 1983-07-18 17
18 13065 1983-07-19 18
19 13065 1983-07-20 19
20 13065 1983-07-21 20
21 13065 1983-07-22 21
22 13065 1983-07-23 22
23 ..... ......... ..
medians = [dataset['value'].median() for range(0, len(dataset['index']), 15) in dataset['value']]
I am expecting to return medians from the dataframe to a new variable.
syntaxError: can't assign to function call
Assuming you have data in the below format:
test = pd.DataFrame({'date': pd.date_range(start = '2016/02/12', periods = 1000, freq='1D'),
'value': np.random.randint(1,1000,1000)})
test.head()
date value
0 2016-02-12 243
1 2016-02-13 313
2 2016-02-14 457
3 2016-02-15 236
4 2016-02-16 893
If you want to median for every 15 days then use pd.Grouper and groupby date:
test.groupby(pd.Grouper(freq='15D', key='date')).median().reset_index()
date Value
2016-02-12 457.0
2016-02-27 733.0
2016-03-13 688.0
2016-03-28 504.0
2016-04-12 591.0
Note that while using pd.Grouper, your date column should be of type datetime. If it's not, convert using:
test['date'] = pd.to_datetime(test['date'])

Pandas query using filter and sort, leading to unresolved errors

I am working on this problem for my coding class that is outlined in the doc strings. I would appreciate any help on optimizing my code as well as any explanations as to why I am receiving the following error despite resetting the index.
import pandas as pd
def beds_top_ten(df, facility_id):
'''
INPUT: DataFrame, int
OUTPUT: date
Write a pandas query that returns the ten census dates with the highest
number of available beds for the nursing home with the specified facility id
REQUIREMENTS:
Do a filter followed by a sort rather than a sort followed by a merge.
'''
df = pd.read_csv('beds.csv', low_memory= False)
df['Bed Census Date'] = pd.to_datetime(df['Bed Census Date'])
df = df.filter(items =['Facility ID', 'Bed Census Date','Available Residential Beds'])
df = df.sort_values(by =[ 'Facility ID', 'Available Residential Beds'], ascending= False)
df_group_by_ten = df.groupby('Facility ID').head(10).reset_index(drop=True)
dates = df_group_by_ten.loc[df_group_by_ten['Facility ID']==facility_id, 'Bed Census Date']
return dates
this is what the table looks like after the first groupby:
Facility ID Bed Census Date Available Residential Beds
336 19 2011-01-05 29
339 19 2010-12-15 28
330 19 2011-02-23 27
332 19 2011-02-02 27
333 19 2011-01-26 27
334 19 2011-01-19 27
335 19 2011-01-12 27
338 19 2010-12-22 27
341 19 2010-12-01 27
331 19 2011-02-09 26
16 17 2013-04-10 22
87 17 2011-11-09 19
30 17 2013-01-02 17
37 17 2012-11-07 17
47 17 2012-08-29 17
31 17 2012-12-26 16
56 17 2012-06-20 16
10 17 2013-05-22 15
27 17 2013-01-23 15
61 17 2012-05-16 15
And when I run from my command_line:
In [15]: beds_top_ten('beds.csv',17)
Out[15]:
16 2013-04-10
87 2011-11-09
30 2013-01-02
37 2012-11-07
47 2012-08-29
31 2012-12-26
56 2012-06-20
10 2013-05-22
27 2013-01-23
61 2012-05-16
Name: Bed Census Date, dtype: datetime64[ns]
Yet when I run the same code on the online environment, I get the following error:
/usr/local/lib/python2.7/unittest/suite.py:108: DtypeWarning: Columns (10,45) have mixed types. Specify dtype option on import or set low_memory=False.
test(result)
E
======================================================================
ERROR: test_fourth_pandas (test_methods.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/src/app/test_methods.py", line 25, in test_fourth_pandas
all_equal = np.all(result == answer)
File "/usr/local/lib/python2.7/site-packages/pandas/core/ops.py", line 812, in wrapper
raise ValueError(msg)
ValueError: Can only compare identically-labeled Series objects
----------------------------------------------------------------------
Ran 1 test in 19.743s
FAILED (errors=1)
There's nothing wrong with pd.to_datetime. It's possible you have erroneous dates. Try specifying a format, and errors='coerce so invalid formats are converted to NaT.
df['Bed Census Date'] = pd.to_datetime(df['Bed Census Date'].str.strip(),
format='%Y-%m-%d', errors='coerce')
Now, expanding on my comment, filter, sort, and get the first 10 items using head:
x = df[df['Facility ID'] == facility_id]\
.sort_values('Available Residential Beds', ascending=False).head(10)
return x['Bed Census Date']
Removing the date formatting line resolved the above error.
df = pd.read_csv('beds.csv', low_memory= False)
#df['Bed Census Date'] = pd.to_datetime(df['Bed Census Date'])
df = df.filter(items=['Facility ID', 'Bed Census Date','Available Residential Beds'])
x = df[df['Facility ID'] == facility_id].sort_values('Available Residential Beds', ascending=False).head(10)
return x['Bed Census Date']

Pandas - formatting a NxN matrix

I have to deal with a square matrix (N x N) (N will change depending on the system, but the matrix will always be a square matrix).
Here is an example:
0 1 2 3 4
0 5.1677124550E-001 5.4962112499E-005 3.2484393256E-002 -1.8901697652E-001 -6.7156804753E-003
1 5.5380106796E-005 5.6159927753E-001 -1.9000545049E-003 -1.4737748792E-002 -7.2598453774E-002
2 3.2486915835E-002 -1.8996351539E-003 5.6791783316E-001 7.2316374186E-002 1.5013066446E-003
3 -1.8901411495E-001 -1.4737367075E-002 7.2315825338E-002 6.2721160365E-001 3.1553528602E-002
4 -6.7136454124E-003 -7.2597907350E-002 1.5007743348E-003 3.1554372311E-002 2.7318109331E-001
5 6.6738948243E-002 1.4102132238E-003 -1.2689244944E-001 4.7666038803E-002 1.8559074897E-002
6 -2.5293332676E-002 3.7536452002E-002 -1.3453018251E-002 -1.3177136905E-001 6.8262612506E-002
7 5.0951492945E-003 2.1082303893E-005 2.2599127408E-004 1.0287898189E-001 -1.1117916184E-001
8 1.0818230191E-003 -1.2435319909E-002 8.1008075834E-003 -4.2864102001E-002 4.2865913226E-002
9 -1.8399671295E-002 -2.1579653853E-002 -8.3073582356E-003 -2.1848513510E-001 -7.3408914625E-002
10 3.4566032399E-003 -4.0687639382E-003 1.3769999130E-003 -1.1873434189E-001 -3.3274201039E-002
11 6.6093238125E-003 1.7153435473E-002 4.9392012712E-003 -8.4590814134E-002 -4.3601041176E-002
12 -1.1418316960E-001 -1.1241625427E-001 -3.2263873516E-002 -1.9323129435E-002 -2.6233049625E-002
13 -1.1352899039E-001 -2.2898299860E-001 -5.3035072561E-002 7.4480651562E-004 6.3778892206E-004
14 -3.2197359289E-002 -5.3404040557E-002 -6.2530142462E-002 9.6648204015E-003 1.5382174347E-002
15 -1.2210509335E-001 1.1380412205E-001 -3.8374895516E-002 -1.2823165326E-002 2.3865200517E-002
16 1.1478157080E-001 -2.1487971631E-001 5.9955334103E-002 -1.2803721235E-003 -2.2477259002E-004
17 -3.9162044498E-002 6.0167325377E-002 -6.7692892326E-002 6.3814569032E-003 -1.3309923267E-002
18 -5.1386866211E-002 -1.1483215267E-003 -3.8482481829E-002 2.2227734790E-003 2.4860195290E-004
19 -1.8287048910E-003 -4.5442287955E-002 -7.6787332291E-003 7.6970470456E-004 -1.8456603178E-003
20 -3.4812676792E-002 -7.8376169613E-003 -3.1205975353E-001 -2.8005140005E-003 3.9792109835E-004
21 2.6908361866E-003 3.7102890907E-004 2.8494060446E-002 -4.8904422930E-002 -5.8840348122E-004
22 -1.6354677061E-003 2.2592828188E-003 1.6591434361E-004 -4.9992263663E-003 -4.3243295112E-002
23 -1.4297833794E-003 -1.7830154308E-003 -1.1426700328E-002 1.7125095395E-003 -1.2016863398E-002
24 1.6271802154E-003 1.6383303957E-003 -7.8049656555E-004 3.7177399735E-003 -1.0472268655E-002
25 -4.1949740427E-004 1.5301971185E-004 -9.8681335931E-004 -2.2257204483E-004 -5.1722898203E-003
26 1.0290471110E-003 9.3255502541E-004 7.7166886713E-004 4.5630851485E-003 -4.3761358485E-003
27 -7.0031784470E-004 -3.5205332654E-003 -1.6311730073E-003 -1.2805479632E-002 -6.5565487971E-003
28 7.4046927792E-004 1.9332629981E-003 3.7374682636E-004 3.9965654817E-003 -6.2275912806E-003
29 -3.4680278867E-004 -2.3027344089E-003 -1.1338817043E-003 -1.2023581780E-002 -5.4242202971E-003
5 6 7 8 9
0 6.6743285428E-002 -2.5292337123E-002 5.0949675928E-003 1.0817408844E-003 -1.8399704662E-002
1 1.4100215877E-003 3.7536256943E-002 2.1212526899E-005 -1.2435482773E-002 -2.1579384876E-002
2 -1.2689432485E-001 -1.3453164785E-002 2.2618690004E-004 8.1008703937E-003 -8.3084039605E-003
3 4.7663851818E-002 -1.3181118094E-001 1.0290976691E-001 -4.2887391630E-002 -2.1847562123E-001
4 1.8558453001E-002 6.8311145594E-002 -1.1122358467E-001 4.2891711956E-002 -7.3413776745E-002
5 6.5246209445E-001 -3.7960754525E-002 5.8439215647E-002 -9.0620367134E-002 -8.4164313206E-002
6 -3.7935271881E-002 1.9415336793E-001 -6.8115262349E-002 5.0899890760E-002 -3.3687874555E-002
7 5.8422477033E-002 -6.8128901087E-002 3.9950499633E-001 -4.4336879147E-002 -4.0665928103E-002
8 -9.0612201567E-002 5.0902528870E-002 -4.4330072001E-002 1.2680415316E-001 1.7096405711E-002
9 -8.4167028549E-002 -3.3690056890E-002 -4.0677875424E-002 1.7097273427E-002 5.2579065978E-001
10 -6.4841142152E-002 -5.4453858464E-003 -2.4697277476E-001 8.5069643903E-005 1.8744016178E-001
11 -1.0367060076E-001 1.5864203200E-002 -1.6074822795E-002 -5.5265410413E-002 -7.3152548403E-002
12 -9.0665723957E-003 3.3027526012E-003 1.8484849938E-003 -7.5841163489E-004 -3.3700244298E-003
13 4.7717318460E-004 -1.8118719766E-003 1.6014630540E-003 -2.3830908057E-004 2.1049292570E-003
14 4.3836856576E-003 -1.7242302777E-003 -1.2023546553E-003 4.0533783460E-004 1.4850814596E-003
15 -1.2402059167E-002 -7.4793143461E-003 -3.8769252328E-004 3.9551076185E-003 1.0737706641E-003
16 -9.3076805579E-005 -1.6074185601E-003 1.7551579833E-003 -5.1663470094E-004 1.1072804383E-003
17 4.6817349747E-003 3.6900011954E-003 -8.6155331565E-004 -9.1007768778E-005 -7.3899260162E-004
18 3.2959550689E-002 3.0400921147E-003 3.9724187499E-004 -1.9220339108E-003 1.8075790317E-003
19 7.0905456379E-004 -5.0949208181E-004 -4.6021500516E-004 -7.9847500945E-004 1.4079850530E-004
20 -1.8687467448E-002 -6.3913023759E-004 -7.3566296037E-004 2.3726543730E-003 -1.0663719038E-003
21 3.6598966411E-003 -8.2335128379E-003 7.5645765132E-004 -2.1824880567E-002 -3.5125687811E-003
22 -1.6198130808E-002 8.4576317115E-003 -6.2045498682E-004 3.3460766491E-002 3.2638760335E-003
23 -3.2057393808E-001 -1.1315081941E-002 3.4822885510E-003 -5.8263446092E-003 2.9508421818E-004
24 -2.6366856593E-002 -5.8331954255E-004 1.1995976399E-003 3.4813904521E-003 -5.0942740761E-002
25 6.5474742063E-003 -5.7681583908E-003 -2.2680039574E-002 -3.3264360995E-002 4.8925407218E-003
26 -1.1288074542E-002 -4.5938216710E-003 -1.9339903561E-003 1.0812058656E-002 2.3005958417E-002
27 1.8937006089E-002 6.5590668002E-003 -2.9973042787E-003 -9.1466195902E-003 -2.0027029530E-001
28 -5.0006834397E-003 -3.1011487603E-002 -2.1071980031E-002 1.5171078954E-002 -6.3286786806E-002
29 1.0199591553E-002 -7.9372677248E-004 3.0157129340E-003 3.3043947441E-003 1.2554933598E-001
10 11 12 13 14
0 3.4566170422E-003 6.6091516193E-003 -1.1418209846E-001 -1.1352717720E-001 -3.2196213169E-002
1 -4.0687114857E-003 1.7153538295E-002 -1.1241515840E-001 -2.2897846552E-001 -5.3401852861E-002
2 1.3767476381E-003 4.9395834885E-003 -3.2262805417E-002 -5.3032729716E-002 -6.2527093260E-002
3 -1.1874067860E-001 -8.4586993618E-002 -1.9322697616E-002 7.4504831410E-004 9.6646936748E-003
4 -3.3280804952E-002 -4.3604931512E-002 -2.6232842935E-002 6.3789697287E-004 1.5382093474E-002
5 -6.4845769217E-002 -1.0366990398E-001 -9.0664935892E-003 4.7719667654E-004 4.3835884630E-003
6 -5.4306282394E-003 1.5863464756E-002 3.3027917727E-003 -1.8118646089E-003 -1.7242102753E-003
7 -2.4687457565E-001 -1.6075394559E-002 1.8484728466E-003 1.6014634135E-003 -1.2023496466E-003
8 8.5962912652E-005 -5.5265657567E-002 -7.5843145596E-004 -2.3831274033E-004 4.0533385644E-004
9 1.8744386918E-001 -7.3152643002E-002 -3.3700964189E-003 2.1048865009E-003 1.4850822567E-003
10 4.2975054072E-001 1.0364270794E-001 -1.5875283846E-003 6.7147216913E-004 1.2875627684E-004
11 1.0364402707E-001 6.0712435750E-001 5.1492123223E-003 8.2705404716E-004 -1.8653698814E-003
12 -1.5875318643E-003 5.1492269487E-003 1.2662026379E-001 1.2488481495E-001 3.3008712754E-002
13 6.7147489686E-004 8.2705994225E-004 1.2488477299E-001 2.4603749137E-001 5.7666439818E-002
14 1.2875157882E-004 -1.8653719810E-003 3.3008614344E-002 5.7666322609E-002 6.3196096154E-002
15 1.1375173141E-003 -1.2188735107E-003 9.5708352328E-003 -1.3282223067E-002 5.3571128896E-003
16 2.1319373893E-004 -2.6367828437E-004 1.4833724552E-002 -2.0115235494E-002 7.8461850894E-003
17 2.3051283757E-004 3.4044831571E-004 4.9262824289E-003 -6.6151918659E-003 1.1684894610E-003
18 -5.6658408835E-004 1.5710333316E-003 -2.6543076573E-003 1.0490950154E-003 -1.5676208892E-002
19 1.0005496308E-003 1.0400419914E-003 -2.7122935995E-003 -5.3716049248E-005 -2.6747366947E-002
20 3.1068907684E-004 5.3348953665E-004 -4.7934824223E-004 4.4853558686E-004 -6.0300656596E-003
21 2.7080517882E-003 -1.9033626829E-002 8.8615570289E-004 -3.7735646663E-004 -7.4101143501E-004
22 -2.9622921796E-003 -2.4159082408E-002 6.6943323966E-004 1.1154593780E-004 1.5914682394E-004
23 3.2842560830E-003 -6.2612752482E-003 1.5738434272E-004 4.6284599959E-004 4.0588132107E-004
24 1.6971737369E-003 2.4217812563E-002 4.3246402884E-004 9.5059931011E-005 3.5484698283E-004
25 -7.4868993750E-002 -8.7332668698E-002 -6.0147742690E-005 -4.8099146029E-005 1.1509155506E-004
26 -9.3177706949E-002 -2.9315061874E-001 2.1287190612E-004 5.0813661565E-005 2.6955715462E-004
27 -7.0097859908E-002 1.2458191360E-001 -1.2846480258E-003 1.2192486380E-004 4.6853704861E-004
28 -6.9485493530E-002 4.8763866344E-002 7.7223643475E-004 1.3853535883E-004 5.4636752811E-005
29 4.8961381968E-002 -1.5272337445E-001 -8.8648769643E-004 -4.4975303480E-005 5.9586006091E-004
15 16 17 18 19
0 -1.2210501176E-001 1.1478027359E-001 -3.9162145749E-002 -5.1389252158E-002 -1.8288904037E-003
1 1.1380272374E-001 -2.1487588526E-001 6.0165774430E-002 -1.1487007778E-003 -4.5441546655E-002
2 -3.8374694597E-002 5.9953296524E-002 -6.7691825286E-002 -3.8484030260E-002 -7.6800715249E-003
3 -1.2822729286E-002 -1.2805898275E-003 6.3813065178E-003 2.2220841872E-003 7.6991955181E-004
4 2.3864994996E-002 -2.2470892452E-004 -1.3309838494E-002 2.4851560674E-004 -1.8460620529E-003
5 -1.2402212045E-002 -9.2994801153E-005 4.6817064931E-003 3.2958166488E-002 7.0866732024E-004
6 -7.4793278406E-003 -1.6074103229E-003 3.6899979002E-003 3.0392561951E-003 -5.0946020505E-004
7 -3.8770026733E-004 1.7551659565E-003 -8.6155605026E-004 3.9692465089E-004 -4.6038088334E-004
8 3.9551171890E-003 -5.1663991899E-004 -9.1008948343E-005 -1.9220277566E-003 -7.9837924658E-004
9 1.0738350084E-003 1.1072790098E-003 -7.3897453645E-004 1.8057852560E-003 1.4013275714E-004
10 1.1375075076E-003 2.1317640112E-004 2.3050639764E-004 -5.6673414945E-004 1.0005316579E-003
11 -1.2189105982E-003 -2.6367792495E-004 3.4043235164E-004 1.5732522246E-003 1.0407973658E-003
12 9.5708232459E-003 1.4833737759E-002 4.9262816092E-003 -2.6542614308E-003 -2.7122986789E-003
13 -1.3282260152E-002 -2.0115238348E-002 -6.6152067653E-003 1.0491248568E-003 -5.3705750675E-005
14 5.3571028398E-003 7.8462085672E-003 1.1684872139E-003 -1.5676176683E-002 -2.6747374282E-002
15 1.3378635756E-001 -1.2613361119E-001 4.2401828623E-002 -2.6595403473E-003 1.9873360401E-003
16 -1.2613349126E-001 2.3154756121E-001 -6.5778628114E-002 -2.2828335280E-003 1.4601821131E-003
17 4.2401749392E-002 -6.5778591727E-002 6.8187241643E-002 -1.6653902450E-002 2.5505038138E-002
18 -2.6595920073E-003 -2.2828074980E-003 -1.6653942562E-002 5.4855247002E-002 2.4729783529E-003
19 1.9873415121E-003 1.4601899329E-003 2.5505058190E-002 2.4729967206E-003 4.4724663284E-002
20 -3.8366743828E-004 -8.8746730931E-004 -6.4420927497E-003 3.6656962180E-002 8.1224860664E-003
21 9.2845385141E-004 3.6802433505E-004 -9.5040708316E-004 -5.1941208846E-003 -1.2444625713E-004
22 -5.0318487549E-004 1.4342911215E-004 2.8985859503E-004 2.0416113478E-004 9.1951318240E-004
23 7.4036073171E-004 -3.4730013615E-004 -1.3351566400E-004 2.3474188588E-003 1.3102362758E-005
24 -2.7749145090E-004 4.7724454321E-005 5.5527644806E-005 -1.7302886151E-004 -1.7726879169E-004
25 -2.5090250470E-004 2.1741519930E-005 2.7208805916E-004 -2.5982303487E-004 -1.9668228900E-004
26 -1.4489113997E-004 -3.0397727583E-005 2.7239543481E-005 -6.0050637375E-004 -2.9892198193E-005
27 -1.6519482597E-005 1.6435294198E-004 5.0961893634E-005 1.4077278097E-004 -1.9027010603E-005
28 -2.3547595249E-004 7.6124571826E-005 1.0117983985E-004 -1.1534040559E-004 -1.0579685787E-004
29 7.0507166233E-005 1.1552377841E-004 -4.5931305760E-005 -2.0007797315E-004 -1.3505340062E-004
20 21 22 23 24
0 -3.4812101478E-002 2.6911592086E-003 -1.6354152863E-003 -1.4301333227E-003 1.6249964844E-003
1 -7.8382610347E-003 3.7103408229E-004 2.2593110441E-003 -1.7829862164E-003 1.6374435740E-003
2 -3.1205423941E-001 2.8493671639E-002 1.6587990556E-004 -1.1426237591E-002 -7.8189111866E-004
3 -2.8004725758E-003 -4.8903739721E-002 -4.9988134121E-003 1.7100983514E-003 3.7179545055E-003
4 3.9806443322E-004 -5.8790208912E-004 -4.3242458298E-002 -1.2016207108E-002 -1.0472139534E-002
5 -1.8686790048E-002 3.6592865292E-003 -1.6198931842E-002 -3.2057224847E-001 -2.6367531700E-002
6 -6.3919412091E-004 -8.2335246704E-003 8.4576155591E-003 -1.1315054733E-002 -5.8369163532E-004
7 -7.3581915791E-004 7.5646519519E-004 -6.2047477465E-004 3.4823216513E-003 1.1991380964E-003
8 2.3726528036E-003 -2.1824763131E-002 3.3460717579E-002 -5.8262172949E-003 3.4812921433E-003
9 -1.0665296285E-003 -3.5124206435E-003 3.2639684654E-003 2.9530797749E-004 -5.0943824872E-002
10 3.1067613876E-004 2.7079189356E-003 -2.9623459983E-003 3.2841200274E-003 1.6984442797E-003
11 5.3351732140E-004 -1.9033427571E-002 -2.4158940046E-002 -6.2609613281E-003 2.4221378111E-002
12 -4.7937892256E-004 8.8611314755E-004 6.6939922854E-004 1.5740024716E-004 4.3249394082E-004
13 4.4851926804E-004 -3.7736678097E-004 1.1153694999E-004 4.6284806253E-004 9.5077824774E-005
14 -6.0300787410E-003 -7.4096053004E-004 1.5918637627E-004 4.0586523098E-004 3.5485782222E-004
15 -3.8368712363E-004 9.2843754228E-004 -5.0316845184E-004 7.4036906127E-004 -2.7745851356E-004
16 -8.8745240886E-004 3.6801936222E-004 1.4342995270E-004 -3.4729860789E-004 4.7711904531E-005
17 -6.4420819427E-003 -9.5038506002E-004 2.8983698019E-004 -1.3352326563E-004 5.5544671478E-005
18 3.6656852373E-002 -5.1941195232E-003 2.0415783452E-004 2.3474119607E-003 -1.7153048632E-004
19 8.1224361521E-003 -1.2444681834E-004 9.1951236579E-004 1.3097434442E-005 -1.7668019335E-004
20 3.3911554853E-001 2.8652507893E-003 -6.8339696880E-005 3.7476484447E-004 8.3606654277E-004
21 2.8652527558E-003 6.1967615286E-002 -3.2455918220E-003 7.8074203872E-003 -1.5351890960E-003
22 -6.8340068690E-005 -3.2455946984E-003 4.1826230856E-002 6.5337193429E-003 -3.1932674182E-003
23 3.7476336333E-004 7.8073802579E-003 6.5336763366E-003 3.4246747567E-001 -2.2590437719E-005
24 8.3515185725E-004 -1.5351889308E-003 -3.1932682244E-003 -2.2585651674E-005 4.7006835231E-002
25 5.3158843621E-007 1.0652535047E-003 1.4954902777E-003 2.4073368793E-004 1.1954474977E-003
26 5.5963948637E-004 -4.4872582333E-004 -1.4772351943E-003 6.3199701928E-004 -2.1389718034E-002
27 -1.7619372799E-004 9.0741766644E-004 9.8175835796E-004 -2.9459682310E-004 7.2835611826E-004
28 2.5127782091E-004 -9.3298199434E-004 6.8787235133E-005 1.2732690365E-004 7.9688727422E-003
29 2.6201943695E-004 1.7128017387E-004 1.2934748675E-003 3.4008367645E-004 1.9615268308E-002
25 26 27 28 29
0 -4.2035299977E-004 1.0294528397E-003 -7.0032537135E-004 7.4047266192E-004 -3.4678947810E-004
1 1.5264932827E-004 9.3263518942E-004 -3.5205362458E-003 1.9332600101E-003 -2.3027335108E-003
2 -9.8735571502E-004 7.7177183895E-004 -1.6311830663E-003 3.7374078263E-004 -1.1338849320E-003
3 -2.2267753982E-004 4.5631164845E-003 -1.2805227755E-002 3.9967067646E-003 -1.2023590679E-002
4 -5.1722782688E-003 -4.3757731112E-003 -6.5561880794E-003 -6.2274289617E-003 -5.4242286711E-003
5 6.5472637324E-003 -1.1287788747E-002 1.8937046693E-002 -5.0006811267E-003 1.0199602824E-002
6 -5.7685226078E-003 -4.5935456207E-003 6.5591405092E-003 -3.1011377655E-002 -7.9382348181E-004
7 -2.2680665405E-002 -1.9338350120E-003 -2.9972765688E-003 -2.1071947728E-002 3.0156847654E-003
8 -3.3264515239E-002 1.0812126530E-002 -9.1466888768E-003 1.5170890552E-002 3.3044094214E-003
9 4.8928775025E-003 2.3007654009E-002 -2.0026482543E-001 -6.3285758846E-002 1.2554808336E-001
10 -7.4869041758E-002 -9.3178724533E-002 -7.0098856149E-002 -6.9485640501E-002 4.8962839723E-002
11 -8.7330564494E-002 -2.9314613543E-001 1.2458021507E-001 4.8763534298E-002 -1.5272144228E-001
12 -6.0132426168E-005 2.1286995818E-004 -1.2846479090E-003 7.7223667108E-004 -8.8648784383E-004
13 -4.8090893023E-005 5.0813447259E-005 1.2192474211E-004 1.3853537972E-004 -4.4975512069E-005
14 1.1509828375E-004 2.6955725919E-004 4.6853708025E-004 5.4636589826E-005 5.9585997916E-004
15 -2.5088560837E-004 -1.4490239429E-004 -1.6517113547E-005 -2.3547725232E-004 7.0506301073E-005
16 2.1741623849E-005 -3.0396484786E-005 1.6435437640E-004 7.6123660238E-005 1.1552303684E-004
17 2.7209709129E-004 2.7234932342E-005 5.0963084246E-005 1.0117936124E-004 -4.5931984725E-005
18 -2.5882735848E-004 -6.0031848430E-004 1.4070861538E-004 -1.1535910049E-004 -2.0001808065E-004
19 -1.9638025822E-004 -2.9919459983E-005 -1.9047914816E-005 -1.0580143635E-004 -1.3503643634E-004
20 8.4829116415E-007 5.5948891149E-004 -1.7619563318E-004 2.5127749619E-004 2.6202088722E-004
21 1.0652521780E-003 -4.4872868033E-004 9.0739586785E-004 -9.3299673048E-004 1.7126146660E-004
22 1.4954902653E-003 -1.4772362211E-003 9.8175151528E-004 6.8801505444E-005 1.2934673074E-003
23 2.4072903510E-004 6.3199689136E-004 -2.9460500091E-004 1.2731327319E-004 3.4007600115E-004
24 1.1952923145E-003 -2.1389995888E-002 7.2832026293E-004 7.9688600183E-003 1.9615297182E-002
25 9.4289717269E-002 1.0562741426E-001 -1.7552990896E-004 7.0060843371E-003 8.7782610441E-003
26 1.0562750999E-001 3.0308674016E-001 -1.6382699707E-003 -5.5832273099E-003 -1.1726448645E-002
27 -1.7551353029E-004 -1.6382784849E-003 2.0673701256E-001 8.2101212014E-002 -1.3115219203E-001
28 7.0060896795E-003 -5.5832572276E-003 8.2101377926E-002 8.7668224780E-002 -5.4259499038E-002
29 8.7782416309E-003 -1.1726450275E-002 -1.3115216547E-001 -5.4259354736E-002 1.5092602943E-001
This should be a 30x30 matrix and I'm trying:
data = pd.read_fwf('C:/Users/henri/Documents/Projects/Python-Lessons/ORCA/orca.hess',
widths=[9, 19, 19, 19, 19, 19])
But it reads as 185x6. I'd like to ignore the first column (numbering the lines) from 0-29 and I'm not using the columns indexes (from 0-29 too) to perform any mathematical operation. Also, Pandas is rounding my numbers and I'd like to keep the original format.
Here is a snip of my output:
Unnamed: 0 0 1 2 3 4
0 0.0 5.167712e-01 0.000055 0.032484 -0.189017 -0.006716
1 1.0 5.538011e-05 0.561599 -0.001900 -0.014738 -0.072598
2 2.0 3.248692e-02 -0.001900 0.567918 0.072316 0.001501
Any help is much appreciated, guys.
import pandas as pd
filename = 'data'
df = pd.read_fwf(filename, widths=[9, 19, 19, 19, 19, 19])
df = df.rename(columns={'Unnamed: 0':'row'})
df = df.dropna(subset=['row'], how='any')
df['col'] = df.groupby('row').cumcount()
df = df.pivot(index='row', columns='col')
df = df.dropna(how='any', axis=1)
df.columns = range(len(df.columns))
print(df.head())
yields
0 1 2 3 4 5 6 \
row
0.0 0.516771 0.066743 0.003457 -0.122105 -0.034812 -0.000420 0.000055
1.0 0.000055 0.001410 -0.004069 0.113803 -0.007838 0.000153 0.561599
2.0 0.032487 -0.126894 0.001377 -0.038375 -0.312054 -0.000987 -0.001900
3.0 -0.189014 0.047664 -0.118741 -0.012823 -0.002800 -0.000223 -0.014737
4.0 -0.006714 0.018558 -0.033281 0.023865 0.000398 -0.005172 -0.072598
7 8 9 ... 20 21 22 \
row ...
0.0 -0.025292 0.006609 0.114780 ... -0.113527 -0.051389 -0.001430
1.0 0.037536 0.017154 -0.214876 ... -0.228978 -0.001149 -0.001783
2.0 -0.013453 0.004940 0.059953 ... -0.053033 -0.038484 -0.011426
3.0 -0.131811 -0.084587 -0.001281 ... 0.000745 0.002222 0.001710
4.0 0.068311 -0.043605 -0.000225 ... 0.000638 0.000249 -0.012016
23 24 25 26 27 28 29
row
0.0 0.000740 -0.006716 -0.018400 -0.032196 -0.001829 0.001625 -0.000347
1.0 0.001933 -0.072598 -0.021579 -0.053402 -0.045442 0.001637 -0.002303
2.0 0.000374 0.001501 -0.008308 -0.062527 -0.007680 -0.000782 -0.001134
3.0 0.003997 0.031554 -0.218476 0.009665 0.000770 0.003718 -0.012024
4.0 -0.006227 0.273181 -0.073414 0.015382 -0.001846 -0.010472 -0.005424
[5 rows x 30 columns]
After parsing the file with
df = pd.read_fwf(filename, widths=[9, 19, 19, 19, 19, 19])
df = df.rename(columns={'Unnamed: 0':'row'})
the column headers can be identified by have a df['row'] value of NaN.
So they can be removed with
df = df.dropna(subset=['row'], how='any')
Now the row numbers keep repeating from 0 to 29. If we group by the row
value, then we can assign an intra-group "cumulative count" to the rows within
each group. That is, the first row of the group gets assigned the value 0, the
next row 1, etc. -- within that group -- and the process is repeated for each
group.
df['col'] = df.groupby('row').cumcount()
# row 0 1 2 3 4 col
# 0 0.0 5.167712e-01 0.000055 0.032484 -0.189017 -0.006716 0
# 1 1.0 5.538011e-05 0.561599 -0.001900 -0.014738 -0.072598 0
# 2 2.0 3.248692e-02 -0.001900 0.567918 0.072316 0.001501 0
# ...
# 182 27.0 -1.755135e-04 -0.001638 0.206737 0.082101 -0.131152 5
# 183 28.0 7.006090e-03 -0.005583 0.082101 0.087668 -0.054259 5
# 184 29.0 8.778242e-03 -0.011726 -0.131152 -0.054259 0.150926 5
Now the desired DataFrame can be obtained by pivoting:
df = df.pivot(index='row', columns='col')
and relabeling the columns:
df.columns = range(len(df.columns))
A more NumPy-based approach might look like this:
import numpy as np
import pandas as pd
filename = 'data'
df = pd.read_csv(filename, delim_whitespace=True)
arr = df.values
N = df.index.max()+1
arr = np.delete(arr, np.arange(N, len(arr), N+1), axis=0)
chunks = np.split(arr, np.arange(N, len(arr), N))
result = pd.DataFrame(np.hstack(chunks)).dropna(axis=1)
print(result)
This will also work for any sized matrix.

Pandas sort() ignoring negative sign

I want to sort a pandas df but I'm having problems with the negative values.
import pandas as pd
df = pd.read_csv('File.txt', sep='\t', header=None)
#Suppress scientific notation (finally)
pd.set_option('display.float_format', lambda x: '%.8f' % x)
print(df)
print(df.dtypes)
print(df.shape)
b = df.sort(axis=0, ascending=True)
print(b)
This gives me the ascending order but completely disregards the sign.
SPATA1 -0.00000005
HMBOX1 0.00000005
SLC38A11 -0.00000005
RP11-571M6.17 0.00000004
GNRH1 -0.00000004
PCDHB8 -0.00000004
CXCL1 0.00000004
RP11-48B3.3 -0.00000004
RNFT2 -0.00000004
GRIK3 -0.00000004
ZNF483 0.00000004
RP11-627G18.1 0.00000003
Any ideas what I'm doing wrong?
Thanks
Loading your file with:
df = pd.read_csv('File.txt', sep='\t', header=None)
Since sort(....) is deprecated, you can use sort_values:
b = df.sort_values(by=[1], axis=0, ascending=True)
where [1] is your column of values. For me this returns:
0 1
0 ACTA1 -0.582570
1 MT-CO1 -0.543877
2 CKM -0.338265
3 MT-ND1 -0.306239
5 MT-CYB -0.128241
6 PDK4 -0.119309
8 GAPDH -0.090912
9 MYH1 -0.087777
12 RP5-940J5.9 -0.074280
13 MYH2 -0.072261
16 MT-ND2 -0.052551
18 MYL1 -0.049142
19 DES -0.048289
20 ALDOA -0.047661
22 ENO3 -0.046251
23 MT-CO2 -0.043684
26 RP11-799N11.1 -0.034972
28 TNNT3 -0.032226
29 MYBPC2 -0.030861
32 TNNI2 -0.026707
33 KLHL41 -0.026669
34 SOD2 -0.026166
35 GLUL -0.026122
42 TRIM63 -0.022971
47 FLNC -0.018180
48 ATP2A1 -0.017752
49 PYGM -0.016934
55 hsa-mir-6723 -0.015859
56 MT1A -0.015110
57 LDHA -0.014955
.. ... ...
60 RP1-178F15.4 0.013383
58 HSPB1 0.014894
54 UBB 0.015874
53 MIR1282 0.016318
52 ALDH2 0.016441
51 FTL 0.016543
50 RP11-317J10.2 0.016799
46 RP11-290D2.6 0.018803
45 RRAD 0.019449
44 MYF6 0.019954
43 STAC3 0.021931
41 RP11-138I1.4 0.023031
40 MYBPC1 0.024407
39 PDLIM3 0.025442
38 ANKRD1 0.025458
37 FTH1 0.025526
36 MT-RNR2 0.025887
31 HSPB6 0.027680
30 RP11-451G4.2 0.029969
27 AC002398.12 0.033219
25 MT-RNR1 0.040741
24 TNNC1 0.042251
21 TNNT1 0.047177
17 MT-ND3 0.051963
15 MTND1P23 0.059405
14 MB 0.063896
11 MYL2 0.076358
10 MT-ND5 0.076479
7 CA3 0.100221
4 MT-ND6 0.140729
[18152 rows x 2 columns]

Categories

Resources