Sort through a Pandas dataframe and save unique entries - python

I'm trying to figure out how to sort through rows in a spreadsheet read with pandas and save values to variables.
Here is my code so far:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('data_file.xlsx', sheetname='Sheet 1')
for line in df:
if line.startswith(line):
The data is formatted the following way:
Column 1 has runner numbers, column 2 has 100 meter sprint times, Column 3 has 400 meter sprint times.
Here's an example of the data:
Runner 100m 400m
1 43.7 93.5
1 37.5 87.6
1 39.2 82.5
2 28.9 67.9
2 26.2 69.9
2 33.3 60.25
2 34.2 60.65
3 19.9 45.5
3 19.8 44.0
4 18.7 50.0
4 19.0 52.4
How could I store the contents of all the rows starting with 1 in a unique variable, all the rows starting with 2 in another variable, 3, etc.? I know this has to involve a loop of some sort but I'm not sure about how to approach this problem.

Generally, you want to avoid trying to programmatically set unique variables. This problem is probably best approached using a dictionary data structure to store the contents of the rows, with keys for each "Runner" ID (but runners would need to be unique).
You can quickly iterate through the data for each runner using pandas groupby. In the loop, the i represents the "Runner" ID and tdf is the dataframe of just data for that runner. This would store a numpy array of the data for each runner in dict d.
d = {}
for i, tdf in df.groupby('Runner'):
d[i] = tdf[['100m', '400m']].values
EDIT:
If you really want to iterate line by line you can use df.iterrows() method.
d = {}
for i, x in df.iterrows():
runner = x['Runner']
data = x[['100m', '400m']].tolist()
d[runner] = d.get(runner, []).append(data)

Related

How to add and update a value in pandas df each time a new value is found?

Most of the the other questions regarding updating values in pandas df are focused on appending a new column or just updating the cell with a new value (i.e. replacing it). My question is a bit different. Assuming my df already has values in it, and I find a new value, I need to add it into the cell to update its value. Example if a cell already has 5 and I found the value 10 in my file that corresponds to that column/row, the value should now be 15.
But I am having trouble writing this bit of code and even getting values to show up in my dataframe.
I have a dictionary, for example:
id_dict={'Treponema': ['162'], 'Leptospira': ['174'], 'Azospirillum': ['192'], 'Campylobacter': ['195', '197', '199', '201'], 'Pseudomonas': ['287'], 'NONE': ['2829358', '2806529']}
And I have sample id files that contain ids and the number of times those ids showed up in a previous file where the first value is the count and the second value is the id.
cat Sample1_idsummary.txt
1,162
15,174
4,195
5,197
6,201
10,2829358
Some of the ids have the same key in id_dict and I need to create a dataframe like the following:
Sample Treponema Leptospira Azospirillum Campylobacter Pseudomonas NONE
0 sample1 1 15 0 15 0 10
Here is my script, but my issue is that my output is always zero for all columns.
samplefile=sys.argv[1]
sample_ID=samplefile.split("_")[0] ## get just ID name
def get_ids_counts(id_dict,samplefie):
'''Obtain a table of id counts from the samplefile.'''
column_names=["Sample"]
column_names.extend([x for x in list(id_dict.keys())])
df = pd.DataFrame(columns=column_names)
df["Sample"]=[sample_ID]
with open(samplefile) as sf: # open the sample taxid count file
for line in sf:
id = line.split(",")[1] # the taxid (multiple can hit the same lineage info)
idcount = int(line.split(",")[0]) # the count from uniq
# For all keys in the dict, if that key is in the sample id file use the count from the id file
# Otherwise all keys not found in the file are "0" in the df
if id in id_dict:
df[list(id_dict.keys())[list(id_dict.values().index(id))]] = idcount
return df.fillna(0)
It's the very last if statement that is confusing me. How to make idcount add each time it gives the same key and why do I always get zeros filled in?
The below mentioned method worked! Here is the updated code:
def get_ids_counts(id_dict,samplefie):
'''Obtain a table of id counts from the samplefile.'''
df = pd.DataFrame([id_dict]).stack().explode().to_frame('id').droplevel(0).reset_index().astype({'id':int})
iddf = pd.read_csv(samplefile, sep=",", names=["count","id"])
df=df.merge(iddf, how='outer').fillna(0).groupby('index')['count'].sum().to_frame(sample_ID).T
return df
And the output, which is still not coming up right:
index 0 Azospirillaceae Campylobacteraceae Leptospiraceae NONE Pseudomonadacea Treponemataceae
mini 106.0 0.0 20.0 0.0 0.0 0.0 5.0
UPDATE 2
With the code below and using my proper files I've managed to get the table but cannot for the life of me get the "NONE" column to show up anymore. Any suggestions? My output is essentially every key value with proper counts but "NONE" disappears.
Instead of doing that way iteratively, you can more automate and use pandas to perform those operations.
Start by creating the dataframe from id_dict:
df = pd.DataFrame([id_dict]).stack().explode().to_frame('id').droplevel(0).reset_index()\
.astype({'id': int})
index id
0 Treponema 162
1 Leptospira 174
2 Azospirillum 192
3 Campylobacter 195
4 Campylobacter 197
5 Campylobacter 199
6 Campylobacter 201
7 Pseudomonas 287
8 NONE 2829358
9 NONE 2806529
Read the count/id text file into a data frame:
idDF = pd.read_csv('Sample1_idsummary.txt', sep=',' , names=['count', 'id'])
count id
0 1 162
1 15 174
2 4 195
3 5 197
4 6 201
5 10 2829358
Now outer merge both the dataframes, fill NaN's with 0, then groupby index, and call sum and create the dataframe calling to_frame and passing count as column name, finally transpose the dataframe:
df.merge(idDF, how='outer').fillna(0).groupby('index')['count'].sum().to_frame('Sample1').T
OUTPUT:
index Azospirillum Campylobacter Leptospira NONE Pseudomonas Treponema
Sample1 0.0 15.0 15.0 10.0 0.0 1.0

How could I transform the numpy array to pandas dataframe?

I am new to analyze using python, I wonder how can I transform the format of the left table to the right one. My initial thought is to create a nested for loop.
The desired table
First, I find read the required csv file.
Imported csv
Then, I count the number of countries in the Column 'country' and the number of the new column names list.
`countries = len(test['country'])`
`columns = len(['Year', 'Values'])`
After that, I should go for the nested for loop, however, I have no idea on writing the code.What I have come up was as follows:
`for i in countries:`
`for j in columns:`
You can use df.melt here:
In [3575]: df = pd.DataFrame({'country':['Afghanistan', 'Albania'], '1970':[1.36, 6.1], '1971':[1.39, 6.22], '1972':[1.43, 6.34]})
In [3576]: df
Out[3576]:
country 1970 1971 1972
0 Afghanistan 1.36 1.39 1.43
1 Albania 6.10 6.22 6.34
In [3609]: df = df.melt('country', var_name='Year', value_name='Values').sort_values('country')
In [3610]: df
Out[3610]:
country Year Values
0 Afghanistan 1970 1.36
2 Afghanistan 1971 1.39
4 Afghanistan 1972 1.43
1 Albania 1970 6.10
3 Albania 1971 6.22
5 Albania 1972 6.34
Not sure of what you want to do, but:
If you want to transform a column in a numpy array, you can use the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"foo": [1,2,3], "bar": [10,20,30]})
print(df)
foo_array = np.array(df["foo"])
print(foo_array)
and then iterate on foo_array
You can also loop on your data frame using :
for row in df.iterrows():
print(row)
But it's not recommended since you can often use built in pandas function to do the same job.
your data frame is also an iterable object which only contains the columns names:
for d in df:
print(d)
# output:
# foo
# bar

Deleting entire rows of a dataset for outliers found in a single column

I am currently trying to remove the outlier values from my dataset, using the median absolute deviation method.
To do so, I followed the instructions given by #tanemaki in Detect and exclude outliers in Pandas data frame, which enables the deletion of entire rows that hold at least one outlier value.
In the post I linked, the same question was asked, but was not answered.
The problem is that I only want the outliers to be searched in a single column.
So, for example, my dataframe looks like:
Temperature Date
1 24.72 2.3
2 25.76 4.6
3 25.42 7.0
4 40.31 9.3
5 26.21 15.6
6 26.59 17.9
For example, there are two 'anomalies in the data:
The Temperature value in row [4]
The Date value in row [5]
So, what I want is for the outlier function to only 'notice' the anomaly in the Temperature column, and delete its corresponding row.
The outlier code I am using is:
df=pd.read_excel(r'/home/.../myfile.xlsx')
from scipy import stats
df[pd.isnull(df)]=0
dfn=df[(np.abs(stats.zscore(df))<4).all(axis=1)] ##taneski
print(dfn)
And my resulting data frame currently looks like:
Temperature Date
1 24.72 2.3
2 25.76 4.6
3 25.42 7.0
6 26.59 17.9
In case I am not getting my message across, the desired output would be:
Temperature Date
1 24.72 2.3
2 25.76 4.6
3 25.42 7.0
5 26.21 15.6
6 26.59 17.9
Any pointers would be of great help. Thanks!
You can always limit the stats.zscore operation on only the Temperature column instead of the whole df. Like this maybe:
In [573]: dfn = df[(np.abs(stats.zscore(df['Temperature']))<4)]
In [574]: dfn
Out[574]:
Temperature Date
1 24.72 2.3
2 25.76 4.6
3 25.42 7.0
5 26.21 15.6
6 26.59 17.9
At the moment, you're calculating the zscores for the whole dataframe and then filtering the dataframe with those calculated scores; what you want to do is just apply the same idea to one column.
Instead of
dfn=df[(np.abs(stats.zscore(df))<4).all(axis=1)]
You want to have
df[np.abs(stats.zscore(df["Temperature"])) < 4]
As a side note, I found that I was unable to get your example results by comparing the zscores to 4; I had to switch it down to 2.

How to enumerate rows in pandas with nonunique values in groups

I am working with expeditions geodata. Could you help with enumeration of stations and records for the same station depending on expedition ID (ID), date (Date), latitude (Lat), longitude (Lon) and some value (Val, it is not reasonable for enumeration)? Assume that station is a group of rows with the same (ID,Date,Lat,Lon), expedition is a group of rows with the same ID.
Dataframe is sorted by 4 columns as in example.
Dataset and required columns
import pandas as pd
data = [[1,'2017/10/10',70.1,30.4,10],\
[1,'2017/10/10',70.1,31.4,20],\
[1,'2017/10/10',70.1,31.4,10],\
[1,'2017/10/10',70.1,31.4,10],\
[1,'2017/10/12',70.1,31.4,20],\
[2,'2017/12/10',70.1,30.4,20],\
[2,'2017/12/10',70.1,31.4,20]];
df = pd.DataFrame(data,columns=['ID','Date','Lat','Lon','Val']);
Additional (I need it, St for station number and Rec for record number within the same station data; output for example above):
df['St'] = [1,2,2,2,3,1,2];
df['Rec'] = [1,1,2,3,1,1,1];
print(df)
I tried and used groupby/cumcount/agg/factorize but have not solved my problem.
Any help! Thanks!
To create 'St', you can use groupby on 'ID' and then check when any of the columns 'Date','Lat','Lon' is different than the previous one using shift, and use cumsum to get the numbers you want, such as:
df['St'] = (df.groupby(['ID'])
.apply(lambda x: (x[['Date','Lat','Lon']].shift() != x[['Date','Lat','Lon']])
.any(axis=1).cumsum())).values
And to create 'Rec', you also need groupby but on all columns 'ID','Date','Lat','Lon' and then use cumcount and add such as:
df['Rec'] = df.groupby(['ID','Date','Lat','Lon']).cumcount().add(1)
and you get:
ID Date Lat Lon Val St Rec
0 1 2017/10/10 70.1 30.4 10 1 1
1 1 2017/10/10 70.1 31.4 20 2 1
2 1 2017/10/10 70.1 31.4 10 2 2
3 1 2017/10/10 70.1 31.4 10 2 3
4 1 2017/10/12 70.1 31.4 20 3 1
5 2 2017/12/10 70.1 30.4 20 1 1
6 2 2017/12/10 70.1 31.4 20 2 1

Column Manipulations with date-Time Pandas

I am trying to do some column manipulations with row and column at same time including date and time series in Pandas. Traditionally with no series python dictionaries are great. But with Pandas it a new thing for me.
Input file : N number of them.
File1.csv, File2.csv, File3.csv, ........... Filen.csv
Ids,Date-time-1 Ids,Date-time-2 Ids,Date-time-1
56,4568 645,5545 25,54165
45,464 458,546
I am trying to merge the Date-time column of all the files into a big data file with respect to Ids
Ids,Date-time-ref,Date-time-1,date-time-2
56,100,4468,NAN
45,150,314,NAN
645,50,NAN,5495
458,200,NAN,346
25,250,53915,NAN
Check for date-time column - If not matched create one and then fill the values with respect to Ids by Subtracting the current date-time value with the value of date-time-ref of that respective Ids.
Fill in empty place with NAN and if next file has that value then replace the new value with NAN
If it were straight column subtract it was pretty much easy but in sync with date-time series and with respect to Ids seems a bit confusing.
Appreciate some suggestions to begin with. Thanks in advance.
Here is one way to do it.
import pandas as pd
import numpy as np
from StringIO import StringIO
# your csv file contents
csv_file1 = 'Ids,Date-time-1\n56,4568\n45,464\n'
csv_file2 = 'Ids,Date-time-2\n645,5545\n458,546\n'
# add a duplicated Ids record for testing purpose
csv_file3 = 'Ids,Date-time-1\n25,54165\n645, 4354\n'
csv_file_all = [csv_file1, csv_file2, csv_file3]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(StringIO(csv_file)) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
Out[206]:
Date-time-1 Date-time-2
Ids
56 4568 NaN
45 464 NaN
645 NaN 5545
458 NaN 546
25 54165 NaN
645 4354 NaN
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
Out[207]:
Date-time-1 Date-time-2
Ids
25 54165 NaN
45 464 NaN
56 4568 NaN
458 NaN 546
645 4354 5545
# do the subtraction
master_csv_file = 'Ids,Date-time-ref\n56,100\n45,150\n645,50\n458,200\n25,250\n'
df_master = pd.read_csv(io.StringIO(master_csv_file), index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
Out[208]:
Date-time-ref Date-time-1 Date-time-2
Ids
25 250 53915 NaN
45 150 314 NaN
56 100 4468 NaN
458 200 NaN 346
645 50 4304 5495

Categories

Resources