Reading a pandas data frame having unequal columns in observations - python

I am trying to read this small data file,
Link - https://drive.google.com/open?id=1nAS5mpxQLVQn9s_aAKvJt8tWPrP_DUiJ
I am using the code -
df = pd.read_table('/Data/123451_date.csv', sep=';', index_col=0, engine='python', error_bad_lines=False)
It has ';' as a seprator, and values are missing in the file for some columns values in some observations (or rows).
How can I read it properly. I see the current dataframe, which is not loaded properly.

It looks like the data you use has some garbage in it. Precisely, rows 1-33 (inclusive) have additional, unnecessary (non-GPS) information included. You can either fix the database by manually removing the unneeded information from the datasheet, or use following code snippet to skip the rows that include it:
from pandas import read_table
data = read_table('34_2017-02-06.gpx.csv', sep=';', skiprows=list(range(1, 34)).drop("Unnamed: 28", axis=1)
The drop("Unnamed: 28", axis=1) is simply there to remove an additional column that is created probably due to each row in your datasheet ending with a ; (because it reads the empty space at the end of each line as data).
The result of print(data.head()) is then as follows:
index cumdist ele ... esttotalpower lat lon
0 49 340 -34.8 ... 9 52.077362 5.114530
1 51 350 -34.8 ... 17 52.077468 5.114543
2 52 360 -35.0 ... -54 52.077521 5.114551
3 53 370 -35.0 ... -173 52.077603 5.114505
4 54 380 -34.8 ... 335 52.077677 5.114387
[5 rows x 28 columns]
To explain the role of the drop command even more, here is what would happen without it (notice the last, weird column)
index cumdist ele ... lat lon Unnamed: 28
0 49 340 -34.8 ... 52.077362 5.114530 NaN
1 51 350 -34.8 ... 52.077468 5.114543 NaN
2 52 360 -35.0 ... 52.077521 5.114551 NaN
3 53 370 -35.0 ... 52.077603 5.114505 NaN
4 54 380 -34.8 ... 52.077677 5.114387 NaN
[5 rows x 29 columns]

Related

How can I subtract two panda data frame columns without getting an index error? [duplicate]

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

Find occurences where a column from one dataframe equals another, based on condition

I have the following two dataframes, which have different size: df1 (966 rows x 2 cols), df2 (36 rows, 2 cols), where
df1:
Video_# Selected Joint.1
484 1 Left_shoulder
778 1 Left_shoulder
418 1 Right_shoulder
964 1 Right_shoulder
193 1 Right_shoulder
... ... ... ... ...
285 36 Right_elbow
267 36 Left_hand
216 36 Shoulder_centre
139 36 Right_shoulder
df2:
Video_# Ann.1
0 1 Shoulder_center
1 2 Head
2 3 Right_hip
... ... ... ... ...
33 34 Left_knee
34 35 Right_knee
35 36 Right_shoulder
Where Video_# goes from 1-36. In df2 the Video_# column is just a single occurrence of each, so 1-36 just once. Whereas in df1 there are multiple occurrences of each 1-36, not the same number for each 1-36 (I hope that makes sense).
What I want to check is the number of occurrences df1['Selected Joint.1'] == df2['Ann.1'] based on Video_#. So the expected output is (eg.):
Video_# Equality Occurrences
1 3
2 5
... ... ...
36 6
Is that possible?
Use DataFrame.merge with GroupBy.size:
df = (df1.merge(df2,
left_on=['Video_#','Selected Joint.1'],
right_on=['Video_#','Ann.1'])
.groupby('Video_')
.size()
.reset_index(name='Equality Occurrences'))

Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)

Calculating difference between two rows in Python / Pandas

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Categories

Resources