I am trying to perform some simple mathematical operations on the files.
The columns in below file_1.csv are dynamic in nature the number of columns will increased from time to time. So we cannot have fixed last_column
master_ids.csv : Before any pre-processing
Ids,ref0 #the columns increase dynamically
1234,1000
8435,5243
2341,563
7352,345
master_count.csv : Before any processing
Ids,Name,lat,lon,ref1
1234,London,40.4,10.1,500
8435,Paris,50.5,20.2,400
2341,NewYork,60.6,30.3,700
7352,Japan,70.7,80.8,500
1234,Prague,40.4,10.1,100
8435,Berlin,50.5,20.2,200
2341,Austria,60.6,30.3,500
7352,China,70.7,80.8,300
master_Ids.csv : after one pre-processing
Ids,ref,00:30:00
1234,1000,500
8435,5243,300
2341,563,400
7352,345,500
master_count.csv: expected Output (Append/merge)
Ids,Name,lat,lon,ref1,00:30:00
1234,London,40.4,10.1,500,750
8435,Paris,50.5,20.2,400,550
2341,NewYork,60.6,30.3,700,900
7352,Japan,70.7,80.8,500,750
1234,Prague,40.4,10.1,100,350
8435,Berlin,50.5,20.2,200,350
2341,Austria,60.6,30.3,500,700
7352,China,70.7,80.8,300,750
Eg: Ids: 1234 appears 2 times so the value of ids:1234 at current time (00:30:00) is 500 which is to be divided by count of ids occurrence and then add to the corresponding values from ref1 and create a new column with the current time.
master_Ids.csv : After another pre-processing
Ids,ref,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600
master_count.csv: expected output after another execution (Merge/append)
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750,550
8435,Paris,50.5,20.2,400,550,500
2341,NewYork,60.6,30.3,700,900,900
7352,Japan,70.7,80.8,500,750,800
1234,Prague,40.4,10.1,100,350,150
8435,Berlin,50.5,20.2,200,350,300
2341,Austria,60.6,30.3,500,700,700
7352,China,70.7,80.8,300,750,600
So here current time is 00:45:00, and we divide the current time value by the count of ids occurrences, and then add to the corresponding ref1 values by creating an new column with new current time.
Program: By Jianxun Li
import pandas as pd
import numpy as np
csv_file1 = '/Data_repository/master_ids.csv'
csv_file2 = '/Data_repository/master_count.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
# do the division by number of occurence of each Ids
# and add column any time series
def my_func(group):
num_obs = len(group)
# process with column name after next timeseries (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
The program executes with no errors and no output. Need some fixing suggestions please.
This program assumes updating of both master_counts.csv and master_ids.csv over time and should be robust to the timing of the updates. That is, it should produce correct results if run multiple times on the same update or if an update is missed.
# this program updates (and replaces) the original master_counts.csv with data
# in master_ids.csv, so we only want the first 5 columns when we read it in
master_counts = pd.read_csv('master_counts.csv').iloc[:,:5]
# this file is assumed to be periodically updated with the addition of new columns
master_ids = pd.read_csv('master_ids.csv')
for i in range( 2, len(master_ids.columns) ):
master_counts = master_counts.merge( master_ids.iloc[:,[0,i]], on='Ids' )
count = master_counts.groupby('Ids')['ref1'].transform('count')
master_counts.iloc[:,-1] = master_counts['ref1'] + master_counts.iloc[:,-1]/count
master_counts.to_csv('master_counts.csv',index=False)
%more master_counts.csv
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750.0,550.0
1234,Prague,40.4,10.1,100,350.0,150.0
8435,Paris,50.5,20.2,400,550.0,500.0
8435,Berlin,50.5,20.2,200,350.0,300.0
2341,NewYork,60.6,30.3,700,900.0,900.0
2341,Austria,60.6,30.3,500,700.0,700.0
7352,Japan,70.7,80.8,500,750.0,800.0
7352,China,70.7,80.8,300,550.0,600.0
import pandas as pd
import numpy as np
csv_file1 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/master_lac_Test.csv'
csv_file2 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/lat_lon_master.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
Out[53]:
00:00:00 00:30:00 00:45:00
Ids
1234 1000 500 100
8435 5243 300 200
2341 563 400 400
7352 345 500 600
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
Out[81]:
Name lat lon 00:00:00
Ids
1234 London 40.4 10.1 500
1234 Prague 40.4 10.1 500
2341 NewYork 60.6 30.3 700
2341 Austria 60.6 30.3 700
7352 Japan 70.7 80.8 500
7352 China 70.7 80.8 500
8435 Paris 50.5 20.2 400
8435 Berlin 50.5 20.2 400
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
Out[55]:
Name lat lon 00:00:00 00:30:00 00:45:00
Ids
1234 London 40.4 10.1 500 500 100
1234 Prague 40.4 10.1 500 500 100
2341 NewYork 60.6 30.3 700 400 400
2341 Austria 60.6 30.3 700 400 400
7352 Japan 70.7 80.8 500 500 600
7352 China 70.7 80.8 500 500 600
8435 Paris 50.5 20.2 400 300 200
8435 Berlin 50.5 20.2 400 300 200
# do the division by number of occurence of each Ids
# and add column 00:00:00
def my_func(group):
num_obs = len(group)
# process with column name after 00:30:00 (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
Out[104]:
Name lat lon 00:00:00 00:30:00 00:45:00
Ids
1234 London 40.4 10.1 500 750 550
1234 Prague 40.4 10.1 500 750 550
2341 NewYork 60.6 30.3 700 900 900
2341 Austria 60.6 30.3 700 900 900
7352 Japan 70.7 80.8 500 750 800
7352 China 70.7 80.8 500 750 800
8435 Paris 50.5 20.2 400 550 500
8435 Berlin 50.5 20.2 400 550 500
My suggestion is to reformat your data so that it's like this:
Ids,ref0,current_time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
Then after your "first preprocess" it will become like this:
Ids,ref0,time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
1234,1000,00:30:00,500
8435,5243,00:30:00,300
2341,563,00:30:00,400
7352,345,00:30:00,500
. . . and so on. The idea is that you should make a single column to hold the time information, and then for each preprocess, insert the new data into new rows, and give those rows a value in the time column indicating what time period they come from. You may or may not want to keep the initial rows with "None" in this table; maybe you just want to start with the "00:30:00" values and keep the "master ids" in a separate file.
I haven't totally followed exactly how you're computing the new ref1 values, but the point is that doing this is likely to greatly simplify your life. In general, instead of adding an unbounded number of new columns, it can be much nicer to add a single new column whose values will then be the values you were going to use as headers for the open-ended new columns.
Related
I have a table:
-60 -40 -20 0 20 40 60
100 520 440 380 320 280 240 210
110 600 500 430 370 320 280 250
120 670 570 490 420 370 330 290
130 740 630 550 480 420 370 330
140 810 690 600 530 470 410 370
The headers along the top are a wind vector and the first col on the left is a distance. The actual data in the 'body' of the table is just a fuel additive.
I am very new to Pandas and Numpy so please excuse the simplicity of the question. What I would like to know is, how can I enter the table using the headers to retrieve one number? I have seen its possible using indexes, but I don't want to use that method if I don't have to.
for example:
I have a wind unit of -60 and a distance of 120 so I need to retrieve the number 670. How can I use Numpy or Pandas to do this?
Also, if I have a wind unit of say -50 and a distance of 125, is it then possible to interpolate these in a simple way?
EDIT:
Here is what I've tried so far:
import pandas as pd
df = pd.read_table('fuel_adjustment.txt', delim_whitespace=True, header=0,index_col=0)
print(df.loc[120, -60])
But get the error:
line 3083, in get_loc raise KeyError(key) from err
KeyError: -60
You can select any cell from existing indices using:
df.loc[120,-60]
The type of the indices needs however to be integer. If not, you can fix it using:
df.index = df.index.map(int)
df.columns = df.columns.map(int)
For interpolation, you need to add the empty new rows/columns using reindex, then apply interpolate on each dimension.
(df.reindex(index=sorted(df.index.to_list()+[125]),
columns=sorted(df. columns.to_list()+[-50]))
.interpolate(axis=1, method='index')
.interpolate(method='index')
)
Output:
-60 -50 -40 -20 0 20 40 60
100 520.0 480.0 440.0 380.0 320.0 280.0 240.0 210.0
110 600.0 550.0 500.0 430.0 370.0 320.0 280.0 250.0
120 670.0 620.0 570.0 490.0 420.0 370.0 330.0 290.0
125 705.0 652.5 600.0 520.0 450.0 395.0 350.0 310.0
130 740.0 685.0 630.0 550.0 480.0 420.0 370.0 330.0
140 810.0 750.0 690.0 600.0 530.0 470.0 410.0 370.0
You can simply use df.loc for that purpose
df.loc[120,-60]
You need to check the data type of index and column. That should be the reason why you failed df.loc[120,-60].
Try:
df.loc[120, "-60"]
To validate the data type, you may call:
>>> df.index
Int64Index([100, 110, 120, 130, 140], dtype='int64')
>>> df.columns
Index(['-60', '-40', '-20', '0', '20', '40', '60'], dtype='object')
If you want to turn the header of columns into int64, you may need to turn it into numeric:
df.columns = pd.to_numeric(df.columns)
For interpolation, I think the only way would be creating that nonexistent index and column first, then you can get that value. However, it will grow your df rapidly if it's frequently query.
First, you need to add the nonexistent index and column.
Interpolate row-wise and column-wise.
Get your value.
new_index = df.index.to_list()
new_index.append(125)
new_index.sort()
new_col = df.columns.to_list()
new_col.append(-50)
new_col.sort()
df = df.reindex(index=new_index, columns=new_col)
df = df.interpolate(axis=1).interpolate()
print(df[125, -50])
Another way is to write a function to fetch relative numbers and returns the interpolate result.
Find the upper and lower indexes and columns of your target.
Fetch the four numbers.
Sequentially interpolate the index and column.
I have a dataframe close consists of close price (with some calculations beforehand) of some stocks, and I want to create a dataframe (with empty entries or random numbers) such that the row names are the tickers of the close and column names are from 10 to 300 with a step size 10. ie. 10,20,30,40,50...
I want to create this df in order to use a for loop to fill in all the entries.
The df close I have is like below:
Close \
ticker AAPL AMD BIDU GOOGL IXIC
Date
2011-06-01 12.339643 8.370000 132.470001 263.063049 2769.189941
2011-06-02 12.360714 8.240000 138.490005 264.294281 2773.310059
2011-06-03 12.265714 7.970000 133.210007 261.801788 2732.780029
2011-06-06 12.072857 7.800000 126.970001 260.790802 2702.560059
2011-06-07 11.858571 7.710000 124.820000 259.774780 2701.560059
......
I tried to check if I firstly create this dataframe correctly as below:
rows = close.iloc[0]
columns = [[i] for i in range(10,300,10)]
print(pd.DataFrame(rows, columns))
But what I got is:
2011-06-01
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 NaN
After this, I would use something like
percent = pd.DataFrame(rows, columns)
for i in range(10, 300, 10):
myerror = myfunction(close, i) # myfunction is a function defined beforehand
extreme = myerror > 0.1
percent.iloc[:,i] = extreme.mean()
To be specific, for i=10, my extreme.mean() is something like:
ticker
Absolute Error (Volatility) AAPL 0.420
AMD 0.724
BIDU 0.552
GOOGL 0.316
IXIC 0.176
MSFT 0.320
NDXT 0.228
NVDA 0.552
NXPI 0.476
QCOM 0.468
SWKS 0.560
TXN 0.332
dtype: float64
But if I tried this way, I got:
IndexError: iloc cannot enlarge its target object
How shall I create this df first? Or do I even need to create this df first?
Here is how I will approach it:
from io import StringIO
import numpy as np
df = pd.read_csv(StringIO("""ticker_Date AAPL AMD BIDU GOOGL IXIC
2011-06-01 12.339643 8.370000 132.470001 263.063049 2769.189941
2011-06-02 12.360714 8.240000 138.490005 264.294281 2773.310059
2011-06-03 12.265714 7.970000 133.210007 261.801788 2732.780029
2011-06-06 12.072857 7.800000 126.970001 260.790802 2702.560059
2011-06-07 11.858571 7.710000 124.820000 259.774780 2701.560059 """), sep="\s+", index_col=0)
col_names = [f"col_{i}" for i in range(10, 300, 10)]
# generate random data
data = np.random.random((df.shape[1], len(col_names)))
# create dataframe
df = pd.DataFrame(data, columns=col_names, index=df.columns.values)
df.head()
This will generate:
col_10 col_20 col_30 col_40 col_50 col_60 col_70 col_80 col_90 col_100 ... col_200 col_210 col_220 col_230 col_240 col_250 col_260 col_270 col_280 col_290
AAPL 0.758983 0.990241 0.804344 0.143388 0.987025 0.402098 0.814308 0.302948 0.551587 0.107503 ... 0.270523 0.813130 0.354939 0.594897 0.711924 0.574312 0.124053 0.586718 0.182854 0.430028
AMD 0.280330 0.540498 0.958757 0.779778 0.988756 0.877748 0.083683 0.935331 0.601838 0.998863 ... 0.426469 0.459916 0.458180 0.047625 0.234591 0.831229 0.975838 0.277486 0.663604 0.773614
BIDU 0.488226 0.792466 0.488340 0.639612 0.829161 0.459805 0.619539 0.614297 0.337481 0.009500 ... 0.049147 0.452581 0.230441 0.943240 0.587269 0.703462 0.528252 0.099104 0.510057 0.151219
GOOGL 0.332762 0.135621 0.653414 0.955116 0.341629 0.213716 0.308320 0.982095 0.762138 0.532052 ... 0.095432 0.908001 0.077070 0.413706 0.036768 0.481697 0.092373 0.016260 0.394339 0.042559
IXIC 0.358842 0.653332 0.994692 0.863552 0.307594 0.269833 0.972357 0.520336 0.124850 0.907647 ... 0.189050 0.664955 0.167708 0.333537 0.295740 0.093228 0.762875 0.779000 0.316752 0.687238
I have a dataframe with 3 columns. Something like this:
Data Initial_Amount Current
31-01-2018
28-02-2018
31-03-2018
30-04-2018 100 100
31-05-2018 100 90
30-06-2018 100 80
I would like to populate the prior rows with the Initial Amount as such:
Data Initial_Amount Current
31-01-2018 100 100
28-02-2018 100 100
31-03-2018 100 100
30-04-2018 100 100
31-05-2018 100 90
30-06-2018 100 80
So find the:
First non_empty row with Initial Amount populated
use that to backfill the initial Amounts to the starting date
If it is the first row and current is empty then copy Initial_Amount, else copy prior balance.
Regards,
Pandas fillna with fill method 'bfill' (uses next valid observation to fill gap) should do what you're looking for:
In [13]: df.fillna(method='bfill')
Out[13]:
Data Initial_Amount Current
0 31-01-2018 100.0 100.0
1 28-02-2018 100.0 100.0
2 31-03-2018 100.0 100.0
3 30-04-2018 100.0 100.0
4 31-05-2018 100.0 90.0
5 30-06-2018 100.0 80.0
I have the last eight months of my customers' data, however these months are not the same months, just the last months they happened to be with us. Monthly fees and penalties are stored in rows, but I want each of the last eight months to be a column.
What I have:
Customer Amount Penalties Month
123 500 200 1/7/2017
123 400 100 1/6/2017
...
213 300 150 1/4/2015
213 200 400 1/3/2015
What I want:
Customer Month-8-Amount Month-7-Amount ... Month-1-Amount Month-1-Penalties ...
123 500 400 450 300
213 900 250 300 200
...
What I've tried:
df = df.pivot(index=num, columns=[amount,penalties])
I got this error:
ValueError: all arrays must be same length
Is there some ideal way to do this?
You can do it with unstack and set_index
# assuming all date is sort properly , then we do cumcount
df['Month']=df.groupby('Customer').cumcount()+1
# slice the most recent 8 one
df=df.loc[df.Month<=8,:]# slice the most recent 8 one
# doing unstack to reshape your df
s=df.set_index(['Customer','Month']).unstack().sort_index(level=1,axis=1)
# flatten multiple index to one
s.columns=s.columns.map('{0[0]}-{0[1]}'.format)
s.add_prefix("Month-")
Out[189]:
Month-Amount-1 Month-Penalties-1 Month-Amount-2 Month-Penalties-2
Customer
123 500 200 400 100
213 300 150 200 400
With the following groupby how can I ultimately group the data so that I can plot the price (x-axis) and size (y-axis) while iterating through every symbol and exchange? Thanks.
df_group = df.groupby(['symbol','exchange','price'])["size"].sum()
symbol exchange price
AAPL ARCA 154.630 800
154.640 641
154.650 100
154.660 300
154.670 400
154.675 100
154.680 300
154.690 1390
154.695 100
154.700 360
154.705 100
154.710 671
154.720 190
154.725 100
154.730 400
...
XOM PSX 80.67 1300
80.68 2721
80.69 1901
80.7 700
80.71 800
80.72 200
80.73 700
80.74 500
80.75 600
80.76 300
80.77 900
80.78 100
80.79 1000
80.8 1000
symbol exch price sizesizesizesizesizesizesizesizesizesizesizesi...
you can use aggregate functions
fun={'symbol':{'size':'count'}
df_group = df.groupby(['symbol','exchange','price']).agg(fun).reset_index()
df_group.columns=df_group.columns.droplevel(1)
df_group