Guessing Indentation of text file with python - python

I am working with a program that generates a specific file format, that I have to read and modify with python scripts. This file is is supposed to be tab delimited, but I haven't been able to recognize the tab character. Any good way to read this kind of file, and generate a new one in the same formatting?
1. Base Year Data for Calibration
1.1 Observed Data per Internal Zone
Sector Zone ExogProd InducedPro ExogDemand Price ValueAdded Attractor
1 1 5000 0 0 14409.8204 0 1
1 2 800 0 0 12628.4625 0 1
1 3 1100 0 0 12676.3341 0 1
2 1 0 3393.2241 0 13944.0613 0 1
2 2 0 732.1119 0 12340.4575 0 1
2 3 0 974.6630 0 12132.7666 0 1
3 1 0 4491.8722 0 2701.8266 0 1
3 2 0 12755.9657 0 2445.0556 0 1
3 3 0 4752.1604 0 2671.2305 0 1
4 1 0 1790.7874 0 3858.0189 0 1
4 2 0 3076.6366 0 3337.8784 0 1
4 3 0 11132.5806 0 3728.1412 0 1
5 1 0 69.5126 0 250000 250000 1
5 2 0 109.5081 0 120000 120000 1
5 3 0 124.2133 0 180000 180000 1
The problem is that when I read this with python with line.split('\t'), I end with just the whole line.

As others have pointed out in the comments, this appears to be just a space separated file with a variable number of spaces between cells. If that is the case, you can extract the cells from a particular row like this:
cells = line.split()
As for regenerating it, you'll need to pad the various columns to different widths. One way would be with code like this:
widths = [12,9,11,11,11,11,11,11]
paddedCells = [string.rjust(cell,widths[i]) for i,cell in enumerate(cells)]
line = ''.join(paddedCells)

actually I am using
%12d %8d %10.2f %10.2f %10.2f %10.2f %10.2f %10.1f\n
The problem seems to be how the file are generated. I am pretty sure is not tab-delimited files.

Related

np where with two conditions and met first

I am trying to create a target variable based on 2 conditions. I have X values that are binary and X2 values that are also binary. My condition is whenver X changes from 1 to zero, we have one in y only if it is followed by a change from 0 to 1 in X2. If that was followed by a change from 0 to 1 in X then we don't do the change in the first place. I attached a picture from excel.
I also did the following to account for the change in X
df['X-prev']=df['X'].shift(1)
df['Change-X;]=np.where(df['X-prev']+df['X']==1,1,0)
# this is the data frame
X=[1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0]
X2=[0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1]
df=pd.DataFrame()
df['X']=X
df['X2']=X2
however, this is not enough as I need to know which change came first after the X change. I attached a picture of the example.
Thanks a lot for all the contributions.
Keep rows that match your transition (X=1, X+1=0) and (X2=1, X2-1=0) then merge all selected rows to a list where a value of 0 means 'start a cycle' and 1 means 'end a cycle'.
But in this list, you can have consecutive start or end so you need to filter again to get only cycles of (0, 1). After that, reindex this new series by your original dataframe index and back fill with 1.
x1 = df['X'].sub(df['X'].shift(-1)).eq(1)
x2 = df['X2'].sub(df['X2'].shift(1)).eq(1)
sr1 = pd.Series(0, df.index[x1])
sr2 = pd.Series(1, df.index[x2])
sr = pd.concat([sr2, sr1]).sort_index()
df['Y'] = sr[sr.lt(sr.shift(-1)) | sr.gt(sr.shift(1))] \
.reindex(df.index).bfill().fillna(0).astype(int)
>>> df
X X2 Y
0 1 0 0 # start here: (X=1, X+1=0) but never ended before another start
1 1 0 0
2 0 0 0
3 0 0 0
4 1 0 0 # start here: (X=1, X+1=0)
5 0 0 1 # <- fill with 1
6 0 0 1 # <- fill with 1
7 0 0 1 # <- fill with 1
8 0 0 1 # <- fill with 1
9 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
10 0 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 0 0
15 0 0 0
16 0 1 0 # end here: (X2=1, X2-1=0) but never started before
17 0 0 0
18 0 0 0
19 0 0 0
20 1 0 0
21 1 0 0 # start here: (X=1, X+1=0)
22 0 0 1 # <- fill with 1
23 0 0 1 # <- fill with 1
24 0 0 1 # <- fill with 1
25 0 0 1 # <- fill with 1
26 0 0 1 # <- fill with 1
27 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
28 0 1 0
29 0 1 0

Checking for subset in a column?

I'm trying to flag some price data as "stale" if the quoted price of the security hasn't changed over lets say 3 trading days. I'm currently trying it with:
firm["dev"] = np.std(firm["Price"],firm["Price"].shift(1),firm["Price"].shift(2))
firm["flag"] == np.where(firm["dev"] = 0, 1, 0)
But I'm getting nowhere with it. This is what my dataframe would look like.
Index
Price
Flag
1
10
0
2
11
0
3
12
0
4
12
0
5
12
1
6
11
0
7
13
0
Any help is appreciated!
If you are okay with other conditions, you can first check if series.diff equals 0 and take cumsum to check if you have a cumsum of 2 (n-1). Also check if the next row is equal to current, when both these conditions suffice, assign a flag of 1 else 0.
n=3
firm['Flag'] = (firm['Price'].diff().eq(0).cumsum().eq(n-1) &
firm['Price'].eq(firm['Price'].shift())).astype(int)
EDIT, to make it a generalized function with consecutive n, use this:
def fun(df,col,n):
c = df[col].diff().eq(0)
return (c|c.shift(-1)).cumsum().ge(n) & df[col].eq(df[col].shift())
firm['flag_2'] = fun(firm,'Price',2).astype(int)
firm['flag_3'] = fun(firm,'Price',3).astype(int)
print(firm)
Price Flag flag_2 flag_3
Index
1 10 0 0 0
2 11 0 0 0
3 12 0 0 0
4 12 0 1 0
5 12 1 1 1
6 11 0 0 0
7 13 0 0 0

Why apply function did not work on pandas dataframe

ct_data['IM NO'] = ct_data['IM NO'].apply(lambda x: pyffx.Integer(b'dkrya#Jppl1994', length=20).encrypt(int(x)))
I am trying to encyrpt here is below head of ct_data
Unnamed: 0 IM NO CT ID
0 0 214281340 x1E5e3ukRyEFRT6SUAF6lg|d543d3d064da465b8576d87
1 1 214281244 -vf6738ee3bedf47e8acf4613034069ab0|aa0d2dac654
2 2 175326863 __g3d877adf9d154637be26d9a0111e1cd6|6FfHZRoiWs
3 3 299631931 __gbe204670ca784a01b7207b42a7e5a5d3|54e2c39cd3
4 4 214282320 773840905c424a10a4a31aba9d6458bb|__g1114a30c6e
But I get as below
Unnamed: 0 ... CT ID
0 0 ... x1E5e3ukRyEFRT6SUAF6lg|d543d3d064da465b8576d87
1 1 ... aa0d2dac654d4154bf7c09f73faeaf62|-vf6738ee3bed
2 2 ... 6FfHZRoiWs2VO02Pruk07A|__g3d877adf9d154637be26
3 3 ... 54e2c39cd35044ffbd9c0918d07923dc|__gbe204670ca
4 4 ... __g1114a30c6ea548a2a83d5a51718ff0fd|773840905c
5 5 ... 9e6eb976075b4b189ae7dde42b67ca3d|WgpKucd28IcdE
IM NO columns header name and its value should be 20 digit encrpted ,
Normally encryption is done as below
import pyffx
strEncrypt = pyffx.Integer(b'dkrya#Jppl1994', length=20)
strEncrptVal = strEncrypt.encrypt(int('9digit IM No'))
ct_data.iloc[:, 1]) displays below thing
0 214281340
1 214281244
2 175326863
3 299631931
4 214282320
5 214279026
This should be a comment but it contains formatted data.
It is probably a mere display problem. With the initial sample of you dataframe, I have executed your command and printed its returned values:
print(ct_data['IM NO'].apply(lambda x: pyffx.Integer(b'dkrya#Jppl1994', length=20).encrypt(int(x))))
0 88741194526272080902
1 2665012251053580165
2 18983388112345132770
3 85666027666173191357
4 78253063863998100367
Name: IM NO, dtype: object
So it is correctly executed. Let us go one step further:
ct_data['IM NO'] = ct_data['IM NO'].apply(lambda x: pyffx.Integer(b'dkrya#Jppl1994', length=20).encrypt(int(x)))
print(ct_data['IM NO'])
0 88741194526272080902
1 2665012251053580165
2 18983388112345132770
3 85666027666173191357
4 78253063863998100367
Name: IM NO, dtype: object
Again...
That means that your command was successfull, but as the IM NO column is now larger, you system can no more display all the columns and it displays the first and las ones, with ellipses (...) in the middle.

Python Groupby and Count

I'm working on create a sankey plot and have the raw data mapped so that I know source and target node. I'm having an issue with grouping the source & target and then counting the number of times each occurs. E.g. using the table below finding out how many time 0 -> 4 occurs and recording that in the dataframe.
index event_action_num next_action_num
227926 0 6
227928 1 5
227934 1 6
227945 1 7
227947 1 6
227951 0 7
227956 0 6
227958 2 6
227963 0 6
227965 1 6
227968 1 5
227972 3 6
Where I want to send up is:
event_action_num next_action_num count_of
0 4 1728
0 5 2382
0 6 3739
etc
Have tried:
df_new_2 = df_new.groupby(['event_action_num', 'next_action_num']).count()
but doesn't give me the result I'm looking for.
Thanks in advance
Try to use agg('size') instead of count():
df_new_2.groupby(['event_action_num', 'next_action_num']).agg('size')
For your sample data output will be:

Why does this Python nested for loop produce the output I get?

I'm very new to learning python, though I understand the basics of the looping, I am unable to understand the method in which output is arrived at.
In particular, how does the mapping of all three for loops happen to give the desired output, as I finding it impossible to understand the logic to be applied, when I try to write the output on paper without referring to IDE.
Code:
n = 4
a = 3
z = 2
for i in range(n):
for j in range(a):
for p in range(z):
print(i, j, p)
Output is:
0 0 0
0 0 1
0 1 0
0 1 1
0 2 0
0 2 1
1 0 0
1 0 1
1 1 0
1 1 1
1 2 0
1 2 1
2 0 0
2 0 1
2 1 0
2 1 1
2 2 0
2 2 1
3 0 0
3 0 1
3 1 0
3 1 1
3 2 0
3 2 1
The first loop iterates four times.
The second loop iterates three times. However since it is embedded inside the first loop, it actually iterates twelve times (4 * 3.)
The third loop iterates two times. However since it is embedded inside the first and second loops, it actually iterates twenty-four times (4 * 3 * 2).

Categories

Resources