Regex/split strings in list for particular element - python

I have a list that of items in a list that looks like this:
[u'1111 aaaa 20 0 250m 149m 113m S 0.0 2.2 532:09.83 bbbb', u' 5555 cccc 20 0 218m 121m 91m S 0.0 3.3 288:50.20 dddd']
The only thing from each item in the list I am concerned about is 2.2 and 3.3, but everything in each item is a variable and changes every time the process is run. The format will always be the same however.
Is there a way to regex each item in the list and check this value in each list?

If you want to just get the 2.2 and 3.3 values, you can go without regexps:
data = [u'1111 aaaa 20 0 250m 149m 113m S 0.0 2.2 532:09.83 bbbb', u' 5555 cccc 20 0 218m 121m 91m S 0.0 3.3 288:50.20 dddd']
print([item.split()[9] for item in data]) # yields [u'2.2', u'3.3']
By default split splits by whitespace. And your 2.2 and 3.3 numbers happen to be 10th in each of the blobs. Python uses 0-indexing of lists, so 10th in human terms becomes 9.

Related

Sequential predictions on multiple sequences

I need to predict a sequence of responses based on a sequence of predictors using something like an LSTM. There are multiple sequences, and they are structured such that they cannot be stacked and still make sense.
For example, we might have one sequence with the sequential values
Observation
Location 1x
Location 1y
Location 2 (response)
1
3.8
2.5
9.4
2
3.9
2.7
9.7
and another with the values
Observation
Location 1x
Location 1y
Location 2 (response)
1
9.4
4.6
16.8
2
9.2
4.1
16.2
Observation 2 from the first table and observation 1 from the second table do not follow each other. I then need to predict on an unseen sequence like
Location 1x
Location 1y
5.6
8.4
5.6
8.1
which is also not correlated to the first two, except that the first two should give a guideline on how to predict the sequence.
I've looked into multiple sequence prediction, and haven't had much luck. Can anyone give a guideline on what sort of stretegies I might use for this problem? Thanks.

Split a list up to a maximum number of elements

I was wondering if someone could help me with the following problem: I have a text file that I split into rows and columns. The text file contains a variable amount of columns, however I would like to split each row into seven columns, no more, no less. To do that, I want to through everything after the sixth column into a single column.
Example code:
import numpy as np
rot = ['6697 1100.0 90.0 0.0 0.0 6609 !',
'701 0.0 0.0 83.9 1.5 000 !AFR-AHS IndHS-AFR']
for i in range(len(rot)):
rot[i]=rot[i].split()
Here, the array 'rot' contains 7 entries in the first row (the ! counts as a separate entry) and 8 in the second row. In both cases, everything after and including the ! should be grouped in the same column.
Many thanks!
You are almost there. split takes (as its second argument) the maximum number of splits to do.
https://docs.python.org/3.8/library/stdtypes.html#str.split
rot = ['6697 1100.0 90.0 0.0 0.0 6609 !',
'701 0.0 0.0 83.9 1.5 000 !AFR-AHS IndHS-AFR']
for i in range(len(rot)):
rot[i]=rot[i].split(maxsplit=6)
Note: You want six splits, which results in seven columns. You'll need to do some extra processing if the text can have fewer than seven columns though.

combine two formats together

I am doing the formatting of a dataframe. I need to do the thousand separator and the decimals. The problem is when I combine them together, only the last one is in effect. I guess many people may have the same confusion, as I have googled a lot, nothing is found.
I tried to use .map(lambda x:('%.2f')%x and format(x,',')) to combine the two required formats together, but only the last one is in effect
DF_T_1_EQUITY_CHANGE_Summary_ADE['Sum of EQUITY_CHANGE'].map(lambda x:format(x,',') and ('%.2f')%x)
DF_T_1_EQUITY_CHANGE_Summary_ADE['Sum of EQUITY_CHANGE'].map(lambda x:('%.2f')%x and format(x,','))
the first result is:
0 -2905.22
1 -6574.62
2 -360.86
3 -3431.95
Name: Sum of EQUITY_CHANGE, dtype: object
the second result is:
0 -2,905.2200000000003
1 -6,574.62
2 -360.86
3 -3,431.9500000000003
Name: Sum of EQUITY_CHANGE, dtype: object
I tried a new way, by using
DF_T_1_EQUITY_CHANGE_Summary_ADE.to_string(formatters={'style1': '${:,.2f}'.format})
the result is:
Row Labels Sum of EQUITY_CHANGE Sum of TRUE_PROFIT Sum of total_cost Sum of FOREX VOL Sum of BULLION VOL Oil Sum of CFD VOL Sum of BITCOIN VOL Sum of DEPOSIT Sum of WITHDRAW Sum of IN/OUT
0 ADE A BOOK USD -2,905.2200000000003 638.09 134.83 15.590000000000002 2.76 0.0 0.0 0 0.0 0.0 0.0
1 ADE B BOOK USD -6,574.62 -1,179.3299999999997 983.2099999999999 21.819999999999997 30.979999999999993 72.02 0.0 0 8,166.9 0.0 8,166.9
2 ADE A BOOK AUD -360.86 235.39 64.44 5.369999999999999 0.0 0.0 0.0 0 700.0 0.0 700.0
3 ADE B BOOK AUD -3,431.9500000000003 190.66 88.42999999999999 11.88 3.14 0.03 2.0 0 20,700.0 -30,000.0 -9,300.0
the result confuses me, as I set the .2f format which is not in effect.
Using the string formatter mini language you can add commas and set the decimals to 2 places using f'{:,.2f}'.
import pandas as pd
df = pd.DataFrame({'EQUITY_CHANGE': [-2905.219262257907,
-6574.619531995241,
-360.85959369471186,
-3431.9499712161164]}
)
df.EQUITY_CHANGE.apply(lambda x: f'{x:,.2f}')
# returns:
0 -2,905.22
1 -6,574.62
2 -360.86
3 -3,431.95
Name: EQUITY_CHANGE, dtype: object
map method is not in-place; it doesn't modify the Series but instead it returns a new one.
So just substitute the result of the map to the old one
Here doc:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html

Get time difference between two values in csv file [duplicate]

This question already has answers here:
Pandas: Difference to previous value
(2 answers)
Closed 3 years ago.
I trying to get the avarage, max and min time difference between value occurrences in a csv file.
The contains a multiple columns and rows.
I am currently working in python and trying to use pandas to solve my problem.
I have managed to break down the csv file to the column i want to get the time difference from and the time column.
Where the "payload" column "value occurrences" happens.
looking like:
time | payload
12.1 2368
13.8 2508
I have also tried to get the time in a array when the value occurrences happens and tried to step through the array but failed bad. I felt like there was a easier way to do it.
def average_time(avg_file):
avg_read = pd.read_csv(avg_file, skiprows=2, names=new_col_names, usecols=[2, 3], na_filter=False, skip_blank_lines=True)
test=[]
i=0
for row in avg_read.payload:
if row != None:
test[i]=avg_read.time
i+=1
if len[test] > 2:
average=test[1]-test[0]
i=0
test=[]
return average
The csv-file currently look like:
time | payload
12.1 2250
12.5 2305
12.9 (blank)
13.1 (blank)
13.5 2309
14.6 2350
14.9 2680
15.0 (blank)
I want to get the time diffenrence between the values in the payload columen. example time between
2250 and 2305 --> 12.5-12.1 = 0.4 sec
and the get the difference between
2305 and 2309 --> 13.5-12.5 = 1 s
Skipping the blank numbers
To later on get the maximum, minimun and average difference.
First use dropna then use Series.diff
DataFrame used:
print(df)
time payload
0 12.1 2250.0
1 12.5 2305.0
2 12.9 NaN
3 13.1 NaN
4 13.5 2309.0
5 14.6 2350.0
6 14.9 2680.0
7 15.0 NaN
df.dropna().time.diff()
0 NaN
1 0.4
4 1.0
5 1.1
6 0.3
Name: time, dtype: float64
Note I assumed your (blank) values are NaN, else use the following before running my code:
df.replace('(blank)', np.NaN, inplace=True, axis=1)
# Or if they are whitespaces
df.replace('', np.NaN, inplace=True, axis=1)

Assigning arrays as values for a key in a dictionary

I have a dat file with different data. The file has different numbers arranged in 7 columns seperated with two whitespaces. Is it possible to read and extract the data for each column and assign the data to a key in a dictionary, using arrays. Is it possible to assign numpy arrays as values for a key in a dictionary?
The dat.file have numbers like this:
1 -0.8 92.3 2.8 150 0 0
2 -0.7 99.3 1.9 140 0 0
3 -0.3 96.4 2.5 120 0 0
4 -0.3 95.0 3.1 130 0 0
5 -0.8 95.7 3.1 130 0 0
6 -0.5 95.0 2.1 120 0 0
7 -0.7 90.9 3.6 110 0 0
8 -0.6 85.7 2.6 80 0 0
9 -0.7 85.7 3.1 60 0 0
10 -1.2 85.6 3.6 50 0 8
I first read all the lines, then I split the values with whitespace as seperator, for each line. I tried to assign the values in each column to the corresponding key in the dictionary, but this does not work. I think I have to put the values in an array and then put the array in the dictionary in some way?
def read_data(filename):
infile = open(filename, 'r')
for line in infile.readlines():
data = {'hour': None, 'temperature': None, 'humidity':
None, 'wind_speed':
None, 'wind_direction':
None, 'direct_flux': None, 'diffuse_flux': None}
lines = line.split()
data['hour'] = lines[0]
data['temperature'] = lines[1]
data['humidity'] = lines[2]
data['wind_speed'] = lines[3]
data['wind_direction'] = lines[4]
data['direct_flux'] = lines[5]
data['diffuse_flux'] = lines[6]
return data
EDIT: I realized numpy arrays are a specific scientific data structure. I have not used them but assume converting the below lists (and its append operation) into numpy arrays is trivial.
You are correct. A dictionary holds (key, value) pairs. An entry of the form (key, value, value, ..., value) is not acceptable. Using a list() as the value (as you suggested) is a solution. Note now that the index corresponds to the line number the data was in.
data = {'hour': None, 'temperature': None, 'humidity':
None, 'wind_speed':
None, 'wind_direction':
None, 'direct_flux': None, 'diffuse_flux': None}
# For each key, initialize a list as its value.
for key in data:
data[key] = list()
for line in infile.readlines():
lines = line.split()
# we simply append into the list this key references.
data['hour'].append(lines[0])
data['temperature'].append(lines[1])
data['humidity'].append(lines[2])
data['wind_speed'].append(lines[3])
data['wind_direction'].append(lines[4])
data['direct_flux'].append(lines[5])
data['diffuse_flux'].append(lines[6])
return data
I'm not quite sure I got right what you are asking for, but I'll try to answer.
I guess you want to load those tabulated data in a way you can easily work with, and making use of numpy's functionality.
Then, I think you have two options.
Using PANDAS
Pandas (here the documentation) is a really complete package that uses numpy to let you work with labelled data (so that columns and rows have a name, and not only a positional index)
using pandas the idea would be to do:
import pandas as pd
df = pd.read_csv('data.tab', sep=" ", index_col=0, header=None,
names=['hour', 'temp', 'hum', 'w_speed', 'w_direction',
'direct_flux','diffuse_flux'])
df
temp hum w_speed w_direction direct_flux diffuse_flux
hour
1 -0.8 92.3 2.8 150 0 0
2 -0.7 99.3 1.9 140 0 0
3 -0.3 96.4 2.5 120 0 0
4 -0.3 95.0 3.1 130 0 0
5 -0.8 95.7 3.1 130 0 0
6 -0.5 95.0 2.1 120 0 0
7 -0.7 90.9 3.6 110 0 0
8 -0.6 85.7 2.6 80 0 0
9 -0.7 85.7 3.1 60 0 0
10 -1.2 85.6 3.6 50 0 8
Or, if you have the column names as the first row of the file simply:
import pandas as pd
df = pd.read_csv('data.tab', sep=" ", index_col=0)
If you haven't heard of this library and you are managing this kind of data, I think it is really worthwhile to give it a close look.
Using only Numpy
If you don't need to do much with those data, or won't do it again or whatever, getting Pandas may be a bit too much...
In any case, you can always read the tabulated file from numpy
import numpy as np
array = np.loadtxt("data.tab", delimiter=" ")
It will ignore comment lines (by default lines with #) and you can also skip the first row and so on.
Now you'll have all the data on array, and you can access it slicing and indexing. If you want to have labelled categories (and you don't like the first option), you can build your dictionary of arrays following the last snippet of code by:
data = {}
headers = ['hour', 'temp', 'hum', 'w_speed', 'w_direction', 'direct_flux',
'diffuse_flux']
for i in xrange(len(headers)):
data[header[i]] = array[:,i]

Categories

Resources