Reading a multiline record into Pandas dataframe - python

I have an earthquake data I want to read into a Pandas dataframe. Data for each earthquake is spread over 5 fixed-format lines, and the format for each of the 5 lines is different. Some fields include variable whitespaces, so I can't just do a delimited read.
Is there an elegant way to parse that with read_fwf (or something else)? I think nesting loops with chunksize=1 might work but it's not very clean. Or I could reformat the file to cat each 5 line block out to a single line; but I'd rather use the original file.
Here's he first earthquake as an example:
MLI 1976/01/01 01:29:39.6 -28.61 -177.64 59.0 6.2 0.0 KERMADEC ISLANDS REGION
M010176A B: 0 0 0 S: 0 0 0 M: 12 30 135 CMT: 1 BOXHD: 9.4
CENTROID: 13.8 0.2 -29.25 0.02 -176.96 0.01 47.8 0.6 FREE O-00000000000000
26 7.680 0.090 0.090 0.060 -7.770 0.070 1.390 0.160 4.520 0.160 -3.260 0.060
V10 8.940 75 283 1.260 2 19 -10.190 15 110 9.560 202 30 93 18 60 88

Related

How to modify the number of the rows in .csv file and plot them

I read .csv file using this command
df = pd.read_csv('filename.csv', nrows=200)
I set the number of rows to 200. So it will only get the data for 200 rows. (200 rows x 1 column)
data
1 4.33
2 6.98
.
.
200 100.896
I want to plot these data however I would like to divide the number of rows by 50. (there will be 200 elements still but the numbers of the rows will be divided by 50).
data
0.02 4.33
0.04 6.98
.
.
4 100.896
I'm not sure how I would do that. Is there a way of doing this?
Just divide the index by 50.
Here an example :
import pandas as pd
import random
data = pd.DataFrame({'col1' : random.sample(range(300), 200)}, index = range(1,201))
data.index = data.index / 50
data
col1
0.02
196
0.04
198
0.06
278
0.08
209
0.10
36
...
...
3.92
175
3.94
69
3.96
145
3.98
15
4.00
18

Unable to merge all of the desired columns from Pandas DataFrame

I am a beginner working with a clinical data set using Pandas in Jupyter Notebook.
A column of my data contains census tract codes and I am trying to merge my data with a large transportation data file that also has a column with census tract codes.
I initially only wanted 2 of the other columns from that transportation file so, after I downloaded the file, I removed all of the other columns except the 2 that I wanted to add to my file and the census tract column.
This is the code I used:
df_my_data = pd.read_excel("my_data.xlsx")
df_transportation_data = pd.read_excel("transportation_data.xlsx")
df_merged_file = pd.merge(df_my_data, df_transportation_data)
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
This worked but then I wanted to add the other columns from the transportation file so I used my initial file (prior to adding the 2 transportation columns) and tried to merge the entire transportation file. This resulted in a new DataFrame with all of the desired columns but only 4 rows.
I thought maybe the transportation file is too big so I tried merging individual columns (other than the 2 I was initially able to merge) and this again results in all of the correct columns but only 4 rows merging.
Any help would be much appreciated.
Edits:
Sorry for not being more clear.
Here is the code for the 2 initial columns I merged:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_two_columns = pd.read_excel('two_columns_from_transportation_file.xlsx')
df_two_columns_merged = pd.merge(df_my_data, df_two_columns, on=['census_tract'])
df_two_columns_merged.to_excel('two_columns_merged.xlsx', index = False)
The outputs were:
df_my_data.head()
census_tract id e t
0 6037408401 1 1 1092
1 6037700200 2 1 1517
2 6065042740 3 1 2796
3 6037231210 4 1 1
4 6059076201 5 1 41
df_two_columns.head()
census_tract households_with_no_vehicle vehicles_per_household
0 6001400100 2.16 2.08
1 6001400200 6.90 1.50
2 6001400300 17.33 1.38
3 6001400400 8.97 1.41
4 6001400500 11.59 1.39
df_two_columns_merged.head()
census_tract id e t households_with_no_vehicle vehicles_per_household
0 6037408401 1 1 1092 4.52 2.43
1 6037700200 2 1 1517 9.88 1.26
2 6065042740 3 1 2796 2.71 1.49
3 6037231210 4 1 1 25.75 1.35
4 6059076201 5 1 41 1.63 2.22
df_my_data has 657 rows and df_two_columns_merged came out with 657 rows.
The code for when I tried to merge the entire transport file:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_transportation_data = pd.read_excel('transportation_data.xlsx')
df_merged_file = pd.merge(df_my_data, df_transportation_data, on=['census_tract'])
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
The output:
df_transportation_data.head()
census_tract Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6001400100 0.00 12.60 65.95 2.16 20.69 0.76 2.08
1 6001400200 5.68 3.66 45.79 6.90 39.01 5.22 1.50
2 6001400300 7.55 6.61 46.77 17.33 31.19 6.39 1.38
3 6001400400 8.85 11.29 43.91 8.97 27.67 4.33 1.41
4 6001400500 8.45 7.45 46.94 11.59 29.56 4.49 1.39
df_merged_file.head()
census_tract id e t Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6041119100 18 0 2755 1.71 3.02 82.12 4.78 8.96 3.32 2.10
1 6061023100 74 1 1201 0.00 9.85 86.01 0.50 2.43 1.16 2.22
2 6041110100 80 1 9 0.30 4.40 72.89 6.47 13.15 7.89 1.82
3 6029004902 123 0 1873 0.00 18.38 78.69 4.12 0.00 0.00 2.40
The df_merged_file only has 4 total rows.
So my question is: why is it that I am able to merge those initial 2 columns from the transportation file and keep all of the rows from my file but when I try to merge the entire transportation file I only get 4 rows of output?
I recommend specifying merge type and merge column(s).
When you use pd.merge(), the default merge type is inner merge, and on the same named columns using:
df_merged_file = pd.merge(df_my_data, df_transportation_data, how='left', left_on=[COLUMN], right_on=[COLUMN])
It is possible that one of the columns you removed from the "transportation_data.xlsx" file previously is the same name as a column in your "my_data.xlsx", causing unmatched rows to be removed due to an inner merge.
A 'left' merge would allow the two columns you need from "transportation_data.xlsx" to attach to values in your "my_data.xlsx", but only where there is a match. This means your merged DataFrame will have the same number of rows as your "my_data.xlsx" has currently.
Well, I think there was something wrong with the initial download of the transportation file. I downloaded it again and this time I was able to get a complete merge. Sorry for being an idiot. Thank you all for your help.

python: grouping or splitting up time series data based on conditions

I work a lot with time series data at my job and I have been trying to use python--specifically pandas--to make some of the work a little faster. I have some code that reads through data in a DataFrame and identifies segments where specified conditions are met. It then separates those segments into individual DataFrames.
I have a sample DataFrame here:
Date Time Pressure Temp Flow Valve Position
0 3/5/2020 12:00:01 5.32 22.12 199 1.00
1 3/5/2020 12:00:02 5.36 22.25 115 0.95
2 3/5/2020 12:00:03 5.33 22.18 109 0.92
3 3/5/2020 12:00:04 5.38 23.51 103 0.90
4 3/5/2020 12:00:05 5.42 24.27 99 0.89
5 3/5/2020 12:00:06 5.49 25.91 92 0.85
6 3/5/2020 12:00:07 5.55 26.78 85 0.82
7 3/5/2020 12:00:08 5.61 29.88 82 0.76
8 3/5/2020 12:00:09 5.69 31.16 87 0.79
9 3/5/2020 12:00:10 5.72 32.01 97 0.87
10 3/5/2020 12:00:11 5.59 29.68 104 0.90
11 3/5/2020 12:00:12 5.53 24.55 111 0.93
12 3/5/2020 12:00:13 5.48 23.54 116 0.96
13 3/5/2020 12:00:14 5.44 23.11 119 1.00
14 3/5/2020 12:00:15 5.41 23.08 121 1.00
The code I have written does what I want but is really difficult to follow and I am sure its offensive to experienced python users.
Here is what it does though:
I more or less create a mask based on a set of conditions and I take the index locations for all the True values in the mask. Then it uses NumPy's .diff() function to identify discontinuity in the indices. Inside the for loop it splits up the mask at the location of each identified discontinuity. Once that is complete I can use the now separate sets of indices to slice out the desired segments of data from my original DataFrame. See the code below:
import pandas as pd
import numpy as np
df = pd.read_csv('sample_data.csv')
idx = np.where((df['Temp'] > 23) & (df['Temp'] < 30))[0]
discontinuity = np.where(np.diff(idx) > 1)[0]
intervals = {}
for i in range(len(discontinuity)+1):
if i == 0:
intervals[i] = df.iloc[idx[0]:idx[discontinuity[i]],1]
if len(intervals[i].values) < 1:
del intervals[i]
elif i == len(discontinuity):
intervals[i] = df.iloc[idx[discontinuity[i-1]+1]:idx[-1],1]
if len(intervals[i].values) < 1:
del intervals[i]
else:
intervals[i] = df.iloc[idx[discontinuity[i-1]+1]:idx[discontinuity[i]],1]
if len(intervals[i].values) < 1:
del intervals[i]
df1 = df.loc[intervals[0].index, :]
df2 = df.loc[intervals[1].index, :]
df1 and df2 contain all the data in the original DataFrame corresponding with the times (rows) that 'Temp' is between 23 and 30.
df1:
Date Time Pressure Temp Flow Valve Position
3 3/5/2020 12:00:04 5.38 23.51 103 0.90
4 3/5/2020 12:00:05 5.42 24.27 99 0.89
5 3/5/2020 12:00:06 5.49 25.91 92 0.85
6 3/5/2020 12:00:07 5.55 26.78 85 0.82
df2:
Date Time Pressure Temp Flow Valve Position
10 3/5/2020 12:00:11 5.59 29.68 104 0.90
11 3/5/2020 12:00:12 5.53 24.55 111 0.93
12 3/5/2020 12:00:13 5.48 23.54 116 0.96
13 3/5/2020 12:00:14 5.44 23.11 119 1.00
I am glad I was able to get this to work for me and I can live with the couple lines that get lost using this method but I know this is a really pedestrian approach and I can't help but think someone who is not a python beginning could do the same thing much more cleanly and efficiently.
Could groupby from itertools or pandas work for this? I haven't been able to find a way to make that work.
Welcome to Stack Overflow.
I think your code can be simplified as such:
# Get the subset that fulfills your conditions
df_conditioned = df.query('Temp > 23 and Temp < 30').copy()
# Check for discontinuities by looking at the indices
# I created a new column called 'Group' to keep track of the continuous indices
indices = df_conditioned.index.to_series()
df_conditioned['Group'] = ((indices - indices.shift(1)) != 1).cumsum()
# Store the groups (segments with same group number) as individual frames in a list
df_list = []
for group in df_conditioned['Group'].unique():
df_list.append(df_conditioned.query('Group == #group').drop(columns='Group'))
Hope it helps!

Select value from dataframe based on other dataframe

i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration.
Dataframe 1 contains the measurement data:
ms force ... ... ...
1 5 20
2 10 20
3 15 25
4 20 30
5 25 20
..... (~ 6000 lines)
Dataframe 2 contains "positioning data"
ms speed (m/s)
1 0 0.66
2 4500 0.66
3 8000 1.3
4 16000 3.0
5 20000 3.0
.....(~300 lines)
Now I want to calculate the position of the first dataframe with the data from secound dataframe
In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2.
My idea is to make something like this: if
In the end I want to display a graph "force <-> way" and not "force <-> time"
Thank you in andvance
==========================================================================
Update:
In the meantime I could almost solve my issue. Now my Data look like this:
Dataframe 2 (Speed Data):
pos v a t t-end t-start
0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000
1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000
2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287
3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531
...
15 0.055 0.686667 0.5 0.064904 23.0 20.0
...
28 0.055 0.686667 0.6 0.064904 35.0 34.0
...
30 0.055 0.686667 0.9 0.064904 44.0 39.0
And Dataframe 1 (time based measurement):
Fx Fy Fz abs_t expected output ('a' from DF1)
0 -13.9 170.3 45.0 0.005 0.000000
1 -14.1 151.6 38.2 0.010 0.000000
...
200 -14.1 131.4 30.4 20.015 0.5
...
300 -14.3 111.9 21.1 34.01 0.6
...
400 -14.5 95.6 13.2 40.025
So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2
So somthing like this (pseudo code):
if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']):
DF1['a'] = DF2['a']
I could make two for loops but it looks like the wrong way and is very very slow.
I hope you understand my problem; to provide a running sample is very hard.
In Excel I did like this:
I found a very slow solution but atleast its working :(
df1['a'] = 0
for index, row in df2.iterrows():
start = row['t-start']
end = row ['t-end']
a = row ['a']
df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a

Python: Imported csv not being split into proper columns

I am importing a csv file into python using pandas but the data frame is only in one column. I copied and pasted data from the comma-separated format from The Player Standing Field table at this link (second one) into an excel file and saved it as a csv (originally as ms-dos, then both as normal and utf-8 per recommendation by AllthingsGo42). But it only returned a single column data frame.
Examples of what I tried:
dataset=pd.read('MLB2016PlayerStats2.csv')
dataset=pd.read('MLB2016PlayerStats2.csv', delimiter=',')
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',')
The each line of code above all returned:
Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary
1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2...
2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1...
3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,...
4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,1...
5,Cristhian Adames\adamecr01,24,COL,NL,69,43,3...
Also tried:
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',',quoting=3)
Which returned:
"Rk Name Age Tm Lg G GS CG Inn Ch
\
0 "1 Fernando Abad\abadfe01 30 TOT AL 57 0 0 46.2 4
1 "2 Jose Abreu\abreujo02 29 CHW AL 152 152 150 1355.2 1337
2 "3 A.J. Achter\achteaj01 27 LAA AL 27 0 0 37.2 6
3 "4 Dustin Ackley\ackledu01 28 NYY AL 23 16 10 140.1 97
4 "5 Cristhian Adames\adamecr01 24 COL NL 69 43 38 415.0 212
E DP Fld% Rtot Rtot/yr Rdrs Rdrs/yr RF/9 RF/G \
0 ... 0 1 1.000 NaN NaN NaN NaN 0.77 0.07
1 ... 10 131 0.993 -2.0 -2.0 -5.0 -4.0 8.81 8.73
2 ... 0 0 1.000 NaN NaN 0.0 0.0 1.43 0.22
3 ... 0 8 1.000 1.0 9.0 3.0 27.0 6.22 4.22
4 ... 6 24 0.972 -4.0 -12.0 1.0 3.0 4.47 2.99
Pos Summary"
0 P"
1 1B"
2 P"
3 1B-OF-2B"
4 SS-2B-3B"
Below is what the data looks like in notepad++
"Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary"
"1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2,4,0,4,0,1,1.000,,,,,0.77,0.07,P"
"2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1355.2,1337,1243,84,10,131,.993,-2,-2,-5,-4,8.81,8.73,1B"
"3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,6,2,4,0,0,1.000,,,0,0,1.43,0.22,P"
"4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,140.1,97,89,8,0,8,1.000,1,9,3,27,6.22,4.22,1B-OF-2B"
"5,Cristhian Adames\adamecr01,24,COL,NL,69,43,38,415.0,212,68,138,6,24,.972,-4,-12,1,3,4.47,2.99,SS-2B-3B"
"6,Austin Adams\adamsau01,29,CLE,AL,19,0,0,18.1,1,0,0,1,0,.000,,,0,0,0.00,0.00,P"
Sorry for the confusion with my question before. I hope this edit will clear things up. Thank you to those that answered thus far.
Running it quickly myself, I was able to get what I am understanding is the desired output.
My only thought is that there i s no need to call out a delimiter for a csv, because a csv is a comma separated variable file, but that should not matter. I am thinking that there is something incorrect with your actual data file and I would go and make sure it is saved correctly. I would echo previous comments and make sure that the csv is a UTF-8, and not an MS-DOS or Macintosh (both options when saving in excel)
Best of luck!
There is no need to call for a delimiter for a csv. You only have to change the separator from ";" to ",". For this you can open your csv file with notepad and change them with the replace tool.

Categories

Resources