Python: Imported csv not being split into proper columns - python
I am importing a csv file into python using pandas but the data frame is only in one column. I copied and pasted data from the comma-separated format from The Player Standing Field table at this link (second one) into an excel file and saved it as a csv (originally as ms-dos, then both as normal and utf-8 per recommendation by AllthingsGo42). But it only returned a single column data frame.
Examples of what I tried:
dataset=pd.read('MLB2016PlayerStats2.csv')
dataset=pd.read('MLB2016PlayerStats2.csv', delimiter=',')
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',')
The each line of code above all returned:
Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary
1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2...
2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1...
3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,...
4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,1...
5,Cristhian Adames\adamecr01,24,COL,NL,69,43,3...
Also tried:
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',',quoting=3)
Which returned:
"Rk Name Age Tm Lg G GS CG Inn Ch
\
0 "1 Fernando Abad\abadfe01 30 TOT AL 57 0 0 46.2 4
1 "2 Jose Abreu\abreujo02 29 CHW AL 152 152 150 1355.2 1337
2 "3 A.J. Achter\achteaj01 27 LAA AL 27 0 0 37.2 6
3 "4 Dustin Ackley\ackledu01 28 NYY AL 23 16 10 140.1 97
4 "5 Cristhian Adames\adamecr01 24 COL NL 69 43 38 415.0 212
E DP Fld% Rtot Rtot/yr Rdrs Rdrs/yr RF/9 RF/G \
0 ... 0 1 1.000 NaN NaN NaN NaN 0.77 0.07
1 ... 10 131 0.993 -2.0 -2.0 -5.0 -4.0 8.81 8.73
2 ... 0 0 1.000 NaN NaN 0.0 0.0 1.43 0.22
3 ... 0 8 1.000 1.0 9.0 3.0 27.0 6.22 4.22
4 ... 6 24 0.972 -4.0 -12.0 1.0 3.0 4.47 2.99
Pos Summary"
0 P"
1 1B"
2 P"
3 1B-OF-2B"
4 SS-2B-3B"
Below is what the data looks like in notepad++
"Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary"
"1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2,4,0,4,0,1,1.000,,,,,0.77,0.07,P"
"2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1355.2,1337,1243,84,10,131,.993,-2,-2,-5,-4,8.81,8.73,1B"
"3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,6,2,4,0,0,1.000,,,0,0,1.43,0.22,P"
"4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,140.1,97,89,8,0,8,1.000,1,9,3,27,6.22,4.22,1B-OF-2B"
"5,Cristhian Adames\adamecr01,24,COL,NL,69,43,38,415.0,212,68,138,6,24,.972,-4,-12,1,3,4.47,2.99,SS-2B-3B"
"6,Austin Adams\adamsau01,29,CLE,AL,19,0,0,18.1,1,0,0,1,0,.000,,,0,0,0.00,0.00,P"
Sorry for the confusion with my question before. I hope this edit will clear things up. Thank you to those that answered thus far.
Running it quickly myself, I was able to get what I am understanding is the desired output.
My only thought is that there i s no need to call out a delimiter for a csv, because a csv is a comma separated variable file, but that should not matter. I am thinking that there is something incorrect with your actual data file and I would go and make sure it is saved correctly. I would echo previous comments and make sure that the csv is a UTF-8, and not an MS-DOS or Macintosh (both options when saving in excel)
Best of luck!
There is no need to call for a delimiter for a csv. You only have to change the separator from ";" to ",". For this you can open your csv file with notepad and change them with the replace tool.
Related
How to split a dataframe containing voltage over time value, so that it can store values of each waveform/bit separately
I have several csv files which have data of voltage over time and each csv files are approximately 7000 rows and the data looks like this: Time(us) Voltage (V) 0 32.96554106 0.5 32.9149649 1 32.90484966 1.5 32.86438874 2 32.8542735 2.5 32.76323642 3 32.74300595 3.5 32.65196886 4 32.58116224 4.5 32.51035562 5 32.42943376 5.5 32.38897283 6 32.31816621 6.5 32.28782051 7 32.26759005 7.5 32.21701389 8 32.19678342 8.5 32.16643773 9 32.14620726 9.5 32.08551587 10 32.04505495 10.5 31.97424832 11 31.92367216 11.5 31.86298077 12 31.80228938 12.5 31.78205891 13 31.73148275 13.5 31.69102183 14 31.68090659 14.5 31.67079136 15 31.64044567 15.5 31.59998474 16 31.53929335 16.5 31.51906288 I read the csv file with pandas dataframe and after plotting a figure in matplotlib with data from one csv file, the figure looks like below. I would like to split every single square waveform/bit and store the corresponding voltage values for each bit separately. So the resulting voltage values of each bit would be stored in a row and should look like this: I don't have any idea how to do that. I guess I have to write a function where I have to assign a threshold value that, if the voltage values are going down for maybe 20 steps of time than capture all the values or if the voltage level is going up for 20 steps of time than capture all the voltage values. Could someone help?
If you get the gradient of your Voltage (here using diff as the time is regularly spaced), this gives you the following: You can thus easily use a threshold (I tested with 2) to identify the peak starts. Then pivot your data: # get threshold of gradient m = df['Voltage (V)'].diff().gt(2) # group start = value above threshold preceded by value below threshold group = (m&~m.shift(fill_value=False)).cumsum().add(1) df2 = (df .assign(id=group, t=lambda d: d['Time (us)'].groupby(group).apply(lambda s: s-s.iloc[0]) ) .pivot(index='id', columns='t', values='Voltage (V)') ) output: t 0.0 0.5 1.0 1.5 2.0 2.5 \ id 1 32.965541 32.914965 32.904850 32.864389 32.854273 32.763236 2 25.045314 27.543777 29.182444 30.588462 31.114454 31.984364 3 25.166697 27.746081 29.415095 30.719960 31.326873 32.125977 4 25.277965 27.877579 29.536477 30.912149 31.367334 32.206899 5 25.379117 27.978732 29.667975 30.780651 31.670791 32.338397 6 25.631998 27.634814 28.959909 30.173737 30.659268 31.053762 7 23.528030 26.137759 27.948386 29.253251 30.244544 30.649153 8 23.639297 26.380525 28.464263 29.971432 30.902034 31.458371 9 23.740449 26.542369 28.707028 30.295120 30.881803 31.862981 10 23.871948 26.673867 28.889103 30.305235 31.185260 31.873096 11 24.387824 26.694097 28.342880 29.678091 30.315350 31.134684 ... t 748.5 749.0 id 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN 5 NaN NaN 6 21.059913 21.161065 7 NaN NaN 8 NaN NaN 9 NaN NaN 10 NaN NaN 11 NaN NaN [11 rows x 1499 columns] plot: df2.T.plot()
Pandas how to preserve all values in dataframe into a csv?
I want to convert the html to csv using pandas functions This is a part of what I read in the dataframe df 0 1 0 sequence 2 1 trainNo K805 2 trainNumber K805 3 departStation 鹰潭 4 departStationPy yingtan 5 arriveStation 南昌 6 arriveStationPy nanchang 7 departDate 2020-05-24 8 departTime 03:55 9 arriveDate 2020-05-24 10 arriveTime 05:44 11 isStartStation False 12 isEndStation False 13 runTime 1小时49分钟 14 preSaleTime NaN 15 takeDays 0 16 isBookable True 17 seatList seatNamepriceorderPriceinventoryisBookablebutt... 18 curSeatIndex 0 seatName price orderPrice inventory isBookable buttonDisplayName buttonType 0 硬座 23.5 23.5 99 True NaN 0 1 硬卧 69.5 69.5 99 True NaN 0 2 软卧 104.5 104.5 4 True NaN 0 0 1 0 departDate 2020-05-23 1 departStationList NaN 2 endStationList NaN 3 departStationFilterMap NaN 4 endStationFilterMap NaN 5 departCityName 上海 6 arriveCityName 南昌 7 gtMinPrice NaN My code is like this for i, df in enumerate(pd.read_html(html,encoding='utf-8')): df.to_csv(".\other.csv",index=True,encoding='utf-8-sig') To preserve the characters in csv, I need to use utf-8-sig encoding. But I don't know how to write the format symbol % ,0,1 0,departDate,2020-05-23 1,departStationList, 2,endStationList, 3,departStationFilterMap, 4,endStationFilterMap, 5,departCityName,上海 6,arriveCityName,南昌 7,gtMinPrice, This is what I got in csv file, only the last part is preserved. The dataframe is correct, while the csv need correction. Can you show me how to make the correct output?
you're saving each dataframe to the same file, so each is getting overwritten until the last one. note the addition of the f-string to change the save file name e.g. f".\other_{i}.csv" each dataframe is a different shape, so they won't all fit together properly To CSV for i, df in enumerate(pd.read_html(html,encoding='utf-8')): df.to_csv(f".\other_{i}.csv", index=True, encoding='utf-8-sig') To Excel with pd.ExcelWriter('output.xlsx', mode='w') as writer: for i, df in enumerate(pd.read_html(html,encoding='utf-8')): df.to_excel(writer, sheet_name=f'Sheet{i}', encoding='utf-8-sig')
Select value from dataframe based on other dataframe
i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration. Dataframe 1 contains the measurement data: ms force ... ... ... 1 5 20 2 10 20 3 15 25 4 20 30 5 25 20 ..... (~ 6000 lines) Dataframe 2 contains "positioning data" ms speed (m/s) 1 0 0.66 2 4500 0.66 3 8000 1.3 4 16000 3.0 5 20000 3.0 .....(~300 lines) Now I want to calculate the position of the first dataframe with the data from secound dataframe In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2. My idea is to make something like this: if In the end I want to display a graph "force <-> way" and not "force <-> time" Thank you in andvance ========================================================================== Update: In the meantime I could almost solve my issue. Now my Data look like this: Dataframe 2 (Speed Data): pos v a t t-end t-start 0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000 1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000 2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287 3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531 ... 15 0.055 0.686667 0.5 0.064904 23.0 20.0 ... 28 0.055 0.686667 0.6 0.064904 35.0 34.0 ... 30 0.055 0.686667 0.9 0.064904 44.0 39.0 And Dataframe 1 (time based measurement): Fx Fy Fz abs_t expected output ('a' from DF1) 0 -13.9 170.3 45.0 0.005 0.000000 1 -14.1 151.6 38.2 0.010 0.000000 ... 200 -14.1 131.4 30.4 20.015 0.5 ... 300 -14.3 111.9 21.1 34.01 0.6 ... 400 -14.5 95.6 13.2 40.025 So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2 So somthing like this (pseudo code): if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']): DF1['a'] = DF2['a'] I could make two for loops but it looks like the wrong way and is very very slow. I hope you understand my problem; to provide a running sample is very hard. In Excel I did like this:
I found a very slow solution but atleast its working :( df1['a'] = 0 for index, row in df2.iterrows(): start = row['t-start'] end = row ['t-end'] a = row ['a'] df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a
How to conditionally create pandas column from other column values
I have a dataframe that looks like this: word start stop speaker 0 but that's alright 2.72 3.47 2 1 we'll have to 8.43 9.07 1 2 okay sure 9.19 11.01 2 3 what? 11.02 12.00 1 4 I agree 12.01 14.00 2 5 but i disagree 14.01 17.00 2 6 thats fine 17.01 19.00 1 7 however you are 19.01 22.00 1 8 like this 22.01 24.00 1 9 and 24.01 25.00 1 I want to create two new columns, df.speaker_1 and df.speaker_2. When df.speaker == 2, I want df.speaker_2 to contain the values of df.word. When df.speaker != 2, I want it to contain an empty string. The same will be repeated for the other speaker value. It should look as below: word start stop speaker speaker_2 speaker_1 0 but that's alright 2.72 3.47 2 but that's alright 1 we'll have to 8.43 9.07 1 we'll have to 2 okay sure 9.19 11.01 2 okay sure 3 what? 11.02 12.00 1 what? 4 I agree 12.01 14.00 2 I agree 5 but i disagree 14.01 17.00 2 but i disagree 6 thats fine 17.01 19.00 1 thats fine 7 however you are 19.01 22.00 1 however you are 8 like this 22.01 24.00 1 like this 9 and 24.01 25.00 1 and Any advice would be appreciated, thanks.
You can copy values from your column word then replace with empty strings as needed: df['speaker_1'] = df['word'] df['speaker_2'] = df['word'] df.loc[df['speaker'] != 1, 'speaker_1'] = '' df.loc[df['speaker'] != 2, 'speaker_2'] = '' Alternatively, you could use apply, but I find this is more straightforward in your case.
You could use pd.DataFrame.mask(): df['speaker_1'] = df.word.mask(df.speaker!=1, '') df['speaker_2'] = df.word.mask(df.speaker!=2, '') # word start ... speaker_1 speaker_2 # 0 but that's alright 2.72 ... but that's alright # 1 we'll have to 8.43 ... we'll have to # 2 okay sure 9.19 ... okay sure # 3 what? 11.02 ... what? # 4 I agree 12.01 ... I agree # 5 but i disagree 14.01 ... but i disagree # 6 thats fine 17.01 ... thats fine # 7 however you are 19.01 ... however you are # 8 like this 22.01 ... like this # 9 and 24.01 ... and
Reading a multiline record into Pandas dataframe
I have an earthquake data I want to read into a Pandas dataframe. Data for each earthquake is spread over 5 fixed-format lines, and the format for each of the 5 lines is different. Some fields include variable whitespaces, so I can't just do a delimited read. Is there an elegant way to parse that with read_fwf (or something else)? I think nesting loops with chunksize=1 might work but it's not very clean. Or I could reformat the file to cat each 5 line block out to a single line; but I'd rather use the original file. Here's he first earthquake as an example: MLI 1976/01/01 01:29:39.6 -28.61 -177.64 59.0 6.2 0.0 KERMADEC ISLANDS REGION M010176A B: 0 0 0 S: 0 0 0 M: 12 30 135 CMT: 1 BOXHD: 9.4 CENTROID: 13.8 0.2 -29.25 0.02 -176.96 0.01 47.8 0.6 FREE O-00000000000000 26 7.680 0.090 0.090 0.060 -7.770 0.070 1.390 0.160 4.520 0.160 -3.260 0.060 V10 8.940 75 283 1.260 2 19 -10.190 15 110 9.560 202 30 93 18 60 88