ASC files not preserving empty columns when added to df Python - python
I have a load of ASC files to extract data from. The issue I am having is that some of the columns have empty rows where there is no data, when I load these files into a df - it populates the first columns with all the data and just adds nans to the end... like this:
a| b| c
1 | 2 | nan
when I want it to be:
a | b | c
1 |nan|2
(I can't figure out how to make a table here to save my life) but where there is no data I want it to preserve the space. Part of my code says the separator is any space with more than two white spaces so I can preserve the headers that have one space within them, I think this is causing the issue but I am not sure how to fix it. I've tried using astropy.io to open the files and determine the delimiter but I get the error that the number of columns doesn't match the data columns.
here's an image of the general look of the files I have so you can see the lack of char delimiters and empty columns.
starting_words = ['Core no.', 'Core No.','Core','Core no.']
data = []
file_paths = []
for file in filepaths:
with open(file) as f:
for i, l in enumerate(f):
if l.startswith(tuple(starting_words)):
df = (pd.read_csv(file,sep = '\\s{2,}', engine = 'python', skiprows = i))
file_paths.append((file.stem + file.suffix))
df.insert(0,'Filepath', file)
data += [df]
break
this is the script that I've used to open the files and keep the header words together, I never got the astropy stuff to run - I either get the columns dont match error or it could not determine the file format.Also, this code has the skiprows part because the files all have random notes at the top that I don't want in my dataframe.
Your data looks well behaved, you could try to make use of the Pandas fwf to read the files with fixed-width formatted lines. If the inference from the fwf is not good enough for you, you can manually describe the extents of the fixed-width fields of each line using the parameter colspecs.
Sample
Core no. Depth Depth Perm Porosity Saturations Oil
ft m mD % %
1 5516.0 1681.277 40.0 1.0
2 5527.0 1684.630 39.0 16.0
3 5566.0 1696.517 508 37.0 4.0
5571.0 1698.041 105 33.0 8.0
6 5693.0 1735.226 44.0 16.0
5702.0 1737.970 4320 35.0 31.0
9 5686.0 1733.093 2420 33.0 26.0
df = pd.read_fwf('sample.txt', skiprows=2, header=None)
df.columns=['Core no.', 'Depth ft', 'Depth m' , 'Perm mD', 'Porosity%', 'Saturations Oil%']
print(df)
Output from df
Core no. Depth ft Depth m Perm mD Porosity% Saturations Oil%
0 1.0 5516.0 1681.277 NaN 40.0 1.0
1 2.0 5527.0 1684.630 NaN 39.0 16.0
2 3.0 5566.0 1696.517 508.0 37.0 4.0
3 NaN 5571.0 1698.041 105.0 33.0 8.0
4 6.0 5693.0 1735.226 NaN 44.0 16.0
5 NaN 5702.0 1737.970 4320.0 35.0 31.0
6 9.0 5686.0 1733.093 2420.0 33.0 26.0
Related
How to trim down a Pandas data frame rows?
I'm trying so hard to shorten this awful lot of rows from an XML sitemap but I can't find a solution to trim it down. import advertools as adv import pandas as pd site = "https://www.halfords.com/sitemap_index.xml" sitemap = adv.sitemap_to_df(site) sitemap = sitemap.dropna(subset=["loc"]).reset_index(drop=True) # Some sitemaps keeps urls with "/" on the end, some is with no "/" # If there is "/" on the end, we take the second last column as slugs # Else, the last column is the slug column slugs = sitemap['loc'].dropna()[~sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-2].str.replace('-', ' ') slugs2 = sitemap['loc'].dropna()[~sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-1].str.replace('-', ' ') # Merge two series slugs = list(slugs) + list(slugs2) # adv.word_frequency automatically removes the stop words word_counts_onegram = adv.word_frequency(slugs) word_counts_twogram = adv.word_frequency(slugs, phrase_len=2) competitor = pd.concat([word_counts_onegram, word_counts_twogram])\ .rename({'abs_freq':'Count','word':'Ngram'}, axis=1)\ .sort_values('Count', ascending=False) competitor.to_csv('competitor.csv',index=False) competitor competitor.shape (67758, 2) (67758, 2) I've been raveling around several blogs included resources on Stack Overflow but nothing seemed to work. This is definitely something going on with my zero expertise in coding I suppose
Two things: You can use adv.url_to_df to split URLs and get the slugs (there should be a column called last_dir: urldf = adv.url_to_df(sitemap['loc'].dropna()) urldf url scheme netloc path query fragment dir_1 dir_2 dir_3 dir_4 dir_5 dir_6 dir_7 dir_8 dir_9 last_dir 0 https://www.halfords.com/cycling/cycling-technology/helmet-cameras/removu-k1-4k-camera-and-stabiliser-694977.html https www.halfords.com /cycling/cycling-technology/helmet-cameras/removu-k1-4k-camera-and-stabiliser-694977.html nan nan cycling cycling-technology helmet-cameras removu-k1-4k-camera-and-stabiliser-694977.html nan nan nan nan nan removu-k1-4k-camera-and-stabiliser-694977.html 1 https://www.halfords.com/technology/bluetooth-car-kits/jabra-drive-bluetooth-speakerphone---white-695094.html https www.halfords.com /technology/bluetooth-car-kits/jabra-drive-bluetooth-speakerphone---white-695094.html nan nan technology bluetooth-car-kits jabra-drive-bluetooth-speakerphone---white-695094.html nan nan nan nan nan nan jabra-drive-bluetooth-speakerphone---white-695094.html 2 https://www.halfords.com/tools/power-tools-and-accessories/power-tools/stanley-fatmax-v20-18v-combi-drill-kit-695102.html https www.halfords.com /tools/power-tools-and-accessories/power-tools/stanley-fatmax-v20-18v-combi-drill-kit-695102.html nan nan tools power-tools-and-accessories power-tools stanley-fatmax-v20-18v-combi-drill-kit-695102.html nan nan nan nan nan stanley-fatmax-v20-18v-combi-drill-kit-695102.html 3 https://www.halfords.com/technology/dash-cams/mio-mivue-c450-695262.html https www.halfords.com /technology/dash-cams/mio-mivue-c450-695262.html nan nan technology dash-cams mio-mivue-c450-695262.html nan nan nan nan nan nan mio-mivue-c450-695262.html 4 https://www.halfords.com/technology/dash-cams/mio-mivue-818-695270.html https www.halfords.com /technology/dash-cams/mio-mivue-818-695270.html nan nan technology dash-cams mio-mivue-818-695270.html nan nan nan nan nan nan mio-mivue-818-695270.html There are options that pandas provides, which you can change. For example: pd.options.display.max_rows 60 # change it to display more/fewer rows: pd.options.display.max_rows = 100 As you did, you can easily create onegrams and bigrams, combine them, and display them: text_list = urldf['last_dir'].str.replace('-', ' ').dropna() one_grams = adv.word_frequency(text_list, phrase_len=1) bigrams = adv.word_frequency(text_list, phrase_len=2) print(pd.concat([one_grams, bigrams]) .sort_values('abs_freq', ascending=False) .head(15) # <-- change this to 100 for example .reset_index(drop=True)) word abs_freq 0 halfords 2985 1 car 1430 2 bike 922 3 kit 829 4 black 777 5 laser 686 6 set 614 7 wheel 540 8 pack 524 9 mats 511 10 car mats 478 11 thule 453 12 paint 419 13 4 413 14 spray 382 Hope that helps?
How to split a dataframe containing voltage over time value, so that it can store values of each waveform/bit separately
I have several csv files which have data of voltage over time and each csv files are approximately 7000 rows and the data looks like this: Time(us) Voltage (V) 0 32.96554106 0.5 32.9149649 1 32.90484966 1.5 32.86438874 2 32.8542735 2.5 32.76323642 3 32.74300595 3.5 32.65196886 4 32.58116224 4.5 32.51035562 5 32.42943376 5.5 32.38897283 6 32.31816621 6.5 32.28782051 7 32.26759005 7.5 32.21701389 8 32.19678342 8.5 32.16643773 9 32.14620726 9.5 32.08551587 10 32.04505495 10.5 31.97424832 11 31.92367216 11.5 31.86298077 12 31.80228938 12.5 31.78205891 13 31.73148275 13.5 31.69102183 14 31.68090659 14.5 31.67079136 15 31.64044567 15.5 31.59998474 16 31.53929335 16.5 31.51906288 I read the csv file with pandas dataframe and after plotting a figure in matplotlib with data from one csv file, the figure looks like below. I would like to split every single square waveform/bit and store the corresponding voltage values for each bit separately. So the resulting voltage values of each bit would be stored in a row and should look like this: I don't have any idea how to do that. I guess I have to write a function where I have to assign a threshold value that, if the voltage values are going down for maybe 20 steps of time than capture all the values or if the voltage level is going up for 20 steps of time than capture all the voltage values. Could someone help?
If you get the gradient of your Voltage (here using diff as the time is regularly spaced), this gives you the following: You can thus easily use a threshold (I tested with 2) to identify the peak starts. Then pivot your data: # get threshold of gradient m = df['Voltage (V)'].diff().gt(2) # group start = value above threshold preceded by value below threshold group = (m&~m.shift(fill_value=False)).cumsum().add(1) df2 = (df .assign(id=group, t=lambda d: d['Time (us)'].groupby(group).apply(lambda s: s-s.iloc[0]) ) .pivot(index='id', columns='t', values='Voltage (V)') ) output: t 0.0 0.5 1.0 1.5 2.0 2.5 \ id 1 32.965541 32.914965 32.904850 32.864389 32.854273 32.763236 2 25.045314 27.543777 29.182444 30.588462 31.114454 31.984364 3 25.166697 27.746081 29.415095 30.719960 31.326873 32.125977 4 25.277965 27.877579 29.536477 30.912149 31.367334 32.206899 5 25.379117 27.978732 29.667975 30.780651 31.670791 32.338397 6 25.631998 27.634814 28.959909 30.173737 30.659268 31.053762 7 23.528030 26.137759 27.948386 29.253251 30.244544 30.649153 8 23.639297 26.380525 28.464263 29.971432 30.902034 31.458371 9 23.740449 26.542369 28.707028 30.295120 30.881803 31.862981 10 23.871948 26.673867 28.889103 30.305235 31.185260 31.873096 11 24.387824 26.694097 28.342880 29.678091 30.315350 31.134684 ... t 748.5 749.0 id 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN 5 NaN NaN 6 21.059913 21.161065 7 NaN NaN 8 NaN NaN 9 NaN NaN 10 NaN NaN 11 NaN NaN [11 rows x 1499 columns] plot: df2.T.plot()
Export multiple dataframe to different csv in python
I have this code: import pandas as pd import os ext = ('.tsv') for files in os.listdir(os.getcwd()): if files.endswith(ext): x = pd.read_table(files, sep='\t', usecols=['#Chrom','Pos','RawScore','PHRED']) x.drop_duplicates(subset ="Pos",keep = False, inplace = True) data_frame=x.head() print(data_frame) #Chrom Pos RawScore PHRED 77171 6 167709702 7.852318 39.0 19180 6 31124849 7.623789 38.0 15823 6 29407955 6.982213 37.0 19182 6 31125257 6.817868 36.0 19974 6 31544591 6.201438 35.0 #Chrom Pos RawScore PHRED 52445 9 139634495 6.950686 36.0 46470 9 125391241 5.477094 34.0 49866 9 134385435 4.841222 33.0 48642 9 131475583 4.357986 31.0 40099 9 113233652 4.284035 31.0 #Chrom Pos RawScore PHRED 7050 13 32972626 6.472542 36.0 32416 13 100518634 5.405765 33.0 10834 13 42465713 4.406294 32.0 9963 13 39422624 4.374808 31.0 22993 13 76395620 4.193058 29.4 As you can imagine, I got multiple dataframes with the same columns names but from different Chromosomes. How can I get this multiples dataframes in differents csv files?
You can save your dataFrames to .csv using panda's pandas.DataFrame.to_csv (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html). More specifically, in your case you can do this: for files in os.listdir(os.getcwd()): if files.endswith(ext): x = pd.read_table(files, sep='\t', usecols= ['#Chrom','Pos','RawScore','PHRED']) x.drop_duplicates(subset ="Pos",keep = False, inplace = True) x.to_csv(f'Chrom{x.iloc[0,0]}.csv') In here, x.iloc[0,0] will take the first element of the first column which is the #Chrom. You can also do this manually. Note that this method would not work if you want to have two different DataFrames with the same #Chromosome. In that case, you have to manually input the name of the csv file.
Row wise calculations(Python)
Trying to run the following code to create a new column 'Median Rank': N=data2.Rank.count() for i in data2.Rank: data2['Median_Rank']=i-0.3/(N+0.4) But I'm getting a constant value of 0.99802. Even though my rank column is as follows: data2.Rank.head() Out[464]: 4131 1.0 4173 3.0 4172 3.0 4132 3.0 5335 10.0 4171 10.0 4159 10.0 5079 10.0 4115 10.0 4179 10.0 4180 10.0 4147 10.0 4181 10.0 4175 10.0 4170 10.0 4116 24.0 4129 24.0 4156 24.0 4153 24.0 4160 24.0 5358 24.0 4152 24.0 Somebody please point out the errors in my code.
Your code isn't vectorised. Use this: N = data2.Rank.count() data2['Median_Rank'] = data2['Rank'] - 0.3 / (N+0.4) The reason your code does not work is because you are assigning the entire column in each loop. So only the last i iteration sticks, values in data2['Median_Rank'] are guaranteed to be identical.
This occurs because every time you make data2['Median_Rank']=i-0.3/(N+0.4) you are updating the entire column with the value calculated by the expression, the easiest way to do that actually don't need a loop: N=data2.Rank.count() data2['Median_Rank'] = data2.Rank-0.3/(N+0.4) It is possible because pandas supports element-wise operations with series. if you still want to use for loop, you will need to use .at and iterate by rows as follow: for i, el in zip(df_filt.index,df_filt.rendimento_liquido.values): df_filt.at[i,'Median_Rank']=el-0.3/(N+0.4)
Python: Imported csv not being split into proper columns
I am importing a csv file into python using pandas but the data frame is only in one column. I copied and pasted data from the comma-separated format from The Player Standing Field table at this link (second one) into an excel file and saved it as a csv (originally as ms-dos, then both as normal and utf-8 per recommendation by AllthingsGo42). But it only returned a single column data frame. Examples of what I tried: dataset=pd.read('MLB2016PlayerStats2.csv') dataset=pd.read('MLB2016PlayerStats2.csv', delimiter=',') dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9', delimiter=',') The each line of code above all returned: Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary 1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2... 2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1... 3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,... 4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,1... 5,Cristhian Adames\adamecr01,24,COL,NL,69,43,3... Also tried: dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9', delimiter=',',quoting=3) Which returned: "Rk Name Age Tm Lg G GS CG Inn Ch \ 0 "1 Fernando Abad\abadfe01 30 TOT AL 57 0 0 46.2 4 1 "2 Jose Abreu\abreujo02 29 CHW AL 152 152 150 1355.2 1337 2 "3 A.J. Achter\achteaj01 27 LAA AL 27 0 0 37.2 6 3 "4 Dustin Ackley\ackledu01 28 NYY AL 23 16 10 140.1 97 4 "5 Cristhian Adames\adamecr01 24 COL NL 69 43 38 415.0 212 E DP Fld% Rtot Rtot/yr Rdrs Rdrs/yr RF/9 RF/G \ 0 ... 0 1 1.000 NaN NaN NaN NaN 0.77 0.07 1 ... 10 131 0.993 -2.0 -2.0 -5.0 -4.0 8.81 8.73 2 ... 0 0 1.000 NaN NaN 0.0 0.0 1.43 0.22 3 ... 0 8 1.000 1.0 9.0 3.0 27.0 6.22 4.22 4 ... 6 24 0.972 -4.0 -12.0 1.0 3.0 4.47 2.99 Pos Summary" 0 P" 1 1B" 2 P" 3 1B-OF-2B" 4 SS-2B-3B" Below is what the data looks like in notepad++ "Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary" "1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2,4,0,4,0,1,1.000,,,,,0.77,0.07,P" "2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1355.2,1337,1243,84,10,131,.993,-2,-2,-5,-4,8.81,8.73,1B" "3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,6,2,4,0,0,1.000,,,0,0,1.43,0.22,P" "4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,140.1,97,89,8,0,8,1.000,1,9,3,27,6.22,4.22,1B-OF-2B" "5,Cristhian Adames\adamecr01,24,COL,NL,69,43,38,415.0,212,68,138,6,24,.972,-4,-12,1,3,4.47,2.99,SS-2B-3B" "6,Austin Adams\adamsau01,29,CLE,AL,19,0,0,18.1,1,0,0,1,0,.000,,,0,0,0.00,0.00,P" Sorry for the confusion with my question before. I hope this edit will clear things up. Thank you to those that answered thus far.
Running it quickly myself, I was able to get what I am understanding is the desired output. My only thought is that there i s no need to call out a delimiter for a csv, because a csv is a comma separated variable file, but that should not matter. I am thinking that there is something incorrect with your actual data file and I would go and make sure it is saved correctly. I would echo previous comments and make sure that the csv is a UTF-8, and not an MS-DOS or Macintosh (both options when saving in excel) Best of luck!
There is no need to call for a delimiter for a csv. You only have to change the separator from ";" to ",". For this you can open your csv file with notepad and change them with the replace tool.