I'm reading in an excel .csv file using pandas.read_csv(). I want to read in 2 separate column ranges of the excel spreadsheet, e.g. columns A:D AND H:J, to appear in the final DataFrame. I know I can do it once the file has been loaded in using indexing, but can I specify 2 ranges of columns to load in?
I've tried something like this....
usecols=[0:3,7:9]
I know I could list each column number induvidually e.g.
usecols=[0,1,2,3,7,8,9]
but I have simplified the file in question, in my real file I have a large number of rows so I need to be able to select 2 large ranges to read in...
I'm not sure if there's an official-pretty-pandaic-way to do it with pandas.
But, You can do it this way:
# say you want to extract 2 ranges of columns
# columns 5 to 14
# and columns 30 to 66
import pandas as pd
range1 = [i for i in range(5,15)]
range2 = [i for i in range(30,67)]
usecols = range1 + range2
file_name = 'path/to/csv/file.csv'
df = pd.read_csv(file_name, usecols=usecols)
As #jezrael notes you can use numpy.r to do this in a more pythonic and legible way
import pandas as pd
import numpy as np
file_name = 'path/to/csv/file.csv'
df = pd.read_csv(file_name, usecols=np.r_[0:3, 7:9])
Gotchas
Watch out when use in combination with names that you have allowed for the extra column pandas adds for the index ie. For csv columns 1,2,3 (3 items) np.r_ needs to be 0:3 (4 items)
Related
So I am trying to create a new dataframe that includes some data from 300+ csv files.
Each file contains upto 200,000 rows of data, I am only interested in 1 of the columns within each file (the same column for each file)
I am trying to combine these columns into 1 dataframe, where column 6 from csv 1 would be in the 1st column of the new dataframe, column 6 from csv 2 would be in the 2nd column of the new dataframe, and so on up until the 315th csv file.
I dont need all 200,000 rows of data to be extracted, but I am unsure of how I would extract just 2,000 rows from the middle section of the data (each file ranges in number of rows so the same exact rows for each file arent necessary, as long as it is the middle 2000)
Any help in extracting the 2000 rows from each file to populate different columns in the new dataframe would be greatly appreciated.
So far, I have manipulated the data to only contain the relevant column for each file. This displays all the rows of data in the column for each file individually.
I tried to use the iloc function to reduce this to 2000 rows but it did not display any actual data in the output.
I am unsure as to how I would now extract this data into a dataframe for all the columns to be contained.
import pandas as pd
import os
import glob
import itertools
#glob to get all csv files
path = os.getcwd()
csv_files = glob.glob(os.path.join('filepath/', "*.csv"))
#loop list of csv files
for f in csv_files:
df = pd.read_csv(f, header=None)
df.rename(columns={6: 'AE'}, inplace=True)
new_df = df.filter(['AE'])
print('Location:', f)
print('File Name:', f.split("\\")[-1])
print('Content:')
display(new_df)
print()
Based on your description, I am inferring that you have a number of different files in csv format, each of which has at least 2000 lines and 6 columns. You want to take the data only from the 6th column of each file and only for the middle 2000 records in each file and to put all of those blocks of 2000 records into a new dataframe, with a column that in some way identifies which file the block came from.
You can read each file using pandas, as you have done, and then you need to use loc, as one of the commenters said, to select the 2000 records you want to keep. If you save each of those blocks of records in a separate dataframe you can then use the pandas concat method to join them all together into different columns of a new dataframe.
Here is some code that I hope will be self-explanatory. I have assumed that you want the 6th column, which is the one with index 5 in pandas because we start counting from 0. I have also used usecols to keep only the 6th column, and I rename the column to an index number based on the order in which the files are being read. You would need to change this for your own choice of column naming.
I choose the middle 2000 records by defining the starting point as record x, say, so that x + 2000 + x = total number of records, therefore x=(total number of records) / 2 - 1000. This might not be exactly how you want to define the middle 2000 records, so you could change this.
df_middles is a list to which we append every new dataframe of the new file's middle 2000 records. We use pd.concat at the end to put all the columns into a new dataframe.
import os
import glob
import pandas as pd
# glob to get all csv files
path = os.getcwd()
csv_files = glob.glob(os.path.join("filepath/", "*.csv"))
df_middles = []
# loop list of csv files
for idx, f in enumerate(csv_files, 1):
# only keep the 6th column (index 5)
df = pd.read_csv(f, header=None, usecols=[5])
colname = f"column_{idx}"
df.rename(columns={5: colname}, inplace=True)
number_of_lines = len(df)
if number_of_lines < 2000:
raise IOError(f"Not enough lines in the input file: {f}")
middle_range_start = int(number_of_lines / 2) - 1000
middle_range_end = middle_range_start + 1999
df_middle = df.loc[middle_range_start:middle_range_end].reset_index(drop=True)
df_middles.append(df_middle)
df_final = pd.concat(df_middles, axis="columns")
I have a dataset consisting of 1 column. In this column there are 2 data. How to separate the data into a second column. i want my dataset to be 2 columns. Thank you
here is my dataset :
i wanna make like this :
If your file is tab delimited, just use
import pandas as pd
df = pd.from_csv(file_path,sep='\t')
Try this:
Directly read the file using read_csv and define these parameters to specify '\t' as delimiter and to ignore error lines set error_bad_lines=False.
df = pd.read_csv('SINGGALANG.tsv',sep='\t',header=None,names=["Tokens","Tags"],error_bad_lines=False, engine="python")
Alternatively, If you want to read all the data in 1 column and then to split it in different columns you can do following:
Assuming, you have dataframe like this:
df = pd.DataFrame(['val1 0','val2 0','val3 0'])
You can use split and create columns.
df[["col1","col2"]] = df[0].str.split(expand=True)
I have a CSV file, and i want to delete some rows of it based on the values of one of the columns. I do not know the related code to delete the specific rows of a CSV file which is in type pandas.core.frame.DataFrame.
I read related questions, and i found that people suggest writing every line that is acceptable in a new file. I do not want to do that. The thing that i want is:
1) to delete the rows that I know the index of them (number of the row)
or
2) to make a new CSV in the memory of the python (not to write and again read it )
Here's an example of what you can do with pandas. If you need more detail, you might find Indexing and Selecting Data a helpful resource.
import pandas as pd
from io import StringIO
mystr = StringIO("""speed,time,date
12,22:05,1
15,22:10,1
13,22:15,1""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr)
# convert time column to timedelta, assuming mm:ss
df['time'] = pd.to_timedelta('00:'+df['time'])
# filter for >= 22:10, i.e. second item
df = df[df['time'] >= df['time'].loc[1]]
print(df)
speed time date
1 15 00:22:10 1
2 13 00:22:15 1
So I've currently got a dataset that has a column called 'logid' which consists of 4 digit numbers. I have about 200k rows in my csv files and I would like to count each unique logid and output it something like this;
Logid | #ofoccurences for each unique id. So it might be 1000 | 10 meaning that the logid 1000 is seen 10 times in the csv file column 'logid'. The separator | is not necessary, just making it easier for you guys to read. This is my code currently:
import pandas as pd
import os, sys
import glob
count = 0
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
counts = df['my_data'].value_counts()
counts
Using this I get a weird output that I dont quite understand:
4 16463
10013 490
pserverno 1
Name: my_data, dtype: int64
I know I am doing something wrong in the last line
counts = df['my_data'].value_counts()
but I am not too sure what. For reference the values I am extracting are from row C in the excel file (so I guess thats column 3?) Thanks in advance!
ok. from my understanding. I think csv file may be like this.
row1,row1,row1
row2,row2,row2
row3,row3,row3
logid,header1,header2
1000,a,b
1001,c,d
1000,e,f
1001,g,h
And I have all done this with this format of csv file like
# skipping the first three row
df = pd.read_csv("file_name.csv", skiprows=3)
print(df['logid'].value_counts())
And the output look like this
1001 2
1000 2
Hope this will help.
update 1
df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
in this line the parameter names = ['my_data'] creates a new header of the data frame. As your csv file has header row so you can skip this parameter. And as the main header you want to row3 so you can skip first three row. And last one thing you are reading all csv file in given path. So be conscious all of the csv files are same format. Happy codding.
I think you need create one big DataFrame by append all df to list and then concat first:
dfs = []
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, usecols=['logid'], low_memory=False)
dfs.append(df)
df = pd.concat(dfs)
Then use value_counts - output is Series. So for 2 column DataFrame need rename_axis with reset_index:
counts = df['my_data'].value_counts().rename_axis('my_data').reset_index(name='count')
counts
Or groupby and aggregate size:
counts = df.groupby('my_data').size().reset_index(name='count')
counts
you may try this.
counts = df['logid'].value_counts()
Now the "counts" should give you the count of each value.
I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.