In the code that I present, it reads csv files that are in one folder and prints them in another.In each of these csv contains two columns which were chosen when the dataframe was defined. In column f I need to count how many times this value was above 50.025 and write it in some column
CODE:
import pandas as pd
import numpy as np
import glob
import os
all_files = glob.glob("C:/Users/Gamer/Documents/Colbun/Saturn/*.csv")
file_list = []
for i,f in enumerate(all_files):
df = pd.read_csv(f,header=0,usecols=["t","f"])
df.apply(lambda x: x['f'] > 50.025, axis=1)
df.to_csv(f'C:/Users/Gamer/Documents/Colbun/Saturn2/{os.path.basename(f).split(".")[0]}_ext.csv')
its not logical to store it in some column... since
its the summary of entire table..not specific to any
row.
df = pd.read_csv(f,header=0,usecols=["t","f"])
how_many_times= len( df[df['f'] > 50.025] )
# you may store it in some unique column but it still doesnt make sense
df['newcol']=how_many_times
To output the count of occurrences in a column according to a particular filter and add it to every row in another column you can simply do the following:
df['cnt'] = df[df['f'] > 50.025]['f'].count()
If you need to use that count to then perform a calculation it would be better to store it in a variable and them perform the calculation while using said variable rather that storing it in your dataframe in an entire column.
In addition I can see from your comments to your question that you also want to remove the index when outputting to CSV so to do that you need to add index=False to the df.to_csv() call.
Your code should look something like this:
import pandas as pd
import numpy as np
import glob
import os
all_files = glob.glob("C:/Users/Gamer/Documents/Colbun/Saturn/*.csv")
file_list = []
for i,f in enumerate(all_files):
df = pd.read_csv(f,header=0,usecols=["t","f"])
df['cnt'] = df[df['f'] > 50.025]['f'].count()
df.to_csv(f'C:/Users/Gamer/Documents/Colbun/Saturn2/{os.path.basename(f).split(".")[0]}_ext.csv', index=False)
Related
I want to read a CSV using Pandas but only certain columns and only rows with spicific values. for example I have a csv of "people and their heights", I want to read the "name" column and "height" column of people that are > "160cm" height only. I want to do this in the first step of read_csv() not after loading it.
import pandas as pd
cols = ['name','height']
df = pd.read_csv("people_and_heights.csv", usecols=cols)
so I want to add a condition to read rows with certain values only or rows that doesn't have nulls for example.
How about this?:
import pandas as pd
from io import StringIO
with open("people_and_heights.csv") as file:
colNames = "\"col1\",\"name\",\"col3\",\"height\""
filteredCsv = "\n".join([colNames,"".join([line for index,line in enumerate(file) if index != 0 and int(line.split(',')[3]) >= 165])])
df = pd.read_csv(StringIO(filteredCsv),usecols=["name","height"])
I have 7 csv files that each contain the same # of columns and rows. I am trying to merge the data from these into 1 csv where each cell is the average of the 7 identical cells. (ex. new-csv(c3) = average(input-csv's(c3)
Here is an example of what the inputs look like. The output should look identical (6 columns x 15 rows) except the values will be averaged in each cell.
So far I have this code to load the csv files, and am reading about making them into a matrix but I don't see anything about merging and averaging by each cell, only row or column.
listdrs = os.listdir(dir_path)
listdrs_path = [ dir_path + x for x in listdrs]
failed_list = []
csv_matrix = []
for file_path in listdrs_path:
tickercsv = file_path.replace(string, '')
ticker = tickercsv.replace('.csv', '')
data = pd.read_csv(file_path, index_col=0)
csv_matrix.append(data)
If you run this in the directory with all of your csv files, you can use glob to find them all, then create a tuple of dfs using the pd.read_csv() with the optional parameter header=None depending on whether or not you have column names. Then you can concat them, group by the index, and take the mean.
import pandas as pd
import glob
files = glob.glob('*.csv')
dfs = (pd.read_csv(f, headers=None) for f in files)
pd.concat(dfs).groupby(level=0).mean()
I would like to create a scalable code to import multiple CSV files, standardize the order of the colnumns based on the colnames and re-write CSV files.
import glob
import pandas as pd
# Get a list of all the csv files
csv_files = glob.glob('*.csv')
# List comprehension that loads of all the files
dfs = [pd.read_csv(x,delimiter=";") for x in csv_files]
A=pd.DataFrame(dfs[0])
B=pd.DataFrame(dfs[1])
alpha=A.columns.values.tolist()
print([pd.DataFrame(x[alpha]) for x in dfs])
I would like to be able to split this object and write CSV for each of the file and rename them with the original names. is that easily possible with python? Thansk for your help.
If you want to reorder columns by a consistent order, assuming that all csv's have the same column names but in a different order, you can sort one of the column name lists and then order the other ones by that list. Using your example:
csv_files = glob.glob('*.csv')
sorted_columns = []
for e,x in enumerate(csv_files):
df = pd.read_csv(x,delimiter=";")
if e==0:
sorted_columns = sorted(df.columns.values.tolist())
df[sorted_columns].to_csv(x, sep=";")
I'm trying to figure out a way to to select only the rows that satisfy my regular expression via Pandas. My actual dataset, data.csv, has one column(the heading is not labeled) and millions of row. The first four rows look like:
5;4Z13H;;L
5;346;4567;;O
5;342;4563;;P
5;3LPH14;4567;;O
and I wrote the following regular expression
([1-9][A-Z](.*?);|[A-Z][A-Z](.*?);|[A-Z][1-9](.*?);)
which would identify 4Z13H; from row 1 and 3LPH14; from row 4. Basically I would like pandas to filter my data and select rows 1 and 4.
So my desired output would be
5;4Z13H;;L
5;3LPH14;4567;;O
I would then like to save the subset of filter rows into a new csv, filteredData.csv. So far I only have this:
import pandas as pd
import numpy as np
import sys
import re
sys.stdout=open("filteredData.csv","w")
def Process(filename, chunksize):
for chunk in pd.read_csv(filename, chunksize=chunksize):
df[0] = df[0].re.compile(r"([1-9][A-Z]|[A-Z][A-Z]|[A-Z][1-9])(.*?);")
sys.stdout.close()
if __name__ == "__main__":
Process('data.csv', 10 ** 4)
I'm still relatively new to python so the code above has some syntax issues(I'm still trying to figure out how to use pandas chunksize). However the main issue is filtering the rows by the regular expression. I'd greatly appreciate anyone's advice
One way is to read the csv as pandas dataframe and then use str.contains to create a mask column
df['mask'] = df[0].str.contains('(\d+[A-Z]+\d+)') #0 is the column name
df = (df[df['mask'] == True]).drop('mask', axis = 1)
You get the desired dataframe, if you wish, you can reset index using df = df.reset_index()
0
0 5;4Z13H;;L
3 5;3LPH14;4567;;O
Second is to first read the csv and create an edit file with only the filtered rows and then read the filtered csv to create the dataframe
with open('filteredData.csv', 'r') as f_in:
with open('filteredData_edit.csv', 'w') as f_outfile:
f_out = csv.writer(f_outfile)
for line in f_in:
line = line.strip()
row = []
if bool(re.search("(\d+[A-Z]+\d+)", line)):
row.append(line)
f_out.writerow(row)
df = pd.read_csv('filteredData_edit.csv', header = None)
You get
0
0 5;4Z13H;;L
1 5;3LPH14;4567;;O
From my experience, I would prefer the second method as it would be more efficient to filter out the undesired rows before creating the dataframe.
Hello I have iterated through multiple columns and it worked. BUT The column names in all the CSV files are in order like so:
Output: id title content tags
However my code outputs the columns in this order:
Output : content id tags title
How do I get it back in the order that all the csv files have it as
here is my code below:
import glob
import os
import pandas as pd
pd.set_option("display.max_rows", 999)
pd.set_option('max_colwidth',100)
import numpy as np
from IPython.display import display
%matplotlib inline
file_path = 'data/'
all_files = glob.glob(os.path.join(file_path, "*.csv"))
merging_csv_files = (pd.read_csv(f) for f in all_files)
stack_exchange_data = pd.concat(merging_csv_files, ignore_index=True)
print ("Data loaded succesfully!")
print ("Stack Exchane Data has {} rows with {} columns each.".format(*stack_exchange_data.shape))
The general way for selecting a DataFrame with columns in a specific order is to simply create a list of the order you desire and then pass that list to the bracket operator of the DataFrame like this:
my_col_order = ['id', 'title', 'content', 'tags']
df[my_col_order]
Also you might want to check that all the DataFrames indeed have the same column order. I don't believe Pandas will sort the column names in concat unless there is at least one DataFrame that has a different column ordering. You might want to print out all the column names from all the DataFrames you are concatenating.