How to create an array of several .txt files - python

I am trying to create some plots from some data in my research lab. Each data file is saved into tab-delimited text files.
My goal is to write a script that can read the text file, add the columns within each file to an array, and then eventually slice through the array at different points to create my plots.
My problem is that I'm struggling with just starting the script. Rather than hardcode each txt file to be added to the same array, is there a way to loop over each file in my directory to add the necessary files to the array, then slice through them?
I apologize if my question is not clear; I am new to Python and it is quite a steep learning curve for me. I can try to clear up any confusion if what I am asking doesn't make sense.
I am also using Canopy to write my script if this matters.

You could do something like:
from csv import DictReader # CSV reader can be used as TSV reader
from glob import iglob
readers = []
for path in iglob("*.txt"):
reader = DictReader(open(path), delimiter='\t')
readers.append(reader)
glob.iglob("*.txt") returns an iterator over all files with extension .txt in the current working directory.
csv.DictReader reads a CSV file as a iterator of dicts. A tab-delimited text file the same thing with a different delimiter.

Related

funny behaviour when editing a csv file in excel and then doing some data filtering in pandas

I was wondering why I get funny behaviour using a csv file that has been "changed" in excel.
I have a csv file of around 211,029 rows and pass this csv file into pandas using a Jupyter-notebook
The simplest example I can give of a change is simply clicking on the filter icon in excel saving the file, unclicking the filter icon and saving again (making no physical changes in the data).
When I pass my csv file through pandas, after a few filter operations, some rows go missing.
This is in comparison to that of doing absolutely nothing with the csv file. Leaving the csv file completely alone gives me the correct number of rows I need after filtering compared to "making changes" to the csv file.
Why is this? Is it because of the number of rows in a csv file? Are we supposed to leave csv files untouched if we are planning to filter through pandas anyways?
(As a side note I'm using Excel on a MacBook.)
Excel does not leave any file "untouched". It applies formatting to every file it opens (e.g. float values like "5.06" will be interpreted as date and changed to "05 Jun"). Depending on the expected datatype these rows might be displayed wrongly or missing in your notebook.
Better use sed or awk to manipulate csv files (or a text editor for smaller files).

How to filter out useable data from csv files using python?

Please help me in extracting important data from a .csv file using python. I got .csv file from 'citrine'.
I want to extract the element name and atomic percentage in the form of "Al2.5B0.02C0.025Co14.7Cr16.0Mo3.0Ni57.48Ti5.0W1.25Zr0.03"
ORIGINAL
[{""element"":""Al"",""idealAtomicPercent"":{""value"":""5.4""}},{""element"":""B"",""idealAtomicPercent"":{""value"":""0.02""}},{""element"":""C"",""idealAtomicPercent"":{""value"":""0.13""}},{""element"":""Co"",""idealAtomicPercent"":{""value"":""7.5""}},{""element"":""Cr"",""idealAtomicPercent"":{""value"":""6.1""}},{""element"":""Mo"",""idealAtomicPercent"":{""value"":""2.0""}},{""element"":""Nb"",""idealAtomicPercent"":{""value"":""0.5""}},{""element"":""Ni"",""idealAtomicPercent"":{""value"":""61.0""}},{""element"":""Re"",""idealAtomicPercent"":{""value"":""0.5""}},{""element"":""Ta"",""idealAtomicPercent"":{""value"":""9.0""}},{""element"":""Ti"",""idealAtomicPercent"":{""value"":""1.0""}},{""element"":""W"",""idealAtomicPercent"":{""value"":""5.8""}},{""element"":""Zr"",""idealAtomicPercent"":{""value"":""0.13""}}]
Original CSV
Expected output
Without having the file structure it is hard to tell.
Try to load the file using:
import csv
with open(file_path) as file:
reader = csv.DictReader(...)
You will have to figure out the arguments for the function which depend on the file.

Merge multiple JSON files into one file by using Python (stream twitter)

I've pulled data from Twitter. Currently, the data is in multiple files and I could not merge it into one single file.
Note: all files are in JSON format.
The code I have used is here and here.
It has been suggested to work with glop to compile JSON files
I write this code as I have seen in some tutorials about merge JSON by using Python
from glob import glob
import json
import pandas as pd
with open('Desktop/json/finalmerge.json', 'w') as f:
for fname in glob('Desktop/json/*.json'): # Reads all json from the current directory
with open(fname) as j:
f.write(str(j.read()))
f.write('\n')
I successfully merge all files and now the file is finalmerge.json.
Now I used this as suggested in several threads:
df_lines = pd.read_json('finalmerge.json', lines=True)
df_lines
1000000*23 columns
Then, what I should do to make each feature in separate columns?
I'm not sure why what's wrong with JSON files, I checked the file that I merge and I found it's not valid as JSON file? what I should do to make this as a data frame?
The reason I am asking this is that I have very basic python knowledge and all the answers to similar questions that I have found are way more complicated than I can understand. Please help this new python user to convert multiple JSON files to one JSON file.
I think that the problem is that your files are not really json (or better, they are structured as jsonl ). You have two ways of proceding:
you could read every file as a text file and merge them line by line
you could convert them to json (add a square bracket at the beginning of the file and a comma at the end of every json element).
Try following this question and let me know if it solves your problem: Loading JSONL file as JSON objects
You can also try to edit your code this way:
with open('finalmerge.json', 'w') as f:
for fname in glob('Desktop/json/*.json'):
with open(fname) as j:
f.write(str(j.read()))
f.write('\n')
Every line will be a different json element.

Taking output calculated from data in multiple text files and writing them into different columns of a CSV file

I am calculating values(numbers) from two numbers in differing columns of a text file. Then I am iterating over multiple text files to do the same calculation. I need to write the output to different columns of a CSV file where each column corresponds to the calculations obtained from an individual text file. I more or less know how to iterate over different files but I don't know how to tell Python to write to a different column. Any guidance is appreciated.
You can use the fact that zip provides lazy iteration to do this pretty efficiently. You can define a simple generator function that yeilds a calculation for every line of the file it is initialized with. You can also use contextlib.ExitStack to manage your open files in a single context manager:
from contextlib import ExitStack
from csv import writer
def calc(line):
# Ingest a line, do some calculations on it.
# This is the function you wrote.
input_files = ['file1.txt', 'file2.txt', ...]
def calculator(file):
"""
A generator function that will lazily apply the calculation
to each line of the file it is initialized with.
"""
for line in file:
yield calc(line)
with open('output.csv', 'w') as output, ExitStack() as input_stack:
inputs = [calculator(input_stack.enter_context(open(file))) for file in input_files]
output_csv = writer(output)
output_csv.wite_row(input_files) # Write heading based on input files
for row in zip(*inputs):
output_csv.write_row(row)
The output in the CSV will be in the same order as the file names in input_files.

csv module in python troubles

I've read countless threads on here but I'm still unable to figure out exactly how to do this. I'm using the CSV module in python to write data to a csv file. My difficulty is, I've stored the header files in a list (called header) and it contains a variable number of columns. I need to reference each column name so I can write it to my file, which would be easy, except for the fact that it might contain a variable # of columns and I can't figure out how to have a variable # of arrays that I can write to (of course I'm using zip(*header, list1,list2,list3,...) to write to the csv file, but how to generate the list(i) so that header[i] populates the ith list??? I'm sorry for the lack of code, I just can't figure out how to even begin ...

Categories

Resources