I'm using TweetScraper to scrape tweets with certain keywords. Right now, each tweet gets saved to a separate JSON file in a set folder, so I end up with thousands of JSON files. Is there a way to make each new tweet append to one big JSON file? If not, how do I process/work with thousands of small JSON files in Python?
Here's the part of settings.py that handles saving data:
# settings for where to save data on disk
SAVE_TWEET_PATH = './Data/tweet/'
SAVE_USER_PATH = './Data/user/'
I would read all files., put data in list and save it again as JSON
import os
import json
folder = '.'
all_tweets = []
# -- read ---
for filename in sorted(os.listdir(folder)):
if filename.endswith('.json'):
fullpath = os.path.join(folder, filename)
with open(fullpath) as fh:
tweet = json.load(fh)
all_tweets.append(tweet)
# --- save ---
with open('all_tweets.json', 'w') as fh:
json.dump(all_tweets, fh)
Related
I want to extract text from multiple text files and the idea is that i have a folder and all text files are there in that folder.
I have tried and succesfully get the text but the thing is that when i use that string buffer somewhere else then only first text file text are visbile to me.
I want to store these texts to a particular string buffer.
what i have done:
import glob
import io
Raw_txt = " "
files = [file for file in glob.glob(r'C:\\Users\\Hp\\Desktop\\RAW\\*.txt')]
for file_name in files:
with io.open(file_name, 'r') as image_file:
content1 = image_file.read()
Raw_txt = content1
print(Raw_txt)
This Raw_txt buffer only works in this loop but i want this buffer somewhere else.
Thanks!
I think the issue is related to where you load the content of your text files.
Raw_txt is overwritten with each file.
I would recommend you to do something like this where the text is appended:
import glob
Raw_txt = ""
files = [file for file in glob.glob(r'C:\\Users\\Hp\\Desktop\\RAW\\*.txt')]
for file_name in files:
with open(file_name,"r+") as file:
Raw_txt += file.read() + "\n" # I added a new line in case you want to separate the different text content of each file
print(Raw_txt)
Also in order to read a text file you don't need io module.
I've used 'fitz' from Pymupdf module to extract data and then with pandas converting the extracted data to dataframe.
#Code to read multiple pdfs from the folder:
from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]
#Code to extract the data:
for pdf in pdf_files:
with fitz.open(pdf) as doc:
pypdf_text = ""
for page in doc:
pypdf_text += page.getText()
But, the above code is only extracting the data for last pdf in the folder. and thus giving the result for only that pdf
Although, the desired goal is to extract the data from all the pdfs in the folder one by one
Please help me understand and resolved why is this happening??
Following code worked for me,
from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]
#Code to extract the data:
pdf_txt = ""
for pdf in pdf_files:
with fitz.open(pdf) as doc:
for page in doc:
pdf_txt += page.getText()
#Converting the extracted data to data frame:
with open('pdf_txt.txt','w', encoding='utf-8') as f: #Converting to text file
f.write(pdf_txt)
data=pd.read_table('pdf_txt.txt',sep='\n') #Converting text file to dataframe
Thank you #Yevhen Kuzmovych for your help!
Change the below code:
Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
to
files_pdf = [ file for file in glob.glob(path+"\*.pdf",recursive=True)]
and give path as a variable.
The goal is to download GTFS data through python web scraping, starting with https://transitfeeds.com/p/agence-metropolitaine-de-transport/129/latest/download
Currently, I'm using requests like so:
def download(url):
fpath = "prov/city/GTFS"
r = requests.get(url)
if r.ok:
print("Saving file.")
open(fpath, "wb").write(r.content)
else:
print("Download failed.")
The results of requests.content of the above url unfortunately renders the following:
You can see the files of interest within the output (e.g. stops.txt) but how might I access them to read/write?
I fear you're trying to read a zip file with a text editor, perhaps you should try using the "zipfile" module.
The following worked:
def download(url):
fpath = "path/to/output/"
f = requests.get(url, stream = True, headers = headers)
if f.ok:
print("Saving to {}".format(fpath))
g=open(fpath+'output.zip','wb')
g.write(f.content)
g.close()
else:
print("Download failed with error code: ", f.status_code)
You need to write this file into a zip.
import requests
url = "https://transitfeeds.com/p/agence-metropolitaine-de-transport/129/latest/download"
fname = "gtfs.zip"
r = requests.get(url)
open(fname, "wb").write(r.content)
Now fname exists and has several text files inside. If you want to programmatically extract this zip and then read the content of a file, for example stops.txt, then you need first to extract a single file, or simply extractall.
import zipfile
# this will extract only a single file, and
# raise a KeyError if the file is missing from the archive
zipfile.ZipFile(fname).extract("stops.txt")
# this will extract all the files found from the archive,
# overwriting files in the process
zipfile.ZipFile(fname).extractall()
Now you just need to work with your file(s).
thefile = "stops.txt"
# just plain text
text = open(thefile).read()
# csv file
import csv
reader = csv.reader(open(thefile))
for row in reader:
...
I have multiple number of json files saved in a folder. I would like to parse each json file, use the library flatten and save as a seperate json file.
I have managed to do this with one json, but struggling to parse several json files at once without merging the data and then save.
I think I need to create a loop to load a json file, flatten and save until there were no more json files in the folder, is this possible?
This still seems to only parse one json file.
path_to_json='json_test/'
for file in [file for file in os.listdir(path_to_json)if file.endswith('.json')]:
with open(path_to_json + file) as json_file:
data1=json.load(json_file)
Any help would be much appreciated thanks!
Every looop 'data1' is assigned to new json files. therefore only returns one result.
Instead, append to a new list.
import os
import json
# Flatten not supported on 3.8.3
path = 'X:test folder/'
file_list = [p for p in os.listdir(path) if p.endswith('.json')]
flattened = []
for file in file_list:
with open(path + file) as json_file:
# flatten json here, can't install from pip.
flattened.append(json.load(json_file))
for file, flat_json in zip(file_list, flattened):
json.dump(flat_json, open(path + file + '_flattened.json', "w"), indent=2)
# Can yo try this out
# https://stackoverflow.com/questions/23520542/issue-with-merging-multiple-json-files-in-python
import glob
read_files = glob.glob("*.json")
with open("merged_file.json", "wb") as outfile:
outfile.write('[{}]'.format(
','.join([open(f, "rb").read() for f in read_files])))
Question: How can I read in many PDFs in the same path using Python package "slate"?
I have a folder with over 600 PDFs.
I know how to use the slate package to convert single PDFs to text, using this code:
migFiles = [filename for filename in os.listdir(path)
if re.search(r'(.*\.pdf$)', filename) != None]
with open(migFiles[0]) as f:
doc = slate.PDF(f)
len(doc)
However, this limits you to one PDF at a time, specified by "migFiles[0]" - 0 being the first PDF in my path file.
How can I read in many PDFs to text at once, retaining them as separate strings or txt files? Should I use another package? How could I create a "for loop" to read in all the PDFs in the path?
Try this version:
import glob
import os
import slate
for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
with open(pdf_file) as pdf:
txt_file = "{}.txt".format(os.path.splitext(pdf_file)[0])
with open(txt_file,'w') as txt:
txt.write(slate.pdf(pdf))
This will create a text file with the same name as the pdf in the same directory as the pdf file with the converted contents.
Or, if you want to save the contents - try this version; but keep in mind if the translated content is large you may exhaust your available memory:
import glob
import os
import slate
pdf_as_text = {}
for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
with open(pdf_file) as pdf:
file_without_extension = os.path.splitext(pdf_file)[0]
pdf_as_text[file_without_extension] = slate.pdf(pdf)
Now you can use pdf_as_text['somefile'] to get the text contents.
What you can do is use a simple loop:
docs = []
for filename in migFiles:
with open(filename) as f:
docs.append(slate.pdf(f))
# or instead of saving file to memory, just process it now
Then, docs[i] will hold the text of the (i+1)-th pdf file, and you can do whatever you want with the file whenever you want. Alternatively, you can process the file inside the for loop.
If you want to convert to text, you can do:
docs = []
separator = ' ' # The character you want to use to separate contents of
# consecutive pages; if you want the contents of each pages to be separated
# by a newline, use separator = '\n'
for filename in migFiles:
with open(filename) as f:
docs.append(separator.join(slate.pdf(f))) # turn the pages into plain-text
or
separator = ' '
for filename in migFiles:
with open(filename) as f:
txtfile = open(filename[:-4]+".txt",'w')
# if filename="abc.pdf", filename[:-4]="abc"
txtfile.write(separator.join(slate.pdf(f)))
txtfile.close()