I am trying to create e dataframe (table with three columns) from a .txt file.
I prepared the txt file so it has the format:
Car
Audi A4 10000
Audi A6 12000
....
Bus
VW Transporter 15000
...
Camper
VW California 20000
...
This is the whole code:
cars = ""
with open("cars.txt", "r", encoding = "utf-8") as f:
cars = f.read()
print(cars)
def generate_car_table(table):
table = pd.DataFrame(columns = ['category', 'model','price'])
return table
cars_table = generate_car_table(cars)
I expect a table with three columns - category, which will show whether the vehicle is car/bus/camper, model and price.
Thank you in advance!
Update:
Having your comments in mind, I see that I misunderstood your question.
If you're text-file (cars.txt) looks like follows:
Car
Audi A4 10000
Audi A6 12000
Bus
VW Transporter 15000
Camper
VW California 20000
so that after every category a line break is made and between the model and the price is a tab, you could run the following code:
# Read the file
data = pd.read_csv('cars.txt', names=['Model','Price','Category'], sep='\t')
# Transform the unstructured data
data.loc[(data['Price'].isnull() == True), 'Category'] = data['Model']
data['Category'].fillna(method='ffill', inplace=True)
data.dropna(axis=0, subset=['Price'], inplace = True)
# Clean the dataframe
data.reset_index(drop=True, inplace=True)
data = data[['Category', 'Model', 'Price']]
print(data)
This does result in the following table:
Category Model Price
0 Car Audi A4 10000.0
1 Car Audi A6 12000.0
2 Bus VW Transporter 15000.0
3 Camper VW California 20000.0
Old Answer:
Your text-file needs a fixed structure (for example all values are separated by a tabulate or a line break).
Then you can use the pd.read_csv method and define the separator by hand with pd.read_csv('yourFileName', sep='yourseperator').
Tabs are \t and line breaks \n, for example.
The following cars.txt (link) for example is structured using tabs and can be read with:
import pandas as pd
pd.read_csv('cars.txt', sep = '\t')
It is likely far easier to create a table from a CSV file than from a text file, as it will make the job of parsing much easier, and also provide the benefit of being easily viewed in table format in spreadsheet applications such as Excel.
You create the file so that it looks something like this
category,model,price
Car,Audi A4,10000
Car,Audi A6,12000
...
And then use the csv package to easily read/write the data into tabular formats
Related
I would like to parse the following idx file: https://www.sec.gov/Archives/edgar/daily-index/2022/QTR1/company.20220112.idx into Pandas DataFrame.
I use the following code to check how it would look like as a text file:
import os, requests
base_path = '/Users/GunardiLin/Desktop/Insider_Ranking/temp/'
current_dirs = os.listdir(path=base_path)
local_filename = f'20200102'
local_file_path = '/'.join([base_path, local_filename])
if local_filename in base_path:
print(f'Skipping index file for {local_filename} because it is already saved.')
url = f'https://www.sec.gov/Archives/edgar/daily-index/2020/QTR1/company.20200102.idx'
r = requests.get(url, stream=True, headers= {'user-agent': 'MyName myname#outlook.com'})
with open(local_file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=10240):
f.write(chunk)
Next I would like to build a parser that is fault tollerance, because it should parse daily a new idx file into pd.DataFrame.
My idea was to use string manipulation, but it would be very complicated and not fault tollerance.
I would be thankful if someone can show the best practice to parse and give a boilerplate code.
Since this is mostly a fixed width file you could use pandas read_fwf to read this file. You can skip over the leading information (via skiprows=) and get straight to the data. The column names are predefined and assigned when read:
idx_path = 'company.20220112.idx'
names = ['Company Name','Form Type','CIK','Date Filed','File Name']
df = pd.read_fwf(idx_path, colspecs=[(0,61),(62,74),(74,84),(86,94),(98,146)], names=names, skiprows=11)
df.head(10)
Company Name Form Type CIK Date Filed File Name
0 005 - Series of IPOSharks Venture Master Fund,... D 1888451 20220112 edgar/data/1888451/0001888451-22-000002.txt
1 10X Capital Venture Acquisition Corp. III EFFECT 1848948 20220111 edgar/data/1848948/9999999995-22-000102.txt
2 110 White Partners LLC D 1903845 20220112 edgar/data/1903845/0001884293-22-000001.txt
3 15 Beach, MHC 3 1903509 20220112 edgar/data/1903509/0001567619-22-001073.txt
4 15 Beach, MHC SC 13D 1903509 20220112 edgar/data/1903509/0000943374-22-000014.txt
5 170 Valley LLC D 1903913 20220112 edgar/data/1903913/0001903913-22-000001.txt
6 1st FRANKLIN FINANCIAL CORP 424B3 38723 20220112 edgar/data/38723/0000038723-22-000003.txt
7 1st FRANKLIN FINANCIAL CORP 424B3 38723 20220112 edgar/data/38723/0000038723-22-000004.txt
8 215 BF Associates LLC D 1904145 20220112 edgar/data/1904145/0001904145-22-000001.txt
9 2401 Midpoint Drive REIT, LLC D 1903337 20220112 edgar/data/1903337/0001903337-22-000001.txt
I have a large file (>500 rows) with multiple data points for each unique item in the list, something like :
cheese
weight
location
gouda
1.4
AL
gouda
2
TX
gouda
1.2
CA
cheddar
5.3
AL
cheddar
6
MN
chaddar
2
WA
Havarti
4
CA
Havarti
4.2
AL
I want to make data frames for each cheese to store the relevant data
I have this:
main_cheese_file = pd.read_csv('CheeseMaster.csv')
cut_the_cheese = main_cheese_file.cheese.unique()
melted = {elem: pd.DataFrame() for elem in cut_the_cheese}
for slice in melted.slice():
melted[slice] = main_cheese_file[:][main_cheese_file.cheese == slice]
to split it up on the unique thing I want.
What I want to do with it is make df's that can be exported for each cheese with the cheese name as the file name.
So far I can force it with
melted['Cheddar'].to_csv('Cheddar.csv')
and get the Cheddars ....
but I don't want to have to know and type out each type of cheese on the list of 500 rows...
Is there a way to add this to my loop?
You can just iterate over a groupby object
import pandas as pd
df = pd.read_csv('CheeseMaster.csv')
for k,v in df.groupby('cheese'):
v.to_csv(f'{k}.csv', index=False)
I have 4 csv files. Each file has different fields, e.g. name, id_number, etc. Each file is talking about the same thing, for which there is a unique id that each file has. So, I would like to concatenate the fields of each of the 4 files into a single DataFrame. For instance, one file contains first_name, another file contains last_name, then I want to merge those two, so that I can have first and last name for each object.
Doing that is trivial, but I'd like to know the most efficient way, or if there is some built-in function that does it very efficiently.
The files look something like this:
file1:
id name age pets
b13 Marge 18 cat
y47 Dan 13 dog
h78 Mark 20 lizard
file2:
id last_name income city
y47 Schmidt 1800 Dallas
b13 Olson 1670 Paris
h78 Diaz 2010 London
file 3 and 4 are like that with different fields. The ids are not necessarily ordered. The goal again, is to have one DataFrame looking like this:
id name age pets last_name income city
b13 Marge 18 cat Olson 1670 Paris
y47 Dan 13 dog Schmidt 1800 Dallas
h78 Mark 20 lizard Diaz 2010 London
What I've done is this:
file1 = pd.read_csv('file1.csv')
file2 = pd.read_csv('file2.csv')
file3 = pd.read_csv('file3.csv')
file4 = pd.read_csv('file4.csv')
f1_group = file1.groupby(['id'])
f2_group = file2.groupby(['id'])
f3_group = file3.groupby(['id'])
f4_group = file4.groupby(['id'])
data = []
for id1, group1 in f1_group:
for id2, group2 in f2_group:
for id3, group3 in f3_group:
for id4, group4 in f4_group:
if id1 == id2 == id3 == id4:
frames = [group1, group2, group3, group4]
con = pd.concat(frames, axis=1)
data.append(con)
That works but is extremely inefficient. If I could eliminate the element that has been already considered from group1, group2, etc, that would help, but it would still be inefficient.
Thanks in advance.
Hi maybe you can try this :)
https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/
import os
import glob
import pandas as pd
#set working directory
os.chdir("/mydir")
#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Ive been wrecking my head with this and I probably just need to step back.
I have a CSV file like this : ( dummy data - there could be 1-20 Parameters )
CAR,NAME,AGE,COLOUR
Ford,Mike,45,Blue
VW,Peter,67,Yellow
And need
CAR,PARAMETER,VALUE
Ford,NAME,Mike
Ford,AGE,45
Ford,COLOUR,BLUE
VW,NAME,Peter
VW,AGE,67
VW,COLOUR,Yellow
Im Looking at :
How to transpose a dataset in a csv file?
How to transpose a dataset in a csv file?
Python writing a .csv file with rows and columns transpose
But i think because I want to keep CAR column static , the Python zip function might not hack it..
Any thoughts on this Sunny Friday Gurus?
Regards!
<Python - Transpose columns to rows within data operation and before writing to file >>
Use pandas:
df_in = read_csv('infile.csv')
df_out = df_in.set_index('CAR').stack().reset_index()
df_out.columns = ['CAR', 'PARAMETER', 'VALUE']
df_out.to_csv('outfile.csv', index=False)
Input and output example:
>>> df_in
CAR NAME AGE COLOUR
0 Ford Mike 45 Blue
1 VW Peter 67 Yellow
>>> df_out
CAR PARAMETER VALUE
0 Ford NAME Mike
1 Ford AGE 45
2 Ford COLOUR Blue
3 VW NAME Peter
4 VW AGE 67
5 VW COLOUR Yellow
I was able to use Python - Transpose columns to rows within data operation and before writing to file with some tweaks and all is working now well.
import csv
with open('transposed.csv', 'wt') as destfile:
writer = csv.writer(destfile)
writer.writerow(['car', 'parameter', 'value'])
with open('input.csv', 'rt') as sourcefile:
for d in csv.DictReader(sourcefile):
car= d.pop('car')
for parameter, value in sorted(d.items()):
row = [car, parameter.upper(), value]
writer.writerow(row)
guys.
I've got a bit of a unique issue trying to merge two big data files together. Both files have a column of the same data (patent number) with all other columns different.
The idea is to join them such that these patent number columns align so the other data is readable and connected.
Just the first few lines of the .dat file looks like:
IL 1 Chicago 10030271 0 3930271
PA 1 Bedford 10156902 0 3930272
MO 1 St. Louis 10112031 0 3930273
IL 1 Chicago 10030276 0 3930276
And the .asc:
02 US corporation No change 11151713 TRANSCO PROD INC 58419
02 US corporation No change 11151720 SECURE TELECOM INC 502530
02 US corporation No change 11151725 SOA SYSTEMS INC 520365
02 US corporation No change 11151738 REVTEK INC 473150
The .dat file is too large to open fully in Excel so I don't think reorganizing it there is an option (rather I don't know if it is or not through any macros I've found online yet).
Quite a newbie question I feel but does anyone know how I could link these data sets together (preferably using Python) with this patent number unique identifier?
You will want to write a program that reads in the data from the two files you would like to merge. You will open the file and parse the data for each line. From there you are able to write the data to a new file in any order that you would like. This is accomplish-able through python file IO.
pseudo code:
def filehandler(self, filename1, filename2):
Fd =open(filename1, "r")
Fd2 = open(filename2, "r")
while True:
line1 = Fd.readline()
if not line1: break # this will exit the loop if there is no more to read
Line1_array = line1.split()
# first line of first file is split and saved in an array deliniated by spaces.