Ive been wrecking my head with this and I probably just need to step back.
I have a CSV file like this : ( dummy data - there could be 1-20 Parameters )
CAR,NAME,AGE,COLOUR
Ford,Mike,45,Blue
VW,Peter,67,Yellow
And need
CAR,PARAMETER,VALUE
Ford,NAME,Mike
Ford,AGE,45
Ford,COLOUR,BLUE
VW,NAME,Peter
VW,AGE,67
VW,COLOUR,Yellow
Im Looking at :
How to transpose a dataset in a csv file?
How to transpose a dataset in a csv file?
Python writing a .csv file with rows and columns transpose
But i think because I want to keep CAR column static , the Python zip function might not hack it..
Any thoughts on this Sunny Friday Gurus?
Regards!
<Python - Transpose columns to rows within data operation and before writing to file >>
Use pandas:
df_in = read_csv('infile.csv')
df_out = df_in.set_index('CAR').stack().reset_index()
df_out.columns = ['CAR', 'PARAMETER', 'VALUE']
df_out.to_csv('outfile.csv', index=False)
Input and output example:
>>> df_in
CAR NAME AGE COLOUR
0 Ford Mike 45 Blue
1 VW Peter 67 Yellow
>>> df_out
CAR PARAMETER VALUE
0 Ford NAME Mike
1 Ford AGE 45
2 Ford COLOUR Blue
3 VW NAME Peter
4 VW AGE 67
5 VW COLOUR Yellow
I was able to use Python - Transpose columns to rows within data operation and before writing to file with some tweaks and all is working now well.
import csv
with open('transposed.csv', 'wt') as destfile:
writer = csv.writer(destfile)
writer.writerow(['car', 'parameter', 'value'])
with open('input.csv', 'rt') as sourcefile:
for d in csv.DictReader(sourcefile):
car= d.pop('car')
for parameter, value in sorted(d.items()):
row = [car, parameter.upper(), value]
writer.writerow(row)
Related
New python user here. I'm trying to figure out how to split a dataframe into 2 excel files, filtered based on whether 2 of the columns contain a specific set of words ("blue" or "mozzarella"). If either of the 2 columns DO contain any of the words, I want that entire row added to one excel file. If they DO NOT contain the words, I want that entire row added to a different excel file.
>>> data
cheese 1 cheese 2 cheese 3
0 blue mozzarella camembert
1 mozzarella munster edam
2 maccagno mozzarella ricotta
3 brie berkswell parmigiano
I've searched around and attempted the below, but it gives me an invalid index error for data.at[row, 'cheese 1'].
writer = pd.ExcelWriter('./Included cheese.xlsx', engine='xlsxwriter')
writer_2 = pd.ExcelWriter('./Excluded cheese.xlsx', engine='xlsxwriter')
pattern = 'blue|mozzarella'
for index, row in data.iterrows():
if pattern in data.at[row, 'cheese 1']|data.at[row, 'cheese 2']:
data.at[row:].to_excel(writer)
else:
data.at[row:].to_excel(writer_2)
How should I fix this to achieve my desired filtered outputs?
If I read the question correctly, you can filter the overall dataframe into two dataframes, one with the pattern and one without. Then write the separate dataframes to excel. Using the | you can write an or statement to filter the data by checking if the pattern is in either column.
pattern_df = data[(data['cheese 1'].str.contains(pattern)) | (data['cheese 2'].str.contains(pattern))]
then you can use ~ to check for rows without the pattern, as well as change up the logic of the filter with & to find all rows without the pattern in col 1 or col 2.
non_pattern_df = data[~(data['cheese 1'].str.contains(pattern)) & ~(data['cheese 2'].str.contains(pattern))]
Hope this helps.
Not very familiar with Pandas and how to deal with xlsx file, but just try with the code below and this works for me:
import pandas as pd
data = pd.read_excel("test.xlsx")
columns = data.columns.values
check_set = {"blue","mozzarella"}
condational = data[columns[0]].isin(check_set) | data[columns[1]].isin(check_set)
data_in = data[condational]
data_out = data[~condational]
data_in.to_excel("Included_cheese.xlsx",index=False)
data_out.to_excel("Excluded_cheese.xlsx",index=False)
At first, I tried to write a xlsx file as in your code
writer = pd.ExcelWriter('Included_cheese.xlsx', engine='xlsxwriter')
writer_2 = pd.ExcelWriter('Excluded_cheese.xlsx', engine='xlsxwriter')
data_in.to_excel(writer)
data_out.to_excel(writer_2)
but I can not open with the written file with excel in this way.
Given: pip install openpyxl
cheese_1 cheese_2 cheese_3
0 blue mozzarella camembert
1 mozzarella munster edam
2 maccagno mozzarella ricotta
3 brie berkswell parmigiano
# Does row contain both mozzarella and blue?
mask = (df.eq('mozzarella').any(axis=1) & df.eq('blue').any(axis=1))
data_in = df[mask]
data_out = df[~mask]
data_in.to_excel("Included_cheese.xlsx", index=False)
data_out.to_excel("Excluded_cheese.xlsx", index=False)
>>> print(data_in)
cheese_1 cheese_2 cheese_3
0 blue mozzarella camembert
>>> print(data_out)
cheese_1 cheese_2 cheese_3
1 mozzarella munster edam
2 maccagno mozzarella ricotta
3 brie berkswell parmigiano
You should first filter the rows you want to save into another dataframe and then write it to the file, something like this:
df_mozza = data[(data["cheese 1"]=="mozarella") | (data["cheese 1"]=="blue") | (data["cheese 2"]=="mozarella") | (data["cheese 2"]=="blue")]
df_mozza.reset_index(in place=True)
df_mozza.to_excel(...)
df_other = data[(data["cheese 1"]!="mozarella") & (data["cheese 1"]!="blue") & (data["cheese 2"]!="mozarella") & (data["cheese 2"]!="blue")]
df_other.reset_index(in_place=True)
df_other.to_excel(...)
I have a CSV file with a lot of rows and different number of columns.
How to group data by count of columns and show it in different frames?
File CSV has the following data:
1 OLEG US FRANCE BIG
1 OLEG FR 18
1 NATA 18
Because I have different number of colums in each row I have to group rows by count of columns and show 3 frames to be able set header then:
ID NAME STATE COUNTRY HOBBY
FR1: 1 OLEG US FRANCE BIG
ID NAME COUNTRY AGE
FR2: 1 OLEG FR 18
FR3:
ID NAME AGE
1 NATA 18
Any words, I need to group rows by count of columns and show them in different dataframes.
since pandas doesn't allow you to have different length of columns, just don't use it to import your data. Your goal is to create three seperate df, so first import the data as lists, and then deal with it and its differents lengths.
One way to solve this is read the data with csv.reader and create the df's with list comprehension together with a condition for the length of the lists.
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
df1 = pd.DataFrame([item for item in data if len(item)==3], columns='ID NAME AGE'.split())
df2 = pd.DataFrame([item for item in data if len(item)==4], columns='ID NAME COUNTRY AGE'.split())
df3 = pd.DataFrame([item for item in data if len(item)==5], columns='ID NAME STATE COUNTRY HOBBY'.split())
print(df1, df2, df3, sep='\n\n')
ID NAME AGE
0 1 NATA 18
ID NAME COUNTRY AGE
0 1 OLEG FR 18
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
If you need to hardcode too many lines for the same step (e.g. too many df's), then you should consider using a loop to create them and store each dataframe as key/value in a dictionary.
EDIT
Here is the little optimizedway of creating those df's. I think you can't get around creating a list of columns you want to use for the seperate df's, so you need to know what variations of number of columns you have in your data (except you want to create those df's without naming the columns.
col_list=[['ID', 'NAME', 'AGE'],['ID', 'NAME', 'COUNTRY', 'AGE'],['ID', 'NAME', 'STATE', 'COUNTRY', 'HOBBY']]
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
dict_of_dfs = {}
for cols in col_list:
dict_of_dfs[f'df_{len(cols)}'] = pd.DataFrame([item for item in data if len(item)==len(cols)], columns=cols)
for key,val in dict_of_dfs.items():
print(f'{key=}: \n {val} \n')
key='df_3':
ID NAME AGE
0 1 NATA 18
key='df_4':
ID NAME COUNTRY AGE
0 1 OLEG FR 18
key='df_5':
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
Now you don't have variables for your df, instead you have them in a dictionary as keys. (I named the df with the number of columns it has, df_3 is the df with three columns.
If you need to import the data with pandas, you could have a look at this post.
Can someone help me to extract multiples tables from ONE pdf file. I have 5 pages, every page have a table with same header column exp:
Table exp in every page
student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4
I want to extract all this tables in one dataframe, First i did
df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)
But i got a messy output so i try this lines of code that looks like this :
[student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4 ,student Score Rang
Maxim 43 34
Nourah 93 5]
so i edited my code like this
import pandas as pd
import tabula
file_path = "filePath.pdf"
# read my file
df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)
It give me a dataframe for each table but i don't how to regroup it into one single dataframe and any other solution to avoid repeating the line of code.
According to the documentation of tabula, read_pdf returns a list when passed the multiple_table=True option.
Thus, you can use pandas.concat on its output to concatenate the dataframes:
df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))
I have a .csv file that contains 3 types of records, each with different quantity of columns.
I know the structure of each record type and that the rows are always of type1 first, then type2 and type 3 at the end, but I don't know how many rows of each record type there are.
The first 4 characters of each row define the record type of that row.
CSV Example:
typ1,John,Smith,40,M,Single
typ1,Harry,Potter,22,M,Married
typ1,Eva,Adams,35,F,Single
typ2,2020,08,16,A
typ2,2020,09,02,A
typ3,Chevrolet,FC101TT,2017
typ3,Toyota,CE972SY,2004
How can I read It with Pandas? It doesn't matter if I have to read one record type each time.
Thanks!!
Here it is a pandas solution.
First we must read the csv file in a way that pandas keeps the entires lines in one cell each. We do that by simply using a wrong separator, such as the 'at' symbol '#'. It can be whatever we want, since we guarantee it won't ever appear in our data file.
wrong_sep = '#'
right_sep = ','
df = pd.read_csv('my_file.csv', sep=wrong_sep).iloc[:, 0]
The .iloc[:, 0] is used as a quick way to convert a DataFrame into a Series.
Then we use a loop to select the rows that belong to each data structure based on their starting characters. Now we use the "right separator" (probably a comma ',') to split the desired data into real DataFrames.
starters = ['typ1', 'typ2', 'typ3']
detected_dfs = dict()
for start in starters:
_df = df[df.str.startswith(start)].str.split(right_sep, expand=True)
detected_dfs[start] = _df
And here you go. If we print the resulting DataFrames, we get:
0 1 2 3 4 5
0 typ1 Harry Potter 22 M Married
1 typ1 Eva Adams 35 F Single
0 1 2 3 4
2 typ2 2020 08 16 A
3 typ2 2020 09 02 A
0 1 2 3
4 typ3 Chevrolet FC101TT 2017
5 typ3 Toyota CE972SY 2004
Let me know if it helped you!
Not Pandas:
from collections import defaultdict
filename2 = 'Types.txt'
with open(filename2) as dataLines:
nL = dataLines.read().splitlines()
defDList = defaultdict(list)
subs = ['typ1','typ2','typ3']
dataReadLines = [defDList[i].append(j) for i in subs for j in nL if i in j]
# dataReadLines = [i for i in nL]
print(defDList)
Output:
defaultdict(<class 'list'>, {'typ1': ['typ1,John,Smith,40,M,Single', 'typ1,Harry,Potter,22,M,Married', 'typ1,Eva,Adams,35,F,Single'], 'typ2': ['typ2,2020,08,16,A', 'typ2,2020,09,02,A'], 'typ3': ['typ3,Chevrolet,FC101TT,2017', 'typ3,Toyota,CE972SY,2004']})
You can make use of the skiprows parameter of pandas read_csv method to skip the rows you are not interested in for a particular record type. The following gives you a dictionary dfs of dataframes for each type. An advantage is that records of the same types don't necessarily have to be adjacent to each other in the csv file.
For larger files you might want to adjust the code such that the file is only read once instead of twice.
import pandas as pd
from collections import defaultdict
indices = defaultdict(list)
types = ['typ1', 'typ2', 'typ3']
filename = 'test.csv'
with open(filename) as csv:
for idx, line in enumerate(csv.readlines()):
for typ in types:
if line.startswith(typ):
indices[typ].append(idx)
dfs = {typ: pd.read_csv(filename, header=None,
skiprows=lambda x: x not in indices[typ])
for typ in types}
Read the file as a CSV file using the CSV reader. The reader fortunately does not care about line formats:
import csv
with open("yourfile.csv") as infile:
data = list(csv.reader(infile))
Collect the rows with the same first element and build a dataframe of them:
import pandas as pd
from itertools import groupby
dfs = [pd.DataFrame(v) for _,v in groupby(data, lambda x: x[0])]
You've got a list of three dataframes (or as many as necessary).
dfs[1]
# 0 1 2 3 4
#0 typ2 2020 08 16 A
#1 typ2 2020 09 02 A
I have a CSV file that looks something like this:
# data.csv (this line is not there in the file)
Names, Age, Names
John, 5, Jane
Rian, 29, Rath
And when I read it through Pandas in Python I get something like this:
import pandas as pd
data = pd.read_csv("data.csv")
print(data)
And the output of the program is:
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
Is there any way to get:
Names Age
0 John 5
1 Rian 29
2 Jane
3 Rath
First, I'd suggest having unique names for each column. Either go into the csv file and change the name of a column header or do so in pandas.
Using 'Names2' as the header of the column with the second occurence of the same column name, try this:
Starting from
datalist = [['John', 5, 'Jane'], ['Rian', 29, 'Rath']]
df = pd.DataFrame(datalist, columns=['Names', 'Age', 'Names2'])
We have
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
So, use:
dff = pd.concat([df['Names'].append(df['Names2'])
.reset_index(drop=True),
df.iloc[:,1]], ignore_index=True, axis=1)
.fillna('').rename(columns=dict(enumerate(['Names', 'Ages'])))
to get your desired result.
From the inside out:
df.append combines the columns.
pd.concat( ... ) combines the results of df.append with the rest of the dataframe.
To discover what the other commands do, I suggest removing them one-by-one and looking at the results.
Please forgive the formating of dff. I'm trying to make everything clear from an educational perspective.
Adjust indents so the code will compile.
You can use:
usecols which helps to read only selected columns.
low_memory is used so that we Internally process the file in chunks.
import pandas as pd
data = pd.read_csv("data.csv", usecols = ['Names','Age'], low_memory = False))
print(data)
Please have unique column name in your csv