The DF looks something like this and extends for thousands of rows (i.e every combination of 'Type' & 'Name' possible)
| total | big | med | small| Type | Name |
|:-----:|:-----:|:-----:|:----:|:--------:|:--------:|
| 5 | 4 | 0 | 1 | Pig | John |
| 6 | 0 | 3 | 3 | Horse | Mike |
| 5 | 2 | 3 | 0 | Cow | Rick |
| 5 | 2 | 3 | 0 | Horse | Rick |
| 5 | 2 | 3 | 0 | Cow | John |
| 5 | 2 | 3 | 0 | Pig | Mike |
I would like to write code that writes files to excel based on the 'Type' column value. In the example above there are 3 different "Types" so I'd like one file for Pig, one for Horse, one for Cow respectively.
I have been able to do this using two columns but for some reason have not been able to do it do it with just one. See code below.
for idx, df in data.groupby(['Type', 'Name']):
table_1 = function_1(df)
table_2 = function_2(df)
with pd.ExcelWriter(f"{'STRING1'+ '_' + ('_'.join(idx)) + '_' + 'STRING2'}.xlsx") as writer:
table_1.to_excel(writer, sheet_name='Table 1', index=False)
table_2.to_excel(writer, sheet_name='Table 2', index=False)
Current result is:
STRING1_Pig_John_STRING2.xlsx (all the rows that have Pig and John)
What I would like is:
STRING1_Pig_STRING2.xlsx (all the rows that have Pig)
Do you have anything against boolean indexing ? If not :
vals = df['Type'].unique().tolist()
with pd.ExcelWriter("blah.xlsx") as writer:
for val in vals:
ix = df[df['Type']==val].index
df.loc[ix].to_excel(writer, sheet_name=str(val), index=False)
EDIT :
If you want to stick to groupby, that would be :
with pd.ExcelWriter("blah.xlsx") as writer:
for idx, df in data.groupby(['Type']):
val = list(set(df.Type))[0]
df.to_excel(writer, sheet_name=str(val), index=False)
Related
I have a DataFrame that essentially has the first row that I want as the column row and I'd like to know how to set new columns and set that row as the first row.
For example:
| 4 | 3 | dog |
| --- | --- | --- |
| 1 | 2 | cat |
I want to change that DataFrame to be:
| number_1 | number_2 | animal |
| -------- | -------- | ------ |
| 4 | 3 | dog |
| 1 | 2 | cat |
What would be the best way to do this?
Lets create a new dataframe with old column row as the first row followed by remaining rows
pd.DataFrame([df.columns, *df.values], columns=['num_1', 'num_2', 'animal'])
num_1 num_2 animal
0 4 3 dog
1 1 2 cat
I have a csv which has data that looks like this
id | code | date
-------------+-----------------------------
| 1 | 2 | 2022-10-05 07:22:39+00::00 |
| 1 | 0 | 2022-11-05 02:22:35+00::00 |
| 2 | 3 | 2021-01-05 10:10:15+00::00 |
| 2 | 0 | 2019-01-11 10:05:21+00::00 |
| 2 | 1 | 2022-01-11 10:05:22+00::00 |
| 3 | 2 | 2022-10-10 11:23:43+00::00 |
I want to remove duplicate id based on the following condition -
For code column, choose the value which is not equal to 0 and then choose one which is having latest timestamp.
Add another column prev_code, which contains list of all the remaining value of the code that's not present in code column.
Something like this -
id | code | prev_code
-------------+----------
| 1 | 2 | [0] |
| 2 | 1 | [0,2] |
| 3 | 2 | [] |
There is probably a sleeker solution but something along the following lines should work.
df = pd.read_csv('file.csv')
lastcode = df[df.code!=0].groupby('id').apply(lambda block: block[block['date'] == block['date'].max()]['code'])
prev_codes = df.groupby('id').agg(code=('code', lambda x: [val for val in x if val != lastcode[x.name].values[0]]))['code']
pd.DataFrame({'id': map(lambda x: x[0], lastcode.index.values), 'code': lastcode.values, 'prev_code': prev_codes.values})
so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.
You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.
Hopefully a very simple question from a Pandas newbie.
How can I make the value of one column equal the value of another in a dataframe? Replace the value in every row. No conditionals, etc.
Context:
I have two CSV's, loaded into dataframe 'a' and dataframe 'b' respectively.
These CSVs are basically the same, except 'a' has a field that was improperly carried forward from another process - floats were rounded to ints. Not my script, can't influence it, I just have the CSVs now.
In reality I probably have 2mil rows and about 60-70 columns in the merged dataframe - so if it's possible to address the columns by their header (in the example these are Col1 and xyz_Col1), that would sure help.
I have joined the CSVs on their common field, so now I have a scenario where I have a dataframe that can be represented by the following:
+--------+------+--------+------------+----------+----------+
| CellID | Col1 | Col2 | xyz_CellID | xyz_Col1 | xyz_Col2 |
+--------+------+--------+------------+----------+----------+
| 1 | 0 | apple | 1 | 0.23 | apple |
| 2 | 0 | orange | 2 | 0.45 | orange |
| 3 | 1 | banana | 3 | 0.68 | banana |
+--------+------+--------+------------+----------+----------+
The result should be such that Col1 = xyz_Col1:
+--------+------+--------+------------+----------+----------+
| CellID | Col1 | Col2 | xyz_CellID | xyz_Col1 | xyz_Col2 |
+--------+------+--------+------------+----------+----------+
| 1 | 0.23 | apple | 1 | 0.23 | apple |
| 2 | 0.45 | orange | 2 | 0.45 | orange |
| 3 | 0.68 | banana | 3 | 0.68 | banana |
+--------+------+--------+------------+----------+----------+
What I have in code so far:
import pandas as pd
a = pd.read_csv('csv1.csv')
b = pd.read_csv('csv2.csv')
#b = b.dropna(axis=1) drop any unnamed fields
#defind 'b' cols by adding an xyz_ prefix as xyz is unique
b = b.add_prefix('xyz_')
#Join the dataframes into a new dataframe named merged
merged = pd.merge(a, b, left_on='Col1', right_on='xyz_Col1')
merged.head(5)
#This is where the xyz_Col1 to Col1 code goes...
#drop unwanted cols
merged = merged[merged.columns.drop(list(merged.filter(regex='xyz')))]
#output to file
merged.to_csv("output.csv", index=False)
Thanks
merged['col1'] = merged['xyz_Col1']
or
merged.loc[:, 'col1'] = merged.loc[:, 'xyz_Col1']
I would like to import a text file using pandas.read_csv:
1541783101 8901951488 file.log 12345 123456
1541783401 21872967680 other file.log 23456 123
1541783701 3 third file.log 23456 123
The difficulty here is that the columns are separated by one or more spaces, but there is one column that contains a file name having spaces. So I can't use sep=r"\s+" to identify the columns as that would fail at the first file name having a space. The file format does not have a fixed column width.
However each file name ends with ".log". I could write separate regular expressions matching each column. Is it possible to use these to identify the columns to import? Or is it possible to write a separator regular expression that selects all characters NOT matching any of the column matching regular expressions?
Answer for updated question -
Here's the code which will not fail whatever the data width may be. You can modify it as per your needs.
df = pd.read_table('file.txt', header=None)
# Replacing uneven spaces with single space
df = df[0].apply(lambda x: ' '.join(x.split()))
# An empty dataframe to hold the output
out = pd.DataFrame(np.NaN, index=df.index, columns=['col1', 'col2', 'col3', 'col4', 'col5'])
n_cols = 5 # number of columns
for i in range(n_cols-2):
# 0 1
if i == 0 or i == 1:
out.iloc[:, i] = df.str.partition(' ').iloc[:,0]
df = df.str.partition(' ').iloc[:,2]
else:
out.iloc[:, 4] = df.str.rpartition(' ').iloc[:,2]
df = df.str.rpartition(' ').iloc[:,0]
out.iloc[:,3] = df.str.rpartition(' ').iloc[:,2]
out.iloc[:,2] = df.str.rpartition(' ').iloc[:,0]
print(out)
+---+------------+-------------+----------------+-------+--------+
| | col1 | col2 | col3 | col4 | col5 |
+---+------------+-------------+----------------+-------+--------+
| 0 | 1541783101 | 8901951488 | file.log | 12345 | 123456 |
| 1 | 1541783401 | 21872967680 | other file.log | 23456 | 123 |
| 2 | 1541783701 | 3 | third file.log | 23456 | 123 |
+---+------------+-------------+----------------+-------+--------+
Note - The code is hardcoded for 5 columns. It can be generalized too.
Previous answer -
Use pd.read_fwf() to read files with fixed-width.
In your case:
pd.read_fwf('file.txt', header=None)
+---+----------+-----+-------------------+-------+--------+
| | 0 | 1 | 2 | 3 | 4 |
+---+----------+-----+-------------------+-------+--------+
| 0 | 20181201 | 3 | file.log | 12345 | 123456 |
| 1 | 20181201 | 12 | otherfile.log | 23456 | 123 |
| 2 | 20181201 | 200 | odd file name.log | 23456 | 123 |
+---+----------+-----+-------------------+-------+--------+