import text data having spaces using pandas.read_csv - python

I would like to import a text file using pandas.read_csv:
1541783101 8901951488 file.log 12345 123456
1541783401 21872967680 other file.log 23456 123
1541783701 3 third file.log 23456 123
The difficulty here is that the columns are separated by one or more spaces, but there is one column that contains a file name having spaces. So I can't use sep=r"\s+" to identify the columns as that would fail at the first file name having a space. The file format does not have a fixed column width.
However each file name ends with ".log". I could write separate regular expressions matching each column. Is it possible to use these to identify the columns to import? Or is it possible to write a separator regular expression that selects all characters NOT matching any of the column matching regular expressions?

Answer for updated question -
Here's the code which will not fail whatever the data width may be. You can modify it as per your needs.
df = pd.read_table('file.txt', header=None)
# Replacing uneven spaces with single space
df = df[0].apply(lambda x: ' '.join(x.split()))
# An empty dataframe to hold the output
out = pd.DataFrame(np.NaN, index=df.index, columns=['col1', 'col2', 'col3', 'col4', 'col5'])
n_cols = 5 # number of columns
for i in range(n_cols-2):
# 0 1
if i == 0 or i == 1:
out.iloc[:, i] = df.str.partition(' ').iloc[:,0]
df = df.str.partition(' ').iloc[:,2]
else:
out.iloc[:, 4] = df.str.rpartition(' ').iloc[:,2]
df = df.str.rpartition(' ').iloc[:,0]
out.iloc[:,3] = df.str.rpartition(' ').iloc[:,2]
out.iloc[:,2] = df.str.rpartition(' ').iloc[:,0]
print(out)
+---+------------+-------------+----------------+-------+--------+
| | col1 | col2 | col3 | col4 | col5 |
+---+------------+-------------+----------------+-------+--------+
| 0 | 1541783101 | 8901951488 | file.log | 12345 | 123456 |
| 1 | 1541783401 | 21872967680 | other file.log | 23456 | 123 |
| 2 | 1541783701 | 3 | third file.log | 23456 | 123 |
+---+------------+-------------+----------------+-------+--------+
Note - The code is hardcoded for 5 columns. It can be generalized too.
Previous answer -
Use pd.read_fwf() to read files with fixed-width.
In your case:
pd.read_fwf('file.txt', header=None)
+---+----------+-----+-------------------+-------+--------+
| | 0 | 1 | 2 | 3 | 4 |
+---+----------+-----+-------------------+-------+--------+
| 0 | 20181201 | 3 | file.log | 12345 | 123456 |
| 1 | 20181201 | 12 | otherfile.log | 23456 | 123 |
| 2 | 20181201 | 200 | odd file name.log | 23456 | 123 |
+---+----------+-----+-------------------+-------+--------+

Related

Comparing two Dataframes and creating a third one where certain contions are met

I am trying to compare two different dataframe that have the same column names and indexes(not numerical) and I need to obtain a third df with the biggest value for the row with the same column name.
Example
df1=
| | col_1 | col2 | col-3 |
| rft_12312 | 4 | 7 | 4 |
| rft_321321 | 3 | 4 | 1 |
df2=
| | col_1 | col2 | col-3 |
| rft_12312 | 7 | 3 | 4 |
| rft_321321 | 3 | 7 | 6 |
Required result
| | col_1 | col2 | col-3 |
| rft_12312 | 7 (because df2.value in this \[row :column\] \>df1.value) | 7 | 4 |
| rft_321321 | 3(when they are equal doesn't matter from which column is the value) | 7 | 6 |
I've already tried pd.update with filter_func defined as:
def filtration_function(val1,val2):
if val1 >= val2:
return val1
else:
return val2
but is not working. I need the check for each column with same name.
also pd.compare but does not allow me to pick the right values.
Thank you in advance :)
I think one possibility would be to use "combine". This method generates an element-wise comparsion between the two dataframes and returns the maximum value of each element.
Example:
import pandas as pd
def filtration_function(val1, val2):
return max(val1, val2)
result = df1.combine(df2, filtration_function)
I think method "where" can work to:
import pandas as pd
result = df1.where(df1 >= df2, df2)

Python - pandas remove duplicate rows based on condition

I have a csv which has data that looks like this
id | code | date
-------------+-----------------------------
| 1 | 2 | 2022-10-05 07:22:39+00::00 |
| 1 | 0 | 2022-11-05 02:22:35+00::00 |
| 2 | 3 | 2021-01-05 10:10:15+00::00 |
| 2 | 0 | 2019-01-11 10:05:21+00::00 |
| 2 | 1 | 2022-01-11 10:05:22+00::00 |
| 3 | 2 | 2022-10-10 11:23:43+00::00 |
I want to remove duplicate id based on the following condition -
For code column, choose the value which is not equal to 0 and then choose one which is having latest timestamp.
Add another column prev_code, which contains list of all the remaining value of the code that's not present in code column.
Something like this -
id | code | prev_code
-------------+----------
| 1 | 2 | [0] |
| 2 | 1 | [0,2] |
| 3 | 2 | [] |
There is probably a sleeker solution but something along the following lines should work.
df = pd.read_csv('file.csv')
lastcode = df[df.code!=0].groupby('id').apply(lambda block: block[block['date'] == block['date'].max()]['code'])
prev_codes = df.groupby('id').agg(code=('code', lambda x: [val for val in x if val != lastcode[x.name].values[0]]))['code']
pd.DataFrame({'id': map(lambda x: x[0], lastcode.index.values), 'code': lastcode.values, 'prev_code': prev_codes.values})

Writing to excel file based on column value & string naming

The DF looks something like this and extends for thousands of rows (i.e every combination of 'Type' & 'Name' possible)
| total | big | med | small| Type | Name |
|:-----:|:-----:|:-----:|:----:|:--------:|:--------:|
| 5 | 4 | 0 | 1 | Pig | John |
| 6 | 0 | 3 | 3 | Horse | Mike |
| 5 | 2 | 3 | 0 | Cow | Rick |
| 5 | 2 | 3 | 0 | Horse | Rick |
| 5 | 2 | 3 | 0 | Cow | John |
| 5 | 2 | 3 | 0 | Pig | Mike |
I would like to write code that writes files to excel based on the 'Type' column value. In the example above there are 3 different "Types" so I'd like one file for Pig, one for Horse, one for Cow respectively.
I have been able to do this using two columns but for some reason have not been able to do it do it with just one. See code below.
for idx, df in data.groupby(['Type', 'Name']):
table_1 = function_1(df)
table_2 = function_2(df)
with pd.ExcelWriter(f"{'STRING1'+ '_' + ('_'.join(idx)) + '_' + 'STRING2'}.xlsx") as writer:
table_1.to_excel(writer, sheet_name='Table 1', index=False)
table_2.to_excel(writer, sheet_name='Table 2', index=False)
Current result is:
STRING1_Pig_John_STRING2.xlsx (all the rows that have Pig and John)
What I would like is:
STRING1_Pig_STRING2.xlsx (all the rows that have Pig)
Do you have anything against boolean indexing ? If not :
vals = df['Type'].unique().tolist()
with pd.ExcelWriter("blah.xlsx") as writer:
for val in vals:
ix = df[df['Type']==val].index
df.loc[ix].to_excel(writer, sheet_name=str(val), index=False)
EDIT :
If you want to stick to groupby, that would be :
with pd.ExcelWriter("blah.xlsx") as writer:
for idx, df in data.groupby(['Type']):
val = list(set(df.Type))[0]
df.to_excel(writer, sheet_name=str(val), index=False)

How to scan characters in strings to match to another string in different column

I have 2 columns of strings and I'd like to match the strings based on the first 3 characters in each string. Basically code that goes over every character of column 1 row 1 and compares it with rows in column 2 to find the best match.
IE: Row 1 Column 1 scans "p""a""s" and looks in Col2 for strings starting with "p""a""s" and so on for Row 2 Column 1.
I'm fairly new to python; my apologies.
Original Table (unsorted):
+-------------+---------+----------+
| Row Index | Col1 | Col2 |
+-------------+---------+----------+
| 1 | pasta | sauce |
| 2 | sauce | orange |
| 3 | orange | pasta |
+-------------+---------+----------+
Expected Table (after matching)
+-------------+---------+----------+
| Row Index | Col1 | Col2 |
+-------------+---------+----------+
| 1 | pasta | pasta |
| 2 | sauce | sauce |
| 3 | orange | orange |
+-------------+---------+----------+
I don't have any code to show as I'm not sure how to start this. Thanks.
Probably not the fastest and cleanest solution, but will return what you're asking for:
df['Col3'] = df.Col1.apply(lambda x: [i for i in df.Col2 if i.startswith(x[:3])][0])

Pandas: Make the value of one column equal to the value of another

Hopefully a very simple question from a Pandas newbie.
How can I make the value of one column equal the value of another in a dataframe? Replace the value in every row. No conditionals, etc.
Context:
I have two CSV's, loaded into dataframe 'a' and dataframe 'b' respectively.
These CSVs are basically the same, except 'a' has a field that was improperly carried forward from another process - floats were rounded to ints. Not my script, can't influence it, I just have the CSVs now.
In reality I probably have 2mil rows and about 60-70 columns in the merged dataframe - so if it's possible to address the columns by their header (in the example these are Col1 and xyz_Col1), that would sure help.
I have joined the CSVs on their common field, so now I have a scenario where I have a dataframe that can be represented by the following:
+--------+------+--------+------------+----------+----------+
| CellID | Col1 | Col2 | xyz_CellID | xyz_Col1 | xyz_Col2 |
+--------+------+--------+------------+----------+----------+
| 1 | 0 | apple | 1 | 0.23 | apple |
| 2 | 0 | orange | 2 | 0.45 | orange |
| 3 | 1 | banana | 3 | 0.68 | banana |
+--------+------+--------+------------+----------+----------+
The result should be such that Col1 = xyz_Col1:
+--------+------+--------+------------+----------+----------+
| CellID | Col1 | Col2 | xyz_CellID | xyz_Col1 | xyz_Col2 |
+--------+------+--------+------------+----------+----------+
| 1 | 0.23 | apple | 1 | 0.23 | apple |
| 2 | 0.45 | orange | 2 | 0.45 | orange |
| 3 | 0.68 | banana | 3 | 0.68 | banana |
+--------+------+--------+------------+----------+----------+
What I have in code so far:
import pandas as pd
a = pd.read_csv('csv1.csv')
b = pd.read_csv('csv2.csv')
#b = b.dropna(axis=1) drop any unnamed fields
#defind 'b' cols by adding an xyz_ prefix as xyz is unique
b = b.add_prefix('xyz_')
#Join the dataframes into a new dataframe named merged
merged = pd.merge(a, b, left_on='Col1', right_on='xyz_Col1')
merged.head(5)
#This is where the xyz_Col1 to Col1 code goes...
#drop unwanted cols
merged = merged[merged.columns.drop(list(merged.filter(regex='xyz')))]
#output to file
merged.to_csv("output.csv", index=False)
Thanks
merged['col1'] = merged['xyz_Col1']
or
merged.loc[:, 'col1'] = merged.loc[:, 'xyz_Col1']

Categories

Resources